Announcement

Collapse
No announcement yet.

AMDGPU Linux Driver Enabling New "LSDMA" Block

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • coder
    replied
    Originally posted by arQon View Post
    In a nutshell, the problem is that you need to keep submitting draws *for the loadscreen itself* while ("interspersed with", really) actually loading the assets. For technical, historical, or practical reasons, those two activities are essentially single-threaded even if they aren't literally so. If vsync is on, the loadscreen causes the renderer to block, so you spend 14ms waiting for vsync for every 2-3ms you spend doing actual work.
    So, you're saying they load some pre-determined number of assets in between screen redraws. And each redraw, you have to wait for the next vsync. That's indeed unfortunate.

    Originally posted by arQon View Post
    You could upgrade from SATA to NVME and it wouldn't make any difference, because the only thing you'd be changing is the ratio of idle to active time over the exact same period.
    Ah, so that's why ridiculously high-refresh monitors exist. I already thought 240 Hz was extreme, so imagine my surprise when 360 Hz and even 480 Hz monitors came along!

    Leave a comment:


  • arQon
    replied
    Originally posted by Linuxxx View Post
    could You give a concrete example where this is actually the case?
    Sure: how about "Anything based on Gamebryo", i.e. Skyrim / Fallout / etc, since that one has user-level proof available for you. The comments on https://www.nexusmods.com/fallout4/mods/10283 are mostly along the lines of a 5x-10x improvement (though I think there are some minor other factors at work there too for the more extreme cases).

    > on a SATA SSD, let alone a NVMe one.

    It doesn't matter even if all the data is in *cache*, let alone something as slow as NVMe. That's the whole point.

    Like I say, it's inherent in some cases - including the only graphics API available on Linux when you registered your account here. The chances of you genuinely never having encountered it are basically zero: it was measurable even 20 years ago, it's just that back then we were using a thousandth of the assets that modern games do, and there was an expectation that loads would be slow anyway because of "the disk".

    Leave a comment:


  • Linuxxx
    replied
    Originally posted by arQon View Post
    If vsync is on, the loadscreen causes the renderer to block, so you spend 14ms waiting for vsync for every 2-3ms you spend doing actual work. You could upgrade from SATA to NVME and it wouldn't make any difference, because the only thing you'd be changing is the ratio of idle to active time over the exact same period.

    Like a lot of things in this line of work, it's not at all obvious until you become aware of it, but then it is.
    I'm sorry to intervene like that, but could You give a concrete example where this is actually the case?
    Because I'm the one that always keeps Vsync on in games (more consistent frametimes), yet I have never experienced excruciatingly long loading times on a SATA SSD, let alone a NVMe one.
    Heck, last time I tested it, even on a HDD either the BFQ or even Kyber IO-schedulers would ensure quicker start-up than using none.

    So, which one is that high-profile engine that still murders load times when Vsync is on, I wonder?

    Leave a comment:


  • arQon
    replied
    Originally posted by Mahboi View Post
    I also didn't think LOD code was that heavy on the systems...
    It's not. I simplified things down to a concept that would be easy to understand.

    As a starting point, consider standing on a street: you can see at least a dozen buildings, each potentially with different construction, but at a minimum different siding, windows, roofs, paint, decorations, trees / bushes / rocks in the yard, fences, cars in the driveway, etc etc etc. The cars alone will be tens of thousands of triangles, made of dozens of materials, each composed of several textures. The bookkeeping is so that when you turn the corner onto a street that doesn't have e.g. any red Ford trucks on it, all the geometry and textures can be discarded (or at least, marked as discardABLE, but that's splitting hairs really).
    Now scale that up from "a couple of streets" to "several hundred square miles", mapped down to a 1-inch resolution, and hopefully you'll start to get an idea of how much "work" is involved in keeping track of it all.

    It's not that I/O doesn't matter, but it's very much NOT just a case of "how quickly can I get one texture from disk into VRAM?". There's so much other stuff going on (including some pretty heavy operations like compiling shaders etc) that I/O just isn't important *enough* that avoiding a copy or two really matters much. You still have to pull it from the disk in the first place, and that alone is enough of a bottleneck for the copy to be irrelevant even before you start getting into the rest of the code. (System memory is, what, ~15x faster than an NVME drive these days? More? Even in the worst case, you're still only interposing from a 4x PCIE SSD to a 16x PCIE card, though the overheads are different. Still, it's a lot of headroom).

    Leave a comment:


  • arQon
    replied
    Originally posted by coder View Post
    Please explain.
    Not sure I understand what needs explaining. Did I confuse things by talking about "base" loads when we were mostly discussing streamed loads? (i.e. for sandboxes).

    Loading screens are (generally, at least) just a specialized case of the normal render path: you have to show SOMETHING there, whether it's a static background with a progress bar, or animated 3D models (actors, weapons, etc), or whatever. The thing is, the code that handles that is nearly always either in the same thread as, or at a minimum synced with, the code that's doing the loading. (In OGL this is basically a requirement, because a context can only be current in a single thread, but even when the API allows for better options it's just not something that people want to deal with: there's a ton of dependencies and other bookkeeping to manage in the asset loading, it's never considered "important" until the load times get seriously awful, and very few renderers support concurrent submissions and/or async rendering anyway).

    You start with the area or region the player's in (as determined by the savegame or transition), then you pull in the landscape etc of that level / cell / whatever, plus flora, fauna, monsters, features, items, and so on. That gets you a nice big list of the assets you need to be able to actually place them in a realized world, but you can't do so until all those assets are loaded - and that can take a long time, hence loadscreens. Since part of the point of the loadscreen though is that you know the game hasn't just hung, you refresh it with updates - progress / animation / etc - but since you don't know *ahead* of time how long any particular subgroup of assets is going to take to load, you post the updates pretty aggressively.

    y'know, this is going to take all day and you don't need the details to understand it, so TLDR time:

    In a nutshell, the problem is that you need to keep submitting draws *for the loadscreen itself* while ("interspersed with", really) actually loading the assets. For technical, historical, or practical reasons, those two activities are essentially single-threaded even if they aren't literally so. If vsync is on, the loadscreen causes the renderer to block, so you spend 14ms waiting for vsync for every 2-3ms you spend doing actual work. You could upgrade from SATA to NVME and it wouldn't make any difference, because the only thing you'd be changing is the ratio of idle to active time over the exact same period.

    Like a lot of things in this line of work, it's not at all obvious until you become aware of it, but then it is.
    Last edited by arQon; 11 May 2022, 02:40 AM.

    Leave a comment:


  • coder
    replied
    Originally posted by Mahboi View Post
    @agd5f and you are essentially saying that we have no hardware issues at all and it's all about software.
    That's not how I read agd5f's post. It pretty clearly states that sometimes it won't work. Worse yet, software has no way of knowing whether it will.

    Here's Nvidia's rather bespoke solution:


    Just skimming it, I don't see them address the issue of hardware compatibility, but I think they're assuming that any serious users requiring GPUDirect access to massive datasets will be using a system carefully spec'd for the task.

    Here's yet more detail, which discusses a compatibility mode for systems not supporting the native capability:

    Last edited by coder; 11 May 2022, 03:45 AM.

    Leave a comment:


  • coder
    replied
    Originally posted by arQon View Post
    (My all-time favorite is vsync, which absolutely murders load times and is *still* an issue with some very high-profile engines to this day...)
    Please explain.

    Leave a comment:


  • Mahboi
    replied
    Originally posted by arQon View Post
    all the work that goes into LOD management etc, without which you don't have a working system at all. That management is what makes loads a background task that's *already* transparent to players even when running off a SATA HDD, once the initial work is done,

    Games with infamously-long load times like GTA or (I think?) Witcher have nothing to do with I/O bandwidth, and everything to do with "bad" code, e.g. having to do format conversions, being limited to a single thread, and so on. (As evidenced by some of them getting 4x-10x speedups on load times in later patches).
    (My all-time favorite is vsync, which absolutely murders load times and is *still* an issue with some very high-profile engines to this day...)

    io_uring is also mostly about IOPS rather than throughput, and we don't really care about IOPS.
    Interesting. I assumed it all stemmed from unnecessary CPU/RAM usage and lack of parallel data loading. I also didn't think LOD code was that heavy on the systems...
    I know io_uring is about IOPS but I assumed that it was the problem: throughput isn't a question anymore(I'm willing to bet any modern m2 SSD will be fast enough most people can't even tell the difference between a PCI-e3 capped and PCI-e4 capped one), so I thought it was about having a good software tool for massive parrallelisation. I learned ITT that parallel DMA has all it needs, but...I'm honestly a bit shocked by what you're saying.

    @agd5f and you are essentially saying that we have no hardware issues at all and it's all about software.

    Originally posted by agd5f View Post
    There are 3 major blockers from a Linux perspective:
    1. Lack of a API at the OpenGL/Vulkan level to make it easy for applications to take advantage of this.
    2. General reluctance to use peer to peer DMA more widely at the kernel level. Part of this is due to the fact that the PCI spec doesn't address peer to peer DMA or make any claims about whether it should work or not or provide a way for the platform to determine whether it works, coupled with the fact that it doesn't work on every platform due to hardware limitations. It does work on all AMD Zen CPUs. It generally works on all recent Intel CPUs, except for some cases where devices cross certain root ports. Beyond that, it's less clear.
    3. Lack of a peer to peer DMA and fencing API at the kernel level that both nvme and GPU drivers support.

    The story was not much better on the windows side until recently due to 2.
    So we have:
    • the obvious PCI-E specs issue which isn't going to be solved anytime soon
    • Lack of openGL/Vulkan API for data is an obvious problem but I don't see how it can't eventually be done if need be
    For single threading vs multi, that is a problem that isn't going away anytime soon alright, but everything else? Am I understanding this right, that essentially 90% of game slow loads are due to poor code for LODs, or model compression/format, or poor multithreading? If we just had those well done, everything would be very fast? We currently have no hardware bottlenecks and it's pretty much all software?

    Leave a comment:


  • arQon
    replied
    Originally posted by Mahboi View Post
    It's a wet dream but I want to believe in a world of instant gaming with zero loads. If that gets put in place actually, the world will start marking the difference between the era when you had to wait for apps to load and the era where you just click and it all gets loaded pretty much instantly.
    Not to rain on your parade, but we've had "streaming" asset handling for about 15 years now.

    There are too many basic errors in your imaginary vision to address them all, but asset *management* needs to run through the CPU for bookkeeping regardless of the asset *loading* specifics, and that becomes more of a factor as time goes by, not less. Being able to DMA textures etc without having to shuffle them through RAM first is nice, certainly, as that does burn a ton of bandwidth, but it's a pretty small piece of the puzzle compared to all the work that goes into LOD management etc, without which you don't have a working system at all. That management is what makes loads a background task that's *already* transparent to players even when running off a SATA HDD, once the initial work is done, and a couple of seconds at startup when the player is looking at a menu anyway is absolutely the wrong piece of code to be caring about.

    Games with infamously-long load times like GTA or (I think?) Witcher have nothing to do with I/O bandwidth, and everything to do with "bad" code, e.g. having to do format conversions, being limited to a single thread, and so on. (As evidenced by some of them getting 4x-10x speedups on load times in later patches).
    (My all-time favorite is vsync, which absolutely murders load times and is *still* an issue with some very high-profile engines to this day...)

    io_uring is also mostly about IOPS rather than throughput, and we don't really care about IOPS. Loading an atlas or a chunk of polysoup for a game is about as unlike running DB queries and inserts for thousands of concurrent users as you can get.

    Leave a comment:


  • skeevy420
    replied
    Originally posted by agd5f View Post

    There is nothing standing in the way of this today from a hardware perspective. You just need to stream data directly from the nvme to the GPU's vram rather than taking a trip through system memory. AMD did this years ago when we build GPUs with nvme on the GPU board, unfortunately, at the time there was not much interest in the industry. There are 3 major blockers from a Linux perspective:
    1. Lack of a API at the OpenGL/Vulkan level to make it easy for applications to take advantage of this.
    2. General reluctance to use peer to peer DMA more widely at the kernel level. Part of this is due to the fact that the PCI spec doesn't address peer to peer DMA or make any claims about whether it should work or not or provide a way for the platform to determine whether it works, coupled with the fact that it doesn't work on every platform due to hardware limitations. It does work on all AMD Zen CPUs. It generally works on all recent Intel CPUs, except for some cases where devices cross certain root ports. Beyond that, it's less clear.
    3. Lack of a peer to peer DMA and fencing API at the kernel level that both nvme and GPU drivers support.

    The story was not much better on the windows side until recently due to 2.
    I'm hoping for 1. that MS (and/or Valve) will port the DirectStorage API to Mesa/Vulkan and after that 2. and 3. will follow for those with supported hardware. From my Layman POV it makes sense to toss it into Vulkan since nearly every graphics standard has a To Vulkan conversion layer.

    Leave a comment:

Working...
X