AMD Proposing Redesign For How Linux GPU Drivers Work - Explicit Fences Everywhere

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • marios
    Senior Member
    • Jul 2020
    • 273

    #21
    Originally posted by jrch2k8 View Post

    There is a reason this is not easy, the performance hit is massive and the security implications are not trivial either
    About the security implications, it is partially true. But they can be solved. On the other hand, I would expect massive performance gains and not performance hit. How can something that does almost everything in user-space be slower than something that ping-pongs the kernel?

    The reason that this is not easy, is that it needs more HW/SW co-design. What makes things even tougher is the need to support the shitty platform called windows as well.

    Comment

    • jrch2k8
      Senior Member
      • Jun 2009
      • 2095

      #22
      Originally posted by marios View Post
      1.) About the security implications, it is partially true. But they can be solved.
      2.) I would expect massive performance gains and not performance hit. How can something that does almost everything in user-space be slower than something that ping-pongs the kernel?

      1.) It is true that is solvable but is not nearly as trivial as you make it to be
      2.) That is not how the hardware works and the only way to avoid a massive penalty is if the hardware functions independently, aka it doesn't require memory access, sync, CPU interoperability, BUS access, etc. because the moment you need any of those then execution rings and context switches come into play among a myriad of other stuff that will take huge bites at your performance/latency. Sure for some trivial hardware like sensors it should be fast enough but for a GPU it will not.
      3.) Sure, HW specifically designed for this type of access could be made(at least partially) but i don't think will be trivial or cost effective for something as massive as a GPU

      Comment

      • Mez'
        Senior Member
        • Mar 2009
        • 1180

        #23
        I didn't understand a word of that proposal. It seems to be a good thing, right?

        That's reassuring for me and my job though. There's not a chance in the world anyone on the business/operational side of things would understand any of this. That's why there are translators (of sorts) along the way: FAs and BAs.

        That's exactly why I'm doing my job. So that in the end such sentences can be understood by more than 0.001% of the population.
        I'm a translator of sort. I'd need a FA here though, for a 1st translation.
        Last edited by Mez'; 20 April 2021, 08:21 PM.

        Comment

        • Zan Lynx
          Senior Member
          • Jan 2012
          • 901

          #24
          Originally posted by marios View Post

          About the security implications, it is partially true. But they can be solved. On the other hand, I would expect massive performance gains and not performance hit. How can something that does almost everything in user-space be slower than something that ping-pongs the kernel?
          For one example see user-space spin locks. They are generally slower and worse than kernel locking. In order to set them up to be better than the kernel requires ridiculous conditions such as real-time scheduling and CPU core pinning. Even then the implementation usually assumes it is running on a non-NUMA, Intel architecture and utterly fails on dual socket Xeon, or AMD or ARM.

          Other tricks like RCU cannot be done in user-space at all.

          Comment

          • discordian
            Senior Member
            • Sep 2009
            • 1131

            #25
            Originally posted by smitty3268 View Post
            Then 5 years after they get that working for everyone NVidia will mention they have some better solution that everyone else should switch to instead.
            You mean like cooking and proposing their own kernel GPU Model, different to the old one that everyone uses? That bastards!

            I guess that unlike Nvidia, they don't have to wait 5+ till the competitions gets through primary school, is able to to read and understand the proposal and might've even run into similar problems with existing solutions....

            Comment

            • discordian
              Senior Member
              • Sep 2009
              • 1131

              #26
              Originally posted by Zan Lynx View Post

              For one example see user-space spin locks. They are generally slower and worse than kernel locking. In order to set them up to be better than the kernel requires ridiculous conditions such as real-time scheduling and CPU core pinning. Even then the implementation usually assumes it is running on a non-NUMA, Intel architecture and utterly fails on dual socket Xeon, or AMD or ARM.
              Its how you use it. user-space spin locks are great for optimization the "fast path", glibc et al use them thoroughly for mutexes and other primitives, but fall back to kernel synchronization if needed. Outside of fringe-cases this gives you a huge boost.
              Originally posted by Zan Lynx View Post
              Other tricks like RCU cannot be done in user-space at all.
              What is liburcu doing then? One of its flavors (QBSR) is a pure userspace implementation

              Comment

              • marios
                Senior Member
                • Jul 2020
                • 273

                #27
                Originally posted by jrch2k8 View Post


                1.) It is true that is solvable but is not nearly as trivial as you make it to be
                2.) That is not how the hardware works and the only way to avoid a massive penalty is if the hardware functions independently, aka it doesn't require memory access, sync, CPU interoperability, BUS access, etc. because the moment you need any of those then execution rings and context switches come into play among a myriad of other stuff that will take huge bites at your performance/latency. Sure for some trivial hardware like sensors it should be fast enough but for a GPU it will not.
                3.) Sure, HW specifically designed for this type of access could be made(at least partially) but i don't think will be trivial or cost effective for something as massive as a GPU
                Originally posted by Zan Lynx View Post
                For one example see user-space spin locks. They are generally slower and worse than kernel locking. In order to set them up to be better than the kernel requires ridiculous conditions such as real-time scheduling and CPU core pinning. Even then the implementation usually assumes it is running on a non-NUMA, Intel architecture and utterly fails on dual socket Xeon, or AMD or ARM.

                Other tricks like RCU cannot be done in user-space at all.
                I am not talking about doing everything in user-space. I am talking about common operations. I am confident that it will work, because similar things are being used in other performance sensitive domains. I never said it was trivial. Also the gains might be marginal, making it not worth the effort. As an example of using similar ideas in different domains, I am leaving this here: https://github.com/ofiwg/ofi-guide/b...ardware-access. VDSO is another simpler example.

                PS. User space spinlocks might be bad, but having a process make a kernel call in each iteration of the busy waiting (if it is not busy waiting it is not a spinlock, so you are comparing apples to oranges) is even worse. In practice mutexes are used, that only call the kernel when taking the lock fails.

                Comment

                • Termy
                  Senior Member
                  • Jun 2016
                  • 332

                  #28
                  Am i wrong or does that sound like preparation for optimizing around RDNA3 chiplet design?

                  Comment

                  • gigi
                    Junior Member
                    • Apr 2021
                    • 41

                    #29
                    i would first ask, did AMD done enough homework for recently released zen 3 processors in linux kernel? does latest kernel atleast recognise zen 3 processors by default?

                    Comment

                    • jrch2k8
                      Senior Member
                      • Jun 2009
                      • 2095

                      #30
                      Originally posted by marios View Post
                      I am not talking about doing everything in user-space. I am talking about common operations. I am confident that it will work, because similar things are being used in other performance sensitive domains. I never said it was trivial. Also the gains might be marginal, making it not worth the effort. As an example of using similar ideas in different domains, I am leaving this here: https://github.com/ofiwg/ofi-guide/b...ardware-access. VDSO is another simpler example.

                      PS. User space spinlocks might be bad, but having a process make a kernel call in each iteration of the busy waiting (if it is not busy waiting it is not a spinlock, so you are comparing apples to oranges) is even worse. In practice mutexes are used, that only call the kernel when taking the lock fails.
                      Network cards are extremely simple(compared to a GPU) and a lot easier to handle in this scenario but a GPU will not work because the GPU is bound basically by every hardware operation (memory access, CPU, I/O, BUS, cache coherency, etc.) and any extra latency will trigger cache flush and context switch delays on the GPU so on and so forth.

                      Also you seem to have a huge misunderstanding on the software side here, so few thing:

                      1.) Userspace is at execution ring 3, it is slow and don't have direct hardware access.
                      2.) Kernel is at execution ring 0, it is really fast and do have direct hardware access.
                      3.) Calling kernel operations from userspace is not that slow and is basically the way the hardware is designed to work(see **).
                      4.) They are not proposing this change because userspace switch are that expensive but because the current sync primitive in use is way too overkill for their needs and the extra step for syncs introduce a bit of unneeded latency which makes sense since those primitive were designed primarily for CPUs operations not modern GPUs sync.
                      5.) OpenFabrics is viable in theory since NPU basically handle I/O buffers with DMA and they simply want to bypass kernel verification by writing directly to a page, personally i don't love the idea and believe the FreeBSD approach is way better which is the NPU write and read directly into kernel buffers and pass the file descriptors to userspace(boy is fast btw)
                      6.) On the other hand unlike 5.) something as stupid as drawing a triangle on a GPU could trigger a couple thousand context switches, a metric ton of I/O operations, huge amount of cache operations, huge DMA reads/write and subsequent PCIe sync operations trying to upload/download data for sync, etc, etc. it basically turn all your hardware into a Christmas tree. To allow kernel bypass here will mean having to write a pseudo kernel just to handle sync.

                      ** Calling kernel code from userspace is basically very very fast and a lot faster than using userspace equivalents (is basically hardware acceleration) BUT is a double edge sword if you don't fully understand what you are doing because not all primitives are fast for every scenario/task and this is the most common mistake developers make when trying to use syscalls directly which has given this notion that going to the kernel is slow.

                      Comment

                      Working...
                      X