Announcement

Collapse
No announcement yet.

Heterogeneous Memory System (HMS) Prototype Published For The Linux Kernel

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Heterogeneous Memory System (HMS) Prototype Published For The Linux Kernel

    Phoronix: Heterogeneous Memory System (HMS) Prototype Published For The Linux Kernel

    The past several years Red Hat developer Jerome Glisse has been working on Heterogeneous Memory Management (HMM) for the Linux kernel to handle the mirroring of process address spaces, system memory that can be transparently used by any device process, and similar functionality around today's GPU computing needs and other devices. Jerome today published the next step as part of his low-level memory device management work and that is the Heterogeneous Memory System for exposing complex memory topologies of today's systems...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Makes me wonder how long it will be before we see advantages in Michaels benchmarking. The potential should be pretty good for any GPU compute apps that sign up.

    Comment


    • #3
      Originally posted by wizard69 View Post
      Makes me wonder how long it will be before we see advantages in Michaels benchmarking. The potential should be pretty good for any GPU compute apps that sign up.
      Yes and no. In a sense this is where benchmarking can prove its value. The way I see it HMS is more about opening possibilities than performance, if used without proper care it could be counter productive. In my mind, particularly in the case of cheap programmers given they will have easier to implement but less performing alternatives to achieve their desired functionality.

      Comment


      • #4
        By the post's title, I thought it was related to the thing that replace GBM and Nvidia EGL stream

        Comment


        • #5
          I imagine they'll need either new cgroups or update the existing ones for this too

          Comment


          • #6
            Decades ago we had a problem with scientific processing on mainframes. I've seen indications that GPU computing has the same problem, and wonder if HSA might be a partial solution - and a complete solution on some hardware. At the time it was referred to as the "gather-scatter problem."

            Our problem at the time was solving sparse matrices - really circuit simulation - not Spice, but a rough equivalent. The problem was contained in a sparse matrix in main memory and we wanted to push the problem through the (very fast) vector pipes. We found that in spite of the speed of the vector pipes, the net speed improvement was only 2X - because of the time spent gathering the data for the vector pipes, then scattering the results back into the original data structures.

            GPUs are better than the vector pipes, in that they have their own working memory. But if for one reason or another you can't keep the entire problem resident in GPU memory, either because it's not big enough or because some of the processing must be done on the CPU instead of the GPU, then you're going to chew up time moving data back and forth instead of doing useful work.

            I can see how HSA might help keep that data GPU-resident longer. I can also see how on a UMA machine (Kaveri, for instance) HSA might mean never having any extra data movement at all. In that case, the UMA machines might get better performance in spite of somewhat anemic hardware, simply because there is less time spent moving data between CPU and GPU.

            However I don't know if HSA - or Cuda or OpenCL - mitigates any of these problems, or of it's orthogonal to them, and they remain.

            Comment


            • #7
              Originally posted by phred14 View Post
              Decades ago we had a problem with scientific processing on mainframes. I've seen indications that GPU computing has the same problem, and wonder if HSA might be a partial solution - and a complete solution on some hardware. At the time it was referred to as the "gather-scatter problem."

              Our problem at the time was solving sparse matrices - really circuit simulation - not Spice, but a rough equivalent. The problem was contained in a sparse matrix in main memory and we wanted to push the problem through the (very fast) vector pipes. We found that in spite of the speed of the vector pipes, the net speed improvement was only 2X - because of the time spent gathering the data for the vector pipes, then scattering the results back into the original data structures.

              GPUs are better than the vector pipes, in that they have their own working memory. But if for one reason or another you can't keep the entire problem resident in GPU memory, either because it's not big enough or because some of the processing must be done on the CPU instead of the GPU, then you're going to chew up time moving data back and forth instead of doing useful work.

              I can see how HSA might help keep that data GPU-resident longer. I can also see how on a UMA machine (Kaveri, for instance) HSA might mean never having any extra data movement at all. In that case, the UMA machines might get better performance in spite of somewhat anemic hardware, simply because there is less time spent moving data between CPU and GPU.

              However I don't know if HSA - or Cuda or OpenCL - mitigates any of these problems, or of it's orthogonal to them, and they remain.
              Out of interest I was also following the HSA development closely and read from more knowledgeable people than myself that data movement is also the main problem to levarage dGPUs for general purpose compute as it is currently neither fast enough nor cache coherent, hence more infrastructure effort is still needed from the hardware side as well (CCIX, GEN-Z and the new SFF-TA-1002 connector seem to be the solution on the horizon), also OS integration is still an issue.

              As I see it, Jerome's current work tackles some major issues from the OS side of things and I hope that there is similar work underway for Windows as it would also solve the problems which were highlighted with the AMD 2990WX CPU launch as well as the OS doesn't know yet that some CPU cores need to be treated differently due to the differences in their access to memory. But also the more widespread and efficient use of APUs, dGPUs and other accelerator devices will profit from this work. Jerome mentioned briefly Heterogeneous Memory Attributes Tables (HMAT) which is a different solution from the ACPI 6.2 standard. Maybe someone could shed some light on both approaches and how they might interact together.

              Comment


              • #8
                This "Gather-Scatter" problem seems to be a much more low level problem. Possibly even could be lower than a kernel problem. I would have thought issues like this would have been resolved with algorithms or with specific hardware design already. But maybe the real problem is that the technology that works is buried in some NDA.

                Comment


                • #9
                  Didn't AMD's first attempt at Hypertransport have accelerators you could just plugin into a 2P CPU socket instead of an CPU? https://en.wikipedia.org/wiki/HyperT...r_interconnect There's OpenCAPI too: https://en.wikipedia.org/wiki/Cohere...ssor_Interface

                  If the interconnect needs to be faster than PCI-E and be coherent, they're available in CPUs, we just need 2P/3P/4P boards made for the consumer with GPUs that are installed just like high-end CPUs (as an example). I think this is all getting a bit costly for the average consumer, but it's not like the hardware doesn't exist or hasn't been tried before. I do think the software is a bigger issue, so looking forward to seeing this work merged in Linux with NUMA-awareness.

                  Comment


                  • #10
                    Originally posted by audir8 View Post
                    Didn't AMD's first attempt at Hypertransport have accelerators you could just plugin into a 2P CPU socket instead of an CPU? https://en.wikipedia.org/wiki/HyperT...r_interconnect There's OpenCAPI too: https://en.wikipedia.org/wiki/Cohere...ssor_Interface

                    If the interconnect needs to be faster than PCI-E and be coherent, they're available in CPUs, we just need 2P/3P/4P boards made for the consumer with GPUs that are installed just like high-end CPUs (as an example). I think this is all getting a bit costly for the average consumer, but it's not like the hardware doesn't exist or hasn't been tried before. I do think the software is a bigger issue, so looking forward to seeing this work merged in Linux with NUMA-awareness.
                    Yes, OpenCAPI is - next to GEN-Z and CCIX - yet another approach to solve this. By the way, there is a presentation on Youtube from an AMD employee on all three standards which might be helpful for some.

                    For everyone interested, have a look on these articles of the GEN-Z consortium - at the end they list major advantages, e.g. costs.

                    1) Your next graphics card might look like these: https://genzconsortium.org/meeting-c...ds-with-pecff/
                    2) The connector in more detail: https://genzconsortium.org/gen-z-sca...sal-connector/

                    Comment

                    Working...
                    X