Announcement

Collapse
No announcement yet.

Linux NUMA Patches Aim To Reduce Overhead, Avoid Unnecessary Migrations

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Linux NUMA Patches Aim To Reduce Overhead, Avoid Unnecessary Migrations

    Phoronix: Linux NUMA Patches Aim To Reduce Overhead, Avoid Unnecessary Migrations

    A set of patches that continue to be worked on for the Linux kernek is reconciling NUMA balancing decisions with the load balancer. Ultimately this series is about reducing unnecessary task and page migrations and other NUMA balancing overhead...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Numa (and other high end system) support is critical for the large enterprises that (in)directly pay for a lot of Linux development, but such patches always reminds me of: https://xkcd.com/619/

    Comment


    • #3
      Typo:

      Originally posted by phoronix View Post
      Phoronix: Linux NUMA Patches Aim To Reduce Overhead, Avoid Unnecessary Migrations

      A set of patches that continue to be worked on for the Linux kernek is reconciling NUMA balancing decisions with the load balancer. Ultimately this series is about reducing unnecessary task and page migrations and other NUMA balancing overhead...

      http://www.phoronix.com/scan.php?pag...ile-Balance-V4

      Comment


      • #4
        what about IO? does linux take into account which PCIe interface is connected to which numa node?

        Comment


        • #5
          Originally posted by CommunityMember View Post
          Numa (and other high end system) support is critical for the large enterprises that (in)directly pay for a lot of Linux development, but such patches always reminds me of: https://xkcd.com/619/
          Not solely, with Threadripper (at least Zen1/1+ versions) I believe it had NUMA for the higher core counts since some CCX's hopped Infinity Fabric to get to the other die memory controllers. Likewise even with Zen2 Threadripper, there's a MUCH lower latency/hop difference, but it's still properly NUMA. This may help with that even more. Still more workstation/high end enthusiast, but not purely "enterprise".

          Comment


          • #6
            Originally posted by Drizzt321 View Post
            Still more workstation/high end enthusiast, but not purely "enterprise".
            But it is not "critical" for most of those, just a nice to have (you get the benefit of improvements done primarily for others). And if you are actually paying for a Linux subscription (RH, Oracle, Canonical, etc.) that helps to pay for development you are one of the few (and I am sure everyone thanks you for your contributions - Thank you!).

            Comment


            • #7
              Originally posted by Drizzt321 View Post
              Likewise even with Zen2 Threadripper, there's a MUCH lower latency/hop difference, but it's still properly NUMA.
              It is not NUMA since the IMC (and PCIe) is on the shared IO die that connects to all CCDs via Infinity Fabric. It, along with EPYC Rome, can be configured to present itself as NUMA in the BIOS. While I'm not sure exactly it probably binds memory channels to CCDs in this configuration.

              Comment


              • #8
                Originally posted by numacross View Post

                It is not NUMA since the IMC (and PCIe) is on the shared IO die that connects to all CCDs via Infinity Fabric. It, along with EPYC Rome, can be configured to present itself as NUMA in the BIOS. While I'm not sure exactly it probably binds memory channels to CCDs in this configuration.
                Is NUMA solely defined by connections to the memory, or the latency to different chunks of memory?

                As per https://www.anandtech.com/show/15044...cores-on-7nm/3, each of the CCXs are connected to different quadrants of the IO die, which:

                For Rome, AMD had explained that the latency differences between accessing memory on the local quadrant versus accessing remote memory controllers is ~+6-8ns and ~+8-10ns for adjacent quadrants (because of the rectangular die, the quadrants adjacent on the long side have larger latency than adjacent quadrants on the short side), and ~+20-25ns for the diagonally opposing quadrants. While for EPYC, AMD provides options to change the NUMA configuration of the system to optimize for either latency (quadrants are their own NUMA domain) or bandwidth (one big UMA domain), the Threadripper systems simply appear as one UMA domain, with the memory controllers of the quadrants being interleaved in the virtual memory space.
                Granted for TR this specifically appears as a single UMA domain, so my argument doesn't hold up in the real world, but it's still likely there there is slight memory latency differences in TR even if it's not presented as NUMA.

                Comment


                • #9
                  Originally posted by Drizzt321 View Post
                  Is NUMA solely defined by connections to the memory, or the latency to different chunks of memory?
                  It probably depends on the scale of differences you want to consider since even classical UMA systems can have very small differences between memory modules/channels. For me NUMA means that disjoint nodes have to access memory over significantly (orders of magnitude) slower bus than normally.

                  Originally posted by Drizzt321 View Post
                  Granted for TR this specifically appears as a single UMA domain, so my argument doesn't hold up in the real world, but it's still likely there there is slight memory latency differences in TR even if it's not presented as NUMA.
                  That is probably the reason for the BIOS option I mentioned, to control even this minuscule latency difference.

                  Comment


                  • #10
                    Lets just clear this up.. TR3 is only NUMA for the sake of cache and latency between the cores... main memory access is equal. Basically it is just there to prevent threads from hopping between cores and thrashing cache as as well as the infinity fabric for no reason.

                    Comment

                    Working...
                    X