Announcement

Collapse
No announcement yet.

Linux NUMA Patches Aim To Reduce Overhead, Avoid Unnecessary Migrations

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • coder
    replied
    Also, I'd imagine this is detrimental to NUMA performance: https://www.phoronix.com/scan.php?pa...c-thread-cache

    ...in cases where a thread in one NUMA domain is communicating with a thread in another domain (e.g. buffers being passed down a GStreamer pipeline, with the respective threads being scheduled on different physical CPUs). In the worst case, the downstream malloc cache will get polluted entirely with buffers from the wrong NUMA domain, leading it getting non-local memory, whenever it does allocations.

    What's needed is either:
    1. Tag allocations with their NUMA domain and bypass the per-thread cache if the free'd memory is from a different NUMA domain.
    2. Explicitly tell the kernel to schedule a subset of inter-communicating threads to run in the same NUMA domain.
    Ideally, both.

    Leave a comment:


  • coder
    replied
    So, in a multi-threaded process, how does the kernel know which threads are communicating the most? Is there any way to explicitly associate a subset of the threads within a process, similar to OpenCL's notion of work groups?

    Leave a comment:


  • coder
    replied
    Originally posted by Drizzt321 View Post
    Ah yea, that bit, forgot about that piece of it realistically is NUMA.
    Ah yea, that meme, forgot about that piece of it realistically is Numa:

    Leave a comment:


  • Drizzt321
    replied
    Originally posted by cb88 View Post
    Lets just clear this up.. TR3 is only NUMA for the sake of cache and latency between the cores... main memory access is equal. Basically it is just there to prevent threads from hopping between cores and thrashing cache as as well as the infinity fabric for no reason.
    Ah yea, that bit, forgot about that piece of it realistically is NUMA.

    Leave a comment:


  • kobblestown
    replied
    Originally posted by numacross View Post
    For me NUMA means that disjoint nodes have to access memory over significantly (orders of magnitude) slower bus than normally.
    It doesn't have to be that drastic in order to have dire consequences. Consider the behaviour of Ryzen TR 2990WX on Windows versus Linux. And the difference in latency is 60%. Or at least that's how it's reported by the firmware - 10 for local memory vs 16 to neighbor chiplet memory.

    Leave a comment:


  • andrewjoy
    replied
    Linux: We are putting in a patch to improve our already very good NUMA support.
    Windows: Whats NUMA ?

    Leave a comment:


  • cb88
    replied
    Lets just clear this up.. TR3 is only NUMA for the sake of cache and latency between the cores... main memory access is equal. Basically it is just there to prevent threads from hopping between cores and thrashing cache as as well as the infinity fabric for no reason.

    Leave a comment:


  • numacross
    replied
    Originally posted by Drizzt321 View Post
    Is NUMA solely defined by connections to the memory, or the latency to different chunks of memory?
    It probably depends on the scale of differences you want to consider since even classical UMA systems can have very small differences between memory modules/channels. For me NUMA means that disjoint nodes have to access memory over significantly (orders of magnitude) slower bus than normally.

    Originally posted by Drizzt321 View Post
    Granted for TR this specifically appears as a single UMA domain, so my argument doesn't hold up in the real world, but it's still likely there there is slight memory latency differences in TR even if it's not presented as NUMA.
    That is probably the reason for the BIOS option I mentioned, to control even this minuscule latency difference.

    Leave a comment:


  • Drizzt321
    replied
    Originally posted by numacross View Post

    It is not NUMA since the IMC (and PCIe) is on the shared IO die that connects to all CCDs via Infinity Fabric. It, along with EPYC Rome, can be configured to present itself as NUMA in the BIOS. While I'm not sure exactly it probably binds memory channels to CCDs in this configuration.
    Is NUMA solely defined by connections to the memory, or the latency to different chunks of memory?

    As per https://www.anandtech.com/show/15044...cores-on-7nm/3, each of the CCXs are connected to different quadrants of the IO die, which:

    For Rome, AMD had explained that the latency differences between accessing memory on the local quadrant versus accessing remote memory controllers is ~+6-8ns and ~+8-10ns for adjacent quadrants (because of the rectangular die, the quadrants adjacent on the long side have larger latency than adjacent quadrants on the short side), and ~+20-25ns for the diagonally opposing quadrants. While for EPYC, AMD provides options to change the NUMA configuration of the system to optimize for either latency (quadrants are their own NUMA domain) or bandwidth (one big UMA domain), the Threadripper systems simply appear as one UMA domain, with the memory controllers of the quadrants being interleaved in the virtual memory space.
    Granted for TR this specifically appears as a single UMA domain, so my argument doesn't hold up in the real world, but it's still likely there there is slight memory latency differences in TR even if it's not presented as NUMA.

    Leave a comment:


  • numacross
    replied
    Originally posted by Drizzt321 View Post
    Likewise even with Zen2 Threadripper, there's a MUCH lower latency/hop difference, but it's still properly NUMA.
    It is not NUMA since the IMC (and PCIe) is on the shared IO die that connects to all CCDs via Infinity Fabric. It, along with EPYC Rome, can be configured to present itself as NUMA in the BIOS. While I'm not sure exactly it probably binds memory channels to CCDs in this configuration.

    Leave a comment:

Working...
X