If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.
Announcement
Collapse
No announcement yet.
Linux NUMA Patches Aim To Reduce Overhead, Avoid Unnecessary Migrations
For me NUMA means that disjoint nodes have to access memory over significantly (orders of magnitude) slower bus than normally.
It doesn't have to be that drastic in order to have dire consequences. Consider the behaviour of Ryzen TR 2990WX on Windows versus Linux. And the difference in latency is 60%. Or at least that's how it's reported by the firmware - 10 for local memory vs 16 to neighbor chiplet memory.
Lets just clear this up.. TR3 is only NUMA for the sake of cache and latency between the cores... main memory access is equal. Basically it is just there to prevent threads from hopping between cores and thrashing cache as as well as the infinity fabric for no reason.
Ah yea, that bit, forgot about that piece of it realistically is NUMA.
So, in a multi-threaded process, how does the kernel know which threads are communicating the most? Is there any way to explicitly associate a subset of the threads within a process, similar to OpenCL's notion of work groups?
...in cases where a thread in one NUMA domain is communicating with a thread in another domain (e.g. buffers being passed down a GStreamer pipeline, with the respective threads being scheduled on different physical CPUs). In the worst case, the downstream malloc cache will get polluted entirely with buffers from the wrong NUMA domain, leading it getting non-local memory, whenever it does allocations.
What's needed is either:
Tag allocations with their NUMA domain and bypass the per-thread cache if the free'd memory is from a different NUMA domain.
Explicitly tell the kernel to schedule a subset of inter-communicating threads to run in the same NUMA domain.
Comment