Announcement

**andrewjoy** · 21 February 2020, 05:59 AM

Linux: We are putting in a patch to improve our already very good NUMA support.
Windows: Whats NUMA ?

**kobblestown** · 21 February 2020, 06:22 AM

Originally posted by numacross View Post

For me NUMA means that disjoint nodes have to access memory over significantly (orders of magnitude) slower bus than normally.

It doesn't have to be that drastic in order to have dire consequences. Consider the behaviour of Ryzen TR 2990WX on Windows versus Linux. And the difference in latency is 60%. Or at least that's how it's reported by the firmware - 10 for local memory vs 16 to neighbor chiplet memory.

**Drizzt321** · 21 February 2020, 06:51 PM

Originally posted by cb88 View Post

Lets just clear this up.. TR3 is only NUMA for the sake of cache and latency between the cores... main memory access is equal. Basically it is just there to prevent threads from hopping between cores and thrashing cache as as well as the infinity fabric for no reason.

Ah yea, that bit, forgot about that piece of it realistically is NUMA.

**coder** · 22 February 2020, 05:44 PM

Originally posted by Drizzt321 View Post

Ah yea, that bit, forgot about that piece of it realistically is NUMA.

Ah yea, that meme, forgot about that piece of it realistically is Numa:

**coder** · 22 February 2020, 05:50 PM

So, in a multi-threaded process, how does the kernel know which threads are communicating the most? Is there any way to explicitly associate a subset of the threads within a process, similar to OpenCL's notion of work groups?

**coder** · 22 February 2020, 05:57 PM

Also, I'd imagine this is detrimental to NUMA performance: https://www.phoronix.com/scan.php?pa...c-thread-cache

...in cases where a thread in one NUMA domain is communicating with a thread in another domain (e.g. buffers being passed down a GStreamer pipeline, with the respective threads being scheduled on different physical CPUs). In the worst case, the downstream malloc cache will get polluted entirely with buffers from the wrong NUMA domain, leading it getting non-local memory, whenever it does allocations.

What's needed is either:

Tag allocations with their NUMA domain and bypass the per-thread cache if the free'd memory is from a different NUMA domain.
Explicitly tell the kernel to schedule a subset of inter-communicating threads to run in the same NUMA domain.

Ideally, both.

Announcement

Linux NUMA Patches Aim To Reduce Overhead, Avoid Unnecessary Migrations

Comment

Comment

Comment

Comment

Comment

Comment