Originally posted by milkylainen
View Post
Modern processor designs do away with the traditional northbridge chipset, instead its functionality is fully integrated with the CPU. All PCI-e lanes are directly connected to and managed by the processor. If the processor has enough PCI-e lanes, you get the full bandwidth out of this setup. This is the case for Threadripper, but not necessarily for Ryzen or Intel desktop processors. Those often only have enough PCI-e lanes to support a single x16 slot, so boards with two x16 slots would provide either 1x x16 or 2x x8 on these slots (e.g. two-GPU SLI configurations). If you don't need a dedicated GPU, on such a system you can still use the first x16 slot for 4 NVMEs.
In general, in a desktop system you need to understand how lanes are wired and which components share lanes to know how much you can put in with full performance. In contrast, an Epyc CPU provides 128 PCI-e lanes, so a big chunk of the server boards available right now don't even expose all of them physically.
Also the CPU needs to support PCI-e bifurcation, which all modern high-end processors do (including all Zen), and the functionality needs to be exposed in the BIOS. In general it looks like modern CPUs operate on four-lane-bundles, so x4 is the smallest unit that is managed by the CPU directly (x1 slots or x2 NVME slots get routed through south bridge). I am not 100% sure on that though. This allows x16 to be split into x8x8, x8x4x4 or x4x4x4x4.
When it comes to interrupt handling, in theory you don't run into problems as interrupts can be distributed along CPU cores. So you could pin the interrupts of each NVMe to a separate core to allow parallel processing of interrupts. There is one big catch though. Modern processors with many cores distribute them over several NUMA nodes. Each PCI-e lane is attached to one of those nodes, and if you were to communicate through the lane from a CPU core that resides on a different NUMA node, the data needs to be routed between the two nodes. This adds latency and there is a bandwidth limit involved as well. The operating system is therefore wise to pin NVME interrupts to CPU cores on the NUMA node the respective lanes are attached to. In the case of 4x4 lanes on a x16 slot obviously all the lanes are attached to the same NUMA node. Note that memory channels are also attached to NUMA nodes, so for example Epyc's four memory channels result from a single-channel memory controller on each node. So you obtain a higher latency in DMA transfers when the memory region happens to be handled by another node. Note however, that you would not hit a bottleneck there with the mere x16 bandwidth alone, only might if there is much more stuff going on in other threads.
In the end, if you have a CPU like Threadripper, you don't have to worry about getting the whole bandwidth out of 4 NVMEs. TR has two NUMA nodes and there is enough CPU cores on each to handle both the interrupts of 4 NVMEs as well as all the filesystem functionality. Obviously the performance is best if the process handling the data stream also resides on the same NUMA node.
Comment