Announcement

**ypnos** · 26 August 2018, 07:23 AM

Originally posted by milkylainen View Post

Are there really no drawbacks to running a 4x*4 over a 16x link? I know the data/cmd runs separately on each pcie serdes physet... But on the host side? Interrupt handling, iomapping/translation, DMA entrypoints? All at one host point? Or does each pcie channel have a full set of handling features, regardless of it is a 1x, 4, 8, 16?

Implementation dependent?

Short answer: Yes, there is no drawback in this setup, and yes, it is implementation-dependent. Some more details:

Modern processor designs do away with the traditional northbridge chipset, instead its functionality is fully integrated with the CPU. All PCI-e lanes are directly connected to and managed by the processor. If the processor has enough PCI-e lanes, you get the full bandwidth out of this setup. This is the case for Threadripper, but not necessarily for Ryzen or Intel desktop processors. Those often only have enough PCI-e lanes to support a single x16 slot, so boards with two x16 slots would provide either 1x x16 or 2x x8 on these slots (e.g. two-GPU SLI configurations). If you don't need a dedicated GPU, on such a system you can still use the first x16 slot for 4 NVMEs.

In general, in a desktop system you need to understand how lanes are wired and which components share lanes to know how much you can put in with full performance. In contrast, an Epyc CPU provides 128 PCI-e lanes, so a big chunk of the server boards available right now don't even expose all of them physically.
Also the CPU needs to support PCI-e bifurcation, which all modern high-end processors do (including all Zen), and the functionality needs to be exposed in the BIOS. In general it looks like modern CPUs operate on four-lane-bundles, so x4 is the smallest unit that is managed by the CPU directly (x1 slots or x2 NVME slots get routed through south bridge). I am not 100% sure on that though. This allows x16 to be split into x8x8, x8x4x4 or x4x4x4x4.

When it comes to interrupt handling, in theory you don't run into problems as interrupts can be distributed along CPU cores. So you could pin the interrupts of each NVMe to a separate core to allow parallel processing of interrupts. There is one big catch though. Modern processors with many cores distribute them over several NUMA nodes. Each PCI-e lane is attached to one of those nodes, and if you were to communicate through the lane from a CPU core that resides on a different NUMA node, the data needs to be routed between the two nodes. This adds latency and there is a bandwidth limit involved as well. The operating system is therefore wise to pin NVME interrupts to CPU cores on the NUMA node the respective lanes are attached to. In the case of 4x4 lanes on a x16 slot obviously all the lanes are attached to the same NUMA node. Note that memory channels are also attached to NUMA nodes, so for example Epyc's four memory channels result from a single-channel memory controller on each node. So you obtain a higher latency in DMA transfers when the memory region happens to be handled by another node. Note however, that you would not hit a bottleneck there with the mere x16 bandwidth alone, only might if there is much more stuff going on in other threads.

In the end, if you have a CPU like Threadripper, you don't have to worry about getting the whole bandwidth out of 4 NVMEs. TR has two NUMA nodes and there is enough CPU cores on each to handle both the interrupts of 4 NVMEs as well as all the filesystem functionality. Obviously the performance is best if the process handling the data stream also resides on the same NUMA node.

**CrayZeApe** · 27 August 2018, 01:03 AM

Splitting the load across both of the 'connected' CCX's gives a massive increase in the results of some sequential tests, but really hurts performance in more general/random tests.

Type: Sequential Read - IO Engine: Linux AIO - Buffered: No - Direct: Yes - Block Size: 2MB

EXT4: 4-Disk RAID0 ............................ 2236
EXT4: 4-Disk RAID0 2 per CCX .................. 7385

XFS: 4-Disk RAID0 ............................. 2235
XFS: 4-Disk RAID0 2 per CCX ................... 7404

Samsung Quad NVMe SSD RAID Btrfs EXT4 XFS 4-Disk Linux Benchmarks Performance - OpenBenchmarking.org

https://openbenchmarking.org/result/1808272-RA-1808249RA82

OpenBenchmarking.org, Phoronix Test Suite, Linux benchmarking, automated benchmarking, benchmarking results, benchmarking repository, open source benchmarking, benchmarking test profiles

**nomadewolf** · 27 August 2018, 01:23 PM

Originally posted by torsionbar28 View Post

The ATA interfaces like SATA and PATA are all half-duplex. But I thought NVMe was full-duplex, just like SAS? Is this not true?

I'm not sure about the interfaces/standards, i was actually talking about the disks themselves.
In my experience when you try to do heavy intensive read/write ops at the same time, ususally the performance dips really low, almost to zero.

**CrayZeApe** · 29 August 2018, 12:47 PM

It seems there is a bug in the test suite preventing fio tests with results above 10GB/s from being returned.
Manually running the test module works fine though.

What's the best way to get a bug report to Michael?

It's a pity the tests that would show the advantages of a split CCX connection won't work withing the test suite. For moving large files around, this configuration is fast.

EXT4: 4-Disk RAID0 2MB Random Read = 14.5GB/s
EXT4: 4-Disk RAID0 2MB Random Write = 10.7GB/s
EXT4: 4-Disk RAID0 2MB Sequential Write = 10.7GB/s

Test data:

Code:

pts/fio-1.11.2 [Type: Random Read - IO Engine: Linux AIO - Buffered: No - Direct: Yes - Block Size: 2MB - Disk Target: Default Test Directory]

# ./fio test.fio
test: (g=0): rw=randread, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=libaio, iodepth=64
fio-3.1
Starting 1 process
Jobs: 1 (f=0): [f(1)][100.0%][r=10.3GiB/s,w=0KiB/s][r=5256,w=0 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=9330: Thu Aug 30 02:05:56 2018
   read: IOPS=6918, BW=13.5GiB/s (14.5GB/s)(271GiB/20010msec)
     lat percentiles (nsec):
     |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
     | 30.00th=[    0], 40.00th=[    0], 50.00th=[    0], 60.00th=[    0],
     | 70.00th=[    0], 80.00th=[    0], 90.00th=[    0], 95.00th=[    0],
     | 99.00th=[    0], 99.50th=[    0], 99.90th=[    0], 99.95th=[    0],
     | 99.99th=[    0]
   bw (  MiB/s): min=13816, max=13852, per=100.00%, avg=13844.78, stdev= 9.24, samples=40
   iops        : min= 6908, max= 6926, avg=6922.35, stdev= 4.60, samples=40
  cpu          : usr=0.97%, sys=48.46%, ctx=128912, majf=0, minf=39649
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=124.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwt: total=138434,0,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=13.5GiB/s (14.5GB/s), 13.5GiB/s-13.5GiB/s (14.5GB/s-14.5GB/s), io=271GiB (290GB), run=20010-20010msec

----------------------------------------------------------------------------------

pts/fio-1.11.2 [Type: Random Write - IO Engine: Linux AIO - Buffered: No - Direct: Yes - Block Size: 2MB - Disk Target: Default Test Directory]

# ./fio test.fio
test: (g=0): rw=randwrite, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=libaio, iodepth=64
fio-3.1
Starting 1 process
Jobs: 1 (f=0): [f(1)][100.0%][r=0KiB/s,w=7828MiB/s][r=0,w=3914 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=9786: Thu Aug 30 02:12:48 2018
  write: IOPS=5094, BW=9.96GiB/s (10.7GB/s)(199GiB/20019msec)
     lat percentiles (nsec):
     |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
     | 30.00th=[    0], 40.00th=[    0], 50.00th=[    0], 60.00th=[    0],
     | 70.00th=[    0], 80.00th=[    0], 90.00th=[    0], 95.00th=[    0],
     | 99.00th=[    0], 99.50th=[    0], 99.90th=[    0], 99.95th=[    0],
     | 99.99th=[    0]
   bw (  MiB/s): min= 9928, max=10316, per=100.00%, avg=10197.79, stdev=75.76, samples=40
   iops        : min= 4964, max= 5158, avg=5098.88, stdev=37.90, samples=40
  cpu          : usr=25.57%, sys=31.57%, ctx=100310, majf=0, minf=6
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=124.6%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwt: total=0,101983,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=9.96GiB/s (10.7GB/s), 9.96GiB/s-9.96GiB/s (10.7GB/s-10.7GB/s), io=199GiB (214GB), run=20019-20019msec

----------------------------------------------------------------------------------

pts/fio-1.11.2 [Type: Sequential Write - IO Engine: Linux AIO - Buffered: No - Direct: Yes - Block Size: 2MB - Disk Target: Default Test Directory]

# ./fio test.fio
test: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=libaio, iodepth=64
fio-3.1
Starting 1 process
Jobs: 1 (f=0): [f(1)][100.0%][r=0KiB/s,w=7788MiB/s][r=0,w=3893 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=10514: Thu Aug 30 02:18:07 2018
  write: IOPS=5095, BW=9.96GiB/s (10.7GB/s)(199GiB/20020msec)
     lat percentiles (nsec):
     |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
     | 30.00th=[    0], 40.00th=[    0], 50.00th=[    0], 60.00th=[    0],
     | 70.00th=[    0], 80.00th=[    0], 90.00th=[    0], 95.00th=[    0],
     | 99.00th=[    0], 99.50th=[    0], 99.90th=[    0], 99.95th=[    0],
     | 99.99th=[    0]
   bw (  MiB/s): min=10004, max=10316, per=100.00%, avg=10202.23, stdev=63.00, samples=40
   iops        : min= 5002, max= 5158, avg=5101.07, stdev=31.50, samples=40
  cpu          : usr=25.41%, sys=32.45%, ctx=100507, majf=0, minf=17164
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=124.7%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwt: total=0,102013,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=9.96GiB/s (10.7GB/s), 9.96GiB/s-9.96GiB/s (10.7GB/s-10.7GB/s), io=199GiB (214GB), run=20020-20020msec

**torsionbar28** · 29 August 2018, 03:47 PM

Originally posted by nomadewolf View Post

I'm not sure about the interfaces/standards, i was actually talking about the disks themselves.
In my experience when you try to do heavy intensive read/write ops at the same time, ususally the performance dips really low, almost to zero.

You observed this behavior on NVMe SSD's? Or on SATA SSD's?

**nomadewolf** · 30 August 2018, 01:39 PM

Originally posted by torsionbar28 View Post

You observed this behavior on NVMe SSD's? Or on SATA SSD's?

SATA.
Never tested on NVMe...

**gkovacs** · 21 October 2018, 10:25 AM

Since many of us are using ZFS on Linux instead of the still poorly performing Btrfs and much less flexible MD RAID, it would be very interesting to see how ZFS would compare to the already tested filesystems on the same 4x NVME SSD hardware. Hopefully Michael can find the time to run the same benchmarks on ZFS, I'm really interested how it can scale with these extremely fast drives.

**pjwelsh** · 22 December 2018, 12:29 PM

I'm using your test setup as the basis for some server NVME performance comparison. Thank you!

1. I'd like to see *all* of the mkfs* commands used in order to validate apples-to-applies with some of my NVME testing on these EVO 970's. Have I missed them being documented somewhere? (on the "mount" command you do say "default")

2. I would like to fully validate that the "delayed initialization" for at least ext4 and xfs filesystems was allowed to fully complete before testing.

3. Did you really do a 4-disk RAID1? That is 3 mirrors? I, and the comment from #10, would like to see the mdamd "--layout=f2" option then.

Announcement

Linux RAID Benchmarks With EXT4 + XFS Across Four Samsung NVMe SSDs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment