Announcement

Collapse
No announcement yet.

Linux RAID Benchmarks With EXT4 + XFS Across Four Samsung NVMe SSDs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by milkylainen View Post
    Are there really no drawbacks to running a 4x*4 over a 16x link? I know the data/cmd runs separately on each pcie serdes physet... But on the host side? Interrupt handling, iomapping/translation, DMA entrypoints? All at one host point? Or does each pcie channel have a full set of handling features, regardless of it is a 1x, 4, 8, 16?

    Implementation dependent?
    Short answer: Yes, there is no drawback in this setup, and yes, it is implementation-dependent. Some more details:

    Modern processor designs do away with the traditional northbridge chipset, instead its functionality is fully integrated with the CPU. All PCI-e lanes are directly connected to and managed by the processor. If the processor has enough PCI-e lanes, you get the full bandwidth out of this setup. This is the case for Threadripper, but not necessarily for Ryzen or Intel desktop processors. Those often only have enough PCI-e lanes to support a single x16 slot, so boards with two x16 slots would provide either 1x x16 or 2x x8 on these slots (e.g. two-GPU SLI configurations). If you don't need a dedicated GPU, on such a system you can still use the first x16 slot for 4 NVMEs.

    In general, in a desktop system you need to understand how lanes are wired and which components share lanes to know how much you can put in with full performance. In contrast, an Epyc CPU provides 128 PCI-e lanes, so a big chunk of the server boards available right now don't even expose all of them physically.
    Also the CPU needs to support PCI-e bifurcation, which all modern high-end processors do (including all Zen), and the functionality needs to be exposed in the BIOS. In general it looks like modern CPUs operate on four-lane-bundles, so x4 is the smallest unit that is managed by the CPU directly (x1 slots or x2 NVME slots get routed through south bridge). I am not 100% sure on that though. This allows x16 to be split into x8x8, x8x4x4 or x4x4x4x4.

    When it comes to interrupt handling, in theory you don't run into problems as interrupts can be distributed along CPU cores. So you could pin the interrupts of each NVMe to a separate core to allow parallel processing of interrupts. There is one big catch though. Modern processors with many cores distribute them over several NUMA nodes. Each PCI-e lane is attached to one of those nodes, and if you were to communicate through the lane from a CPU core that resides on a different NUMA node, the data needs to be routed between the two nodes. This adds latency and there is a bandwidth limit involved as well. The operating system is therefore wise to pin NVME interrupts to CPU cores on the NUMA node the respective lanes are attached to. In the case of 4x4 lanes on a x16 slot obviously all the lanes are attached to the same NUMA node. Note that memory channels are also attached to NUMA nodes, so for example Epyc's four memory channels result from a single-channel memory controller on each node. So you obtain a higher latency in DMA transfers when the memory region happens to be handled by another node. Note however, that you would not hit a bottleneck there with the mere x16 bandwidth alone, only might if there is much more stuff going on in other threads.

    In the end, if you have a CPU like Threadripper, you don't have to worry about getting the whole bandwidth out of 4 NVMEs. TR has two NUMA nodes and there is enough CPU cores on each to handle both the interrupts of 4 NVMEs as well as all the filesystem functionality. Obviously the performance is best if the process handling the data stream also resides on the same NUMA node.
    Last edited by ypnos; 26 August 2018, 07:40 AM.

    Comment


    • #12
      Splitting the load across both of the 'connected' CCX's gives a massive increase in the results of some sequential tests, but really hurts performance in more general/random tests.


      Type: Sequential Read - IO Engine: Linux AIO - Buffered: No - Direct: Yes - Block Size: 2MB

      EXT4: 4-Disk RAID0 ............................ 2236
      EXT4: 4-Disk RAID0 2 per CCX .................. 7385

      XFS: 4-Disk RAID0 ............................. 2235
      XFS: 4-Disk RAID0 2 per CCX ................... 7404


      https://openbenchmarking.org/result/...RA-1808249RA82

      Comment


      • #13
        Originally posted by torsionbar28 View Post
        The ATA interfaces like SATA and PATA are all half-duplex. But I thought NVMe was full-duplex, just like SAS? Is this not true?
        I'm not sure about the interfaces/standards, i was actually talking about the disks themselves.
        In my experience when you try to do heavy intensive read/write ops at the same time, ususally the performance dips really low, almost to zero.

        Comment


        • #14
          It seems there is a bug in the test suite preventing fio tests with results above 10GB/s from being returned.
          Manually running the test module works fine though.

          What's the best way to get a bug report to Michael?

          It's a pity the tests that would show the advantages of a split CCX connection won't work withing the test suite. For moving large files around, this configuration is fast.

          EXT4: 4-Disk RAID0 2MB Random Read = 14.5GB/s
          EXT4: 4-Disk RAID0 2MB Random Write = 10.7GB/s
          EXT4: 4-Disk RAID0 2MB Sequential Write = 10.7GB/s

          Test data:
          Code:
          pts/fio-1.11.2 [Type: Random Read - IO Engine: Linux AIO - Buffered: No - Direct: Yes - Block Size: 2MB - Disk Target: Default Test Directory]
          
          # ./fio test.fio
          test: (g=0): rw=randread, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=libaio, iodepth=64
          fio-3.1
          Starting 1 process
          Jobs: 1 (f=0): [f(1)][100.0%][r=10.3GiB/s,w=0KiB/s][r=5256,w=0 IOPS][eta 00m:00s]
          test: (groupid=0, jobs=1): err= 0: pid=9330: Thu Aug 30 02:05:56 2018
             read: IOPS=6918, BW=13.5GiB/s (14.5GB/s)(271GiB/20010msec)
               lat percentiles (nsec):
               |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
               | 30.00th=[    0], 40.00th=[    0], 50.00th=[    0], 60.00th=[    0],
               | 70.00th=[    0], 80.00th=[    0], 90.00th=[    0], 95.00th=[    0],
               | 99.00th=[    0], 99.50th=[    0], 99.90th=[    0], 99.95th=[    0],
               | 99.99th=[    0]
             bw (  MiB/s): min=13816, max=13852, per=100.00%, avg=13844.78, stdev= 9.24, samples=40
             iops        : min= 6908, max= 6926, avg=6922.35, stdev= 4.60, samples=40
            cpu          : usr=0.97%, sys=48.46%, ctx=128912, majf=0, minf=39649
            IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=124.9%
               submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
               complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
               issued rwt: total=138434,0,0, short=0,0,0, dropped=0,0,0
               latency   : target=0, window=0, percentile=100.00%, depth=64
          
          Run status group 0 (all jobs):
             READ: bw=13.5GiB/s (14.5GB/s), 13.5GiB/s-13.5GiB/s (14.5GB/s-14.5GB/s), io=271GiB (290GB), run=20010-20010msec
          
          ----------------------------------------------------------------------------------
          
          pts/fio-1.11.2 [Type: Random Write - IO Engine: Linux AIO - Buffered: No - Direct: Yes - Block Size: 2MB - Disk Target: Default Test Directory]
          
          # ./fio test.fio
          test: (g=0): rw=randwrite, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=libaio, iodepth=64
          fio-3.1
          Starting 1 process
          Jobs: 1 (f=0): [f(1)][100.0%][r=0KiB/s,w=7828MiB/s][r=0,w=3914 IOPS][eta 00m:00s]
          test: (groupid=0, jobs=1): err= 0: pid=9786: Thu Aug 30 02:12:48 2018
            write: IOPS=5094, BW=9.96GiB/s (10.7GB/s)(199GiB/20019msec)
               lat percentiles (nsec):
               |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
               | 30.00th=[    0], 40.00th=[    0], 50.00th=[    0], 60.00th=[    0],
               | 70.00th=[    0], 80.00th=[    0], 90.00th=[    0], 95.00th=[    0],
               | 99.00th=[    0], 99.50th=[    0], 99.90th=[    0], 99.95th=[    0],
               | 99.99th=[    0]
             bw (  MiB/s): min= 9928, max=10316, per=100.00%, avg=10197.79, stdev=75.76, samples=40
             iops        : min= 4964, max= 5158, avg=5098.88, stdev=37.90, samples=40
            cpu          : usr=25.57%, sys=31.57%, ctx=100310, majf=0, minf=6
            IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=124.6%
               submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
               complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
               issued rwt: total=0,101983,0, short=0,0,0, dropped=0,0,0
               latency   : target=0, window=0, percentile=100.00%, depth=64
          
          Run status group 0 (all jobs):
            WRITE: bw=9.96GiB/s (10.7GB/s), 9.96GiB/s-9.96GiB/s (10.7GB/s-10.7GB/s), io=199GiB (214GB), run=20019-20019msec
          
          ----------------------------------------------------------------------------------
          
          pts/fio-1.11.2 [Type: Sequential Write - IO Engine: Linux AIO - Buffered: No - Direct: Yes - Block Size: 2MB - Disk Target: Default Test Directory]
          
          # ./fio test.fio
          test: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=libaio, iodepth=64
          fio-3.1
          Starting 1 process
          Jobs: 1 (f=0): [f(1)][100.0%][r=0KiB/s,w=7788MiB/s][r=0,w=3893 IOPS][eta 00m:00s]
          test: (groupid=0, jobs=1): err= 0: pid=10514: Thu Aug 30 02:18:07 2018
            write: IOPS=5095, BW=9.96GiB/s (10.7GB/s)(199GiB/20020msec)
               lat percentiles (nsec):
               |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
               | 30.00th=[    0], 40.00th=[    0], 50.00th=[    0], 60.00th=[    0],
               | 70.00th=[    0], 80.00th=[    0], 90.00th=[    0], 95.00th=[    0],
               | 99.00th=[    0], 99.50th=[    0], 99.90th=[    0], 99.95th=[    0],
               | 99.99th=[    0]
             bw (  MiB/s): min=10004, max=10316, per=100.00%, avg=10202.23, stdev=63.00, samples=40
             iops        : min= 5002, max= 5158, avg=5101.07, stdev=31.50, samples=40
            cpu          : usr=25.41%, sys=32.45%, ctx=100507, majf=0, minf=17164
            IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=124.7%
               submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
               complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
               issued rwt: total=0,102013,0, short=0,0,0, dropped=0,0,0
               latency   : target=0, window=0, percentile=100.00%, depth=64
          
          Run status group 0 (all jobs):
            WRITE: bw=9.96GiB/s (10.7GB/s), 9.96GiB/s-9.96GiB/s (10.7GB/s-10.7GB/s), io=199GiB (214GB), run=20020-20020msec

          Comment


          • #15
            Originally posted by nomadewolf View Post
            I'm not sure about the interfaces/standards, i was actually talking about the disks themselves.
            In my experience when you try to do heavy intensive read/write ops at the same time, ususally the performance dips really low, almost to zero.
            You observed this behavior on NVMe SSD's? Or on SATA SSD's?

            Comment


            • #16
              Originally posted by torsionbar28 View Post
              You observed this behavior on NVMe SSD's? Or on SATA SSD's?
              SATA.
              Never tested on NVMe...

              Comment


              • #17
                Since many of us are using ZFS on Linux instead of the still poorly performing Btrfs and much less flexible MD RAID, it would be very interesting to see how ZFS would compare to the already tested filesystems on the same 4x NVME SSD hardware. Hopefully Michael can find the time to run the same benchmarks on ZFS, I'm really interested how it can scale with these extremely fast drives.

                Comment


                • #18
                  I'm using your test setup as the basis for some server NVME performance comparison. Thank you!

                  1. I'd like to see *all* of the mkfs* commands used in order to validate apples-to-applies with some of my NVME testing on these EVO 970's. Have I missed them being documented somewhere? (on the "mount" command you do say "default")

                  2. I would like to fully validate that the "delayed initialization" for at least ext4 and xfs filesystems was allowed to fully complete before testing.

                  3. Did you really do a 4-disk RAID1? That is 3 mirrors? I, and the comment from #10, would like to see the mdamd "--layout=f2" option then.
                  Last edited by pjwelsh; 22 December 2018, 12:38 PM.

                  Comment

                  Working...
                  X