Announcement

Collapse
No announcement yet.

NVMe "Simple Copy" Offloaded Copy Support Being Prepared For The Linux Kernel

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • NVMe "Simple Copy" Offloaded Copy Support Being Prepared For The Linux Kernel

    Phoronix: NVMe "Simple Copy" Offloaded Copy Support Being Prepared For The Linux Kernel

    One of the NVMe specification additions that was ratified this year is the "simple copy" command that allows for copying multiple contiguous ranges to a single destination. That simple copy operation is offloaded to the SSD controller. The Linux kernel support for NVMe simple copy is now being prepared...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Wow, this is huge for my application. Reordering append-only tracts without reading them is a huge win.

    Comment


    • #3
      Originally posted by atomsymbol

      Physically, the blocks are still being read and then written, but the reads&writes do not involve the PCI Express bus, the main memory and the CPU.

      Without NVMe-simple-copy being faster than 7 GB/s (PCIe x4 4.0) it is mostly pointless of course and would be beneficial only in case applications running on the CPU are fully saturating the DDR4 memory bandwidth.
      what about power consumption? Not having to deal with it on the CPU side should also lead to longer battery lifetime

      Comment


      • #4
        I had heard that copies of data were sometimes optimized away to a single copy on the same disk? (internally by the controller, nothing to do with the OS which otherwise thinks there is more than one copy)

        Is this meant to be similar to cut/paste on the same disk where some inodes are updated instead of an actual transfer? Or actually writes new copies internally?

        ---

        I remember, long ago back in the days before SSDs were mainstream, and when I was on Windows. I used a program called TeraCopy, since if I was copying files from various locations to a new destination (same disk or another disk I can't recall), the performance would slow down to a crawl without that software. Perhaps it was Queue Depth related? Just seemed that simultaneous transfers were being attempted instead of queued up (maybe what happened is what was effectively sequential I/O became random I/O as a result without a transfer queue?)

        It made me a bit paranoid to run multiple copies like that, but perhaps it's a non-issue on SSDs, especially NVMe disks? If anyone is familiar with that experience, is it something Linux can still experience when dealing with HDD storage? And if so is there an equivalent to TeraCopy? It seems UltraCopier is the equivalent, but it's not as transparent to adopt as TeraCopy on Windows did. eg for KDE/Plasma, anything using KIO needs support and anything not using KIO will need the alternative method of copying modified to delegate to UltraCopier (or equivalent queue solution): https://bugs.kde.org/show_bug.cgi?id=161017

        The description here suggests it's mostly an HDD issue due to access latency from stacking concurrent transfers:


        Originally posted by atomsymbol
        Without NVMe-simple-copy being faster than 7 GB/s (PCIe x4 4.0) it is mostly pointless of course =
        Embedded devices support NVMe, but sometimes it's PCIe 2.0 or if your lucky 3.0. Sometimes the amount of lanes is only x2 or x1 IIRC. Plenty of PCIe 3.0 PCs too, some that only use x2 lanes for M.2. External USB SSDs come to mind too as some of those are NVMe with a PCIe 3.0 x2 lane M.2 to USB bridge chipset.

        Regarding the 7GB/sec sequential I/O peak on PCIe 4.0, that's hitting the limits for x4 PCIe 4.0 lanes, they could possibly exceed that if they're already bottlenecking there. I don't know too much about the low-level details with NAND and controllers for those disks though to comment further.

        Comment


        • #5
          Originally posted by karolherbst View Post

          what about power consumption? Not having to deal with it on the CPU side should also lead to longer battery lifetime
          I suspect such power saving would be negligible, if at all noticeable.

          Comment


          • #6
            Originally posted by atomsymbol

            Physically, the blocks are still being read and then written, but the reads&writes do not involve the PCI Express bus, the main memory and the CPU.

            Without NVMe-simple-copy being faster than 7 GB/s (PCIe x4 4.0) it is mostly pointless of course and would be beneficial only in case applications running on the CPU are fully saturating the DDR4 memory bandwidth.
            Are you sure it can't be done entirely through remapping, at least some of the time? It's not like your SSDs are exposing the flash directly to you to begin with so I would expect this to affect the indexes/hash tries, but not so much involve rewriting the data itself.

            On second thought I guess what I'm thinking of is a move and not a copy.

            Comment


            • #7
              Originally posted by atomsymbol

              According to a random PDF about NVMe-simple-copy, it seems to me that the granularity of a simple copy command is 512 bytes which is most likely much smaller than the internal physical granularity of an SSD block. Quick Internet search for the term "ssd typical block size" yields "between 256 KB and 4 MB".
              FWIW, 512 byte sectors are a legacy matter. When you are giving a copy command, even if it is 512-byte aligned rather than aligned to the SSD's page size, there is a chance that a complete page is somewhere in the range.

              Comment


              • #8
                Originally posted by atomsymbol
                I don't understand the point of your post.
                Even if you had to process part of the copy with actual copying/writing, any page-aligned subset of that copy operation can be done entirely through the pagetables.

                Comment


                • #9
                  Originally posted by atomsymbol
                  Yes. But if the pagetables aren't stored in some special kind of area on the SSD then any modification to the page table will result in writing a whole SSD block (SSD block size is from 256 KB to 4 MB). If NVMe-simple-copy is moving less than 256KB-4MB of data (such as: less than 4 MB per second) it does not matter from performance perspective whether it is implemented (A) just via page table modifications or (B) via additionally copying the data to an unused SSD block. Several gigabytes of data via NVMe-simple-copy would need to be moved in order to see a performance difference between (A) and (B).
                  That's a pretty bold claim, I don't think the breakeven would be that high, if the pagetable mechanism were designed with this in mind.

                  Comment


                  • #10
                    Originally posted by atomsymbol

                    Just a note: There exists for example the copy_file_range() syscall in Linux, but in reality most file copy operations are using the read&write syscalls. This basically means that NVMe-simple-copy won't be beneficial to most kinds of real-world file copy operations. Running "strace /bin/cp file1 file2" yields:

                    Code:
                    openat() = 3
                    fstat(3) = 0
                    openat() = 4
                    fstat(4) = 0
                    fadvise64(3) = 0
                    read(3, 131072) = 8477 <<<<
                    write(4, 8477) = 8477 <<<<
                    read(3, 131072) = 0
                    fchmod(4) = 0
                    flistxattr(3) = 0
                    flistxattr(3) = 0
                    fchmod(4, 0400) = 0
                    fgetxattr(3) = -1 ENODATA
                    fstat(3) = 0
                    fsetxattr(4) = 0
                    close(4) = 0
                    close(3) = 0
                    which uses the read&write syscalls to copy the file.

                    In summary: The real-world impact of NVMe-simple-copy outside of a small number of special cases is (currently) very limited.
                    Well yes, I'm talking about my application which has its own on-disk format.

                    Comment

                    Working...
                    X