phoronix, this topic has been gaining traction lately: There is a v17 patchset for "Implement copy offload support" that was released just last month:This would be a great Linux kernel feature, and it would be great to see some more attention on the subject. Once copy offload is added, there will be many optimization opportunities in filesystems for any CoW operation (eg, BTRFS, bcachefs, dm-thin) and in the block layer (eg, dm-kcopyd, dm-integrity journal copy-out), and of course anywhere a journal replay as applied.
-Eric
NVMe "Simple Copy" Offloaded Copy Support Being Prepared For The Linux Kernel
Collapse
X
-
Originally posted by atomsymbol
Just a note: There exists for example the copy_file_range() syscall in Linux, but in reality most file copy operations are using the read&write syscalls. This basically means that NVMe-simple-copy won't be beneficial to most kinds of real-world file copy operations. Running "strace /bin/cp file1 file2" yields:
Code:openat() = 3 fstat(3) = 0 openat() = 4 fstat(4) = 0 fadvise64(3) = 0 read(3, 131072) = 8477 <<<< write(4, 8477) = 8477 <<<< read(3, 131072) = 0 fchmod(4) = 0 flistxattr(3) = 0 flistxattr(3) = 0 fchmod(4, 0400) = 0 fgetxattr(3) = -1 ENODATA fstat(3) = 0 fsetxattr(4) = 0 close(4) = 0 close(3) = 0
In summary: The real-world impact of NVMe-simple-copy outside of a small number of special cases is (currently) very limited.
Leave a comment:
-
-
Originally posted by atomsymbolYes. But if the pagetables aren't stored in some special kind of area on the SSD then any modification to the page table will result in writing a whole SSD block (SSD block size is from 256 KB to 4 MB). If NVMe-simple-copy is moving less than 256KB-4MB of data (such as: less than 4 MB per second) it does not matter from performance perspective whether it is implemented (A) just via page table modifications or (B) via additionally copying the data to an unused SSD block. Several gigabytes of data via NVMe-simple-copy would need to be moved in order to see a performance difference between (A) and (B).
Leave a comment:
-
-
Originally posted by atomsymbolI don't understand the point of your post.
Leave a comment:
-
-
Originally posted by atomsymbol
According to a random PDF about NVMe-simple-copy, it seems to me that the granularity of a simple copy command is 512 bytes which is most likely much smaller than the internal physical granularity of an SSD block. Quick Internet search for the term "ssd typical block size" yields "between 256 KB and 4 MB".
Leave a comment:
-
-
Originally posted by atomsymbol
Physically, the blocks are still being read and then written, but the reads&writes do not involve the PCI Express bus, the main memory and the CPU.
Without NVMe-simple-copy being faster than 7 GB/s (PCIe x4 4.0) it is mostly pointless of course and would be beneficial only in case applications running on the CPU are fully saturating the DDR4 memory bandwidth.
On second thought I guess what I'm thinking of is a move and not a copy.
Leave a comment:
-
-
I had heard that copies of data were sometimes optimized away to a single copy on the same disk? (internally by the controller, nothing to do with the OS which otherwise thinks there is more than one copy)
Is this meant to be similar to cut/paste on the same disk where some inodes are updated instead of an actual transfer? Or actually writes new copies internally?
---
I remember, long ago back in the days before SSDs were mainstream, and when I was on Windows. I used a program called TeraCopy, since if I was copying files from various locations to a new destination (same disk or another disk I can't recall), the performance would slow down to a crawl without that software. Perhaps it was Queue Depth related? Just seemed that simultaneous transfers were being attempted instead of queued up (maybe what happened is what was effectively sequential I/O became random I/O as a result without a transfer queue?)
It made me a bit paranoid to run multiple copies like that, but perhaps it's a non-issue on SSDs, especially NVMe disks? If anyone is familiar with that experience, is it something Linux can still experience when dealing with HDD storage? And if so is there an equivalent to TeraCopy? It seems UltraCopier is the equivalent, but it's not as transparent to adopt as TeraCopy on Windows did. eg for KDE/Plasma, anything using KIO needs support and anything not using KIO will need the alternative method of copying modified to delegate to UltraCopier (or equivalent queue solution): https://bugs.kde.org/show_bug.cgi?id=161017
The description here suggests it's mostly an HDD issue due to access latency from stacking concurrent transfers:
Originally posted by atomsymbolWithout NVMe-simple-copy being faster than 7 GB/s (PCIe x4 4.0) it is mostly pointless of course =
Regarding the 7GB/sec sequential I/O peak on PCIe 4.0, that's hitting the limits for x4 PCIe 4.0 lanes, they could possibly exceed that if they're already bottlenecking there. I don't know too much about the low-level details with NAND and controllers for those disks though to comment further.
Leave a comment:
-
-
Originally posted by atomsymbol
Physically, the blocks are still being read and then written, but the reads&writes do not involve the PCI Express bus, the main memory and the CPU.
Without NVMe-simple-copy being faster than 7 GB/s (PCIe x4 4.0) it is mostly pointless of course and would be beneficial only in case applications running on the CPU are fully saturating the DDR4 memory bandwidth.
Leave a comment:
-
-
Wow, this is huge for my application. Reordering append-only tracts without reading them is a huge win.
Leave a comment:
-
Leave a comment: