Originally posted by Michael
View Post
If you ask, I am sure that people with actual expertise would be happy to share. Brendan described how easy it is to produce misleading numbers fairly well:
Originally posted by Brendan Gregg
You could run the fs micro-benchmarks Brendan Gregg published:
https://gist.github.com/brendangregg...9698c70d9e7496
The rationale of each is well defined and the results can be meaningful in that context, but they do not provide a complete picture as soon as you step outside of it.
To give an example, here is another "benchmark", although it is really just a sanity test:
http://kevinclosson.net/2012/03/06/y...s-versus-ext4/
That does not work very well on ext4 because all writes through `ext4_file_write_iter()` serialize on the inode lock. That includes AIO and DirectIO. That test will not work on ZFS because we do not support O_DIRECT (due to the lack of standardization and the XFS semantics being incompatible with CoW in general), but you can modify the script to use oflag=sync rather than oflag=odirect, which does work and actually does reveal a scaling issue.
That scaling issue is that although ZFS employs fine grained locking to ensure that writes to different regions of the file can be done concurrently, synchronous operations (e.g. fsync, O_SYNC/O_DSYNC/O_FSYNC, msync, etcetera) serialize in the ZIL commit on the per dataset ZIL. There is some batching that allows multiple ZIL commits to be done simultaneously, but by the time a log commit has started, all other committers that missed that batch must wait for it to finish, with the committers aggregated into the next write out. Consequently, ZFS will scale better than ext4 does with multiple synchronous writers, but neither can presently touch XFS in synchronous I/O when it uses O_DIRECT (which implies O_DSYNC on XFS). It does phenomonally on this by avoiding serialization both at the inode and log level.
As for the actual relevance of concurrent synchronous writes, they are important for keeping latencies down in workloads that rely on synchronous writes such as databases (atomic commits and logging) and virtual machines (flushes) on low latency solid state media. This did not matter for rotational media, but it poses a scaling bottleneck when things become solid state. The bottleneck can be fixed by changing the code to pick the location of the next log commit at the start of an in-progress commit, allow others to go forward after that is picked and block them on the completion of their in-flight predecessors. That is easier said than done, but it is doable and would be a low latency version of what we have today without a disk format change. Lower latencies are possible with a disk format change, but they would logically come after pipelining the intent log.
It is important to note here that the hardware needs to actually support queue depths greater than one and have sufficient headroom for concurrency to matter. The former is not the case on certain early SATA disks that are internally PATA and use PATA-to-SATA bridge chips, various PATA hardware (unless it supported ATA's crippled TCQ) and most likely other hardware of which I am unaware. The latter is unlikely on most hardware, although it is not impossible. Running through something slow like USB 2.0 would be an obvious way to achieve it.
More generally, you should try to abide by best practices when configuring things. In the context of the slightly modified variation of Kevin Closson's test, ZFS has a default 128KB recordsize, so if it were performing well on synchronous IO, we would see a performance penalty unless proper configuration is used. Similarly, a database administrator is not going to run their 8KB recordsize database on a filesystem with a 128KB recordsize or put a pool into production that suffers read-modify-write on disk sectors (e.g. assuming 512-byte sectors on 4096-byte sectors). Coincidentally, that is one reason why separate datasets are recommended for databases' data files and logs, with another reason being recordsize optimization. Such advice is well documented:
The remark about block device sector sizes also applies to other filesystems such as XFS, which used 512-byte sectors by default the last that I checked, although not ext4 unless the device sector size exceeds its 4096-byte default. When automating things, it is important to check these details. If want numbers from worst case configuration scenarios, which do have value when taken as such, you could do runs for properly configured and unconfigured storage stacks.
I can say more, but I am mostly saying this to demonstrate that the advice can be made available should people interested in publishing benchmarks want to publish numbers that are actually meaningful.
Leave a comment: