@baryluk I think Michael did it right. Anomalies will always happen in computer science depending on caching and so on. I think is not Michael's duty to look in sourcecode (or whatever which is the way) to find which and where are the problems of performance regressions.
For example Ext4 is known with *defaults* to be slower (by some visible percents) but on long run as it have live defragmentation and other features, after years when fragmentation will payoff, you will see that a one year old Ext4 machine may be faster than an Ext3 one.
So benchmarks in general are limited and I think that Michael does a wonderful job to promote Linux, BSDs in general and a good caching implementation that happen in one FS that does not reproduce to all other implementations may be just a signal to report as bug to other FS implementers, not to shoot the messenger.
@ciplogic -- what you seem to be failing to understand is that the ZFS random write results aren't actually possible, there is something else going on. Without an explanation as to what else is going on or WHY they are possible, all of the results published in this article are rubbish. 100% meaningless. So sure, examining why errant results occur might not be his job, but if that's the case, and he lets the article exist as-is, he will be disseminating gross misinformation.
The credibility of phoronix is pretty poor already, I suspect they will simply let this be another nail in the coffin.
Bencchmarking of any benchmark excluding your applications that you run on, is meaningless.
Originally Posted by thesjg
As far as for me it appears is just a better caching behavior. As this benchmark will not likely make OS to flush it's cache, may happen that things get "too fast". The issue is: if your application will use the same usage pattern will it fly as fast?
My point was that anomalies always appear in benchmarking, also as disk is two orders of magnitude slower than memory, and disk access even much more times, I think that is not a fault of Phoronix suite. Michael in all that can do is to run them and to see if are not problems statistically (which is a feature of PTS).
Accuracy issues aside, I had never heard of HAMMER filesystem. If this is a new effort that has only a few developers, then congratulations are in order -- this filesystem is a significant improvement over what you already have on BSD, and in some cases it can also remain competitive with the big three Linux filesystems. It's always nice to have another option in the open source world. If I find myself using BSD for some reason, I might check this out.
Right but the overall most important thing is to know WHAT your are benchmarking. Unless you know THAT then any benchmark used in testing a specific thing is null-and-void and pseudoscientific. Wanna test caching? Fine, but then do so properly with different amounts of cache available to the VFS-cache subsystem. Wanna test gzip times? All good and fine but do so properly with at least 2 different CPUs and GCC set at different optimizations, as that affect gzip performance. Simply put, know WHAT you'd wanna benchamark, then isolate THAT making other things equal and you have something that at least passes the low-watermark for validity.
Originally Posted by ciplogic
And if you don't know how your application works the system then tossing a dice is just as valid.
Originally Posted by ciplogic
Err that is where and why the bad reputation comes from. Anyone can run a benchmark (as is apparent at this site), but that does not a valid-benchmark-make! Like I said above, if you're out to benchmark performance of a filesystem then you need to make sure your testing and benchmarks actually test that and not caching/disks/CPU etc. If you don't then your testing is invalid in that context.
Originally Posted by ciplogic
Well, there are numerous problems with the benchmark. Take blogbench for example. Blogbench has simultanious read and write threads where the write activity creates an ever increasing data set size (starting at 0) and the read activity reads from that same data set. Thus if write performance is poor the data set simply does not grow large enough to blow out the system's filesystem buffer cache and read performance will appear to be very high. If write performance is high then the data set will grow beyond what memory can cache and read performance will wind up being very poor. So treating the numbers as separate entities and not being cognizant of whether the test blew out the buffer cache or not basically makes the results garbage.
Another very serious problem is when these benchmarks are run on filesystems with filesystem compression or de-dup. The problem is that most of these tests don't actually write anything to the file. They will write all zeros are some simple pattern that is trivially compressed and, poof, you are suddenly not testing the filesystem or disk performance at all.
A third problem related in particular to transaction tests is how often the benchmark program calls fsync() and what its expectations are verses what the filesystem actually does.
A fourth is, well, you do realize that HAMMER maintains a fine-grained history (30-60 second grain) and you can access a snapshot of the filesystem at any point in that history. The whole point in using the filesystem, apart from the instant crash recovery, is to have access to historical data, so its kinda like comparing apples to oranges if you don't normalize the feature set.
A fifth is the compiler, which is obvious in the gzip tests (which are cpu bound, NOT filesystem bound in any way).
These problems are obvious just by looking at the crazy results that were posted, and the author should have realized this and tracked down the WHY. Benchmarks only work when you understand what they actually do.
There are numerous other issues... whether the system was set to AHCI mode or not (DragonFly's AHCI driver is far better than its ATA driver). Whether the OS was tuned for benchmarking or for real-world activities w/ regards to how much memory the OS is willing to dedicate to filesystem caches. How often the OS feels it should sync the filesystem. Filesystem characteristics such as de-dup and compression and history. fsync handling. Safety considerations (how much backlog the filesystem or OS caches before it starts trying to flush to the media... more is not necessarily better in a production environment), characteristics in real load situations which require system memory for things other than caching filesystem data. And I could go on.
In short, these benchmarks are fairly worthless.
Now HAMMER does have issues, but DragonFly also has solutions for those issues. In a real system where performance matters you are going to have secondary storage, such as a small SSD, and in DragonFly setting a SSD up with its swapcache to cache filesystem meta-data to go along side the slower 'normal' 1-3TB HD(s) is kinda what HAMMER is tuned for. Filesystem performance testing on a laptop is a bit of an oxymoron since 99.999% of what you will be doing normally will be cached in memory anyway and the filesystem will be irrelevant. But on the whole our users like HAMMER because it operates optimally for most workloads and being able to access live snapshots of everything going back in time however long you want to go (based on storage use verses how much storage you have) is actually rather important. Near real time mirroring streams to onsite and/or offsite backups not to mention being able to run multiple mirroring streams in parallel with very low overhead is also highly desireable. It takes a little tuning (e.g. there is no reason to keep long histories for /usr/obj or /tmp), but it's easy.
Only reproductible way to perform good benchmark is to trace all filesystem events (excluding actuall data content, if compression/deduplication is not used) on real world system (like monitor /home directory of real desktop, or / of real medium mail server) for long time (for example one month). And then replay them quickly as a benchmark, this will include very big mix of possible workloads, including random reads, writes, large file write, reads, parallel combinations of them, metadata operations, filesystem traversals, deletions, in parallal with other operations, etc, fragmentation of files and free space, data locality, waste of space, complex caching behaviour etc.
Simple microbenchmarks are good for developers of filesystems because they can use them to infere what is going on in particular part of code (just like in science), but they are not any ultimate measure of quality and performance. They are only usefull for improving code, not really comparing multiple different filesystems.
Most microbenchmarks are also repeated multiple times on completly clean filesystem, which excludes lots of factors from equation (simple and full cacheing, other operations on the same or other filesystems, no fragmentation). So often benchmarks recreate whole filesystem and dropes all caches beetween runs, or delete everything on filesystem (which do not need to be the same thing!) This at least recreates somehow similar conditions. But how to recreate complex conditions of desktop which was used for few months? If one will perform benchmark in subfolder of filesystem and the delete files after it, it is highly probably that end result will be far off the begining condition, so one cannot actually perform benchmark again. It is also hard to be reproduced by other person on other box.
Most robust way to fix this problems is to use accelerated ageing of filesystem by replaying predermined (recorded aka traced) operations coming from real workload. One can also prepare such trace log (with some informations anonymized, like data contens and actuall filenames, just make them of similar length and structure, so dir operations will behave in similar way as in recorded system). Such logs will include all operations, including timestamps, pids, threads, locks, read, write, open, close, fsync, sync, unmount/mount/reboot, seek, tell, create, unlink, link, symlink, aio, o_direct, o_sync, mmap, fadvise, fallocate, error conditions, etc. etc.
There are numerous projects which provides such tools for Linux (and few other filesystems) with very low performance overhead. They are very often used in network filesystems (like NFS), because tracing is just equivalent to running sniffer in beetween client and server, removing actuall file data content, and simply appending to (compressed) log. On local filesystem one will need to use some generic monitoring layer (like perf, kprobes, dtrace, bio event monitoring) or modules desgined for this. They all can hook beetwen userspace request of all kind involing filesystem and store logs somewhere else for later inspection or replay. Other possible ways include stacked filesystem (in kernel or FUSE) or generic VFS api for this (which we do not have currently AFAIK).
One can also use this replays and stop them at predetermined points (or at the end), and compare multiple filesystems (or the same multiple times) and check (by direct comparission or saved checksums if also content data of files are replayed), if content of filesystem is the same for regression testing, conformance testing and other purposes.
Other tracing is block device tracing which is mostly usefull for filesystem developers, but can also be of great importance for users (especially when performing benchmarks on part of device, or when using multiple devices like RAID or zfs/btrfs). Simple access graphs (time vs sector-number) or just cummulative-sum of read and write requests, and of course IO/s and MB/s per seconds vs time, can provide really interesting measures.
For tracing, one should read this material:
Unfortunately most of them are somehow old and needs some small adjustments to work in newest kernels, but they are very useful for benchmarking real world file system operations.
I hope phoronix will start using this tools for more robust benchmarks. (one can actually perform this easly, beucase recodred trace log can be the replayed repeatedly using userspace tools).
The last three posts really summed things up wow.
well call me DIABLO everyone knows that other than the Fast File System on a RAD everything else is a roll of the DICE :wave: at dillon