Large HDD/SSD Linux 2.6.38 File-System Comparison

mtippett replied

14 March 2011, 06:21 PM
Originally posted by tytso View Post

So my impression is that you have a framework which will iterate over a number of file systems, reformat the partition to file system $foo, mount the partition as $foo, and then run the suite of benchmarks. Do you not? Or are you doing this manually, by hand? If you do have such a framework, then the file system aging function needs to inserted after the mkfs step. I'm not sure what is formally considered part of the PTS, and what is considered part of test framework. Presumably whatever uploads the results to the open benchmarking web site is also part of the test framework, which I thought was part of PTS ---- so I had assumed the mkfs was part of the PTS subsystem.

Those steps are not part of the framework itself. My involvement is primarily Phoronix Test Suite & OpenBenchmarking, not Phoronix.com. I believe that Michael does the system prep manually. We do have the concept of "context" for the test which is about either preparing the system or configuration under test. But that isn't fully fleshed out.

It's impractical for PTS to include detailed system preparation steps within the suite itself. The preparation is intensely focused on what the configuration or the variant part of the test run is. Just for FS, it could be mount options only, new vs old (aged), alternate FS, different kernels and their impact on a fs. Obviously this is meaningless for say compilar comparison. So it comes down to a routine similar to...

1. Prepare System Under Test
2. Prepare Configuration Under Test (this is really optional if the variant part is really the System Under Test)
3. Invoke "phoronix-test-suite benchmark <test>"
4. Upload to OpenBenchmarking for further discussion
5. Go to 1 for as many variants as you want.
6. Upload full comparison to OpenBenchmarking
7. If Michael is running the test, then generate an article.

We have talked about a way to take a collection of contexts and calling out to a locally configured script to put the system in that context for running the tests. That will effectively automate 2-6, but it won't be considered part of PTS, but rather it will lower the manual effort for people doing broader comparisons (or in the use with Software Development). My mental picture for the context file that might be useful for benchmarking is something like

Code:

<context-name> <context-information>

You would then have a script that can take you to the particular context. So for filesystems you might have a file such as

Code:

ext3-nobarrier 100GB-70%Cap-3%Frag-opts=nobarrier ext3-barrier 100GB-70%Cap-3%Frag-opts=barrier ext3-discard 100GB-70%Cap-3%Frag-opts=discard

The person executing the comparison would need to write a script that is invoked as "set-context.sh 100GB-70%Cap-3%Frag-opts=nobarrier" which would then do the system preparation (100GB, 70% capacity, 3% fragmentation, mount opts=xxx).

I assume that you can see that the same structure could easily be extended to do bisection across an ordered set of kernels or git commits.

I can't speak for others, I haven't really taken to much advantage of the Phoronix test suite because the signal to noise ratio has been too low. The focus on competition between file systems, as opposed to watching for regressions, isn't really useful for developers.

Phoronix Test Suite is effectively an independent project that grew out of the personal discussions that Michael and I would have regarding the results being presented on Phoronix Test Suite.

Phoronix Test Suite itself, is merely a test execution environment. The results that it generates, and the feeding of the information into articles in Phoronix.com is independent. I'm sure there is actually a lot of value that you could get out of the suite itself - from making available simplified repeatable test cases to monitoring updates to your code as you make them.

...

Now, I don't blame you for that --- in the end, your primary responsibility is to continued success of the commercial enterprise of this web site, which means if sensationalism drives web hits, then sensationalism it shall be.

Again for the record, I am not involved in direct way with Phoronix.com. My involvement is tangential into Phoronix Test Suite and OpenBenchmarking. My day job is as an driving teams of engineers, it's just I have a bent for seeing good engineering done, and Phoronix Test Suite is a way that I can help the industry.

But the fact remains that developers are also extremely busy folks, and if they have to spend a huge amount of time figuring out what the results might mean, they're likely not going to bother. Developers also tend to prefer benchmarks which test specific parts of the file system, one at a time. This is why we tend to use benchmarks such as FFSB, with different profiles such as "large file create", "random writes", "random reads", "large sequential reads", etc. Another favorite benchmark is fs_mark, which tests efficiency of fsync() and journaling subsystems. I don't mind looking at the application centric benchmarks, but I'm not likely to try to set them up. But if you give me lock_stat and oprofile runs, I'll very happily look at them, and discuss what the results might mean, and then work to improve those workloads as part of my future development efforts.

This part is really hard, each developer has their own sub-component or subsystem that they care about, and for each of those there are a set of metrics that directly affect those systems. But there are a hell-of-a-lot of subsystems that represent vastly different areas. So the middle ground is finding benchmarks and tests that serve as a canary in a coal mine to trigger the deeper digging. The deeper digging into a particular sub-domain marginalizes the other domains.

That said, neither Michael or myself would shirk away from deep-diving when the canary indicates that something is wrong. It is a two way street, where the integration of the tools and methodology for a domain of expertise needs to have leadership from the outside, the integration is where the biggest win is.

The bottom line is that benchmarking for the sake of improving the file system requires close cooperation with the developers. I'm not sure whether that's compatible with Phoronix's mission. If so, I'd be happy to work with you more closely. And if it's not Phoronix's cup of tea, that's OK. There is room for multiple different approaches to benchmarking. All I ask that they not be too misleading, but that's more for the sake of not leading naive users down the primrose path.....

So long as the developers are engaged in looking at the problem, rather than blaming the tool, we've got no concerns working with any developer (be it under the OpenBenchmarking or the Phoronix banner).

I don't do articles on Phoronix.com, but do blog postings on OpenBenchmarking.org, so there are ways of getting messages out through that too.

From this thread, so areas that PTS can immediately add value are in

1. Distributed end-user testing - you can get people to run a single command to get consistent results from a broad set of users)
2. Regression Management - We have trackers at http://phoromatic.com/kernel-tracker.php, setting up one is _very_ easy. Currently that one watches the ubuntu-upstream-kernel builds, but could easily do a git pull;make sort of cycle. This is very interesting since you can have distributed systems that are used for testing
3. Reproducing scenarios - If an end user sees an issue with a particular behaviour, capturing a test-case allows it to be more easily reproduced internally to developers.

Some concrete areas that we'd like to see is suggestions for improvements in the test cases or benchmarks themselves. If there are suites of tests that characterize a filesystem's behaviour integrating it isn't much of a problem.
Leave a comment:
tytso replied

14 March 2011, 03:54 PM
Originally posted by mtippett View Post

My view at this stage, is that the application of aging or similar approaches to position the system under test (SUT) to be in a particular form is something that is for the most part outside the purview of PTS and is more involved in SUT preparation. (Obviously a comparison between an aged and unaged filesystem is a completely different issue since the variant portion is actually the aged filesystem).

So my impression is that you have a framework which will iterate over a number of file systems, reformat the partition to file system $foo, mount the partition as $foo, and then run the suite of benchmarks. Do you not? Or are you doing this manually, by hand? If you do have such a framework, then the file system aging function needs to inserted after the mkfs step. I'm not sure what is formally considered part of the PTS, and what is considered part of test framework. Presumably whatever uploads the results to the open benchmarking web site is also part of the test framework, which I thought was part of PTS ---- so I had assumed the mkfs was part of the PTS subsystem.

Thanks. Although I generally don't see the development community try to attempt to reproduce issues, but that is a different issue .

I can't speak for others, I haven't really taken to much advantage of the Phoronix test suite because the signal to noise ratio has been too low. The focus on competition between file systems, as opposed to watching for regressions, isn't really useful for developers. (*Especially* when you're not filtering out things like barrier vs. nobarrier issues.) And when there has been fluctuations, there's been no attempt to explain why it might be happening. I suspect that for some, there has been some distaste over the sensationalism over the fsync() changes which were need to protect data safety, and since there was clearly no understanding over what was happening, the writeup sometimes would assume that one-time changes would translate into long-term trends.

Now, I don't blame you for that --- in the end, your primary responsibility is to continued success of the commercial enterprise of this web site, which means if sensationalism drives web hits, then sansationalism it shall be.

But the fact remains that developers are also extremely busy folks, and if they have to spend a huge amount of time figuring out what the results might mean, they're likely not going to bother. Developers also tend to prefer benchmarks which test specific parts of the file system, one at a time. This is why we tend to use benchmarks such as FFSB, with different profiles such as "large file create", "random writes", "random reads", "large sequential reads", etc. Another favorite benchmark is fs_mark, which tests efficiency of fsync() and journaling subsystems. I don't mind looking at the application centric benchmarks, but I'm not likely to try to set them up. But if you give me lock_stat and oprofile runs, I'll very happily look at them, and discuss what the results might mean, and then work to improve those workloads as part of my future development efforts.

Eric Whitney at HP, who I've mentioned before, does all of these things, and he's been extremely helpful. He invests time to assist ext4 development, and exchange, we help him out by figuring out why his 48-core system was crashing --- turns out there was a big in the block layer that his benchmarking efforts were tripping, and as a result of the collaboration, it will be fixed before 2.6.38 ships. Just today, we spent half of the ext4 weekly concall talking about his recent results testing patches that will be going into 2.6.39 merge window: http://free.linux.hp.com/~enw/ext4/2.6.38-rc5/

The bottom line is that benchmarking for the sake of improving the file system requires close cooperation with the developers. I'm not sure whether that's compatible with Phoronix's mission. If so, I'd be happy to work with you more closely. And if it's not Phoronix's cup of tea, that's OK. There is room for multiple different approaches to benchmarking. All I ask that they not be too misleading, but that's more for the sake of not leading naive users down the primrose path.....
Leave a comment:
mtippett replied

14 March 2011, 02:11 PM
Originally posted by tytso View Post

This probably explains why it still uses the debugfs hack and not a more general FIEMAP interface. So hopefully we can get this up on github, and people can help get the tool in shape so it can be used by folks such as your benchmarking operation. Some assembly will still be necessary, but it shouldn't be that much work.

My view at this stage, is that the application of aging or similar approaches to position the system under test (SUT) to be in a particular form is something that is for the most part outside the purview of PTS and is more involved in SUT preparation. (Obviously a comparison between an aged and unaged filesystem is a completely different issue since the variant portion is actually the aged filesystem).

One thing that I definitely need to give you guys kudos for is that you do document your hardware configurations for the System and Configuration Under Test, and you do strive for strong reproducibility. That's all good stuff.

Thanks. Although I generally don't see the development community try to attempt to reproduce issues, but that is a different issue .

One of the things which they do which is incredibly helpful to file system developers is that they will do oprofile and (very important on larger CPU count machines) lockstat runs. Enabling oprofile and/or lockstat will of course skew the benchmark results, so they have to be done separately, and the performance results discarded, but the oprofile and lockstat information is very useful in showing what are the next things that can be optimized to further improve the file system.

Another very useful analysis tools for understanding why the results are the way they are is to use blktrace. ...

PTS has a MONITOR capability. It is generally used for monitoring load, temperature and so on. Hooking in other tools that add more value for developers should be fairly easily. Feel free to contact me matthew at phoronix.com to discuss further.

Last edited by mtippett; 14 March 2011, 02:12 PM. Reason: Remove extraneous spaces.
Leave a comment:
tytso replied

13 March 2011, 03:58 PM
Originally posted by deanjo View Post

The only distro that I know of that has barriers enabled by default on EXT3 is SuSE/openSUSE.

I am fairly sure that that Red Hat enabled barriers by default in RHEL 6, and although I'm not 100% sure, I believe in recent Fedora releases as well. My source on this is Ric Wheeler, formerly of EMC and who is now the file system manager at Red Hat.

-- Ted
Leave a comment:
deanjo replied

13 March 2011, 01:36 PM
Originally posted by loonyphoenix View Post

Ext3 has barrier=0 as default? Really? Seems strange.

Isn't that a distro-specific thing, though, default mount options?

The only distro that I know of that has barriers enabled by default on EXT3 is SuSE/openSUSE.
Leave a comment:
tytso replied

13 March 2011, 07:43 AM
Originally posted by energyman View Post

Which means that ext4 with its 'sometimes a crash can mean original and destination are both 0' isn't good enough either

That's due to buggy applications that don't use the POSIX interfaces correctly (like it or not, open(), close(), read(), write(), and fsync() are all POSIX interfaces and the need to use fsync() correctly goes back decades --- and the need to do this is true for all modern file systems). But that's another discussion/flame war....
Leave a comment:
energyman replied

12 March 2011, 02:48 PM
Originally posted by tytso View Post

Ext3 is actually not my creation. Credit for implementing ext3 journaling, which was the key new feature for ext3, belongs to Stephen Tweedie. I ported and integrated Daniel Phillip's htree code and Andreas Gruenbacher's acl and xattr code into ext3, but I've never been the maintainer of the ext3 subsystem. That honor fell to Andrew Morton, and more recently, the maintainer is now Jan Kara of SuSE.

I actually pushed strongly for changing ext3's defaults to enable barriers, but that was vetoed by the then-maintainer of ext3, which was Andrew Morton. As I said earlier, we really should try again now that Jan Kara is the ext3 maintainer, since SuSE ships with the defaults changed in their enterprise product.

then I am sorry for blaming you. Please excuse my behaviour.
Hopefully that push comes soon. A filesystem optizimed for benchmarks is not a file system I want to use. I want a filesystem that puts the data first.

Which means that ext4 with its 'sometimes a crash can mean original and destination are both 0' isn't good enough either
Leave a comment:
tytso replied

12 March 2011, 09:45 AM
Originally posted by energyman View Post

you mean reiser4? Which yells loudly if barriers are not supported and go in sync mode?
Why not use your own creation as an example of dumb defaults - ext3?

Ext3 is actually not my creation. Credit for implementing ext3 journaling, which was the key new feature for ext3, belongs to Stephen Tweedie. I ported and integrated Daniel Phillip's htree code and Andreas Gruenbacher's acl and xattr code into ext3, but I've never been the maintainer of the ext3 subsystem. That honor fell to Andrew Morton, and more recently, the maintainer is now Jan Kara of SuSE.

I actually pushed strongly for changing ext3's defaults to enable barriers, but that was vetoed by the then-maintainer of ext3, which was Andrew Morton. As I said earlier, we really should try again now that Jan Kara is the ext3 maintainer, since SuSE ships with the defaults changed in their enterprise product.
Leave a comment:
tytso replied

12 March 2011, 09:32 AM
Originally posted by mtippett View Post

Has this been previously done? The impressions tool presentation talks only about making a filesystem look similar to an old one. It didn't actually attempt to benchmark it within that paper. I understand intellectually the value of it, but I would also assume that some filesystems would behave very differently between the two.

The need to use aged file systems to catch both performance and functional problems is something which is well known to industry practitioners. For example, xfstests (which was a functional test suite for file systems developed by SGI, but which has since been extended so it can be used on many file systems in general, and has started to have ext4-specific tests contributed to it) has provisions so that one file system is constantly reformatted, and another file system is kept across test runs, so it can be "aged" --- since some problems could only be reproduced on an aged file systems.

However, there haven't been good tools to generate aged file systems; some people may have had some ad hoc tools, but nothing general. And the academic community hasn't had this insight until very recently. (If it hasn't been published in an tenure-track journal, it doesn't exist as far as the academic community is concerned, and so for them it's a new idea. :-)

The author of the Impressions paper gave a talk at Google, and he used some of the same slides that he presented at the FAST conference. In those slides, he did show graphs that showed how much difference there was between freshly created file systems and aged file systems --- and the difference was quite noticeable. (Not a surprise.)

My opinion of the tool is that functionally it's O.K., but it could be better. it currently models the number, size, and distribution of files. And, it also measures file fragmentation, which is a great way to measure future performance when reading the existing files. So that's all good. The one thing it doesn't measure the free space fragmentation, which would be a good predictor of future performance for newly created files. So there's room for improvement on that front.

On the implementation side, my primary complaint is that its method for measuring file fragmentation talks to debugfs over a pipe, and parses the output of debugfs, which is (a) an ugly kludge, and (b) limits its functionality to ext2/3/4 file systems. Linux has a file systme independent ioctl, FIEMAP, which will return the same information and would allow the file fragmentation module of the fs impressions tool to be used across multiple file systems.

The thing which makes the fs impressions tool really useful is that you can both provide a statistical model for the number of files, distribution of files, and fragmentation of the files --- and he has some initial work to measure these statistics on an existing file system, so that you can create a statistical model that matches a current file system. So if you have a file server which has been in service for 18 months, you could use his tool to gather statistics, and then use that to create a model that is reproducible, and has the same characteristics as the source file system.

I talked to the author of the paper, and he's agreed in principle to put the source code of the tool up on github, and allow community members to submit patches to improve the tool. What is there is a good starting point, but he was the first to admit that it was a research vehicle, and once it was done and the researchers moved on to other research interests, the tool was never improved and productionalized. This probably explains why it still uses the debugfs hack and not a more general FIEMAP interface. So hopefully we can get this up on github, and people can help get the tool in shape so it can be used by folks such as your benchmarking operation. Some assembly will still be necessary, but it shouldn't be that much work.

100% agree. There are thousands of measures and thousands of conditions that can be applied. What Michael and I try to listen for is the scenario and the potential measure that can be used. OpenBenchmarking and PTS provide for the visibility and repeatability respectively. The harder part is determining the Configuration Under Test and preparing the System Under Test to suit.

One thing that I definitely need to give you guys kudos for is that you do document your hardware configurations for the System and Configuration Under Test, and you do strive for strong reproducibility. That's all good stuff.

Other folks who do a really good job are Eric Whitney at HP, who has helped me greatly in ext4 development, and Stephen Pratt at IBM. Both have done benchmarking professionally, and it shows. For an example of their work, see: http://free.linux.hp.com/~enw/ext4/2.6.36-rc6/ and http://btrfs.boxacle.net/.

One of the things which they do which is incredibly helpful to file system developers is that they will do oprofile and (very important on larger CPU count machines) lockstat runs. Enabling oprofile and/or lockstat will of course skew the benchmark results, so they have to be done separately, and the performance results discarded, but the oprofile and lockstat information is very useful in showing what are the next things that can be optimized to further improve the file system.

Another very useful analysis tools for understanding why the results are the way they are is to use blktrace. The only caveat with blktrace is that the results can be very misleading on non-aged file systems. For example, btrfs is unquestionably better than ext4 at avoiding seeks on freshly created file systems. Chris Mason has some animations of blktrace output which makes this very clear, and no doubt this is why btrfs performs better on ext4 on freshly created file systems on 1-2 processor systems (where lock contention isn't as important), and on workloads where there is a lot of files created and written to sequentially on a new file system (as opposed to database workloads which have a lot of random read/write operations). But I've been hesitant to put in some very simple and easy-to-make changes that would improve ext4's sequential file creation on freshly created file systems, because it would mean turning off the anti-defragmentation measures that we have put in to try to assure that ext4 will age more gracefully over the long term. As with any engineering discipline, a file system engineer has to often balance and trade off competing goals.

At the same time, I know we haven't done enough work to make sure ext4 could be improved on the long-term file system aging point of view. So much work to be done, and not enough time. :-)
Leave a comment:
testerus replied

12 March 2011, 08:31 AM
Reiser4

Originally posted by ayumu View Post

As usual, reiser4 is missing. A shame.

Is there a chance that Reiser4 gets added later on?
Leave a comment:

Previous 1 2 3 4 5 template Next

Announcement

Large HDD/SSD Linux 2.6.38 File-System Comparison

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: