No announcement yet.

Real World Benchmarks Of The EXT4 File-System

  • Filter
  • Time
  • Show
Clear All
new posts

  • #46
    bonnie nonsense, and XFS tweaks

    first, the bonnie++ benchmark is nonsense. I downloaded the benchmark suite, and
    pts/test-resources/bonnie/ makes a bonnie script that will run
    ./bonnie_/sbin/bonnie++ -d scratch_dir/ -s $2 > $LOG_FILE 2>&1" > bonnie
    -s controls the size of the big file used in sequential write/rewrite/read and lseek tests, and has no impact on the multiple file creation/read/deletion test. The defaults for that are -n 10:0:0:0, IIRC. That means bonnie++ will create 10 * 1024 empty files in the scratch directory. This mostly tests the kernel's in-memory cache structures, since that's not big enough to fill up the memory, so you're not waiting for anything to happen on disk. The deletion does have to happen on disk for anything that made it to disk before being deleted, which can be a bottleneck.
    -n 30:50000:200:8 would be a more interesting test, probably. (file sizes between 50kB (not kiB) and 200B, 30*1024 files spread over 8 subdirs)

    A few people have pointed out that XFS has stupid defaults, but nobody posted a good recommendation. I've played with XFS extensively and benchmarked a few different kinds of workloads on HW RAID5 and on single disks. And I've been using it on my desktop for several years now. For general purpose use, I would recommend:

    mkfs.xfs -l lazy-count=1,size=128m -L yourlabel /dev/yourdisk
    mount with -o noatime,logbsize=256k  (put that in /etc/fstab)
    lazy-count: don't keep the counters in the superblock up to date all the time, since there's enough info elsewhere. fewer writes = good.

    -l size=128m: XFS likes to have big logs, and this is the max size.

    mount -o logbsize=256k: That's log buffer size = 256kiB (of kernel memory). The default (and max with v1 logs) is 32kiB. This makes a factor of > 2 performance difference on a lot of small-file workloads. I think logbufs=8 has a similar effect (the default is 2 log bufs of size 32k. I haven't tested logbus=8,logbsize=256k. The XFS devs frequently recommend to people asking about perf tuning on the mailing list that they use logbsize=256k, but they don't mention increasing logbufs too.

    If you have an older mkfs.xfs, get the latest xfsprogs, 2.10.1 has better defaults for mkfs (e.g. unless you set RAID stripe params, agcount=4, which is about as much parallelism as a single disk can give you anyway. The old default was much higher agcount, which could slow down when the disk started to get full.)

    Or just use your old mkfs.xfs and specify agcount:
    mkfs.xfs -l lazy-count=1,size=128m -L label /dev/disk  -d agcount=4 -i attr=2

    If you want to start tuning, read up on XFS a bit. (unfortunately, there's no good tuning guide anywhere obvious on the web site). Read the man page for mkfs.

    You can't change the number of allocation groups without a fresh mkfs, but you can enable version 2 logs, and lazy-count, without mkfs. xfs_admin -j -c1 will switch to v2 logs with lazy-count enabled. xfs_growfs says growing the log size isn't supported, which is a problem if you have less than the max size of 128MB, since XFS loves large logs. It lets it have more metadata ops on the fly, instead of being forced to write them out sooner.

    if your FS is bigger than 1TB, you should mount with -o inode64, too. Note that contrary to the docs, noikeep is the default. I checked the kernel sources, and that's been the case for a while, I think. Otherwise I would recommend using noikeep to reduce fragmentation.

    If you're making a filesystem only a couple GB, like a root fs, a 128MB log will take a serious chunk of the available space. You might be better of with JFS. I'm currently benchmarking XFS with tons of different option combinations for use as a root fs... (XFS block size, and log size, lazy-count=0/1, mount -o logbsize=, and block dev readahead and io elevator)

    I use LVM for /usr, /home, /var/tmp (includes /var/cache and /usr/local/src), so my root FS currently is a 1.5GB JFS filesystem that is 54% full. It's on a software RAID1.
    Since I run Ubuntu, my /var/lib/dpkg/info has 9373 files out of the total 20794 regular files (27687 inodes) on the filesystem, most of them small.

    export LESS=iM
    find / -xdev -type f -ls | sort -n -k7 | less -S
    then look at the % in less's status line. or type 50% to go to 50% of the file position.
    <= 1k: 45%
    <= 2k: 52%
    <= 3k: 58% (mostly /var/lib/dpkg/info)
    <= 4k: 59%
    <= 6k: 62%
    <= 8k: 64%
    <= 16k: 71% (a lot of kernel modules...)
    <= 32k: 85%
    <= 64k: 93%
    <= 128k: 96%

    > 1M: 0.2% (57 files)

    (I started doing this with find without -type f, and there are lots of small directories (that don't need any blocks outside the inode): < 1k: 59%; < 2k: 64%; < 3k: 68%)

    Every time dpkg upgrades a package, or I even run dpkg -S, it reads /var/lib/dpkg/info/*.list (and maybe more). (although dlocate usually works as a replacement for dlocate -S). This usually takes several seconds when the cache is cold on my current JFS filesystem that I created ~2 years ago when I installed the system. This is what I notice as slow on my root filesystem currently. JFS is fine with hot caches, e.g. for /lib, /etc, /bin, and so on. But dpkg is always very slow the first time.

    Those small files are probably pretty scattered now, and probably not stored in anything like readdir() order or alphabetical order. I'm hoping XFS will do better than JFS at keeping down fragmentation, although it probably won't. It writes files created at the same time all nearby (it actually tries to make contiguous writes out of dirty data). It doesn't look at where old files in the same directory are stored when trying to decide where to put new files, AFAIK. So I'll probably end up with more scattered files. At least with XFS's batched writeout, mkdir; cp -a info/*; mv ... ; rm -r ...; will work to make a defragged copy of the directory and files in it. (to just defrag the directory, mkdir; ln info/*; That can make readdir order = alphabetical order. Note using *, which expands to a sorted list, instead of using just cp -a, which will operate in readdir order. dpkg doesn't read in readdir order, it goes (mostly?) alphabetically by package name (based on its status file).)

    Anyway, I'm considering using a smaller data block size, like -b size=2k or size=1k, (but -n size=8k, I definitely don't want smaller blocks for directories. There are a lot of tiny directories, but they won't waste 8k because there's room in the inode for their data. See directory sizes with e.g. ls -ld. Larger directory block sizes help to reduce directory fragmentation. And most of the directories on my root filesystem that aren't tiny are fairly large. xfs_bmap -v works on directories, too, BTW). XFS is extent-based, so a small block size doesn't make huge block bitmaps even for large files.

    I think I was finding that smaller data block sizes were using more CPU than the default 4k (=max=page size) in hot-cache situations. I compared some results I've already generated, and 1k or 2k does seem slightly faster for untarring the whole FS; drop_caches; tar c | wc -c (so stat+read) ; drop_caches; untar again (overwrite); drop_caches; read some more, timing each component of that. My desktop has been in single-user mode for 1.5 days testing this. I should post my results somewhere when I'm done... And I need to find a good way to explore the 5 (or higher) dimensional data (time as a function of block size, log size, logbuf size, lazy-count=0/1, and deadline vs. cfq, and blockdev --setra 256, 512, or 1024 if I let my tests run that long...).

    BTW, JFS is good, and does use less CPU. That won't reduce CPU wakeups to save power, though. FS code mostly runs when called by processes doing a read(2), or open(2), or whatever. Filesystems do usually start a thread to do async tasks, though. But those threads shouldn't be waking up at all when there's no I/O going on.
    I decided to use JFS for my root FS a couple years ago after reading I probably would have used XFS, but I hadn't realized that to work around the grub-install issue you just have boot grub from a USB stick or whatever, and type root (hd0,0); setup (hd0). I recently set up a bioinformatics cluster using XFS for root and all other filesystems. It works fine, except that getting GRUB installed is a hassle.

    Also BTW, there's a lot of good reading on e.g. suggestions for setting up software RAID,, and lots of filesystem stuff:

    XFS is wonderful for large files, and has some neat other features. If you download torrents, you usually get fragmented files because they start sparse and are written in the order the blocks come in. xfs can preallocate space without actually writing it, so you end up with a minimally-fragmented file. azureus has an option to use xfs_io's resvsp command. Linux now has an fallocate(2) command which should work for XFS and ext4. posix_fallocate(3) should use it. I'm not sure if fallocate is actually implemented for xfs yet, but I would hope so since its semantics are the same. And I don't know what glibc version includes an fallocate(2) backend for posix_fallocate(3).
    And xfs has nice tools, like xfs_bmap to show you the fragmentation of any file.
    Last edited by llama; 12-06-2008, 07:09 PM.


    • #47
      Next time consider io_thrash for a benchmark is open source, well documented, and creates a workload that simulates a high end transaction processing database engine.

      Disclosure: I manage the product / product (GT.M - and that released io_thrash.


      • #48
        The bonnie++ options used in the benchmarks at:

        were bonnie++ -n128:128k:0

        The -n128 means that the test wrote, read and deleted 128k (131,072) files. These were first sequentially, then randomly, written/read/deleted to/from the directory.

        The :128k:0 means that every file had a random size between 128k (131,072 bytes) and zero. So the average file-size was 64k.


        • #49
          Originally posted by Kazade View Post
          I'll be honest, I'm a little confused about using games as a benchmark for a filesystem. Games load resources from the disk before the game play starts, everything from that point on is stored in either RAM or VRAM while the game is in play (unless of course you run out of memory). Only an insane game developer would read or write from the disk during gameplay because it would kill frame rate.

          If you were timing the loading times (or game saves) fair enough, but using the frame rate as a bench mark seems pointless.
          Some games certainly do load textures on the fly. Guild Wars is such a game.

          I think testing game performance isn't a bad idea, but average FPS isn't a good indicator. A utility that works like fraps should be utilized which will show lowest fps/highest fps. The lowest fps score would be the more interesting statistic in a game known to load textures on the fly, even if running under wine.
          Last edited by psycho_driver; 12-04-2008, 02:17 PM.


          • #50
            Originally posted by kjgust View Post
            Oh dear.. Well first off how can I say this.. You just made me CHOKE on my coffee. Haha, you know, the only time I used reiserFS, it was a bad experience, eventually . So even if it is faster, its definitely not as proven or as reliable as something like EXT3. I personally wouldn't be surprised to see ReiserFS3 be removed from the Linux Kernel eventually. Because from my experience at least, and what I've heard from others, its really not that good.
            I knew Jade would make an appearance in this thread. His obsession with ReiserFS isn't healthy.

            I've used ReiserFS twice, and both times I had catostrophic filesystem failures within about a year.


            • #51
              No comment from drag (or anybody else) on this?

              Originally posted by drag
              Suse was a early adopter and proponent of ReiserFsv3. They have ReiserFS developers on staff.
              At least this statement of yours is true.

              Suse has supported and distributed Reiser3 ever since January 2000 (in SuSE 6.3).

              They show no sign of moving to support v4 in any meaningful way.
              This is TOTAL CRAP. Suse supported and distributed Reiser4 for years.

              They were almost the only ones supporting it.

              They too depend heavily on the ability of Linux to compete with Unix, Windows, and especially Redhat.
              This is TOTAL CRAP as well.

              Reiser4 was supported by SuSE till they were bought out by the Jews. The Jews already owned Redhat, so there was no competition.

              So you would think that if v4 offered a substantial advantage over the more mundane Linux file systems then they would jump at the chance to push their OS forward.
              The (German company SuSE) did "jump at the chance," as a non-fairy tale version of history substantiates.

              When the Jews bought it out, they worked hard on getting rid of KDE, Reiser3, Reiser4, mp3 support and NTFS support.

              Destruction of Linux NTFS support got away from them when Szabolcs Szakacsits released his NTFS driver.

              They removed mp3 support from SuSE 10. Thus I stopped using SuSE, so I don't know if it is still sabotaged in this way.

              Reiser4 has been successfully shut down by sabotage of the Linux kernel code due to Andrew Morton.

              They are still trying to kill Reiser3, but too many people know that for years it was the best filesystem available and it is proving hard for them to get rid of it.

              There was a huge user rebellion against the move to Gnome and KDE stayed,... at least for now.


              • #52
                Suse supporting Reiserfs in a meaningful way would mean that they support using v4 as a install option. Which they don't.

                Originally posted by Jade View Post

                Reiser4 was supported by SuSE till they were bought out by the Jews. The Jews already owned Redhat, so there was no competition.

                So the Jews hate ReiserFS?


                My good sir.

                You are either serious and happen to be borderline insane; Or are a batshit insane troll. Either way you have too much time on your hands and seem to have a almost complete lack of critical thinking skills.

                I suggest a double dose of a BS degree in liberal arts at a very high quality private university (the more conservative the better) combined with counseling. From looking at your website you seem to have some serious delusions and possibly schizophrenic tendencies. If you already have a person your seeing, fire him/her, and if you already have a degree try to get your money back to pay for a better one.

                And probably some better religion, if your into that sort of thing. It's helped lots of people in the past get a better grounding.

                I may sound insulting, but it's really for your own benefit.
                Last edited by drag; 12-04-2008, 06:15 PM.


                • #53
                  resierfs IS NOT reiser4

                  god people, don't you know anything?

                  reiserfs = reiserfs 3.5&3.6
                  reiser4 = reiser4

                  no 'reiserfs' for 4, and no 4 in reiserfs. Two completly different file systems.

                  Get your facts straight, before you look completly silly, ok?
                  Oh, and jade - all the points you might have are invalidated by your idiotic (yes, I said it), conspracy theories and jew hating.


                  • #54

                    You know, it is nice that you share your opinions about the jews with us. But that is not facts, as you can not back them up with links. So please stay on topic then, when you discuss your opinions? Back to file systems.


                    • #55
                      Why isno JFS benched? It's the fastest FS for Linux and it supports 64bit since a lot of years.


                      • #56
                        Originally posted by thacrazze View Post
                        Why is no JFS benched? It's the fastest FS for Linux
                        No it is not. Reiser4 is the fastest. See:



                        • #57
                          Hahaha. What is with this guy? I thought ReiserFS was okay but geez. ReiserFS and Reiser4 are already associated with one weirdo too many.


                          • #58
                            Chewi, it is associated with one 'weirdo' because of people like you who can't see behind technical merits.


                            • #59
                              I'm using reiserfs ('3') for at least 5 years and I never ever had any data loss nor filesystem corruption. It handled perfectly every power loss I occured. It's rock stable and is eating alive ext3 when it comes to performance.
                              Recovering accidentally *deleted* files is another topic as it may be problematic, especially when you have some reiserfs file image on partition being recovered.
                              Other disadvantages (the only actually) of reiserfs are longer mount time and reasonably higher CPU usage. But please stop spreading bull**** about its unreliability.

                              On the other hand I remember quite often problems with ext3 and it caused for me to forget about ext* forever for anything except /boot.
                              Last edited by reavertm; 12-06-2008, 12:57 AM.


                              • #60
                                Originally posted by reavertm View Post
                                But please stop spreading bull**** about its unreliability.
                                Your lucky, that was not the case 3 years ago when openSUSE dropped it from being the default filesystem. Feel free to look around at the mailing lists from back then.