No announcement yet.

Real World Benchmarks Of The EXT4 File-System

  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Originally posted by Jade View Post
    RESULT: With compression, REISER4, absolutely SMASHED the other filesystems.
    Oh dear.. Well first off how can I say this.. You just made me CHOKE on my coffee. Haha, you know, the only time I used reiserFS, it was a bad experience, eventually . So even if it is faster, its definitely not as proven or as reliable as something like EXT3. I personally wouldn't be surprised to see ReiserFS3 be removed from the Linux Kernel eventually. Because from my experience at least, and what I've heard from others, its really not that good.


    • #42
      Originally posted by drag View Post
      Suse was a early adopter and proponent of ReiserFsv3. They have ReiserFS developers on staff. They show no sign of moving to support v4 in any meaningful way. They too depend heavily on the ability of Linux to compete with Unix, Windows, and especially Redhat. So you would think that if v4 offered a substantial advantage over the more mundane Linux file systems then they would jump at the chance to push their OS forward.
      Just to add on:

      Suse dropped Reiser as it's default filesystem because of several technical problems, as well as problems related to maintenance especially after Chris Mason left (the people basically left holding th bag on maintaining it). That left basically Mahoney to look after it and with it's bug ridden past it just became to big of a headache. It also wasn't so shit hot in performance or reliabilty as well. I wouldn't be surprised if it is soon dropped from the supported filesystems all together in suse.

      ReiserFS has no future. It's effectively dead. Time to put it up on the shelf with other innovations like the Superdisk 120 and the 80186.


      • #43
        Which bit of this didn't you understand?

        (includes Reiser4 and Ext4)

        Some Amazing Filesystem Benchmarks. Which Filesystem is Best?

        RESULT: With compression, REISER4, absolutely SMASHED the other filesystems.

        No other filesystem came close (not even remotely close).

        Using REISER4 (gzip), rather than EXT2/3/4, saves you a truly amazing 816 - 213 = 603 MB (a 74% saving in disk space), and this, with little, or no, loss of performance when storing 655 MB of raw data. In fact, substantial performance increases were achieved in the bonnie++ benchmarks.

        We use the following filesystems:

        REISER4 gzip: Reiser4 using transparent gzip compression.
        REISER4 lzo: Reiser4 using transparent lzo compression.
        REISER4 Standard Reiser4 (with extents)
        EXT4 default Standard ext4.
        EXT4 extents ext4 with extents.
        NTFS3g Szabolcs Szakacsits' NTFS user-space driver.
        NTFS NTFS with Windows XP driver.

        Disk Usage in megabytes. Time in seconds. SMALLER is better.

        |File         |Disk |Copy |Copy |Tar  |Unzip| Del |
        |System       |Usage|655MB|655MB|Gzip |UnTar| 2.5 |
        |Type         | (MB)| (1) | (2) |655MB|655MB| Gig |
        |REISER4 gzip | 213 | 148 |  68 |  83 |  48 |  70 |
        |REISER4 lzo  | 278 | 138 |  56 |  80 |  34 |  84 |
        |REISER4 tails| 673 | 148 |  63 |  78 |  33 |  65 |
        |REISER4      | 692 | 148 |  55 |  67 |  25 |  56 |
        |NTFS3g       | 772 |1333 |1426 | 585 | 767 | 194 |
        |NTFS         | 779 | 781 | 173 |   X |   X |   X |
        |REISER3      | 793 | 184 |  98 |  85 |  63 |  22 |
        |XFS          | 799 | 220 | 173 | 119 |  90 | 106 |
        |JFS          | 806 | 228 | 202 |  95 |  97 | 127 |
        |EXT4 extents | 806 | 162 |  55 |  69 |  36 |  32 |
        |EXT4 default | 816 | 174 |  70 |  74 |  42 |  50 |
        |EXT3         | 816 | 182 |  74 |  73 |  43 |  51 |
        |EXT2         | 816 | 201 |  82 |  73 |  39 |  67 |
        |FAT32        | 988 | 253 | 158 | 118 |  81 |  95 |

        The raw data (without filesystem meta-data, block alignment wastage, etc) was 655MB.
        It comprised 3 different copies of the Linux kernel sources.

        Disk Usage: The amount of disk used to store the data.
        Copy 655MB (1): Time taken to copy the data over a partition boundary.
        Copy 655MB (2): Time taken to copy the data within a partition.
        Tar Gzip 655MB: Time taken to Tar and Gzip the data.
        Unzip UnTar 655MB: Time taken to UnGzip and UnTar the data.
        Del 2.5 Gig: Time taken to Delete everything just written (about 2.5 Gig).

        Each test was preformed 5 times and the average value recorded.

        To get a feel for the performance increases that can be achieved by using compression, we look at the total time (in seconds) to run the test:

        bonnie++ -n128:128k:0 (bonnie++ is Version 1.93c)

        | FILESYSTEM | TIME |
        |REISER4 lzo |  1938|
        |REISER4 gzip|  2295|
        |REISER4     |  3462|
        |EXT4        |  4408|
        |EXT2        |  4092|
        |JFS         |  4225|
        |EXT3        |  4421|
        |XFS         |  4625|
        |REISER3     |  6178|
        |FAT32       | 12342|
        |NTFS-3g     |>10414|
        The top two results use Reiser4 with compression. Since bonnie++ writes test files which are almost all zeros, compression speeds things up dramatically. That this is not the case in real world examples can be seen in the first test above where compression often does not speed things up. However, more importantly, it does not slow things down much, either.
        Last edited by Jade; 12-04-2008, 05:23 AM.


        • #44
          Originally posted by drag View Post
          but keep in mind that unlike Ext2->Ext3->Ext4 each new Reiser file system is rewritten from scratch and are not related to one another in any direct manner.
          Then again, ext4 with extents is not compatible with ext3 or ext2 at all. You cannot ever never mount a fully featured ext4 fs as them. I actually found it weird this benchmark lacked fsck tests for ext3 and ext4. The speedup in those is the main thing I'm looking forward into. My conclusions from the benchmarks would be that ext4 excels with big files, probably due to the new extents, but performance difference otherwise is not significant. Performance differences would likely have been smaller with smaller test files. Maybe the other tests didn't deal with gigabytes of data? "Extents are introduced to replace the traditional block mapping scheme used by ext2/3 filesystems. An extent is a range of contiguous physical blocks, improving large file performance and reducing fragmentation. A single extent in ext4 can map up to 128MiB of contiguous space with a 4KiB block size." Wikipedia. And yeah, you need to fully reformat the hard disk to get full benefits of ext4 afaik. (That is, extents for old files too)


          • #45
            Originally posted by mctop View Post

            first of all, thanks for the articel and benchmark.

            We are planning to buy a new raid system with around 4TB of storage capacity (actual we have 2TB on ext3). On monthly scheduled administration days we reboot the main server for maintenance (new kernel, surely kick all nfs clients ...). So, from time to time, the raid system will check (tunefs could avoid this, but for safety reasons we perform the complete disk check) the data. This needs hours where you just can wait and wait ....

            So, if ext4 would reduce this checking time, i would immediatley change.

            Any experiences or a possibility to check this???

            Thanks in advance
            Have you tried Solaris and ZFS? ZFS has no fsck. Instead, it has something called "scrub" but your ZFS raid is online and fully functioning mean while. Ive heard that to fsck a large ext3 took one week!

            Here is a Linux admin comparing ZFS with linux filesystems:

            Here is a Linux guy setting up a home file server ZFS:

            ZFS + 48 SATA discs + dual Opteron and no hardware raid (just plain SATA controller), writes more than 2 GB/sec:

            And SUN is selling a new Storage device, 7000. Read about "The Killer App". You could download and play with that analysis software that uses unique DTrace in a VMware image (which simulates several discs with ZFS raid):

            Create a ZFS raid:
            # zpool create myZFSraid disc0 disc1 disc2 disc3
            and that is all. No formatting needed, just bang away immediately. Dead simple administration.
            Last edited by kebabbert; 12-04-2008, 06:16 AM.


            • #46
              bonnie nonsense, and XFS tweaks

              first, the bonnie++ benchmark is nonsense. I downloaded the benchmark suite, and
              pts/test-resources/bonnie/ makes a bonnie script that will run
              ./bonnie_/sbin/bonnie++ -d scratch_dir/ -s $2 > $LOG_FILE 2>&1" > bonnie
              -s controls the size of the big file used in sequential write/rewrite/read and lseek tests, and has no impact on the multiple file creation/read/deletion test. The defaults for that are -n 10:0:0:0, IIRC. That means bonnie++ will create 10 * 1024 empty files in the scratch directory. This mostly tests the kernel's in-memory cache structures, since that's not big enough to fill up the memory, so you're not waiting for anything to happen on disk. The deletion does have to happen on disk for anything that made it to disk before being deleted, which can be a bottleneck.
              -n 30:50000:200:8 would be a more interesting test, probably. (file sizes between 50kB (not kiB) and 200B, 30*1024 files spread over 8 subdirs)

              A few people have pointed out that XFS has stupid defaults, but nobody posted a good recommendation. I've played with XFS extensively and benchmarked a few different kinds of workloads on HW RAID5 and on single disks. And I've been using it on my desktop for several years now. For general purpose use, I would recommend:

              mkfs.xfs -l lazy-count=1,size=128m -L yourlabel /dev/yourdisk
              mount with -o noatime,logbsize=256k  (put that in /etc/fstab)
              lazy-count: don't keep the counters in the superblock up to date all the time, since there's enough info elsewhere. fewer writes = good.

              -l size=128m: XFS likes to have big logs, and this is the max size.

              mount -o logbsize=256k: That's log buffer size = 256kiB (of kernel memory). The default (and max with v1 logs) is 32kiB. This makes a factor of > 2 performance difference on a lot of small-file workloads. I think logbufs=8 has a similar effect (the default is 2 log bufs of size 32k. I haven't tested logbus=8,logbsize=256k. The XFS devs frequently recommend to people asking about perf tuning on the mailing list that they use logbsize=256k, but they don't mention increasing logbufs too.

              If you have an older mkfs.xfs, get the latest xfsprogs, 2.10.1 has better defaults for mkfs (e.g. unless you set RAID stripe params, agcount=4, which is about as much parallelism as a single disk can give you anyway. The old default was much higher agcount, which could slow down when the disk started to get full.)

              Or just use your old mkfs.xfs and specify agcount:
              mkfs.xfs -l lazy-count=1,size=128m -L label /dev/disk  -d agcount=4 -i attr=2

              If you want to start tuning, read up on XFS a bit. (unfortunately, there's no good tuning guide anywhere obvious on the web site). Read the man page for mkfs.

              You can't change the number of allocation groups without a fresh mkfs, but you can enable version 2 logs, and lazy-count, without mkfs. xfs_admin -j -c1 will switch to v2 logs with lazy-count enabled. xfs_growfs says growing the log size isn't supported, which is a problem if you have less than the max size of 128MB, since XFS loves large logs. It lets it have more metadata ops on the fly, instead of being forced to write them out sooner.

              if your FS is bigger than 1TB, you should mount with -o inode64, too. Note that contrary to the docs, noikeep is the default. I checked the kernel sources, and that's been the case for a while, I think. Otherwise I would recommend using noikeep to reduce fragmentation.

              If you're making a filesystem only a couple GB, like a root fs, a 128MB log will take a serious chunk of the available space. You might be better of with JFS. I'm currently benchmarking XFS with tons of different option combinations for use as a root fs... (XFS block size, and log size, lazy-count=0/1, mount -o logbsize=, and block dev readahead and io elevator)

              I use LVM for /usr, /home, /var/tmp (includes /var/cache and /usr/local/src), so my root FS currently is a 1.5GB JFS filesystem that is 54% full. It's on a software RAID1.
              Since I run Ubuntu, my /var/lib/dpkg/info has 9373 files out of the total 20794 regular files (27687 inodes) on the filesystem, most of them small.

              export LESS=iM
              find / -xdev -type f -ls | sort -n -k7 | less -S
              then look at the % in less's status line. or type 50% to go to 50% of the file position.
              <= 1k: 45%
              <= 2k: 52%
              <= 3k: 58% (mostly /var/lib/dpkg/info)
              <= 4k: 59%
              <= 6k: 62%
              <= 8k: 64%
              <= 16k: 71% (a lot of kernel modules...)
              <= 32k: 85%
              <= 64k: 93%
              <= 128k: 96%

              > 1M: 0.2% (57 files)

              (I started doing this with find without -type f, and there are lots of small directories (that don't need any blocks outside the inode): < 1k: 59%; < 2k: 64%; < 3k: 68%)

              Every time dpkg upgrades a package, or I even run dpkg -S, it reads /var/lib/dpkg/info/*.list (and maybe more). (although dlocate usually works as a replacement for dlocate -S). This usually takes several seconds when the cache is cold on my current JFS filesystem that I created ~2 years ago when I installed the system. This is what I notice as slow on my root filesystem currently. JFS is fine with hot caches, e.g. for /lib, /etc, /bin, and so on. But dpkg is always very slow the first time.

              Those small files are probably pretty scattered now, and probably not stored in anything like readdir() order or alphabetical order. I'm hoping XFS will do better than JFS at keeping down fragmentation, although it probably won't. It writes files created at the same time all nearby (it actually tries to make contiguous writes out of dirty data). It doesn't look at where old files in the same directory are stored when trying to decide where to put new files, AFAIK. So I'll probably end up with more scattered files. At least with XFS's batched writeout, mkdir; cp -a info/*; mv ... ; rm -r ...; will work to make a defragged copy of the directory and files in it. (to just defrag the directory, mkdir; ln info/*; That can make readdir order = alphabetical order. Note using *, which expands to a sorted list, instead of using just cp -a, which will operate in readdir order. dpkg doesn't read in readdir order, it goes (mostly?) alphabetically by package name (based on its status file).)

              Anyway, I'm considering using a smaller data block size, like -b size=2k or size=1k, (but -n size=8k, I definitely don't want smaller blocks for directories. There are a lot of tiny directories, but they won't waste 8k because there's room in the inode for their data. See directory sizes with e.g. ls -ld. Larger directory block sizes help to reduce directory fragmentation. And most of the directories on my root filesystem that aren't tiny are fairly large. xfs_bmap -v works on directories, too, BTW). XFS is extent-based, so a small block size doesn't make huge block bitmaps even for large files.

              I think I was finding that smaller data block sizes were using more CPU than the default 4k (=max=page size) in hot-cache situations. I compared some results I've already generated, and 1k or 2k does seem slightly faster for untarring the whole FS; drop_caches; tar c | wc -c (so stat+read) ; drop_caches; untar again (overwrite); drop_caches; read some more, timing each component of that. My desktop has been in single-user mode for 1.5 days testing this. I should post my results somewhere when I'm done... And I need to find a good way to explore the 5 (or higher) dimensional data (time as a function of block size, log size, logbuf size, lazy-count=0/1, and deadline vs. cfq, and blockdev --setra 256, 512, or 1024 if I let my tests run that long...).

              BTW, JFS is good, and does use less CPU. That won't reduce CPU wakeups to save power, though. FS code mostly runs when called by processes doing a read(2), or open(2), or whatever. Filesystems do usually start a thread to do async tasks, though. But those threads shouldn't be waking up at all when there's no I/O going on.
              I decided to use JFS for my root FS a couple years ago after reading
     I probably would have used XFS, but I hadn't realized that to work around the grub-install issue you just have boot grub from a USB stick or whatever, and type root (hd0,0); setup (hd0). I recently set up a bioinformatics cluster using XFS for root and all other filesystems. It works fine, except that getting GRUB installed is a hassle.

              Also BTW, there's a lot of good reading on e.g. suggestions for setting up software RAID,, and lots of filesystem stuff:

              XFS is wonderful for large files, and has some neat other features. If you download torrents, you usually get fragmented files because they start sparse and are written in the order the blocks come in. xfs can preallocate space without actually writing it, so you end up with a minimally-fragmented file. azureus has an option to use xfs_io's resvsp command. Linux now has an fallocate(2) command which should work for XFS and ext4. posix_fallocate(3) should use it. I'm not sure if fallocate is actually implemented for xfs yet, but I would hope so since its semantics are the same. And I don't know what glibc version includes an fallocate(2) backend for posix_fallocate(3).
              And xfs has nice tools, like xfs_bmap to show you the fragmentation of any file.
              Last edited by llama; 12-06-2008, 07:09 PM.


              • #47
                Next time consider io_thrash for a benchmark

       is open source, well documented, and creates a workload that simulates a high end transaction processing database engine.

                Disclosure: I manage the product / product (GT.M - and that released io_thrash.


                • #48
                  The bonnie++ options used in the benchmarks at:


                  were bonnie++ -n128:128k:0

                  The -n128 means that the test wrote, read and deleted 128k (131,072) files. These were first sequentially, then randomly, written/read/deleted to/from the directory.

                  The :128k:0 means that every file had a random size between 128k (131,072 bytes) and zero. So the average file-size was 64k.


                  • #49
                    Originally posted by Kazade View Post
                    I'll be honest, I'm a little confused about using games as a benchmark for a filesystem. Games load resources from the disk before the game play starts, everything from that point on is stored in either RAM or VRAM while the game is in play (unless of course you run out of memory). Only an insane game developer would read or write from the disk during gameplay because it would kill frame rate.

                    If you were timing the loading times (or game saves) fair enough, but using the frame rate as a bench mark seems pointless.
                    Some games certainly do load textures on the fly. Guild Wars is such a game.

                    I think testing game performance isn't a bad idea, but average FPS isn't a good indicator. A utility that works like fraps should be utilized which will show lowest fps/highest fps. The lowest fps score would be the more interesting statistic in a game known to load textures on the fly, even if running under wine.
                    Last edited by psycho_driver; 12-04-2008, 02:17 PM.


                    • #50
                      Originally posted by kjgust View Post
                      Oh dear.. Well first off how can I say this.. You just made me CHOKE on my coffee. Haha, you know, the only time I used reiserFS, it was a bad experience, eventually . So even if it is faster, its definitely not as proven or as reliable as something like EXT3. I personally wouldn't be surprised to see ReiserFS3 be removed from the Linux Kernel eventually. Because from my experience at least, and what I've heard from others, its really not that good.
                      I knew Jade would make an appearance in this thread. His obsession with ReiserFS isn't healthy.

                      I've used ReiserFS twice, and both times I had catostrophic filesystem failures within about a year.