No announcement yet.

ZFS vs EXT4 - ZFS wins

  • Filter
  • Time
  • Show
Clear All
new posts

  • ZFS vs EXT4 - ZFS wins

    I have said this numerous times and now I am able to back my claims with results

    When given multiple dedicated disks, ZFS is considerably faster than EXT4. The notable exceptions are
    a) No Sync/Fsync tests. Meaningless in any case other than for academic purposes.
    b) Very simple configurations, such as a single disk, single HW Raid volume, or a simple mirror.

    The reason for this is that testing of software cannot be hardware blind. Limited hardware will always make simpler software look better. All the light-weight desktop environments are a point in case. On older systems with limited hardware resources, the heavy-weights will simply not have the leg room to come into their own right.

    Similarly ZFS has got a lot of features which are realy good when you have the right hardware, but on limited hardware the limited performance may outweight the benefits, depending on your requirements.

    Two points that I must raise:

    Firstly: Testing ZFS on a solid state disk eliminates its ability to hide the latencies of physically rotating disk drives. Don't get me wrong: SSDs are not bad for ZFS, on the contrary traditional hard disk drives are bad for EXT2/3/4

    Secondly: I watched CPU and run-queue while I performed the tests. With ZFS there was typically arround 80% CPU idle, with EXT4 it varied arround the 66% mark. This is on an otherwise idle system.

    The importance of this is two-fold: Firstly the CPU was not the bottle neck in this case, running at 2Gbps. I will soon replace the disk subsystem with one that supports 8Gbps. The disks/busses will there be more able to keep up with the requests, which means that the CPU will have to work harder to keep the disks fed with requests. If everything scales as expected, EXT4 will run into a CPU bottleneck first.

    Secondly even with the existing disk/FC configuration, if one adds a workload, there will be less CPU time available for the file system. ZFS will then suffer less.

  • #2
    Hardware info states that you have a kernel derived from 2.6.32, which makes me wonder how recent is the ext4 code you're running. Back when the actual 2.6.32 kernel was released ext4 was still in its infancy. I know enterprise distros backport some things to older kernels, but this benchmark doesn't look representative of the current state of ext4 as it is now.

    Could you also provide the ZFS code version you're using? It's out-of-tree code so I can't even guesstimate it based on the kernel version.
    Last edited by Shnatsel; 25 April 2013, 05:08 PM.


    • #3
      Software versions

      This server is running RHEL 6.4 with current patches as well as ZFS-on-Linux 0.6.1 using the ZFS-on-Linux team's repository.

      RHEL is in fact quite conservative about Kernel versions, but I do NOT believe that current EXT4 versions are much faster. Very often the opposite is true - Bug fixes very often close shortcuts that were taken, and as a result introduce extra logic, extra precautions, extra work for the system. Of course some bugs are performance bugs, eg where sloppy code were used or where someone discover a smater algorithm for the same thing.

      How can I check the versions of the mdadm, lvm and ext4 drivers? (Sorry for my lack of knowledge but the bulk of my experience is Unix, not Linux.)


      • #4
        Originally posted by hartz View Post
        I have said this numerous times and now I am able to back my claims with results
        First of all, thank you for your time and effort, doing these benchmarks and sharing your results. Thanks a lot!
        ZFS benchmarks seem very rare and I admit, doing it properly is difficult to the high number of variables involved.

        Thus, may I ask to clarify: I would like to know more about the disk used. What disk where used in what layout? All 22 mentioned disks at once? In one large ZFS pool / mdadm config? So a mirror would be 11 drives mirroring the other 11? How about Raid5 or RaidZ? How was Raid10 layout? What disk controller was used? Did you use seperate ZIL / cache devices?

        To what percentage was the pool / raid filled during the test? Was fragmentation an issue?

        Would be glad if you could clarify.


        • #5
          Disk configuration

          I decided that separate ZIL / ARC etc caches would be an unfair advantage for ZFS. I wanted to compare as close as possible like-for-like functionality wise.

          So for EXT4 I used Ext4 on LVM on mdadm-raid on multipathd. mdadm provides the striping/data protection. LVM provides functionality such snapshots and volume grow/create/resize.

          For ZFS I used ZFS directly on multipathd. ZFS provides all the functionality.

          The disks are connected via both ports on a QLogic 2632 dual-port card, connected directly (No switch involved) to an EMC CX-310. The Clariion provides 10 x single-disk LUNs (As close as I can get to a JBoD) as well as a single 5-disk RAID-5 LUN, used in the HW tests.

          All of the JBoD-LUNs are in Tray0, the Raid5 Lun is in Tray1.

          Phoronix-Test-Suite reports 10 + 10 disks, but this is not accurate. I suspect that it is confused by seeing the disks both via multipathing (/dev/mapper/mpath{a..j} and also seeing the disks in one of the several other locations, eg /dev/sd*, /dev/disk/*/* ... I don't know what exactly.

          Ditto for the HW-Raid5 LUN.

          The CX310 only supports 2Gbps connection speed. I will re-do the tests with a VNX 5300 at full 8Gbps some time next week when I get access to that storage.

          OK, What else? Oh the actual configuration for each test:

          - JHMultidisk-ZFS-RaidZ-JBoD (5-disk Raid-Z)

          zpool create -o ashift=12 POOL raidz /dev/mapper/mpath{a..e}

          - JHMultidisk-ZFS-Raid10 (4 pairs)

          zpool create -o ashift=12 POOL mirror /dev/mapper/mpath[ab] mirror /dev/mapper/mpath[cd] mirror /dev/mapper/mpath[ef] mirror /dev/mapper/mpath[gh]

          - JHMultidisk-ZFS-Mirror-JBod

          zpool create -o ashift=12 POOL mirror /dev/mapper/mpath[ab]

          - JHMultidisk-EXT4-LVM-mdRaid5-JBoD (5-disk Raid-5)

          mdadm --create /dev/md2 --level=5 --raid-devices=5 /dev/mapper/mpath{f..j}

          - JHMultidisk-EXT4-LVM-mdMirror

          mdadm --create /dev/md2 --level=1 --raid-devices=2 /dev/mapper/mpath[ab]

          - JHMultidisk-EXT4-LVM-HWRaid5

          The arrays were destroyed and re-created between tests. So the only data on them were the test data.

          The mount command for ZFS was:
          zfs create -o atime=off -o mountpoint=/test_mountpoint POOL/test

          The mount command (process) for Ext4 was
          pvcreate /dev/md2; vgcreate testgroup /dev/md2; lvcreate -l xxxx -n ext4vol testgroup
          mkfs -r ext4 /dev/mapper/testgroup-ext4vol
          mount -o noatime /test_mountpoint /dev/mapper/testgroup-ext4vol

          I have actually completed a test for ZFS on the HW-raid lun. The performance was dismal, but I accidentally put the EXT4 "description" on the test and I don't know how to fix that so it is not part of the group at the moment.

          I have a few more interesting ideas for tests:

          Ext4 on a zVol

          ZFS Raid-Z2 with all 10 drives (I will do this test now)

          P.S. I have been trying to get an mdadm raid-10 test but failing dismally... The system crash when I try to run pvcreate and I then have to disconnect the FC cables to get it to complete the boot-up. I have basically given up on mdadm raid-10 testing.


          • #6
            New Kernel

            I just patched again and got a few updates, including a new kernel.

            # uname -a
            Linux emc-grid 2.6.32-358.6.1.el6.x86_64 #1 SMP Fri Mar 29 16:51:51 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux


            • #7
              Some notes/observations on the results

              1. When mounting ZFS with sync=disabled I actually get worse performance.

              2. I did not get better results when mounting ZFS with checksum=off .... The results were the same to within the amrgin of error. The conclusion is that the bottle-neck is not the calculation of checksums. This will likely be different on a system where CPU is actually being stressed.

              3. Disabling file access time tracking is done to make the tests run quicker. I did some initial tests to check its impact and it appears to be about 10% consistently, for both ZFS and Ext4

              4. I once ran a test on Ext2 (I forgot to add the -t ext4 flag to mkfs). It is remarkably faster than ext4!!!

              5. In case anybody wats to know, I have googled and googled and did not find a documented way on how to make fs-mark run against a "specified" mountpoint. What I did was I replaced the "-s scratch" in the file /root/.phoronix-test-suite/test-profiles/pts/fs-mark-1.0.0/test-definition.xml with -s /test_mountpoint/scratch ... I wish the test would just prompt you for a file system/directory to test. In any case the test results reports only the file system type on which the directory /root/.phoronix-test-suite/installed-tests/pts/fs-mark-1.0.0/fs_mark-3.3/scratch/ resides, irrespective of what you are actually testing.


              • #8
                RaidZ2 test added to results

                Now including ZFS RaidZ2 over 10 disks



                • #9
                  ext2 being faster than ext4 is not surprising because ext2 doesn't have journaling. You'd have to disable journaling on ext4 for a fair comparison.

                  Also I see you bypass LVM with your ZFS setup. This makes me wonder where do you actually get the speed boost from - the fact that ZFS bypasses LVM and provides all the functionality on filesystem level, i.e. you're comparing LVM/mdadm-raid with the relevant part of ZFS, or it actually has something to do with the filesystem itself, i.e. you're comparing ext4 with ZFS. I personally suspect the former because ext4 beats ZFS in single-disk benchmarks.

                  I'm not aware of any way to look up the filesystem version used in your setup. There's no filesystem-specific versioning in the kernel and even if there would be, it would be useless because enterprise distributions checrrypick patches in no particular order. So to get an idea about your ext4 state you'll have to get a list of all ext4 patches that ever were mainlined, figure out which of those patches have been backported to your kernel, and the lists of applied and unapplied patches would be your ext4 version. So it may be easier to install the latest mainline kernel and test the latest ext4 than figure out what combination of patches your enterprise kernel has

                  Thanks for conducting the benchmarks, the results are very interesting indeed!


                  • #10

                    ZFS provides all the file system and volume/space/redundancy/management features that you get from mdadm+LVM+Ext4 and more. Much more, in fact, but let me not get distracted - it is worth a discussion thread on its own.

                    I did not follow a formal documented process. I have long ago (somewhere on these forums perhaps) suggested this testing because I never had access to the right hardware, so now i managed to get the hardware and could not skip the opportunity.

                    So the test objective is basically to do a ZFS vs Ext4 performance comparisson with the following in mind:
                    a) Follow best practices, particularly around configuring ZFS with "whole disks", ashift correctly configured, etc.
                    b) Since the storage management aspect of ZFS is such a major, integrated component I wanted to utilize it fully. That means multiple dedicated hard drives.
                    c) Since all the extra features of ZFS could be considered "bloat" when they are not being used, I wanted to level the playing field by activating the Linux native equivalents, which means adding the extra layers for LVM and mdadm
                    d) Proper performance testing requires that All system resources be monitored while running the tests. Ideally these should be logged and analized to identify the bottle necks.

                    The test environment is a server with sufficient disks to allow "complex" storage configuration - multi-pathed access to disks, striping and and redundancy etc.

                    I don't have the skill or time to properly analize the performance statistics. I will record SAR stats when I run the tests again on the faster disk, and I will take detailed configuration snapshots, and will record more detail on how the disk is configured for each test (eg zpool status, zpool get all, mdadm --detail, etc) as well as start times for each test run.

                    Maybe somebody will take the time to look at this all :-)

                    I am open to suggestions as to what to capture and/or how to perform the tests, but please keep in mind that this must fit into my other work demands and the system needs to go out to the customer once I have finished testing / ensured that it is OK for use in production, so my time is not unlimited. :-(