No announcement yet.

Benchmarking ZFS On FreeBSD vs. EXT4 & Btrfs On Linux

  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    For desktop use, I am happy using ext4. If distros decide to enable btrfs by default later, that's fine, too. All the desktop systems I've built do not use RAID; if it has two hard disks then I just use them as two independent disks. I usually have a different operating system on each, though, which makes RAIDing them impossible (in the case of Windows and Linux dual boot). Sure, my desktop does a lot of I/O, but what I really care about is read performance (for load times) -- the only time I do really heavy write stuff is when I am installing software (which is a one-time hit per application). For smaller writes that occur as part of normal application usage, what I really want is for the small writes to be responsive -- blocking I/O on these small writes (such as maintaining configuration settings in a sqlite database) should involve absolutely minimal blocking on the app side, so the app can continue with what it's doing.

    I tend to back up data that I really care about, so industrial strength data integrity is less important. But read performance... man... I boot up and reboot all the time testing stuff and installing stuff that requires a reboot, etc. Loading apps grinds the disk like crazy. Booting up my main Ubuntu desktop (GNOME) takes almost as long as it does to load the desktop on my loaded-down Windows 7. This is a simple consequence of having a very large number of programs installed, which sometimes start things when you login to your session, and all those icons have to be read from disk, etc. I want the best read performance for the desktop, and it looks like ext4 is that.

    I also run a dedicated server with four 1.5TB HDDs. I have been seriously debating which file system (and indeed, which operating system) to use for this server. For isolation and security, I have a policy of having an absolutely minimal host OS on the server. The host OS should be as clean and reliable as possible. It needs to set up networking, bring up SSH, and start the guest OSes that contain the actual services. It's a multi-purpose server, with different people in control of sometimes an entire guest instance, but I have a fairly good idea of exactly what software runs in each guest, even when it's under the control of someone else.

    The server has a Core i7 975 @ 3.33 GHz, 12GB DDR3, the aforementioned 4 x 1.5TB 7200rpm HDDs, and a 100Mbps symmetrical uplink sitting on a Level3 Frankfurt, Germany backbone.

    Currently, the host OS is Fedora 13, but with most of the default programs purged. For security and bugfixes, I have periodic planned downtimes where I update the core packages on the host from the Fedora repos, as well as compile the latest stable Linux kernel. I have a .config that I migrate from version to version and I carefully consider each option, to cut down on the number of modules that are built, disable potential security risk features like Kprobes and /dev/mem, etc.

    Since some of my users want to run services that require very fast packet I/O (real-time FPS gaming), I use a voluntary preemption model, but I stick with the 100 Hz timer. I haven't had any complaints about responsiveness with these settings.

    For guest isolation I use Linux-VServer. Obviously this means I have to patch my kernel with the latest Linux-VServer patches. Therefore I can't update my kernel until upstream Linux-VServer maintainers release the VServer patch against the latest stable kernel. The wait is usually not long.

    The interesting part about Linux-VServer is that it has zero guest overhead, because it's just a container solution. It uses the same kernel in all the guests as the host, but each guest environment is "private": only the host OS can poke into the guests, but guests can't poke at each other. This is true even though all the guests share the same filesystem (as far as the host and the disk is concerned). Something similar to chroot is used for file system isolation, but presumably it's even more robust. Guests can't load kernel modules, but I don't care; they don't need to. The biggest advantage for me is that a single filesystem is used for all the guests, which is faster than running a filesystem on top of another filesystem (like you normally do with guests: you store their image as a file on the host filesystem, and within that image is the guest filesystem.)

    I also run a KVM (Kernel Virtual Machine) guest, Windows Server 2008 R2 Standard. This is to support one of my users whose in-house server software is currently not ported to run on Linux. Obviously this is slower than Linux-VServer, but the performance is acceptable for the apps running in there.

    Now, for the relevant stuff regarding this article.

    Currently, I am using the Linux Multiple Devices md5_raid subsystem, and LVM2 on top of that, and ext4 on top of that. Reads are fast, as expected, but write speed is terrible! More importantly, though, the CPU usage from md5_raid is simply out of this world when any significant amount of writing is taking place. And since it occurs in a kernel thread, it basically blocks everything out while the writes take place.

    This isn't a problem for small writes, but if you were to copy a 5GB file, you could effectively suspend all networking and make the entire box unresponsive (yes, even from a guest) while md5_raid calculates all the parity bits. That makes me really insecure about the stability of the system, even though very large writes usually don't happen often enough to make a difference in practice.

    I have tried many MD tweaks to try and resolve this, but it seems that it is just an inherent misfeature of Linux-MD RAID5.

    Right now, I don't think I am in a position to move to OpenSolaris (because I am not confident in Oracle's willingness to continue to support it), nor FreeBSD (because I am not confident in FreeBSD's ability to act as a virtualization host for *any* full OS virt solution). So, although I really want the data integrity features and predictable performance of ZFS, I can't switch to an operating system that supports ZFS at production quality level (ZFS-FUSE is not production quality).

    That leaves btrfs. I know it doesn't support RAID5; maybe I can use its RAID1+0 instead. But btrfs is still experimental! I would be somewhat timid putting it on my server now. I consider myself a fairly agile sysadmin; I'm not the sort who runs RHEL and only uses the filesystem that is recommended by default by RHEL. But I am also not going to take chances with my data. If the btrfs authors themselves don't claim their filesystem is stable, how can I deploy it on my server?

    My opinion on this seems to change every week. One week, I will get a swell of confidence in OpenSolaris, and say that my plan is to migrate to OSOL as my host, and use VirtualBox or Xen to virtualize Windows, and Zones to replace Linux-VServer. The next week, I will say, FreeBSD is the best because it supports ZFS as well as OSOL, but at least the people maintaining it are still around. Then again, I will come back and say, what if I ever need to deploy a binary-only Linux software, or open source software that uses something unique to the Linux kernel? So then I think that maybe I should stay with Linux, but I really, really dislike Multiple Devices, and btrfs isn't ready.

    So, I have been in a holding pattern for about 6 months on this. I have known that I really need to get away from Linux-MD; that I can (and should) move to RAID 1+0; and that my safest bet is probably to wait for btrfs to mature, as this will require the minimal amount of reconfiguration and learning new things. And I can continue to reap the benefits of the Linux kernel's performance, as has been demonstrated in every benchmark comparing Linux vs BSD and Solaris. I also don't want to give up Linux-VServer, since it has served me very well to date, and KVM really speeds up the network performance in the Windows guest thanks to the virtio-net drivers.

    I'm still very undecided on this, and I'd love to know what others think about my predicament. My inaction on this will probably either result in the RAID array getting degraded and then a catastrophic data loss; or I will just have to deal with repeated instances (every few months) of the server appearing to "go down" when some user decides to copy a multi-gigabyte file. What I want to do is jump ship just before the ship I'm on sinks into the water, and land feet first on a much sturdier ship. It's critical that I time this right: if I migrate too soon, I might end up with an unsupported OS (if Oracle pulls the plug on OpenSolaris), or a buggy filesystem that's either too slow or loses data (if I use btrfs before the kinks are worked out). And if I use FreeBSD, I'm really venturing into the unknown, because I have no idea how I am going to support containers there, or how to virtualize Windows there. Not to mention my only substantial sysadmin work so far has been on OpenSolaris, Fedora and Ubuntu; I've barely touched any BSD system.


    • #22
      Originally posted by RahulSundaram View Post

      Yes. Btrfs does provide complete data integrity and not just metadata integrity by default. Aside from potential filesystem bugs, this is part of the design.
      That is cool. But, there is a long way from wanting BTRFS to provide complete data integration, and for BTRFS to really do it in practice. It is not as easy to just add some checksums here and there. It is much more difficult. Today, after all these years since the first hard drive was sold, only ZFS provides good data integrity (which has actually been proven by comp sci researchers in papers) in a common filesystem. No one has succeeded before.

      ZFS also have a good track record of providing good safety, there are lots of success stories with ZFS. Like, when one of the network cards at CERN(?) was faulty and injected faulty bits into the data stream, but ZFS immediately detected it (from another server). CERN is now migrating to ZFS machines, away from Linux hardware raid.

      So, when you say that BTRFS offers complete data integrity - that is very doubtful. Because there are many strange situation that BTRFS needs to be able to handle, and we dont know if it does. It is like you say "yes BTRFS is totally guaranteed bug free". That is impossible to promise. So I dont agree with the web page you link to. Maybe it is the goal of BTRFS, but it is a good goal.

      I read somewhere (cant find the link now, can someone link to the interview I talk about?) that the BTRFS developer said he got convinced to try to add data integrity to BTRFS after Sun talked a lot about it - before that, he didnt understood the importance. Therefore BTRFS is not designed from scratch to offer data integrity, it is an after thought and add on. But ZFS is designed from scratch to offer data integrity.

      Sun and Solaris developers have decades of experience from Enterprise storage and they where the first to notice the need for data integrity and the problems that arise in Enterprise server halls. And other filesystems followed Sun/Solaris. Sun shows the way to the rest. To really do complete data integrity correctly, requires VAST experience. There appears to be only one full time BTRFS developer according to your link - does he have the required experience, has he a large database of all problems that haved occured in Storage halls? I doubt that. Complete data integrity is extremely difficult to do correctly.

      But if BTRFS does not succeed providing complete data integrity, it is better that BTRFS gives us just some rudimentary protection than no protection at all. Sure, XFS, JFS, ReiserFS, etc also has lots of checksums all over the place to give data protection - but in formal academic studies the protection is very bad, in reality.

      In short, it remains to see how good BTRFS offers data integrity. It is good though, that people become aware of the Silent Corruption problem and requests Data integrity. The more filesystems that provide data integrity - the better for us users. Exciting times indeed!


      • #23
        Originally posted by d2kx View Post
        Performance wise, btrfs destroyed the competition.
        I havent looked at the benchmarks, but I am convinced you are correct.

        But instead of benching performance, if we benched data safety, I am sure ZFS destroys them all. BTRFS is not ready yet. I doubt it offers good data protection, read my long post above.

        Performance is not that important. Is your data safe? No? Then it is your choice to use a fast but data corrupting filesystem! I prefer to use a safe filesystem.


        • #24
          Originally posted by sbergman27 View Post
          btrfs is impressive. Chris judiciously selected his feature-set for technical reasons, as opposed silly marketing reasons. I mean, come on, a 128 bit filesystem? What was Sun thinking? That's a bullet point for glossies aimed at managers who don't understand exponents. (Disk storage space has been increasing at a uniform rate of about a bit every 1.5 years for the 23 years that I've been watching. And Linus would probably have a brain hemorrhage if presented with a set of patches for a 128 bit filesystem.) Pragmatic design decisions have allowed btrfs development to progress with remarkable agility.
          Hmmm... You havent read about ZFS I guess? I bet you and Linus T were one of the guys that said "no one needs 640KB RAM".

          "Logically, the next question is if ZFS' 128 bits is enough. The answer is no. If Moore's Law holds, in 10 to 15 years people will need a 65th bit. As a 128-bit system, ZFS is designed to support more storage, more file systems, more snapshots, more directory entries, and more files than can possibly be created in the foreseeable future."

          128 bits suffice for all future needs. 64 bit does not. If you try to fill ONE single 128 bit filesystem, you need so much energy to write that many bits, that you could literally boil all oceans on earth. Has humanity ever created that much energy? Here are the calculations which show we never ever need more than 128 bit:

          (It is like you use one atom to store one bit. There are only 10^79 atoms in the entire universe. A filesystems does not need to handle more bits than that number. Now 128 bits is far less than that number, but still, it 128 is a huge number that humanity never will need.)

          I am glad Sun engineers plans further ahead, than some other people.


          • #25
            Originally posted by KDesk View Post
            I would like to know more about this, can you provide me some more info or links?
            First of all, there are lots of stories where people lost data. For instance the CERN study. There are several links here:

            Also, then there are researchers in comp sci that tested common filesystems: XFS, JFS, ReiserFS, ext3, NTFS, etc and injected faulty bits to see how well the errors where handled. The result was depressing. They could not repair all errors. Not even detect all errors! How can you repair an error you do not detect?
            56% of data loss due to system & hardware problems - OntrackData loss is painful and all too common. Why?

            And then researchers have tried to stress ZFS in the same way, but ZFS succeed detecting all errors. It could not repair all errors, because ZFS did not use raid. You need two disks to fetch a bad bit. The important part is to detect all errors. If you know there is an error, you can repair it. ZFS detects all errors:
            File systems are supposed to protect your data, but most are 20-30 year old architectures that risk data with every I/O. The open source ZFS from Sun Oracle claims high data integrity - and now that claim has been independently tested.

            Silent corruption is more common than you think.


            • #26
              Originally posted by cjcox View Post
              ZFS means a LOT to Solaris users. Because before ZFS, logical volume management and SW RAID meant Solstice Disksuitie (aka Solaris Volume Manager). And while it worked "ok", it was a real mess and had lots of frustrating limitations.

              ZFS looks REALLY interesting if you DO NOT look at LVM and other *ix OS's solutions. So for a Solaris person who knows the pains of SVM, ZFS looks absolutely grand!

              With regards to "feature" differences, I'd recommend just looking at the features from both sites. From my own experience, ZFS users tend to be "blind" and unwilling to look at other technologies... they've decided that they're #1. With regards to Linux distributions, I'd say greater than 50% of the so called "top" engineers I've met from Sun haven't used Linux since Red Hat 7.2.

              Btrfs is an IMPORTANT filesystem... like ZFS. Btrfs means enterprise like features, like ZFS. However, unlike ZFS, you will be able to use it with a Linux distribution (arguably, you could port it to your own Linux today... you just can't distribute that custom edition).

              So... ZFS, IMHO, doesn't bring too much to the table... UNLESS you're a Sun SVM + UFS user... then it's fantastic! I'm serious, UFS and SVM are ugly and have been very, very, very, very, very, very problematic in the past. ZFS came late for Sun and a bit late for the *ix world and N/A when it comes to a Linux distribution.

              I'm not saying that ZFS isn't ok... it is ok. But the future belongs to things like btrfs... NOT ZFS.
              You say that ZFS is better than the bad Solaris solutions, but in comparison with Linux solutions, ZFS is bad. That is not correct.

              In comparison with anything, ZFS is best. If you read some stories, you see that lots of Linux companies switch to OpenSolaris just because of ZFS. ZFS is the best thing since sliced bread they say. It is far better than anything any OS can offer. Including Linux.

              Linux company switches:

              "When it comes to storing data, you?ll pry OpenSolaris (and ZFS) out of our cold dead hands. We won?t deploy databases on anything else."

              There are lots of other stories.
              The Domain So, at the $DAYJOB, we were faced with building a large operational data store. Large has many meanings to many people. I’ve written about this before, but I’ll reiterate the scope: (> 1TB data, thousands of tables, several tables with around one billion rows). So, for a variety of reasons, we chose PostgreSQL. I’ve written about that choice a few times, but didn’t write about the choice to use Solaris.

              Linux is really bad as a Large Enterprise server. It is not because of the bad filesystems, but because of limitations in the Linux kernel
              I am frequently asked by potential customers with high I/O requirements if they can use Linux instead of AIX or Solaris.No one ever asks me about


              • #27
                Thx for the articles, even though they seem a little outdated, they are from 2008, btrfs wasn't developed then.

                ?Linux is really bad as a Large Enterprise server. It is not because of the bad filesystems, but because of limitations in the Linux kernel?

                There are numerous stories about seniors consultant of +20 years experience being wrong. I am not saying this guy is one of them, but the data is from 2008 too, and he doesn't show any benchmark or quantitative results. Michael test showed that btrfs was superior on threaded I/O, at least on a single desktop HDD.

                Btrfs is mainly developed because ZFS was clearly superior (features and data integrity), and just now (2010), btrfs is begin to show some maturity level. So new data / benchmark on large entreprise storage hardware is required to show those points.


                • #28
                  Good old kebabbert, repeating the same things over and over again 50 times.

                  Btrfs will provide just as much data integrity as ZFS, since that was one of the goals. The only difference is that ZFS is a bit more mature since it's been in use longer. Give Btrfs another couple years and it will be ready for serious use. Looks like it will be ready for the desktop this fall.


                  • #29
                    Linux is really bad as a Large Enterprise server. It is not because of the bad filesystems, but because of limitations in the Linux kernel



                    • #30
                      Originally posted by kebabbert View Post
                      "Logically, the next question is if ZFS' 128 bits is enough. The answer is no. If Moore's Law holds, in 10 to 15 years people will need a 65th bit.
                      Moore's law is about transistor density on silicon. It has nothing to do with disk space. However, it will apply if we all move to SSD storage. But that move will be a big speed bump in the march of disk size increases. That said, at the current geometric rate of expansion, the move from 32->64 bits represents about 48 years. But some applications today absolutely need more than that. Let's be *very* generous and say that the largest applications today have a 48 bit requirement. Assuming that expansion continues at the historical rate (and that's a big if. It likely won't.) then we have about 25 years before anyone at all would care about the 64 bit "barrier". An increase in bittness today, beyond 64, is not totally ridiculous or beyond the pale. But 128 *is* ridiculous. That's just a big waste of memory and processor. (Unsurprisingly, memory and processor use are ZFS's two main points of suckiness today.) Why design a filesystem today that sacrifices performance (today) in order to scale to sizes that we won't care about for 120 years? Does someone really think that ZFS will be around in 120 years?! More likely, management and marketing like the idea of being able to claim 128 bitness as a bullet point in their glossies.

                      You, Kebabbert, are apparently one of the folks I was referring to who does not understand exponential growth. Exactly the sort of person who might be impressed by such a claim in a Sun/Oracle glossy.

                      Or at least you have not bothered to do the simple arithmetic which demonstrates how silly ZFS 128 bitness really is.