Announcement

Collapse
No announcement yet.

Btrfs Battles EXT4 With The Linux 2.6.33 Kernel

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by intgr View Post
    Ironically enough, compressed reiser4 would blow everything else out of the water in these benchmarks.
    Well I don't know about that, as I've been doing a java port of my companies ide and language and working with a 4.9 GB database (real customer shipping data) it seems here anyways that no matter on SSD or a regular HD but btrfs seems to be edging out r4.

    Comment


    • #12
      Originally posted by intgr View Post
      [...] compressed reiser4 would blow everything else out of the water in these benchmarks.
      None of these safer* fs "blow everything else out of the water" ... ext2 would prbly come the closest but who wants to race boats w/o at least a lifejacket so when one gets tossed into the water, at least there's a chance of survival.


      *safer as in journaled or CoW (or something) to get the data back when errors knock the fs out of whack.

      Comment


      • #13
        Originally posted by deanjo View Post
        Well I don't know about that, as I've been doing a java port of my companies ide and language and working with a 4.9 GB database (real customer shipping data) it seems here anyways that no matter on SSD or a regular HD but btrfs seems to be edging out r4.
        I'm not talking about real-world benchmarks, I'm talking about these synthetic benchmarks that Phoronix used in this article, that only write a bunch of zeroes to the disk. They just aren't adequate for benchmarking compressed file systems (reiser4 nor btrfs).

        Trust me, the one thing reiser4 is really good at is compressing zeroes.
        Originally posted by fhj52 View Post
        None of these safer* fs "blow everything else out of the water" ... ext2 would prbly come the closest
        (Even though you completely missed my point.) You would think so, but most of the time it's not actually true. Modern journaling file systems are much better tuned than old unsafe file systems (ext2, UFS).

        In fact, for a random-write workload, CoW is pretty much the ideal file system layout, because it turns random writes into sequential ones.

        Comment


        • #14
          " real-world benchmarks " == oxymoron



          I, and anyone ithink, will agree using zeros is not 'real-world' of course but nevertheless it is a baseline, which ithink is what the author/tester was aiming to get(despite the tests being run on unstable kernel, unstable btrfs and ext4( with, IMO, dubious stability ).

          Maybe the compression test(s), at least, could be better. It would be, ithink, more constructive to suggest how to achieve something closer to end-user(desktop & server) usage rather than waste BW discussing effectively dead or old fs that have neither journal or CoW safety nets.

          ...

          Comment


          • #15
            iozone write parameters

            I'm Chris Mason, one of the btrfs developers. Thanks for taking the time to benchmark these filesystems!

            Someone forwarded me the iozone parameters used, and it looks like they have iozone doing 1K writes, which is less than the linux page size (4k on x86,x86-64 systems).

            One way that btrfs is different from most other filesystems is that we never change pages while data is being written to the disk. When the application is doing 1k writes, each page is modified 4 times.

            If the kernel decides to write the page somewhere in the middle of those four writes, ext4 will just change the page while it is being written. This happens often as the kernel tries to find free pages by writing dirty pages.

            Btrfs will wait for the write to complete, and then because btrfs does copy on write, it will allocate a new block for the new write and write to the new location. This means that we are slow because we're waiting for writes and we're slow because we fragment the file more.

            On my test machine switching from 1k writes to 4k writes increases btrfs write tput from 72MB/s to 85MB/s.

            Numbers from another tester, all btrfs:

            iozone -r1 (1k writes) 20MB/s
            iozone -r4 (4k writes) 64MB/s
            iozone -r64 (64k writes) 84MB/s

            In practice, most people doing streaming writes like this use much larger buffer sizes (1MB or more). They often also use O_DIRECT.

            -chris

            Comment


            • #16
              [QUOTE=sektion31;108954]oh thanks for clarification. i read that reiser4 and btrfs are more similar to each other than to ext3/4. so i assumed they have a similar design idea.

              Just to clarify, the big thing that I took from reiserfs (actually reiserv3, which was the one I worked on) was the idea of key/item storage. The btrfs btree uses a very similar key structure to order the items in the btree.

              This is different from ext* which tend to have specialized block formats for different types of metadata. Btrfs just tosses things into a btree and lets it index it for searching.

              -chris

              Comment


              • #17
                Hi Chris,
                I'm Ric, one of the users excited about the btrfs fs(as geeky as that is).
                Thank you for taking the time for development!

                I ran IOzone while setting-up a new SAS2 RAID adapter(LSI 9211) and disks(Hitachi c10k300) on Opteron dual skt 940(285's) system while using openSUSE Linux with 2.6.32 kernel. The initial purpose was to use md so a md RAID 0 was created and then stress tested. IOzone was one of the tools used. A 8GB file is used to overcome the effects of the installed 4GB RAM.
                :

                /usr/lib/iozone/bin/iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[md0-RAID0]_[btrfs]_[9211-8i].xls

                (8,388,608 kB file)
                Writer Report
                421,398 kBps

                Re-Writer Report
                424,017 kBps

                Reader Report
                321,558 kBps

                Re-Reader Report
                324,612 kBps

                Others, ext3, ext4, & JFS, faired about the same but READ were faster and, more importantly, faster than WRITE as would be expected.

                I was a bit short on time then so just now ran it with same IOzone parameters but using the 9211's "Integrated RAID" RAID-0 on a different kernel.
                :
                File size set to 8388608 KB
                Record Size 64 KB
                Machine = Linux sm.linuXwindows.hom 2.6.31.6-desktop-1mnb #1 SMP Tue Dec 8 15: Excel chart generation enabled
                Excel chart generation enabled
                Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/CLONE/SAS600/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31]_[btrfs]_[9211-8i_RAID-0].xls
                Output is in Kbytes/sec :

                "Writer report"
                "64"
                "8388608" 412,433

                "Re-writer report"
                "64"
                "8388608" 417,586

                "Reader report"
                "64"
                "8388608" 391,542

                "Re-Reader report"
                "64"
                "8388608" 393,962

                As you can see, same thing: WRITE is faster than READ even on the IR RAID.
                Something weird is going on ... Perhaps it is an IOzone & btrfs issue.(?) If so IOzone tests are skewed (... the wrong way, ). I'd blame it on this test but none of the other fs had faster WRITEs than READs in the results.

                I have not tried it on Intel Nehalem platform yet but thought maybe you should know something odd was occurring (that is not exhibited by the other fs).

                I don't need an explanation or anything like that but would be good to know you got the post if you have the time. I do have the excel files if needed.

                -Ric

                PS: This is not the first time I found md to be faster than a HBA or RAID card's RAID. Distressing but also very good for us Linux geeks. ...wish it(md) was cross-platform.

                Comment


                • #18
                  Originally posted by fhj52 View Post
                  Hi Chris,
                  I'm Ric, one of the users excited about the btrfs fs(as geeky as that is).
                  Thank you for taking the time for development!

                  I ran IOzone while setting-up a new SAS2 RAID adapter(LSI 9211) and disks(Hitachi c10k300) on Opteron dual skt 940(285's) system while using openSUSE Linux with 2.6.32 kernel. The initial purpose was to use md so a md RAID 0 was created and then stress tested. IOzone was one of the tools used. A 8GB file is used to overcome the effects of the installed 4GB RAM.
                  :

                  /usr/lib/iozone/bin/iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[md0-RAID0]_[btrfs]_[9211-8i].xls

                  (8,388,608 kB file)
                  Writer Report
                  421,398 kBps

                  Re-Writer Report
                  424,017 kBps

                  Reader Report
                  321,558 kBps

                  Re-Reader Report
                  324,612 kBps

                  Others, ext3, ext4, & JFS, faired about the same but READ were faster and, more importantly, faster than WRITE as would be expected.
                  Thanks for giving btrfs a try. Usually when read results are too low it is because there isn't enough read ahead being done. The two easy ways to control readahead are to use a much larger buffer size (10MB for example) or to tune the bdi parameters.

                  Btrfs does crcs after reading, and sometimes it needs a larger readahead window to perform as well as the other filesystems. You could confirm this by turning crcs off (mount -o nodatasum).

                  Linux uses a bdi (backing dev info) to collect readahead and a few other device statistics. Btrfs creates a virtual bdi so that it can easily manage multiple devices. Sometimes it doesn't pick the right read ahead values for faster raid devices.

                  In /sys/class/bdi you'll find directories named btrfs-N where N is a number (1,2,3) for each btrfs mount. So /sys/class/bdi/btrfs-1 is the first btrfs filesystem. /sys/class/bdi/btrfs-1/read_ahead_kb can be used to boost the size of the kernel's internal read ahead buffer. Triple whatever is in there and see if your performance changes.

                  If that doesn't do it, just let me know. Most of the filesystems scale pretty well on streaming reads and writes to a single file, so we should be pretty close on this system.

                  -chris

                  Comment


                  • #19
                    Hi Chris,
                    Thanks for the explanation and suggestion.
                    Before seeing it, I did try an older Parallel SCSI card, LSI megaRAID 320-2x with some Fujitsu U320 disks in RAID 0. The card has 512MB of BBU cache ...no way I know to adjust that. [ ...unless you meant a kernel adjustment.(?)]
                    The results were strikingly different as before but more so:
                    "Writer report"
                    "64"
                    "8388608" 244679

                    "Re-writer report"
                    "64"
                    "8388608" 231935

                    "Reader report"
                    "64"
                    "8388608" 51755

                    "Re-Reader report"
                    "64"
                    "8388608" 50160

                    Then I found & used your suggestions of nodataram and changed the 4096 value to 12288 for the readahead [ *SAS600 type btrfs (rw,noatime,nodatasum)], and that looks like it definitely improved the WRITE faster than READ oddity for the 9211-8i SAS2 card. ( It has no buffer/cache onboard but the HDD have 64MB and the adapter is set so disk cache is on.)

                    Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/CLONE/SAS600/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31.12]_[btrfs]_[9211-8i_RAID-0]-[nodatasum_12288_readahead].xls

                    "Writer report"
                    "64"
                    "8388608" 490296

                    "Re-writer report"
                    "64"
                    "8388608" 470194

                    "Reader report"
                    "64"
                    "8388608" 462138

                    "Re-Reader report"
                    "64"
                    "8388608" 458668

                    Still slower READ but not nearly as dramatic.

                    The MegaRAID mount was also changed [ PAS320RAID0 type btrfs (rw,noatime,nodatasum) ], but the results did not show improvement. WRITE is still testing as ~4x faster.
                    Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/PAS320RAID0/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31.12]_[btrfs]_[PAS320_MEGARAID-0]-[nodatasum_12288_readahead].xls

                    "Writer report"
                    "64"
                    "8388608" 232943

                    "Re-writer report"
                    "64"
                    "8388608" 230301

                    "Reader report"
                    "64"
                    "8388608" 52251

                    "Re-Reader report"
                    "64"
                    "8388608" 51795

                    ...
                    The adapters are a lot different. The MegaRAID is a RAID card for U320 PAS w/ a large cache & BBU while the 9211 is an HBA for SAS2(6Gbps) interface w/ no cache or BBU. I'd like to say it is the adapters and SAS-vs-SCSI but ext4 results indicate otherwise.
                    Last week's test,
                    iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/linux/PAS320RAID0/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[320-2x_RAID0]_[ext4]-2.xls
                    :
                    Writer Report
                    64
                    8388608 229468
                    Re-writer Report
                    64
                    8388608 233403
                    Reader Report
                    64
                    8388608 208436
                    Re-reader Report
                    64
                    8388608 210758

                    It is a bit slower too for READ ... but no drama.
                    Like most everybody else, PAS disks won't be used much longer so I put those numbers up there just as information for you, in case it is needed.
                    ...

                    On the up side, man, look at those numbers. The btrfs just walloped ext4 for this test!
                    That 490,296 kBps is the fastest I've ever seen here for a WRITE. By all means, please keep up the good work!


                    I'll look at the buffering but with the 9211 HBA there's not much to do for it. Perhaps, the buffering with the disks' cache got turned off between Linux and MS OS somehow. It should not as it is an adapter setting but the LSI2008 and LSI2108 kernel module(mpt2sas) is relatively new. ... it'll take a while to get the SW running to find out.


                    -Ric

                    Comment


                    • #20
                      Originally posted by fhj52 View Post
                      Hi Chris,
                      Thanks for the explanation and suggestion.
                      Before seeing it, I did try an older Parallel SCSI card, LSI megaRAID 320-2x with some Fujitsu U320 disks in RAID 0. The card has 512MB of BBU cache ...no way I know to adjust that. [ ...unless you meant a kernel adjustment.(?)]
                      The results were strikingly different as before but more so:
                      "Writer report"
                      "64"
                      "8388608" 244679

                      "Re-writer report"
                      "64"
                      "8388608" 231935

                      "Reader report"
                      "64"
                      "8388608" 51755

                      "Re-Reader report"
                      "64"
                      "8388608" 50160

                      Then I found & used your suggestions of nodataram and changed the 4096 value to 12288 for the readahead [ *SAS600 type btrfs (rw,noatime,nodatasum)], and that looks like it definitely improved the WRITE faster than READ oddity for the 9211-8i SAS2 card. ( It has no buffer/cache onboard but the HDD have 64MB and the adapter is set so disk cache is on.)

                      Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/CLONE/SAS600/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31.12]_[btrfs]_[9211-8i_RAID-0]-[nodatasum_12288_readahead].xls

                      "Writer report"
                      "64"
                      "8388608" 490296

                      "Re-writer report"
                      "64"
                      "8388608" 470194

                      "Reader report"
                      "64"
                      "8388608" 462138

                      "Re-Reader report"
                      "64"
                      "8388608" 458668

                      Still slower READ but not nearly as dramatic.

                      The MegaRAID mount was also changed [ PAS320RAID0 type btrfs (rw,noatime,nodatasum) ], but the results did not show improvement. WRITE is still testing as ~4x faster.
                      Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/PAS320RAID0/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31.12]_[btrfs]_[PAS320_MEGARAID-0]-[nodatasum_12288_readahead].xls

                      "Writer report"
                      "64"
                      "8388608" 232943

                      "Re-writer report"
                      "64"
                      "8388608" 230301

                      "Reader report"
                      "64"
                      "8388608" 52251

                      "Re-Reader report"
                      "64"
                      "8388608" 51795

                      ...
                      The adapters are a lot different. The MegaRAID is a RAID card for U320 PAS w/ a large cache & BBU while the 9211 is an HBA for SAS2(6Gbps) interface w/ no cache or BBU. I'd like to say it is the adapters and SAS-vs-SCSI but ext4 results indicate otherwise.
                      Last week's test,
                      iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/linux/PAS320RAID0/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[320-2x_RAID0]_[ext4]-2.xls
                      :
                      Writer Report
                      64
                      8388608 229468
                      Re-writer Report
                      64
                      8388608 233403
                      Reader Report
                      64
                      8388608 208436
                      Re-reader Report
                      64
                      8388608 210758

                      It is a bit slower too for READ ... but no drama.
                      Like most everybody else, PAS disks won't be used much longer so I put those numbers up there just as information for you, in case it is needed.
                      ...

                      -Ric
                      Thanks for trying this out. nodatasum will improve both writes and reads because it isn't doing the checksum during the write.

                      On raid cards with writeback cache (and sometimes even single drives with writeback cache), the cache may allow the card to process writes faster than it can read. This is because the cache gives the drive the chance to stage the IO and perfectly order it, while reads must be done more or less immediately. Good cards have good readahead logic, but this doesn't always work out.

                      So, now that we have the kernel readahead tuned (btw, you can try larger numbers in the bdi read_ahead_kb field), the next step is to make sure the kernel is using the largest possible requests on the card.

                      cd /sys/block/xxxx/queue where xxxx is the device for your drive. You want the physical device, and if you're using MD you want to do this to each drive in the MD raid set (example cd /sys/block/sda/queue)

                      echo deadline > scheduler
                      echo 2048 > nr_requests
                      cat max_hw_sectors_kb > max_sectors_kb

                      Switching to deadline may or may not make a difference, the others are very likely to help.

                      -chris

                      Comment

                      Working...
                      X