Announcement

Collapse
No announcement yet.

Btrfs Battles EXT4 With The Linux 2.6.33 Kernel

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    [QUOTE=sektion31;108954]oh thanks for clarification. i read that reiser4 and btrfs are more similar to each other than to ext3/4. so i assumed they have a similar design idea.

    Just to clarify, the big thing that I took from reiserfs (actually reiserv3, which was the one I worked on) was the idea of key/item storage. The btrfs btree uses a very similar key structure to order the items in the btree.

    This is different from ext* which tend to have specialized block formats for different types of metadata. Btrfs just tosses things into a btree and lets it index it for searching.

    -chris

    Comment


    • #17
      Hi Chris,
      I'm Ric, one of the users excited about the btrfs fs(as geeky as that is).
      Thank you for taking the time for development!

      I ran IOzone while setting-up a new SAS2 RAID adapter(LSI 9211) and disks(Hitachi c10k300) on Opteron dual skt 940(285's) system while using openSUSE Linux with 2.6.32 kernel. The initial purpose was to use md so a md RAID 0 was created and then stress tested. IOzone was one of the tools used. A 8GB file is used to overcome the effects of the installed 4GB RAM.
      :

      /usr/lib/iozone/bin/iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[md0-RAID0]_[btrfs]_[9211-8i].xls

      (8,388,608 kB file)
      Writer Report
      421,398 kBps

      Re-Writer Report
      424,017 kBps

      Reader Report
      321,558 kBps

      Re-Reader Report
      324,612 kBps

      Others, ext3, ext4, & JFS, faired about the same but READ were faster and, more importantly, faster than WRITE as would be expected.

      I was a bit short on time then so just now ran it with same IOzone parameters but using the 9211's "Integrated RAID" RAID-0 on a different kernel.
      :
      File size set to 8388608 KB
      Record Size 64 KB
      Machine = Linux sm.linuXwindows.hom 2.6.31.6-desktop-1mnb #1 SMP Tue Dec 8 15: Excel chart generation enabled
      Excel chart generation enabled
      Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/CLONE/SAS600/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31]_[btrfs]_[9211-8i_RAID-0].xls
      Output is in Kbytes/sec :

      "Writer report"
      "64"
      "8388608" 412,433

      "Re-writer report"
      "64"
      "8388608" 417,586

      "Reader report"
      "64"
      "8388608" 391,542

      "Re-Reader report"
      "64"
      "8388608" 393,962

      As you can see, same thing: WRITE is faster than READ even on the IR RAID.
      Something weird is going on ... Perhaps it is an IOzone & btrfs issue.(?) If so IOzone tests are skewed (... the wrong way, ). I'd blame it on this test but none of the other fs had faster WRITEs than READs in the results.

      I have not tried it on Intel Nehalem platform yet but thought maybe you should know something odd was occurring (that is not exhibited by the other fs).

      I don't need an explanation or anything like that but would be good to know you got the post if you have the time. I do have the excel files if needed.

      -Ric

      PS: This is not the first time I found md to be faster than a HBA or RAID card's RAID. Distressing but also very good for us Linux geeks. ...wish it(md) was cross-platform.

      Comment


      • #18
        Originally posted by fhj52 View Post
        Hi Chris,
        I'm Ric, one of the users excited about the btrfs fs(as geeky as that is).
        Thank you for taking the time for development!

        I ran IOzone while setting-up a new SAS2 RAID adapter(LSI 9211) and disks(Hitachi c10k300) on Opteron dual skt 940(285's) system while using openSUSE Linux with 2.6.32 kernel. The initial purpose was to use md so a md RAID 0 was created and then stress tested. IOzone was one of the tools used. A 8GB file is used to overcome the effects of the installed 4GB RAM.
        :

        /usr/lib/iozone/bin/iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[md0-RAID0]_[btrfs]_[9211-8i].xls

        (8,388,608 kB file)
        Writer Report
        421,398 kBps

        Re-Writer Report
        424,017 kBps

        Reader Report
        321,558 kBps

        Re-Reader Report
        324,612 kBps

        Others, ext3, ext4, & JFS, faired about the same but READ were faster and, more importantly, faster than WRITE as would be expected.
        Thanks for giving btrfs a try. Usually when read results are too low it is because there isn't enough read ahead being done. The two easy ways to control readahead are to use a much larger buffer size (10MB for example) or to tune the bdi parameters.

        Btrfs does crcs after reading, and sometimes it needs a larger readahead window to perform as well as the other filesystems. You could confirm this by turning crcs off (mount -o nodatasum).

        Linux uses a bdi (backing dev info) to collect readahead and a few other device statistics. Btrfs creates a virtual bdi so that it can easily manage multiple devices. Sometimes it doesn't pick the right read ahead values for faster raid devices.

        In /sys/class/bdi you'll find directories named btrfs-N where N is a number (1,2,3) for each btrfs mount. So /sys/class/bdi/btrfs-1 is the first btrfs filesystem. /sys/class/bdi/btrfs-1/read_ahead_kb can be used to boost the size of the kernel's internal read ahead buffer. Triple whatever is in there and see if your performance changes.

        If that doesn't do it, just let me know. Most of the filesystems scale pretty well on streaming reads and writes to a single file, so we should be pretty close on this system.

        -chris

        Comment


        • #19
          Hi Chris,
          Thanks for the explanation and suggestion.
          Before seeing it, I did try an older Parallel SCSI card, LSI megaRAID 320-2x with some Fujitsu U320 disks in RAID 0. The card has 512MB of BBU cache ...no way I know to adjust that. [ ...unless you meant a kernel adjustment.(?)]
          The results were strikingly different as before but more so:
          "Writer report"
          "64"
          "8388608" 244679

          "Re-writer report"
          "64"
          "8388608" 231935

          "Reader report"
          "64"
          "8388608" 51755

          "Re-Reader report"
          "64"
          "8388608" 50160

          Then I found & used your suggestions of nodataram and changed the 4096 value to 12288 for the readahead [ *SAS600 type btrfs (rw,noatime,nodatasum)], and that looks like it definitely improved the WRITE faster than READ oddity for the 9211-8i SAS2 card. ( It has no buffer/cache onboard but the HDD have 64MB and the adapter is set so disk cache is on.)

          Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/CLONE/SAS600/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31.12]_[btrfs]_[9211-8i_RAID-0]-[nodatasum_12288_readahead].xls

          "Writer report"
          "64"
          "8388608" 490296

          "Re-writer report"
          "64"
          "8388608" 470194

          "Reader report"
          "64"
          "8388608" 462138

          "Re-Reader report"
          "64"
          "8388608" 458668

          Still slower READ but not nearly as dramatic.

          The MegaRAID mount was also changed [ PAS320RAID0 type btrfs (rw,noatime,nodatasum) ], but the results did not show improvement. WRITE is still testing as ~4x faster.
          Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/PAS320RAID0/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31.12]_[btrfs]_[PAS320_MEGARAID-0]-[nodatasum_12288_readahead].xls

          "Writer report"
          "64"
          "8388608" 232943

          "Re-writer report"
          "64"
          "8388608" 230301

          "Reader report"
          "64"
          "8388608" 52251

          "Re-Reader report"
          "64"
          "8388608" 51795

          ...
          The adapters are a lot different. The MegaRAID is a RAID card for U320 PAS w/ a large cache & BBU while the 9211 is an HBA for SAS2(6Gbps) interface w/ no cache or BBU. I'd like to say it is the adapters and SAS-vs-SCSI but ext4 results indicate otherwise.
          Last week's test,
          iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/linux/PAS320RAID0/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[320-2x_RAID0]_[ext4]-2.xls
          :
          Writer Report
          64
          8388608 229468
          Re-writer Report
          64
          8388608 233403
          Reader Report
          64
          8388608 208436
          Re-reader Report
          64
          8388608 210758

          It is a bit slower too for READ ... but no drama.
          Like most everybody else, PAS disks won't be used much longer so I put those numbers up there just as information for you, in case it is needed.
          ...

          On the up side, man, look at those numbers. The btrfs just walloped ext4 for this test!
          That 490,296 kBps is the fastest I've ever seen here for a WRITE. By all means, please keep up the good work!


          I'll look at the buffering but with the 9211 HBA there's not much to do for it. Perhaps, the buffering with the disks' cache got turned off between Linux and MS OS somehow. It should not as it is an adapter setting but the LSI2008 and LSI2108 kernel module(mpt2sas) is relatively new. ... it'll take a while to get the SW running to find out.


          -Ric

          Comment


          • #20
            Originally posted by fhj52 View Post
            Hi Chris,
            Thanks for the explanation and suggestion.
            Before seeing it, I did try an older Parallel SCSI card, LSI megaRAID 320-2x with some Fujitsu U320 disks in RAID 0. The card has 512MB of BBU cache ...no way I know to adjust that. [ ...unless you meant a kernel adjustment.(?)]
            The results were strikingly different as before but more so:
            "Writer report"
            "64"
            "8388608" 244679

            "Re-writer report"
            "64"
            "8388608" 231935

            "Reader report"
            "64"
            "8388608" 51755

            "Re-Reader report"
            "64"
            "8388608" 50160

            Then I found & used your suggestions of nodataram and changed the 4096 value to 12288 for the readahead [ *SAS600 type btrfs (rw,noatime,nodatasum)], and that looks like it definitely improved the WRITE faster than READ oddity for the 9211-8i SAS2 card. ( It has no buffer/cache onboard but the HDD have 64MB and the adapter is set so disk cache is on.)

            Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/CLONE/SAS600/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31.12]_[btrfs]_[9211-8i_RAID-0]-[nodatasum_12288_readahead].xls

            "Writer report"
            "64"
            "8388608" 490296

            "Re-writer report"
            "64"
            "8388608" 470194

            "Reader report"
            "64"
            "8388608" 462138

            "Re-Reader report"
            "64"
            "8388608" 458668

            Still slower READ but not nearly as dramatic.

            The MegaRAID mount was also changed [ PAS320RAID0 type btrfs (rw,noatime,nodatasum) ], but the results did not show improvement. WRITE is still testing as ~4x faster.
            Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/PAS320RAID0/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31.12]_[btrfs]_[PAS320_MEGARAID-0]-[nodatasum_12288_readahead].xls

            "Writer report"
            "64"
            "8388608" 232943

            "Re-writer report"
            "64"
            "8388608" 230301

            "Reader report"
            "64"
            "8388608" 52251

            "Re-Reader report"
            "64"
            "8388608" 51795

            ...
            The adapters are a lot different. The MegaRAID is a RAID card for U320 PAS w/ a large cache & BBU while the 9211 is an HBA for SAS2(6Gbps) interface w/ no cache or BBU. I'd like to say it is the adapters and SAS-vs-SCSI but ext4 results indicate otherwise.
            Last week's test,
            iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/linux/PAS320RAID0/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[320-2x_RAID0]_[ext4]-2.xls
            :
            Writer Report
            64
            8388608 229468
            Re-writer Report
            64
            8388608 233403
            Reader Report
            64
            8388608 208436
            Re-reader Report
            64
            8388608 210758

            It is a bit slower too for READ ... but no drama.
            Like most everybody else, PAS disks won't be used much longer so I put those numbers up there just as information for you, in case it is needed.
            ...

            -Ric
            Thanks for trying this out. nodatasum will improve both writes and reads because it isn't doing the checksum during the write.

            On raid cards with writeback cache (and sometimes even single drives with writeback cache), the cache may allow the card to process writes faster than it can read. This is because the cache gives the drive the chance to stage the IO and perfectly order it, while reads must be done more or less immediately. Good cards have good readahead logic, but this doesn't always work out.

            So, now that we have the kernel readahead tuned (btw, you can try larger numbers in the bdi read_ahead_kb field), the next step is to make sure the kernel is using the largest possible requests on the card.

            cd /sys/block/xxxx/queue where xxxx is the device for your drive. You want the physical device, and if you're using MD you want to do this to each drive in the MD raid set (example cd /sys/block/sda/queue)

            echo deadline > scheduler
            echo 2048 > nr_requests
            cat max_hw_sectors_kb > max_sectors_kb

            Switching to deadline may or may not make a difference, the others are very likely to help.

            -chris

            Comment


            • #21
              Hi Chris,
              I have to adjust my Wow! statement of previous post. I could not get to openSUSE (2.6.32 kernel) last night. But today I did; Without changing the readahead value but using noatime and nodatasum, a new record here:

              "Writer report"
              "64"
              "8388608" 539,898 kBps

              HS!, the interface is only spec'd at 586MBps ...
              Here's the output:
              Code:
                            KB  reclen   write rewrite    read    reread                                                                             
                       8388608      64  539898  543101   463523   463367
              READ is still quite a lot slower acc2 IOzone.

              Not enuf data to draw the conclusion that the readahead default value is too small for near state of the art storage, i.e., SAS2 HDD and SSD, but it surely looks that way.
              SO I changed the default 4096 to 12288 in the /sys/devices/virtual/bdi/btrfs-*/read_ahead_kb files and ran it again ...no love:

              Code:
                            KB  reclen   write rewrite    read    reread                                                                                                                            
                       8388608      64  549614  542850   462666   462772
              I am using the same IOzone parameters and getting the basically same results, so as to not appear too crazy* I changed it to drop the CPU Utilization(it is useless anyway ...), started mount/unmount between each test(there was some indication of previous cache being used) and set the stride to smaller value( the RAID uses 64k stripe).
              I won't bore you with useless data. I tried several strides(1*64, 2*64, ... 192*64) and none mattered. READ is about the same.
              I had to stop using the auto unmount & mount function in IOzone as every time it was done the readahead_kb was reset to the default 4096 value. I poked around a little but my guess is that is a kernel value I cannot change w/o rebuilding the kernel or module. I'll look a bit more later. ...

              I also tried increasing the read_ahead to 32,768 ...even 64MB! No diff for the READ that way either:
              Code:
              Command line used: /usr/lib/iozone/bin/iozone -L64 -S1024 -a -j2 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=128].xls
                            KB  reclen   write rewrite    read    reread 
                       8388608      64  535878  542170   463488   463487
              Command line used: /usr/lib/iozone/bin/iozone -L64 -S1024 -a -j1 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=65536]-[stride=64].xls                         
                            KB  reclen   write rewrite    read    reread
                       8388608      64  536160  542576   440697   445034
              ...

              While composing this I see you posted.
              You're welcome and thank you for the suggestions.

              Will try those suggestions, esp. the deadline as I meant to change that and forgot about it. Current scheduler is the default, CFQ.

              Prbly should not get too much into the 9211 HBA card specifics but it is pretty typical HBA: no cache and does not have readahead or writeback.
              It does allow setting the HDD cache as on or off for use, which is a new widget. It was set to on but I cannot verify it still is. ... LSI Linux software is not only lame but also proprietary => I cannot fix it.
              I assume the HDD cache is being used because the boot log indicates the kernel thinks it is enabled:
              Code:
              ... sd 0:1:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
              I trust Linus et al more than LSI anyway, .


              -Ric


              *crazy: someone who does the same exact thing, the same exact way over and over again and expects a different result each time.

              Comment


              • #22
                Hey Chris,
                I tried the suggestions of scheduler(deadline), nr_requests and hw_sectors changes.
                READ is slower than WRITE by > 70MBps.
                Code:
                        Auto Mode
                        File size set to 8388608 KB
                        Record Size 64 KB          
                
                        Machine = Linux * 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100   Excel chart generation enabled
                        Excel chart generation enabled                                                                                
                        Command line used: */iozone -L64 -S1024 -a -j1 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32_deadline_sectors=4096_nr_requests=2048]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=64].xls                                                                                                                  
                        Output is in Kbytes/sec                                                                                                 
                        Time Resolution = 0.000001 seconds.                                                                                     
                        Processor cache size set to 1024 Kbytes.                                                                                
                        Processor cache line size set to 64 bytes.                                                                              
                        File stride size set to 1 * record size.                                                                                
                                                                            random  random    bkwd   record   stride                                                                                                                                                            
                              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread                                                                                                                         
                         8388608      64  537033  542196   462496   462088                                                                                                                                                                                                      
                
                iozone test complete.
                Just so it is clear, I'm not complaining. I mean who can complain about 500MBps+|-35MBps?
                I'm trying to assist. So if there is some other way you want this run just say so. (The IOzone test is ~ 1m15s on this new SAS2 setup so is painless, especially compared to *ATA and PAS disks. )
                ...or even some other ap if you think IOzone m/b fiddling with results somehow.

                -Ric

                Comment


                • #23
                  Originally posted by fhj52 View Post
                  Hey Chris,
                  I tried the suggestions of scheduler(deadline), nr_requests and hw_sectors changes.
                  READ is slower than WRITE by > 70MBps.
                  Code:
                          Auto Mode
                          File size set to 8388608 KB
                          Record Size 64 KB          
                  
                          Machine = Linux * 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100   Excel chart generation enabled
                          Excel chart generation enabled                                                                                
                          Command line used: */iozone -L64 -S1024 -a -j1 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32_deadline_sectors=4096_nr_requests=2048]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=64].xls                                                                                                                  
                          Output is in Kbytes/sec                                                                                                 
                          Time Resolution = 0.000001 seconds.                                                                                     
                          Processor cache size set to 1024 Kbytes.                                                                                
                          Processor cache line size set to 64 bytes.                                                                              
                          File stride size set to 1 * record size.                                                                                
                                                                              random  random    bkwd   record   stride                                                                                                                                                            
                                KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread                                                                                                                         
                           8388608      64  537033  542196   462496   462088                                                                                                                                                                                                      
                  
                  iozone test complete.
                  Just so it is clear, I'm not complaining. I mean who can complain about 500MBps+|-35MBps?
                  I'm trying to assist. So if there is some other way you want this run just say so. (The IOzone test is ~ 1m15s on this new SAS2 setup so is painless, especially compared to *ATA and PAS disks. )
                  ...or even some other ap if you think IOzone m/b fiddling with results somehow.

                  -Ric
                  Thanks for trying this out, I think the best thing to do would be to nail down exactly how fast the device is.

                  dd if=/dev/xxx of=/dev/zero bs=20M iflag=direct count=409

                  /dev/xxx is whatever you built btrfs on top of. This should be a read only benchmark, and since we're running O_DIRECT it removes the kernel readahead from the picture.

                  -chris

                  Comment


                  • #24
                    Originally posted by mason View Post
                    Thanks for trying this out, I think the best thing to do would be to nail down exactly how fast the device is.

                    dd if=/dev/xxx of=/dev/zero bs=20M iflag=direct count=409

                    /dev/xxx is whatever you built btrfs on top of. This should be a read only benchmark, and since we're running O_DIRECT it removes the kernel readahead from the picture.

                    -chris
                    Hi, thanks for the post. I am happy to do whatever I can to assist.
                    The btrfs(which I pronounce "better f s") is, or at least the potential of, a truly world class fs. I thank you and all the develpers for doing the work and Oracle for funding it. I know it is in Oracle's best interest to have such but making it GPL-licensed ... gotta love'em for at least that.

                    I was involved with other tasks but got to this today.
                    Under openSUSE 11.2(kernal 2.6.32-3) the SAS2 IR RAID-0 is device sda.
                    Background:
                    Code:
                    #> mount
                    ...
                    /dev/sda16 on /SAS600RAID type btrfs (rw,noatime,nodatasum)
                    #> df
                    ...
                    /dev/sda16   btrfs    339G  104K  339G   1% /SAS600RAID
                    ...
                    #> cd /sys/block/sda/queue; cat scheduler;cat nr_requests;cat max_hw_sectors_kb
                    noop anticipatory [deadline] cfq
                    128
                    4096
                    Without changing the requests:
                    Code:
                    It's 16:47:08 CST (UTC-0600) on Fri Feb 05, week 05 in 2010.
                    You are root at { /home }
                    #> dd if=/dev/sda of=/dev/zero bs=20M iflag=direct count=409
                    409+0 records in
                    409+0 records out
                    8577351680 bytes (8.6 GB) copied, 15.7202 s, 546 MB/s
                    Running it on the partition has a different result
                    Code:
                    #> dd if=/dev/sda16 of=/dev/zero bs=20M iflag=direct count=409
                    409+0 records in
                    409+0 records out
                    8577351680 bytes (8.6 GB) copied, 17.7661 s, 483 MB/s
                    Then for proof that it is independent of requests:
                    Code:
                    #> echo deadline > scheduler; echo 2048 > nr_requests;cat max_hw_sectors_kb > max_sectors_kb; cat scheduler;cat nr_requests;cat max_hw_sectors_kb
                    noop anticipatory [deadline] cfq
                    2048
                    4096
                    
                    (ain't bash great, :))
                    
                    It's 17:09:46 CST (UTC-0600) on Fri Feb 05, week 05 in 2010.
                    You are root at { /sys/block/sda/queue }
                    #> dd if=/dev/sda of=/dev/zero bs=20M iflag=direct count=409
                    409+0 records in
                    409+0 records out
                    8577351680 bytes (8.6 GB) copied, 15.7276 s, 545 MB/s
                    and
                    #> dd if=/dev/sda16 of=/dev/zero bs=20M iflag=direct count=409
                    409+0 records in
                    409+0 records out
                    8577351680 bytes (8.6 GB) copied, 17.7943 s, 482 MB/s
                    => The same,which is what is expected.

                    [ ran both sets of those several times and results were ~same each time]

                    BTW, read_ahead_kb is the default value
                    #> cat /sys/class/bdi/btrfs-1/read_ahead_kb; cat /sys/class/bdi/btrfs-2/read_ahead_kb
                    4096
                    4096
                    and changing them (to 32768) also made no diff, as expected.

                    So, 482-483MBps is the value for the partition.
                    Using everything in suggested setup with increased bdi, read_ahead-kb, ...etc., I ran IOzone again:
                    Code:
                            Auto Mode
                            File size set to 8388608 KB
                            Record Size 64 KB          
                    
                            Machine = Linux sm-opensuse 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100   Excel chart generation enabled
                            Excel chart generation enabled                                                                                
                            Command line used: iozone -L64 -S1024 -a -j2 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32_deadline_sectors=4096_nr_requests=2048]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=j2xL=128].xls                      
                            Output is in Kbytes/sec                                                                                                            
                            Time Resolution = 0.000001 seconds.                                                                                                
                            Processor cache size set to 1024 Kbytes.                                                                                           
                            Processor cache line size set to 64 bytes.                                                                                         
                            File stride size set to 2 * record size.
                                                                                
                                  KB  reclen   write rewrite    read    reread    
                             8388608      64  537542  540890   460758   460473
                    READ is a little slower than target value but pretty close. I have no clue on what is margin of error for IOzone results ...

                    I'm not sure if this means the WRITE IOzone results are inflated or if the Hitachi algorithm and buffer are doing that great of a job for WRITEs, ...or something else.
                    At the least it appears that kernel 2.6.32-3 is not helping or I and openSUSE have a config that is keeping it from helping.

                    If the next thing is to use 2.6.33, I will have to build one. openSUSE factory version(for openSUSE 11.3) is broken (here) ... A build is fine; Just a little extra time.


                    -Ric

                    Comment


                    • #25
                      Originally posted by fhj52 View Post
                      Hi, thanks for the post. I am happy to do whatever I can to assist.
                      The btrfs(which I pronounce "better f s") is, or at least the potential of, a truly world class fs. I thank you and all the develpers for doing the work and Oracle for funding it. I know it is in Oracle's best interest to have such but making it GPL-licensed ... gotta love'em for at least that.

                      I was involved with other tasks but got to this today.
                      Under openSUSE 11.2(kernal 2.6.32-3) the SAS2 IR RAID-0 is device sda.
                      Background:
                      Code:
                      #> mount
                      ...
                      /dev/sda16 on /SAS600RAID type btrfs (rw,noatime,nodatasum)
                      #> df
                      ...
                      /dev/sda16   btrfs    339G  104K  339G   1% /SAS600RAID
                      ...
                      #> cd /sys/block/sda/queue; cat scheduler;cat nr_requests;cat max_hw_sectors_kb
                      noop anticipatory [deadline] cfq
                      128
                      4096
                      Without changing the requests:
                      Code:
                      It's 16:47:08 CST (UTC-0600) on Fri Feb 05, week 05 in 2010.
                      You are root at { /home }
                      #> dd if=/dev/sda of=/dev/zero bs=20M iflag=direct count=409
                      409+0 records in
                      409+0 records out
                      8577351680 bytes (8.6 GB) copied, 15.7202 s, 546 MB/s
                      Running it on the partition has a different result
                      Code:
                      #> dd if=/dev/sda16 of=/dev/zero bs=20M iflag=direct count=409
                      409+0 records in
                      409+0 records out
                      8577351680 bytes (8.6 GB) copied, 17.7661 s, 483 MB/s
                      Then for proof that it is independent of requests:
                      Code:
                      #> echo deadline > scheduler; echo 2048 > nr_requests;cat max_hw_sectors_kb > max_sectors_kb; cat scheduler;cat nr_requests;cat max_hw_sectors_kb
                      noop anticipatory [deadline] cfq
                      2048
                      4096
                      
                      (ain't bash great, :))
                      
                      It's 17:09:46 CST (UTC-0600) on Fri Feb 05, week 05 in 2010.
                      You are root at { /sys/block/sda/queue }
                      #> dd if=/dev/sda of=/dev/zero bs=20M iflag=direct count=409
                      409+0 records in
                      409+0 records out
                      8577351680 bytes (8.6 GB) copied, 15.7276 s, 545 MB/s
                      and
                      #> dd if=/dev/sda16 of=/dev/zero bs=20M iflag=direct count=409
                      409+0 records in
                      409+0 records out
                      8577351680 bytes (8.6 GB) copied, 17.7943 s, 482 MB/s
                      => The same,which is what is expected.

                      [ ran both sets of those several times and results were ~same each time]

                      BTW, read_ahead_kb is the default value
                      #> cat /sys/class/bdi/btrfs-1/read_ahead_kb; cat /sys/class/bdi/btrfs-2/read_ahead_kb
                      4096
                      4096
                      and changing them (to 32768) also made no diff, as expected.

                      So, 482-483MBps is the value for the partition.
                      Using everything in suggested setup with increased bdi, read_ahead-kb, ...etc., I ran IOzone again:
                      Code:
                              Auto Mode
                              File size set to 8388608 KB
                              Record Size 64 KB          
                      
                              Machine = Linux sm-opensuse 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100   Excel chart generation enabled
                              Excel chart generation enabled                                                                                
                              Command line used: iozone -L64 -S1024 -a -j2 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32_deadline_sectors=4096_nr_requests=2048]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=j2xL=128].xls                      
                              Output is in Kbytes/sec                                                                                                            
                              Time Resolution = 0.000001 seconds.                                                                                                
                              Processor cache size set to 1024 Kbytes.                                                                                           
                              Processor cache line size set to 64 bytes.                                                                                         
                              File stride size set to 2 * record size.
                                                                                  
                                    KB  reclen   write rewrite    read    reread    
                               8388608      64  537542  540890   460758   460473
                      READ is a little slower than target value but pretty close. I have no clue on what is margin of error for IOzone results ...

                      I'm not sure if this means the WRITE IOzone results are inflated or if the Hitachi algorithm and buffer are doing that great of a job for WRITEs, ...or something else.
                      At the least it appears that kernel 2.6.32-3 is not helping or I and openSUSE have a config that is keeping it from helping.

                      If the next thing is to use 2.6.33, I will have to build one. openSUSE factory version(for openSUSE 11.3) is broken (here) ... A build is fine; Just a little extra time.


                      -Ric
                      Great, different parts of the drive can perform differently. Or, it could be an alignment issue the write cache is hiding.

                      The easiest way to tell is to do the read test farther down the drive. Where does sda16 start?

                      Lets pretend it starts 500GB into the drive. You can use rough numbers, we don't need it down to the KB.

                      500 * 1024 / 20 gives us the number of 20MB blocks into the drive that we need to skip to get to 500GB, which is 25600.

                      dd if=/dev/sda of=/dev/zero bs=20M skip=25600 count=409 iflag=direct

                      This will tell us if the problem with sda16 is alignment or not.

                      Comment


                      • #26
                        Hey Chris,
                        Did manage to get 2.6.33 on Mandriva running. Ran some iozone tests and are basically the same except both write and read are slower. The gap narrowed a bit as a result.

                        The odd thing is that the ext4 & ext3 partitions tested are now exhibiting the same thing: slower reads than writes.
                        Also, I did a cat of the max_hw_sectors and they were smaller than the max_sectors_kb.
                        It could be the different companies, Mandriva -v- SUSE, but I'll have to get the 2.6.33 from openSUSE running before I can check it to see if some fiddling was done.
                        I don't follow the kernel changes much anymore. Is somebody making changes to improve writes?

                        ...
                        The partition is behind ~ 205GB on the RAID, i.e., about 35% is a clone of another system drive, then the 380+GB that is formatted as btrfs.

                        The
                        dd if=/dev/sda of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
                        will be into the last 10% of the formatted space of the RAID-0.

                        AS soon as I can reboot ... I'll post the dd value|results.

                        c-ya,
                        Ric

                        Comment


                        • #27
                          This is the 2.6.31 kernel(Mandriva 2010.0) result.
                          #> dd if=/dev/sdf of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
                          409+0 records in
                          409+0 records out
                          8577351680 bytes (8.6 GB) copied, 23.5782 s, 364 MB/s
                          It's 10:29:13 CST (UTC-0600) on Sat Feb 06, week 05 in 2010.

                          Comment


                          • #28
                            Originally posted by mason View Post
                            Great, different parts of the drive can perform differently. Or, it could be an alignment issue the write cache is hiding.

                            The easiest way to tell is to do the read test farther down the drive. Where does sda16 start?

                            Lets pretend it starts 500GB into the drive. You can use rough numbers, we don't need it down to the KB.

                            500 * 1024 / 20 gives us the number of 20MB blocks into the drive that we need to skip to get to 500GB, which is 25600.

                            dd if=/dev/sda of=/dev/zero bs=20M skip=25600 count=409 iflag=direct

                            This will tell us if the problem with sda16 is alignment or not.
                            #> uname -srv
                            Linux 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100
                            (openSUSE 11.2)
                            #> dd if=/dev/sda of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
                            409+0 records in
                            409+0 records out
                            8577351680 bytes (8.6 GB) copied, 23.613 s, 363 MB/s
                            .
                            #> uname -srv
                            Linux 2.6.33-desktop-0.rc6.1mnb #1 SMP Sat Jan 30 01:00:20 CET 2010
                            (Mandriva 2010.0 w. 2010.1 kernel)
                            #> dd if=/dev/sdf of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
                            409+0 records in
                            409+0 records out
                            8577351680 bytes (8.6 GB) copied, 23.612 s, 363 MB/s

                            #> cd /sys/block/sda/queue; cat scheduler;cat nr_requests;cat max_hw_sectors_kb
                            noop anticipatory [deadline] cfq
                            128
                            4096

                            For both,
                            #> cat /sys/devices/virtual/bdi/btrfs-1/read_ahead_kb
                            4096


                            At least they are consistent on all three kernels. ...

                            -Ric

                            Comment


                            • #29
                              I did get the openSUSE 2.6.33 kernel up to init3 and the dd results were the same: ~ 364MBps.

                              ...and Mandriva's kernel is using different default values than openSUSE.
                              Code:
                              #> cd /sys/block/sda/queue; cat scheduler;cat nr_requests;cat max_hw_sectors_kb;cat max_sectors_kb
                              noop anticipatory [deadline] cfq
                              128
                              64
                              64
                              -Ric

                              Comment


                              • #30
                                Chris:

                                I won't be able to test with this partition anymore.
                                I don't suppose it matters since the WRITE faster than READ issue is exhibited on all drive types here but if you want more tests run, it will have to be elsewhere.

                                BOL

                                -Ric

                                Comment

                                Working...
                                X