Announcement

Collapse
No announcement yet.

Btrfs Battles EXT4 With The Linux 2.6.33 Kernel

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Hi Chris,
    I have to adjust my Wow! statement of previous post. I could not get to openSUSE (2.6.32 kernel) last night. But today I did; Without changing the readahead value but using noatime and nodatasum, a new record here:

    "Writer report"
    "64"
    "8388608" 539,898 kBps

    HS!, the interface is only spec'd at 586MBps ...
    Here's the output:
    Code:
                  KB  reclen   write rewrite    read    reread                                                                             
             8388608      64  539898  543101   463523   463367
    READ is still quite a lot slower acc2 IOzone.

    Not enuf data to draw the conclusion that the readahead default value is too small for near state of the art storage, i.e., SAS2 HDD and SSD, but it surely looks that way.
    SO I changed the default 4096 to 12288 in the /sys/devices/virtual/bdi/btrfs-*/read_ahead_kb files and ran it again ...no love:

    Code:
                  KB  reclen   write rewrite    read    reread                                                                                                                            
             8388608      64  549614  542850   462666   462772
    I am using the same IOzone parameters and getting the basically same results, so as to not appear too crazy* I changed it to drop the CPU Utilization(it is useless anyway ...), started mount/unmount between each test(there was some indication of previous cache being used) and set the stride to smaller value( the RAID uses 64k stripe).
    I won't bore you with useless data. I tried several strides(1*64, 2*64, ... 192*64) and none mattered. READ is about the same.
    I had to stop using the auto unmount & mount function in IOzone as every time it was done the readahead_kb was reset to the default 4096 value. I poked around a little but my guess is that is a kernel value I cannot change w/o rebuilding the kernel or module. I'll look a bit more later. ...

    I also tried increasing the read_ahead to 32,768 ...even 64MB! No diff for the READ that way either:
    Code:
    Command line used: /usr/lib/iozone/bin/iozone -L64 -S1024 -a -j2 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=128].xls
                  KB  reclen   write rewrite    read    reread 
             8388608      64  535878  542170   463488   463487
    Command line used: /usr/lib/iozone/bin/iozone -L64 -S1024 -a -j1 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=65536]-[stride=64].xls                         
                  KB  reclen   write rewrite    read    reread
             8388608      64  536160  542576   440697   445034
    ...

    While composing this I see you posted.
    You're welcome and thank you for the suggestions.

    Will try those suggestions, esp. the deadline as I meant to change that and forgot about it. Current scheduler is the default, CFQ.

    Prbly should not get too much into the 9211 HBA card specifics but it is pretty typical HBA: no cache and does not have readahead or writeback.
    It does allow setting the HDD cache as on or off for use, which is a new widget. It was set to on but I cannot verify it still is. ... LSI Linux software is not only lame but also proprietary => I cannot fix it.
    I assume the HDD cache is being used because the boot log indicates the kernel thinks it is enabled:
    Code:
    ... sd 0:1:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    I trust Linus et al more than LSI anyway, .


    -Ric


    *crazy: someone who does the same exact thing, the same exact way over and over again and expects a different result each time.

    Comment


    • #22
      Hey Chris,
      I tried the suggestions of scheduler(deadline), nr_requests and hw_sectors changes.
      READ is slower than WRITE by > 70MBps.
      Code:
              Auto Mode
              File size set to 8388608 KB
              Record Size 64 KB          
      
              Machine = Linux * 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100   Excel chart generation enabled
              Excel chart generation enabled                                                                                
              Command line used: */iozone -L64 -S1024 -a -j1 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32_deadline_sectors=4096_nr_requests=2048]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=64].xls                                                                                                                  
              Output is in Kbytes/sec                                                                                                 
              Time Resolution = 0.000001 seconds.                                                                                     
              Processor cache size set to 1024 Kbytes.                                                                                
              Processor cache line size set to 64 bytes.                                                                              
              File stride size set to 1 * record size.                                                                                
                                                                  random  random    bkwd   record   stride                                                                                                                                                            
                    KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread                                                                                                                         
               8388608      64  537033  542196   462496   462088                                                                                                                                                                                                      
      
      iozone test complete.
      Just so it is clear, I'm not complaining. I mean who can complain about 500MBps+|-35MBps?
      I'm trying to assist. So if there is some other way you want this run just say so. (The IOzone test is ~ 1m15s on this new SAS2 setup so is painless, especially compared to *ATA and PAS disks. )
      ...or even some other ap if you think IOzone m/b fiddling with results somehow.

      -Ric

      Comment


      • #23
        Originally posted by fhj52 View Post
        Hey Chris,
        I tried the suggestions of scheduler(deadline), nr_requests and hw_sectors changes.
        READ is slower than WRITE by > 70MBps.
        Code:
                Auto Mode
                File size set to 8388608 KB
                Record Size 64 KB          
        
                Machine = Linux * 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100   Excel chart generation enabled
                Excel chart generation enabled                                                                                
                Command line used: */iozone -L64 -S1024 -a -j1 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32_deadline_sectors=4096_nr_requests=2048]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=64].xls                                                                                                                  
                Output is in Kbytes/sec                                                                                                 
                Time Resolution = 0.000001 seconds.                                                                                     
                Processor cache size set to 1024 Kbytes.                                                                                
                Processor cache line size set to 64 bytes.                                                                              
                File stride size set to 1 * record size.                                                                                
                                                                    random  random    bkwd   record   stride                                                                                                                                                            
                      KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread                                                                                                                         
                 8388608      64  537033  542196   462496   462088                                                                                                                                                                                                      
        
        iozone test complete.
        Just so it is clear, I'm not complaining. I mean who can complain about 500MBps+|-35MBps?
        I'm trying to assist. So if there is some other way you want this run just say so. (The IOzone test is ~ 1m15s on this new SAS2 setup so is painless, especially compared to *ATA and PAS disks. )
        ...or even some other ap if you think IOzone m/b fiddling with results somehow.

        -Ric
        Thanks for trying this out, I think the best thing to do would be to nail down exactly how fast the device is.

        dd if=/dev/xxx of=/dev/zero bs=20M iflag=direct count=409

        /dev/xxx is whatever you built btrfs on top of. This should be a read only benchmark, and since we're running O_DIRECT it removes the kernel readahead from the picture.

        -chris

        Comment


        • #24
          Originally posted by mason View Post
          Thanks for trying this out, I think the best thing to do would be to nail down exactly how fast the device is.

          dd if=/dev/xxx of=/dev/zero bs=20M iflag=direct count=409

          /dev/xxx is whatever you built btrfs on top of. This should be a read only benchmark, and since we're running O_DIRECT it removes the kernel readahead from the picture.

          -chris
          Hi, thanks for the post. I am happy to do whatever I can to assist.
          The btrfs(which I pronounce "better f s") is, or at least the potential of, a truly world class fs. I thank you and all the develpers for doing the work and Oracle for funding it. I know it is in Oracle's best interest to have such but making it GPL-licensed ... gotta love'em for at least that.

          I was involved with other tasks but got to this today.
          Under openSUSE 11.2(kernal 2.6.32-3) the SAS2 IR RAID-0 is device sda.
          Background:
          Code:
          #> mount
          ...
          /dev/sda16 on /SAS600RAID type btrfs (rw,noatime,nodatasum)
          #> df
          ...
          /dev/sda16   btrfs    339G  104K  339G   1% /SAS600RAID
          ...
          #> cd /sys/block/sda/queue; cat scheduler;cat nr_requests;cat max_hw_sectors_kb
          noop anticipatory [deadline] cfq
          128
          4096
          Without changing the requests:
          Code:
          It's 16:47:08 CST (UTC-0600) on Fri Feb 05, week 05 in 2010.
          You are root at { /home }
          #> dd if=/dev/sda of=/dev/zero bs=20M iflag=direct count=409
          409+0 records in
          409+0 records out
          8577351680 bytes (8.6 GB) copied, 15.7202 s, 546 MB/s
          Running it on the partition has a different result
          Code:
          #> dd if=/dev/sda16 of=/dev/zero bs=20M iflag=direct count=409
          409+0 records in
          409+0 records out
          8577351680 bytes (8.6 GB) copied, 17.7661 s, 483 MB/s
          Then for proof that it is independent of requests:
          Code:
          #> echo deadline > scheduler; echo 2048 > nr_requests;cat max_hw_sectors_kb > max_sectors_kb; cat scheduler;cat nr_requests;cat max_hw_sectors_kb
          noop anticipatory [deadline] cfq
          2048
          4096
          
          (ain't bash great, :))
          
          It's 17:09:46 CST (UTC-0600) on Fri Feb 05, week 05 in 2010.
          You are root at { /sys/block/sda/queue }
          #> dd if=/dev/sda of=/dev/zero bs=20M iflag=direct count=409
          409+0 records in
          409+0 records out
          8577351680 bytes (8.6 GB) copied, 15.7276 s, 545 MB/s
          and
          #> dd if=/dev/sda16 of=/dev/zero bs=20M iflag=direct count=409
          409+0 records in
          409+0 records out
          8577351680 bytes (8.6 GB) copied, 17.7943 s, 482 MB/s
          => The same,which is what is expected.

          [ ran both sets of those several times and results were ~same each time]

          BTW, read_ahead_kb is the default value
          #> cat /sys/class/bdi/btrfs-1/read_ahead_kb; cat /sys/class/bdi/btrfs-2/read_ahead_kb
          4096
          4096
          and changing them (to 32768) also made no diff, as expected.

          So, 482-483MBps is the value for the partition.
          Using everything in suggested setup with increased bdi, read_ahead-kb, ...etc., I ran IOzone again:
          Code:
                  Auto Mode
                  File size set to 8388608 KB
                  Record Size 64 KB          
          
                  Machine = Linux sm-opensuse 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100   Excel chart generation enabled
                  Excel chart generation enabled                                                                                
                  Command line used: iozone -L64 -S1024 -a -j2 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32_deadline_sectors=4096_nr_requests=2048]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=j2xL=128].xls                      
                  Output is in Kbytes/sec                                                                                                            
                  Time Resolution = 0.000001 seconds.                                                                                                
                  Processor cache size set to 1024 Kbytes.                                                                                           
                  Processor cache line size set to 64 bytes.                                                                                         
                  File stride size set to 2 * record size.
                                                                      
                        KB  reclen   write rewrite    read    reread    
                   8388608      64  537542  540890   460758   460473
          READ is a little slower than target value but pretty close. I have no clue on what is margin of error for IOzone results ...

          I'm not sure if this means the WRITE IOzone results are inflated or if the Hitachi algorithm and buffer are doing that great of a job for WRITEs, ...or something else.
          At the least it appears that kernel 2.6.32-3 is not helping or I and openSUSE have a config that is keeping it from helping.

          If the next thing is to use 2.6.33, I will have to build one. openSUSE factory version(for openSUSE 11.3) is broken (here) ... A build is fine; Just a little extra time.


          -Ric

          Comment


          • #25
            Originally posted by fhj52 View Post
            Hi, thanks for the post. I am happy to do whatever I can to assist.
            The btrfs(which I pronounce "better f s") is, or at least the potential of, a truly world class fs. I thank you and all the develpers for doing the work and Oracle for funding it. I know it is in Oracle's best interest to have such but making it GPL-licensed ... gotta love'em for at least that.

            I was involved with other tasks but got to this today.
            Under openSUSE 11.2(kernal 2.6.32-3) the SAS2 IR RAID-0 is device sda.
            Background:
            Code:
            #> mount
            ...
            /dev/sda16 on /SAS600RAID type btrfs (rw,noatime,nodatasum)
            #> df
            ...
            /dev/sda16   btrfs    339G  104K  339G   1% /SAS600RAID
            ...
            #> cd /sys/block/sda/queue; cat scheduler;cat nr_requests;cat max_hw_sectors_kb
            noop anticipatory [deadline] cfq
            128
            4096
            Without changing the requests:
            Code:
            It's 16:47:08 CST (UTC-0600) on Fri Feb 05, week 05 in 2010.
            You are root at { /home }
            #> dd if=/dev/sda of=/dev/zero bs=20M iflag=direct count=409
            409+0 records in
            409+0 records out
            8577351680 bytes (8.6 GB) copied, 15.7202 s, 546 MB/s
            Running it on the partition has a different result
            Code:
            #> dd if=/dev/sda16 of=/dev/zero bs=20M iflag=direct count=409
            409+0 records in
            409+0 records out
            8577351680 bytes (8.6 GB) copied, 17.7661 s, 483 MB/s
            Then for proof that it is independent of requests:
            Code:
            #> echo deadline > scheduler; echo 2048 > nr_requests;cat max_hw_sectors_kb > max_sectors_kb; cat scheduler;cat nr_requests;cat max_hw_sectors_kb
            noop anticipatory [deadline] cfq
            2048
            4096
            
            (ain't bash great, :))
            
            It's 17:09:46 CST (UTC-0600) on Fri Feb 05, week 05 in 2010.
            You are root at { /sys/block/sda/queue }
            #> dd if=/dev/sda of=/dev/zero bs=20M iflag=direct count=409
            409+0 records in
            409+0 records out
            8577351680 bytes (8.6 GB) copied, 15.7276 s, 545 MB/s
            and
            #> dd if=/dev/sda16 of=/dev/zero bs=20M iflag=direct count=409
            409+0 records in
            409+0 records out
            8577351680 bytes (8.6 GB) copied, 17.7943 s, 482 MB/s
            => The same,which is what is expected.

            [ ran both sets of those several times and results were ~same each time]

            BTW, read_ahead_kb is the default value
            #> cat /sys/class/bdi/btrfs-1/read_ahead_kb; cat /sys/class/bdi/btrfs-2/read_ahead_kb
            4096
            4096
            and changing them (to 32768) also made no diff, as expected.

            So, 482-483MBps is the value for the partition.
            Using everything in suggested setup with increased bdi, read_ahead-kb, ...etc., I ran IOzone again:
            Code:
                    Auto Mode
                    File size set to 8388608 KB
                    Record Size 64 KB          
            
                    Machine = Linux sm-opensuse 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100   Excel chart generation enabled
                    Excel chart generation enabled                                                                                
                    Command line used: iozone -L64 -S1024 -a -j2 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32_deadline_sectors=4096_nr_requests=2048]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=j2xL=128].xls                      
                    Output is in Kbytes/sec                                                                                                            
                    Time Resolution = 0.000001 seconds.                                                                                                
                    Processor cache size set to 1024 Kbytes.                                                                                           
                    Processor cache line size set to 64 bytes.                                                                                         
                    File stride size set to 2 * record size.
                                                                        
                          KB  reclen   write rewrite    read    reread    
                     8388608      64  537542  540890   460758   460473
            READ is a little slower than target value but pretty close. I have no clue on what is margin of error for IOzone results ...

            I'm not sure if this means the WRITE IOzone results are inflated or if the Hitachi algorithm and buffer are doing that great of a job for WRITEs, ...or something else.
            At the least it appears that kernel 2.6.32-3 is not helping or I and openSUSE have a config that is keeping it from helping.

            If the next thing is to use 2.6.33, I will have to build one. openSUSE factory version(for openSUSE 11.3) is broken (here) ... A build is fine; Just a little extra time.


            -Ric
            Great, different parts of the drive can perform differently. Or, it could be an alignment issue the write cache is hiding.

            The easiest way to tell is to do the read test farther down the drive. Where does sda16 start?

            Lets pretend it starts 500GB into the drive. You can use rough numbers, we don't need it down to the KB.

            500 * 1024 / 20 gives us the number of 20MB blocks into the drive that we need to skip to get to 500GB, which is 25600.

            dd if=/dev/sda of=/dev/zero bs=20M skip=25600 count=409 iflag=direct

            This will tell us if the problem with sda16 is alignment or not.

            Comment


            • #26
              Hey Chris,
              Did manage to get 2.6.33 on Mandriva running. Ran some iozone tests and are basically the same except both write and read are slower. The gap narrowed a bit as a result.

              The odd thing is that the ext4 & ext3 partitions tested are now exhibiting the same thing: slower reads than writes.
              Also, I did a cat of the max_hw_sectors and they were smaller than the max_sectors_kb.
              It could be the different companies, Mandriva -v- SUSE, but I'll have to get the 2.6.33 from openSUSE running before I can check it to see if some fiddling was done.
              I don't follow the kernel changes much anymore. Is somebody making changes to improve writes?

              ...
              The partition is behind ~ 205GB on the RAID, i.e., about 35% is a clone of another system drive, then the 380+GB that is formatted as btrfs.

              The
              dd if=/dev/sda of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
              will be into the last 10% of the formatted space of the RAID-0.

              AS soon as I can reboot ... I'll post the dd value|results.

              c-ya,
              Ric

              Comment


              • #27
                This is the 2.6.31 kernel(Mandriva 2010.0) result.
                #> dd if=/dev/sdf of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
                409+0 records in
                409+0 records out
                8577351680 bytes (8.6 GB) copied, 23.5782 s, 364 MB/s
                It's 10:29:13 CST (UTC-0600) on Sat Feb 06, week 05 in 2010.

                Comment


                • #28
                  Originally posted by mason View Post
                  Great, different parts of the drive can perform differently. Or, it could be an alignment issue the write cache is hiding.

                  The easiest way to tell is to do the read test farther down the drive. Where does sda16 start?

                  Lets pretend it starts 500GB into the drive. You can use rough numbers, we don't need it down to the KB.

                  500 * 1024 / 20 gives us the number of 20MB blocks into the drive that we need to skip to get to 500GB, which is 25600.

                  dd if=/dev/sda of=/dev/zero bs=20M skip=25600 count=409 iflag=direct

                  This will tell us if the problem with sda16 is alignment or not.
                  #> uname -srv
                  Linux 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100
                  (openSUSE 11.2)
                  #> dd if=/dev/sda of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
                  409+0 records in
                  409+0 records out
                  8577351680 bytes (8.6 GB) copied, 23.613 s, 363 MB/s
                  .
                  #> uname -srv
                  Linux 2.6.33-desktop-0.rc6.1mnb #1 SMP Sat Jan 30 01:00:20 CET 2010
                  (Mandriva 2010.0 w. 2010.1 kernel)
                  #> dd if=/dev/sdf of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
                  409+0 records in
                  409+0 records out
                  8577351680 bytes (8.6 GB) copied, 23.612 s, 363 MB/s

                  #> cd /sys/block/sda/queue; cat scheduler;cat nr_requests;cat max_hw_sectors_kb
                  noop anticipatory [deadline] cfq
                  128
                  4096

                  For both,
                  #> cat /sys/devices/virtual/bdi/btrfs-1/read_ahead_kb
                  4096


                  At least they are consistent on all three kernels. ...

                  -Ric

                  Comment


                  • #29
                    I did get the openSUSE 2.6.33 kernel up to init3 and the dd results were the same: ~ 364MBps.

                    ...and Mandriva's kernel is using different default values than openSUSE.
                    Code:
                    #> cd /sys/block/sda/queue; cat scheduler;cat nr_requests;cat max_hw_sectors_kb;cat max_sectors_kb
                    noop anticipatory [deadline] cfq
                    128
                    64
                    64
                    -Ric

                    Comment


                    • #30
                      Chris:

                      I won't be able to test with this partition anymore.
                      I don't suppose it matters since the WRITE faster than READ issue is exhibited on all drive types here but if you want more tests run, it will have to be elsewhere.

                      BOL

                      -Ric

                      Comment

                      Working...
                      X