Announcement

Collapse
No announcement yet.

Btrfs Battles EXT4 With The Linux 2.6.33 Kernel

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • fhj52
    replied
    Chris:

    I won't be able to test with this partition anymore.
    I don't suppose it matters since the WRITE faster than READ issue is exhibited on all drive types here but if you want more tests run, it will have to be elsewhere.

    BOL

    -Ric

    Leave a comment:


  • fhj52
    replied
    I did get the openSUSE 2.6.33 kernel up to init3 and the dd results were the same: ~ 364MBps.

    ...and Mandriva's kernel is using different default values than openSUSE.
    Code:
    #> cd /sys/block/sda/queue; cat scheduler;cat nr_requests;cat max_hw_sectors_kb;cat max_sectors_kb
    noop anticipatory [deadline] cfq
    128
    64
    64
    -Ric

    Leave a comment:


  • fhj52
    replied
    Originally posted by mason View Post
    Great, different parts of the drive can perform differently. Or, it could be an alignment issue the write cache is hiding.

    The easiest way to tell is to do the read test farther down the drive. Where does sda16 start?

    Lets pretend it starts 500GB into the drive. You can use rough numbers, we don't need it down to the KB.

    500 * 1024 / 20 gives us the number of 20MB blocks into the drive that we need to skip to get to 500GB, which is 25600.

    dd if=/dev/sda of=/dev/zero bs=20M skip=25600 count=409 iflag=direct

    This will tell us if the problem with sda16 is alignment or not.
    #> uname -srv
    Linux 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100
    (openSUSE 11.2)
    #> dd if=/dev/sda of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
    409+0 records in
    409+0 records out
    8577351680 bytes (8.6 GB) copied, 23.613 s, 363 MB/s
    .
    #> uname -srv
    Linux 2.6.33-desktop-0.rc6.1mnb #1 SMP Sat Jan 30 01:00:20 CET 2010
    (Mandriva 2010.0 w. 2010.1 kernel)
    #> dd if=/dev/sdf of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
    409+0 records in
    409+0 records out
    8577351680 bytes (8.6 GB) copied, 23.612 s, 363 MB/s

    #> cd /sys/block/sda/queue; cat scheduler;cat nr_requests;cat max_hw_sectors_kb
    noop anticipatory [deadline] cfq
    128
    4096

    For both,
    #> cat /sys/devices/virtual/bdi/btrfs-1/read_ahead_kb
    4096


    At least they are consistent on all three kernels. ...

    -Ric

    Leave a comment:


  • fhj52
    replied
    This is the 2.6.31 kernel(Mandriva 2010.0) result.
    #> dd if=/dev/sdf of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
    409+0 records in
    409+0 records out
    8577351680 bytes (8.6 GB) copied, 23.5782 s, 364 MB/s
    It's 10:29:13 CST (UTC-0600) on Sat Feb 06, week 05 in 2010.

    Leave a comment:


  • fhj52
    replied
    Hey Chris,
    Did manage to get 2.6.33 on Mandriva running. Ran some iozone tests and are basically the same except both write and read are slower. The gap narrowed a bit as a result.

    The odd thing is that the ext4 & ext3 partitions tested are now exhibiting the same thing: slower reads than writes.
    Also, I did a cat of the max_hw_sectors and they were smaller than the max_sectors_kb.
    It could be the different companies, Mandriva -v- SUSE, but I'll have to get the 2.6.33 from openSUSE running before I can check it to see if some fiddling was done.
    I don't follow the kernel changes much anymore. Is somebody making changes to improve writes?

    ...
    The partition is behind ~ 205GB on the RAID, i.e., about 35% is a clone of another system drive, then the 380+GB that is formatted as btrfs.

    The
    dd if=/dev/sda of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
    will be into the last 10% of the formatted space of the RAID-0.

    AS soon as I can reboot ... I'll post the dd value|results.

    c-ya,
    Ric

    Leave a comment:


  • mason
    replied
    Originally posted by fhj52 View Post
    Hi, thanks for the post. I am happy to do whatever I can to assist.
    The btrfs(which I pronounce "better f s") is, or at least the potential of, a truly world class fs. I thank you and all the develpers for doing the work and Oracle for funding it. I know it is in Oracle's best interest to have such but making it GPL-licensed ... gotta love'em for at least that.

    I was involved with other tasks but got to this today.
    Under openSUSE 11.2(kernal 2.6.32-3) the SAS2 IR RAID-0 is device sda.
    Background:
    Code:
    #> mount
    ...
    /dev/sda16 on /SAS600RAID type btrfs (rw,noatime,nodatasum)
    #> df
    ...
    /dev/sda16   btrfs    339G  104K  339G   1% /SAS600RAID
    ...
    #> cd /sys/block/sda/queue; cat scheduler;cat nr_requests;cat max_hw_sectors_kb
    noop anticipatory [deadline] cfq
    128
    4096
    Without changing the requests:
    Code:
    It's 16:47:08 CST (UTC-0600) on Fri Feb 05, week 05 in 2010.
    You are root at { /home }
    #> dd if=/dev/sda of=/dev/zero bs=20M iflag=direct count=409
    409+0 records in
    409+0 records out
    8577351680 bytes (8.6 GB) copied, 15.7202 s, [B]546 MB/s[/B]
    Running it on the partition has a different result
    Code:
    #> dd if=/dev/sda16 of=/dev/zero bs=20M iflag=direct count=409
    409+0 records in
    409+0 records out
    8577351680 bytes (8.6 GB) copied, 17.7661 s, [B]483 MB/s[/B]
    Then for proof that it is independent of requests:
    Code:
    #> echo deadline > scheduler; echo 2048 > nr_requests;cat max_hw_sectors_kb > max_sectors_kb; cat scheduler;cat nr_requests;cat max_hw_sectors_kb
    noop anticipatory [deadline] cfq
    2048
    4096
    
    (ain't bash great, :))
    
    It's 17:09:46 CST (UTC-0600) on Fri Feb 05, week 05 in 2010.
    You are root at { /sys/block/sda/queue }
    #> dd if=/dev/sda of=/dev/zero bs=20M iflag=direct count=409
    409+0 records in
    409+0 records out
    8577351680 bytes (8.6 GB) copied, 15.7276 s, [B]545 MB/s[/B]
    and
    #> dd if=/dev/sda16 of=/dev/zero bs=20M iflag=direct count=409
    409+0 records in
    409+0 records out
    8577351680 bytes (8.6 GB) copied, 17.7943 s, [B]482 MB/s[/B]
    => The same,which is what is expected.

    [ ran both sets of those several times and results were ~same each time]

    BTW, read_ahead_kb is the default value
    #> cat /sys/class/bdi/btrfs-1/read_ahead_kb; cat /sys/class/bdi/btrfs-2/read_ahead_kb
    4096
    4096
    and changing them (to 32768) also made no diff, as expected.

    So, 482-483MBps is the value for the partition.
    Using everything in suggested setup with increased bdi, read_ahead-kb, ...etc., I ran IOzone again:
    Code:
            Auto Mode
            File size set to 8388608 KB
            Record Size 64 KB          
    
            Machine = Linux sm-opensuse 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100   Excel chart generation enabled
            Excel chart generation enabled                                                                                
            Command line used: iozone -L64 -S1024 -a -j2 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32_deadline_sectors=4096_nr_requests=2048]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=j2xL=128].xls                      
            Output is in Kbytes/sec                                                                                                            
            Time Resolution = 0.000001 seconds.                                                                                                
            Processor cache size set to 1024 Kbytes.                                                                                           
            Processor cache line size set to 64 bytes.                                                                                         
            File stride size set to 2 * record size.
                                                                
                  KB  reclen   write rewrite    read    reread    
             8388608      64  [B]537542[/B]  540890   [B]460758[/B]   460473
    READ is a little slower than target value but pretty close. I have no clue on what is margin of error for IOzone results ...

    I'm not sure if this means the WRITE IOzone results are inflated or if the Hitachi algorithm and buffer are doing that great of a job for WRITEs, ...or something else.
    At the least it appears that kernel 2.6.32-3 is not helping or I and openSUSE have a config that is keeping it from helping.

    If the next thing is to use 2.6.33, I will have to build one. openSUSE factory version(for openSUSE 11.3) is broken (here) ... A build is fine; Just a little extra time.


    -Ric
    Great, different parts of the drive can perform differently. Or, it could be an alignment issue the write cache is hiding.

    The easiest way to tell is to do the read test farther down the drive. Where does sda16 start?

    Lets pretend it starts 500GB into the drive. You can use rough numbers, we don't need it down to the KB.

    500 * 1024 / 20 gives us the number of 20MB blocks into the drive that we need to skip to get to 500GB, which is 25600.

    dd if=/dev/sda of=/dev/zero bs=20M skip=25600 count=409 iflag=direct

    This will tell us if the problem with sda16 is alignment or not.

    Leave a comment:


  • fhj52
    replied
    Originally posted by mason View Post
    Thanks for trying this out, I think the best thing to do would be to nail down exactly how fast the device is.

    dd if=/dev/xxx of=/dev/zero bs=20M iflag=direct count=409

    /dev/xxx is whatever you built btrfs on top of. This should be a read only benchmark, and since we're running O_DIRECT it removes the kernel readahead from the picture.

    -chris
    Hi, thanks for the post. I am happy to do whatever I can to assist.
    The btrfs(which I pronounce "better f s") is, or at least the potential of, a truly world class fs. I thank you and all the develpers for doing the work and Oracle for funding it. I know it is in Oracle's best interest to have such but making it GPL-licensed ... gotta love'em for at least that.

    I was involved with other tasks but got to this today.
    Under openSUSE 11.2(kernal 2.6.32-3) the SAS2 IR RAID-0 is device sda.
    Background:
    Code:
    #> mount
    ...
    /dev/sda16 on /SAS600RAID type btrfs (rw,noatime,nodatasum)
    #> df
    ...
    /dev/sda16   btrfs    339G  104K  339G   1% /SAS600RAID
    ...
    #> cd /sys/block/sda/queue; cat scheduler;cat nr_requests;cat max_hw_sectors_kb
    noop anticipatory [deadline] cfq
    128
    4096
    Without changing the requests:
    Code:
    It's 16:47:08 CST (UTC-0600) on Fri Feb 05, week 05 in 2010.
    You are root at { /home }
    #> dd if=/dev/sda of=/dev/zero bs=20M iflag=direct count=409
    409+0 records in
    409+0 records out
    8577351680 bytes (8.6 GB) copied, 15.7202 s, [B]546 MB/s[/B]
    Running it on the partition has a different result
    Code:
    #> dd if=/dev/sda16 of=/dev/zero bs=20M iflag=direct count=409
    409+0 records in
    409+0 records out
    8577351680 bytes (8.6 GB) copied, 17.7661 s, [B]483 MB/s[/B]
    Then for proof that it is independent of requests:
    Code:
    #> echo deadline > scheduler; echo 2048 > nr_requests;cat max_hw_sectors_kb > max_sectors_kb; cat scheduler;cat nr_requests;cat max_hw_sectors_kb
    noop anticipatory [deadline] cfq
    2048
    4096
    
    (ain't bash great, :))
    
    It's 17:09:46 CST (UTC-0600) on Fri Feb 05, week 05 in 2010.
    You are root at { /sys/block/sda/queue }
    #> dd if=/dev/sda of=/dev/zero bs=20M iflag=direct count=409
    409+0 records in
    409+0 records out
    8577351680 bytes (8.6 GB) copied, 15.7276 s, [B]545 MB/s[/B]
    and
    #> dd if=/dev/sda16 of=/dev/zero bs=20M iflag=direct count=409
    409+0 records in
    409+0 records out
    8577351680 bytes (8.6 GB) copied, 17.7943 s, [B]482 MB/s[/B]
    => The same,which is what is expected.

    [ ran both sets of those several times and results were ~same each time]

    BTW, read_ahead_kb is the default value
    #> cat /sys/class/bdi/btrfs-1/read_ahead_kb; cat /sys/class/bdi/btrfs-2/read_ahead_kb
    4096
    4096
    and changing them (to 32768) also made no diff, as expected.

    So, 482-483MBps is the value for the partition.
    Using everything in suggested setup with increased bdi, read_ahead-kb, ...etc., I ran IOzone again:
    Code:
            Auto Mode
            File size set to 8388608 KB
            Record Size 64 KB          
    
            Machine = Linux sm-opensuse 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100   Excel chart generation enabled
            Excel chart generation enabled                                                                                
            Command line used: iozone -L64 -S1024 -a -j2 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32_deadline_sectors=4096_nr_requests=2048]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=j2xL=128].xls                      
            Output is in Kbytes/sec                                                                                                            
            Time Resolution = 0.000001 seconds.                                                                                                
            Processor cache size set to 1024 Kbytes.                                                                                           
            Processor cache line size set to 64 bytes.                                                                                         
            File stride size set to 2 * record size.
                                                                
                  KB  reclen   write rewrite    read    reread    
             8388608      64  [B]537542[/B]  540890   [B]460758[/B]   460473
    READ is a little slower than target value but pretty close. I have no clue on what is margin of error for IOzone results ...

    I'm not sure if this means the WRITE IOzone results are inflated or if the Hitachi algorithm and buffer are doing that great of a job for WRITEs, ...or something else.
    At the least it appears that kernel 2.6.32-3 is not helping or I and openSUSE have a config that is keeping it from helping.

    If the next thing is to use 2.6.33, I will have to build one. openSUSE factory version(for openSUSE 11.3) is broken (here) ... A build is fine; Just a little extra time.


    -Ric

    Leave a comment:


  • mason
    replied
    Originally posted by fhj52 View Post
    Hey Chris,
    I tried the suggestions of scheduler(deadline), nr_requests and hw_sectors changes.
    READ is slower than WRITE by > 70MBps.
    Code:
            Auto Mode
            File size set to 8388608 KB
            Record Size 64 KB          
    
            Machine = Linux * 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100   Excel chart generation enabled
            Excel chart generation enabled                                                                                
            Command line used: */iozone -L64 -S1024 -a -j1 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32_deadline_sectors=4096_nr_requests=2048]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=64].xls                                                                                                                  
            Output is in Kbytes/sec                                                                                                 
            Time Resolution = 0.000001 seconds.                                                                                     
            Processor cache size set to 1024 Kbytes.                                                                                
            Processor cache line size set to 64 bytes.                                                                              
            File stride size set to 1 * record size.                                                                                
                                                                random  random    bkwd   record   stride                                                                                                                                                            
                  KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread                                                                                                                         
             8388608      64  [B]537033[/B]  542196   [B]462496[/B]   462088                                                                                                                                                                                                      
    
    iozone test complete.
    Just so it is clear, I'm not complaining. I mean who can complain about 500MBps+|-35MBps?
    I'm trying to assist. So if there is some other way you want this run just say so. (The IOzone test is ~ 1m15s on this new SAS2 setup so is painless, especially compared to *ATA and PAS disks. )
    ...or even some other ap if you think IOzone m/b fiddling with results somehow.

    -Ric
    Thanks for trying this out, I think the best thing to do would be to nail down exactly how fast the device is.

    dd if=/dev/xxx of=/dev/zero bs=20M iflag=direct count=409

    /dev/xxx is whatever you built btrfs on top of. This should be a read only benchmark, and since we're running O_DIRECT it removes the kernel readahead from the picture.

    -chris

    Leave a comment:


  • fhj52
    replied
    Hey Chris,
    I tried the suggestions of scheduler(deadline), nr_requests and hw_sectors changes.
    READ is slower than WRITE by > 70MBps.
    Code:
            Auto Mode
            File size set to 8388608 KB
            Record Size 64 KB          
    
            Machine = Linux * 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100   Excel chart generation enabled
            Excel chart generation enabled                                                                                
            Command line used: */iozone -L64 -S1024 -a -j1 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32_deadline_sectors=4096_nr_requests=2048]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=64].xls                                                                                                                  
            Output is in Kbytes/sec                                                                                                 
            Time Resolution = 0.000001 seconds.                                                                                     
            Processor cache size set to 1024 Kbytes.                                                                                
            Processor cache line size set to 64 bytes.                                                                              
            File stride size set to 1 * record size.                                                                                
                                                                random  random    bkwd   record   stride                                                                                                                                                            
                  KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread                                                                                                                         
             8388608      64  [B]537033[/B]  542196   [B]462496[/B]   462088                                                                                                                                                                                                      
    
    iozone test complete.
    Just so it is clear, I'm not complaining. I mean who can complain about 500MBps+|-35MBps?
    I'm trying to assist. So if there is some other way you want this run just say so. (The IOzone test is ~ 1m15s on this new SAS2 setup so is painless, especially compared to *ATA and PAS disks. )
    ...or even some other ap if you think IOzone m/b fiddling with results somehow.

    -Ric

    Leave a comment:


  • fhj52
    replied
    Hi Chris,
    I have to adjust my Wow! statement of previous post. I could not get to openSUSE (2.6.32 kernel) last night. But today I did; Without changing the readahead value but using noatime and nodatasum, a new record here:

    "Writer report"
    "64"
    "8388608" 539,898 kBps

    HS!, the interface is only spec'd at 586MBps ...
    Here's the output:
    Code:
                  KB  reclen   write rewrite    read    reread                                                                             
             8388608      64  539898  543101   463523   463367
    READ is still quite a lot slower acc2 IOzone.

    Not enuf data to draw the conclusion that the readahead default value is too small for near state of the art storage, i.e., SAS2 HDD and SSD, but it surely looks that way.
    SO I changed the default 4096 to 12288 in the /sys/devices/virtual/bdi/btrfs-*/read_ahead_kb files and ran it again ...no love:

    Code:
                  KB  reclen   write rewrite    read    reread                                                                                                                            
             8388608      64  [B]549614[/B]  542850   [B]462666[/B]   462772
    I am using the same IOzone parameters and getting the basically same results, so as to not appear too crazy* I changed it to drop the CPU Utilization(it is useless anyway ...), started mount/unmount between each test(there was some indication of previous cache being used) and set the stride to smaller value( the RAID uses 64k stripe).
    I won't bore you with useless data. I tried several strides(1*64, 2*64, ... 192*64) and none mattered. READ is about the same.
    I had to stop using the auto unmount & mount function in IOzone as every time it was done the readahead_kb was reset to the default 4096 value. I poked around a little but my guess is that is a kernel value I cannot change w/o rebuilding the kernel or module. I'll look a bit more later. ...

    I also tried increasing the read_ahead to 32,768 ...even 64MB! No diff for the READ that way either:
    Code:
    Command line used: /usr/lib/iozone/bin/iozone -L64 -S1024 -a -j2 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=128].xls
                  KB  reclen   write rewrite    read    reread 
             8388608      64  535878  542170   463488   463487
    Command line used: /usr/lib/iozone/bin/iozone -L64 -S1024 -a -j1 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=65536]-[stride=64].xls                         
                  KB  reclen   write rewrite    read    reread
             8388608      64  536160  542576   440697   445034
    ...

    While composing this I see you posted.
    You're welcome and thank you for the suggestions.

    Will try those suggestions, esp. the deadline as I meant to change that and forgot about it. Current scheduler is the default, CFQ.

    Prbly should not get too much into the 9211 HBA card specifics but it is pretty typical HBA: no cache and does not have readahead or writeback.
    It does allow setting the HDD cache as on or off for use, which is a new widget. It was set to on but I cannot verify it still is. ... LSI Linux software is not only lame but also proprietary => I cannot fix it.
    I assume the HDD cache is being used because the boot log indicates the kernel thinks it is enabled:
    Code:
    ... sd 0:1:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    I trust Linus et al more than LSI anyway, .


    -Ric


    *crazy: someone who does the same exact thing, the same exact way over and over again and expects a different result each time.

    Leave a comment:

Working...
X