Chris:
I won't be able to test with this partition anymore.
I don't suppose it matters since the WRITE faster than READ issue is exhibited on all drive types here but if you want more tests run, it will have to be elsewhere.
BOL
-Ric
Announcement
Collapse
No announcement yet.
Btrfs Battles EXT4 With The Linux 2.6.33 Kernel
Collapse
X
-
I did get the openSUSE 2.6.33 kernel up to init3 and the dd results were the same: ~ 364MBps.
...and Mandriva's kernel is using different default values than openSUSE.
Code:#> cd /sys/block/sda/queue; cat scheduler;cat nr_requests;cat max_hw_sectors_kb;cat max_sectors_kb noop anticipatory [deadline] cfq 128 64 64
Leave a comment:
-
Originally posted by mason View PostGreat, different parts of the drive can perform differently. Or, it could be an alignment issue the write cache is hiding.
The easiest way to tell is to do the read test farther down the drive. Where does sda16 start?
Lets pretend it starts 500GB into the drive. You can use rough numbers, we don't need it down to the KB.
500 * 1024 / 20 gives us the number of 20MB blocks into the drive that we need to skip to get to 500GB, which is 25600.
dd if=/dev/sda of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
This will tell us if the problem with sda16 is alignment or not.
Linux 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100
(openSUSE 11.2)
#> dd if=/dev/sda of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
409+0 records in
409+0 records out
8577351680 bytes (8.6 GB) copied, 23.613 s, 363 MB/s
.
#> uname -srv
Linux 2.6.33-desktop-0.rc6.1mnb #1 SMP Sat Jan 30 01:00:20 CET 2010
(Mandriva 2010.0 w. 2010.1 kernel)
#> dd if=/dev/sdf of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
409+0 records in
409+0 records out
8577351680 bytes (8.6 GB) copied, 23.612 s, 363 MB/s
#> cd /sys/block/sda/queue; cat scheduler;cat nr_requests;cat max_hw_sectors_kb
noop anticipatory [deadline] cfq
128
4096
For both,
#> cat /sys/devices/virtual/bdi/btrfs-1/read_ahead_kb
4096
At least they are consistent on all three kernels. ...
-Ric
Leave a comment:
-
This is the 2.6.31 kernel(Mandriva 2010.0) result.
#> dd if=/dev/sdf of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
409+0 records in
409+0 records out
8577351680 bytes (8.6 GB) copied, 23.5782 s, 364 MB/s
It's 10:29:13 CST (UTC-0600) on Sat Feb 06, week 05 in 2010.
Leave a comment:
-
Hey Chris,
Did manage to get 2.6.33 on Mandriva running. Ran some iozone tests and are basically the same except both write and read are slower. The gap narrowed a bit as a result.
The odd thing is that the ext4 & ext3 partitions tested are now exhibiting the same thing: slower reads than writes.
Also, I did a cat of the max_hw_sectors and they were smaller than the max_sectors_kb.
It could be the different companies, Mandriva -v- SUSE, but I'll have to get the 2.6.33 from openSUSE running before I can check it to see if some fiddling was done.
I don't follow the kernel changes much anymore. Is somebody making changes to improve writes?
...
The partition is behind ~ 205GB on the RAID, i.e., about 35% is a clone of another system drive, then the 380+GB that is formatted as btrfs.
The
dd if=/dev/sda of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
will be into the last 10% of the formatted space of the RAID-0.
AS soon as I can reboot ... I'll post the dd value|results.
c-ya,
Ric
Leave a comment:
-
Originally posted by fhj52 View PostHi, thanks for the post. I am happy to do whatever I can to assist.
The btrfs(which I pronounce "better f s") is, or at least the potential of, a truly world class fs. I thank you and all the develpers for doing the work and Oracle for funding it. I know it is in Oracle's best interest to have such but making it GPL-licensed ... gotta love'em for at least that.
I was involved with other tasks but got to this today.
Under openSUSE 11.2(kernal 2.6.32-3) the SAS2 IR RAID-0 is device sda.
Background:Code:#> mount ... /dev/sda16 on /SAS600RAID type btrfs (rw,noatime,nodatasum) #> df ... /dev/sda16 btrfs 339G 104K 339G 1% /SAS600RAID ... #> cd /sys/block/sda/queue; cat scheduler;cat nr_requests;cat max_hw_sectors_kb noop anticipatory [deadline] cfq 128 4096
Code:It's 16:47:08 CST (UTC-0600) on Fri Feb 05, week 05 in 2010. You are root at { /home } #> dd if=/dev/sda of=/dev/zero bs=20M iflag=direct count=409 409+0 records in 409+0 records out 8577351680 bytes (8.6 GB) copied, 15.7202 s, [B]546 MB/s[/B]
Code:#> dd if=/dev/sda16 of=/dev/zero bs=20M iflag=direct count=409 409+0 records in 409+0 records out 8577351680 bytes (8.6 GB) copied, 17.7661 s, [B]483 MB/s[/B]
Code:#> echo deadline > scheduler; echo 2048 > nr_requests;cat max_hw_sectors_kb > max_sectors_kb; cat scheduler;cat nr_requests;cat max_hw_sectors_kb noop anticipatory [deadline] cfq 2048 4096 (ain't bash great, :)) It's 17:09:46 CST (UTC-0600) on Fri Feb 05, week 05 in 2010. You are root at { /sys/block/sda/queue } #> dd if=/dev/sda of=/dev/zero bs=20M iflag=direct count=409 409+0 records in 409+0 records out 8577351680 bytes (8.6 GB) copied, 15.7276 s, [B]545 MB/s[/B] and #> dd if=/dev/sda16 of=/dev/zero bs=20M iflag=direct count=409 409+0 records in 409+0 records out 8577351680 bytes (8.6 GB) copied, 17.7943 s, [B]482 MB/s[/B]
[ ran both sets of those several times and results were ~same each time]
BTW, read_ahead_kb is the default value
#> cat /sys/class/bdi/btrfs-1/read_ahead_kb; cat /sys/class/bdi/btrfs-2/read_ahead_kb
4096
4096
and changing them (to 32768) also made no diff, as expected.
So, 482-483MBps is the value for the partition.
Using everything in suggested setup with increased bdi, read_ahead-kb, ...etc., I ran IOzone again:
Code:Auto Mode File size set to 8388608 KB Record Size 64 KB Machine = Linux sm-opensuse 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100 Excel chart generation enabled Excel chart generation enabled Command line used: iozone -L64 -S1024 -a -j2 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32_deadline_sectors=4096_nr_requests=2048]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=j2xL=128].xls Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 64 bytes. File stride size set to 2 * record size. KB reclen write rewrite read reread 8388608 64 [B]537542[/B] 540890 [B]460758[/B] 460473
I'm not sure if this means the WRITE IOzone results are inflated or if the Hitachi algorithm and buffer are doing that great of a job for WRITEs, ...or something else.
At the least it appears that kernel 2.6.32-3 is not helping or I and openSUSE have a config that is keeping it from helping.
If the next thing is to use 2.6.33, I will have to build one. openSUSE factory version(for openSUSE 11.3) is broken (here) ... A build is fine; Just a little extra time.
-Ric
The easiest way to tell is to do the read test farther down the drive. Where does sda16 start?
Lets pretend it starts 500GB into the drive. You can use rough numbers, we don't need it down to the KB.
500 * 1024 / 20 gives us the number of 20MB blocks into the drive that we need to skip to get to 500GB, which is 25600.
dd if=/dev/sda of=/dev/zero bs=20M skip=25600 count=409 iflag=direct
This will tell us if the problem with sda16 is alignment or not.
Leave a comment:
-
Originally posted by mason View PostThanks for trying this out, I think the best thing to do would be to nail down exactly how fast the device is.
dd if=/dev/xxx of=/dev/zero bs=20M iflag=direct count=409
/dev/xxx is whatever you built btrfs on top of. This should be a read only benchmark, and since we're running O_DIRECT it removes the kernel readahead from the picture.
-chris
The btrfs(which I pronounce "better f s") is, or at least the potential of, a truly world class fs. I thank you and all the develpers for doing the work and Oracle for funding it. I know it is in Oracle's best interest to have such but making it GPL-licensed ... gotta love'em for at least that.
I was involved with other tasks but got to this today.
Under openSUSE 11.2(kernal 2.6.32-3) the SAS2 IR RAID-0 is device sda.
Background:Code:#> mount ... /dev/sda16 on /SAS600RAID type btrfs (rw,noatime,nodatasum) #> df ... /dev/sda16 btrfs 339G 104K 339G 1% /SAS600RAID ... #> cd /sys/block/sda/queue; cat scheduler;cat nr_requests;cat max_hw_sectors_kb noop anticipatory [deadline] cfq 128 4096
Code:It's 16:47:08 CST (UTC-0600) on Fri Feb 05, week 05 in 2010. You are root at { /home } #> dd if=/dev/sda of=/dev/zero bs=20M iflag=direct count=409 409+0 records in 409+0 records out 8577351680 bytes (8.6 GB) copied, 15.7202 s, [B]546 MB/s[/B]
Code:#> dd if=/dev/sda16 of=/dev/zero bs=20M iflag=direct count=409 409+0 records in 409+0 records out 8577351680 bytes (8.6 GB) copied, 17.7661 s, [B]483 MB/s[/B]
Code:#> echo deadline > scheduler; echo 2048 > nr_requests;cat max_hw_sectors_kb > max_sectors_kb; cat scheduler;cat nr_requests;cat max_hw_sectors_kb noop anticipatory [deadline] cfq 2048 4096 (ain't bash great, :)) It's 17:09:46 CST (UTC-0600) on Fri Feb 05, week 05 in 2010. You are root at { /sys/block/sda/queue } #> dd if=/dev/sda of=/dev/zero bs=20M iflag=direct count=409 409+0 records in 409+0 records out 8577351680 bytes (8.6 GB) copied, 15.7276 s, [B]545 MB/s[/B] and #> dd if=/dev/sda16 of=/dev/zero bs=20M iflag=direct count=409 409+0 records in 409+0 records out 8577351680 bytes (8.6 GB) copied, 17.7943 s, [B]482 MB/s[/B]
[ ran both sets of those several times and results were ~same each time]
BTW, read_ahead_kb is the default value
#> cat /sys/class/bdi/btrfs-1/read_ahead_kb; cat /sys/class/bdi/btrfs-2/read_ahead_kb
4096
4096
and changing them (to 32768) also made no diff, as expected.
So, 482-483MBps is the value for the partition.
Using everything in suggested setup with increased bdi, read_ahead-kb, ...etc., I ran IOzone again:
Code:Auto Mode File size set to 8388608 KB Record Size 64 KB Machine = Linux sm-opensuse 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100 Excel chart generation enabled Excel chart generation enabled Command line used: iozone -L64 -S1024 -a -j2 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32_deadline_sectors=4096_nr_requests=2048]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=j2xL=128].xls Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 64 bytes. File stride size set to 2 * record size. KB reclen write rewrite read reread 8388608 64 [B]537542[/B] 540890 [B]460758[/B] 460473
I'm not sure if this means the WRITE IOzone results are inflated or if the Hitachi algorithm and buffer are doing that great of a job for WRITEs, ...or something else.
At the least it appears that kernel 2.6.32-3 is not helping or I and openSUSE have a config that is keeping it from helping.
If the next thing is to use 2.6.33, I will have to build one. openSUSE factory version(for openSUSE 11.3) is broken (here) ... A build is fine; Just a little extra time.
-Ric
Leave a comment:
-
Originally posted by fhj52 View PostHey Chris,
I tried the suggestions of scheduler(deadline), nr_requests and hw_sectors changes.
READ is slower than WRITE by > 70MBps.
Code:Auto Mode File size set to 8388608 KB Record Size 64 KB Machine = Linux * 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100 Excel chart generation enabled Excel chart generation enabled Command line used: */iozone -L64 -S1024 -a -j1 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32_deadline_sectors=4096_nr_requests=2048]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=64].xls Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 64 bytes. File stride size set to 1 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 8388608 64 [B]537033[/B] 542196 [B]462496[/B] 462088 iozone test complete.
I'm trying to assist. So if there is some other way you want this run just say so. (The IOzone test is ~ 1m15s on this new SAS2 setup so is painless, especially compared to *ATA and PAS disks. )
...or even some other ap if you think IOzone m/b fiddling with results somehow.
-Ric
dd if=/dev/xxx of=/dev/zero bs=20M iflag=direct count=409
/dev/xxx is whatever you built btrfs on top of. This should be a read only benchmark, and since we're running O_DIRECT it removes the kernel readahead from the picture.
-chris
Leave a comment:
-
Hey Chris,
I tried the suggestions of scheduler(deadline), nr_requests and hw_sectors changes.
READ is slower than WRITE by > 70MBps.
Code:Auto Mode File size set to 8388608 KB Record Size 64 KB Machine = Linux * 2.6.32-3-default #1 SMP 2009-12-04 00:41:46 +0100 Excel chart generation enabled Excel chart generation enabled Command line used: */iozone -L64 -S1024 -a -j1 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32_deadline_sectors=4096_nr_requests=2048]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=64].xls Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 64 bytes. File stride size set to 1 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 8388608 64 [B]537033[/B] 542196 [B]462496[/B] 462088 iozone test complete.
I'm trying to assist. So if there is some other way you want this run just say so. (The IOzone test is ~ 1m15s on this new SAS2 setup so is painless, especially compared to *ATA and PAS disks. )
...or even some other ap if you think IOzone m/b fiddling with results somehow.
-Ric
Leave a comment:
-
Hi Chris,
I have to adjust my Wow! statement of previous post. I could not get to openSUSE (2.6.32 kernel) last night. But today I did; Without changing the readahead value but using noatime and nodatasum, a new record here:
"Writer report"
"64"
"8388608" 539,898 kBps
HS!, the interface is only spec'd at 586MBps ...
Here's the output:
Code:KB reclen write rewrite read reread 8388608 64 539898 543101 463523 463367
Not enuf data to draw the conclusion that the readahead default value is too small for near state of the art storage, i.e., SAS2 HDD and SSD, but it surely looks that way.
SO I changed the default 4096 to 12288 in the /sys/devices/virtual/bdi/btrfs-*/read_ahead_kb files and ran it again ...no love:
Code:KB reclen write rewrite read reread 8388608 64 [B]549614[/B] 542850 [B]462666[/B] 462772
I won't bore you with useless data. I tried several strides(1*64, 2*64, ... 192*64) and none mattered. READ is about the same.
I had to stop using the auto unmount & mount function in IOzone as every time it was done the readahead_kb was reset to the default 4096 value. I poked around a little but my guess is that is a kernel value I cannot change w/o rebuilding the kernel or module. I'll look a bit more later. ...
I also tried increasing the read_ahead to 32,768 ...even 64MB! No diff for the READ that way either:
Code:Command line used: /usr/lib/iozone/bin/iozone -L64 -S1024 -a -j2 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=32768]-[stride=128].xls KB reclen write rewrite read reread 8388608 64 535878 542170 463488 463487 Command line used: /usr/lib/iozone/bin/iozone -L64 -S1024 -a -j1 -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[9211-8i-RAID0]_[btrfs_noatime_nodatasum_readahead=65536]-[stride=64].xls KB reclen write rewrite read reread 8388608 64 536160 542576 440697 445034
While composing this I see you posted.
You're welcome and thank you for the suggestions.
Will try those suggestions, esp. the deadline as I meant to change that and forgot about it. Current scheduler is the default, CFQ.
Prbly should not get too much into the 9211 HBA card specifics but it is pretty typical HBA: no cache and does not have readahead or writeback.
It does allow setting the HDD cache as on or off for use, which is a new widget. It was set to on but I cannot verify it still is. ... LSI Linux software is not only lame but also proprietary => I cannot fix it.
I assume the HDD cache is being used because the boot log indicates the kernel thinks it is enabled:Code:... sd 0:1:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
-Ric
*crazy: someone who does the same exact thing, the same exact way over and over again and expects a different result each time.
Leave a comment:
Leave a comment: