Announcement

**pal666** · 11 February 2022, 11:34 PM

Originally posted by Solid State Brain View Post

I still believe most users, especially gamers and/or new users who may expect Linux performance to be always greater than say Windows/NTFS, will be mainly interested in performance.

those users will never to heavy random writes. os buffers and coalesces writes, only reads can be random on non-database workload

**coder** · 12 February 2022, 02:55 AM

Originally posted by F.Ultra View Post

One of my BTRFS Raid10 is compromised of 24x14TB drives. Full scrub takes about 4h.

That's only because it's not full. No 14 TB HDD can be read in 4h, as that would amount to 972 MB/s, which is far outside the range of any current HDD media transfer rate.

So, you're only reading the blocks with filesystem data, rather than mdraid's naive approach of reading all blocks. That's all well and good, until your array starts to near capacity. Then, you're going to approach the same scrub times as with mdraid or HW raid. Perhaps even far worse, if BTRFS doesn't scrub using completely contiguous reads.

Originally posted by F.Ultra View Post

All 100% reads only.

Yes, because that's what scrubbing is.

**oleid** · 12 February 2022, 04:01 AM

Originally posted by Danny3 View Post

Ok, so if a user has full disk encryption and a program decides to use this API, it can write unencrypted data?

No, by 'writing to disk they mean a block device. Which is encrypted in your case.

**Solid State Brain** · 12 February 2022, 04:02 AM

Originally posted by onlyLinuxLuvUBack View Post

100% where? everywhere or just 8 thread test ?

100% CPU load only on the 8-thread tests, but other tests also had high CPU overhead; possibly they were limited by single-core speeds but I haven't investigated that in detail and currently I cannot repeat them (at the moment I'm using a much older configuration with SATA SSDs and Btrfs compression - performance not the utmost priority).

Originally posted by pal666 View Post

those users will never to heavy random writes. os buffers and coalesces writes, only reads can be random on non-database workload

Single-thread, large-block sequential read speeds with direct IO are also noticeably slower on Btrfs than with other filesystems, and this can easily be tested by just anybody with a fast NVMe SSD. My point in any case is that gamers and the like will often care a lot about good benchmarks even if they don't get to use the performance in practice.

One could use buffered read/writes, but that's not how synthetic SSD benchmarks are normally done. Buffered operations with Btrfs are of course better but still affected by high CPU overhead.

Here is a Phoronix benchmark with fio from 2020 which curiously saw Btrfs doing quite well with a Gen.4 NVMe SSD, using buffered/non-direct operations: https://www.phoronix.com/scan.php?pa...esystems&num=3

**kiffmet** · 12 February 2022, 12:42 PM

I don't care about RAID5/6 on Btrfs as there are already MDADM and LVM for this as FS-agnostic ways to achieve this. What would be the theoretical advantages (besides needing to use fewer maintenance tools) of having RAID within the FS layer instead of the block layer anyways? Is it currently known if old volumes can easily be upgraded to the new on-disk format with some sort of conversion tool or will a reformat followed by restoring the data from a backup be needed?

**F.Ultra** · 12 February 2022, 01:47 PM

Originally posted by coder View Post

That's only because it's not full. No 14 TB HDD can be read in 4h, as that would amount to 972 MB/s, which is far outside the range of any current HDD media transfer rate.

So, you're only reading the blocks with filesystem data, rather than mdraid's naive approach of reading all blocks. That's all well and good, until your array starts to near capacity. Then, you're going to approach the same scrub times as with mdraid or HW raid. Perhaps even far worse, if BTRFS doesn't scrub using completely contiguous reads.

Yes, because that's what scrubbing is.

Exactly, and what that you just wrote where my point on why mdraid is so bad for drives compared with btrfs raid and how trying to recover from a failed drive in a normal raid might nuke your drives while btrfs will not since the recovery will only be reads from the old drives and all the writes will only happen on the new drive while in a normal raid the entire raid have to be rewritten.

Still it's way faster than your raid6 (which of course is not only due to this being only reads, having 24 SAS drives also help quite a bit), you have 16TB of disk to go over in over 9h while mine finished scrubbing 60TiB in 4:26:02

Code:

root@fileserver-sto5:~# btrfs scrub status /opt
UUID: d6cb5d55-729e-4b44-aee0-526b6fb82aed
Scrub started: Fri Feb 11 21:54:31 2022
Status: finished
Duration: 4:26:02
Total to scrub: 60.48TiB
Rate: 3.87GiB/s
Error summary: no errors found
root@fileserver-sto5:~#

**F.Ultra** · 12 February 2022, 01:57 PM

Originally posted by kiffmet View Post

I don't care about RAID5/6 on Btrfs as there are already MDADM and LVM for this as FS-agnostic ways to achieve this. What would be the theoretical advantages (besides needing to use fewer maintenance tools) of having RAID within the FS layer instead of the block layer anyways? Is it currently known if old volumes can easily be upgraded to the new on-disk format with some sort of conversion tool or will a reformat followed by restoring the data from a backup be needed?

With btrfs it's not having raid inside the fs vs raid on the block level since btrfs raid is not like raid. I used to have a raid-5 setup at my home machine for /home, now I have switched that to a btrfs raid1 setup, the practical benefits of this is e.g that every single time where there is a non clean shutdown (be it due to power outage or complete system hang [I'm a dev so that happens]) I no longer have to wait for hours for the machine to boot and be usable. Also I don't miss having my machine completely bork out on my the last sunday in every month when mdraid decides to do a complete resync of the entire raid stack.

Another practical thing is that if a drive will fail in the future I will not risk having any of the working drives being nuked by mdraid forcing a complete resync when replacing the failed drive (this is a common problem among raid users that replacing a failed drive will make other drives fail during the rebuild phase) since btrfs will only read from the working drives and not write to them.

Another practical benefit is that the btrfs raid keeps checksums of the files and not just a parity of each chunk which means that whenever I want to perform a check or a recovery I only have to touch/read the actual amount of data stored and not the entire storage pool size.

**coder** · 12 February 2022, 01:58 PM

Originally posted by kiffmet View Post

What would be the theoretical advantages (besides needing to use fewer maintenance tools) of having RAID within the FS layer instead of the block layer anyways?

Based on posts from other threads, the claim is that it makes you more resilient to hardware problems, since mdraid doesn't check parity unless you're scrubbing. I think mdraid just depends on drives to fail a read, and then it kicks the drive out of the array. This has the disadvantage that a bug in the drive's firmware or in the disk controller (if SATA) or the kernel's block layer can introduce an error that mdraid won't see. There can also be errors at the SATA or PCIe level, but those do have some amount of CRC-type protection, AFAICT.

The main disadvantage of using a FS in this way is that it involves reading the entire stripe from all devices. mdraid has a read optimization I think most hardware RAID controllers also do, which reads each stripe from N-1 or N-2 drives, depending on whether you're using RAID-5 or RAID-6. I'm not aware of an option to force mdraid to always read from all drives and check parity.

I think another disadvantage with mdraid is that a disk gets completely ejected from the array, when it has any errors. Now, let's say you have an 8-drive RAID-6 and one drive fails. You pull it and rebuild with a new drive. However, during the rebuild, errors are encountered in two of the other drives, each at different spots. What I think will happen is that the first error will eject that drive, and now you've lost all redundancy. Upon encountering the second error, you're faced with an array failure. Even though all the data could be reconstructed (i.e. by reading blocks from whichever drives don't have errors in them), I think mdraid won't do it.

Sadly, these functional gaps in mdraid aren't fundamental. It can do both things (i.e. demand-scrubbing - checking parity on all reads) and flexible array rebuilds. I guess nobody cared enough to add those features.

Somebody please correct me, if I'm mistaken.

**coder** · 12 February 2022, 02:00 PM

Originally posted by F.Ultra View Post

trying to recover from a failed drive in a normal raid might nuke your drives while btrfs will not since the recovery will only be reads from the old drives and all the writes will only happen on the new drive while in a normal raid the entire raid have to be rewritten.

Where did you get that idea? RAID rebuilds only read from the existing drives and only write to the new drive. Rewriting the existing drives would be pretty stupid, if only because it'd more than double rebuild times.

**coder** · 12 February 2022, 02:08 PM

Originally posted by F.Ultra View Post

every single time where there is a non clean shutdown (be it due to power outage or complete system hang [I'm a dev so that happens]) I no longer have to wait for hours for the machine to boot and be usable.

I'm pretty sure that's not a mdraid thing. I'm guessing you previously used like ext2 on mdraid. It's probably that filesystem which wanted to do a fsck.

Originally posted by F.Ultra View Post

Also I don't miss having my machine completely bork out on my the last sunday in every month when mdraid decides to do a complete resync of the entire raid stack.

You can schedule scrubbing whenever you want. And you should still be scrubbing your RAID, even though it's using BTRFS.

Originally posted by F.Ultra View Post

Another practical thing is that if a drive will fail in the future I will not risk having any of the working drives being nuked by mdraid forcing a complete resync when replacing the failed drive (this is a common problem among raid users that replacing a failed drive will make other drives fail during the rebuild phase)

That usually only happens to people who don't do regular scrubbing. If you scrub frequently enough, and especially if you use RAID-6, then the risk of an array failure during rebuild is negligible (though it's higher for arrays with more drives).

Announcement

Linux 5.18 Looks Like It Will Finally Land Btrfs Encoded I/O

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment