Announcement

**mdedetrich** · 13 July 2023, 03:52 AM

Originally posted by Khrundel View Post

Well, it is more of "make things as simple as possible, but not simpler". All these layered structures tend to become overengineered. And btrfs already breaks these tiers by doing raid stuff and distributing data across devices, adding cache is just goes one step further. Caching subsystem can benefit from knowing about files, especially in case of COW. Imagine some "hot" extent, like some shared library. With all access statistics gathered during last year this extent has to have highest caching priority, unless you know it was overwritten yesterday and now persist only because of some backup snapshot. So you have to either just guess by access pattern, or overcomplicate tiers API to pass data needed only for this two modules.

I think ZFS already does this because it also has an integrated cache called ARC (which has both a memory and SSD cache)?

**timofonic** · 13 July 2023, 03:58 AM

It would be nice if there's some Bcachefs insider collaborating in Phoronix, as bridgman does for AMD stuff

**flower** · 13 July 2023, 05:14 AM

Originally posted by EphemeralEft View Post

All Linux filesystems (including BTRFS) have tiered storage if you put them on top of BCache (what BCacheFS is based on) or DM-Cache. Same with encryption if you put it on top of DM-Crypt. It's honestly a better solution than duplicating that work for every filesystem. And it works for non-filesystem block devices, too.

Caching and tiered storage are different concepts. they are not the same.
you can't have tiered storage with btrfs as the filesystem has to support it.

**tesfabpel** · 13 July 2023, 06:59 AM

Originally posted by EphemeralEft View Post

It's actually BTRFS | LVM | DM-Crypt | BCache | DM-RAID, where DM-Crypt is managed by Cryptsetup and DM-RAID is managed by LVM. The top-most LVM Layer is split into different filesystems for different purposes.

Sorry for the off-topic, but I'd like to ask you a question...

I'm using BTRFS on top of LUKS 2 on two (surely not weak) PCs (one using Fedora 38 and the other using ArchLinux) but in both, the system when doing a lot of I/O on the main disk (both have an SSD) sometimes there are hiccups (even of some seconds!) in the UI and also the responsiveness of apps (so it's not just the UI that's I/O starved)...

I've read about some options of cryptsetup to tweak the performance by disabling some features that were needed when they were introduced but they shouldn't be needed anymore, but sadly they didn't fix the problem 100%...

These are the flags I've used: discards same_cpu_crypt submit_from_crypt_cpus no_read_workqueue no_write_workqueue

Code:

# cryptsetup status luks-b8d62e8e-...
/dev/mapper/luks-b8d62e8e-... is active and is in use.
 type:    LUKS2
 cipher:  aes-xts-plain64
 keysize: 512 bits
 key location: keyring
 device:  /dev/sda3
 sector size:  512
 offset:  32768 sectors
 size:    465514496 sectors
 mode:    read/write
 flags:   discards same_cpu_crypt submit_from_crypt_cpus no_read_workqueue no_write_workqueue

Have you used something? Are you affected?

Thanks.

**milo_hoffman** · 13 July 2023, 08:26 AM

The only "killer feature" that ZFS has that you can't do just as well with other solutions on Linux like LVM, BTRFS, etc is the snapshot replication features. That is the best feature and the main reason to use ZFS for NAS or Proxmox.

**ehansin** · 13 July 2023, 08:48 AM

Originally posted by pWe00Iri3e7Z9lHOX2Qx View Post

I think the biggest problem with the RAID10 example I gave is that nobody who is familiar with RAID10 from other systems would expect a write pattern like this to be possible.

Code:

| SDA | SDB | SDC | SDD |
|-----|-----|-----|-----|
| A1 | A2 | A1 | A2 |
| B1 | B1 | B2 | B2 |
| C1 | D1 | D1 | C1 |
| D2 | C2 | C2 | D2 |

I think for btrfs they should have actually named these profiles something else, because a lot of assumptions get made based on a name and previous familiarity / experience. I certainly wouldn't instinctively assume that writes were like "mini RAID10s" going everywhere willy-nilly and that I was totally screwed if any second disk fails. But yes, agreed on the volume management. ZFS is wonderfully simple to set up and manage compared to the unholy combined hell of layers like dm-crypt + dm-integrity + dm-raid etc.

Thanks from me as well! Assuming you are right on this (have no reason to doubt, just saying), I will say I had no idea. When I first read your take on the three remaining disks (out of the original four), that for Btrfs there was a 100% chance of data loss with an additional failure, I was thinking "what is he talking about??" I then thought about the 1 in 3 chance (all else being equal) of the second mirror of the already degraded stripe pair (if I an saying that correctly) being the failed disk for a proper RAID 1+0 (10). Your visual above for Btrfs made it all very clear. Anyway, very interesting and appreciate you sharing.

**ehansin** · 13 July 2023, 08:58 AM

Originally posted by waxhead View Post

I agree that the RAID teminology being used in BTRFS is not very smart and lots of people who complain about BTRFS does not get this. It was some work being done a while ago for suggesting a different naming scheme, altough one might argue that using the RAID name draws people to it because of familiarity and they "instantly know" what it is all about.

And no , you are not totally screwed if one disk fails depending on how you configure metadata. The failure mode may be perfectly acceptable and besides if one disk fail it does not have to be as taxing for the drives to duplicate remaining replica of the lost drive's data to other drives. E.g. with one drive lost , you *MAY* have a faster route to recovering the array if you have existing space than on a traditional raid10.

Interesting as well. More for me to look into and and try to understand. I keep digging deeper and deeper into storage stuff. If I am reading this correctly, using the 2 + 2 disk example, if one disk fails, Btrfs could rebuild duplicate blocks/chunks by copying onto the remaining three disks so that these would then have two copies of the blocks/chunks between the three remaining disks. Of course, that depends on space available to do so. Good or bad, probably depends on your take. But different, that is for sure.

**flower** · 13 July 2023, 10:15 AM

Originally posted by milo_hoffman View Post

The only "killer feature" that ZFS has that you can't do just as well with other solutions on Linux like LVM, BTRFS, etc is the snapshot replication features. That is the best feature and the main reason to use ZFS for NAS or Proxmox.

the reason i switched to zfs from my dm-integrity - mdraid - lvm - crypt - xfs setup was faster checksumming.
it is possible to get fast checksums with dm-integrity - but only when you store them on a differen drive (which i did).

another benefit of zfs is that it is way easier to setup and maintain than my old one.

**pWe00Iri3e7Z9lHOX2Qx** · 13 July 2023, 11:59 AM

Originally posted by ehansin View Post

Thanks from me as well! Assuming you are right on this (have no reason to doubt, just saying), I will say I had no idea. When I first read your take on the three remaining disks (out of the original four), that for Btrfs there was a 100% chance of data loss with an additional failure, I was thinking "what is he talking about??" I then thought about the 1 in 3 chance (all else being equal) of the second mirror of the already degraded stripe pair (if I an saying that correctly) being the failed disk for a proper RAID 1+0 (10). Your visual above for Btrfs made it all very clear. Anyway, very interesting and appreciate you sharing.

It's documented in the official btrfs "SysAdminGuide" on the kernel wiki. The content is all archived now, but the write strategy hasn't changed. The RAID10 example they give shows the disks being reordered for writing the 2nd chunk in the file. If the file was bigger (requiring more chunks), the device reordering can of course reoccur.

"Note that chunks within a RAID grouping are not necessarily always allocated to the same devices (B1-B4 are reordered in the example above). This allows Btrfs to do data duplication on block devices with varying sizes, and still use as much of the raw space as possible. "

Unfortunately the man page content for mkfs.btrfs has profile write layout examples but doesn't include one for RAID10. But it gives you some clues like this.

"Actual physical block placement on devices depends on current state of the free/allocated space and may appear random."

The whole thing was designed to prioritize flexibility and capacity utilization of storage (e.g. adding any random disk of any size). This bit in the documentation is very true, and is why I wish they had used something other than the traditional RAID nomenclature to describe these profiles.

"Btrfs's "RAID" implementation bears only passing resemblance to traditional RAID implementations. Instead, Btrfs replicates data on a per-chunk basis."

Btrfs will make as many copies of data or metadata as you tell it to, you may just be surprised where those chunks live on your storage devices

. Again, this allows you to do all kinds of wacky stuff you can't do in most RAID setups. Whether or not this design is a pro or con for you depends on your use cases.

**kreijack** · 13 July 2023, 12:37 PM

Originally posted by EphemeralEft View Post

It's actually BTRFS | LVM | DM-Crypt | BCache | DM-RAID, where DM-Crypt is managed by Cryptsetup and DM-RAID is managed by LVM. The top-most LVM Layer is split into different filesystems for different purposes.

Although I'm not using it, LVM actually has the option to layer DM-Integrity over each RAID member for per-member corruption detection. Because DM-Integrity treats corruption as read errors, the other RAID members are automatically used if the data on one member is corrupt. The RAID layout is a 6x4TB “raid6_ls_6”, which is a non-standard combination of left-symmetric RAID5 (distributed parity) but the last disk is dedicated to Q syndrome parity. This has the benefit that I can switch between RAID5 and "RAID6" without reshaping, at the expense of losing 1/6 disks worth of read performance.

Thank for sharing the info.

Originally posted by EphemeralEft View Post

In theory RAID6 should also be able to tell which member is invalid in the case of a mismatch (without per-member DM-Integrity), but DM-RAID/LVM doesn't currently have that feature.

Theoretically with RAID6 you have a enough redundancy to check the correctness of the and to rebuild if needed. But this has an huge cost: you have to read every time from all the disks even if you want to read only one sector. I don't think that nobody does that.

The only case where it is cheap, is when a read returns error; in this case you have to rebuild the data anyway, so you have to read from all the disks.

I never used dm-integry, but I read that the performance are very bad.

Originally posted by EphemeralEft View Post

BCache is used in write-through mode, so the SSD can fail without data loss. My boot partition is a RAID1 at the beginning of all RAID members (thanks to Grub) so truly any 2 drives could fail without losing any data. I use the integrity checking of BTRFS as a sanity check of the RAID, BCache, and the SSD. It also functions as a janky method of "authenticated encryption". Besides the BTRFS RAID56 issues, at-rest encryption is important to me. So until BTRFS supports encryption, I'd need to encrypt all RAID members individually.

On the btrfs mailing list there are patches under review for adding fscrypt to btrfs. I don't think that these patches will land soon, but almost the process is started.

Originally posted by EphemeralEft View Post

I honestly prefer having separate layers that I can manage myself. I can (and eventually will) switch BCache to DM-Cache. And move integrity checking from BTRFS to DM-Crypt for AEAD. A while ago I switched from MDAdm to DM-RAID. I couldn't mix and match implementations with an all-in-one solution. I also probably couldn't tweak as many settings.

I fully agree that it is better have small independent block to combine. It was even discuss to switch btrfs from its internal raid implementation to the dm one. But it never happened. I don't know/remember if this was for technical reasons or only because nobody worked on that. I suspect the latter.

Anyway, even if BTRFS doesn't support well BTRFS5/6, having integrate the RAID inside the filesystem brings some capability like:
- reread/re-build from a god copy corrupted data,
- reshaping the raid profile (e.g. switching from raid1 <-> raid5)
- changing the number of disk (growing or *reducing*)

Announcement

Bcachefs File-System Plans To Try Again To Land In Linux 6.6

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment