Originally posted by Khrundel
View Post
Announcement
Collapse
No announcement yet.
Bcachefs File-System Plans To Try Again To Land In Linux 6.6
Collapse
X
-
Originally posted by EphemeralEft View Post
All Linux filesystems (including BTRFS) have tiered storage if you put them on top of BCache (what BCacheFS is based on) or DM-Cache. Same with encryption if you put it on top of DM-Crypt. It's honestly a better solution than duplicating that work for every filesystem. And it works for non-filesystem block devices, too.
you can't have tiered storage with btrfs as the filesystem has to support it.
- Likes 3
Comment
-
Originally posted by EphemeralEft View PostIt's actually BTRFS | LVM | DM-Crypt | BCache | DM-RAID, where DM-Crypt is managed by Cryptsetup and DM-RAID is managed by LVM. The top-most LVM Layer is split into different filesystems for different purposes.
I'm using BTRFS on top of LUKS 2 on two (surely not weak) PCs (one using Fedora 38 and the other using ArchLinux) but in both, the system when doing a lot of I/O on the main disk (both have an SSD) sometimes there are hiccups (even of some seconds!) in the UI and also the responsiveness of apps (so it's not just the UI that's I/O starved)...
I've read about some options of cryptsetup to tweak the performance by disabling some features that were needed when they were introduced but they shouldn't be needed anymore, but sadly they didn't fix the problem 100%...
These are the flags I've used: discards same_cpu_crypt submit_from_crypt_cpus no_read_workqueue no_write_workqueue
Code:# cryptsetup status luks-b8d62e8e-... /dev/mapper/luks-b8d62e8e-... is active and is in use. type: LUKS2 cipher: aes-xts-plain64 keysize: 512 bits key location: keyring device: /dev/sda3 sector size: 512 offset: 32768 sectors size: 465514496 sectors mode: read/write flags: discards same_cpu_crypt submit_from_crypt_cpus no_read_workqueue no_write_workqueue
Thanks.
Comment
-
Originally posted by pWe00Iri3e7Z9lHOX2Qx View Post
I think the biggest problem with the RAID10 example I gave is that nobody who is familiar with RAID10 from other systems would expect a write pattern like this to be possible.
Code:| SDA | SDB | SDC | SDD | |-----|-----|-----|-----| | A1 | A2 | A1 | A2 | | B1 | B1 | B2 | B2 | | C1 | D1 | D1 | C1 | | D2 | C2 | C2 | D2 |
- Likes 2
Comment
-
Originally posted by waxhead View PostI agree that the RAID teminology being used in BTRFS is not very smart and lots of people who complain about BTRFS does not get this. It was some work being done a while ago for suggesting a different naming scheme, altough one might argue that using the RAID name draws people to it because of familiarity and they "instantly know" what it is all about.
And no , you are not totally screwed if one disk fails depending on how you configure metadata. The failure mode may be perfectly acceptable and besides if one disk fail it does not have to be as taxing for the drives to duplicate remaining replica of the lost drive's data to other drives. E.g. with one drive lost , you *MAY* have a faster route to recovering the array if you have existing space than on a traditional raid10.
Comment
-
Originally posted by milo_hoffman View PostThe only "killer feature" that ZFS has that you can't do just as well with other solutions on Linux like LVM, BTRFS, etc is the snapshot replication features. That is the best feature and the main reason to use ZFS for NAS or Proxmox.
it is possible to get fast checksums with dm-integrity - but only when you store them on a differen drive (which i did).
another benefit of zfs is that it is way easier to setup and maintain than my old one.
- Likes 1
Comment
-
Originally posted by ehansin View Post
Thanks from me as well! Assuming you are right on this (have no reason to doubt, just saying), I will say I had no idea. When I first read your take on the three remaining disks (out of the original four), that for Btrfs there was a 100% chance of data loss with an additional failure, I was thinking "what is he talking about??" I then thought about the 1 in 3 chance (all else being equal) of the second mirror of the already degraded stripe pair (if I an saying that correctly) being the failed disk for a proper RAID 1+0 (10). Your visual above for Btrfs made it all very clear. Anyway, very interesting and appreciate you sharing.
"Note that chunks within a RAID grouping are not necessarily always allocated to the same devices (B1-B4 are reordered in the example above). This allows Btrfs to do data duplication on block devices with varying sizes, and still use as much of the raw space as possible. "
Unfortunately the man page content for mkfs.btrfs has profile write layout examples but doesn't include one for RAID10. But it gives you some clues like this.
"Actual physical block placement on devices depends on current state of the free/allocated space and may appear random."
The whole thing was designed to prioritize flexibility and capacity utilization of storage (e.g. adding any random disk of any size). This bit in the documentation is very true, and is why I wish they had used something other than the traditional RAID nomenclature to describe these profiles.
"Btrfs's "RAID" implementation bears only passing resemblance to traditional RAID implementations. Instead, Btrfs replicates data on a per-chunk basis."
Btrfs will make as many copies of data or metadata as you tell it to, you may just be surprised where those chunks live on your storage devices . Again, this allows you to do all kinds of wacky stuff you can't do in most RAID setups. Whether or not this design is a pro or con for you depends on your use cases.
- Likes 2
Comment
-
Originally posted by EphemeralEft View Post
It's actually BTRFS | LVM | DM-Crypt | BCache | DM-RAID, where DM-Crypt is managed by Cryptsetup and DM-RAID is managed by LVM. The top-most LVM Layer is split into different filesystems for different purposes.
Although I'm not using it, LVM actually has the option to layer DM-Integrity over each RAID member for per-member corruption detection. Because DM-Integrity treats corruption as read errors, the other RAID members are automatically used if the data on one member is corrupt. The RAID layout is a 6x4TB “raid6_ls_6”, which is a non-standard combination of left-symmetric RAID5 (distributed parity) but the last disk is dedicated to Q syndrome parity. This has the benefit that I can switch between RAID5 and "RAID6" without reshaping, at the expense of losing 1/6 disks worth of read performance.
Originally posted by EphemeralEft View PostIn theory RAID6 should also be able to tell which member is invalid in the case of a mismatch (without per-member DM-Integrity), but DM-RAID/LVM doesn't currently have that feature.
The only case where it is cheap, is when a read returns error; in this case you have to rebuild the data anyway, so you have to read from all the disks.
I never used dm-integry, but I read that the performance are very bad.
Originally posted by EphemeralEft View PostBCache is used in write-through mode, so the SSD can fail without data loss. My boot partition is a RAID1 at the beginning of all RAID members (thanks to Grub) so truly any 2 drives could fail without losing any data. I use the integrity checking of BTRFS as a sanity check of the RAID, BCache, and the SSD. It also functions as a janky method of "authenticated encryption". Besides the BTRFS RAID56 issues, at-rest encryption is important to me. So until BTRFS supports encryption, I'd need to encrypt all RAID members individually.
Originally posted by EphemeralEft View PostI honestly prefer having separate layers that I can manage myself. I can (and eventually will) switch BCache to DM-Cache. And move integrity checking from BTRFS to DM-Crypt for AEAD. A while ago I switched from MDAdm to DM-RAID. I couldn't mix and match implementations with an all-in-one solution. I also probably couldn't tweak as many settings.
Anyway, even if BTRFS doesn't support well BTRFS5/6, having integrate the RAID inside the filesystem brings some capability like:
- reread/re-build from a god copy corrupted data,
- reshaping the raid profile (e.g. switching from raid1 <-> raid5)
- changing the number of disk (growing or *reducing*)
Comment
Comment