Originally posted by starshipeleven
View Post
RAID in mdadm/dm and in BTRFS is different.
Even if they do actually share code (unlike ZFS which seems to reimplement its own damn kitchensink - one of the reason it's a lot less liked than BTRFS and eveyone complains about layers transgression in ZFS but not in BTRFS. ZFS is its own kitchen sink, BTRFS is actually a whole stack that implements a B-tree based filesystem, but also uses shares code with other kernel code facilities. It's more a kind of a wapper around B-trees + RAID shared with mdadm/dm + partitionning shared with dm/lvm/evm/dynvol/blah...)
That shared cde is only used to read and write stripes. That's it.
Then you need a bunch of management code.
just like mdadm adds its raid array management code, or all clientts of device mapper like LVM add their own volume management.
In the case of BTRFS that code is relatively simple. It's all about using checksuming to pick which of the present copies is the correct one.
And handlinig adding/removing storage to the pool.
And some balancing code to spread the copies around (that what the whole damn B-tree technology is about).
In BTRFS this code exists for ages. It's rock solid. It's put into production by companies like Facebook, Suse, etc.
It has survived all the testing thrown at it.
The only missing feature is being to parametrically set the number of copies putting 3 instead of 2 to be able to lose 2 drive instead of 1, to have caracteristics parity with RAID6).
And that's currently been worked on, and will end up in production some day in the future.
RAID5 is *completely different*. It's redundancy works around having 1 (or 2) parity block for every N-1 (or N-2) block of data.
First you need a component that handle an array of drives, correctly reacts when one is missing and is able to rebuild a replacement drive.
Although mdadm had it almost from day one (that was the whole point), and DMraid correctly leverages dm to do it (even if most firmware RAIDs only support combos of RAID0 and 1), LVM still doesn't feature it in mainstream.
(And do you complain that LVM is "not production ready after a decade" just because it missed RAID5/6 for so long ? Nope. You just stack mdadm RAID5/6 under it or restrict yourself to RAID0/1 in LVM).
BTRFS is in the same position as LVM here. (except that the experimental RAID5/6 feature is in the vanilla kernel code instead of floating as some patches on some obscure mailing list).
Then for BTRFS you would need code that can leverage checksums to guess wich combo of data and parity is correct and which either data or parity block is corrupt.
That code doesn't exist yet in the kernel. That code needs to be written.
Originally posted by gbcox
View Post
We talk about an FS that is into production at several companies.
We're talkinug about something that is a great tool, but that comes with quite a few caveats, that you need to pay attention to (or that you need to delegate to someone else to manage for you. You don't give a shit hhow FB runs their installation. And if you pick a product from Suse or Jolla, you rely on their code to handle the management for you).
What you, gbcox, are talking is about your dream filesystem that desn't exist yet.
What you're constantly bitching about boils down, according to starshipeleven, to "A want a file system that is like ZFS, but has none of ZFS's drawbacks... and works for everything from embed all the way to massive clusters... and BTRFS isn't that thing I drream of... Waaaaaaaa....."
Originally posted by gbcox
View Post
We're talking about a project that is very complex, and has hundreds of features.
Some of them are rock solid since a couple of years.
Some of them are still highly experimental (RAID5/6)
Some don't even exist yet (on-line dedup, integrated crypto, integrated B-cache style multilayered caching).
Do you complain that LVM is not complete because it doesn't have a fully functionnal and poduction ready RAID5/6 ?
Do you used to complain that EXT2/3/4 is not production ready because it missed an integrated crypto layer for so long, or because its compression sill isn't mainline even today ?
Same for BTRFS: it IS production ready for some usage patterns, still experimental for others.
Originally posted by gbcox
It's one thing to complain that BTRFS was done too hastily.
It's another thing to actuallyy deliver all the things that he promises (snapshots for free, checksumming)
And let's see if he ever gets production ready RAID5/6.
I'm not doubting that he will manage to get it done eventually. I'm just saying, don't expect a feature perfect flawless BcacheFS to pop-up suddenly into existence in the next few months.
Originally posted by gbcox
The one missing piece, from a reliability point of view, is that it is still vulnerable to the parity RAID "write hole", where a partial write as a result of a power failure will result in inconsistent parity data.
- Parity may be inconsistent after a crash (the "write hole")
- Parity data is not checksummed
Or said differently: "scrub" the basic functionnality that you (or Suse's scripts) need to perform regularily to make sure that your data is okay could very probably eat your data up.
(And apparently that's exactly what was confirmed and requires a rewrite).
On the other hand, *that* is the part rewritten after ten years (the same part that isn't even handled properly by LVM and requires stacking with mdadm).
In other news, RAID0/1 are still performing as they should, and are still giving good results in production.
The only complain is not yet being able to set the width of RAID1 to 3 (2 copies).
I'm *also* following BTRFS closely, because I'm *also* using it personnally and professionnally.
For me, it was clear BTRFS RAID6 sould not be relied on yet.
Currently, I stack it above a classical mdadm+lvm whenever raid6 is needed (at home).
It needs a rewrite. Major portions are still experimental.
Major server vendors are quietly moving to XFS instead. It has simply taken too long to stablize.
- It still can only be grown and not reduced in mainstream code. That's a critical feature missing!~ XFS sucks~
- There is no copy-on-write yet, nor log-structured. The system relies on a simple journal!~ In 2016!~ When UDF has featured it for ages!~
- There is no snapsots for free. XFS relies on freezing the FS, doing a slow snapsot in LVM, and continuing only afterward. It's has nothing better than EX4~
And now, they are rewriting part of the file allocation code. Their are replacing the B+ trees with plain B-trees (hmm... where have I heard that one being used before ?) Their code is unfinished, and BTRFS is the better solution !~
Fedora attempted several times to make it the default and wisely chose against it.
I would tend to agree however that based upon the apparent development cycle of BTRFS, that by the time it is ready, it very well could be a moot point - along with many other features they are "working" on - but not everyone expects a filesystem project to be decades in the making.
Comment