Announcement

**pal666** · 02 May 2018, 03:16 PM

Originally posted by nomadewolf View Post

What's wrong in using Patreon to fund his project?
From what (little) i can see, he seems to actually be trying to solve an actual problem in the OSS world, instead of just creating or forking a new DE...

he claims that there is some problem which he surely can fix if you pay him, but dozens of fulltime devs have no chance at fixing. so you need to pay him, or you are doomed. wrong part is he has financial incentive in telling you this bullshit.
he better fork new de, at least it will not destroy your data

**DrYak** · 03 May 2018, 10:19 AM

Originally posted by kreijack View Post

I am not sure about what you reported. My understanding of what you wrote is that, if I have an (e.g.) raid5 composed by 6 disk of 100GB, a rebuild requires to read all 100GB of each disks. What is allocated above doesn't matter.

RAID1 can be directly handled in LVM (you can ask LVM to do RAID1 when allocating a logical volume from 2 different physical volumes in a volume group).
(And I just now learned that LVM has recently been extended to handle RAID5/6 directly, too)

Originally posted by kreijack View Post

In BTRFS, the checks is related only to the file stored on the disks (i.e. if the filesystem is half filled, you need to read only 6x100/2 GB ...)

Yup. Because the part that does the RAID check can leverage the knowledge of which part of the file system is allocated.
(Except that RAID5/6 isn't considered production ready in BTRFS as of today, so you might be actually *corrupting* the disks.
Though patches are underway, and will probably get more intensively tested over the next few kernel iterations.
RAID5/6 in BTRFS might get production ready within one year if no hidden surprises discovered).

In MD (which DM used for the raid implementation) there is no the concept to "allocated extents". The fact that the layers above can know what is allocated and where doesn't matters because the resync is performed by the lower MD layer.

In *MD* (as in "mdadm" was used to create a "/dev/md0" block device) there is indeed no suchconcept.

In *LVM* (which can do RAID on its own by asking the *DM* layer for it, which in turn could be sharing some routine with MD) thin volume and thin pools *HAVE* the concept of "allocated" extents. The fact that the layers above can know what is allocated and where does matter now, because the layer above can pass the information to the LVM thin volume through "discard/fstrim".
So, when LVM triggers RAID1 resync, I knows roughly which blocks from the thin pool are currently used in a thin volume, and which blocks are currently free, and thus LVM can resync only the used space. Because it has access to the necessary information, same as BTRFS.
(In LVM's case, that's because the information gets propagated around through fstrim, in BTRFS's case that's because BTRFS is a full stack and is directly aware of its chunk and extents allocations).

In other words, put 2x 100G disks in a system.
Put the two disks in a LVM volume groupe.
Put the two disks in a Logical volume using RAID1, and use as a Thin-Pool
Create a 100GB Thin logical volume out of the above thin pool.
At that point of time LVM knows that the the logical volume actually uses 0Gb on the disk.
Write 50Gb worth of data on the filesystem on the logical volume.
At that point of time LVM will auto-allocate 50Gb from the Thin-Pool to be used in the thin logical volume (the same way a VM's disk image "grows" automatically)
Delete 40Gb worth from the data on the filesystem.
Run FSTrim on the filesystem.
At that point of time LVM is informed that only 10Gb from the Thin-Pool are actually used in the thin logical volume. 40Gb can be claimed back and returned to the thin pool, free to be used if another thin volume needs to allocated them.

Whenever you read from these 10Gb, LVM will read from the actual blocks on the disks and return data.
Whenever you attempt to read outside of these 10Gb (e.g.: you do a disk image using DD instead of partclone), LVM will return NULs, it will not read blocks from the disk, in knows that the 40Gb of formely used data aren't used anymore, it won't attempt to read for these blocks, they aren't assigned to any thin-volume currently.

If you trigger a resync and LVM is in charge of the RAID (LVM is calling itself into DM and kernel MD routines) it should be able to only resync the 10Gb that are currently used in the thin volume.
(if LVM is simply used to partition an MDADM's /dev/md0 volume, and mdadm is in charge of it, mdadm will resync the whole 2x 100GB because it has no concept of which parts are used, and which not).

Note though, that MDADM could have some very primitive notion of "used blocks" through it's bitmap. So in case of disk drop and re-add, mdadm can only rewrite blocks that did change in the meantime.

Someone could eventually trap "discard/fstrim" commands and use them with some more advanced bitmap to keep a map of used vs. free regions.
But AFAIK, currently nothing like that is implemented.

In LVM it is implemented and rely on the "discard" instruction that are sent by the fstrim command - but as Pal666 mentioned, this duplicates information around.
In BTRFS it is implemented because the full stack is directly aware of that (and muich more other thing).

**sdack** · 03 May 2018, 12:10 PM

Originally posted by RahulSundaram View Post

Good that you aren't denying the obvious fact anymore. Served the purpose

Good? No. You're the one who denies that it can have friendships in commerce. It's you who thinks money excludes friendships when really it's greed, which destroys it. You only haven't connected the dots yet.

**nomadewolf** · 03 May 2018, 01:48 PM

Originally posted by pal666 View Post

he claims that there is some problem which he surely can fix if you pay him, but dozens of fulltime devs have no chance at fixing. so you need to pay him, or you are doomed. wrong part is he has financial incentive in telling you this bullshit.
he better fork new de, at least it will not destroy your data

No one can agree with the financial incentive.
But no one can argue that he is 'giving' the code back. So, anyone who understands may accompany the progress (or not).
I'm no expert on file systems, but my understanding is that a new, better filesystem is needed. Otherwise BTRFS and Stratis wouldn't show up.
I'd also argue that Red Hat's incentive for Stratis is also financial. The difference is that Red Hat has the means to invest ahead and 'collect' later.
So, financial incentive can be a good driving force, IMO.

My question is: is this (bcachefs) actually a good FS, (or has the potential to be)?

Announcement

Learning More About Red Hat's Stratis Project To Offer Btrfs/ZFS-Like Functionality

Comment

Comment

Comment

Comment