Announcement

**jacob** · 01 June 2018, 07:48 PM

Originally posted by PuckPoltergeist View Post

Why should this be impossible on NOCOW filesystems?

To put it in a grossly simplistic way, writing on a XFS or Ext4 filesystem with data checksum would need to work along these lines:

1. Add journal entry
2. Write an extent
3. Calculate and write the checksum
4. Close journal entry

Now what happens if the system crashes between steps 2 and 3? Upon reboot, the extent written has no valid checksum, so the FS safety guarantees go out thr window. And we also can't just recalculate it, precisely because in the absence of a valid checksum in the first place, there is no way to tell if the data have been corrupted or not.

**jacob** · 01 June 2018, 10:22 PM

Originally posted by AndyChow View Post

By now, it's evident that BTRFS is badly designed. Bcachefs might not have all the features, but they are building them slowly, in a sane way, once the basics have been mastered and work. With BTRFS, everything was thrown together, and it doesn't work. A couple of weeks ago, a pcie hardware failure caused my system to require a hard reboot. My raid-1 btrfs had the last leaf on the most recent tree corrupted. There was absolutely no way to repair it. The only thing I could do was dump the files in another array, and destroy and fe-format the btrfs array. All attempts to recover and all commands were done by a btrfs developer.

So with BTRFS, a raid-1 array that has the very last block of the very last write broken due to a power failure corrupts the entire filesystem. And there is no way to recover. So how is btrfs even COW? My understanding of COW is that you could just truncate the last modifications and recover everything not too new. But no, it doesn't, not with BTRFS.

AFAIK RAID1 has been production quality for many years and has been tested in similar scenarios countless times. Your problem has probably some deeper cause that may or may not be related to BTRFS.

**Royi** · 02 June 2018, 02:44 AM

Would you use XFS for your boot drive?

**F.Ultra** · 02 June 2018, 07:45 AM

Originally posted by jacob View Post

I think it's the other way around. ZFS and BTRFS have CoW as one of their main design features, and incidentally it allowed them to support data checksum.

No it's not the other way around. Both ZFS and BTRFS choose to go with CoW for a reason and checksumming without a write hole was one (from a long list) of those reasons.

**PuckPoltergeist** · 02 June 2018, 11:04 AM

Originally posted by jacob View Post

To put it in a grossly simplistic way, writing on a XFS or Ext4 filesystem with data checksum would need to work along these lines:

1. Add journal entry
2. Write an extent
3. Calculate and write the checksum
4. Close journal entry

Now what happens if the system crashes between steps 2 and 3? Upon reboot, the extent written has no valid checksum, so the FS safety guarantees go out thr window. And we also can't just recalculate it, precisely because in the absence of a valid checksum in the first place, there is no way to tell if the data have been corrupted or not.

There is no difference to a COW filesystem. The commit was not done, so you have to discard it.

**jacob** · 02 June 2018, 04:13 PM

Originally posted by PuckPoltergeist View Post

There is no difference to a COW filesystem. The commit was not done, so you have to discard it.

No. In a COW system it goes like this:

1. write new extent, alongside the old one
2. write new checksum alongside the old one
3. write new metadata, with their own checksums etc., alongside old metadata
4. commit

Until step 4, the old data remain unchanged. If a crash occurs at any time during that period, the newly written data will be lost but we will use the old data, with a valid checksum.

The commit itself is atomic and is basically akin to a single update of a field in the superblock. In other words it can't crash mid-way. After the commit, we use the newly written data and a valid checksum is in place.

**PuckPoltergeist** · 02 June 2018, 04:32 PM

Originally posted by jacob View Post

No. In a COW system it goes like this:

1. write new extent, alongside the old one
2. write new checksum alongside the old one
3. write new metadata, with their own checksums etc., alongside old metadata
4. commit

Until step 4, the old data remain unchanged. If a crash occurs at any time during that period, the newly written data will be lost but we will use the old data, with a valid checksum.

The commit itself is atomic and is basically akin to a single update of a field in the superblock. In other words it can't crash mid-way. After the commit, we use the newly written data and a valid checksum is in place.

You're speaking about data consistency. That's something different. In a COW filesystem, this is implicit. But you can achieve this with full data journaling on NOCOW filesystems too. Nevertheless this is not relevant to data integrity, what checksums do. Be careful and don't mix this two parts.

PS: And to solve the problem, you outlined above, we simply need to change the ordering a little:

1. Add journal entry with checksum including
2. Write an extent
3. write the checksum
4. Close journal entry

Now you have the checksum in the log and can verify on log replay.

**waxhead** · 02 June 2018, 05:09 PM

Originally posted by timofonic View Post

Would you like to add XFS to the table? Please...

Sure! not much point , but here you go... Added EXT4 as well and differentiated between metadata (the data that describes the filesystem itself) and data (your stored files) checksum. I cleaned up the list a tad as well.

NOTE: Some of the features below may be possible by utilizing other tools , this table represents the filesystem native support.

Feature	Bcachefs	Btrfs	XFS	EXT4
Data checksum	Yes, but not yet usable	Yes	No	No
Metadata checksum	Yes, but not yet usable	Yes	Usable	Usable
Compression	Yes, but not yet usable	Yes	No	No
Scrubbing	No yet implemented	Yes	No	No
Writeback caching	Yes	Not implemented*	No	No
Replication	Not yet implemented	Yes	No	No
Encryption	Yes, but advised not to use	Not implemented*	No	Yes
Snapshots	Not yet implemeted	Yes	No	No

**jacob** · 02 June 2018, 06:37 PM

Originally posted by PuckPoltergeist View Post

You're speaking about data consistency.

[...]

Now you have the checksum in the log and can verify on log replay.

I'm speaking about consistency between data and the corresponding checksum. It basically boils down to the fact that after data are written, you need to write the checksum using a separate disk operation. With a NOCOW filesystem there is no way to ensure that these two things will be either executed consistently together, or not at all (remember ACID?).

Your solution does not work, because except in the most trivial cases, you don't know all the data in advance to be able to precalculate the checksusm. It also doesn't cater for the more complicated scenarios, like where you partially overwrite an existing extent.

**PuckPoltergeist** · 02 June 2018, 07:11 PM

Originally posted by jacob View Post

I'm speaking about consistency between data and the corresponding checksum It basically boils down to the fact that after data are written, you need to write the checksum using a separate disk operation. With a NOCOW filesystem there is no way to ensure that these two things will be either executed consistently together, or not at all (remember ACID?).

This doesn't matter for integrity. It's about consistency. And you will achieve this with explicit full data journaling too. Without this, you may loose data. But this is pretty normal for filesystems without data journaling. It doesn't matter, how the journaling is done (implicit or explicit) and it's totally independent from checksums.

edit: to make it more clear, that this is independent from checksums, look at your example without those checksums:

1. Add journal entry
2. Write an extent
3. Close journal entry

If you have a crash between 1. and 3. your data is lost. It doesn't matter, if you add any checksums. You can add them for detecting data corruptions, but writeout still suffer the same problem. A different problem, that checksums won't solve.

Your solution does not work, because except in the most trivial cases, you don't know all the data in advance to be able to precalculate the checksusm. It also doesn't cater for the more complicated scenarios, like where you partially overwrite an existing extent.

It does. The data doesn't change with writeout and the checksum is calculated over the data in RAM. So it doesn't matter, if it is calculated before or after writeout.

Announcement

Bcachefs Linux File-System Benchmarks vs. Btrfs, EXT4, F2FS, XFS

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment