Announcement

**mdedetrich** · 15 December 2020, 12:24 PM

Originally posted by useless View Post

Where, in that particular mail, is a mention about anything you said?

There are more links that you can read into if you can't understand the technical details, i.e. https://www.unixsheikh.com/articles/...integrity.html and https://www.reddit.com/r/btrfs/comme...6_patch_state/ but the most succinct explanation is here https://www.reddit.com/r/btrfs/comme...eb2x&context=3 . When running in RAID 5/6, btrfs is not fully copy on write, it actually goes into RMW which means in certain situations it overwrite existing data rather than creating a new raid stripe (i.e. CoW functionality). This means that in specific corner cases, you can damage/lose data.

https://www.spinics.net/lists/linux-btrfs/msg94447.html does a better job explaining why fixing this is hard, quoting directly

We can get strategy #1 on btrfs by making two small(ish) changes: 1.1. allocate blocks strictly on stripe-aligned boundaries. 1.2. add a new balance filter that selects only partially filled RAID5/6 stripes for relocation. The 'ssd' mount option already does 1.1, but it only works for RAID5 arrays with 5 disks and RAID6 arrays with 6 disks because it uses a fixed allocation boundary, and it only works for metadata because...it's coded to work only on metadata. The change would be to have btrfs select an allocation boundary for each block group based on the number of disks in the block group (no new behavior for block groups that aren't raid5/6), and do aligned allocations for both data and metadata. This creates a problem with free space fragmentation which we solve with change 1.2. Implementing 1.2 allows balance to repack partially filled stripes into complete stripes, which you will have to do fairly often if you are allocating data strictly on RAID-stripe-aligned boundaries. "Write 4K then fsync" uses 256K of disk space, since writes to partially filled stripes would not be allowed, we have 252K of wasted space and 4K in use. Balance could later pack 64 such 4K extents into a single RAID5 stripe, recovering all the wasted space. Defrag can perform a similar function, collecting multiple 4K extents into a single 256K or larger extent that can be written in a single transaction without wasting space. Strategy #2 requires some disk format changes: 2.1. add a new block group type for metadata that uses simple replication (raid1c3/raid1c4, already done) 2.2. record all data blocks to be written to partially filled RAID5/6 stripes in a journal before modifying any blocks in the stripe.

i.e. one fix isn't a full fix (it only works if you have a static number of disks, i,e, 5/6 for RAID 5/6 respectively) and another fix requires a change to the on disk format (which of course is very problematic since you have to deal with migrations).

There is a reason why it hasn't been fixed yet, its bloody hard to do. If you actually care that much about your data, ZFS is still the far superior option.

**vladpetric** · 15 December 2020, 01:12 PM

Originally posted by mdedetrich View Post

There are more links that you can read into if you can't understand the technical details, i.e. https://www.unixsheikh.com/articles/...integrity.html and https://www.reddit.com/r/btrfs/comme...6_patch_state/ but the most succinct explanation is here https://www.reddit.com/r/btrfs/comme...eb2x&context=3 . When running in RAID 5/6, btrfs is not fully copy on write, it actually goes into RMW which means in certain situations it overwrite existing data rather than creating a new raid stripe (i.e. CoW functionality). This means that in specific corner cases, you can damage/lose data.

https://www.spinics.net/lists/linux-btrfs/msg94447.html does a better job explaining why fixing this is hard, quoting directly

i.e. one fix isn't a full fix (it only works if you have a static number of disks, i,e, 5/6 for RAID 5/6 respectively) and another fix requires a change to the on disk format (which of course is very problematic since you have to deal with migrations).

There is a reason why it hasn't been fixed yet, its bloody hard to do. If you actually care that much about your data, ZFS is still the far superior option.

Thank you for the precise and concise explanation.

**S.Pam** · 15 December 2020, 03:48 PM

Originally posted by Volta View Post

It's not confirmed, because it was mentioned in comment section. Furthermore, what are they risking except virtual image? It will bring the 'risk' to not COW file systems level, won't it?

True. But the baseline for btrfs is cow, checksums and reliability during powerloss etc. So, while you get the same level as ext4, it is less than that of btrfs.

Besides. Is your vm data not important? I am running multiple VMs at work and the data is important so I run them on btrfs storage. Performance isn't an issue yet.

I do admit that if there is a case where disk io is a big bottle neck then I would consider doing nocow or other solutions. But in my experience I have not got near that limit.

**duby229** · 15 December 2020, 04:20 PM

Originally posted by S.Pam View Post

True. But the baseline for btrfs is cow, checksums and reliability during powerloss etc. So, while you get the same level as ext4, it is less than that of btrfs.

Besides. Is your vm data not important? I am running multiple VMs at work and the data is important so I run them on btrfs storage. Performance isn't an issue yet.

I do admit that if there is a case where disk io is a big bottle neck then I would consider doing nocow or other solutions. But in my experience I have not got near that limit.

Wow, oxymoron much...

Important data, so btrfs.... Just wait until an important file gets corrupted somehow, even if it was natural phenomena or by some program, and then the balance operation spreads the corruption...

**Charlie68** · 15 December 2020, 04:33 PM

Originally posted by xorbe View Post

The openSUSE crowd is always selling btrfs as a backup miracle, and I dutifully remind everyone it's not a backup everytime another user comes through with a trashed and dead btrfs partition.

In fact in openSUSE nobody calls it backup, but rather snapshot of restore. However, it prevented me from missing the plane or leaving with a broken PC, all systems should have it.

**S.Pam** · 15 December 2020, 04:42 PM

Originally posted by duby229 View Post

Wow, oxymoron much...

Important data, so btrfs.... Just wait until an important file gets corrupted somehow, even if it was natural phenomena or by some program, and then the balance operation spreads the corruption...

I'm not sure what you mean by spreading corruption? Snapshots and backups are in place to deal with the case of application failure and other events. Point is that with checksums you know that at least the storage layer is not likely to introduce undetected corruptions that also would be copied to backups etc.

What was the oxymoron?

**useless** · 15 December 2020, 05:03 PM

Originally posted by mdedetrich View Post

When running in RAID 5/6, btrfs is not fully copy on write, it actually goes into RMW which means in certain situations it overwrite existing data rather than creating a new raid stripe (i.e. CoW functionality). This means that in specific corner cases, you can damage/lose data.

I know RAID56 is not usable now. They have, basically, copy-pasted the md code for that. Obviously it will be RMW.

Originally posted by mdedetrich View Post

There is a reason why it hasn't been fixed yet, its bloody hard to do.

It hasn't been fixed because, essentially, nobody cares (well, nobody who can code/review/test that). Several attempts have been merged to improve the situation but they don't take RAID56 seriously (kind of 'best effort'); RAID5 is basically dead, and RAID6 is somewhat appealing in very particular circumstances. However, Josef Bacik stated that they will be pursuing RAID56 after the last corner cases of ENOSPC were ironed out (late 2020, mid 2021).

Originally posted by mdedetrich View Post

If you actually care that much about your data, ZFS is still the far superior option.

ZFS RAIDZ approach isn't without drawbacks, mainly fragmentation. Sure, it isn't a problem for everyone but you could also said that it's a design flaw. So, mitigations could be put in place (but OpenZFS devs don't care much about it now, IIRC). Same for btrfs RAID56 (nobody cares now, mitigations could be put in place). There will always be trade-offs.

Yes, ZFS works (and works very well, I use it heavily), but btrfs also works very well (and I also use it heavily). Just stay away from RAID56.

**adriansev** · 15 December 2020, 05:45 PM

What i like to suggest, and maybe Michael see this, is to have filesystem benchmark numbers for both flash and hdd... maybe even splitted in nvme flash, sata flash and sata hdd ... their behaviour and mechanics are wildly different and it can really happen that a fs could be better than other depending on what storage media is used on. Also, given the discussion on raid6, it could be really interesting to see some numbers on a 12 hdd drives raid6 md raid, formated (with proper alignment and stripe/stride settings) with different filesystems (but i doubt that this is possible as it takes a lot of hardware)

**Volta** · 15 December 2020, 07:05 PM

Originally posted by S.Pam View Post

True. But the baseline for btrfs is cow, checksums and reliability during powerloss etc. So, while you get the same level as ext4, it is less than that of btrfs.

Besides. Is your vm data not important? I am running multiple VMs at work and the data is important so I run them on btrfs storage. Performance isn't an issue yet.

I do admit that if there is a case where disk io is a big bottle neck then I would consider doing nocow or other solutions. But in my experience I have not got near that limit.

It depends what someone wants. I only run one VM for some basic things. I don't want it to stress my SSD and I expect sane performance from it. This is also default setting I expect from Fedora. For more serious tasks I would share your needs.

**nranger** · 15 December 2020, 10:52 PM

Originally posted by mppix View Post

Overclock + RAID5 on a (media) server??

There is enough evidence to show that RAID5 should not be used, especially with SSDs, since a second drive failure becomes increasingly likely during reconstruction..
As of RAID6, I don't get the point. It is really complex with large arrays and the benefit with small arrays is limited especially considering today's storage prices.
Also, the RAID 0, 1, 10 software raid computation overhead is manageable even with pcie3/4 drives (software RAID is the only RAID option for pcie drives). ..and then there is performance...

Am I missing something?

Yes, you are missing something. At no point did I say anything about overclocking a server.

Announcement

Btrfs Has Many Nice Improvements, Better Performance With Linux 5.11

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment