Announcement

**woddy** · 12 April 2024, 06:03 AM

Originally posted by Quackdoc View Post

Personally, I don't think BTRFS is stable due to my series of issues I've had with it, I guess it's almost 2 years ago now. And sadly this is an issue many people have experienced too. I still find btrfs just too much of a hassle anyways, since even when I wasn't running into data loss, I've hit annoyances where I will run out of space and deleting files doesn't recover space. Thankfully this isnt an issue with bcachefs.

I had the only data corruption with ext4, but that doesn't mean I say that ext4 is an unreliable file system. I have been using Linux for 15 years, the last 6 years I have been using btrfs and have never had any problems, I use it mainly because I need to have regular snapshots and it works well (at least in openSUSE). Btrfs suffers from a bad reputation that was created years ago...now it's a good file system. It's perfect ? No, but neither are the others, but one's opinion, even if based on personal experiences, does not prove anything, it is the numbers that make the difference, but I am not aware of any numbers that demonstrate that btrfs is in bad shape and in the absence of these numbers, the only thing that is valid is the official documentation, which says what is stable and what is not, everything else is nonsense.

**Quackdoc** · 12 April 2024, 06:06 AM

Originally posted by woddy View Post

I had the only data corruption with ext4, but that doesn't mean I say that ext4 is an unreliable file system. I have been using Linux for 15 years, the last 6 years I have been using btrfs and have never had any problems, I use it mainly because I need to have regular snapshots and it works well (at least in openSUSE). Btrfs suffers from a bad reputation that was created years ago...now it's a good file system. It's perfect ? No, but neither are the others, but one's opinion, even if based on personal experiences, does not prove anything, it is the numbers that make the difference, but I am not aware of any numbers that demonstrate that btrfs is in bad shape and in the absence of these numbers, the only thing that is valid is the official documentation, which says what is stable and what is not, everything else is nonsense.

It failed irrecoverably across multiple devices a total of I think it was 7 times within a span of 6-9 months for me, single drive devices

**cynic** · 12 April 2024, 06:13 AM

Originally posted by Quackdoc View Post

I've hit annoyances where I will run out of space and deleting files doesn't recover space.

is there a pubblic discussion for this or did it just happen on your machine?

**Old Grouch** · 12 April 2024, 08:16 AM

Originally posted by Mitch View Post

Just to add something here: Facebook has an insane number of machines in their fleets. Those machines have BTRFS as a standard filesystem and have done so for many years. I think it's the boot drive. They routinely make use of its compression, send, and receive from what I remember. Probably other features, too.

Fedora also ships it by default. I think the BTRFS-isn't-stable arguments should have died ages ago. BTRFS is one of many super-reliable filesystems and has been for years regardless of how rough it may have been at its beginning, many, many years ago.

I think you can make a good case there for btrfs being suitable for Facebook's use-case(s). Other people exercise the filesystem in different ways.

Fedora is interesting: in principle it will be generating a wider set of use-cases than Facebook simply because more organisations use it. In practice, I suspect people choose to use Fedora for particular reasons, and that might well limit the number of different use-cases being exercised.

There are people out there who do really weird stuff and so exercise the more extreme edges of file system behaviour more than 'boring corporate' high-volume high reliability stuff.

Not least, home users who are not using 'enterprise hardware' with ECC and stable power supplies. How well does the file system cope with being run on less reliable consumer hardware? A purist might claim that people who don't run 'professional' systems deserve all they get: but in reality most vehicles on the road are private cars maintained sporadically, not fleet lorries checked daily. A filesystem suitable for professional use might be unable to perform on consumer hardware. If filesystem 'A' reports no errors and filesystem 'B' shows errors, or even worse, gives up; then the consumer experience is that filesystem 'A' is 'better', because it causes them fewer obvious problems.

None of this might apply to btrfs - but professional use-cases and consumer use cases are different, and place different demands on a filesystem. Being technically correct while preventing an end-user from watching a cat-video might not give a file system a good reputation. Do I care if a bit-error means that one pixel is off by one in one colour channel in one frame of a cat video? Probably not, but some systems will refuse to play the corrupt file, and could even mount the entire file-system read-only until things are 'sorted out'. Fast, benign, recovery from errors can be important, but very hard to implement.

**cynic** · 12 April 2024, 08:26 AM

Originally posted by timofonic View Post

Another Btrfs propagandist from Meta...

why do Meta would need making pro-btrfs propaganda?
it isn't a product of their and they aren't making any money if more users use it.

**cynic** · 12 April 2024, 08:29 AM

Originally posted by Old Grouch View Post

Not least, home users who are not using 'enterprise hardware' with ECC and stable power supplies. How well does the file system cope with being run on less reliable consumer hardware?

btrfs used to be rather fragile when used with failing RAM, but lately it got much more reliable in such scenarios.

you shoudn't run any filesystem (or any software in general) on such systems, anyway.

**Old Grouch** · 12 April 2024, 08:57 AM

Originally posted by cynic View Post

btrfs used to be rather fragile when used with failing RAM, but lately it got much more reliable in such scenarios.

you shoudn't run any filesystem (or any software in general) on such systems, anyway.

That's almost blaming the victim. Most consumers don't know what RAM is, let alone ECC RAM.

If you don't have ECC RAM, and your filesystem is not checksumming data (and metadata), errors can be completely missed, until something critical is affected. Unless and until consumer hardware takes data integrity seriously (which is more than ECC and RAID), we need to accept that filesystems will be run on less than perfect hardware. Obviously, I'd like that to be different, but cost-engineering works against high-quality systems.
This means that filesystems that are fragile in the face of data corruption are rated as less good than ones that ignore corruption altogether, and keep 'working' until they can't.

If your development model assumes perfection in the lower layers of the software/hardware stack that you are reliant upon, you are in for some nasty surprises. Graceful failure is preferable to unreported failure, even though unreported failure gives a better user experience for a while, until things fail catastrophically.

Think of a suspension bridge where cables fail at the rate of one per year, and the bridge can cope with 3 simultaneous failures.

Year 1: A cable fails. If it is reported, maintenance and/or replacement can be scheduled.
Year 2; A cable fails. If it is reported, maintenance and/or replacement can be scheduled. If the previous cable has nor been replaced you can increase the urgency of the maintenance.
Year 3: A cable fails. If it is reported, maintenance and/or replacement can be scheduled. If the previous cables have nor been replaced you can increase the urgency of the maintenance. You do the maintenance.
Year 4: a cable fails. If it is reported, maintenance and/or replacement can be scheduled. If the failures were unreported, you now unexpectedly need a new bridge.

Error handling is important. You can do the maintenance, or unexpectedly need to rebuild the bridge every four years or so.

**cynic** · 12 April 2024, 09:55 AM

Originally posted by Old Grouch View Post

That's almost blaming the victim. Most consumers don't know what RAM is, let alone ECC RAM.

If you don't have ECC RAM, and your filesystem is not checksumming data (and metadata), errors can be completely missed, until something critical is affected. Unless and until consumer hardware takes data integrity seriously (which is more than ECC and RAID), we need to accept that filesystems will be run on less than perfect hardware. Obviously, I'd like that to be different, but cost-engineering works against high-quality systems.
This means that filesystems that are fragile in the face of data corruption are rated as less good than ones that ignore corruption altogether, and keep 'working' until they can't.

If your development model assumes perfection in the lower layers of the software/hardware stack that you are reliant upon, you are in for some nasty surprises. Graceful failure is preferable to unreported failure, even though unreported failure gives a better user experience for a while, until things fail catastrophically.

Think of a suspension bridge where cables fail at the rate of one per year, and the bridge can cope with 3 simultaneous failures.

Year 1: A cable fails. If it is reported, maintenance and/or replacement can be scheduled.
Year 2; A cable fails. If it is reported, maintenance and/or replacement can be scheduled. If the previous cable has nor been replaced you can increase the urgency of the maintenance.
Year 3: A cable fails. If it is reported, maintenance and/or replacement can be scheduled. If the previous cables have nor been replaced you can increase the urgency of the maintenance. You do the maintenance.
Year 4: a cable fails. If it is reported, maintenance and/or replacement can be scheduled. If the failures were unreported, you now unexpectedly need a new bridge.

Error handling is important. You can do the maintenance, or unexpectedly need to rebuild the bridge every four years or so.

Cable failing/disk failing is something that software can detect and manage.
Defective RAM is a completely different situation. On failing RAM your software cannot run as it is intended and cannot give any guarantee of correctness.

**Old Grouch** · 12 April 2024, 10:10 AM

Originally posted by cynic View Post

Cable failing/disk failing is something that software can detect and manage.
Defective RAM is a completely different situation. On failing RAM your software cannot run as it is intended and cannot give any guarantee of correctness.

I fear we are talking past each other.

The problem is the assumption that RAM is perfect, Flash memory is not perfect, and neither are HDDs. Flash uses some pretty sophisticated algorithms to massage unreliable storage into something with a well-determined error-rate. HDDs do the same. As do CDs. However, the same is not done for RAM (except in some specialised areas), which leads to the insanity of bit errors either not being detectable, or being detectable but not reported, or the availability of reports being ignored by the operating system. The reason is that adding the appropriate code would slow down RAM markedly. As an industry, especially in consumer devices, we use performance as an excuse to ignore data integrity.

As for failing RAM, triplicate the hardware, run the programs in lock-step, and vote on the results. That will catch a lot. And you can guarantee correctness to the level of the chance of the same error occurring in two hardware instances simultaneously, leading to an incorrect vote result. Non-zero, but low. You will need to do this even with formally-proven software because hardware exists in the real world and has an error rate independent of your logical proofs. Proving your software is 'correct' (according to its specification) tells you nothing about the hardware it runs on. A stray cosmic-ray alpha particle can ruin your day.

Expecting a single instance of anything to run perfectly is 'brave'.

**woddy** · 12 April 2024, 11:19 AM

Originally posted by Quackdoc View Post

It failed irrecoverably across multiple devices a total of I think it was 7 times within a span of 6-9 months for me, single drive devices

With all due respect, but if we take everything that users write on social media, forums etc. as valid. the world would have ended long ago.

Announcement

Bcachefs Sees More Fixes For Linux 6.9-rc4, Reiterates Its Experimental Nature

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment