Announcement

Collapse
No announcement yet.

Bcachefs Hopes To Remove "EXPERIMENTAL" Flag In The Next Year

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • varikonniemi
    replied
    Originally posted by Old Grouch View Post

    .
    i think so, i don't believe they recently had many data corruption issues. But it still feels wild that it was already 10 months ago.

    In the pool metadata part bcachefs is clearly superior, IIRC it has all data in every disk's superblock. So as long as one of them survives it can recover.

    Leave a comment:


  • Old Grouch
    replied
    Originally posted by varikonniemi View Post

    the same facts can be represented in many different ways with emphasis on different things, and make the argument sound completely different. I don't think zfs has any advantage in the checksum department. But what it has had recently is a bug that silently corrupted data as it was being written, leading to data loss.

    Where bcachefs is in a class of it's own is to heal the filesystem around the data that is there. I think i read somewhere that even if you nuke the whole filesystem and only leave the data, it can rebuild the filesystem with correct file paths, only the extended attributes are lost.
    I like links to evidence: is this the bug you refer to?

    Critical OpenZFS bug causing data corruption (2006? - 2023 [v2.2.1])

    The Register: Data-destroying defect found after OpenZFS 2.2.0 release

    I think this CVE refers:

    CVE-2023-49298

    It's worth reading the cause of the issue, linked in the documentation from the above references.

    The bug can cause data corruption due to an incorrect dirty dnode check.
    Unfortunately, it turns out the "is this thing dirty" check was incomplete, so sometimes it could decide a thing wasn't dirty, when it was, and not force a flush, and give you old information about the file's contents if you hit a very narrow window between flushing one thing and not another.

    If you actually tried reading it, that would be correct, but if you skipped reading parts of it at all because you thought they were empty, well, then your program has incorrect ideas about what the contents was, and might write those ideas, or modify the empty areas and write the result, out somewhere.​
    ...sync=always doesn't have anything to do with it - that causes all writes to be written to a journal, so in the event of a crash before the current transaction is committed, the changes can be replayed. The transaction commit stage still happens as normal, and that's the place this bug lives. Again, it will change the timing, but it can't be predicted. Thus, its not an effective workaround.​
    If you look at it and squint a bit, you can describe it as 'incorrect cache handling'. It indicates that this stuff is hard to get right, and bugs can last for a long time before being discovered if they only appear in infrequently used aspects of the filesystem.

    What you don't want is an instance of corrupt metadata (for whatever reason: could be a cosmic ray) cascading and trashing large parts of, or all of a filesystem. You want it to be robust against 'minor damage' - which is why people use fsck, after all, in the belief that useful amounts of data can be rescued. ZFS doesn't have an fsck program because the checksumming and duplicated and triplicated metadata should be good enough to recover in normal operation. Still, corrupted pool metadata can trash an entire pool. If that happened often, I'm sure people would notice. The apparent fact that it is possible that a single power outage could render a complete pool unrecoverable, except from backup, makes some people wary. Aren't journalling filesystems meant to make that scenario considerably less likely?

    Then again, perhaps the writer didn't have enough knowledge to use other recovery options:
    zfs pool metadata corrupt
    I used zdb -u -l to dump a list of uberblocks, set vfs.zfs.spa.load_verify_metadata and vfs.zfs.spa.load_verify_data to 0, and used a combination of -n, -N, -R /some/Mountpoint, -o readonly=on and -T with the txg of an older uberblock's txg to at least get to where the data is present, in read-only form. From there I was able to see with zpool status -v, which files were corrupt, then decrypt the pool, and file-level copy the data out to an external HDD.​
    I like the idea of filesystems that checksum data and metadata. I also like filesystems where recovery options are available - corrupting a superblock/uberblock or other metadata (which can happen with a power outage or cosmic ray glitch, or bad memory, or cheap devices without power back-up re-ordering I/O invisibly to the OS...) shouldn't render the entire filesystem unrecoverable - it's not physically possible for all files to be corrupted in an instant (short of high-explosives or nuclear explosions), so telling me my only option is to restore the entire filesystem from backup when only a small amount of metadata has been trashed is not ideal. Recovering from fault conditions is part of every administrator's job. It really ought to be easier than it currently is. If bcachefs is a step further in that direction than zfs, then that is great.

    Leave a comment:


  • varikonniemi
    replied
    Originally posted by billyswong View Post
    the same facts can be represented in many different ways with emphasis on different things, and make the argument sound completely different. I don't think zfs has any advantage in the checksum department. But what it has had recently is a bug that silently corrupted data as it was being written, leading to data loss.

    Where bcachefs is in a class of it's own is to heal the filesystem around the data that is there. I think i read somewhere that even if you nuke the whole filesystem and only leave the data, it can rebuild the filesystem with correct file paths, only the extended attributes are lost.

    Leave a comment:


  • lyamc
    replied
    Originally posted by bkdwt View Post
    Imagine running a 100TB+ storage with a experimental filesystem...
    While not 100TB, I've got a decent sized one.

    Code:
    # fdisk -l | grep TiB
    Disk /dev/sda: 3.64 TiB, 4000753476096 bytes, 7813971633 sectors
    Disk /dev/sdb: 3.64 TiB, 4000753476096 bytes, 7813971633 sectors
    Disk /dev/sdc: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
    Disk /dev/sdd: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
    Disk /dev/sde: 5.46 TiB, 6001175126016 bytes, 11721045168 sectors
    Disk /dev/sdf: 5.46 TiB, 6001175126016 bytes, 11721045168 sectors
    Disk /dev/sdg: 5.46 TiB, 6001175126016 bytes, 11721045168 sectors
    Disk /dev/sdh: 5.46 TiB, 6001175126016 bytes, 11721045168 sectors
    Disk /dev/sdi: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
    Disk /dev/sdj: 10.91 TiB, 12000138625024 bytes, 23437770752 sectors
    Disk /dev/sdk: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
    Disk /dev/sdl: 10.91 TiB, 12000138625024 bytes, 23437770752 sectors​

    Leave a comment:


  • billyswong
    replied
    Originally posted by varikonniemi View Post

    What are you talking about? Data written on bcachefs is probably the most safe data as it can be recovered even if the filesystem is completely corrupted.
    I was probably misled by a popular FUD in the past

    Now some said it is all fake and untrue.

    Leave a comment:


  • mobadboy
    replied
    Looking forward to root-on-bcachefs!

    FreeBSD support when?

    Leave a comment:


  • varikonniemi
    replied
    Originally posted by Old Grouch View Post

    In principle, yes.

    In practice, you need to make sure there are not multi-level caches (both for reading and writing) that can operate incorrectly - and with modern block-based devices (SSD, other types of NVRAM, spinning rust with caches) you may not have programmatic access to bypass some of those caches. Not everyone has 'enterprise-class' hardware, so you end up having to run a filesystem on 'unreliable' hardware. This is a a challenging environment.

    Note that in environments that require very high data integrity, things like memory buses, internal CPU data-paths, and CPU registers will all have ECC and/or duplicated or triplicated hardware with voting to reduce the expected error rate. The problem isn't solved solely by battery backup and RAID.

    Note further that reading data after it is written assumes that the data in memory that you are comparing with the written data was correct in the first place. Unless all the hardware used by the code in the path between handover to filesystem API to being written on disk is protected (by ECC or other data integrity mechanisms), then there are places data can be corrupted that are not under the filesystem's control. Getting data integrity right is a hot mess.
    after write you do shasum on the source and destination, and that makes it verifiedly correct.

    There is no magic that can be done to improve on memory errors etc. those are frankly out of scope in this discussion. What matters here is that if the data was written to disk, then bcachefs provides strong guarantees that it can always be recovered. Because the filesystem can be corrupted at any level, the data is still available.

    Leave a comment:


  • varikonniemi
    replied
    Originally posted by billyswong View Post

    There are still a lot of computers with no battery backup and no ECC RAM. There will be a lot of new computers like so in foreseeable future. If there are any general purpose file systems to replace ext4 as the future default FS for Linux, such FS ought to be able to make the data NOT more risky in such configuration.
    What are you talking about? Data written on bcachefs is probably the most safe data as it can be recovered even if the filesystem is completely corrupted.

    Leave a comment:


  • Old Grouch
    replied
    Originally posted by varikonniemi View Post

    You get that level of security by reading the data after it is written. Then you know if it is on disk in correct format, most probably safe no matter what happens to the filesystem. Another way to do it without performance hit is to have enterprise class hardware with battery backup and certified parts etc.
    In principle, yes.

    In practice, you need to make sure there are not multi-level caches (both for reading and writing) that can operate incorrectly - and with modern block-based devices (SSD, other types of NVRAM, spinning rust with caches) you may not have programmatic access to bypass some of those caches. Not everyone has 'enterprise-class' hardware, so you end up having to run a filesystem on 'unreliable' hardware. This is a a challenging environment.

    Note that in environments that require very high data integrity, things like memory buses, internal CPU data-paths, and CPU registers will all have ECC and/or duplicated or triplicated hardware with voting to reduce the expected error rate. The problem isn't solved solely by battery backup and RAID.

    Note further that reading data after it is written assumes that the data in memory that you are comparing with the written data was correct in the first place. Unless all the hardware used by the code in the path between handover to filesystem API to being written on disk is protected (by ECC or other data integrity mechanisms), then there are places data can be corrupted that are not under the filesystem's control. Getting data integrity right is a hot mess.

    Leave a comment:


  • billyswong
    replied
    Originally posted by varikonniemi View Post

    You get that level of security by reading the data after it is written. Then you know if it is on disk in correct format, most probably safe no matter what happens to the filesystem. Another way to do it without performance hit is to have enterprise class hardware with battery backup and certified parts etc.
    There are still a lot of computers with no battery backup and no ECC RAM. There will be a lot of new computers like so in foreseeable future. If there are any general purpose file systems to replace ext4 as the future default FS for Linux, such FS ought to be able to make the data NOT more risky in such configuration.

    Leave a comment:

Working...
X