Announcement

Collapse
No announcement yet.

OpenZFS 2.2.1 Released Due To A Block Cloning Bug Causing Data Corruption

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • muncrief
    replied
    Originally posted by grahamperrin View Post

    Issue 15526 may be simplest to mitigate without a patch.

    Via https://old.reddit.com/r/DataHoarder...mpr/-/kahvhzj/ (2023-11-23):

    OpenZFS Issue 15526 Mitigation | MemeticHenry

    Drive-by readers of page two here, please note: 15526 is distinct, not to be confused with block cloning, although there is (naturally) discussion of 15526 in the context of other issues.

    Please be prepared for discussion of 15526, in particular, to shift to a separate area of Phoronix, when a related article appears. Thank you.

    In the meantime, for users of FreeBSD:

    FreeBSD sysctl vfs.zfs.dmu_offset_next_sync and openzfs/zfs issue #15526
    • includes a link to FreeBSD bug report 275308.
    According to the discussion at https://github.com/openzfs/zfs/pull/15571/files the patch suggested by robn fixes the issue under all circumstances, as numerous people have tested it and are no longer able to reproduce the hole error.

    As stated in my previous post, I'm completely lost as far as being able to use github the way I did git, however I was able to download the master branch with AUR zfs-dkms-git, and then manually copy and paste the new "dnode_is_dirty" function into "dnode.c". I then compiled and installed the resulting zfs-dkms-git, along with the required zfs-utils-git, successfully.

    As of last night around 8 PM the patched zfs-dkms has been running on my 11 TB Manjaro media server and 1.5 TB Arch desktop workstation without error. I understand that running such code without more thorough testing is dangerous, but I decided that running it with the original known error is even more so.

    I wish I could offer a more professional way of implementing the proposed solution, but I'm so frustrated with github and its new forced and uber complicated Microsoft operations that I just don't care anymore.

    So if someone could take what I've done and offer the correct way of applying the patch with github it might be more helpful to others. And it would also teach me how in the heck to use github, because, again as I said before, I was simply unable to figure it out.
    Last edited by muncrief; 26 November 2023, 01:52 PM.

    Leave a comment:


  • grahamperrin
    replied
    Originally posted by muncrief View Post
    … patch …
    Issue 15526 may be simplest to mitigate without a patch.

    Via https://old.reddit.com/r/DataHoarder...mpr/-/kahvhzj/ (2023-11-23):

    OpenZFS Issue 15526 Mitigation | MemeticHenry

    Drive-by readers of page two here, please note: 15526 is distinct, not to be confused with block cloning, although there is (naturally) discussion of 15526 in the context of other issues.

    Please be prepared for discussion of 15526, in particular, to shift to a separate area of Phoronix, when a related article appears. Thank you.

    In the meantime, for users of FreeBSD:

    FreeBSD sysctl vfs.zfs.dmu_offset_next_sync and openzfs/zfs issue #15526
    • includes a link to FreeBSD bug report 275308.
    Last edited by grahamperrin; 26 November 2023, 10:23 AM. Reason: Correction to the short URL for the comment in Reddit.

    Leave a comment:


  • muncrief
    replied
    EDIT:
    As expected this didn't work. It compiled, but when I tried to install the dkms module I received an "implicit declaration of function ‘dmu_buf_add_user_size’; did you mean ‘dmu_buf_set_user_ie’?" error. Which of course means something else has changed and I guess I would have to try and figure out how to check out the entire github tree and compile it. But since I can't even figure out how to checkout a patch I'm not going to attempt it. I wish everything would just work like git used to, because github is beyond old timers like me. I have no idea what Microsoft has done, but they sure want you to install a lot of MS stuff, and appear to be discouraging working in your own local directory. So while I'm sure it's pilot error on my part, I see no reason for changing everything and making it so complex.

    ORGINAL POST:
    Thanks for this information grahamperrin. After reviewing the link I found another link within it to a patch that hopefully fixes the issue, but wow, after almost an hour I was unable to simply check out the patch. The patch itself is shown at https://github.com/openzfs/zfs/pull/...2bc10e9f91cf5# but like I said for the life of me I couldn't copy it or check it out. I've checked out uncounted patches from git before, but it seems that github has some uber complex way of getting it.

    I scoured the internet and installed the github cli, obtained an authorization token, created some kind of workspace on github that urged me to install some kind of Microsoft stuff, etc., but no matter what I did I simply couldn't check out the patch.

    However I finally just found the dnode.c file which seemed to be patched and copied it, did a makepkg -o in my zfs-dkms directory, overwrote the existing dnode.c file with the new patched one, and then compiled the package with makepkg -e. Everything seems to have worked and I have a new zfs-dkms package, but I'm reluctant to install it because I'm really not sure what I'm doing.

    Does anyone know if what I've done is correct? I just can't believe how complicated it is to do something as simple as checkout a patch from github, and assume I must be doing something wrong. I'm using Arch by the way.
    Last edited by muncrief; 25 November 2023, 07:36 PM.

    Leave a comment:


  • grahamperrin
    replied
    Originally posted by BwackNinja View Post
    found zeroes in chunks of files that were new files that I had just checksummed, but I hadn't been able to find anyone talking about it.
    Maybe this:

    PSA: it's not block cloning, it's a data corruption bug on reads in ZFS 2.1.4+, set zfs_dmu_offset_next_sync=0 : zfs

    Leave a comment:


  • BwackNinja
    replied
    Originally posted by Developer12 View Post

    imo, the harshness for BTRFS is entirely justified.

    It's being used as the default in multiple distros simply because it was (until recently) the only in-tree CoW filesystem, yet every single week there are fresh tales of people loosing data to it. It's gone on so long that people loosing data to BTRFS has become a constant background noise that everyone has gotten used to. Descriptions of people's experiences of BTRFS are always tempered with "I've never personally lost data" with the implied caveat being that they've heard of people who have.

    This might be the first data-eating ZFS bug to reach production in over a decade and it was found and fixed immediately after release. It's also very notable that this development cycle involved heavy modification of the block cloning subsystem (likely to support current and upcoming reflink and fast-dedup work) where this bug was found, meaning that this bug was likely born in the last few weeks. How many years have most bugs lingered in BTRFS, eating people's data, before finally being resolved?

    The difference in reliability between the two filesystems is nothing short of staggering. If people want to stop BTRFS getting criticised, they should stop using it. It's been rancid since it's inception.
    If you read the comments on the bug report and the pull request that looks like it fixes it, you'd see that developer thinking that the issue has been in the ZFS codebase since at least 2013. The issue has only been exacerbated recently because of the update to coreutils changing the default behavior of cp. I've personally been dealing with this since before updating to 2.2.0 in Ubuntu 23.10 without block cloning enabled. The fix in 2.2.1 reduces the incidence of data corruption, but doesn't eliminate it. This doesn't reduce my trust of ZFS because it was only corrupting new files with holes, not putting regular old files at risk.

    Leave a comment:


  • Developer12
    replied
    Originally posted by EphemeralEft View Post
    I feel like a lot of BTRFS people are being so harsh on ZFS right now because a lot of ZFS people constantly attack BTRFS over every little (sometimes insignificant) thing. It’s amazing how “all software has bugs” only when it’s ZFS being criticized.
    imo, the harshness for BTRFS is entirely justified.

    It's being used as the default in multiple distros simply because it was (until recently) the only in-tree CoW filesystem, yet every single week there are fresh tales of people loosing data to it. It's gone on so long that people loosing data to BTRFS has become a constant background noise that everyone has gotten used to. Descriptions of people's experiences of BTRFS are always tempered with "I've never personally lost data" with the implied caveat being that they've heard of people who have.

    This might be the first data-eating ZFS bug to reach production in over a decade and it was found and fixed immediately after release. It's also very notable that this development cycle involved heavy modification of the block cloning subsystem (likely to support current and upcoming reflink and fast-dedup work) where this bug was found, meaning that this bug was likely born in the last few weeks. How many years have most bugs lingered in BTRFS, eating people's data, before finally being resolved?

    The difference in reliability between the two filesystems is nothing short of staggering. If people want to stop BTRFS getting criticised, they should stop using it. It's been rancid since it's inception.

    Leave a comment:


  • BwackNinja
    replied
    Really glad to see this, and especially to see that it isn't a completely new issue. I've been running into this issue for months and my investigation found zeroes in chunks of files that were new files that I had just checksummed, but I hadn't been able to find anyone talking about it.

    Leave a comment:


  • woddy
    replied
    Originally posted by JanW View Post

    Fallacy of composition: Just because you (one member of the population) never lost any data using Btrfs does not mean no one in that population has. You seem to conclude from your own experience with Btrfs that there are fewer data losses with Btrfs than OpenZFS in a given population of users.


    You didn't understand anything of what I wrote, I didn't conclude anything, I use btrfs because I use some features that ext4 doesn't have, I never wrote that data loss will never happen with btrfs, that's just my experience.
    I was only contesting the urban legend about the unreliability of btrfs, by the way the last time I had data corruption was with the super stable ext4.
    The moral of it all is that it can happen with any fs, but stop saying that btrfs is unreliable.​

    Leave a comment:


  • cynic
    replied
    Originally posted by EphemeralEft View Post
    I feel like a lot of BTRFS people are being so harsh on ZFS right now because a lot of ZFS people constantly attack BTRFS over every little (sometimes insignificant) thing. It’s amazing how “all software has bugs” only when it’s ZFS being criticized.
    I'm not wasting my harshness with ZFS: I prefer btrfs but at the end of the day both fs are good.
    I'm keeping it all for when we'll start hearing about "The COW filesystem for Linux that won't eat your data"


    Leave a comment:


  • avis
    replied
    Originally posted by evil_core View Post

    Not entirely true. XFS has much better performance and much better tested(it's used in enterprise).
    Ext4 has case-folding and encryption, but those features were ridden by bugs in the past.
    ext4: 2 billion users.
    XFS: < 200K users.

    And I lost ~200GB of data to XFS albeit around 2005 when it probably wasn't worth using.

    Leave a comment:

Working...
X