Announcement

Collapse
No announcement yet.

Linux 5.5 SSD RAID 0/1/5/6/10 Benchmarks Of Btrfs / EXT4 / F2FS / XFS

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DrYak
    replied
    Originally posted by profoundWHALE View Post
    Here's the story all about how my life got flipped, turned upside down....
    I'm using this thing for network storage access k?
    (Side note: some protocols like SMB can be prone to data corruption while transfering files over the network (though you can force checksuming). Use rsync when backup stuff up to the server whenever possible (it checksum as part of the core protocol). Also whenever possible run some sha sums. ).

    Originally posted by profoundWHALE View Post
    I have it automatically scrubbing and defragging.
    Wait, wut?

    Automatically scrubbing: yes, it's part of normal BTRFS maintenance.

    But automatically defraging: Whaaaaa?? What are you referring to?

    A - you are also having "btrfs defrag" running periodically in systemd timer / in a cron job.
    This is nuts. This is not part of any "best practice" recommendations. It's not in the default settings of any automatic maintenance tools.
    You should NOT periodically run it, it makes no sense.

    It also doesn't do what you probably think: it has very little to do with the defragmantation as on FAT-based (and on NTFS ?) partitions.
    Due to their craptastic mecanics based around allocation table (because that did make sense eons ago on 8-bit computer with only a couple of KiB of RAM. Why did Microsoft decide to design exFAT around the same crap is an entirely different question) they will absolutely systematically completely fragment the layout of files all over the partition, leading to poor performance on mecanical HDD due to constant seeking. Defraging will find consecutive space where the file can be written linearly instead of a giant mess of clusters all over the place.

    Any modern filesystem, including modern-day extent-based filesystem on Linux, thank to much better allocation mecanics, are a lot less prone to that kind of problem. Defraging is normally not that much needed.

    On CoW system "fragmentation" has a completely different meaning. It has nothing to do with the physical layout (though the physical layout will tend to fragment in the old sense too, due to the copies part of CoW), but with the logical representation of a file. Remember that CoW (and log structured) will never modify a file in place. Instead it will write a new version of the extent and then update the pointers. In case of large files that have multiple random inplace overwrites (Virtual disk images, databases, torrents), the file will end up being a giant maze of twisty pointers, all alike. This can slightly impact the performance of the file system, and in embed scenario (Jolla's first smartphone, Raspberry Pi 1, etc.) can be quite ressource intensive to traverse the maze to find the data that you want.
    "btrfs defrag" is a process that will read a file and will rewrite as a new continuous extent (or at least as a serie of larger extents), thus de maze-ifying it. But while doing that, it will - by definition - completly break any shared extents that was part of a snapshotting. (Snapshots are - by definition - saving space by sharing copies and only using pointers to the differences between snapshot).
    It has also a couple of other different uses cases, like recompressing a file (Read the old raw file, write a new compressed one with Zstd and level=18).

    You can for example run a btrfs defrag as part of the post-processing once a torrent has finished downloading. (because, due to how torrent work, it will be by then a huge maze of twisty pointers).
    But putting defrag in cron will cause constant rewriting of data, and will completly fuck up your snapshots (on CoW systems, having 4 timepoint backups of a 16GB file will only take 16GB +whatever differences exist between the timepoints. On a classic EXT4+rsync+hardlinks backup system, the 4 timepoints will eat 64GB - as you'll have 4 different 16GB files that only differe slightly. By running "defrag", you are writing entirely new copies of the file, thus turning the former situation into the later and instantly negating any benefit that the CoW snapshotting did bring). The constant rewriting will also kill flash and make the allocation unhappy (more on this later).

    You should not run btrfs defrag in cron unless you have a very specific use cases and you know exactly what you're doing.


    B- your are using the "auto-defrag" mount options.

    Which basically tries to reduce the amount of fragmentation in case of heavy random writes: multiple adjacent writes will be grouped together and will coalesce into a single larger write. (Basically that is like running "btrfs defrag", but only on the region of the file that saw a sudden burst of nearby writes all close to each other). It helps against making too much twisty mazes. Depending on your workload, it might help.

    Still, for databases and virtual image, the recomendation is to mark the files as nocow, and for integrity and coherence rely on whatever internal mecanism they have. (database usually have their own internal journaling mecanics to be able to survive power-cord yanking-class of problems. virtal disk image have whatever the filesystem in the image uses. basically you're layering both btrfs' and the software's integrity mecanics in a redundant maner which isn't always a brilliant idea).


    C- you are confusing with another type of maintenance work (that is normally provided by maintenance tools such as opensuse's "btrfs-maintenance" and jolla's "btrfs-balancer"): balancing.
    That is something that is good to perform every now and then but isn't as critical as scrubing. This is due to the fact that btrfs, zfs and bcachefs are all also their own volume managers (similar to LVM) in addition of being file systems (and in the case of zfs, implement a completely different set of volume management functions instead of sharing part of the work done with lvm/mdadm/dm/etc. hence the stronger criticism that "zfs" has received with regards to layer violations).
    In the case of btrfs, it allocates space in block groups. Whenever it needs to write new data or metadata it takes free space from the drive and alocate a 1GiB data block group or a 256MiB metadata blockgroup. And then it writes the data inside the block group. Garbage collection of the old not used anymore copies of CoW will leave holes in the middle of older block groups and turn them into a swiss cheese. BTRFS has a slight tendency of prefering to append at the end of a recent block group, rather than spread the write across multiple tiny holes spread among old block groups. (More recent version of btrfs have better tuned their allocator to balance the pro and cons of this strategy).

    Per se, it's not that much of a problem. In fact, for use cases of media that don't like inplace overwriting (like flash and shingled magnetic, that need to perform expensive read-modify-write cycles) that's actually a big advantage to avoid filing the holes of the swiss cheese. BCacheFS has an even stronger tendency to be mostly-append of blocks and Kent touts it as a big advantage for flash and shingled (avoids RMW cycles) and for RAID5/6/erasure coding (which might need to perform RMW cycles to update parity if only part of a stripe is updated).

    The problem is when you have not so large space: you might have a bunch of "swiss cheese" data block groups, all filled at ~30%. Except now the system needs to write metadata, and all the metadata block groups are now full, and thus it needs to allocate a new metadata block group. But if you ran out free space on the drive you can't allocate a new metadata block group. You're out of space *despite* only having 70% space usage in *data*. You're getting "enosp" errors.
    This problem used to be even more insidious because all the nitty gritty detail of allocation are only shown on internal btrfs tools ("btrfs filesystem df" and "btrfs filesystem usage"), and "df" simply showed "70% free" (correct for the space available inside data block groups, not the free space available on the drive). This caused panic and incomprehension among users: you had free space (df showed "70% free") yet get "enosp" error message in journal / dmesg / var-messages!
    Surely the BTRFS must be corrupted! I need to fix it! Let's run FSCK! (user proceeds to completely trash a perfectly functional btrfs filesystem)

    Balancing as part of the maintenance is a way to mitigate this problem: among the filters you can give to balance, the "musage" and "dusage" filter can request it to find old "swiss cheese" block groups. It takes the data of multiple such blockgroups, compact its and allocates a single new block group to write it.
    In the scenario above, a simple balance can gather all the 30% full "swiss cheese" block groups, rewrite them as small number of full block groups and return the remaining 70% free space to be allocated. No need to shot btrfs in the head with some FSCK.

    Nowaday the situation has become much better.

    On one hand, the allocator of btrfs has become much better and can avoid painting it self in a corner allocation-wise. It can sort of balance on its own and avoid leaving too many swiss cheese around.
    On the other hand the single number returned by df better reflects the current allocation situation. It will correctly display "0" in the above scenario, alerting the user that (free) space is running low.

    But some reasonnable amount of balancing (collecting and compacting swiss cheese block groups with <40% of occupancy on a weekly or monthly basis is reasonnable). Just remember to balance only *after* coherency has been successfully attested with "scrub".

    Using well done tools (like opensuse's btrfs-maintenance) is good idea.


    Leave a comment:


  • DrYak
    replied
    Originally posted by profoundWHALE View Post
    If btrfs is as unstable as you guys say "lololol you shouldn't pick something unstable" then why should anyone trust their data with it in the first place?
    In 2013, which is when the link you're giving to answer xinorom's "Pretty sure no one you should have been listening to was calling it stable in 2013.", THE WHOLE BTRFS WAS NOT CONSIDERED STABLE AND PRODUCTION READY by anyone sane of mind.

    You could use in testing context or it could be semi usable if you had an extremely good backup policy, which is what I was doing back then.

    Originally posted by profoundWHALE View Post
    And if it is "stable" then it shouldn't just eat all my data like it did.
    NOWADAYS, BTRFS is considered stable by its authors as long as you stick to the set of features that are considered stable (Spoiler alert: RAID5/6 still isn't considered stable - nearly all other features are considered stable *SOME* of them for quite some time).

    Originally posted by profoundWHALE View Post
    Feature additions are very VERY different from something like the fsck, something which should be working before it is ever mainlined in the kernel.
    TL;DR: The actions that you think about when thinking about "fsck" are handled differently in BTRFS. It's either handled by the filesystem itself, or it's handled by the "btrfs" tool.

    With regards to fsck : for the last time, it's a CoW filesystem. CoW and log-structured filesystem work in a completely different way. FSCK makes no real sense in a CoW filesystem, it was just a small tool added by openSUSE developers to cover a few case (plenty of filesystem don't have a FSCK module: the exFAT garbage doesn't have one. For obvious reasons iso9660 doesn't have one. F2FS and UDF don't have one given that they are log-structured, etc.)

    Again. Fsck is to be used when your data has been partially left in an incoherent state. Due to how they work, CoW and log structured file system CANNOT BY DEFINITION be left in an incoherent state because they cannot modify existing data in-place. There is always "a" coherent state, which at worse case is just the previous state (which is always accessible by design in CoW and log-struct FS, because they never overwrite in place).

    If you have a classical in-place modifying file system, like EXT4, XFS, etc. or even garbage as FAT32, and you sudenly yank the power cable out of the wall, you're left with a harddisk that contains data that is in an underterminate half-written state.

    Tools like fsck and journals (and the "t" of "tFAT") are supposed to help discern something in this "halfway-through" state and at least recover some of the data. At least the system can be put back into a coherent state.

    CoW filesystems - like BTRFS, ZFS and BCacheFS once that will definitely be considered stable - and log-structured file systems (like F2FS, UDF and a few other). Never touch old copies of data, they always write NEW block of data (e.g.: log-struct write a new log entry, and eventually garbage-collect old log entries down the line. CoW write a new modified copy of the data and then subsequently update the pointers).
    If you suddenly yank(*) the power cord in mid uses, at worse you're left with a coherent filesystem (basically everything until the write you interrupted) plus some extra garbage at a new position that can be safely ignored. There is no point in FSCK. The functionnality traditionnally covered by FSCK upon a reboot (make the thing usable and coherent again) are now done by the mounting mechanism (find the last present coherent state and ignore any subsequent garbage).

    Old style "mount -o recovery" (nowadays built-in but you can further nudge it by using "mount -o usebackuproot" if needed) is what does the exact same thing as "run fsck in case of power loss" (and in the precise situation of filesystems that have a journal, like EXT4, Reiser, etc. - fsck does the exact same thing as attempting to mount without fsck does on that system: it first tries to replay the log).

    ----

    Next to that, there is an entirely different class of problems which are bitrot, data corruption, etc. It's the proverbial cosmic ray flipping random old bits. Data that was otherwise good get insta-corrupted in-place on the disk.

    If it happens in a critical part of the filesystem (structure/metadata), it might accidentally render it incoherent. If it happens in the middle of data, your files become corrupted.

    Classical filesystems like EXT4, XFS, etc. can also attempt to use fsck to make something out of these situations. (Because it's not much different than the "half wirtten garbage on power-cord yank" situation fsck was designed to address). Fsck might manage to recove from corrupted structure/metadata. Still lots of data loss are expected. You might need to run some "here be dragons" option of fsck, and you might find yourself in a situation where fsck trashes more than it recovers (remember the caveats against having a ReiserFS image on top of a ReiserFS partition if you need to rebuild the tree ?)
    Even in cases where FSCK is able to recover, it can be better to rebuild the filesystem (except in cases where FSCK absolutely perfectly guarantees that the new recovered state is 100% clean).
    There absolutely nothing that fsck can do for data that was destroyed *in* files.

    ZFS, BTRFS, BCacheFS tackle that problem from a completely different angle: *everything* is checksumed. periodic scrub will read all the data and compare to checksum and detect any unexpected bit-flip.
    These filesystems also have diverse mecanism for data redundancy (well except RAID5/6/erasure coding on BTRFS and BCacheFS. On BTRFS it's still *not stable yet*, on BCacheFS it's barely started being worked on) such as multiple copies (e.g.: even "dup" - on the same drive - for BTRFS)
    If any data corruption is detected, either during normal operation or during a scrub, these filesystems are supposed to be able to auto-repair it by leveraging the mirror drives (or even a "dup"(**) copy on the same drive in case of BTRFS).

    In best case, you should not even reach a situation where the equivalent of running FSCK is required.

    If the btrfs isn't able to auto-repair:
    - for data (files) well, that's what backups are for. At least scrub logs will point you to exactly which file got eaten by the bit flipping daemons, and you can imediately take actions. You don't need to wait multiple year down the line and only release the corruption once your try reading it.
    - for metadata (structure): very often at that point the filesystem is still mountable. If that's not the case, "btrfs restore" is still able to exctact as many files as possible out of the drive. You should back up any file for which you don't have a backup yet, and you should rebuild the file system. It's in a corrupt state any way.

    That's about the only small caveat with BTRFS: if the checksum on a sub-tree fails, there is no simple way to just let go any metadata that is now unusable. A "backup-then-rebuild" can a bit more often be required.

    ----

    (*): with most sane hardware. Very bad hardware with horrendously bad caps (SSD) or flywheel mecanics (HDD), if in the middle of some read-modify-write cycles (Flash, shingled) could lose what's currently in flight in the internal RAM and accidentally destroy and corrupt old data that was old and stable in the point of view of the filesystem. If that's the case, then you have a problem. A *hardware* problem: your mass storage is crap and can't be trusted.
    I *have* actually had some flash die in such way (due to a short in a smartphone). BTRFS was still able to recover to a state at which data could be backed up and the system rebuilt from backups.

    (**): saddly, the way flash translation layer works defeats the purpose of "dup" be grouping writes together, both copies of DUP have a high chance of ending up in the same erase group in flash and will get killed together. You'd need actual physical multiple drives and "RAID1" instead.

    Leave a comment:


  • xinorom
    replied
    Originally posted by profoundWHALE View Post
    Haha that's a new one. I think I'll leave you and your mental gymnastics alone now
    Ok buddy. If you need any data recovery services in the next few weeks, feel free to give me a call.

    Leave a comment:


  • profoundWHALE
    replied
    Haha that's a new one. I think I'll leave you and your mental gymnastics alone now

    Leave a comment:


  • xinorom
    replied
    Originally posted by profoundWHALE View Post
    You're an idiot, or a troll, or perhaps both.
    2016 was the year I used it for real. You can't even keep the simplest of things straight and can't stop yourself from projecting. I don't care about your ADHD.
    You probably think you're trolling me, but you're actually trolling yourself...

    Leave a comment:


  • profoundWHALE
    replied
    You're an idiot, or a troll, or perhaps both.

    2016 was the year I used it for real. You can't even keep the simplest of things straight and can't stop yourself from projecting. I don't care about your ADHD.

    Leave a comment:


  • xinorom
    replied
    Originally posted by profoundWHALE View Post
    If btrfs is as unstable as you guys say "lololol you shouldn't pick something unstable" then why should anyone trust their data with it in the first place?

    And if it is "stable" then it shouldn't just eat all my data like it did.
    It wasn't stable in 2013. No one claimed it was back then. If you had half a brain you'd realize that a commit in December 2013 marking the disk format as "no longer unstable" ought to indicate that the filesystem as a whole is still pretty unstable.

    In 2020, Btrfs is now relatively stable. Shock horror, 7 years makes a difference.

    Originally posted by profoundWHALE View Post
    My criticism is that it is not stable
    Your criticism is bunk, due to being a clueless brainlet with severe ADHD. It's hard to take anything you say seriously after thoroughly demonstrating how idiotic you are...

    Leave a comment:


  • profoundWHALE
    replied
    Originally posted by DrYak View Post

    Finalizing the disk format just means that they pinky-swear to stop breaking the on-disk format at every release. i.e.: a btrfs partition formated on alpha 0.16.1, shouldn't necessarily break because your kernel use a module alpha 0.16.2.

    From that point onward, they only concentrate on fixing bugs, and working on features that do not break the on-disk format of the data.

    e.g.: once extref, or a new compression (zstd) or a new checksum (lz4) is introduced, the driver should still be able to mount old partitions, and old drivers should detect the unsupported feature and gracefully refuse to mount or to read/modify the files instead of utterly crashing and/or corrupting the partition due to feature mismatch.

    It absolutely does not signify that a filesystem is stable. In fact it's quite the contrary: it's only at this point that the bug hunting can *even start*.
    You've completely missed my point and can't tell if you're agreeing with me or not.

    If btrfs is as unstable as you guys say "lololol you shouldn't pick something unstable" then why should anyone trust their data with it in the first place?

    And if it is "stable" then it shouldn't just eat all my data like it did.

    Feature additions are very VERY different from something like the fsck, something which should be working before it is ever mainlined in the kernel.

    My criticism is that it is not stable and therefore shouldn't be trusted. For a system trying to replace ZFS, that's pretty bad.

    Leave a comment:


  • profoundWHALE
    replied
    Originally posted by DrYak View Post
    Having backup is always a good idea no matter what


    If your scrubs are taking multiple days, then there's something wrong.
    e.g.: some background tasks that takes way too much I/O.
    or e.g.: smartctl kicking into full long selt-tests (which also kill I/O on rotationnal media due to seeking)
    or you're stacking above a lower layer that also has it's own pitfalls (stacking above a mdadm RAID5/6 which brings in a lot of read-modify-write cycles. Or used shingled drives in a way that managed to increase the r-m-w cycles despite btrfs being cow)

    Normally scrub should take a couple of hours max, and is something that needs to be performed on a regular basis to guarantee data safety.
    (I tend to run it weekly, monthly is about the min recommandations).

    If you have I/O problems, you might consider (a stable a mature) SSD caching layer between BTRFS and the drives.



    If you get corruption all over the place:
    - you've been mistaken and actually run one of the features not considered stable (like RAID5/6 instead of RAID0/1, like extref or skinny on a to old kernel).
    - you've got some massive hardware problem the difference being that the BTRFS checksumming actually notices it. It needs to be very massive if the RAID1 duplication is insufficient for recovering data.

    In my very long experience with BTRFS I've never seen a filesystem corrupt itself "just because BTRFS". It was always either me playing with experimental options, or the medium breaking.



    *BTRFS SCRUB* is the standard check that you need to run periodically on BTRFS.
    Here's the story all about how my life got flipped, turned upside down....

    I'm using this thing for network storage access k? I have it automatically scrubbing and defragging. Then one day, I try to open a file (such as a video) and notice that it's missing some frame and some audio. I didn't think too much of it.

    Continued use of the system and more and more files were showing the same problems, some even saying that the file doesn't actually exist.

    So I manually run a scrub. When I say it takes a whole day I mean I start it in the morning and by the time I got back from work it should be done but it always failed at about 70%.

    I saw that there were some more things I could try with scrub by instead of trying to do it in the drives and then come back and try another one, I tried the several different commands on each drive. I found out when I get back home hat they halted due to errors, you know the errors that it's supposed to fix.

    Eventually I managed to copy the files from one drive to an empty drive and whatever was totally corrupted was skipped automatically.

    But then I found out that even if it copied, many files were missing chunks from them.

    I had to use the list of corrupted files to know what it is that I needed to recover from backup, I checked around and I still had the original SD card for things like wedding videos.

    So, like I said. For me, the person, I will never be able to trust btrfs.

    And don't give me that "well it probably was set up raid5 crap". Anyone with half a brain would test something like that (making sure ngthst it's RAID10 and working) to replace the whole data storage solution.

    -------

    I've never had any issues with corruption -yet- on bcachefs. The problems I'm referring to is stuff like a piece of the software isn't working quite right like a certain feature might not be functional yet or a girl update fails to build. The point was when there's a problem with bcachefs Kent is like oh I need to fix that.

    When there's a problem with btrfs it's just sort of a "quirk" which you should know about or else you'll lose your data or something fun like that.

    Leave a comment:


  • DrYak
    replied
    Originally posted by profoundWHALE View Post
    Well now I know that you don't know what you're talking about. Maybe you should go troll somewhere else
    https://git.kernel.org/pub/scm/linux...cb5c58097b918e
    Finalizing the disk format just means that they pinky-swear to stop breaking the on-disk format at every release. i.e.: a btrfs partition formated on alpha 0.16.1, shouldn't necessarily break because your kernel use a module alpha 0.16.2.

    From that point onward, they only concentrate on fixing bugs, and working on features that do not break the on-disk format of the data.

    e.g.: once extref, or a new compression (zstd) or a new checksum (lz4) is introduced, the driver should still be able to mount old partitions, and old drivers should detect the unsupported feature and gracefully refuse to mount or to read/modify the files instead of utterly crashing and/or corrupting the partition due to feature mismatch.

    It absolutely does not signify that a filesystem is stable. In fact it's quite the contrary: it's only at this point that the bug hunting can *even start*.

    Leave a comment:

Working...
X