Announcement

Collapse
No announcement yet.

Linux 5.5 SSD RAID 0/1/5/6/10 Benchmarks Of Btrfs / EXT4 / F2FS / XFS

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • zyxxel
    replied
    Originally posted by DrYak View Post
    Normally scrub should take a couple of hours max, and is something that needs to be performed on a regular basis to guarantee data safety.
    (I tend to run it weekly, monthly is about the min recommandations).
    This depends on amount of data and how fast the machine can handle reading.

    If having 12-14 TB of data on a drive and the machine manages about 150 MB / second then you get a total runtime of around 24 hours. With the larger drives, it's really important to make sure that the drives are connected to controllers that can handle the full transfer speed the disk supports. And even when using ATA-600, most drives are limited to 200-250 MB/second.

    So in the end, for really large drives it's often meaningful to split the scrub into multiple runs using cancel/resume instead of doing one huge scrub every x days.

    Leave a comment:


  • zyxxel
    replied
    Originally posted by profoundWHALE View Post

    Hey genius, what good is a log of a scrub that fails if the tools for fixing the failure are failing?

    I get the saying of "a bad carpenter blames his toolbox" but seriously, what the heck are logs going to do?
    The logs informs what is broken.

    This is the required input to be able to make good decisions what the next step should be.

    Without access to the logs, everything will just end up as black magic.

    Originally posted by profoundWHALE View Post
    I'm using the same drives, right now. I check the harddrives about twice a year for bad sectors.
    You shouldn't check your disks for bad sectors twice/year. Your system should check this continuously and mail you instantly when something starts to smell fishy.

    Your story reminds me of way too many people having RAID-5 and forgetting about it. So one year after the first drive fails, a second drive fails and everything is lost. All because of the assumption that all is well as long as data can be read out. And then the user is angry with their RAID-5. Or blames the drives for failing at the same time, even if one drive had been broken for a very long time.


    Originally posted by profoundWHALE View Post
    You can blame me for blaming btrfs all you want. What decided to just start corrupting my files? btrfs. Was regular scrubbing and defragging in place? Yes. I followed all the instructions that the btrfs developers had for how to do exactly what I was doing.
    It is your guess that it was BTRFS that started to corrupt your files. It's just an assumption without being backed by actual proof.

    Originally posted by profoundWHALE View Post
    So either every single person who did what I did ended up losing their data due to "bad practices" or btrfs has some serious bugs that result in massive corruption.
    I have hundreds of "terabyte-years" of data on BTRFS (multiple systems, configurations, drives, controllers, ...) For some strange reasons, it works very well. So BTRFS can't be as bad as you claim. Somehow, you have managed to find a "fuzz factor" that makes your experience different. If that "fuzz factor" is you, your system or your personal actions/assumptions, then it isn't relevant as a general tip for others to stay away from BTRFS.

    Leave a comment:


  • profoundWHALE
    replied
    Originally posted by DrYak View Post

    Yup, you have shown disregard to multiple best practices (any logs of scrub, smart, etc. ?), but go on, blame btrfs for your failings.
    It's okay to decide BTRFS isn't for you. But blaming it when you're not even checking the logs seem displaced.
    Hey genius, what good is a log of a scrub that fails if the tools for fixing the failure are failing?

    I get the saying of "a bad carpenter blames his toolbox" but seriously, what the heck are logs going to do?

    Originally posted by DrYak View Post
    PLEASE DO NOT TRY PLAYING AROUND WITH ZONEFS IT'S EVEN MORE ALIEN YOU WILL COMPLETLY TRASH VERY VALUABLE STUFF WITH IT.
    If you're talking about ZFS I've already used it, multiple times. I wanted something with native Linux support. Now ZFS has native ZFS support which is good, but I've already been content with bcachefs in the meantime.

    Originally posted by DrYak View Post
    When they start dying, they all start dying at more or less the same time.
    I'm using the same drives, right now. I check the harddrives about twice a year for bad sectors.

    Originally posted by DrYak View Post
    You should have tried that WAY MUCH EARLIER. Also, "btrfs restore" can also try to copy file with failed checksum.
    I tried that. It failed.

    What I had to do was force mount it and force-copy everything and tell it to ignore whatever transfer failed, but also to output whatever failed into a .txt file.

    The problem is that I'm dealing with terabytes of videos and if I lose 1% of a video, it's basically garbage now.

    You can blame me for blaming btrfs all you want. What decided to just start corrupting my files? btrfs. Was regular scrubbing and defragging in place? Yes. I followed all the instructions that the btrfs developers had for how to do exactly what I was doing.

    So either every single person who did what I did ended up losing their data due to "bad practices" or btrfs has some serious bugs that result in massive corruption.

    How come "you didn't think too much of it". You have proof of data corruption. On a system that should be able to maintain integrity. What went through your mind.
    How about you check your head before trying to jump to conclusions. Settle down. I downloaded some videos a while back for testing some encoders like the daala video codec but not every video reached 100%, meaning some would just be missing chunks. That's why I didn't think too much of it.

    After I had videos that I knew were good go bad, then I knew something was wrong and well there's no need for me to repeat the issues and the steps taken to fix the issue.

    You've been jumping to conclusions about so many things and getting all worked up. Settle down. I don't know if btrfs is your God and I insulted him or something, but you gotta settle.
    Last edited by profoundWHALE; 02-04-2020, 01:57 PM.

    Leave a comment:


  • Spam
    replied
    Originally posted by DrYak View Post

    A - you are also having "btrfs defrag" running periodically in systemd timer / in a cron job.
    This is nuts. This is not part of any "best practice" recommendations. It's not in the default settings of any automatic maintenance tools.
    You should NOT periodically run it, it makes no sense.
    Why so? Defragmenting helps in many workloads. It can also cause inflation of data usage due to unduping deduped data (snapshots). Still is a safe thing to use.

    The "btrfsmaintenance" scripts developed by one of the Btrfs developers does regular scrubs, trims, balance and defragment. https://github.com/kdave/btrfsmaintenance/

    Leave a comment:


  • xinorom
    replied
    Originally posted by profoundWHALE View Post
    I'm also not looking for tech support. This was 4 years ago and I've stayed away from btrfs since. Also, if the software In using is so bad that I need tech support just to not result in data corruption then it's bad software.
    tl;dr: you're a brainlet and these kinds of self-inflicted problems will happen to you again in the near future. I for one will laugh heartily when it happens. Please keep us updated.

    Leave a comment:


  • profoundWHALE
    replied
    If you're getting confused by terms I'm using such as autodefeag, don't be. This was 4 years ago.

    You're asking a lot of questions as if I was running this for some company. I wasn't, it was for me, and I have my own job and my own life to attend to. Automatic scrubbing should be fine. Autodefrag was set because in my testing before actually using btrfs I noticed some issues regarding dropped performance and when I investigated it turned out that there were portions that just got super fragmented.

    I had it configured RAID10 software through btrfs' own tools since I had some issues with hardware raid. Besides, I know the filesystem works better when it is aware of what it is doing in regards to things like raid.

    Now, to be clear, I've actually been appreciating your responses. They're actually very informative, but I'll continue to ignore the troll.

    ------

    In regards to the corruption, I cannot remember all the details, but btrfs would refuse to mount the drives. It refused to recover from a good drive. It failed scrubbing. It failed (or had no change with) the non-fsck recovery options. As a last resort I tried fsck and it either said it did something (but nothing changed) or it failed as well.

    I am still using those same hard drives and I have no issues with them.


    ​​​​​​This software is supposed to replace something like ZFS and I have never had this type of corruption without hardware failure.

    I'm also not looking for tech support. This was 4 years ago and I've stayed away from btrfs since. Also, if the software In using is so bad that I need tech support just to not result in data corruption then it's bad software.

    Leave a comment:


  • xinorom
    replied
    Originally posted by DrYak View Post
    Had you been paying attention to logs, you would have been noticing something fishy is happening.
    But you didn't, until the point the whole situation has become unbearable.
    ...
    don't blame btrfs for your own admin incompetency
    Self-accountability overload. Must. Find. Someone. Else. To. Blame.

    I can't wait until it happens again. I wish I could laugh right in his face at the exact moment it all fails. Hopefully he at least posts a "bcachefs is broken" thread on here.

    Leave a comment:


  • DrYak
    replied
    Originally posted by profoundWHALE View Post
    Then one day, I try to open a file (such as a video) and notice that it's missing some frame and some audio. I didn't think too much of it.
    Okay, wait, what?

    How come "you didn't think too much of it". You have proof of data corruption. On a system that should be able to maintain integrity. What went through your mind.
    Also you're sure you've been running scrub periodically? Did you even pay attention to the result of the scrub? Did you had any mecanism in place for your server to alert you if something went wrong?

    There's no way that under normal use, the first time you notice data is droped frames.
    - if you've been running scrubs periodically, the scrub procedure should have returned warning long before you serendipitously discover the corruption.
    - the checksuming on btrfs is extent based. If an extent is corrupted, the normal behaviour of btrfs is to declare the whole extent corrupt. You should just have a droped frame, you should have a whole chunk of your video refusing to load. (Note: you can still recover your file using "btrfs restore", but at that point the damage is done)
    - it's rather possible that at this point, btrfs is still doing its work: reading the damaged extent, notice checksum failure, try to reconstruct data from the other side of the RAID1. But all this recovery procedure is slow and has a hard time keeping up with the real-time situation of video. Data reaches the player with some delay, causing a few video frame to drop. The drop you experience isn't actually the corruption. It's the latency you experience in real-time video playing while behind the scene btrfs is running circles trying to recover some corrupted mess - recovery is sucessful but doesn't keep up with the real-time video requirement.

    Originally posted by profoundWHALE View Post
    Continued use of the system and more and more files were showing the same problems, some even saying that the file doesn't actually exist.
    This is a telltale sign of hardware corruption happening. The first occurence of same problems of dropped frame is basically latency problems of automatic repairs happening in the background. The "file doesn't exist" is the way some high level software (e.g.: your smbd server) reacting to the unrecoverable checksum errors (No sane copies of the extent are currently accessible).

    So have you been paying attention to the logs of the scrub?
    Have you been paying attention to the SMART messages of the drive?
    Do you even run smartd?

    Running smartd is important: running "short tests" nightly at a time of low IO, and running "long tests" periodically (weekly or monthly, at a time of low use, at a DIFFERENT time than the "btrfs scrub" other wise IO will suffer and both procedure will take multiple days to finish due to competing for head seeks) is critical for any serious storage business. Using smartd to monitor a few critical SMART indicators is also a good idea (Backblaze and Google have publication detailing which measurement correlate best with impending doom. Spoiler alert: since the end of the IBM Deathstar era, it isn't temperature anymore).

    Once performance started going weird, didn't you think about having a look at journal/dmesg/var-messages to see if there is something obvious that needs adressing ?

    You have to realise that:
    - if your job relies on data management, you thus have to pay much more attention to the details
    - specially if you insist on using new cool toys that work in surprising and unusual way (BTRFS, ZFS and BCacheFS are rather revolutionnary in the way the work. Expect the principle of least astonishment to completely fly out of the window, specially with regards of old habit taken up with EXT4/3/2).

    Thus you should at least RTFM of all the tools involved and have a good idea of the idiosyncrasis involved. I'm not only speaking about only btrfs, but any other relevant tool along the way: smartctl/smartd, mdadm (if you use that for RAID5/6), LVM, etc.

    If you do not have the time/patience to pick up the above, it's okay to fall back onto ready made for use tool kits.

    - opensuse has written very good script to help the maintenance of btrfs (in general experience of new emerging tech like btrfs, systemd, etc. tends to be much smoother in opensuse, because they have tech/engineers putting some effort to make the experience smooth), these tools are even available in Debian.

    - it's even okay to rely on an appliance that has abstracted much of this work under the hood into a high level simpler interface, where you can easily get synthetic information ("Your HDD is going to die soon" red box in the interface, instead of digging logs).

    Originally posted by profoundWHALE View Post
    So I manually run a scrub. When I say it takes a whole day I mean I start it in the morning and by the time I got back from work it should be done but it always failed at about 70%.
    I say what I probably think has happened.

    (You mention using multiple drives in RAID1+0 configurations. I suspect you followed the age-old mantra of buying them in groups of the same batch (the shop probably even sold you drives with sequential serial numbers). While sysadmins have their reasons to do that - it's easier to manage them in pods - it's necessarily a good tip for a home server. You see, drives from the same batch not only have very close performance to each other (useful for hardware RAID0), they follow the same bathtub curve. When they start dying, they all start dying at more or less the same time. Again this is useful for a datacenter sysadmin (if one drive fails, replace the whole pod, because the others are following up soon). But at home it means that you're going to see drive problem more or less together, making it more complicated to replace them one after the other).

    What happened to you, is that if you paid attention to the journal/dmesg of your server, you would have been noticing that for the past month or so , it would be filled with walls of DMA_crc, sector errors and other such messages.
    This is most likely caused by hardware problem.
    It might have been a problem of the path between the server and the drives (actually very common on SBC such as raspberry pi. Most storage failure aren't failure of the actual mSATA, but power brown outs, bad SATA-to-USB3 bridge chips, etc.)
    It might have been the harddrive starting to die of old age, all more or less simultaneously due to the afore mentionned batch effect.

    SMART, if you had been paying attention to it, would have been complaining of slight increase of crc errors. smartd would have notified you of the amount of reserved/unallocated sectors being depleted as they need to be allocated to make up for old dead sectors.
    SMART test, first during the exhaustive long test, and more recently even the quick test, would have reported read aborts. With logs full of unrecoverable CRC read errors.
    In short, your harddrives are starting to fail.

    If your server is within earshot, you would even been noticing the typical clicking sounds of the harddrives.

    Meanwhile, above that, btrfs has been dutifully trying to it's stuff. Detecting corruption through checksum, then self-repairing corrupted data during scrubs by fetching alternative copies from RAID1 and rewriting them.

    This has vaguely kept the system afloat, except for the occasionnal droped frame when latency caused by all this has become too much.

    Had you been paying attention to logs, you would have been noticing something fishy is happening.
    But you didn't, until the point the whole situation has becom unbearable.

    scrub taking ages is a telltale sign.
    it probably needs to fight upstream problems. harddrive failing, multiple read attempts, clicking sounds. reading data to control its checksum is getting terribly slow and difficult.
    at multiple point along the scrub, recovery needs to happen.
    at some point the scrub just gives up, either because the problems cause too much retry and timeouts trying to get the data. Or because both RAID1 copies are on failed sectors of the drives.

    This is the point where any sane admin would realise:
    - you've been missing something huge for quite a lot of time.
    - it's time to check that you have any critical stuff backup somewhere (secondary backup fileserver, optical media, tape, whatever you use)
    - it won't be bad to reconsider your general strategy:
    RTFM is a possibility.
    Swithing to an appliance where somebody else has done the work for you, like a Sinology is another valid strategy.
    Droping the high tech toys because you're unable to use them correctly IS a valid strategy. But don't blame btrfs for your own admin incompetency.

    Originally posted by profoundWHALE View Post
    I saw that there were some more things I could try with scrub by instead of trying to do it in the drives and then come back and try another one, I tried the several different commands on each drive. I found out when I get back home hat they halted due to errors, you know the errors that it's supposed to fix.
    At that point you're probably just bashing random command you've been reading at various stackexchange forum.
    Usually it pretty quickly devolves into using creatively the experimental options. Like zeroing the checksum tree. And shooting your filesystem in the head using FSCK.

    The description you're giving: "some more thing", "tried several different commands", "*they* halted due to errors, you know the errors that *it* is supposed to fix" clearly demonstrates that you don't have a clue what are the stuff you're copy pasting from forum into your command line.

    In general the procedure is simple.
    You run scrub. If the scrub says your system is sane, you might proceed further.
    If the scrub fails, that means that things are already dear and the system is failing. Refreshing your backup at that point is a good idea. If the system doesn't mount or if files aren't accessible, try extracting them manually using "btrfs restore" but pay attention to its logs with regards to crc.
    Now it's agood time to investigate WHAT caused the dataloss. Funny surprise might be comming (an impeding mass death of your drives).

    If the scrub says your data is clean, then your data is clean for all intent. There might be further problem that can be fixed by some careful rebalancing with corresponding filters. e.g.: "enosp" error can be fixed by rebalancing with usage filters to purge and compact old "swiss cheese" block groups.

    But if the scrub fails, it means the current state of the drive is very dire.
    - it's not "we have a minor problem, but we're too stupid to have a functionnal fsck to fix that minor problem".
    - it's "the kind of problem that are normally fixable by fsck have been getting fixed in the background by the filesystem. Now I can't even manage that anymore. Problems are too major, I'm giving up".

    That's a general problem of new filesystem that are a little bit too magic (like BTRFS, ZFS).

    They can automatically handle most problems for you and kind of "hide them from you", until it's not possible anymore at which point the whole system is fucked up.

    Remember that demo where sun smashed RAID-Z2 ZFS filesystem with a hammer? and the whole thing just ran perfectly despite two drives being down?
    What would you expect would have been happening once a third drive got hammered?
    Yes, right.
    The whole thing critically failing to the point that it's not recoverable anymore.
    A RAID-Z2 system with three drives dead just can't function. Mathematically it doesn't contain enough information to go forward.
    It's just as dead as your BTRFS system.
    You just crossed the point at which the magic stops hiding the multiple problems and (both litteral and metaphorical) hammers to the drives.

    Originally posted by profoundWHALE View Post
    Eventually I managed to copy the files from one drive to an empty drive and whatever was totally corrupted was skipped automatically.
    You should have tried that WAY MUCH EARLIER. Also, "btrfs restore" can also try to copy file with failed checksum.

    Originally posted by profoundWHALE View Post
    But then I found out that even if it copied, many files were missing chunks from them.
    That's the extents with the failed checksum. Alternativement by using btrfs restore, you'd have chunks of noise in the middle.

    Originally posted by profoundWHALE View Post
    I had to use the list of corrupted files to know what it is that I needed to recover from backup, I checked around and I still had the original SD card for things like wedding videos.
    If it is critical, it should go on some ROM media like optical, or on a secondary backup server.
    "1 copy is no copy" mantra.

    Originally posted by profoundWHALE View Post
    So, like I said. For me, the person, I will never be able to trust btrfs.
    Yup, you have shown disregard to multiple best practices (any logs of scrub, smart, etc. ?), but go on, blame btrfs for your failings.
    It's okay to decide BTRFS isn't for you. But blaming it when you're not even checking the logs seem displaced.

    Originally posted by profoundWHALE View Post
    I've never had any issues with corruption -yet- on bcachefs. The problems I'm referring to is stuff like a piece of the software isn't working quite right like a certain feature might not be functional yet or a girl update fails to build. The point was when there's a problem with bcachefs Kent is like oh I need to fix that.
    When there's a problem with btrfs it's just sort of a "quirk" which you should know about or else you'll lose your data or something fun like that.
    I haven't completely understood your girl update fails feature whatever.

    Still, BCacheFS share lots of feature with other modern FS (checksuming, redundancy, self-repair) - the only main difference is it's specific tiered approach to storage. That's its shtick that is has inherited from BCache.

    That also means that it will be able to cope with quite some mess underneath.
    That also means that if there is problems showing despite it self-healing capabilites, when that point is reached the situation is dire.
    The only difference is that the tool to exact whatever you can from the mess might be indeed called "fsck" in BCacheFS land.

    With regards to the quirks: well what did you expect - btrfs is one of the new gen "quasi magic self-healing FS", and its own of the early one. It is going to be weird.
    Specially space allocation was not very smart in early version and very quickly came to bite you.
    Thing have improved recently (e.g.: you don't need to be aware of the whole block group allocation inside btrfs anymore. Though you're still dealing with a new-gen beast that is both a filesystem layer AND a volume manager layer, both at the same time. Getting informed about this kind of unusual new tech before using it should be expected).

    And after this whole long rant, I might give you a new advice.
    You seem to be attracted to shiny new toy that get mentionned in tech news.
    You don't spend much time documenting yourself about the intricacies of this new toys.

    PLEASE DO NOT TRY PLAYING AROUND WITH ZONEFS IT'S EVEN MORE ALIEN YOU WILL COMPLETLY TRASH VERY VALUABLE STUFF WITH IT.

    Leave a comment:


  • DrYak
    replied
    Originally posted by profoundWHALE View Post
    Here's the story all about how my life got flipped, turned upside down....
    I'm using this thing for network storage access k?
    (Side note: some protocols like SMB can be prone to data corruption while transfering files over the network (though you can force checksuming). Use rsync when backup stuff up to the server whenever possible (it checksum as part of the core protocol). Also whenever possible run some sha sums. ).

    Originally posted by profoundWHALE View Post
    I have it automatically scrubbing and defragging.
    Wait, wut?

    Automatically scrubbing: yes, it's part of normal BTRFS maintenance.

    But automatically defraging: Whaaaaa?? What are you referring to?

    A - you are also having "btrfs defrag" running periodically in systemd timer / in a cron job.
    This is nuts. This is not part of any "best practice" recommendations. It's not in the default settings of any automatic maintenance tools.
    You should NOT periodically run it, it makes no sense.

    It also doesn't do what you probably think: it has very little to do with the defragmantation as on FAT-based (and on NTFS ?) partitions.
    Due to their craptastic mecanics based around allocation table (because that did make sense eons ago on 8-bit computer with only a couple of KiB of RAM. Why did Microsoft decide to design exFAT around the same crap is an entirely different question) they will absolutely systematically completely fragment the layout of files all over the partition, leading to poor performance on mecanical HDD due to constant seeking. Defraging will find consecutive space where the file can be written linearly instead of a giant mess of clusters all over the place.

    Any modern filesystem, including modern-day extent-based filesystem on Linux, thank to much better allocation mecanics, are a lot less prone to that kind of problem. Defraging is normally not that much needed.

    On CoW system "fragmentation" has a completely different meaning. It has nothing to do with the physical layout (though the physical layout will tend to fragment in the old sense too, due to the copies part of CoW), but with the logical representation of a file. Remember that CoW (and log structured) will never modify a file in place. Instead it will write a new version of the extent and then update the pointers. In case of large files that have multiple random inplace overwrites (Virtual disk images, databases, torrents), the file will end up being a giant maze of twisty pointers, all alike. This can slightly impact the performance of the file system, and in embed scenario (Jolla's first smartphone, Raspberry Pi 1, etc.) can be quite ressource intensive to traverse the maze to find the data that you want.
    "btrfs defrag" is a process that will read a file and will rewrite as a new continuous extent (or at least as a serie of larger extents), thus de maze-ifying it. But while doing that, it will - by definition - completly break any shared extents that was part of a snapshotting. (Snapshots are - by definition - saving space by sharing copies and only using pointers to the differences between snapshot).
    It has also a couple of other different uses cases, like recompressing a file (Read the old raw file, write a new compressed one with Zstd and level=18).

    You can for example run a btrfs defrag as part of the post-processing once a torrent has finished downloading. (because, due to how torrent work, it will be by then a huge maze of twisty pointers).
    But putting defrag in cron will cause constant rewriting of data, and will completly fuck up your snapshots (on CoW systems, having 4 timepoint backups of a 16GB file will only take 16GB +whatever differences exist between the timepoints. On a classic EXT4+rsync+hardlinks backup system, the 4 timepoints will eat 64GB - as you'll have 4 different 16GB files that only differe slightly. By running "defrag", you are writing entirely new copies of the file, thus turning the former situation into the later and instantly negating any benefit that the CoW snapshotting did bring). The constant rewriting will also kill flash and make the allocation unhappy (more on this later).

    You should not run btrfs defrag in cron unless you have a very specific use cases and you know exactly what you're doing.


    B- your are using the "auto-defrag" mount options.

    Which basically tries to reduce the amount of fragmentation in case of heavy random writes: multiple adjacent writes will be grouped together and will coalesce into a single larger write. (Basically that is like running "btrfs defrag", but only on the region of the file that saw a sudden burst of nearby writes all close to each other). It helps against making too much twisty mazes. Depending on your workload, it might help.

    Still, for databases and virtual image, the recomendation is to mark the files as nocow, and for integrity and coherence rely on whatever internal mecanism they have. (database usually have their own internal journaling mecanics to be able to survive power-cord yanking-class of problems. virtal disk image have whatever the filesystem in the image uses. basically you're layering both btrfs' and the software's integrity mecanics in a redundant maner which isn't always a brilliant idea).


    C- you are confusing with another type of maintenance work (that is normally provided by maintenance tools such as opensuse's "btrfs-maintenance" and jolla's "btrfs-balancer"): balancing.
    That is something that is good to perform every now and then but isn't as critical as scrubing. This is due to the fact that btrfs, zfs and bcachefs are all also their own volume managers (similar to LVM) in addition of being file systems (and in the case of zfs, implement a completely different set of volume management functions instead of sharing part of the work done with lvm/mdadm/dm/etc. hence the stronger criticism that "zfs" has received with regards to layer violations).
    In the case of btrfs, it allocates space in block groups. Whenever it needs to write new data or metadata it takes free space from the drive and alocate a 1GiB data block group or a 256MiB metadata blockgroup. And then it writes the data inside the block group. Garbage collection of the old not used anymore copies of CoW will leave holes in the middle of older block groups and turn them into a swiss cheese. BTRFS has a slight tendency of prefering to append at the end of a recent block group, rather than spread the write across multiple tiny holes spread among old block groups. (More recent version of btrfs have better tuned their allocator to balance the pro and cons of this strategy).

    Per se, it's not that much of a problem. In fact, for use cases of media that don't like inplace overwriting (like flash and shingled magnetic, that need to perform expensive read-modify-write cycles) that's actually a big advantage to avoid filing the holes of the swiss cheese. BCacheFS has an even stronger tendency to be mostly-append of blocks and Kent touts it as a big advantage for flash and shingled (avoids RMW cycles) and for RAID5/6/erasure coding (which might need to perform RMW cycles to update parity if only part of a stripe is updated).

    The problem is when you have not so large space: you might have a bunch of "swiss cheese" data block groups, all filled at ~30%. Except now the system needs to write metadata, and all the metadata block groups are now full, and thus it needs to allocate a new metadata block group. But if you ran out free space on the drive you can't allocate a new metadata block group. You're out of space *despite* only having 70% space usage in *data*. You're getting "enosp" errors.
    This problem used to be even more insidious because all the nitty gritty detail of allocation are only shown on internal btrfs tools ("btrfs filesystem df" and "btrfs filesystem usage"), and "df" simply showed "70% free" (correct for the space available inside data block groups, not the free space available on the drive). This caused panic and incomprehension among users: you had free space (df showed "70% free") yet get "enosp" error message in journal / dmesg / var-messages!
    Surely the BTRFS must be corrupted! I need to fix it! Let's run FSCK! (user proceeds to completely trash a perfectly functional btrfs filesystem)

    Balancing as part of the maintenance is a way to mitigate this problem: among the filters you can give to balance, the "musage" and "dusage" filter can request it to find old "swiss cheese" block groups. It takes the data of multiple such blockgroups, compact its and allocates a single new block group to write it.
    In the scenario above, a simple balance can gather all the 30% full "swiss cheese" block groups, rewrite them as small number of full block groups and return the remaining 70% free space to be allocated. No need to shot btrfs in the head with some FSCK.

    Nowaday the situation has become much better.

    On one hand, the allocator of btrfs has become much better and can avoid painting it self in a corner allocation-wise. It can sort of balance on its own and avoid leaving too many swiss cheese around.
    On the other hand the single number returned by df better reflects the current allocation situation. It will correctly display "0" in the above scenario, alerting the user that (free) space is running low.

    But some reasonnable amount of balancing (collecting and compacting swiss cheese block groups with <40% of occupancy on a weekly or monthly basis is reasonnable). Just remember to balance only *after* coherency has been successfully attested with "scrub".

    Using well done tools (like opensuse's btrfs-maintenance) is good idea.


    Leave a comment:


  • DrYak
    replied
    Originally posted by profoundWHALE View Post
    If btrfs is as unstable as you guys say "lololol you shouldn't pick something unstable" then why should anyone trust their data with it in the first place?
    In 2013, which is when the link you're giving to answer xinorom's "Pretty sure no one you should have been listening to was calling it stable in 2013.", THE WHOLE BTRFS WAS NOT CONSIDERED STABLE AND PRODUCTION READY by anyone sane of mind.

    You could use in testing context or it could be semi usable if you had an extremely good backup policy, which is what I was doing back then.

    Originally posted by profoundWHALE View Post
    And if it is "stable" then it shouldn't just eat all my data like it did.
    NOWADAYS, BTRFS is considered stable by its authors as long as you stick to the set of features that are considered stable (Spoiler alert: RAID5/6 still isn't considered stable - nearly all other features are considered stable *SOME* of them for quite some time).

    Originally posted by profoundWHALE View Post
    Feature additions are very VERY different from something like the fsck, something which should be working before it is ever mainlined in the kernel.
    TL;DR: The actions that you think about when thinking about "fsck" are handled differently in BTRFS. It's either handled by the filesystem itself, or it's handled by the "btrfs" tool.

    With regards to fsck : for the last time, it's a CoW filesystem. CoW and log-structured filesystem work in a completely different way. FSCK makes no real sense in a CoW filesystem, it was just a small tool added by openSUSE developers to cover a few case (plenty of filesystem don't have a FSCK module: the exFAT garbage doesn't have one. For obvious reasons iso9660 doesn't have one. F2FS and UDF don't have one given that they are log-structured, etc.)

    Again. Fsck is to be used when your data has been partially left in an incoherent state. Due to how they work, CoW and log structured file system CANNOT BY DEFINITION be left in an incoherent state because they cannot modify existing data in-place. There is always "a" coherent state, which at worse case is just the previous state (which is always accessible by design in CoW and log-struct FS, because they never overwrite in place).

    If you have a classical in-place modifying file system, like EXT4, XFS, etc. or even garbage as FAT32, and you sudenly yank the power cable out of the wall, you're left with a harddisk that contains data that is in an underterminate half-written state.

    Tools like fsck and journals (and the "t" of "tFAT") are supposed to help discern something in this "halfway-through" state and at least recover some of the data. At least the system can be put back into a coherent state.

    CoW filesystems - like BTRFS, ZFS and BCacheFS once that will definitely be considered stable - and log-structured file systems (like F2FS, UDF and a few other). Never touch old copies of data, they always write NEW block of data (e.g.: log-struct write a new log entry, and eventually garbage-collect old log entries down the line. CoW write a new modified copy of the data and then subsequently update the pointers).
    If you suddenly yank(*) the power cord in mid uses, at worse you're left with a coherent filesystem (basically everything until the write you interrupted) plus some extra garbage at a new position that can be safely ignored. There is no point in FSCK. The functionnality traditionnally covered by FSCK upon a reboot (make the thing usable and coherent again) are now done by the mounting mechanism (find the last present coherent state and ignore any subsequent garbage).

    Old style "mount -o recovery" (nowadays built-in but you can further nudge it by using "mount -o usebackuproot" if needed) is what does the exact same thing as "run fsck in case of power loss" (and in the precise situation of filesystems that have a journal, like EXT4, Reiser, etc. - fsck does the exact same thing as attempting to mount without fsck does on that system: it first tries to replay the log).

    ----

    Next to that, there is an entirely different class of problems which are bitrot, data corruption, etc. It's the proverbial cosmic ray flipping random old bits. Data that was otherwise good get insta-corrupted in-place on the disk.

    If it happens in a critical part of the filesystem (structure/metadata), it might accidentally render it incoherent. If it happens in the middle of data, your files become corrupted.

    Classical filesystems like EXT4, XFS, etc. can also attempt to use fsck to make something out of these situations. (Because it's not much different than the "half wirtten garbage on power-cord yank" situation fsck was designed to address). Fsck might manage to recove from corrupted structure/metadata. Still lots of data loss are expected. You might need to run some "here be dragons" option of fsck, and you might find yourself in a situation where fsck trashes more than it recovers (remember the caveats against having a ReiserFS image on top of a ReiserFS partition if you need to rebuild the tree ?)
    Even in cases where FSCK is able to recover, it can be better to rebuild the filesystem (except in cases where FSCK absolutely perfectly guarantees that the new recovered state is 100% clean).
    There absolutely nothing that fsck can do for data that was destroyed *in* files.

    ZFS, BTRFS, BCacheFS tackle that problem from a completely different angle: *everything* is checksumed. periodic scrub will read all the data and compare to checksum and detect any unexpected bit-flip.
    These filesystems also have diverse mecanism for data redundancy (well except RAID5/6/erasure coding on BTRFS and BCacheFS. On BTRFS it's still *not stable yet*, on BCacheFS it's barely started being worked on) such as multiple copies (e.g.: even "dup" - on the same drive - for BTRFS)
    If any data corruption is detected, either during normal operation or during a scrub, these filesystems are supposed to be able to auto-repair it by leveraging the mirror drives (or even a "dup"(**) copy on the same drive in case of BTRFS).

    In best case, you should not even reach a situation where the equivalent of running FSCK is required.

    If the btrfs isn't able to auto-repair:
    - for data (files) well, that's what backups are for. At least scrub logs will point you to exactly which file got eaten by the bit flipping daemons, and you can imediately take actions. You don't need to wait multiple year down the line and only release the corruption once your try reading it.
    - for metadata (structure): very often at that point the filesystem is still mountable. If that's not the case, "btrfs restore" is still able to exctact as many files as possible out of the drive. You should back up any file for which you don't have a backup yet, and you should rebuild the file system. It's in a corrupt state any way.

    That's about the only small caveat with BTRFS: if the checksum on a sub-tree fails, there is no simple way to just let go any metadata that is now unusable. A "backup-then-rebuild" can a bit more often be required.

    ----

    (*): with most sane hardware. Very bad hardware with horrendously bad caps (SSD) or flywheel mecanics (HDD), if in the middle of some read-modify-write cycles (Flash, shingled) could lose what's currently in flight in the internal RAM and accidentally destroy and corrupt old data that was old and stable in the point of view of the filesystem. If that's the case, then you have a problem. A *hardware* problem: your mass storage is crap and can't be trusted.
    I *have* actually had some flash die in such way (due to a short in a smartphone). BTRFS was still able to recover to a state at which data could be backed up and the system rebuilt from backups.

    (**): saddly, the way flash translation layer works defeats the purpose of "dup" be grouping writes together, both copies of DUP have a high chance of ending up in the same erase group in flash and will get killed together. You'd need actual physical multiple drives and "RAID1" instead.

    Leave a comment:

Working...
X