Originally posted by profoundWHALE
View Post
Originally posted by profoundWHALE
View Post
Automatically scrubbing: yes, it's part of normal BTRFS maintenance.
But automatically defraging: Whaaaaa?? What are you referring to?
A - you are also having "btrfs defrag" running periodically in systemd timer / in a cron job.
This is nuts. This is not part of any "best practice" recommendations. It's not in the default settings of any automatic maintenance tools.
You should NOT periodically run it, it makes no sense.
It also doesn't do what you probably think: it has very little to do with the defragmantation as on FAT-based (and on NTFS ?) partitions.
Due to their craptastic mecanics based around allocation table (because that did make sense eons ago on 8-bit computer with only a couple of KiB of RAM. Why did Microsoft decide to design exFAT around the same crap is an entirely different question) they will absolutely systematically completely fragment the layout of files all over the partition, leading to poor performance on mecanical HDD due to constant seeking. Defraging will find consecutive space where the file can be written linearly instead of a giant mess of clusters all over the place.
Any modern filesystem, including modern-day extent-based filesystem on Linux, thank to much better allocation mecanics, are a lot less prone to that kind of problem. Defraging is normally not that much needed.
On CoW system "fragmentation" has a completely different meaning. It has nothing to do with the physical layout (though the physical layout will tend to fragment in the old sense too, due to the copies part of CoW), but with the logical representation of a file. Remember that CoW (and log structured) will never modify a file in place. Instead it will write a new version of the extent and then update the pointers. In case of large files that have multiple random inplace overwrites (Virtual disk images, databases, torrents), the file will end up being a giant maze of twisty pointers, all alike. This can slightly impact the performance of the file system, and in embed scenario (Jolla's first smartphone, Raspberry Pi 1, etc.) can be quite ressource intensive to traverse the maze to find the data that you want.
"btrfs defrag" is a process that will read a file and will rewrite as a new continuous extent (or at least as a serie of larger extents), thus de maze-ifying it. But while doing that, it will - by definition - completly break any shared extents that was part of a snapshotting. (Snapshots are - by definition - saving space by sharing copies and only using pointers to the differences between snapshot).
It has also a couple of other different uses cases, like recompressing a file (Read the old raw file, write a new compressed one with Zstd and level=18).
You can for example run a btrfs defrag as part of the post-processing once a torrent has finished downloading. (because, due to how torrent work, it will be by then a huge maze of twisty pointers).
But putting defrag in cron will cause constant rewriting of data, and will completly fuck up your snapshots (on CoW systems, having 4 timepoint backups of a 16GB file will only take 16GB +whatever differences exist between the timepoints. On a classic EXT4+rsync+hardlinks backup system, the 4 timepoints will eat 64GB - as you'll have 4 different 16GB files that only differe slightly. By running "defrag", you are writing entirely new copies of the file, thus turning the former situation into the later and instantly negating any benefit that the CoW snapshotting did bring). The constant rewriting will also kill flash and make the allocation unhappy (more on this later).
You should not run btrfs defrag in cron unless you have a very specific use cases and you know exactly what you're doing.
B- your are using the "auto-defrag" mount options.
Which basically tries to reduce the amount of fragmentation in case of heavy random writes: multiple adjacent writes will be grouped together and will coalesce into a single larger write. (Basically that is like running "btrfs defrag", but only on the region of the file that saw a sudden burst of nearby writes all close to each other). It helps against making too much twisty mazes. Depending on your workload, it might help.
Still, for databases and virtual image, the recomendation is to mark the files as nocow, and for integrity and coherence rely on whatever internal mecanism they have. (database usually have their own internal journaling mecanics to be able to survive power-cord yanking-class of problems. virtal disk image have whatever the filesystem in the image uses. basically you're layering both btrfs' and the software's integrity mecanics in a redundant maner which isn't always a brilliant idea).
C- you are confusing with another type of maintenance work (that is normally provided by maintenance tools such as opensuse's "btrfs-maintenance" and jolla's "btrfs-balancer"): balancing.
That is something that is good to perform every now and then but isn't as critical as scrubing. This is due to the fact that btrfs, zfs and bcachefs are all also their own volume managers (similar to LVM) in addition of being file systems (and in the case of zfs, implement a completely different set of volume management functions instead of sharing part of the work done with lvm/mdadm/dm/etc. hence the stronger criticism that "zfs" has received with regards to layer violations).
In the case of btrfs, it allocates space in block groups. Whenever it needs to write new data or metadata it takes free space from the drive and alocate a 1GiB data block group or a 256MiB metadata blockgroup. And then it writes the data inside the block group. Garbage collection of the old not used anymore copies of CoW will leave holes in the middle of older block groups and turn them into a swiss cheese. BTRFS has a slight tendency of prefering to append at the end of a recent block group, rather than spread the write across multiple tiny holes spread among old block groups. (More recent version of btrfs have better tuned their allocator to balance the pro and cons of this strategy).
Per se, it's not that much of a problem. In fact, for use cases of media that don't like inplace overwriting (like flash and shingled magnetic, that need to perform expensive read-modify-write cycles) that's actually a big advantage to avoid filing the holes of the swiss cheese. BCacheFS has an even stronger tendency to be mostly-append of blocks and Kent touts it as a big advantage for flash and shingled (avoids RMW cycles) and for RAID5/6/erasure coding (which might need to perform RMW cycles to update parity if only part of a stripe is updated).
The problem is when you have not so large space: you might have a bunch of "swiss cheese" data block groups, all filled at ~30%. Except now the system needs to write metadata, and all the metadata block groups are now full, and thus it needs to allocate a new metadata block group. But if you ran out free space on the drive you can't allocate a new metadata block group. You're out of space *despite* only having 70% space usage in *data*. You're getting "enosp" errors.
This problem used to be even more insidious because all the nitty gritty detail of allocation are only shown on internal btrfs tools ("btrfs filesystem df" and "btrfs filesystem usage"), and "df" simply showed "70% free" (correct for the space available inside data block groups, not the free space available on the drive). This caused panic and incomprehension among users: you had free space (df showed "70% free") yet get "enosp" error message in journal / dmesg / var-messages!
Surely the BTRFS must be corrupted! I need to fix it! Let's run FSCK! (user proceeds to completely trash a perfectly functional btrfs filesystem)
Balancing as part of the maintenance is a way to mitigate this problem: among the filters you can give to balance, the "musage" and "dusage" filter can request it to find old "swiss cheese" block groups. It takes the data of multiple such blockgroups, compact its and allocates a single new block group to write it.
In the scenario above, a simple balance can gather all the 30% full "swiss cheese" block groups, rewrite them as small number of full block groups and return the remaining 70% free space to be allocated. No need to shot btrfs in the head with some FSCK.
Nowaday the situation has become much better.
On one hand, the allocator of btrfs has become much better and can avoid painting it self in a corner allocation-wise. It can sort of balance on its own and avoid leaving too many swiss cheese around.
On the other hand the single number returned by df better reflects the current allocation situation. It will correctly display "0" in the above scenario, alerting the user that (free) space is running low.
But some reasonnable amount of balancing (collecting and compacting swiss cheese block groups with <40% of occupancy on a weekly or monthly basis is reasonnable). Just remember to balance only *after* coherency has been successfully attested with "scrub".
Using well done tools (like opensuse's btrfs-maintenance) is good idea.
Leave a comment: