Originally posted by useless
View Post
https://www.spinics.net/lists/linux-btrfs/msg94447.html does a better job explaining why fixing this is hard, quoting directly
We can get strategy #1 on btrfs by making two small(ish) changes: 1.1. allocate blocks strictly on stripe-aligned boundaries. 1.2. add a new balance filter that selects only partially filled RAID5/6 stripes for relocation. The 'ssd' mount option already does 1.1, but it only works for RAID5 arrays with 5 disks and RAID6 arrays with 6 disks because it uses a fixed allocation boundary, and it only works for metadata because...it's coded to work only on metadata. The change would be to have btrfs select an allocation boundary for each block group based on the number of disks in the block group (no new behavior for block groups that aren't raid5/6), and do aligned allocations for both data and metadata. This creates a problem with free space fragmentation which we solve with change 1.2. Implementing 1.2 allows balance to repack partially filled stripes into complete stripes, which you will have to do fairly often if you are allocating data strictly on RAID-stripe-aligned boundaries. "Write 4K then fsync" uses 256K of disk space, since writes to partially filled stripes would not be allowed, we have 252K of wasted space and 4K in use. Balance could later pack 64 such 4K extents into a single RAID5 stripe, recovering all the wasted space. Defrag can perform a similar function, collecting multiple 4K extents into a single 256K or larger extent that can be written in a single transaction without wasting space. Strategy #2 requires some disk format changes: 2.1. add a new block group type for metadata that uses simple replication (raid1c3/raid1c4, already done) 2.2. record all data blocks to be written to partially filled RAID5/6 stripes in a journal before modifying any blocks in the stripe.
There is a reason why it hasn't been fixed yet, its bloody hard to do. If you actually care that much about your data, ZFS is still the far superior option.
Comment