Announcement

**GreenReaper** · 28 January 2020, 12:34 AM

Nah, it's fair. I lost six hours of my user's new content to the btrfs committed transaction without writeback bug in 5.2. The only reason it wasn't more is the server's memory filled up with new data in that time and it finally froze writes. And the only reason that data wasn't all lost completely is that some of it had been served from memory to our caches - which use ext4 and mdadm - that had stored it safely.

Sure, other filesystems have bugs. But this was a doozy and it happened just a few kernel revisions ago. Then there was that poor combination of btrfs send and delayed allocation which could lead to it not sending any data for inodes it hadn't written out yet, quietly corrupting snapshots. And neither of those are new features, nor was the bug itself in new code - it existed since btrfs send was merged.

Btrfs can do a lot. Unfortunately this means it has a lot of bugs, especially when one component reacts unfavourably with another.

**oiaohm** · 28 January 2020, 02:15 AM

Originally posted by xinorom View Post

XFS isn't a CoW filesystem. A filesystem that lacks certain features (that imply bookkeeping overheads) is necessarily going to be faster, or at the very least easier to optimize. I don't know why so many people fail to understand that comparing the performance of a CoW filesystem to a non-CoW filesystem is not an apples-to-apples comparison.

You close "XFS isn't a CoW filesystem" is wrong. XFS is a part CoW file system. Look at the relink part of XFS to see that.

https://blogs.oracle.com/linux/xfs-data-block-sharing-reflink

You do have copy on write behaviour inside XFS just not used all the time. Being part CoW file system means the XFS file system will drop back to direct writes protected by journal when something is not shared.

Ext4 is currently not a CoW file system in any form but ext3cow of the past says that someone could make Ext4 another part CoW file system.

Btrfs and ZFS are both full CoW file systems means they cannot skip out on the CoW overhead.

Basically you have three types of file systems.

Traditional. That has no copy on write functionality.
Part CoW. That is able to selectively use Copy on Write when there is a preset reason todo so example XFS relinks.
Full CoW. That all operations are done Copy on Write.

The performance difference between traditional and Part CoW is very small. Feature list difference between a Part CoW and a Full CoW file-system can also get very small.

Big thing a Part CoW file system will always be missing is transparent creation of snapshot as in. Part CoW have to be directed to create snapshots so you have to snapshot before modification to record what the modification was. Full CoW you can snapshot after modification depending on how much back history the full cow keeps due to the transparent creation. Lot of cases trading away this feature for speed not going to be a problem. Snapshot mid modification are more often useless than useful.

So there is a really hard question do you really want a Full CoW file system or do you really want a well designed Part CoW file system.

**CochainComplex** · 28 January 2020, 03:30 AM

Originally posted by Paradigm Shifter View Post

Ah, it's using a PERC controller. I relearned an important lesson recently: want RAID? Buy a dedicated card.

I was attempting to experiment with RAID (on Linux, should be easy, right?) with a consumer X470 board. in the end I gave up. The "on board" RAID was terrible (and AMD appear to have removed their Linux drivers) so I tried software RAID... which was OK until every reboot when the array would fall apart and need to be rebuilt.

well depending on the reliability you want to achieve. As mentioned by you but in a slightly different way. Consumergrade hw raid controller are not always really reliable. And i also dont know if rebuilding or accessing your data is straightforward once you have a broken hw controller. In such a case i would always prefer software over hw.

concerning btrfs i would use the fs implemented "raid" config (if you want simple striped or mirror). So in this case you are on the software side and given by its structure you dont want to have an additional part screwing around in your config.

**CochainComplex** · 28 January 2020, 03:40 AM

Originally posted by GreenReaper View Post

Nah, it's fair. I lost six hours of my user's new content to the btrfs committed transaction without writeback bug in 5.2. The only reason it wasn't more is the server's memory filled up with new data in that time and it finally froze writes. And the only reason that data wasn't all lost completely is that some of it had been served from memory to our caches - which use ext4 and mdadm - that had stored it safely.

Sure, other filesystems have bugs. But this was a doozy and it happened just a few kernel revisions ago. Then there was that poor combination of btrfs send and delayed allocation which could lead to it not sending any data for inodes it hadn't written out yet, quietly corrupting snapshots. And neither of those are new features, nor was the bug itself in new code - it existed since btrfs send was merged.

Btrfs can do a lot. Unfortunately this means it has a lot of bugs, especially when one component reacts unfavourably with another.

maybe not totally related. But it is recommended to use the kernel corresponding btrfs-progs version to make sure not having some strange behavior dealing with operations on the fs side.

**CochainComplex** · 28 January 2020, 03:44 AM

Michael what btrfs-progs version have you been using to manage the btrfs configs?

**DrYak** · 28 January 2020, 07:22 AM

Michael : Given that BCacheFS is slowly nearing upstream inclusion, it would be good to start it appear in filesystem benchmarks.

Having ZFS (to have another point of comparison with modern CoW / snapshotting / checksumming filesystems) would be great.

Originally posted by oiaohm View Post

You close "XFS isn't a CoW filesystem" is wrong. XFS is a part CoW file system.

Oh, I didn't realise the feature got finally declared stable now. I was still remembering it as an experimental feature only.
Is it enabled by default, or is it still at the "must be configured" phase ?

-----

Also other detail regarding the tech behind filesystems:

in addition of being fully CoW, BTRFS, ZFS (and BCacheFS) are also fully checksuming (Everything including the data is checksummed. Most of the other filesystems only checksum their metadata).
(Which also burns cycles, and slows down perfs, in exchange of more reliability).

Also F2FS is log-structured, which shares some of the benefit (no-inplace overwrite, possibility to always recover by reverting to an older version, friendlier on append-mostly / overwrite averse media such as Flash, Shingeld magnetic, etc.) that CoW also provides. It is NOT checksuming its data though.

So it's surprising that it performs that well compared to EXT4/XFS/etc.

Oh and the usual warnings:
- RAID5/6 are *still* not considered stable by BTRFS.
- CoW file system are bad at multiple random writes inside large files (eg.: databases, virtual disks, torrents). The current tips are: mount the filesystem with "autodefrag" (= tries to group several writes into one) and mark these specific files as nocow* (touch to creat an empty file, chattr +C on the empty file, then optionnal write any data that you need (e.g.: use cat >> to copy from an older CoW version of the disk image, or truncate to reserve empty space for your torrent), enjoy)

For obvious reasons, nocow files also drop checksums. Which isn't critical because said application tend to have their own internal integrity checks (torrents uses hash as part of their design, database rely on advanced integrity mecanics implemented at the file-level, and virtual drive rely on whatever the filesystem inside the image has... which could actually be a FAT32 filesystem in which case you don't get much).

**S.Pam** · 28 January 2020, 07:47 AM

I would not suggest using nocow for databases. Instead, you can turn off double-writes (MariaDB/MySQL) since double writes are not needed on COW filesystems. Also a nocow file will still be COWed if you do snapshots and the like.

With regard to RAID5 I think the write hole "worries" are exaggerated. You need two faults for it to be problematic; both a crash AND and disk failure. If you scrub after unclean shut downs you should be safe. Further, since Btrfs supports different RAID modes (aka profiles) for metadata and data, it is recommended to run RAID1 or RAID1c3 with metadata together with RAID5 for data. The write hole exists for other RAID implementations as well (mdadm, BIOS raid, HW raid cards...) unless you take specific precautions.

Though, I must admit, I almost always use RAID1. Disk space is not that often an issue, and it is usually easier and quicker to rebuild.

https://btrfs.wiki.kernel.org/index.php/Status has some stats

**starshipeleven** · 28 January 2020, 07:52 AM

Originally posted by GreenReaper View Post

Recommended by the way it's been used by Facebook and Synology

Recommended? Where did they recommend this?

That's how they use it for their own specific usecase, with their own specific decisions on tradeoffs.

Facebook isn't using it for data storage but for easy restore and snapshot of frontline, expendable containers and VMs for webservers and other computing services that don't store data inside themselves (there are database servers and storage servers for that). Their approach to data integrity issues is "terminate the container/VM and restore a backup".

I'm not aware of them using it on top of mdadm.

Synology (and others in the NAS sector) are using it only for its snapshot capabilities, not for its data integrity ability, see below why.

as checksumming and snapshot layer over the top of the block storage

this is NOT recommended by btrfs developers and it's useless for data integrity as btrfs without any parity can't fix any of the problems the checksumming will detect.

Running a btrfs volume with data=dup (so that you have two copies of the data and can therefore fix data integrity issues) is a RAID1, and running a RAID1 on top of a RAID5/6 is nonsense, you are wasting space for no reason.

In our case we expect to use to store original media files, so the checksumming is important to us

You really should have done your homework and not have randoms on the internet correct you.

Btrfs cannot do what you ask at the level of reliability you require. Layering it on top of mdadm is a tradeoff where you agree that you don't need some of its features (self-healing data integrity).

The only filesystem that can do what you ask is ZFS.

**S.Pam** · 28 January 2020, 09:30 AM

To be more fair, checksums do help as you'd discover bad data before it migrates over to your backups. This is a perfectly valid use case.

When you do not need high availability / uptime you may as well run data=single and use hourly snapshots with frequent backups. This has the advantage of file versioning AND data integrity without the need for tripple space usage (raid mirror + backup). In other words typical home and small office users.

**starshipeleven** · 28 January 2020, 11:38 AM

Originally posted by Spam View Post

To be more fair, checksums do help as you'd discover bad data before it migrates over to your backups. This is a perfectly valid use case.

When you do not need high availability / uptime you may as well run data=single and use hourly snapshots with frequent backups. This has the advantage of file versioning AND data integrity without the need for tipple space usage (raid mirror + backup). In other words typical home and small office users.

Yeah, this is the another valid usecase, given different requirements.

If you scale it a little bigger you can use btrfs with data=single as the "lower part" of a cluster filesystem in a SAN, and do away with RAID entirely as now you have multiple servers anyway and they are your availability. If an entire server drives blow up you just replace them and it restores the data from the others SAN.

The main issue with btrfs is that these usecases are what brings most $$$ so they are the functions that work well and are used in production, while stuff used by home users and enthusiasts like RAID5/6 is somewhat neglected still.

While filesystems that were developed for the reality of a decade ago like ZFS can do that and are fine.

Personally, for most home-user systems (and also my own NAS if I wasn't a fucking nerd) I would go with something far simpler like SnapRaid and a scheduled scrub, as the feature set is much more aligned with their usecase than ZFS and btrfs are, and it's also much simpler to understand, set up, maintain and recover from disk failures. https://www.snapraid.it/compare

Announcement

Linux 5.5 SSD RAID 0/1/5/6/10 Benchmarks Of Btrfs / EXT4 / F2FS / XFS

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment