Announcement

**SystemCrasher** · 21 January 2016, 10:57 AM

Originally posted by blackiwid View Post

sadly btrfs at one point dont eats your data, thats good, but its very hard to make any use of any of its advantages, its very hard to mainntain you run even with very conservative usage into such problems that your fs iis full and you have to type cracy magic fs specific commands to get space back from it.

BS. I use btrfs on quite many various things these days and in typical scenario I would have hard time noticing it wasn't EXT4 or something.

1) Hard to maintain? Not really. In typical, more or less sane use cases it takes more or less same amount of care like EXT4 or XFS. I.e. close to zero. It can be not a case in some exotic setupts like very small storages below of ~10GiB or running in DISK FULL conditions for a while. But in typical use cases it just works.
2) I use snapshots and reflinked (insta-created, CoW-backed) copies. IMHO these are cool features. I do not get why it considered hard to use. These are fairly logical and managinv VMs in advanced fashion is probably harder thing to do, yet whole legions of ppl are doing it.
3) No, you do not have to type some commands to get space back. It only needed in some few really strange use cases. Like putting btrfs on small SD card or flash stick, which isn't best idea ever, generally. Yet, those who are absolutely inclined on doing so, can use mixed block groups (mkfs.btrfs -M ... ). This avoids prob alltogether, while being less optimal choice on larger storages.

**SystemCrasher** · 21 January 2016, 11:37 AM

Originally posted by DrYak View Post

weird partition alignement (both the starting sector of the partition and the internal structure of the FAT32), so that the partition's boot sector ends up in the same not often rewritten erase block as the partition table, and then each subsequent FAT is in its own erase block, and the main directory is tweak to exactly fill yet another erase block

That's a case. Take a look on factory formatting of SD cards and flash sticks, you'll see

. Reformatting card or stick can also lead to serious loss of performance, especially if you manage to put fileystem blocks crossing NAND page boundaries in unfortunate way. Write amplification would happen and you can easily kill write speed like 2x on small files, because instead of writing one page now you have to write two, due to bad FS vs NAND blocks alignment.

Putting boot sector to separate erase block happens because if you write something to e.g. FAT block, firmware actually does Read-Modify-Write sequence and if you suddenly lose power at this point... it going to be very sad if you lose whole partition table thing, right? At this point you'll get utterly dead storage, with no even partition table left. FAT can recover from 2nd copy, but if you've lost partition table, its a bummer. So when formatting flash media it can make a lot of sense to leave enough free space past MBR boot sector unused. I've seen like 3 dead things with totally missing partition table. That's what one gets for clueless formatting of cards and sticks. It is possible to recover by recomputing partition table, but it's not something Average Joe can handle...

Then, trying to align filesystem on eraseblocks/page boundaries also makes sense. But quite hard to do, because most storages do not expose true geometry. Though something like OpenChannel seems to follow this way and its right thing to do.

**energyman** · 21 January 2016, 01:13 PM

another meaningless test.

ZFS was never meant for single disk usage.
There is no check that fuse actually let the data hit the disk.

Wouldn't be the first time a FS would report back with a successful sync while in truth the data is still on the fly.

**ryao** · 21 January 2016, 05:55 PM

Originally posted by SystemCrasher View Post

BS. I use btrfs on quite many various things these days and in typical scenario I would have hard time noticing it wasn't EXT4 or something.

1) Hard to maintain? Not really. In typical, more or less sane use cases it takes more or less same amount of care like EXT4 or XFS. I.e. close to zero. It can be not a case in some exotic setupts like very small storages below of ~10GiB or running in DISK FULL conditions for a while. But in typical use cases it just works.
2) I use snapshots and reflinked (insta-created, CoW-backed) copies. IMHO these are cool features. I do not get why it considered hard to use. These are fairly logical and managinv VMs in advanced fashion is probably harder thing to do, yet whole legions of ppl are doing it.
3) No, you do not have to type some commands to get space back. It only needed in some few really strange use cases. Like putting btrfs on small SD card or flash stick, which isn't best idea ever, generally. Yet, those who are absolutely inclined on doing so, can use mixed block groups (mkfs.btrfs -M ... ). This avoids prob alltogether, while being less optimal choice on larger storages.

What you call very small storage is the norm on virtual machine providers like Digital Ocean. I tried compiling ZFS on CoreOS on Digital Ocean about 15 months ago before CoreOS dropped btrfs. I not only had ENOSPC passed to the build system, but I had ENOSPC passed to the rebalancing tool.
Managing VMs on block devices makes more sense than putting them on actual files. btrfs does not support creating block device subvolumes (a much less confusing usage of a variation on the term volume).
Or virtual machines, old disks/SSDs, etcetera. You cannot hand wave away a bug that breaks the expected behavior of a filesystem as "some few really strange use cases" and expect to have a production filesystem driver. This bug needs to be fixed. There is a chance that it has been fixed, but it will take years before the fix filters down to everyone and that is only if the fix got all cases this time. If the problem really is that the filesystem must not be smaller than a certain size, the right thing to do would be to enforce a minimum size limit so people cannot make filesystems that small. My intuition is that this probably will not fix such problems, because size is a relative thing. If you have problems at small filesystem sizes, you are going to have problems at larger filesystem sizes. The only difference will be that the size of the problems will be bigger when it finally does happen.

**ryao** · 21 January 2016, 05:57 PM

Originally posted by Michael

For making the results reproducible and representative of the out-of-the-box experience, each file-system was mounted with its stock mount options.

These tests are not reproducible at all.

First, it is not clear which whether any partitioning was done and what the alignment was if partitioning was done, so it is not possible to reproduce them even if you had the money to buy the exact same hardware. If you are doing partitioning for other filesystems, you should not be doing it for ZFS because that causes it to use the default IO elevator (likely CFQ) in addition to its own, which is a handicap. If it is given the whole disk, it handles partitioning in a sane way and will use the noop IO elevator, which gives best performance by staying out of the way of ZFS' elevator as much as possible.

Second, since next to no one in your audience is going to buy the exact same hardware as you and no one is going to use hardware in a misconfigured configuration, you need to adjust for hardware quirks. Otherwise, it is impossible to produce a faithful comparison, even if we pretend that the benchmark numbers are actually meaningful.

Third, there is no consistency in the drives you use across published benchmarks even within the span of a few weeks. That makes it difficult to make sense of what actually changed when looking at your numbers, such as the FIO results on these benchmarks done 19 days apart:

ZFS Still Trying To Compete With EXT4 & Btrfs On Linux - Phoronix

https://www.phoronix.com/scan.php?page=article&item=zfs_linux_062&num=2

Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

10-Way Linux File-System Comparison On Linux 3.10 - Phoronix

https://www.phoronix.com/scan.php?page=article&item=linux_310_10fs&num=2

Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

My best guess is that the quirks database that I implemented a few weeks before the second set of benchmarks is why ZFS went from doing worse than ext4 (which doesn't need a database on the drives Michael uses because it assumes 4KB by default) to being twice as capable, but it is hard to be certain. Michael has not used hardware that made use of the quirks database again since then, as we can see from the only benchmark he has done between then and now:

https://www.phoronix.com/scan.php?pa...x-41-zfs&num=1

That last point is one to which I pay particular attention because the ZoL quirks list needs to be updated as new quirky hardware is released. Anyway, none of the ZFS numbers are representative because of these issues.

I had written a nice critique of all of the benchmarks here, but my post seems to have hit a character limit and was lost, so I will need to rewrite it some other time.

**ryao** · 21 January 2016, 06:39 PM

Here is my critique of each benchmark, written a second time. Hopefully it will not be lost this time:

SQLite

I have not examined this with my analysis tools, but from what I know of it by reputation, I suspect this might be one of the few benchmarks here that are okay for Michael's purposes as used because it is embedded in plenty of places where expecting a proper recordsize is unlikely. One example is firefox, which uses it in people's home directories. The whole filesystem still needs to be configured for the drive's actual sector size though. Given how FreeBSD's ZFS driver did on SQLite when given a drive listed in its quirks database (such that it compensated for the quirk by setting alignment shift properly on creation), there is a good chance ZoL will do significantly better here when the alignment shift is set properly on creation. Knowing whether it actually does is enough to make me consider doing my own benchmark tests correctly.

FS-Mark

This tests how quickly you can fsync dirty data. This does not matter on desktop workloads. Enterprise workloads where dirty data writeout does matter are not represented by the IO pattern here at all. That being to write a bunch of big files and fsync them. ext4 does particularly well here because its ->writepages path and journal path do not block each other. ZFS does not scale particularly well because despite ZIL batching things, it still executes batches serially (no pipelining for now). Performance would easily shoot up if sync=disabled were set on the dataset. That is typically the right thing to do for this sort of workload where losing the last 5 seconds of data does not matter and the developer is not expected to fix his software. I posted detailed analysis of this last year:

http://www.phoronix.com/forums/forum...454#post811454

I do not know about the other filesystems on this one, but a database workload would be one of the few things that I would expect to care about fsync performance of dirty data. The workload here does not correspond to any real world workload known to me and consequently the numbers are useless for determining which filesystem is better for an actual workload that matters. While the SQLite benchmark probably isn't a great indicator for non-embedded database uses (intended for embedded use and is likely being exercised by a single client), the inability of both the multihreaded and single threaded versions of FS-Mark to correlate with the SQLite result could be considered to raise a flag as for the usefulness of these numbers.

DBench

I have not examined this with my analysis tools, but from what I know of it by reputation, this was designed by the Samba developers to simulate SMB workloads. I have not verified that this meets their goal, but I am inclined to expect it to do that. However, this probably needs some scrutiny (preferably by the Samba developers) to tell if Michael is using it properly.

Compile Bench

I have not examined this with my analysis tools, but from the description of what it does and what the actual numbers are, I do not feel any need to do that. This is a useless benchmark that does not have anything to do with compilaton. What it actually does do is measure average throughput when trying to stuff a large amount of dirty data into memory, which isn't a realistic workload. This is evident when one considers that the numbers produced in this benchmark often exceed the theoretical limits of the storage devices. Better numbers here could indicate potential performance problems from IO stalls when memory cannot hold anymore dirty data. ZFS is explicitly designed to throttle writes in such a scenario to maintain consistent performance, so it should not do well here unless the tunables on dirty data writeout behavior changed. I have not examined the FreeBSD ZFS driver to know why PC-BSD got such high numbers, but I hope that whatever was done there did not compromise the dirty data throttle.

An actual benchmark that does something similar in a way that would matter would be extracting a Linux kernel source tar archive.

FIO

This used DirectIO, which is a non-standard way to take advantage of in-place filesystem design for a performance boost when developers know how to use it properly and jump through the hoops necessary to do that, not a thing that should be allowed to break a workload when a filesystem cannot support it and refuses to pretend that it can. Naively adding it to a workload is always detrimental because if it were actually a good thing to enable in general, it would be on without userland needing to ask and there would be no flag to pass to open().

The semantics as implemented in XFS (where DirectIO was invented) are the closest thing to any actual standard and are incompatible with checksums, compression, snapshots, multiple disk configurations and just about anything else that differentiates CoW filesystems from XFS. This makes it incompatible with CoW filesystems like ZFS and btrfs, although btrfs manages to run here. btrfs is likely either ignoring it or pretending to be an in-place filesystem by doing something like an implicit nodatacow, which puts data integrity at risk (admittedly, data on XFS has the same risk). It would be easy to tell ZFS to ignore it, but that would confuse database administrators, which would be more detrimental than helpful.

If you want to do benchmarks where DirectIO might be relevant, run a database workload on an actual production database and turn off DirectIO on the filesystems that do not support it. That would produce a representative benchmark numbers while using DirectIO in one of the few contexts where it would be beneficial for perfomrance.

If you want to get decent numbers out of FIO, use Brendan Gregg's FIO commands:

https://gist.github.com/brendangregg...9698c70d9e7496

That said, do not use DirectIO with FIO unless you can clearly state what it is what you think the numbers actually would mean and how that meaning matters to those reading it, like what Kevin Closson did, except he used dd:

http://kevinclosson.net/2012/03/06/y...s-versus-ext4/

In his case, any workload that runs through the ->write_iter() path rather than the ->writepage/->writepages path would suffice because ext4 is serializing that path behind the inode lock. Some examples are DirectIO, vectored IO and AIO. The sort of reasoning behind that kind of benchmark is very different than the benchmarks on Phoronix using DirectIO, where no explanation of how the numbers actually matter could be provided.

**blackiwid** · 21 January 2016, 09:04 PM

Originally posted by SystemCrasher View Post

BS. I use btrfs on quite many various things these days and in typical scenario I would have hard time noticing it wasn't EXT4 or something.

1) Hard to maintain? Not really. In typical, more or less sane use cases it takes more or less same amount of care like EXT4 or XFS. I.e. close to zero. It can be not a case in some exotic setupts like very small storages below of ~10GiB or running in DISK FULL conditions for a while. But in typical use cases it just works.

120gb ssd is esoteric, running full if the normal df gives you no usable numbers and you dont even have a idea how much diskspace is free... so a simple question how much disc space you have is a very complicated with btrfs, so there are maybe 2 3 different answers that are all true in some way.

I run into that, I also had big speed issues that my system got extremly slow because I had some kind of rpm-btrfs-snapshot package installed.

Yes you maybe have months without any issues but if you run into them they take you longer to solve, instead of ahh itts full lets delete 10 big files... you have to google some cracy btrfs specific commands I never have to do that with ext4, I just delete the files and its fine.

Originally posted by SystemCrasher View Post

2) I use snapshots and reflinked (insta-created, CoW-backed) copies. IMHO these are cool features. I do not get why it considered hard to use. These are fairly logical and managinv VMs in advanced fashion is probably harder thing to do, yet whole legions of ppl are doing it.

I played with guixsd, they do same things just with harddisk completly automated save with ext4 without even systemd, so its not so much about the os but the system on top of it. of course theoreticly btrfs allows you to do some stuff maybe cheaper/faster or cleaner, but the tools for it are just not there.

Originally posted by SystemCrasher View Post

3) No, you do not have to type some commands to get space back. It only needed in some few really strange use cases. Like putting btrfs on small SD card or flash stick, which isn't best idea ever, generally. Yet, those who are absolutely inclined on doing so, can use mixed block groups (mkfs.btrfs -M ... ). This avoids prob alltogether, while being less optimal choice on larger storages.

getting a full hd happens from time to time. thats not strange, I dont use 10 harddisks in some strange pools in my laptop I use a ssd and I somethimes copy some gb of movies to it when I have to go away. I dont get whats strange there. git-annex is here a word how you can easily do such cracy strange stuff.

**SystemCrasher** · 22 January 2016, 08:57 PM

Originally posted by ryao View Post

[*]What you call very small storage is the norm on virtual machine providers like Digital Ocean. I tried compiling ZFS on CoreOS on Digital Ocean about 15 months ago before CoreOS dropped btrfs. I not only had ENOSPC passed to the build system, but I had ENOSPC passed to the rebalancing tool.

Still, CoreOS is some specific niche thing, small storage is quite specific scenario either. Digital Ocean isn't center of universe, there're thousands of hosting companies using really different solutions and providing wildly different offers and techs. And as I've told, if one really needs btrfs on small storage, there're "mixed block groups". Technically, on btrfs, free space allocation on storage happens in "chunks". Chunks are fairly large, IIRC typical allocation unit for chunk is like 0.5GiB. Normally chunks store either data OR metadata. It allows one to get different redundancy schemes for data and metadata, etc. But on small storages below ~10GiB it easily gets "disbalanced": quantum effects appear and can play poor joke. That's where mixed block groups make sense, so data and metadata are stored in same chunks. And NOSPC thing has been fixed quite a lot. I wonder if you still can face it in hard to eliminate ways on recent kernels. Realistically speaking, running CoW with low disk space isn't best idea ever. Even usual filesystems would get explosive growth in fragmentation. CoWs would do even worse. And when it comes to ZFS... once you get "fragments explosion", uhm, well, in btrfs you can at least run defragger. But on ZFS... you can just e.g. move data away manually and reassemble pool. There is no way to get performance to sane levels easily as far as I know. Sure, Sun marketing BS prefers to stay silent about things like this. That's what I dislike about Sun and their techs.

Managing VMs on block devices makes more sense than putting them on actual files.

Theoretically it is the case. As well as placing DBs on RAW partitions is most optimal in terms of performance/parsing. But managing VMs or DBs this way happens to be pain in the rear so it rarely used. At the end of day, admins can't use file-based tools - HUGE disadvantage. And when it comes down to using filesystem as relatively thin "management"/convenience layer, proper file preallocation can get rid of most fragmentation issues, and parsing filesystem+file is usually much smaller issue. At this point more convenient management usually outweighs any remaining ganis.

btrfs does not support creating block device subvolumes (a much less confusing usage of a variation on the term volume).

Sun did it because IIRC Solaris had no other facilities to do it when they created ZFS. It makes little sense on Linux though, since it got own facilities for RAIDs, volume management, etc. So it just happens to be some duplicate code to do the very same kind of things. Hardly counts as advantage on its own (in Linux context).

Btrfs subvolumes are entirely different thing - these are "hierarchies", enjoying by CoW services and sharing same storage space. Since you can have more than one version of e.g. "/", and you can have more than one version of some other parts of hierarchies, etc - it makes sense to reference it somehow beyond of usual naming. This is what they call subvolumes. Snapshots aren't much different either. These are hierarchies, usually sharing blocks with something else. There are no big differences between RO and RW snapshots except formal permission to do CoW-backed writes on this particular hierarchy.

Btrfs design is centered around idea RAID is not going to be block-level thing anymore. It rather per-file thing. So one can potentially have various RAID levels on same set of devices, which is logical, since files not necessarily have equal value or require different tradeoffs. One of reasons why it allocates space in chunks. Chunk supposed to have some allocation scheme. Technically there can be arbitrary mix of chunks on devices. If no suitable chunk found, btrfs attempts to allocate new one. And in some places it haves interesting notion of "free space". It means "free of btrfs presence". In sense this space isn't yet allocated into any chunks. It means "empty surface, not used by btrfs yet". So most things come down to being CoW-backed hierarchy. Simplest being a "reflinked copy", it not anyhow distinguished at all. You just get some hierarchy which references to old one and initially shares all blocks, that's what allows to "insta-copy" large things.

Overall this design happens to be flexible in terms of how it allocated. It can swallow pre-existing EXT4, pretending it was initial snapshot. It can easily move away from some drive, allowing to remove device from pool, as long as there was enough space on other drives.

And how about VMs like this:
cp --reflink VM1 VM2 (takes about second for 10 GiB diskfile).

Then, only differences are stored. Basically, ability to deploy VM from template in a SECOND + space savings out of the box. Needless to say that's how I deal with VMs. Some caveats apply. I.e. it is unwise to do CoW twice, and using filesystem to snapshot VM instead of VM tools takes some special care about consistency. But fundamentally there is nothing wrong in doing CoW in FS rather than in VM diskfiles, etc. Same thing, different appoach.

[*]Or virtual machines, old disks/SSDs, etcetera. You cannot hand wave away a bug that breaks the expected behavior of a filesystem as "some few really strange use cases" and expect to have a production filesystem driver.

That's why btrfs, just like zfs, got some tunables. Yet, it is really desirable if filesystem shows decent behavior on typical workloads without user's intervention. Sure, fiddling with bunch of VMs right without introducing unreasonable overhead and avoiding obviously problematic scenarios can take some expertise. But Michael does not deploys VM servers, he just takes a look on simple desktop/server use case. When someone deploys their homepage or blog on their server, they do not spend much time on tuning filesystems, could lack required expertise and it is way to expensive to hire gurus, lol. So, defaults matter. Defaults will always prevail.

This bug needs to be fixed. There is a chance that it has been fixed, but it will take years before the fix filters down to everyone and that is only if the fix got all cases this time.

Mixed block groups are available for a while. So this bug has been fixed ages ago. But who reads damn manuals, right? Well, I do not remember, but IIRC since some version mkfs.btrfs defaults to mixed-bg for small storages.

If the problem really is that the filesystem must not be smaller than a certain size, the right thing to do would be to enforce a minimum size limit so people cannot make filesystems that small. My intuition is that this probably will not fix such problems, because size is a relative thing.

This is related to "chunks" thing and there is tunable to allow different tradeoff for small storages. Once data and metadata can be stored in same chunks, there is no longer reason to get "unbalanced", not even on small storage. But storing data and metadata in explicitly separated chunks allows to assign different redundancy schemes to data and metadata. E.g. on single drive it is common to see "dup" scheme for metadata and "single" for data. It means it keeps TWO copy of metadata, even on single drive storage, so if something goes bad, metadata are hopefully going to be last thing to die. Makes some sense, isn't it?

If you have problems at small filesystem sizes, you are going to have problems at larger filesystem sizes.

No, once there're enough chunks, metadata and data chunks would normally balance on its own and rebalancing wouldn't be needed except maybe some unusual cases. Small storages are just too small, there're too few chunks, quantum effects could dominate. I.e. on 10GiB volume, 0.5GiB chunk is a big deal since it accounts 5% of space. Miss one and data vs metadata balance is seriously skewed, leading to poor space usage. On terabyte storage it is not a problem at all. Mixed-bg lacks this problem, by keeping data and metadata in same chunks. But it only makes sense on small storages. In larger installs one may want to enjoy by different redundancy schemes for data and metadata.

**SystemCrasher** · 23 January 2016, 10:10 AM

Originally posted by blackiwid View Post

120gb ssd is esoteric,

120GiB is just a very typical desktop these days. At least system drive of it. Those who still do not use SSD for at least system drive on desktop are masochistically inclined, or something. So it is very reasonable to benchmark how various filesystems perform on SSDs. Yet, to be fair, one should TRIM surface before each test, to make starting conditions more or less the same.

running full if the normal df gives you no usable numbers and you dont even have a idea how much diskspace is free...

In simple setups df shows something realistic. When it comes to more complicated setups, it comes down to the fact btrfs can technically use mix of RAID levels to store data. There is no a priori knowledge which RAID scheme is going to be requested and how it would map to existing devices and available free space on these devices. It isn't fully implemented/exposed to users right now, but it is one of core features of design, most parts are here and most code paths make no assumptions about fixed allocation of RAIDs.

In btrfs there're devices and free space on them. When it have to do e.g. mirror, it is like this: take any 2 devices, store 2 copy of blocks. These do not have to be some fixed sets of fixed size. It only matters to have ANY two devices with enough free space on them to place requested data. Next write can take another 2 devices, or whatever. However every write haves 2 copy of blocks, hence mirroring constraint stands and failure of single drive does not brings mirror down. Because all blocks on failed drive had mirror copy on other devices.

This means some fancy things. Imagine you have 2 x 1TiB drives + 2 TiB drive. Block-level raids would struggle to exceed 1TiB for "mirror" RAID type in this scenario. Btrfs would easily do 2TiB RAID. Because it does not cares of alignment and somesuch. All it needs are 2 devices with enough free space. So it would be able to do something clos to 2TiB mirror. Because it is possible to take 2 devices and write mirrored block on both, and do so for about 2TiB of data.

This implies adding devices never been so easy. Now one do not have to care much about device sizes and space loss is much less concern. Idea is: add to pool whatever you've got, get some extra space and rock-n-roll. It is also possible to make pool smaller, you can move data away and remove drive, as long as other devices have enough free space to do so. Reducing size of block-level raids is faaaar more complicated and time consuming task.

so a simple question how much disc space you have is a very complicated with btrfs, so there are maybe 2 3 different answers that are all true in some way.

Right. Because answer depends on another answer aka "which redundancy scheme you're going to use for next writes?". It is not known a priori, and design is meant to allow changing it for different writes. Unlike most designs on the Earth, this thing can potenitally store various files using various redundancy schemes. Why do you think you can convert RAID 1 to RAID 5 on the fly on btrfs? Btrfs is perfectly fine with the fact there're RAID1 and RAID5 blocks are coexisting for some time. And conversion is like this: read RAID1 blocks, write RAID5 blocks. Works fine as long as there're enough devices and space. It is really different approach to storage management. Well beyond of what ZFS dared to do. This design meant to do managing multi-drive pools logical and simple. As well as making snapshots easy, etc.

I run into that, I also had big speed issues that my system got extremly slow because I had some kind of rpm-btrfs-snapshot package installed.

One have to care of snapshots. Filesystem can't discard blocks used by some file if they are referenced by something else. Erasing file means nothing, if its blocks are held also by snapshot. Blocks just getting one less reference, but unless reference count reaches zero, there is no way to free blocks wihout damaging something. This means if you've got snapshots, reclaiming space can get problematic and snapshots can occupy a lot of space. The very same issue exists with e.g. VM snapshots in VM CoW diskfiles, etc. If you're being overzealous in making numbers of snapshots and/or grow too large "delta" from some snapshot and careless about trimming snapshots and re-taking them in less distand point in time, you can soon figure out your 10GiB VM now takes more than 100GiBs. Because of snapshots and "deltas" are taking disk space.

One really supposed to keep sane numbers of snapshots and understand the fact blocks taken by snapshots can't be freed and reclaimed.

Yes you maybe have months without any issues but if you run into them they take you longer to solve, instead of ahh itts full lets delete 10 big files... you have to google some cracy btrfs specific commands I never have to do that with ext4, I just delete the files and its fine.

It seems you want to drive the car, or even ride bus as passenger, not to fly airplane. Yes, taking best of btrfs takes some understanding how it differs from other filesystems. Then it runs smoothly. Sorry, but you can't apply your car driver experience obtained on EXT4 to flying CoW ariplanes like btrfs, VM CoW drives, ... . And good luck to make snapshots on EXT4, LVM can do it, but you better to try it, then you'll get idea why we like btrfs snapshots.

I played with guixsd, they do same things just with harddisk completly automated save with ext4 without even systemd, so its not so much about the os but the system on top of it. of course theoreticly btrfs allows you to do some stuff maybe cheaper/faster or cleaner, but the tools for it are just not there.

As I've told, feel free to try real snapshots on top of e.g. EXT4. Real snapshot means "filesystem state at certain point of time". Or, actually, "hierarchy state" in btrfs. With real snapshot I can e.g. erase /home, swear I've did something terribly wrong and roll it back to saved filesystem state ("snapshot"), getting most of my files back and only losing changes since this point of time. Of course it means some storage requirements, and these are only resembling EXT4 appetites if all versions are identical and share same set of blocks. Shoud files diverge, CoW would do its magic, unsharing changed blocks, software would not have even slightest idea what's going on. But storage demands would increas.

getting a full hd happens from time to time. thats not strange, I dont use 10 harddisks in some strange pools in my laptop I use a ssd and I somethimes copy some gb of movies to it when I have to go away. I dont get whats strange there. git-annex is here a word how you can easily do such cracy strange stuff.

Git would not work well for recovering my erased /home, nor it can do it in way which is completely transparent to all applications. Program just writes file. It haves no idea if CoW e.g. unshares blocks at this point, etc.

With btrfs I can e.g. mount 5 different versions of my /home to various points and maybe even change some of them ("writeable snapshot") or even eventualy switching to one of these states as default way to mount /home. When it comes to VCS, they are doing something similar in spirit, but in really different ways. In no way it happens to be transparent to other programs, and overall it just entirely different solution with entirely different properties, even if some core of idea could be same. If it works better for you - fine, it would be stupid to insist one size should fit all. Some ppl want to ride bus, some want to pilot airplane. Does not means we must stop using either buses or airplanes. They come with really different tradeoffs.

**ryao** · 23 January 2016, 12:07 PM

Originally posted by SystemCrasher View Post

Still, CoreOS is some specific niche thing, small storage is quite specific scenario either. Digital Ocean isn't center of universe, there're thousands of hosting companies using really different solutions and providing wildly different offers and techs. And as I've told, if one really needs btrfs on small storage, there're "mixed block groups". Technically, on btrfs, free space allocation on storage happens in "chunks". Chunks are fairly large, IIRC typical allocation unit for chunk is like 0.5GiB. Normally chunks store either data OR metadata. It allows one to get different redundancy schemes for data and metadata, etc. But on small storages below ~10GiB it easily gets "disbalanced": quantum effects appear and can play poor joke. That's where mixed block groups make sense, so data and metadata are stored in same chunks. And NOSPC thing has been fixed quite a lot. I wonder if you still can face it in hard to eliminate ways on recent kernels. Realistically speaking, running CoW with low disk space isn't best idea ever. Even usual filesystems would get explosive growth in fragmentation. CoWs would do even worse. And when it comes to ZFS... once you get "fragments explosion", uhm, well, in btrfs you can at least run defragger. But on ZFS... you can just e.g. move data away manually and reassemble pool. There is no way to get performance to sane levels easily as far as I know. Sure, Sun marketing BS prefers to stay silent about things like this. That's what I dislike about Sun and their techs.

Fragmentation should be a problem for performance, not for the implementation of expected POSIX semantics. It is fine to have fragmentation use more space than ordinary. However, if fragmentation can cause ENOSPC when you have a non-zero number in df, you have a serious bug in your free space accounting code that violates the principle of least astonishment.

Originally posted by SystemCrasher View Post

Theoretically it is the case. As well as placing DBs on RAW partitions is most optimal in terms of performance/parsing. But managing VMs or DBs this way happens to be pain in the rear so it rarely used. At the end of day, admins can't use file-based tools - HUGE disadvantage. And when it comes down to using filesystem as relatively thin "management"/convenience layer, proper file preallocation can get rid of most fragmentation issues, and parsing filesystem+file is usually much smaller issue. At this point more convenient management usually outweighs any remaining ganis.

Databases that require a filesystem to do block placement are a different matter than virtual machines that emulate block devices. Running a block device through a filesystem violates the KISS principle.

"file preallocation" does not make much sense on a CoW filesystem from a performance perspective. You can reserve space, but you don't reserve specific blocks. Reserving specific blocks makes no sense unless you are doing things in-place so that you can keep things contiguous, but that is distinctly different from CoW.

Originally posted by SystemCrasher View Post

Sun did it because IIRC Solaris had no other facilities to do it when they created ZFS. It makes little sense on Linux though, since it got own facilities for RAIDs, volume management, etc. So it just happens to be some duplicate code to do the very same kind of things. Hardly counts as advantage on its own (in Linux context).

If you mean the loop device, it adds the overhead of a top-half/bottom-half in addition to the VFS overhead. It was invented for testing purposes, not production use.

Originally posted by SystemCrasher View Post

Btrfs subvolumes are entirely different thing - these are "hierarchies", enjoying by CoW services and sharing same storage space. Since you can have more than one version of e.g. "/", and you can have more than one version of some other parts of hierarchies, etc - it makes sense to reference it somehow beyond of usual naming. This is what they call subvolumes. Snapshots aren't much different either. These are hierarchies, usually sharing blocks with something else. There are no big differences between RO and RW snapshots except formal permission to do CoW-backed writes on this particular hierarchy.

If you can have a mountable filesystem called a "subvolume" in btrfs, you can take an internal file and expose it as a block device. It takes much less code and is far easier to do. It also avoids the overhead of running through filesystem interfaces twice and senselessly emulating top-half/bottom-half handlers. This has been done in ZFS in less than 2,000 lines of code:

https://github.com/zfsonlinux/zfs/bl...ule/zfs/zvol.c

The code implementing POSIX filesystem semantics in ZFS is more than 10,000 lines of code.

Originally posted by SystemCrasher View Post

Btrfs design is centered around idea RAID is not going to be block-level thing anymore. It rather per-file thing. So one can potentially have various RAID levels on same set of devices, which is logical, since files not necessarily have equal value or require different tradeoffs. One of reasons why it allocates space in chunks. Chunk supposed to have some allocation scheme. Technically there can be arbitrary mix of chunks on devices. If no suitable chunk found, btrfs attempts to allocate new one. And in some places it haves interesting notion of "free space". It means "free of btrfs presence". In sense this space isn't yet allocated into any chunks. It means "empty surface, not used by btrfs yet". So most things come down to being CoW-backed hierarchy. Simplest being a "reflinked copy", it not anyhow distinguished at all. You just get some hierarchy which references to old one and initially shares all blocks, that's what allows to "insta-copy" large things.

Redundancy in btrfs is implemented at the block pointer level, not the file level. You can configure it at btrfs filesystem creation, reconfigure it at runtime and presumably do the same on "subvolumes", but there is no interface to override that at the file level, contrary to what you suggest. ZFS has a similar facility in its block pointers, but it is only for making additional copies so that it can withstand head drops and silent corruption, not the loss of an underlying block device. It also goes up to 3, while btrfs is limited to 2. Using the block pointer as the means of withstanding underlying device failures also means that btrfs does not support N-way mirroring because it limited its block pointers to 2 copies. I suppose that parity lets it go to 3, but not in any performant manner. Also, disk failures cause remounting things stored on btrfs during the boot process to fail unless the degraded flag is passed, which violates the principle of least astonishment.

That said, it is more accurate to discuss parity levels and mirroring discussing either btrfs or ZFS because RAID is a very well defined term that does not apply to either ZFS or btrfs, although btrfs (ab)uses the terms raid5/raid6 when describing its abilities.

Originally posted by SystemCrasher View Post

Overall this design happens to be flexible in terms of how it allocated. It can swallow pre-existing EXT4, pretending it was initial snapshot. It can easily move away from some drive, allowing to remove device from pool, as long as there was enough space on other drives.

And how about VMs like this:
cp --reflink VM1 VM2 (takes about second for 10 GiB diskfile).

Then, only differences are stored. Basically, ability to deploy VM from template in a SECOND + space savings out of the box. Needless to say that's how I deal with VMs. Some caveats apply. I.e. it is unwise to do CoW twice, and using filesystem to snapshot VM instead of VM tools takes some special care about consistency. But fundamentally there is nothing wrong in doing CoW in FS rather than in VM diskfiles, etc. Same thing, different appoach.

How do you rollback to the original state if something goes wrong? This seems non-equivalent to `zfs snapshot` and `zfs clone`.

Originally posted by SystemCrasher View Post

That's why btrfs, just like zfs, got some tunables. Yet, it is really desirable if filesystem shows decent behavior on typical workloads without user's intervention. Sure, fiddling with bunch of VMs right without introducing unreasonable overhead and avoiding obviously problematic scenarios can take some expertise. But Michael does not deploys VM servers, he just takes a look on simple desktop/server use case. When someone deploys their homepage or blog on their server, they do not spend much time on tuning filesystems, could lack required expertise and it is way to expensive to hire gurus, lol. So, defaults matter. Defaults will always prevail.

Tunables should affect performance, not whether or not the filesystem breaks expected behavior by returning ENOSPC when df reports free space. It is a serious design flaw if a tunable is designed to be broken by default.

Originally posted by SystemCrasher View Post

Mixed block groups are available for a while. So this bug has been fixed ages ago. But who reads damn manuals, right? Well, I do not remember, but IIRC since some version mkfs.btrfs defaults to mixed-bg for small storages.

This is related to "chunks" thing and there is tunable to allow different tradeoff for small storages. Once data and metadata can be stored in same chunks, there is no longer reason to get "unbalanced", not even on small storage. But storing data and metadata in explicitly separated chunks allows to assign different redundancy schemes to data and metadata. E.g. on single drive it is common to see "dup" scheme for metadata and "single" for data. It means it keeps TWO copy of metadata, even on single drive storage, so if something goes bad, metadata are hopefully going to be last thing to die. Makes some sense, isn't it?

No, once there're enough chunks, metadata and data chunks would normally balance on its own and rebalancing wouldn't be needed except maybe some unusual cases. Small storages are just too small, there're too few chunks, quantum effects could dominate. I.e. on 10GiB volume, 0.5GiB chunk is a big deal since it accounts 5% of space. Miss one and data vs metadata balance is seriously skewed, leading to poor space usage. On terabyte storage it is not a problem at all. Mixed-bg lacks this problem, by keeping data and metadata in same chunks. But it only makes sense on small storages. In larger installs one may want to enjoy by different redundancy schemes for data and metadata.

If they fixed the bug, that is great. However, you are not doing the btrfs developers any favors by making excuses for bugs. That is enough to make other filesystem developers glad that you are not a fan of their work.

By the way, the idea of duplicate metadata was called ditto blocks in ZFS before btrfs was invented. It is not only used by default, but it is so important that it cannot be fully disabled. It used to be that you could not disable it at all, but a small exception called redundant_metadata=mostly was made for database workloads in ZFS' variable height indirect block pointer trees to allow the very bottom level of pointers to be written only once.

Announcement

Some Quick Tests With ZFS, F2FS, Btrfs & Friends On Linux 4.4

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment