Originally posted by SystemCrasher
View Post
The reason why ZFS has done as well as it has in spite of that is because it had the good fortunate of combining a substantial number of good ideas well before anyone else had tried. Consequently, by virtue of having done things well, new filesystems need to not only do them well, but avoid introducing any severe deficiencies while doing them substantially better. Going back to the design trade-offs remark, that probably is not going to happen.
Your "btrfs is as good as zfs" generalization suggests to me that your use cases are such that differences do not matter to you. Your advocacy of reflinks as a means of doing snapshots would seem to confirm that because reflinks are a really poor implementation of the concept of a snapshotting API. Skip the next inline reply for the one after that for an explanation of why they are a really bad implementation.
Originally posted by SystemCrasher
View Post
- Stochastic testing in userspace. The core ZFS is driver is compiled into the libzpool.so library and exercised by a tool called ztest. This can help catch problems that tend to be found in production, such as ENOSPC.
- The ability to run the latest code on older kernels, with a buildbot to verify sanity. The lack of backports is a pain for enterprise deployments because fixes after non-trivial refactoring are never backported. This also means that compatibility with new features is missing, such that you cannot take your storage from a newer system to an older one.
- A Merkle tree based disk format. This is a double edged sword. It means that you cannot "reshape" easily and can easily respond to memory pressure, but you lose certain properties such as the ability to go back to a last good state after a crash and end up needing something like xfs_repair, which is put into a misnamed tool called fsck. Fixing this would require a new disk format.
- Strong checksums. The crc32c checksum has weaknesses that could fail to detect odd bugs in the storage hardware and 32-bits is a bit on the weak side even for a decent checksum algorithm. ZFS uses the fletcher checksum. It was designed to avoid the problems in the CRC family of checksums.
- A separate hierarchical namespace for what btrfs calls "subvolumes".
- Support for creating and managing block devices just like any other volume. This is probably where the separate namespace matters. It also cannot be done by the loop device due to overhead.
- Separate snapshot and clone functionality, rather than the awkward combination of snapshot+clone that btrfs implements.What btrfs has done is similar to the Windows `CreateProcess()` function versus the similar to POSIX fork() + execve() functions. It works when you want both functions, except for the times when you really only want the first of the two logically separate functions. A snapshot should only be able to be renamed, replicated, destroyed, cloned or mounted readonly. There should be no other operation that works on it. This is important for making sure that your snapshots contain what you expect them to contain. The fact that snapshots and clones were showhorned into the mount namespace might explain why these two functions are not separate in btrfs.
- A disk format specification for new developers getting started. While ZFS' disk format specification is old, it is still a good starting place for new developers. btrfs does not seem to have anything like that.
- A superior page replacement algorithm. ZFS uses ARC while btrfs is still using the LRU algorithm that the VFS provides helpers for using rather than implementing its own. Having metadata required to figure out where to place things on disk and reference them so you can write that out without blocking on a read tends to be more important in a CoW filesystem than an in-place filesystem, where you can just write in-place (and partial writes are the user's problem).
- A mechanism to throttle IO operations that increase dirty data in increasingly large amounts before it fills. Failing to throttle userspace until the limit forces you to throttle userspace will either block userland for a long period of time (waiting for everything) or short periods (write out a tiny bit only to hit it again), which leads to unpredicatable performance. The latter might look like it works until you hit an fsync (or an operation like it), where everything stops from userland's perspective. This is a weakness of the generic code for handling dirty data writeout that btrfs uses than of btrfs itself, but the VFS API is generic enough that btrfs could handle this own its own like OpenZFS does. That being to insert `usleep()` into the VFS operations that increase dirty data to slow them down based on how close the system is to the dirty data limit to keep userland from experiencing seemingly random lags.
- A way to perform writes without issuing reads without nodatacow on random-write intensive workloads such as databases and virtual machines. Avoiding read-copy-write on CoW operations of extent-backed files is hard, but that might be a reason to adopt to an indirect block tree like what ZFS uses rather than tell users that they are on their own for data integrity. Telling a VM it is on its own for data integrity would not kill data integrity when the guest uses a driver that can handle it, but the same cannot be said for userland applications such as databases.
- Parity-based redundancy without read-modify-write like raidz. btrfs raid 5/6 is the MD RAID code copied into the filesystem and uses a stripe cache to try to get good performance. This might seem more acceptable given that btrfs' extents ensure that userland applications are likely to suffer from read-modify-write overhead no matter what you do, but duplicating the problem in another layer is just going to make fixing it that much more difficult.
- N-way mirroring. btrfs does a weird thing where it stores two copies of data on separate disks and calls it a mirror while ZFS will support arbitrarily number of disks with each disk storing the same data at the same location as you would expect on a mirror.
- Graceful handling of disk failures. Things like needing to pass a degraded mountflag (especially if this is your rootfs) should a disk not be working is definitely not graceful handling. (Ab)using mount to act as import/assembly probably made this seem like acceptable behavior, but it is a violation of postel's law. If you want a way to warn the system administrator that there is a problem, you should have userspace handle notification.
Originally posted by SystemCrasher
View Post
- Initial implementation discussions have suggested that doing it needs an indirection, which is not nice. Maybe btrfs can get away without it, but I am not sure.
- If you use the immutable bit to simulate a real snapshot, you have a racy situation where something else can modify it.
- Rolling back is a hack. Rolling back requires doing two reflinks so that you can unlink the second one and do a new reflink if you want to rollback. This is a maintenance nightmare because it is up the the sysadmin to figure out what is what.
- You cannot do send/recv.
Originally posted by SystemCrasher
View Post
As for licensing, I am not a lawyer, but the CDDL is a "F/OSS" license according to the FSF. The FSF's publication seems to be almost in exact agreement with the SFLC's opinion, with the only difference being what constitutes an exception that allows Linux LKMs to be under otherwise incompatible licenses:
https://www.softwarefreedom.org/news...x_Kernel_CDDL/
The FSF appears to want one in writing while the SFLC appears to think that the kernel developers' actions created one. While a written exception would be nicer to have, there are plenty of situations in law where people's actions gave others' rights. An example of this is an implied easement in property law.
Under the assumption that everything that the FSF and SFLC claimed in common is correct, I am inclined to agree with the SFLC on the single point on which they differ. It is difficult to claim that there was no exception made when the mainline kernel developers made an interface for non-GPL software to use, the project lead (Linus Torvalds) claim that ports of non-GPL drivers such as the Andrew filesystem is not a violation and the idea went unchallenged for years. By this point, it is common practice. Ubuntu also was not the first distribution to ship binary ZFS kernel modules either. Gentoo and Sabayon did it on ISOs years before Ubuntu did.
Furthermore, there has yet to be a single person who thinks there is a violation that has claimed that a port of ZFS to Linux would not qualify under fair use, while there is a legal opinion that it does:
http://www.rtt-law.com/public/files/...te%20paper.pdf
As per the Berne convention, this is a matter of US law, so unless people who hold a majority of the Linux copyright simultaneously thinks that there is a violation and that there is no fair use defense under US law, there is nothing to discuss. Without a majority, you cannot dispel the idea that the majority implicitly allowed it and without a way to counter fair use arguments, even a majority cannot claim that the law gives them the right to stop people from distributing binary ZFSOnLinux kernel modules.
Lastly, what I have said is just the understanding of a non-lawyer who tried to understand what actual lawyers wrote. If you hold copyright on a part of the code in question and want to talk to someone about this matter, I suggest getting in touch with the SFLC to speak with actual lawyers. If you are not a copyright holder, this does not concern you under the law. If you think otherwise, you can check with an attorney.
Comment