Announcement

**Developer12** · 16 February 2024, 12:09 AM

Originally posted by muncrief View Post

During the recent data corruption problems with ZFS I discovered that neither BTRFS or ZFS have overall architectural documents, and "verification" is accomplished via individual developers running random scripts over and over until other developers agree the code is okay to release. So I'd be very, very, careful incorporating such a massive change into the codebase.

Don't get me wrong, I greatly appreciate the work of the developers of both systems, and still run ZFS myself. However my recommendations that they step back and appoint a lead architect, develop architectural documents, and real verification systems incorporating targeted and fuzz testing were rebuffed, at times with baffling vitriol, so it appears that all advanced Linux filesystems are pretty much a toss up as far as reliability is concerned. I hate to say that, but until someone gets serious about organization and testing it's the unfortunate truth.

https://www.cs.hmc.edu/~rhodes/cs134/readings/The%20Zettabyte%20File%20System.pdf

The lead architects are among the authors of that paper, though it seems the guy calling most of the shots these days is Matthew Ahrens. As for verification, unless you want to try to develop a system of formal verification for C code (something the rust people are still working towards, with a much better language) the various unit and integration tests in the ztest test suite is about the best one can hope for. Yes, that testing includes both targeted and fuzz testing, as well as torture testing for extreme conditions. After the bug that surfaced in november was patched, tests based on the reproducer were added, and there's been discussion of how best to test for similar classes of bug in the future.

**muncrief** · 16 February 2024, 01:10 AM

Originally posted by Developer12 View Post

https://www.cs.hmc.edu/~rhodes/cs134/readings/The%20Zettabyte%20File%20System.pdf

The lead architects are among the authors of that paper, though it seems the guy calling most of the shots these days is Matthew Ahrens. As for verification, unless you want to try to develop a system of formal verification for C code (something the rust people are still working towards, with a much better language) the various unit and integration tests in the ztest test suite is about the best one can hope for. Yes, that testing includes both targeted and fuzz testing, as well as torture testing for extreme conditions. After the bug that surfaced in november was patched, tests based on the reproducer were added, and there's been discussion of how best to test for similar classes of bug in the future.

Unfortunately this is a 20+ year old overview of ZFS Developer12. It is not an up do date document representing the current ZFS implementation with details such as flowcharts, code section descriptions and function breakdowns, etc., and cannot be used as a guide for low level development. As I stated in my OP, it appears that all of this critical information instead resides in the minds of individual developers, and they must simply do their best to coordinate their code additions and modifications without any up to date documentation as a reference, and therefore no way to update such critical documentation.

And as I said, unfortunately this is not unique to ZFS as it at least applies to BTRFS as well.

This type of on-the-fly code development of something as critical and complex as a filesystem will always result in unforeseen errors, race conditions, etc. as sooner or later entropy will introduce them, no matter how skilled and conscientious the development team.

And this really did surprise me, and took me aback, when I first discovered it a few months ago. I'd always assumed that filesystems were well documented with that documentation constantly updated to reflect changes.

As without this type of design documentation there's simply no way to produce complete and reliable verification systems.

But once again, it is not my intent to chastise or disrespect the developers of these filesystems, as I truly appreciate the hard work and hours they spend producing them. I simply wanted to bring these fundamental organizational problems to their attention, and pass along the lessons I'd learned over four decades of research and development on everything from integrated circuits to firmware to high level software.

**Volta** · 16 February 2024, 03:10 AM

Originally posted by muncrief View Post

Of course there will always be bugs in software Volta. But without organization, architectural documents, and targeted and fuzz testing there will be more. But I understand that suggesting such things makes some people angry, so I'm happy to simply throw in my two cents for people to accept or discard, and leave it at that.

There's always organization and testing and there are still bugs in the software. There were no (modern) file system free of bugs including data corruption. Maybe it's unavoidable. Even Rust won't solve logic mistakes.

**pabloski** · 16 February 2024, 07:50 AM

Originally posted by Volta View Post

I hate to say that, but it appears you have no clue. What ZFS data corruption has anything common with Linux and btrfs? ZFS isn't Linux file system. It's from unreliable slowlaris and used in FreeBSD. There were data corruption bugs in NTFS, whatever crap macOS was using, so what's your point? I got it! There are no bugs in not released super zeta file system. You can use that instead.

Fanboysms apart, the real problem is the uptick in serious bugs in ZFS recently. On other post on this forum someone lamented the sad state of affairs in current day ZFS management. The old team isn't working on it anymore, while new recruits, many of them clueless, have taken of maintanance and development of this critical piece of software. The user lamented in particular their luck based approach to solve open bugs, an approach showing they have very little understanding of how filesystems work.

**skeevy420** · 16 February 2024, 09:55 AM

Originally posted by pabloski View Post

Fanboysms apart, the real problem is the uptick in serious bugs in ZFS recently. On other post on this forum someone lamented the sad state of affairs in current day ZFS management. The old team isn't working on it anymore, while new recruits, many of them clueless, have taken of maintanance and development of this critical piece of software. The user lamented in particular their luck based approach to solve open bugs, an approach showing they have very little understanding of how filesystems work.

That luck comment was me and I said that I feel lucky that I haven't hit any of these bugs because they're all features that I'd like to use and have used to say "Look at all this neat stuff that OpenZFS can do or is about to be able to do". Normally I just update software as soon as it's deemed stable and go about my business. It's a bad habit caused by 13 years of "BTW, I use Arch".

My only exception to that update rule is OpenZFS on disk format updating. I've been using OpenZFS since the 0.6.5 or 0.6.4 release, from 10 years ago, and, anecdotally, it seems like there's some sort of bug or issue with every major release of OpenZFS (9/10 it's related to features I don't use which is one of the reasons I feel lucky) so I usually hold off updating the on-disk format until the 2nd or 3rd point release when update bugs have usually squashed. In regards to updating packages that aren't ZFS, if I get a bad package update and encounter a bug, worst case scenario is an OS reinstall and restoring some ~/.config backups. If I get a bad ZFS update I lose a decade of data, lose my root, etc. It's 100x worse than any other bug that can hit me. That's why I drag ass on OpenZFS updating and treat it with an "if it ain't broke, don't fix it" mentality.

A decade ago, the 29 year old me would have updated the 2.1 on disk format while block cloning was an MR and I'd have fucked myself hard. Nowadays, the 39 year old me gets schadenfreude watching people who behave like the 29 year old me.

Simply put, I don't have a backup server so if I hose my data I'm fucked. It's just safer to sit back and watch everyone else hose their data trying out new stuff. And I do feel like there's been a little bit of luck involved in picking the right settings that don't corrupt my data. It's also me having an update strategy around my long-term data that doesn't involve updating for the sake of updating.

I have that same lucky feeling with Arch and CachyOS updates. Ironically, the only major "my system is FUBAR'd" updates I've ever taken on Arch in the past 13 years involved either BTRFS or GRUB. File systems and on disk formats are tricky. Who knew?

**pabloski** · 17 February 2024, 05:34 AM

Originally posted by skeevy420 View Post

File systems and on disk formats are tricky. Who knew?

This is why it is important to have real experts handling the code and the design. Not updating for a long time is a wait and see strategy and it mostly works. At least you don't get burned by the most obvious bugs. But it is like cybersecurity. You have the latest and greatest antivirus/EDR, so you are safe from script kiddies. However there is nothing you can do against a high level hacker. Same with bugs. Same nasty bugs can stay hidden for years or decades, just to show up at some point and ruing everything.

Concurrency is the first source of such bugs. And this is the reason behind everyone pulling Rust into their kernels and operating systems. However, for now, the De Raadt approach is the only guarantee. But it requires people that know what they are doing. Something that isn't available in the current open source filesystem arena. It is a dangerous trend and is affecting practically every filesystem under the sun, except Ext!!

**skeevy420** · 17 February 2024, 11:03 AM

Originally posted by pabloski View Post

This is why it is important to have real experts handling the code and the design. Not updating for a long time is a wait and see strategy and it mostly works. At least you don't get burned by the most obvious bugs. But it is like cybersecurity. You have the latest and greatest antivirus/EDR, so you are safe from script kiddies. However there is nothing you can do against a high level hacker. Same with bugs. Same nasty bugs can stay hidden for years or decades, just to show up at some point and ruing everything.

Concurrency is the first source of such bugs. And this is the reason behind everyone pulling Rust into their kernels and operating systems. However, for now, the De Raadt approach is the only guarantee. But it requires people that know what they are doing. Something that isn't available in the current open source filesystem arena. It is a dangerous trend and is affecting practically every filesystem under the sun, except Ext!!

There are real experts handling the code. A lot of their contributors have been working on OpenZFS for more than a decade. Technically speaking, I updated the code, I just didn't update the on-disk format. I ran "pacman -Syu" but I didn't run "zpool ugrade $POOL". Still haven't ran "zpool upgrade $POOL" because I'm happy with the 2.0 feature-set.

Concurrency. Yep. OpenZFS is like the Linux kernel in a lot of ways. There's a core team and a project lead while simultaneously there's a lot of other independent work going on and, infrequently, shit happens because of that. It's just that when shit happens everyone notices and remembers. That's part of why I take a wait-and-see approach with critical things like the kernel and file system driver. I feel it worth mentioning that if my hardware wasn't so new I'd be running Linux LTS. I really hope the Linux/Rust effort takes off.

De Raadt. Never heard the name. Granted, I've never been that big into the BSDs outside of FreeBSD and a few desktop derivatives of it, but damn do I agree with a lot of what he says:

GPL fans said the great problem we would face is that companies would take our BSD code, modify it, and not give back. Nope—the great problem we face is that people would wrap the GPL around our code, and lock us out in the same way that these supposed companies would lock us out. Just like the Linux community, we have many companies giving us code back, all the time.

But once the code is GPL'd, we cannot get it back.

I suppose it's because of people like him that so much of the Linux kernel is dual OR licensed.

Anyways, in regards to people knowing what they're doing, as far as I can tell, there are only two ways to solve that problem and only one of them is feasible long term. The first option is the corporate approach where a corporation, large business, small business, independent contractor, etc picks up a project and pays people to learn it and maintain it. That's the current status quo and we see how well that works. The 2nd option involves a country as large as America or China revamping their education system so that schools back open source projects and use them to teach students how to program. The first option assumes that every project can be monetized. Since not all projects can be monetized, there are 23,000 high schools in America and 19,000 Gentoo packages so I really hope you can see where I'm going with this.

If not, it means that there could be a person hired to work on every single open source project with a constant stream interns, students, to act as QA testers, junior devs, documentation writers, and more. Damn near every single open source project can be covered by high schools alone. The entirety of K-12 in the US consists of 98,000 schools in 13,000 school districts. There are 1576 public colleges. (I had no idea the ratio was that bad...9/10 high schools per 1 college...fuck me that's bad). Factor in colleges and middle schools and that is A LOT of eyes on open source. Factor in NATO and UN allies doing that, there would be so many schools that they'd have to make up new projects or have advanced courses that would literally be Rewrite It In Rust 101. **SIGH** I can dream of a better future, dammit

Announcement

Fast Dedup Coming To OpenZFS For Overhauling Deduplication Capability

Comment

Comment

Comment

Comment

Comment

Comment

Comment