Announcement

**wikinevick** · 09 February 2016, 05:51 PM

Originally posted by trilean View Post

Interesting...I wonder why Clear Linux is so much ahead in some benchmarks. Especially CPU bound benchmarks like Compile Bench where the OS doesn't really have to do anything except get out of the way and let the task run. Does anyone know if they're using a different scheduler? A different timer frequency (CONFIG_HZ) maybe?

OpenBSD 5.9 should be much better. Seems like the next release is all about SMP: http://www.openbsd.org/plus.html

Keep wishing ... SMP support won't grow from one release to tthe next (six months?)

.

**aht0** · 28 February 2016, 09:16 PM

I am by now (member of phoronix couple of months) detecting a pattern.

Have some article comparing BSD with Linux appearing.. you got crowd of "penguin friends" descending (30+ posts) on it like vultures bashing Sun or random BSD for all it's worth.. starting usually with mr. SystemCrasher, his "cow filesystems"TM and "sun marketing bs"TM remarks...

It's bit like Russian web trolls under random political article in Northern/Eastern Europe.

Have some neutral BSD news or announcement, none of this crowd deigns to comment on it.

Is it caused by some sort of complicated form of inferiority complex? Need to prove how superior THEIR FAVORITE OS is? Honestly curious..

**kebabbert** · 06 March 2016, 10:46 AM

"You got it backwards. ZFS is not fragile, it is the most robust filesystem there is out there. ZFS is built with a focus on data integrity, to be robust and protect your data."a

Originally posted by SystemCrasher View Post

Sounds like Sun's marketing bullshit.

Well, you can not disagree that ZFS is the first common filesystem that had checksums to detect data corruption. Ive read an old interview with Chris Mason that after he read some articles of ZFS data integrity, only then he realized that checksums should be added to BTRFS. It is an afterthought. Mason did not know the importance of detecting data corruption in very large datasets. Sun had vast experience of large data sets and the problems that might occur in large storage. Mason, a desktop developer, had not that experience of large storage servers. Thus, ZFS was the first filesystem out there, and still today, many considers ZFS be the safest filesystem out there. So, it seems very probable that Sun's aim of designing a new filesystem with a focus on data integrity and reliability is true. No other filesystem is in the same league in terms of data integrity according to several independent comp sci research groups.

For isntance, there are several research papers out there, by independent comp sci researchers, that all conclude that ZFS is safe. Those links can be found on the wikipedia article on ZFS in the section about data integrity - if you care to read about what researcher say. I agree with you that you should be sceptical to everything you hear, that is very sound and intelligent. But in this case, about ZFS claims about superior reliability - these claims are backed up by several independent researchers. OTOH, there is no research showing that BTRFS is safe, so I would be a bit hesitant of trusting BTRFS - for exactly the same reason you are sceptical of ZFS claims of data integrity:
- Sun claims ZFS is very safe - you are sceptical and dont trust ZFS because it might be marketing. But there are research papers out there confirming ZFS is superior.
- BTRFS claims to be safe (who says it btw?) - I am sceptical and dont trust BTRFS because it might be marketing. There are no research papers out there confirming BTRFS is safe. No researchers have examined BTRFS. If you know of such research papers, please post them. I would like to learn more.

So, if you look at what science says, it seems ZFS is safer than BTRFS.

Research papers shows that other filesystems such as XFS, NTFS, etc -are unsafe and might corrupt your data. Research papers can be found in the same wikipedia section on ZFS data integrity.

Yet, googling a bit can yeld some funny "success stories". When someone tells "Enterprise" you should expect catch, and marketing promises would wildly exceed practical performance.

I agree that no filesystem is 100% safe, but yet, there are fewer "success stories" on ZFS than other filesystems / RAID. For instance, if you look at the ZFS forums, and someone have problems, almost in all cases the problems are solved. Very often when you see that someone says ZFS reports data corruption or they can not import the data zpool, it turns out to be a hardware problem: flaky PSU, sata cable slightly slighty loose, faulty RAM stick, etc. And when they fix the hardware problem the problems are gone and they can import the zpool again. OTOH, have you read the BTRFS forums? Full of reports of data corruption, etc. The wiki says "Q:Is BTRFS safe? A:It depends". I doubt you will find research papers out there soon, claiming that BTRFS is safe.

Checksumming isn't worst part of ZFS, sure. They used quite good tradeoffs in terms of checksumming strength vs speed. On other hand, many other engineering solutions are much less sound. No defragger? While CoW inherently means fragments on write? And only lame mumblings about adding new drives? Skipping "good luck to remove them later" catch? Staying silent about bad sectors would bring thing down and no tools would be able to fix that? That's what called MARKETING BULLSHIT. Sure, sun was good at it. That's what I dislike about Sun and their products.

You are free to have an opinion about Sun, but in this case: science says ZFS is safe, whereas other solutions are not safe. Regarding "staying silent about bad sectors" - ZFS reports all such problems and repair them immediately. Many (all?) other filesystems can not even detect all forms of data corruption, so how can they repair those data corruption cases? ZFS detects all types of data corruption, according to researchers, and according to the same research paper, ZFS repairs all those corruption cases too. It is easy, because when ZFS detects corruption, it is just a matter of retrieving a correct data block and replace the faulty. Which is what ZFS does. Read the research paper (on wikipedia article on ZFS data integrity).

"For isntance, hard disks have lot of checksums on the surface to detect data corruption - and still you get data corruption on hard disks"

XFS can't do full journalling, to begin with. It would journal metadata only, so you can get half-old, half-new data in file if crash happens. There is no data to complete or rollback DATA part. Needless to say, file can get unusable. So speaking about reliability sounds funny for such designs. In classic designs, journalling takes one write to journal and one into storage. Hence writing data twice. It kills write speed by at least 2x. Nobody in sane mind uses e.g. EXT4 in this mode. So XFS, JFS and some few others do not even implement full journalling. OTOH CoW based things avoid it in some smartass way, where whole storage area is basically a "journal". That's why most futuristic designs are centered around CoW these days. Full journalling at no-journalling speed. But there're some other catches

.

Data corruption protection is not provided by journaling. You need checksums to get full data integrity. But, you need a special kind of checksums. Ordinary checksums will not do. This is evidenced by hard disks having lot of error correcting codes to combat data corruption - and still hard disks get data corruption. No, you must use end-to-end checksums to get full data integrity.

The problem with storage is that traditionally you have different layers: filesystem layer, raid layer, volume management layer, etc etc. And data might be checksummed on each layer to catch data corruption, but when data passes one layer to another there might be corruption because the checksum is not passed. This means data might be corrupted when you have many layers. ZFS is very different, it is monolithic. ZFS is in charge of everything: filesystem, raid, volume manager, etc etc. This means the checksums in ZFS is always known to every layer. This is the one of the many reasons ZFS is safe: it is monolithic so every layer have access to the same checksum. This means the data on ZFS in both ends of the chain is verified: data is verified on disk, all the way up to the other side (i.e. RAM if you wish). That is, end-to-end checksums. Have you played this game as a kid? All kids sits in a ring, and one whispers a word to the next, who whispers the same word to the next, etc. Then you compare the words from the beginning and at the end: they always differ. You need to be sure that the words at the beginning and at the end agrees: end-to-end.

Funnily, the Linux kernel developers have called ZFS and abomination because it is an "rampant layering violation":

ZFS: Rampant Layering Violation

https://lildude.co.uk/zfs-rampant-layering-violation

I stumbled upon a great article (via Ars Technica) by Jeff Bonwick, responding to Andrew Morton's claim that ZFS is a "rampant layering violation" because it cuts across the traditionally separate worlds of the filesystem, volume manager, and RAID controller. I think ZFS is a brilliant invention and believe it'll go far. It removes complexity and cost, improves data reliability and generally makes people's (both sysadmins & desktop users) lives soooo much easier. Too many times I've had to be the bearer of bad news and let a customer know their data is corrupt and they'll have to restore from backup. Most of the time we can work out where the problem has occurred and take the necessary steps to prevent or reduce the chances of it from occurring again. However there are times, particularly when dealing with 3rd party storage and it's associated software, that an explanation can't be found and I have to refer the customer to the 3rd party vendor. These problems are invariably due to over complexity in the storage stack. Whilst ZFS does have it's limitations (can't shrink or boot from ZFS just yet), work is in place to remove these limitations. Remember, ZFS is in it's infancy - all it's competitors have been in the public sector for many many years. One of the comments on Jeff's post caught my eye (I've copied part of it verbatim)... first, it is my understanding that sun chose a particular open-license to prevent zfs to make it into the linux kernel. this is unfortunate. with zfs coming to mac osx soon, zfs is the chance for sun to basically implement and benevolently control the future direction of one file system to conquer them all. (hey, sun: how about releasing a slower version for linux that's got the extra useless layer as dummy calls, and that allows me to mount a zfs partition/disk on a sun or on a max osx or on a linux machine?) second, is it not possible for zfs to plug into the upper linux layer for its interface, use the intermediate parts that it has itself rather than calling on linux, and rely only on the lowest linux hardware layer? This is typical of a lot of people in the Linux community and is quite childish - "he's got a nice shiny new toy and he won't let me play with it". Sun is not preventing Linux from doing anything, it's own license is. In fact there is enough Linux compatible code out there for some enterprising Linux developer to work on their own ZFS port. Remember, the only reason ZFS has made it into FreeBSD and OS X is because their licenses aren't so aggressive towards other licenses. I'm a big fan of Linux, but not of this kind of mentality. The operating system space is big enough for all of us - it's not a Linux vs MS world. Get over it.

Linux kernel developers says ZFS is bad design because it is not layered, it is monolithic. Well, the reason ZFS is safe is BECAUSE it is monolithic. It seems that the Linux kernel devs did not really understood the design criteria behind ZFS and why it was built like that weird. The reason was because everything in ZFS revolves around data integrity. If ZFS did not care about data integrity, it would be layered just like all traditional filesystems on Linux. But ZFS is different because it's highest priority is data integrity, not performance. In Enterprise, you care about reliability, not performance. If a big business server goes down, you might loose $billions. Performance is secondary, reliability is the top priority. That is why Mainframes have sometimes triple cpus, same computation is calculated on two cpus and if they differ, a third cpu will disconnect the faulty cpu. etc etc. In enterprise arena, reliability and data integrity is top priority. Chris Mason, a desktop developer, has no clue of this.

The weird thing is that BTRFS is also a monolithic "rampant layering violation". It is funny how Linux kernel developers mocked ZFS, and then after a few years, did a similar clone even though they first agreed that ZFS is badly designed (Linux devs did not understand ZFS weird design because of it's purpose: data integrity).

I do not give a fuck about "since 1929" stuff, but I do care about technical excellence, solving difficult challenges, good performance, interesting designs, good algos, inner working of things, etc.

A true nerd!

Me like. I am also like you. I care about technical excellence, and I back it up with science and research papers and studies. There is a reason everybody wants ZFS and it is hyped: it is safe.

Come on, keep quoting Sun's marketing BS. Yet, if we google a bit, we'll see some ppl around who had ZFS fucked up and .. no good means to recover. It do not have to be fsck, etc, but if filesystem's metadata are damaged to degree driver can't mount it properly, it could be good idea to have some "plan B", either to parse all metadata, check them carefully and recover like fsck tools do, or at least being able to read data back with relaxed demands on metadata integrity.

You know, ZFS "scrub" checks/repairs the metadata and the data. "fsck" is not really trustworthy. For instance, sometimes when you fsck or "chkdsk" a 10TB raid array, it will complete in a few minutes. But to really read and traverse 10 TB data, takes many hours. Say a disk reads 100MB/sec - you do the math. In fact, "fsck" and Windows "chkdsk" only checks the metadata, but never the data. This means after a "fsck" the data might still be corrupt. OTOH, ZFS "scrub" takes many hours, as it will read and repair all the errors on every bit.

And the need for "fsck" has not been that great, the ZFS forums are not crowded with threads about people loosing their data and requiring "fsck". I dont see a great demand for "fsck" for ZFS.

This makes vague assumption all involved data are valid and it works reasonably. But it may or may not be a case. CoW can have rather specific failure modes and snapshots are anything but backup replacement. E.g. if someone had 5 snapshots, they shared blocks and bad sectors appeared here, all 5 snapshots are going to be ruined at once. Going back in time may or may not succeed. Depends on how badly metadata are damaged, and actually, if one got severe corruption, trying tricks like this can easily make things worse, because CoW can get stuck half way while trying to rewind back, hanging in the middle of nowhere. Read-only recovery to another destination allows to avoid this problem - it does not touches original storage, so failure at least can't develop any further. And one can try to select various "entry points" to try to parse them.

The metadata on ZFS is always duplicated all over the raid or disk. If it is damaged, you can always get another correct copy. There are not 5 snapshots on ZFS. Every time you write data to the disk/array, the data is written to a new place. The old data is never touched. And you can back in time to any of these writes. You will never have only 5 writes on a disk. You might have 5 snapshots on disk, but there will be many many writes on the disk. You can back to any of them, and eventually as you back longer and longer back in time, you will find a valid state that you can import.

Unless there is some pesky bad sector appears and it happens there was no redundancy. At this point ZFS easily fucked up and really hard to repair. Because Sun architects preferred to mumble "always valid" and had no plan B for scenarios this assumption has failed. And I've seen a few, where it has been the case. So ppl had a lot of very specific fun. I think it can be better than that and not really fond being fed by marketing BS instead of technical solutions.

It is true that no filesystem is 100% safe, and something might always happen. But, there is not a great demand of "fsck". I dont know many users who have requested "fsck". There is hardly any use case for "fsck". ZFS "scrub" repairs many more problems than "fsck".

At the end you are free to have your opinion, but several researchers say ZFS is safe, and they say other filesystems are not safe. And it is a fact that many (all?) OSes wants ZFS, for instance Linux. If ZFS was bad, interest would die in this geeky computer world. Marketing can only uphold an interest for so long time, eventually, the true qualities will prevail. And many nerds such as me (or you) and other savvy developers, can think for ourselves. And who promotes ZFS today in Linux? It is the nerds. They dont fall for the marketing that ZFS spewed out 10 years ago (ZFS is 15 years old today). FreeBSD wants ZFS, Linux wants it, Apple has it, etc. There might be some merit in ZFS? Have you even tried it? If we talk about Solaris DTrace, which was also marketed by Sun, we see today what OSes have it or recenlty ported/copied it. I doubt Sun marketing is so powerful to make every major OS have ported/copied DTrace or ZFS? These OSes have a copy of DTrace or ported it. And probably a few other OSes I dont know of have DTrace:
1) FreeBSD
2) Mac OS X
3) QNX
4) IBM AIX have a copy called ProbeVue
5) VMware have a copy called vProbes (they even credit DTrace explicitly)
6) Linux have a copy called Systemtap (the developers first talked about DTrace a lot, and then deleted all references to DTrace because they did not want to give credit: https://blogs.oracle.com/ahl/entry/dtrace_knockoffs ). Now Linux also have DTrace via Oracle Linux distro.

**SystemCrasher** · 07 March 2016, 03:46 PM

Originally posted by kebabbert View Post

Well, you can not disagree that ZFS is the first common filesystem that had checksums to detect data corruption.

Right, but I had quite heated discussion with some Gentoo dev here on Phoronix. Even he has admitted ZFS does checksumming in quite a specific way it brings some major tradeoffs. Somehow I do not think these tradeoffs are best for me in most usecases I care of.

Ive read an old interview with Chris Mason that after he read some articles of ZFS data integrity, only then he realized that checksums should be added to BTRFS.

Mason looked on existing designs (ZFS included) and attempted to take the best of them. While btrfs checksumming is somewhat weaker than ZFS, ZFS is IMHO somewhat overengineered and this led to quite nasty tradeoffs. So, you're free to live without defragger on CoW FS, "enjoying" by growing fragmentation, since everything CoW does is a fragment. Regardless of what zealots mumble. You're free to think performance degradation is to be taken as granted, with no way to defragment. You're free to dismantle huge storage pool to do something about frags if you're not ok with it. I like cp --reflink thing, yet it seems zfs can't afford it. It can't afford switching off CoW for selected files either. No, I do not think it could be called feature. I call it a bug, since I can roughly estimate long-term evolution of free/busy space and I do not like supposed outcome, even if shiny benchmark on empty filesystem shows reasonable performance, it not going to stay like this forever (but Sun marketing dep't always preferred to stay silent about it, since they lack defragger anyway).

I see no point trying to fool myself and I do care about convenient and pleasant ways to manage my systems without weird, impractical, unnatural or inconvenient assumptions. Somehow I got to think btrfs is quite close to being like this, but I can't tell the same about ZFS, which makes different assumptions.

It is an afterthought. Mason did not know the importance of detecting data corruption in very large datasets. Sun had vast experience of large data sets and the problems that might occur in large storage.

I see nothing wrong to use experience others have obtained. Only fools do it other ways. Btrfs design had checksumming from earliest days. But when it comes to very large datasets, I think one may really want to do it Google-like way, using distributed networked systems, where health of single server isn't a big deal. This way is easier to scale, one can use cheap machines and very strong checksumming could be e.g. inherent part of e.g. "content addressing", if you understand what it means.

Mason, a desktop developer, had not that experience of large storage servers. Thus, ZFS was the first filesystem out there, and still today, many considers ZFS be the safest filesystem out there. So, it seems very probable that Sun's aim of designing a new filesystem with a focus on data integrity and reliability is true. No other filesystem is in the same league in terms of data integrity according to several independent comp sci research groups.

While this not so untrue, ZFS takes a plenty of tradeoffs, so it IMHO only makes sense in high-end enterprise installations, inclined on storing large pile of data in centralized way regardless of costs it takes, where high-end enterprise HW makes these ZFS design assumptions adeqate. TBH I think in most cases such designs are outdated dinosaurs. One can make scaling orders of magnitude easier and cheaper. Server health (including filesystem) could be not a big deal and proper design could handle ALL errors, not just storage issues.

For isntance, there are several research papers out there, by independent comp sci researchers, that all conclude that ZFS is safe. Those links can be found on the wikipedia article on ZFS in the section about data integrity - if you care to read about what researcher say.

It is, as long as you live in perfect world full of spherical cows, CPUs which never fail, error-free RAM, perfect network cards and so on. ZFS does not gives a fuck if network card could corrupted data packet and network checksums are weak? Ok, but then ZFS looks a bit overengineered, it deals with just one particular corner case of corruption. But there're many other cases it DOES NOT handles. Yet they are quite an issue for large-scale systems and large amounts of data. Overall, I think ZFS only makes sense in few high-end enterprise installations using expensive HW with ECC RAM, redundant high-quality power supplies and so on, if large centralized storage is a must. Other than that I do not get a point of this design.

I agree with you that you should be sceptical to everything you hear, that is very sound and intelligent. But in this case, about ZFS claims about superior reliability - these claims are backed up by several independent researchers. OTOH, there is no research showing that BTRFS is safe,

There is certain difference between you and me. I'm skeptical about many other things, so I'm aware there is no such thing as perfect reliability. Comp sci researchers tend to make very vague assumptions like "RAM is error free", "CPU works perfectly", "network cards are always ok", "there're no bugs", and so on. Which may or may not be true. So their conclusions aren't readily applicable to real wold - there're many failure modes they've missed but which are going to backstab real-world admins here and there. Who of these comp sci guys cares in real-world CPU fan could just get stuck?

so I would be a bit hesitant of trusting BTRFS - for exactly the same reason you are sceptical of ZFS claims of data integrity:

It is more complicated than that. You have to trust whole HW and SW stack to get it right. Granted complexity of modern HW and SW, it going to be non-realistic assumption most of time :P. Being utterly inclined on filesystem integrity when running a cheap system with shitty power supply, CPU with $5 cooler barely handling TDP, RAM w/o ECC, and so on could be rather stupid thing to do. ZFS makes certain assumptions and they only hold for high-end enterprise HW, to some degree. On downside, taken tradeoffs bring management pains and inconvenient assumptions which do not hold in many practical use cases.

- Sun claims ZFS is very safe - you are sceptical and dont trust ZFS because it might be marketing[....]BTRFS claims to be safe (who says it btw?) - I am sceptical and dont trust BTRFS because it might be marketing.

I wouldn't claim "btrfs is very safe". Realistically, if we assume there're no major bugs, it going to be better than e.g. EXT4, but somewhat worse than ZFS in terms of data checksumming algos. But EXT4 is nowhere close feature-wise and ZFS got very specific design goals, it never meant to be "general purpose".

So, if you look at what science says, it seems ZFS is safer than BTRFS.

It could be formally valid statement, but it is incomplete, narrow view on picture, which happens to be larger. So it may or may not make sense. I can imagine ZFS tradeoffs on hardcore checksumming making sense on expensive enterprise HW. But on cheaper HW it just looks silly. If one runs e.g. system w/o ECC RAM, you can get data corrupt wihout even noticing it, before you even wirte it to ZFS. Cheap power supply could fry all HDDs at once. Single-drive ZFS could face bad sector it can't correct, etc.

Research papers shows that other filesystems such as XFS, NTFS, etc -are unsafe and might corrupt your data. Research papers can be found in the same wikipedia section on ZFS data integrity.

XFS, NTFS and somesuch do not even perform full journalling, since in usual journalling it takes at least TWO writes, which is slow. If you crash in the middle of write, you can end up with half-old, half-new file. Needless to say it hardly counts as "reliable" thing. Such designs retain metadata valid so no fsck needed, but it does not imply DATA are valid. EXT4 can do full journalling, but is SLOW. CoWs are more interesting in this regard, since they achieve full journalling without writing data twice.

I agree that no filesystem is 100% safe, but yet, there are fewer "success stories" on ZFS than other filesystems / RAID. For instance, if you look at the ZFS forums, and someone have problems, almost in all cases the problems are solved.

I've seen at least some cases where ZFS has been ran on non-redundant storages like a single drive, then there was a bad sector. And the answer was "ZFS isn't meant to deal with situations like this". Should never happen. Lol.

Very often when you see that someone says ZFS reports data corruption or they can not import the data zpool, it turns out to be a hardware problem: flaky PSU, sata cable slightly slighty loose, faulty RAM stick, etc. And when they fix the hardware problem the problems are gone and they can import the zpool again.

I've seen how it deals with uncorrectable bad sectors. Should never happen, blah-blah, standard Sun marketing BS. I do not want it to be this way, I need my data. And it would be better if I can salvage most data back even under really imperfect conditions beyond initial design assumptions.

OTOH, have you read the BTRFS forums? Full of reports of data corruption, etc. The wiki says "Q:Is BTRFS safe? A:It depends". I doubt you will find research papers out there soon, claiming that BTRFS is safe.

OTOH I've read these fancy mans and forums and got idea that even if my filesystem is utterly damaged, I could try to get most data back by "btrfs restore" in offline mode. Way better solution compared to parsing everything in hex editor. It allows to try various entry points, finding relatively undamaged trees, eventually stumbling on relatively undamaged metadata, so I can read most data back in more or less automated manner. Such tools are sometimes seein in advanced data recovery suites, but the only similar tools I've seen were commercial programs for NTFS and FAT32. I'm not aware of comparable tools for other filesystems, btrfs is very neat they offer this kind of feature in their tools. While there was no reason to try this in real world, I've played a bit with this just for fun and it seems I'm quite okay with this tool. I think I can handle most issues myself thanks to such tooling. Ironically, I only attempted to use btrfs restore out of curiosity. Haven't been needed so far in real-world conditions so far, despite all mumblings it is "unstable".

You are free to have an opinion about Sun, but in this case: science says ZFS is safe, whereas other solutions are not safe.

My point is: if you do not have expensive high-end enterprise HW, it makes little sense to pedal ZFS reliability so much, because rest of system does not meets ZFS design assumptions and things could fail anyway, one way or another. So the point to have quite some pain for little gain is not clear to me. Realistically speaking, faulty RAM module is more likely than btrfs failing to catch read error, etc.

Regarding "staying silent about bad sectors" - ZFS reports all such problems and repair them immediately. Many (all?) other filesystems can not even detect all forms of data corruption, so how can they repair those data corruption cases?

Btrfs and recently some few others can try to check checksums. But in this particular case the issue is: you CAN NOT "repair" it. Question rather turns into: how do I mount it and get my data back?! If ZFS driver fails to bring it online, it seems it is bummer, with very limited means to do anything about it. Even ext4 would allow to run "offline" fsck, btrfs would allow to do the same, and if plan B failed, there is plan C and D to try with "btrfs restore", so even if you can't bring storage online.

ZFS detects all types of data corruption, according to researchers, and according to the same research paper, ZFS repairs all those corruption cases too. It is easy, because when ZFS detects corruption, it is just a matter of retrieving a correct data block and replace the faulty. Which is what ZFS does. Read the research paper (on wikipedia article on ZFS data integrity).

Sometimes filesystem MUST face corruption and deal with it. E.g. modern laptops would allow only ONE drive. It not going to be super-reliable. But it is handy if I can recover partially damaged filesystem and get most data back. And sorry, but I'm really using snapshots as VM-like ways to manage my systems. At this point, I use VMs and containers and I do not get why the hell I should manage physical backends other way. As the result, I can recover from virtually anything fast. Even rm -rf / from root would take about a minute to go back into past and try it again, hopefully facing better outcome.

"For isntance, hard disks have lot of checksums on the surface to detect data corruption - and still you get data corruption on hard disks"

They actually have ECC codes. So failed reads are recovered, without even letting you know about it, unless you're grinding SMART and HDD reports it. But whatever, on single drive system, if HDD can't read sector after some attempts, it returns error. Read failed. You have to live with it. And my life going to be better if there is still some plan and it does not sounds like "EPIC FAIL". So I warmly welcome tools to deal with damaged FS (or images) in hope either to correct errors or at least get most of data back.

Data corruption protection is not provided by journaling. You need checksums to get full data integrity.

This is true to some degree. But it does not protects from CPU or RAM or network errors.... he-he

But, you need a special kind of checksums. Ordinary checksums will not do. This is evidenced by hard disks having lot of error correcting codes to combat data corruption - and still hard disks get data corruption. No, you must use end-to-end checksums to get full data integrity.

Sometimes error is just too wild, so it would exceed strength of ECC code. At this point ECC code would detect error at very best. Sometimes it could fail to detect error at all. If error is detected, HDD reports UNC(orrectable read error), read fails. ECC codes have their own design assumptions. They aren't meant to be panacea and I understand it.

The problem with storage is that traditionally you have different layers: filesystem layer, raid layer, volume management layer, etc etc. And data might be checksummed on each layer to catch data corruption, but when data passes one layer to another there might be corruption because the checksum is not passed. This means data might be corrupted when you have many layers. ZFS is very different, it is monolithic. ZFS is in charge of everything: filesystem, raid, volume manager, etc etc.

Yes, it brings some advantages, e.g. one can try to read RAID one way or another and see when checksum matches, btrfs could also do this trick these days. Though it gone a bit further, and internally its design assumes RAID can be per-file thing, so storage pool could coexist with a fancy mix of RAID levels. There're deveices, chunks and requested storage schemes. Nothing of this is fundamentally set in stone. They have got quite some headroom for futuristic cool stuff to chew on, when they're done with more simple things :P.

You need to be sure that the words at the beginning and at the end agrees: end-to-end.

Now you should get idea why Google is okay with just ext4 w/o journal on some of servers. If you view storage server as atomic unit which either returns what you expect or fails for whatever resons, you do not have to care why it has failed.

Linux kernel developers says ZFS is bad design because it is not layered, it is monolithic. Well, the reason ZFS is safe is BECAUSE it is monolithic.

Well, its true to some degree. But btrfs gone a bit further in this regard. It got more excuses on doing RAID its own way because... because I'm not aware of other designs which can internally go down to file level in decision making which RAID level to use. Internally btrfs isn't a block-level or fixed-scheme RAID, that's quite a major excuse. No existing kernel facilities can handle it this way right now. This said, btrfs reuses linux kernel RAID algos (computational parts of code, they are separate modules). Still, Chris earned some uproar on some things, and had to reuse existing kernel facilities as much as possible.

It seems that the Linux kernel devs did not really understood the design criteria behind ZFS and why it was built like that weird. The reason was because everything in ZFS revolves around data integrity.

And to my taste, it being pedalled way too much, up to degree it turns into stupid religion where admins blindly praying to data gods, swearing buzzwords from Sun marketing BS, without actually understanding what happens, why and if it fits reasonably particular use case at all. Failing to evaluate if their actions result in sane overall outcome. Doh.

If ZFS did not care about data integrity, it would be layered just like all traditional filesystems on Linux.

IIRC Sun never had cache/vm management, raids, volume management and somesuch comparable to Linux in their Solaris, that's they trown all this into ZFS. Reasonable for Solaris, but kinda redundant on more developed OSes. Btrfs uses usual means of Linux kernel to do filesystem caching, and it only does own raids+device management to be able to do advanced RAID & space allocation things which aren't even planned in ZFS, as far as I understand.

The weird thing is that BTRFS is also a monolithic "rampant layering violation". It is funny how Linux kernel developers mocked ZFS, and then after a few years, did a similar clone even though they first agreed that ZFS is badly designed (Linux devs did not understand ZFS weird design because of it's purpose: data integrity).

Yet, there're some technical differences in how btrfs does its things, and overall it is much better integrated with rest of Linux system, reuses as much code as it could afford without sacrificing advanced design goals and so on. I like it this way.

A true nerd!

Me like. I am also like you. I care about technical excellence, and I back it up with science and research papers and studies. There is a reason everybody wants ZFS and it is hyped: it is safe.

What can you tell me about data safety on, say, single-drive laptop computer? Fairly typical configuration these days, isn't it? ZFS zealots would gladly tell me that bad sector with no redundancy wasn't design assumption. So lack of recovery ways is to be expected. So I either have to use hex editor myself or pay piles of $$ to data recovery experts if I had something valuable without being backed up, right? I think it can be a bit better than that.

You know, ZFS "scrub" checks/repairs the metadata and the data. "fsck" is not really trustworthy.

IIRC scrub REQUIRES me to be able to bring storage online. Ability to mount storage online not to be taken as granted for damaged storage. Another stupid vague design assumption? Btrfs got both scrub, fsck-like tool to grind through metadata and even builtin hardcore data recovery suite if one got really desperate. I would prefer such palette of tools for EVERY filesystem I face.

For instance, sometimes when you fsck or "chkdsk" a 10TB raid array, it will complete in a few minutes. But to really read and traverse 10 TB data,

Of course. Yet, I need tool which would walk metadata and ensure they're more or less sane, to degree kernel driver can mount it without major issues and then would not face major breakage during run time. Scrub is good, but it assumes storage is able to mount, which isn't to be taken as granted for damage storages. Therefore it is not a fsck replacement. I would rather call it complementary tool.

And the need for "fsck" has not been that great, the ZFS forums are not crowded with threads about people loosing their data and requiring "fsck". I dont see a great demand for "fsck" for ZFS.

I care not if "need for "fsck" has not been that great", but I do care about right set of tools which are supposed to help me, should I face emergency conditions. Tool to walk metadata in offline mode is welcome. Requirements to bring storage online before doing so are not. It is good to have background scans if I can bring storage online, but it is really nice to have plan B if storage is damaged and fails to mount.

[quote]The metadata on ZFS is always duplicated all over the raid or disk. If it is damaged, you can always get another correct copy.[quote]
This "always" assumption sounds too vague and implies it would show me what EPIC FAIL means if it happens other way. Btrfs also DUPs metadata on single drive by default, but it also provides set of tools I would prefer to have for each and every filesystem. Good tooling is advantage of filesystem.

There are not 5 snapshots on ZFS. Every time you write data to the disk/array, the data is written to a new place. The old data is never touched.

Except if I would face e.g. spurious bad sector(s) on my laptop HDD and not like if I'm too happy with options ZFS could offer me in this use case. I think it is better to acknowledge this design was never meant for use cases like this. Yet, I really prefer to use snapshots, etc. And btw there is no point to tell how snapshots are working to me, I'm well aware of it and using it for quite some time. Maybe like 10 years on VMs but surely less than that on bare metal HW.

At the end you are free to have your opinion, but several researchers say ZFS is safe, and they say other filesystems are not safe.

And I think this is very synthetic, artificial and oversimplified point of view, which does not takes crapload of many other things into account. Furthermore, no abstract researchers would help me when I'll face bad sector on e.g. my laptop hard drive. I'll have to wing it on my own, that's where I care about decent tools to deal with cases beyond of "should never happen" crap.

And it is a fact that many (all?) OSes wants ZFS, for instance Linux.

Linux is a kernel, and ZFS would never land to mainline kernel, thanks to odd license from Sun, intentionally created this way. So, where is Linux and where is Sun? Not to mention "Linux is interested in ZFS" is technically incorrect. At very most some distros may or may not ship it, and even then SFC considers it could be GPL violation. Whatever, it seems sun's greed haven't helped them too much.

Solaris DTrace, which was also marketed by Sun, we see today what OSes have it or recenlty ported/copied it.

Fine, but this lengthy list of proprietary/abandonware things isn't what I call exciting. Feel free to do advanced tracing in qnx, or aix, etc. I'm better off doing things like this in Linux and preferrably staying miles away from anything from Oracle. After all, Oracle proven to be hostile to opensource so many times. They screwed mysql, destroyed solaris, caused LibreOffice fork ... very nice list of opensource achievements for a company, lol.

**kebabbert** · 08 March 2016, 06:19 AM

Originally posted by SystemCrasher View Post

Right, but I had quite heated discussion with some Gentoo dev here on Phoronix. Even he has admitted ZFS does checksumming in quite a specific way it brings some major tradeoffs. Somehow I do not think these tradeoffs are best for me in most usecases I care of.

I agree with almost everything you say. You speak true. It is true that there will be fewer problems with ZFS because of its high reliability, but when the shit hits the fan...

I understand your concern. All storage solutions need a backup, including zfs, btrfs, ntfs, raid, etc. Shit will hit the fan.

It is true zfs will get fragmented over time, and it does not have a defragger. To decrease fragmentation, zfs will write only every fifth second to disk. In worst case, if you really must defrag, you can send the data to another server and back again. I agree it is a pain, but it is doable.

It is also a problem to change the number of disks in a vdev, it can not be done. But you can add another vdev. A zpool consists of one or more vdevs. Each vdev is configured as raid5 or raid6 or mirror. So, you can always add a new vdev, but never change the nr of disks in a vdev.

If you have a single disk, you can use "copies=2" switch, which will duplicate all data twice effectively halve the disk capacity. The metadata is always riddled all over the disk in mutiple copies, so it does not need to be backupped. "Copies=3" also works. No other filesystem can do this.

Regarding zfs is only fine on highend hardware, that is not really strictly correct. First, If you have non ecc ram, and you get corrupt dimm, zfs will not corrupt your data. I can give you a technical link on this, if you wish. This is a misconception that is not true (zfs will eat your data if you have corrupt non ecc ram) as well as the misconception that zfs needs more than 2 gb ram in a server to function well. No, instead i would argue the opposite: if you have cheap hardware, then you should use zfs to give extra protection in form of software. For instance, zfs detects flaky psu. Wouldnt it be good to know as soon as you have the slightest problem in your hardware? Zfs detects all those problems. Immediately. With high end stuff, it is reliable in itself so the need for extra protection as zfs decreases. It is when you have cheap stuff, you need extra protection. Zfs makes cheap stuff very very reliable. And on server graded hardware such as xeon, ecc ram, zfs will boost data integrity into enterprise quality arena. So i would say, the worse hardware you have, the greater the need to get safe software that will immediately detect loose sata cables, faulty ram dimms, etc. In high end stuff, you dont need the extra protection as much.

Have you ever tried zfs? Just curious...

**nasyt** · 09 March 2016, 06:51 AM

Originally posted by aht0 View Post

I am by now (member of phoronix couple of months) detecting a pattern.

Have some article comparing BSD with Linux appearing.. you got crowd of "penguin friends" descending (30+ posts) on it like vultures bashing Sun or random BSD for all it's worth.. starting usually with mr. SystemCrasher, his "cow filesystems"TM and "sun marketing bs"TM remarks...

It's bit like Russian web trolls under random political article in Northern/Eastern Europe.

Have some neutral BSD news or announcement, none of this crowd deigns to comment on it.

Is it caused by some sort of complicated form of inferiority complex? Need to prove how superior THEIR FAVORITE OS is? Honestly curious..

Well, The best thing is to use Windows Server and to stay miles away from all these Linux Trolls

Windows Server is relatively secure against criticism by these Linux Trolls. Well, it is proprietary and expensive, there is an evil corporation... but technically it is not worse than Linux, it's well known, fast powerful, it frees us from systemd and other linuxisms.

**aht0** · 10 March 2016, 03:11 PM

Originally posted by Michael View Post

Sans file-system differences, past tests I've done haven't shown much difference between PC-BSD and FreeBSD when using the same kernel and compiler.

Believe me, there is. You only need to use low-power netbook and do installation of each to convince yourself. FreeBSD is leaner. You can use PC-BSD on one but you need to jump trough extra hoops like disabling filesystem compression (saves CPU cycles: LZ4 compression is enabled by default for ZFS), disabling bunch of unnecessary stuff from etc/rc.conf and so forth.

PC-BSD loads lots of stuff during start, it's no longer fully compatible with a vanilla FreeBSD (try using PC-BSD with a UFS2 doing "expert" fs selection in terminal, afterwards you will running into incompatibilities very fast once you try updating the system etc. Except for first brief glance, it's derivative of FreeBSD, not pure FreeBSD.

Thanks for a tip regarding Windows Server :P

Announcement

How Three BSD Operating Systems Compare To Ten Linux Distributions

Comment

Comment

Comment

Comment

Comment

Comment

Comment