Announcement

**ehansin** · 17 June 2019, 12:19 PM

Originally posted by numacross View Post

You might be interested in this comparison then

Thanks for sharing this link! I have just kind of skimmed for now, but will give it a more thorough read later. Seems like really good stuff, and even in skimming I learned some things.

**k1e0x** · 17 June 2019, 12:35 PM

Originally posted by oiaohm View Post

So https://stratis-storage.github.io/faq/ Stratis from Redhat and EMC is not good tooling.

Notice you are not talking about EMC any more.

EMC and Redhat are both using the Linux kernel as their full stack. ZFS on Linux you now have two stacks.

New tricks for XFS [LWN.net]

https://lwn.net/Articles/747633/

Everything started changing in the Linux kernel at the start of last year. The big advantage of having your complete stack was reducing the odds of hitting ENOSPC errors. But this did not make it impossible particularly once you get into virtual machine instances or iscsi and the like that support thin provisioning.

XFS developer instead of integrating raid, volume management and so on like they he had in the past this time decide hey I will fix the block layer to provide the information file system need and accept data how file system needs as well and give a safe ENOSPC. Safe ENOSPC pass to block device a list of operating that all need to happen or all need to fail this fixed up the major reason for ZFS using CoW.

Now XFS choice means the option of using hardware accelerated raid is still there and better than before. Yes BTRFS and ZFS both have trouble using hardware accelerators coming from implementing their own unique stacks.

Implementing the full stack on your own means you have NIH(Not Invented Here) Illness. As in if it not invented by you the end users cannot use it. Both BTRFS and ZFS suffer from this. Its also simple to forgot XFS developers attempting the path before XFS was ported to Linux. XFS development team has a lot of experience under their belt.

I will be truthful Linux block layer in reality had been broken even since thin provisioning had been added and it took the xfs lead developer to fix it.

Yes lots of people put ZFS up as this magic solution. ZFS and BTRFS as a solution lacks a lot of flexibility due to the single stack problem.

It does not pay to reinvent the wheel in most cases and that is exactly what ZFS and BTRFS have done in implementing raid and volume management.

Basically the cobble together arguement is excuse to ignore the defects of reinventing the wheel has caused ZFS and BTRFS.

Ok so.. we are talking about two different things here a little bit.. (by tooling I mean mostly enterprise web apps and integration, ZFS has great command line tooling.)

..and some of what you say isn't wrong but.. lets get to the fun and talk about what is.

Hardware raid is a lie. It is software (firmware) backed on to a small raid controller board. Any improvements you see from this come at great risk because in order to do their job they have to lie to the OS. They are in my opinion fundamentally broken. The OS really needs to know when commits are actually on disk. They work on cache, and.. oh guess what? You can add really fast read/write cache to ZFS and use your own processor to do the transactions intelligently and order the reads and writes according to what the OS needs. It has a terrific ARC cache model that.. somehow preforms much better than you'd think (I swear it's magic, I'm always impressed with it). And at the same time it does *not* lie to the OS. No need to put all your eggs in closed source firmware code that is probably sitting in some ex-employees home directory.

When I'm talking about "integration of the stack" what I'm referencing to is just the file system itself. So from the blocks on the disk to the memory controller ZFS is all one thing. I like XFS but it's a very old design. It's been quite interesting the hoops they have jumped through to try to improve it's design but some of those are quite ugly. Such as XFS's Cow operations sound absolutely horrific to me compared to ZFS method. Loopback? Are you kidding me? .. seriously? You think that is a good thing? Please.. please remove yourself from any data center immediately and never return. (j/k)

Sometimes things really need to be redesigned and filesystems is a good place to start with this. Like it or not you have to admit that ZFS was revolutionary and changed the discussion on filesystems.

**oiaohm** · 18 June 2019, 05:22 PM

Originally posted by k1e0x View Post

Ok so.. we are talking about two different things here a little bit.. (by tooling I mean mostly enterprise web apps and integration, ZFS has great command line tooling.)

stratis provides both commandline and web based tooling. ZFS is not the only thing in the game with tooling covering all the same area. This is just a different design.

Originally posted by k1e0x View Post

Hardware raid is a lie. It is software (firmware) backed on to a small raid controller board. Any improvements you see from this come at great risk because in order to do their job they have to lie to the OS. They are in my opinion fundamentally broken.

No you have missed the fundamentally broken.

Originally posted by k1e0x View Post

The OS really needs to know when commits are actually on disk.

Do hdd or SSD drives in fact tell you this the answer is no. They tell you when the data lands in their caches not when it absolutely written to disc/storage.

Originally posted by k1e0x View Post

And at the same time it does *not* lie to the OS.

This is the problem with ZFS it design on the incorrect presume that hdd and ssd don't lie their head off. This is why a hardware raid that is well built is great. Why they have storage on board for their cache that they know is not a lieing bastard that has enough power to back it self up to a non lieing bastard flash storage in case of power loss. Of course I would like to see more open hardware audit-able raid controllers that are properly designed to deal with the nasty out of hdd and ssd as well as more truthful hdd and ssd drives out there.

Originally posted by k1e0x View Post

Such as XFS's Cow operations sound absolutely horrific to me compared to ZFS method. Loopback? Are you kidding me? .. seriously? You think that is a good thing? Please.

Its not hard horrible as it sounds. Particularly once you see it has block deduplication in the deeper levels. https://www.redhat.com/en/blog/look-...pression-layer

There was a video presentation over xfs looking at loopback to do snapshotting and that video covers some of the defects and limitations the current snapshot model has in ZFS and Btrfs. Duplicating the file system structure on a block based cow behind it allows working around those problems. It does clearly point out that the loopback usage was only prototype.

XFS is putting the cow in a different positions to where it in ZFS and BTRFS. ZFS and BTRFS have Cow process in the file system structures but XFS developer is putting the cow at the data block level so file system structures are not different to the files this comes important when you need to avoid UUID collisions and other problems. XFS method is not loopback if you look closer. XFS mount can mount a file without a loopback.

https://www.spinics.net/lists/linux-xfs/msg25911.html yes XFS is turning into a full blowing file system with CoW built in as well. Non raid parts there is getting very limited differences between XFS and BTRFS/ZFS. XFS put raid back to host. Data checksuming they are still working out if this is better in the block layer or the file system layer with XFS. XFS + dm system from Linux is fairly much a feature match to ZFS.

**k1e0x** · 18 June 2019, 06:14 PM

Originally posted by oiaohm View Post

stratis provides both commandline and web based tooling. ZFS is not the only thing in the game with tooling covering all the same area. This is just a different design.

No you have missed the fundamentally broken.

Do hdd or SSD drives in fact tell you this the answer is no. They tell you when the data lands in their caches not when it absolutely written to disc/storage.

This is the problem with ZFS it design on the incorrect presume that hdd and ssd don't lie their head off. This is why a hardware raid that is well built is great. Why they have storage on board for their cache that they know is not a lieing bastard that has enough power to back it self up to a non lieing bastard flash storage in case of power loss. Of course I would like to see more open hardware audit-able raid controllers that are properly designed to deal with the nasty out of hdd and ssd as well as more truthful hdd and ssd drives out there.

Its not hard horrible as it sounds. Particularly once you see it has block deduplication in the deeper levels. https://www.redhat.com/en/blog/look-...pression-layer

There was a video presentation over xfs looking at loopback to do snapshotting and that video covers some of the defects and limitations the current snapshot model has in ZFS and Btrfs. Duplicating the file system structure on a block based cow behind it allows working around those problems. It does clearly point out that the loopback usage was only prototype.

XFS is putting the cow in a different positions to where it in ZFS and BTRFS. ZFS and BTRFS have Cow process in the file system structures but XFS developer is putting the cow at the data block level so file system structures are not different to the files this comes important when you need to avoid UUID collisions and other problems. XFS method is not loopback if you look closer. XFS mount can mount a file without a loopback.

https://www.spinics.net/lists/linux-xfs/msg25911.html yes XFS is turning into a full blowing file system with CoW built in as well. Non raid parts there is getting very limited differences between XFS and BTRFS/ZFS. XFS put raid back to host. Data checksuming they are still working out if this is better in the block layer or the file system layer with XFS. XFS + dm system from Linux is fairly much a feature match to ZFS.

If I understand you right what you are saying is because hard disks use cache, therefore ZFS is a broken design. At the same time you're saying adding another layer of cache is better. Thanks but I'll just use the much faster system ram already on the system and skip the unnecessary layer.

ZFS has no use for a battery backup controller as it can not become inconstant due to power failure (or any other reason) thus it has no need for fsck nor a journal playback.

I'm not sure why you'd want to use XFS's bolt on approach but... you know.. suit yourself. Crazy people do crazy things I guess. You can just install and use ZFS today.. you don't need to wait for XFS to hack on anything.

Also if it's feature compatible, where is XFS's block wise remote filesystem clone/incremental copy of an encrypted volume feature? Cuz that is a really nice thing ZFS can do being that all it's layers work together. (The remote host can be untrusted also, it never needs to have a copy of the key and it never gets unencrypted.)

**oiaohm** · 18 June 2019, 07:09 PM

Originally posted by k1e0x View Post

If I understand you right what you are saying is because hard disks use cache, therefore ZFS is a broken design. At the same time you're saying adding another layer of cache is better. Thanks but I'll just use the much faster system ram already on the system and skip the unnecessary layer.

ZFS has no use for a battery backup controller as it can not become inconstant due to power failure (or any other reason) thus it has no need for fsck nor a journal playback.

zfs pool corruption on power loss · Issue #4501 · openzfs/zfs

https://github.com/zfsonlinux/zfs/issues/4501

My recently created zpool's metadata was corrupted due to a brief loss of power. I have ECC ram, and the pool was made up of 2 vdevs consisting of 6 disks in raidz2. The pool didn't come up when th...

Random users for a while with ZFS are in fact nuked due to power lose. Due to the bits of data that were in the HDD/SSD cache being critical bits get vaporised and never written. Yes ZFS design would work perfectly if hdd/ssd were not lieing bastards and tell it yes I have that data stored to disc when it still in fact holding it in cache. You either powerbackup the raid controller or you powerbackup the complete machine.

Really we do need a battery/capacitor backup on our hard drives and some form of we have lost power from mains harddrive please flush you cache and suspend operations. Yes the only way you can be in fact sure that a hdd or ssd has in fact written it cache out is send the power down command to drive and them you need to power cycle the drive to bring it back on line. Some people have wondered why some high-end raids cards are linked to the drive power connectors this is why so they can in fact know that the data was written.

ZFS is design that hdd and ssd behave themselves and tell the truth about what they have written unfortunately this is not reality. Having to send shutdown command then power cycle the drive to make the drive active again to be sure data written is a total ass not something ZFS design allows for.

Originally posted by k1e0x View Post

Also if it's feature compatible, where is XFS's block wise remote filesystem clone/incremental copy of an encrypted volume feature?

stratis does that using the block layer features of Linux. The art of thin provisioning so able to order blocks in the clone on the usage demard.

XFS + dm system from Linux . Note I put + dm as in plus the disc management system. Full volume encryption and relocation that done in the dm layer.

This is the problem comparing ZFS straight to XFS the result is yes ZFS has more features. But when you compare XFS+DM to ZFS its XFS+DM with more features.

**k1e0x** · 18 June 2019, 07:30 PM

Originally posted by oiaohm View Post

zfs pool corruption on power loss · Issue #4501 · openzfs/zfs

https://github.com/zfsonlinux/zfs/issues/4501

My recently created zpool's metadata was corrupted due to a brief loss of power. I have ECC ram, and the pool was made up of 2 vdevs consisting of 6 disks in raidz2. The pool didn't come up when th...

Random users for a while with ZFS are in fact nuked due to power lose. Due to the bits of data that were in the HDD/SSD cache being critical bits get vaporised and never written. Yes ZFS design would work perfectly if hdd/ssd were not lieing bastards and tell it yes I have that data stored to disc when it still in fact holding it in cache. You either powerbackup the raid controller or you powerbackup the complete machine.

Really we do need a battery/capacitor backup on our hard drives and some form of we have lost power from mains harddrive please flush you cache and suspend operations. Yes the only way you can be in fact sure that a hdd or ssd has in fact written it cache out is send the power down command to drive and them you need to power cycle the drive to bring it back on line. Some people have wondered why some high-end raids cards are linked to the drive power connectors this is why so they can in fact know that the data was written.

ZFS is design that hdd and ssd behave themselves and tell the truth about what they have written unfortunately this is not reality. Having to send shutdown command then power cycle the drive to make the drive active again to be sure data written is a total ass not something ZFS design allows for.

stratis does that using the block layer features of Linux. The art of thin provisioning so able to order blocks in the clone on the usage demard.

XFS + dm system from Linux . Note I put + dm as in plus the disc management system. Full volume encryption and relocation that done in the dm layer.

This is the problem comparing ZFS straight to XFS the result is yes ZFS has more features. But when you compare XFS+DM to ZFS its XFS+DM with more features.

Nice FUD. The "bug" was never acknowledged or addressed, and has some skepticism from the devs as to it being possible. He also says it emailed him before it booted/while booting? Sounds fishy. But yes, I'm sure ZFS is *very broken* and unsafe : roll eyes : those enterprise multi petabyte arrays could vanish overnight I'm so sure. The Dev's comments are "This is very unusual and shouldn't be possible. zpool import -F should always be able to rollback unless your disk was lying to you, or potentially there was a pool configuration change (add/remove disk) right before the power outage." In it's design it can't be in an inconsistent state and if it ever were to be in a transaction the last consistent state would still exist (or the one before it). This is how Cow works. ZFS developers were/are very aware devices lie and the does everything humanly possible to protect data from them.

You might even say they were overly obsessed with protecting the data. I'm not sure we have any filesystem that goes to such lengths yet? APFS didn't. https://arstechnica.com/gadgets/2016...s-file-system/

I noted it, but you don't understand what I'm talking about. block level encryption + compression (+dedup if you want tho it's being redesigned as it sucks) to remote without the private key modifying only the files that changed without having to calculate the difference between them.

**oiaohm** · 18 June 2019, 10:11 PM

Originally posted by k1e0x View Post

Nice FUD. The "bug" was never acknowledged or addressed, and has some skepticism from the devs as to it being possible. He also says it emailed him before it booted/while booting? Sounds fishy. But yes, I'm sure ZFS is *very broken* and unsafe : roll eyes : those enterprise multi petabyte arrays could vanish overnight I'm so sure. The Dev's comments are "This is very unusual and shouldn't be possible. zpool import -F should always be able to rollback unless your disk was lying to you, or potentially there was a pool configuration change (add/remove disk) right before the power outage." In it's design it can't be in an inconsistent state and if it ever were to be in a transaction the last consistent state would still exist (or the one before it). This is how Cow works. ZFS developers were/are very aware devices lie and the does everything humanly possible to protect data from them.

I have quoted one bug. But many bugs in the bugzilla are reported like that without the developers in fact finding the problem. Fun part is it the problem is coming from the fact modern day ssd and newer harddrives have larger blocks than ZFS is controlling. So you write 4 kb but in fact 4 megs of data have been transferred up to cache and cleared out of storage before being rewritten. So there is a window when 4 megs of data containing only 4kb of new data can magically disappear.

Cow does not help you when it losing a shot gun spread of data. Raid5 and Raid6 rebuild checksums help with this shot gun problem.

The result of the block size being different to what the storage media is in fact using.

Coding for SSDs – Part 2: Architecture of an SSD and Benchmarking

http://codecapsule.com/2014/02/12/coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/

This is Part 2 over 6 of "Coding for SSDs", covering Sections 1 and 2. For other parts and sections, you can refer to the Table to Contents. This is a series of articles that I wrote to share what I learned while documenting myself on SSDs, and on how to make code perform well on SSDs. If you're in

Most SSDs have blocks of 128 or 256 pages, which means that the size of a block can vary between 256 KB and 4 MB.

Hardware raids do have a stack of horrible tricks around these problems. This is the problem losing power to a harddrive or ssd while it mid write moving data around basically can shot gun blast you data storage taking out a mixture of new and old data. Cow is not strong enough to resist that. Raid 6 double-parityRAID was not made for no good reason. Raid rebuilds taking ages after power outage is not for no good reason either. They are design with the idea you have not powered down correctly your data has been shotgun blasted and could be missing many pieces.

The idea that important storage can be without a UPS because you have ZFS is fatally wrong. The low level is a total ass.

Originally posted by k1e0x View Post

I noted it, but you don't understand what I'm talking about. block level encryption + compression (+dedup if you want tho it's being redesigned as it sucks) to remote without the private key modifying only the files that changed without having to calculate the difference between them.

Device-mapper level is able to do block level encryption+ compression+ dedup. VDO from redhat for dedup and compression to then remote by iscsi and other methods by the levels under it.

So yes I understood exactly what you said you are clueless on how much device mapper on Linux can in fact do.

**skeevy420** · 18 June 2019, 10:52 PM

Originally posted by oiaohm View Post

I have quoted one bug. But many bugs in the bugzilla are reported like that without the developers in fact finding the problem. Fun part is it the problem is coming from the fact modern day ssd and newer harddrives have larger blocks than ZFS is controlling. So you write 4 kb but in fact 4 megs of data have been transferred up to cache and cleared out of storage before being rewritten. So there is a window when 4 megs of data containing only 4kb of new data can magically disappear.

Cow does not help you when it losing a shot gun spread of data. Raid5 and Raid6 rebuild checksums help with this shot gun problem.

The result of the block size being different to what the storage media is in fact using.

Coding for SSDs – Part 2: Architecture of an SSD and Benchmarking

http://codecapsule.com/2014/02/12/coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/

This is Part 2 over 6 of "Coding for SSDs", covering Sections 1 and 2. For other parts and sections, you can refer to the Table to Contents. This is a series of articles that I wrote to share what I learned while documenting myself on SSDs, and on how to make code perform well on SSDs. If you're in

Hardware raids do have a stack of horrible tricks around these problems. This is the problem losing power to a harddrive or ssd while it mid write moving data around basically can shot gun blast you data storage taking out a mixture of new and old data. Cow is not strong enough to resist that. Raid 6 double-parityRAID was not made for no good reason. Raid rebuilds taking ages after power outage is not for no good reason either. They are design with the idea you have not powered down correctly your data has been shotgun blasted and could be missing many pieces.

The idea that important storage can be without a UPS because you have ZFS is fatally wrong. The low level is a total ass.

Device-mapper level is able to do block level encryption+ compression+ dedup. VDO from redhat for dedup and compression to then remote by iscsi and other methods by the levels under it.

So yes I understood exactly what you said you are clueless on how much device mapper on Linux can in fact do.

ashift=12 for the 4mb and ashift=13 for the 8mb
ashift=9 for 512kb spinners and ashift=8 for the 256kb oddities
...or don't use ashift and let ZFS try to set it manually...

just saying that ZFS has you covered there...and if someone doesn't account for any of that before setting up the drives then they shouldn't be the doing that job.

ZFS also has Raid built-in so that Raid/UPS argument is just unnecessary.

Random file system w/o a UPS loses data during a power outage; this and more on tonight's episode of No Shit Sherlock featuring Rick Romero.

**oiaohm** · 19 June 2019, 01:32 AM

Originally posted by skeevy420 View Post

ashift=12 for the 4mb and ashift=13 for the 8mb
ashift=9 for 512kb spinners and ashift=8 for the 256kb oddities
...or don't use ashift and let ZFS try to set it manually...
just saying that ZFS has you covered there...and if someone doesn't account for any of that before setting up the drives then they shouldn't be the doing that job.

Wear levelling in SSD and sector relocation on hdd ruin this idea. ZFS does not have your covered at all. Alter that value all you like that does not stop the shotgun effect Wear levelling sees sectors spread all over a SSD collected into 1 block/zone in the SSD. ashift does not help you see this.

Originally posted by skeevy420 View Post

ZFS also has Raid built-in so that Raid/UPS argument is just unnecessary.

Reason for needing UPS/state backup on raid controller is two slightly different reasons. UPS it to allow the drive to be shut down safely so avoid losing data in the first place. The state backup in hardware raid controller is to detect that the last writes did not in fact go though and run a full raid rebuild looking for shotgun effect damage. One of the problems with ZFS is silent data loss why with the COW the last write did not go though the fact it missing is ignored.

To see what the heck is going to be nuked you need not ashift but dm-zoned/zoned block device (ZBC and ZAC compliant devices) supporting drives. So the drive firmware will in fact answer what is stored in what zones and what will be transferred to drive cache when you modify things. Most of our current drives don't prove this information so you are doing a firmware leap of faith.

Having a raid does not mean that you can miss having a UPS backup. Yes a state backup raid controller disappearing on reboot after power outage for a few days doing a rebuild is a serous pain in the ass. Basically if you value your data and what to access it quickly you will use a UPS.

**skeevy420** · 19 June 2019, 09:46 AM

Originally posted by oiaohm View Post

Wear levelling in SSD and sector relocation on hdd ruin this idea. ZFS does not have your covered at all. Alter that value all you like that does not stop the shotgun effect Wear levelling sees sectors spread all over a SSD collected into 1 block/zone in the SSD. ashift does not help you see this.

So ensuring that ZFS, or any file system for that matter, is set to use the same block size of the underlying drive doesn't help at all? Oh, come on now. You can't stop the effect, but you can ensure that the file system sends data in a manner that speeds it up to hopefully make outage errors less likely.

Basically if you value your data and what to access it quickly you will use a UPS.

Which is why that RAID/UPS argument is unnecessary since that's a problem with any file system, not just ZFS. ZFS having some built-in mechanisms to help deal with outages, well, helps, but only a moron would consider that to be 100% reliable. It helps like fsck helps other file systems; nice to know it's there, but you're a moron if you think that's all you need to keep you covered and protected.

ZFS having it's own RAID helps, but you're still a moron to think it's all you need to be safe just like you'd be a moron to think a hardware RAID controller and multiple disks is all you need with XFS, EXT4, or BTRFS.

See what I mean, it's all unnecessary since RAID and no UPS bring the same issues to ZFS that they bring to every other file system in existence. Quirks between various file systems might differ here and there, but it's essentially same shit, different file system.

Announcement

ZFS On Linux 0.8.1 Brings Many Fixes, Linux 5.2 Compatibility Bits

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment