Announcement

Collapse
No announcement yet.

Linus Torvalds Doesn't Recommend Using ZFS On Linux

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • k1e0x
    replied
    Yes, using a larger sector size and placing the checksum next to the block is bad because it can't protect from phantom read/write or write/read from wrong sector. ZFS solves this problem and can chain verify itself down entire tree of blocks.

    And again sorry oiaohm.. Linux dm can't tell if data on one drive is correct from data on another as it has no checksum on the data. Corruption on one side of a mirror can't be compared to the other. All it can do is read. In ZFS's case it reads, compares the checksum and then can take action if it's wrong such as looking in the array for another copy of the data and even going back and fixing the bad block. That is what it means by self healing.

    Linux DM has proposed adding this but it's logic for the proposal is pretty bonkers. They write the block out then wait 5 seconds for the compression to finish then calculate the checksum.. whoa.. no thanks. ZFS does it in ram before the write. (this is what I mean by shoehorn and duct tape on features.) In ZFS writing compressed data takes less time than uncompressed.. because well... you're writing less data! Simple. In Linux DM writing less data takes 5 seconds more time?? How does that logic work? Did it double write the data? How did they determine 5 seconds was right? it sounds like an ass pull number. Ick ick.. just no.. no.. thanks for trying.

    XFS has checksums also, but only on metadata. They cite it's too slow to do it on data blocks. (A problem ZFS's "terrible" design allowed them to solve.. if only XFS's design was as bad as ZFS's they could have data block checksums. )
    Last edited by k1e0x; 15 January 2020, 09:19 PM.

    Leave a comment:


  • ryao
    replied
    Originally posted by oiaohm View Post
    Its only useless if you cannot read/process it and know what it is.



    Having the controller do it on read means you are not wasting cpu time on a broken block.



    The controller calculated ECC can be sent back when the block of data is read from the drive. Harddrives have more than one mode. There are some nice ones for data recovery/data protection that are not exposed by the normal OS block layers.



    I did not say you would not be calculating the ECC again.



    No what you are doing is counting the money you got back while ignoring what the register said you should get. By luck the person miss count and give you the value you were expecting. When you come todo you tax your invoice is wrong. The ECC in the drive is your invoice for what data the drive was expecting to send you.



    So the classical random quality hardware/software raid controllers. Not all hardware raid controllers are created equal some in fact use the harddrives ECC values the cheaper ones don't.

    So a three way mirror raid can be more secure than than Z-RAID if it on one of the controllers using the harddrive controller stored ECC values and your OS is able to use the integrity checks of those ECC after the data is transferred to ram . So you can read the ECC value from one drive and a block from another and they should match right. Basically how to store you party information without in fact costing yourself any space.

    Basically you really do need to do way more homework particularly on the most resistant to failure hardware raid controllers not the random junk.



    That not exactly true either. ZFS Z-raid vs a 16 Linux kernel dm software raid, Both have exactly the same failure numbers. If a Linux file system is sitting on 16 dm software raid I am sorry you are out of luck you don't have the integrity advantage.

    ZFS risk of integrity issues low compare to a normal Linux file system alone is because it a a integrated block layer stuff. Z-RAID what is triple parity or raid 7 that happens to be patented by netapp in 2009 with patent expire in 2030. 16 raid is a method to create a triple parity like raid without stepping on netapp patent. I guess this is another reason why you cannot re-license because without patent coverage you are screwed right.

    So improve Linux block layer the integrity advantage of zfs can go by by for all file systems the Linux kernel supports. Do remember the block layer copies straight into the Linux kernel page cache when you are native to Linux not running some alien beast like zfs does.
    Just about everything you said is wrong. The most fundamental issue is that you do not seem to understand the end to end principle:

    https://web.mit.edu/Saltzer/www/publ...d/endtoend.pdf

    You also don’t seem to know how the hardware works because things like “one of the controllers using the harddrive controller stored ECC values” makes no sense. That data is never sent from the hard drive. It is possible to have extra data stored with the sector like what netapp does, but this fails to provide adequate protection from a write operation being done to the wrong sector.

    There is no way to improve the Linux block layer to be as good as using ZFS. That is why Chris Mason made btrfs.

    There are failure modes that RAID 5 has that RAID-Z lacks such as rendering all data inaccessible when a RAID 5 rebuild operation encounters corruption that makes recalculating the result mathematically impossible. ZFS would simply report the damaged data while rebuilding the rest.
    Last edited by ryao; 15 January 2020, 07:42 PM.

    Leave a comment:


  • k1e0x
    replied
    Originally posted by LxFx View Post

    I'm mainly interested in the selfhealing and "RAID" ZFS capabilities for my personal central storage.
    If I check the Arch topic for btrfs it says that those features are unstable, contain errors or have significant downsides...
    I would prefer the included btrfs before the license incompatible ZFS but one thing I don't want in an FS is it being error prone or unstable....
    Anything I'm missing here?
    oiaohm has a poor understanding of how file systems in general work, as ryao pointed out.

    Originally posted by oiaohm View Post

    You are missing a lot.
    https://www.jodybruchon.com/2017/03/...ot-and-raid-5/

    Lets cover some facts. Your basic harddrives and SSD at the controller level are in fact self healing. Horrible point is our block layers in operating systems have not allow us to simply access the controller generated ECC data. Adding ZFS to operating system does not address this weakness in the block layer. Instead you end up calculating checksum basically twice. Horrible reality here OS block layers need a major rework to give access to information that does exist.

    Next btrfs own built in raid is marked error prone but that is not the only option.
    https://wiki.archlinux.org/index.php...e_RAID_and_LVM
    You still have your general operating system raid options and other options.

    ZFS not being mainline kernel support does in a lot of ways increase your risk of errors coming from the fact upstream kernel fixes something and does not consider how your ZFS file system driver will be doing things.

    This is my problem with ZFS or nothing is normally that they are not really considering the full problem at hand and if ZFS is really fixing the problem or just adding duplication of functionality that in fact increases risk of data loss.
    Who am I? I am a system engineer who has previously worked for many fortune 500 companies as a storage engineer. I have used ZFS since it's release on Solaris in 2006, FreeBSD and on Linux in the enterprise with real mission critical production systems.

    Uncorrectable bit error is an extremely low factor on most storage medium... however as disk sizes become larger the likelihood that you will experience that grows as the number of bits on the drive also becomes greater.

    For example, if you write only 10 blocks and you have a 00.0001% chance of failure.. chances are you won't have an issue.. but if you write 4m blocks.. your chance is larger. This is the reason Dell has deprecated RAID5 and 6 in enterprise because having only one copy of the data, and then in 6's case 2 copies wasn't good enough for enterprise. (Compounding this was the resilver times)

    ZFS essentially expects the data from the disk to be wrong and calculates a checksum for every block on the disk. This is done in the kernel and is independent from the disk firmware. Due to this ZFS can and does detect errors in ordinary drives that firmware can not. (and even in expensive enterprise SAN equipment as found by CERN's LHC)

    Why don't more file systems do this? Because making a pass over the data is extremely inefficient for performance. ZFS manages to pull this off without too much of a hit when you consider what it's doing, making it competitive and *only* slightly slower than most non-checksuming filesystems, at the same time providing much higher data integrity.

    Why isn't it done on firmware? Because firmware does not have access to a processor as fast as your CPU. And because it would add a lot of cost to the device to do it. Drive manufactures have a mean failure rate they are comfortable with for the cost of their devices.. as shown even very expensive storage solutions get this wrong.

    Disk manufactures publish their failure rates for devices. ZFS also has published data integrity data by LLNL. In a 100 Petabyte array consisting of 30,000 drives ZFS managed to have 0 uncorrectable disk errors in over 10 years. Zero. Making ZFS the gold standard in data integrity.

    You can learn more about this design from this talk.
    https://www.youtube.com/watch?v=NRoUC9P1PmA
    Also here is an analysis of APFS and it's lack of check summing. http://dtrace.org/blogs/ahl/2016/06/...rt5/#apfs-data

    So why does oiaohm say ZFS checksuming each block for integrity before passing it to the application "in fact increases risk of data loss"? .. idk he's just an idiot shill. A checksum is the only way to discover this and the closer that checksum on the data is calculated to the application that wrote it in the data chain the better. The more layers and abstractions you go through (ie firmware) before calculating it, the more can go wrong.
    Last edited by k1e0x; 15 January 2020, 08:45 PM.

    Leave a comment:


  • oiaohm
    replied
    Originally posted by ryao View Post
    The presence of some sort of calculation at the controller is useless for ensuring integrity when a valid calculation can be sent with the wrong data.
    Its only useless if you cannot read/process it and know what it is.

    Originally posted by ryao View Post
    The only way to handle this is to calculate a checksum as early in the stack as possible on write, store it separately and verify it in the same place on read. Expecting the controller to do something for you is letting things happen too late.
    Having the controller do it on read means you are not wasting cpu time on a broken block.

    Originally posted by ryao View Post
    Whatever the controller calculates is also not what is stored on disk, or sent back with the data.
    The controller calculated ECC can be sent back when the block of data is read from the drive. Harddrives have more than one mode. There are some nice ones for data recovery/data protection that are not exposed by the normal OS block layers.

    Originally posted by ryao View Post
    It gets recalculated each time in a different way. This is a great way to get served the wrong data with a valid checksum/ECC calculation. The proper way to address it is to have checksums stored with pointers that are verified by the kernel.
    I did not say you would not be calculating the ECC again.

    Originally posted by ryao View Post
    It is like getting change back when making a purchase. You can rely on the other guy to count it, or you could count it yourself to be sure that you received what you were supposed to receive. The other guy counting it never means that your count of it is redundant. That would just be blind trust that is prone to abuse.
    No what you are doing is counting the money you got back while ignoring what the register said you should get. By luck the person miss count and give you the value you were expecting. When you come todo you tax your invoice is wrong. The ECC in the drive is your invoice for what data the drive was expecting to send you.

    Originally posted by ryao View Post
    I also wrote a list of issues that hardware RAID controllers have and most of them apply to software RAID:

    http://open-zfs.org/wiki/Hardware#Ha...ID_controllers
    So the classical random quality hardware/software raid controllers. Not all hardware raid controllers are created equal some in fact use the harddrives ECC values the cheaper ones don't.

    So a three way mirror raid can be more secure than than Z-RAID if it on one of the controllers using the harddrive controller stored ECC values and your OS is able to use the integrity checks of those ECC after the data is transferred to ram . So you can read the ECC value from one drive and a block from another and they should match right. Basically how to store you party information without in fact costing yourself any space.

    Basically you really do need to do way more homework particularly on the most resistant to failure hardware raid controllers not the random junk.

    Originally posted by ryao View Post
    The risk of integrity issues with ZFS is lower than with in-tree filesystems, not higher.
    That not exactly true either. ZFS Z-raid vs a 16 Linux kernel dm software raid, Both have exactly the same failure numbers. If a Linux file system is sitting on 16 dm software raid I am sorry you are out of luck you don't have the integrity advantage.

    ZFS risk of integrity issues low compare to a normal Linux file system alone is because it a a integrated block layer stuff. Z-RAID what is triple parity or raid 7 that happens to be patented by netapp in 2009 with patent expire in 2030. 16 raid is a method to create a triple parity like raid without stepping on netapp patent. I guess this is another reason why you cannot re-license because without patent coverage you are screwed right.

    So improve Linux block layer the integrity advantage of zfs can go by by for all file systems the Linux kernel supports. Do remember the block layer copies straight into the Linux kernel page cache when you are native to Linux not running some alien beast like zfs does.

    Leave a comment:


  • skeevy420
    replied
    Originally posted by allquixotic View Post
    ZFS is fast enough for the people who use it. In fact, being able to take advantage of tiered storage probably results in a faster overall experience compared to having to directly write to the HDDs. Using ARC, L2ARC and ZIL when you have this kind of hardware (big HDDs + fast/small SSDs) will probably get you the highest total system performance (real-world, not microbenchmarked) for that given hardware. Obviously, for a single NVMe SSD in a laptop, a filesystem that doesn't do checksums like ext4, or one built for flash devices from the ground up like f2fs, will probably be faster.
    My 2 and 4 TB HDDs have around an 80mb/s read speed. What does that mean in real world use? That any file system I use will read at that speed so I tune my tools like de/encrpytion, de/compression, etc to aim for around that speed for single disk use so ZFS gives me awesome features and it is no faster or slower than anything else for a HDD.

    It also means that a mirror and a backing SSD or two would give me damn-near SSD speeds (but I'd have to run a completely different tuning on ZFS).

    That's what makes ZFS neat and unique. I can run hardcore levels of encryption "transparently" because I'm using a slow disk and wouldn't know the difference anyways. I can't wait to be able to use Zstd:19 with my pools because it decompresses faster than my spinner's read speed.

    ZFS allows us to tune for speed or size or anywhere in between to suite one's hardware and needs. ZFS is as fast or as slow as one makes it. That's both good and bad because it does take time to learn and it does have a lot of knobs to turn...kind of like the Linux kernel

    All I know is that I can't tell y'all how many times ZFS has saved my game drive over the past 5 years due to power outages (and it's almost that time of year for those kinds of storms ).

    Leave a comment:


  • ryao
    replied
    Originally posted by oiaohm View Post

    You are missing a lot.
    https://www.jodybruchon.com/2017/03/...ot-and-raid-5/

    Lets cover some facts. Your basic harddrives and SSD at the controller level are in fact self healing. Horrible point is our block layers in operating systems have not allow us to simply access the controller generated ECC data. Adding ZFS to operating system does not address this weakness in the block layer. Instead you end up calculating checksum basically twice. Horrible reality here OS block layers need a major rework to give access to information that does exist.

    Next btrfs own built in raid is marked error prone but that is not the only option.
    https://wiki.archlinux.org/index.php...e_RAID_and_LVM
    You still have your general operating system raid options and other options.

    ZFS not being mainline kernel support does in a lot of ways increase your risk of errors coming from the fact upstream kernel fixes something and does not consider how your ZFS file system driver will be doing things.

    This is my problem with ZFS or nothing is normally that they are not really considering the full problem at hand and if ZFS is really fixing the problem or just adding duplication of functionality that in fact increases risk of data loss.
    That article is a bunch of nonsense. ZFS does not use CRCs. There are plenty of other things wrong there too, but how to ensure data integrity is the crux of things, so let’s focus on that.

    The presence of some sort of calculation at the controller is useless for ensuring integrity when a valid calculation can be sent with the wrong data. The only way to handle this is to calculate a checksum as early in the stack as possible on write, store it separately and verify it in the same place on read. Expecting the controller to do something for you is letting things happen too late.

    Whatever the controller calculates is also not what is stored on disk, or sent back with the data. It gets recalculated each time in a different way. This is a great way to get served the wrong data with a valid checksum/ECC calculation. The proper way to address it is to have checksums stored with pointers that are verified by the kernel.

    It is like getting change back when making a purchase. You can rely on the other guy to count it, or you could count it yourself to be sure that you received what you were supposed to receive. The other guy counting it never means that your count of it is redundant. That would just be blind trust that is prone to abuse.

    I also wrote a list of issues that hardware RAID controllers have and most of them apply to software RAID:

    http://open-zfs.org/wiki/Hardware#Ha...ID_controllers

    The claim that a second drive failure during a RAID 5 rebuild is statistically unlikely has been thoroughly debunked by people who did actual statistics:

    https://queue.acm.org/detail.cfm?id=1670144

    The risk of integrity issues with ZFS is lower than with in-tree filesystems, not higher.
    Last edited by ryao; 15 January 2020, 10:56 AM.

    Leave a comment:


  • oiaohm
    replied
    Originally posted by LxFx View Post
    I'm mainly interested in the selfhealing and "RAID" ZFS capabilities for my personal central storage.
    If I check the Arch topic for btrfs it says that those features are unstable, contain errors or have significant downsides...
    I would prefer the included btrfs before the license incompatible ZFS but one thing I don't want in an FS is it being error prone or unstable....
    Anything I'm missing here?
    You are missing a lot.
    https://www.jodybruchon.com/2017/03/...ot-and-raid-5/

    Lets cover some facts. Your basic harddrives and SSD at the controller level are in fact self healing. Horrible point is our block layers in operating systems have not allow us to simply access the controller generated ECC data. Adding ZFS to operating system does not address this weakness in the block layer. Instead you end up calculating checksum basically twice. Horrible reality here OS block layers need a major rework to give access to information that does exist.

    Next btrfs own built in raid is marked error prone but that is not the only option.
    https://wiki.archlinux.org/index.php...e_RAID_and_LVM
    You still have your general operating system raid options and other options.

    ZFS not being mainline kernel support does in a lot of ways increase your risk of errors coming from the fact upstream kernel fixes something and does not consider how your ZFS file system driver will be doing things.

    This is my problem with ZFS or nothing is normally that they are not really considering the full problem at hand and if ZFS is really fixing the problem or just adding duplication of functionality that in fact increases risk of data loss.

    Leave a comment:


  • LxFx
    replied
    Originally posted by ernstp View Post
    I've never seen the point of ZFS when we have Btrfs...
    I'm mainly interested in the selfhealing and "RAID" ZFS capabilities for my personal central storage.
    If I check the Arch topic for btrfs it says that those features are unstable, contain errors or have significant downsides...
    I would prefer the included btrfs before the license incompatible ZFS but one thing I don't want in an FS is it being error prone or unstable....
    Anything I'm missing here?

    Leave a comment:


  • oiaohm
    replied
    I am not a redhat guy you are just not liking my answers.

    Originally posted by k1e0x View Post
    If old is good why not just improve UFS? It has the simplest block design ever. (they actually do improve it, they added snapshot support to it recently) The reason is this stuff needs to be designed from the ground up. And in filesystem land that takes 10 years minimum. They can't (shouldn't) horseshoe everything else on and expect everything to be ok.
    Really a lot of what XFS is doing is really not horseshoeing more on top of the file system. Its providing functions to get access to stuff hidden behind the file system. Like the blocks that a file is made up of for other usages other than direct io and so on.

    Yes one of my questions is if XFS integration with the block layer behind it can be improved alot and make it more functional on tiered storage items like UFS most likely can be as well. UFS under LInux has had not had the backing from IBM providing servers and other things to test the file system to it limit.


    Originally posted by k1e0x View Post
    End note there are a lot of good filesystems out there Linux (I mean redhat) could use and improve on if they don't like ZFS.
    The licensing of ZFS that fairly keep it out of the mainline kernel tree no matter who review of the license you read leave ZFS screwed for mainline Linux while it remains CDDL.

    Please note stratis was not designed to be restricted to xfs only if a more suitable file system gets into mainline Linux. A suitablke file system has to be under a Linux kernel compatible license.


    Originally posted by k1e0x View Post
    bcachefs and HAMMER2 come to mind.. "But HAMMER is really ingrained into Dragonfly" Yes, it's is. So was ZFS in Solaris. They still ported it.
    bachefs hopefully this year. The peer reviews of bcachefs to get into Linux kernel mainline has found many possible data eating errors will be fixed before merged. One thing to come out of btrfs mess was better general file system testing tools.

    Some of ZFS issues with Linux is the fact its expecting the Solaris block layer that Linux does not really have. Hammer may be a very bad fit. Its one of the things the porting ZFS out of Solaris has caused some of ZFS performance problems.

    Originally posted by k1e0x View Post
    This is hard work and RedHat always seems to take the easiest approach. Like them "fixing" the problem with the kernel OOM killer hanging the system by using a userland systemd daemon to make sure the kernel OOM killer is never called.. good job fixing that kernel Redhat! lol
    This problem comes about because of one particular difference between freebsd and Linux.
    https://www.kernel.org/doc/Documenta...mit-accounting
    Way more aggressive overcommit on Linux. One advantage of moving the OOM to userspace is the means to change it on the fly without rebuilding the kernel.

    Originally posted by k1e0x View Post
    I got to ask... do you really think you're going to end up with a good OS like this?
    For this problem to allow means to use something Freebsd cannot do that gives Linux major advantages in particular workloads.

    Originally posted by k1e0x View Post
    This is why people are using FreeBSD.. because yes.. they implement things slower but they take their time and make sure it's engineered right. It's intentionally designed and changes are heavily debated, Linux randomly evolves with whatever is popular at the moment and whoever gets traction first.
    This is also why FreeBSD lost the super computer market. Has also fall behind in the web server market. Never got a foothold in the mobile market.

    Sorry trying to make out that FreeBSD has a strong position. People using FreeBSD for workloads are a dieing breed even with the havoc systemd caused.

    Leave a comment:


  • k1e0x
    replied
    Originally posted by oiaohm View Post
    (red hat guy said stuff)
    If old is good why not just improve UFS? It has the simplest block design ever. (they actually do improve it, they added snapshot support to it recently) The reason is this stuff needs to be designed from the ground up. And in filesystem land that takes 10 years minimum. They can't (shouldn't) horseshoe everything else on and expect everything to be ok.

    End note there are a lot of good filesystems out there Linux (I mean redhat) could use and improve on if they don't like ZFS. bcachefs and HAMMER2 come to mind.. "But HAMMER is really ingrained into Dragonfly" Yes, it's is. So was ZFS in Solaris. They still ported it. This is hard work and RedHat always seems to take the easiest approach. Like them "fixing" the problem with the kernel OOM killer hanging the system by using a userland systemd daemon to make sure the kernel OOM killer is never called.. good job fixing that kernel Redhat! lol

    I got to ask... do you really think you're going to end up with a good OS like this? This is why people are using FreeBSD.. because yes.. they implement things slower but they take their time and make sure it's engineered right. It's intentionally designed and changes are heavily debated, Linux randomly evolves with whatever is popular at the moment and whoever gets traction first.
    Last edited by k1e0x; 15 January 2020, 04:27 AM.

    Leave a comment:

Working...
X