Announcement

Collapse
No announcement yet.

Linus Torvalds Doesn't Recommend Using ZFS On Linux

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by allquixotic View Post
    ZFS is fast enough for the people who use it. In fact, being able to take advantage of tiered storage probably results in a faster overall experience compared to having to directly write to the HDDs. Using ARC, L2ARC and ZIL when you have this kind of hardware (big HDDs + fast/small SSDs) will probably get you the highest total system performance (real-world, not microbenchmarked) for that given hardware. Obviously, for a single NVMe SSD in a laptop, a filesystem that doesn't do checksums like ext4, or one built for flash devices from the ground up like f2fs, will probably be faster.
    My 2 and 4 TB HDDs have around an 80mb/s read speed. What does that mean in real world use? That any file system I use will read at that speed so I tune my tools like de/encrpytion, de/compression, etc to aim for around that speed for single disk use so ZFS gives me awesome features and it is no faster or slower than anything else for a HDD.

    It also means that a mirror and a backing SSD or two would give me damn-near SSD speeds (but I'd have to run a completely different tuning on ZFS).

    That's what makes ZFS neat and unique. I can run hardcore levels of encryption "transparently" because I'm using a slow disk and wouldn't know the difference anyways. I can't wait to be able to use Zstd:19 with my pools because it decompresses faster than my spinner's read speed.

    ZFS allows us to tune for speed or size or anywhere in between to suite one's hardware and needs. ZFS is as fast or as slow as one makes it. That's both good and bad because it does take time to learn and it does have a lot of knobs to turn...kind of like the Linux kernel

    All I know is that I can't tell y'all how many times ZFS has saved my game drive over the past 5 years due to power outages (and it's almost that time of year for those kinds of storms ).

    Comment


    • Originally posted by ryao View Post
      The presence of some sort of calculation at the controller is useless for ensuring integrity when a valid calculation can be sent with the wrong data.
      Its only useless if you cannot read/process it and know what it is.

      Originally posted by ryao View Post
      The only way to handle this is to calculate a checksum as early in the stack as possible on write, store it separately and verify it in the same place on read. Expecting the controller to do something for you is letting things happen too late.
      Having the controller do it on read means you are not wasting cpu time on a broken block.

      Originally posted by ryao View Post
      Whatever the controller calculates is also not what is stored on disk, or sent back with the data.
      The controller calculated ECC can be sent back when the block of data is read from the drive. Harddrives have more than one mode. There are some nice ones for data recovery/data protection that are not exposed by the normal OS block layers.

      Originally posted by ryao View Post
      It gets recalculated each time in a different way. This is a great way to get served the wrong data with a valid checksum/ECC calculation. The proper way to address it is to have checksums stored with pointers that are verified by the kernel.
      I did not say you would not be calculating the ECC again.

      Originally posted by ryao View Post
      It is like getting change back when making a purchase. You can rely on the other guy to count it, or you could count it yourself to be sure that you received what you were supposed to receive. The other guy counting it never means that your count of it is redundant. That would just be blind trust that is prone to abuse.
      No what you are doing is counting the money you got back while ignoring what the register said you should get. By luck the person miss count and give you the value you were expecting. When you come todo you tax your invoice is wrong. The ECC in the drive is your invoice for what data the drive was expecting to send you.

      Originally posted by ryao View Post
      I also wrote a list of issues that hardware RAID controllers have and most of them apply to software RAID:

      http://open-zfs.org/wiki/Hardware#Ha...ID_controllers
      So the classical random quality hardware/software raid controllers. Not all hardware raid controllers are created equal some in fact use the harddrives ECC values the cheaper ones don't.

      So a three way mirror raid can be more secure than than Z-RAID if it on one of the controllers using the harddrive controller stored ECC values and your OS is able to use the integrity checks of those ECC after the data is transferred to ram . So you can read the ECC value from one drive and a block from another and they should match right. Basically how to store you party information without in fact costing yourself any space.

      Basically you really do need to do way more homework particularly on the most resistant to failure hardware raid controllers not the random junk.

      Originally posted by ryao View Post
      The risk of integrity issues with ZFS is lower than with in-tree filesystems, not higher.
      That not exactly true either. ZFS Z-raid vs a 16 Linux kernel dm software raid, Both have exactly the same failure numbers. If a Linux file system is sitting on 16 dm software raid I am sorry you are out of luck you don't have the integrity advantage.

      ZFS risk of integrity issues low compare to a normal Linux file system alone is because it a a integrated block layer stuff. Z-RAID what is triple parity or raid 7 that happens to be patented by netapp in 2009 with patent expire in 2030. 16 raid is a method to create a triple parity like raid without stepping on netapp patent. I guess this is another reason why you cannot re-license because without patent coverage you are screwed right.

      So improve Linux block layer the integrity advantage of zfs can go by by for all file systems the Linux kernel supports. Do remember the block layer copies straight into the Linux kernel page cache when you are native to Linux not running some alien beast like zfs does.

      Comment


      • Originally posted by LxFx View Post

        I'm mainly interested in the selfhealing and "RAID" ZFS capabilities for my personal central storage.
        If I check the Arch topic for btrfs it says that those features are unstable, contain errors or have significant downsides...
        I would prefer the included btrfs before the license incompatible ZFS but one thing I don't want in an FS is it being error prone or unstable....
        Anything I'm missing here?
        oiaohm has a poor understanding of how file systems in general work, as ryao pointed out.

        Originally posted by oiaohm View Post

        You are missing a lot.
        https://www.jodybruchon.com/2017/03/...ot-and-raid-5/

        Lets cover some facts. Your basic harddrives and SSD at the controller level are in fact self healing. Horrible point is our block layers in operating systems have not allow us to simply access the controller generated ECC data. Adding ZFS to operating system does not address this weakness in the block layer. Instead you end up calculating checksum basically twice. Horrible reality here OS block layers need a major rework to give access to information that does exist.

        Next btrfs own built in raid is marked error prone but that is not the only option.
        https://wiki.archlinux.org/index.php...e_RAID_and_LVM
        You still have your general operating system raid options and other options.

        ZFS not being mainline kernel support does in a lot of ways increase your risk of errors coming from the fact upstream kernel fixes something and does not consider how your ZFS file system driver will be doing things.

        This is my problem with ZFS or nothing is normally that they are not really considering the full problem at hand and if ZFS is really fixing the problem or just adding duplication of functionality that in fact increases risk of data loss.
        Who am I? I am a system engineer who has previously worked for many fortune 500 companies as a storage engineer. I have used ZFS since it's release on Solaris in 2006, FreeBSD and on Linux in the enterprise with real mission critical production systems.

        Uncorrectable bit error is an extremely low factor on most storage medium... however as disk sizes become larger the likelihood that you will experience that grows as the number of bits on the drive also becomes greater.

        For example, if you write only 10 blocks and you have a 00.0001% chance of failure.. chances are you won't have an issue.. but if you write 4m blocks.. your chance is larger. This is the reason Dell has deprecated RAID5 and 6 in enterprise because having only one copy of the data, and then in 6's case 2 copies wasn't good enough for enterprise. (Compounding this was the resilver times)

        ZFS essentially expects the data from the disk to be wrong and calculates a checksum for every block on the disk. This is done in the kernel and is independent from the disk firmware. Due to this ZFS can and does detect errors in ordinary drives that firmware can not. (and even in expensive enterprise SAN equipment as found by CERN's LHC)

        Why don't more file systems do this? Because making a pass over the data is extremely inefficient for performance. ZFS manages to pull this off without too much of a hit when you consider what it's doing, making it competitive and *only* slightly slower than most non-checksuming filesystems, at the same time providing much higher data integrity.

        Why isn't it done on firmware? Because firmware does not have access to a processor as fast as your CPU. And because it would add a lot of cost to the device to do it. Drive manufactures have a mean failure rate they are comfortable with for the cost of their devices.. as shown even very expensive storage solutions get this wrong.

        Disk manufactures publish their failure rates for devices. ZFS also has published data integrity data by LLNL. In a 100 Petabyte array consisting of 30,000 drives ZFS managed to have 0 uncorrectable disk errors in over 10 years. Zero. Making ZFS the gold standard in data integrity.

        You can learn more about this design from this talk.
        https://www.youtube.com/watch?v=NRoUC9P1PmA
        Also here is an analysis of APFS and it's lack of check summing. http://dtrace.org/blogs/ahl/2016/06/...rt5/#apfs-data

        So why does oiaohm say ZFS checksuming each block for integrity before passing it to the application "in fact increases risk of data loss"? .. idk he's just an idiot shill. A checksum is the only way to discover this and the closer that checksum on the data is calculated to the application that wrote it in the data chain the better. The more layers and abstractions you go through (ie firmware) before calculating it, the more can go wrong.
        Last edited by k1e0x; 15 January 2020, 08:45 PM.

        Comment


        • Originally posted by oiaohm View Post
          Its only useless if you cannot read/process it and know what it is.



          Having the controller do it on read means you are not wasting cpu time on a broken block.



          The controller calculated ECC can be sent back when the block of data is read from the drive. Harddrives have more than one mode. There are some nice ones for data recovery/data protection that are not exposed by the normal OS block layers.



          I did not say you would not be calculating the ECC again.



          No what you are doing is counting the money you got back while ignoring what the register said you should get. By luck the person miss count and give you the value you were expecting. When you come todo you tax your invoice is wrong. The ECC in the drive is your invoice for what data the drive was expecting to send you.



          So the classical random quality hardware/software raid controllers. Not all hardware raid controllers are created equal some in fact use the harddrives ECC values the cheaper ones don't.

          So a three way mirror raid can be more secure than than Z-RAID if it on one of the controllers using the harddrive controller stored ECC values and your OS is able to use the integrity checks of those ECC after the data is transferred to ram . So you can read the ECC value from one drive and a block from another and they should match right. Basically how to store you party information without in fact costing yourself any space.

          Basically you really do need to do way more homework particularly on the most resistant to failure hardware raid controllers not the random junk.



          That not exactly true either. ZFS Z-raid vs a 16 Linux kernel dm software raid, Both have exactly the same failure numbers. If a Linux file system is sitting on 16 dm software raid I am sorry you are out of luck you don't have the integrity advantage.

          ZFS risk of integrity issues low compare to a normal Linux file system alone is because it a a integrated block layer stuff. Z-RAID what is triple parity or raid 7 that happens to be patented by netapp in 2009 with patent expire in 2030. 16 raid is a method to create a triple parity like raid without stepping on netapp patent. I guess this is another reason why you cannot re-license because without patent coverage you are screwed right.

          So improve Linux block layer the integrity advantage of zfs can go by by for all file systems the Linux kernel supports. Do remember the block layer copies straight into the Linux kernel page cache when you are native to Linux not running some alien beast like zfs does.
          Just about everything you said is wrong. The most fundamental issue is that you do not seem to understand the end to end principle:

          https://web.mit.edu/Saltzer/www/publ...d/endtoend.pdf

          You also don’t seem to know how the hardware works because things like “one of the controllers using the harddrive controller stored ECC values” makes no sense. That data is never sent from the hard drive. It is possible to have extra data stored with the sector like what netapp does, but this fails to provide adequate protection from a write operation being done to the wrong sector.

          There is no way to improve the Linux block layer to be as good as using ZFS. That is why Chris Mason made btrfs.

          There are failure modes that RAID 5 has that RAID-Z lacks such as rendering all data inaccessible when a RAID 5 rebuild operation encounters corruption that makes recalculating the result mathematically impossible. ZFS would simply report the damaged data while rebuilding the rest.
          Last edited by ryao; 15 January 2020, 07:42 PM.

          Comment


          • Yes, using a larger sector size and placing the checksum next to the block is bad because it can't protect from phantom read/write or write/read from wrong sector. ZFS solves this problem and can chain verify itself down entire tree of blocks.

            And again sorry oiaohm.. Linux dm can't tell if data on one drive is correct from data on another as it has no checksum on the data. Corruption on one side of a mirror can't be compared to the other. All it can do is read. In ZFS's case it reads, compares the checksum and then can take action if it's wrong such as looking in the array for another copy of the data and even going back and fixing the bad block. That is what it means by self healing.

            Linux DM has proposed adding this but it's logic for the proposal is pretty bonkers. They write the block out then wait 5 seconds for the compression to finish then calculate the checksum.. whoa.. no thanks. ZFS does it in ram before the write. (this is what I mean by shoehorn and duct tape on features.) In ZFS writing compressed data takes less time than uncompressed.. because well... you're writing less data! Simple. In Linux DM writing less data takes 5 seconds more time?? How does that logic work? Did it double write the data? How did they determine 5 seconds was right? it sounds like an ass pull number. Ick ick.. just no.. no.. thanks for trying.

            XFS has checksums also, but only on metadata. They cite it's too slow to do it on data blocks. (A problem ZFS's "terrible" design allowed them to solve.. if only XFS's design was as bad as ZFS's they could have data block checksums. )
            Last edited by k1e0x; 15 January 2020, 09:19 PM.

            Comment


            • Originally posted by oiaohm View Post

              https://web.archive.org/web/20060816...PL_redline.pdf
              the problem in 3.1 of CDDL is.
              be distributed only under the terms of this License.

              This is straight from MPL 1.1 and is in CDDL 1.0 as is. This means CDDL cannot be over-licensed with any license just like MPL 1.1 does not have to be GPL. Lets say at some point Oracle released a CDDL 1.1 you could not over-license CDDL 1.0 with CDDL 1.1 no matter how you worded CDDL 1.1. Yet MPL 1.1 could be over-license by newer version of license because it contained
              or a future version of this License released under Section 6.1

              This was removed from CDDL. So the issue CDDL is not simple to fix. The fix to CDDL is throw the complete code base away and rewrite from scratch. Everything tainted by CDDL is not fixable legally.

              That distributed only under terms of this License blocks the normal thing under copyright where you can licence your own work under as many licenses as you like. Yes even if you ask every developer in a CDDL work for permission to re-license and they said yes the CDDL license still directly forbids this. CDDL is many levels worse than GPL in this regard you can re-license a GPL work if you can get all the developers to agree. This is most likely why Oracle legal department does not answer on re-license or allow into Linux kernel because you most likely cannot ever legally. Yes CDDL is a viral license on a complete different level once you put something under CDDL its under CDDL forever and nothing else.
              The CDDL still has the terms allowing Sun (now Oracle) to publish a new version, and for anyone to then choose to use that version. See section 4. There is no notice in the ZFS source specifically prohibiting use of newer versions. If Oracle says CDDL 2.0 is a copy of GPLv2, it's done.

              You're also missing my point. No-one denies that the two licenses are currently incompatible. What is controversial is whether Sun specifically intended to make it incompatible with Linux, or it just turned out that way because they weren't really considering that (remember at the time they were just open-sourcing the Solaris operating system).

              Comment


              • Originally posted by k1e0x View Post
                And again sorry oiaohm.. Linux dm can't tell if data on one drive is correct from data on another as it has no checksum on the data. Corruption on one side of a mirror can't be compared to the other. All it can do is read. In ZFS's case it reads, compares the checksum and then can take action if it's wrong such as looking in the array for another copy of the data and even going back and fixing the bad block. That is what it means by self healing.
                This aspect should actually be emphasized a lot more as it is a crucial advantage of checksumming in the filesystem. I'd say more so than bitrot even. Having a mirror doesn't help if you don't know which side has the correct data any more. On regular mirrors or RAID if you ever get out of sync on which drive had the failure your entire array is basically gone.

                Eg, I've had a case where a botched Windows install overwrote part of one of the drives in my storage zpool. If that had been a regular RAID, I'd have had to sweat bullets to make sure I kept track of exactly which drive needed to be rebuilt. With raid-z I just booted back into linux and scrubbed the pool, and the filesystem handles everything.
                Last edited by nivedita; 16 January 2020, 06:45 PM. Reason: Added example

                Comment


                • Originally posted by nivedita View Post
                  No-one denies that the two licenses are currently incompatible.
                  I do. Other people do too. It's complicated. But you have to take into account GPLv2 clause 0 being void in the US and EU and the ability to comply with both licenses at the same time meaning one can follow the terms of both licenses and in doing so does not cause harm to either license. That is Canonical's opinion. (and they aren't the only ones, several university legal professors have agreed with them)

                  The FSF and RedHat take a different opinion (however it is rumor privately there is some dissent in that camp and Stallman and his underlings have been in dispute about this)

                  As read the licenses are incompatible. In actual effective practice.. maybe not.

                  Regardless that is moot because it's not in tree, and OpenZFS doesn't really want to put it in tree even if Linus gave the ok. ZFS doesn't cater to only Linux, they support 5 OS's but they do want it to work well on Linux.

                  Ars Technica picked this up btw.
                  https://arstechnica.com/gadgets/2020...straight-dope/
                  "Linus should avoid authoritative statements about projects he's unfamiliar with."

                  I think they are being generous to Linus. "Don't worry people data integrity is a buzz word and it's not even as fast! You don't need to know your data is *right*, you'd loose 5-10% of your performance that way!"
                  Last edited by k1e0x; 16 January 2020, 07:46 PM.

                  Comment


                  • Originally posted by nivedita View Post
                    The CDDL still has the terms allowing Sun (now Oracle) to publish a new version, and for anyone to then choose to use that version. See section 4. There is no notice in the ZFS source specifically prohibiting use of newer versions. If Oracle says CDDL 2.0 is a copy of GPLv2, it's done.
                    All new OpenZFS code has a CDDL version pin to prevent Oracle from being able to change the license terms.

                    Comment


                    • Originally posted by ryao View Post

                      The GPLv2 has no patent grant, so code under it is far more vulnerable than code under the CDDL. Being under the GPLv2 does not make it more safe. If anything, it is less safe. That is why the GPLv3 was written.
                      Disagree.

                      Obviously patents are how they'd attack the GPL code, but I think it's very clear that would be a tougher argument to make than the straightforward copyright claim they'd have with the ZFS code.

                      In the end, it's just a matter of degree though. I agree they absolutely could drag the gpl code into court if they wanted to.

                      Comment

                      Working...
                      X