Announcement

Collapse
No announcement yet.

Linus Torvalds Doesn't Recommend Using ZFS On Linux

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by oiaohm View Post

    LOL Its not like its impossible for XFS to have a 4x mirror with a block cache in front of it behind it in the next Linux kernel releases. Sorry you don't have the speed.

    Yes block cache on nvme is also possible. So that does not give you a speed advantage. Its sad to watch these ZFS fans be out of date on benchmarks because they don't want to admit their ass is kicked in raw performance. Feature advantage is not as big as they want to make out either.
    ZFS only really loses in single disk SSD-like scenarios. Single disk HDDs all behave around the same unless a file system like ZFS or BTRFS+LUKS is using gzip9 or very high encryption levels and other crazy intensive stuff.

    Used in mirrors or better with the ZIL and L2ARC on fast storage, which is what ZFS is primarily designed for, it usually does win in regards to enabled features and speed.

    Comment


    • Originally posted by skeevy420 View Post

      ZFS only really loses in single disk SSD-like scenarios. Single disk HDDs all behave around the same unless a file system like ZFS or BTRFS+LUKS is using gzip9 or very high encryption levels and other crazy intensive stuff.

      Used in mirrors or better with the ZIL and L2ARC on fast storage, which is what ZFS is primarily designed for, it usually does win in regards to enabled features and speed.
      ZFS is really about integrity and ease of storage management. Those are it's AAA features and that is why you use it. You use it to minimize the downtime of doing backups either taking them or having to restore them in the first place. ZFS send has big advantages over rsync where as it doesn't need to spend 2 hours or more calculating the delta between two storage pools, it already knows what blocks have changed and what to send. I've heard cases where people using Bacula had their backup times exceed 24 hours making daily backups impossible. They solved that with ZFS send.

      That being said it's flexible enough to design a storage layout for iops and speed depending on your workflow. You can get very good competitive performance with any other like product if you design the layout correctly. You can put a ZIL on Optane too in FreeBSD.

      Comment


      • Originally posted by skeevy420 View Post
        ZFS only really loses in single disk SSD-like scenarios. Single disk HDDs all behave around the same unless a file system like ZFS or BTRFS+LUKS is using gzip9 or very high encryption levels and other crazy intensive stuff.
        That is not exactly true. ZFS loses in single HDD solutions to XFS as well.

        Originally posted by skeevy420 View Post
        Used in mirrors or better with the ZIL and L2ARC on fast storage, which is what ZFS is primarily designed for, it usually does win in regards to enabled features and speed.
        This is what I call bias bench-marking you see ZFS with ZIL and L2ARC but then they don't give xfs one cache options either then claim win.

        https://www.redhat.com/en/blog/impro...mance-dm-cache

        Yes dm-cache and bcache and other solutions like it really do speed up xfs a lot. Mirrors + cache options with xfs in perform do normally beat zfs with ZIL and L2ARC at least now. There has been a recent change.

        Do notice a warm cache under XFS works out to ~4x faster. What is roughly the same boost you get by enabling L2ARC to ZFS expect you are starting off slower. So ZFS with ZIL and L2ARC on does not catch up to XFS with Cache on. In fact the performance difference gets wider not narrower in XFS advantage. Only reason ZFS in benchmarks with L2ARC wins over XFS is those benchmarking are basically not giving XFS a cache.

        So yes the thing you complain about with normal file system benchmarks being unfair because they don't allow L2ARC you see those attempting to sell ZFS do the reverse where they don't give XFS or any other file system any of the other caching options.

        Really ZFS without cache is not unfair in fact its giving ZFS a better chance than having to face off against XFS with cache. Basically fair competition here if you have the solid state drive for cache you should compare all file systems setup to use it that way and the file system does not need to have cache feature to have a block level cache under it.

        Its surprising to a lot of people how poor the L2ARC and ZIL in fact performs when you compare it to other cache options. Having data integrity stuff does not come free.

        Originally posted by k1e0x View Post
        ZFS is really about integrity and ease of storage management. Those are it's AAA features and that is why you use it. You use it to minimize the downtime of doing backups either taking them or having to restore them in the first place. ZFS send has big advantages over rsync where as it doesn't need to spend 2 hours or more calculating the delta between two storage pools, it already knows what blocks have changed and what to send. I've heard cases where people using Bacula had their backup times exceed 24 hours making daily backups impossible. They solved that with ZFS send.
        ZFS send is a good feature. That there is currently not a good replication for that. But not all workloads you need this integrity and replication like postgresql database with their WAL don't need ZFS send or file system integrity so IO performance is more important and their own backup system provides that stuff.

        Basically ZFS features that effect IO make it not the most suitable for particular workloads.

        Originally posted by k1e0x View Post
        You can get very good competitive performance with any other like product if you design the layout correctly. You can put a ZIL on Optane too in FreeBSD.
        Its about time you stop this lie. You can put caching under XFS on Optane as well and see insane performance boosts. If you objective is IOPS ZFS never wins.

        Something to wake up to iomap change in Linux kernel is a major one as this allows the VFS layer to put request straight to the block layer if the block map information from the file system is already got and in the iomap.

        Why does XFS not have data block checksumming or compression simple. Does it make any sense when you were planning to allow the VFS layer to bipass the file system layer. Basically this model change means the file system driver is only there to process the file system metadata. So in this model compression/checksum are either in the VFS or the block layer.


        Comment


        • Originally posted by oiaohm View Post

          That is not exactly true. ZFS loses in single HDD solutions to XFS as well.



          This is what I call bias bench-marking you see ZFS with ZIL and L2ARC but then they don't give xfs one cache options either then claim win.

          https://www.redhat.com/en/blog/impro...mance-dm-cache

          Yes dm-cache and bcache and other solutions like it really do speed up xfs a lot. Mirrors + cache options with xfs in perform do normally beat zfs with ZIL and L2ARC at least now. There has been a recent change.

          Do notice a warm cache under XFS works out to ~4x faster. What is roughly the same boost you get by enabling L2ARC to ZFS expect you are starting off slower. So ZFS with ZIL and L2ARC on does not catch up to XFS with Cache on. In fact the performance difference gets wider not narrower in XFS advantage. Only reason ZFS in benchmarks with L2ARC wins over XFS is those benchmarking are basically not giving XFS a cache.

          So yes the thing you complain about with normal file system benchmarks being unfair because they don't allow L2ARC you see those attempting to sell ZFS do the reverse where they don't give XFS or any other file system any of the other caching options.

          Really ZFS without cache is not unfair in fact its giving ZFS a better chance than having to face off against XFS with cache. Basically fair competition here if you have the solid state drive for cache you should compare all file systems setup to use it that way and the file system does not need to have cache feature to have a block level cache under it.

          Its surprising to a lot of people how poor the L2ARC and ZIL in fact performs when you compare it to other cache options. Having data integrity stuff does not come free.



          ZFS send is a good feature. That there is currently not a good replication for that. But not all workloads you need this integrity and replication like postgresql database with their WAL don't need ZFS send or file system integrity so IO performance is more important and their own backup system provides that stuff.

          Basically ZFS features that effect IO make it not the most suitable for particular workloads.



          Its about time you stop this lie. You can put caching under XFS on Optane as well and see insane performance boosts. If you objective is IOPS ZFS never wins.

          Something to wake up to iomap change in Linux kernel is a major one as this allows the VFS layer to put request straight to the block layer if the block map information from the file system is already got and in the iomap.

          Why does XFS not have data block checksumming or compression simple. Does it make any sense when you were planning to allow the VFS layer to bipass the file system layer. Basically this model change means the file system driver is only there to process the file system metadata. So in this model compression/checksum are either in the VFS or the block layer.

          You can put any cache you want on XFS. Cache is real world testing and that is what matters.

          Its surprising to a lot of people how poor the L2ARC and ZIL in fact performs when you compare it to other cache options.
          Really..?
          https://www.usenix.org/legacy/events...do/megiddo.pdf
          (Original paper. Look at table VIII on page page 15, the ARC is nearly outperforming a tuned offline cache)

          Bryan Cantrill did a review of that paper.. if you're the more.. audio/visual type.
          https://www.youtube.com/watch?v=F8sZRBdmqc0

          And, Allan Jude did a ELI5 talk on the algorithm if you are wondering what all these numbers and symbols mean in the other sources.
          https://www.youtube.com/watch?v=1Wo3i2gkAIk

          I think the true value in the ARC algorithm is that datasets do evolve over time and it's necessary to have that cache be able to adapt to those changes. When you know how it works and how small the changes are when it adapts itself, it's really very interesting this works so well... I would assume we would need a much more complicated algorithm to accomplish this stuff... but apparently not. The ARC is relatively simple.

          It's open source btw.. feel free to re-implement it.
          Last edited by k1e0x; 01-29-2020, 07:33 PM.

          Comment


          • Originally posted by k1e0x View Post
            (Original paper. Look at table VIII on page page 15, the ARC is nearly outperforming a tuned offline cache)
            I am not putting it head to head with a tuned cache but a auto tuning cache. dm_cache and bcache are both auto tuning.

            You need to read page 2 and pay very careful attention to the first sentence.
            https://www.usenix.org/legacy/events...do/megiddo.pdf
            We consider the problem of cache management in a demand paging scenario with uniform page sizes.
            One problem is Linux a uniform page size system?
            https://www.youtube.com/watch?v=p5u-vbwu3Fs

            The answer is no. And Linux going to come more and more non uniform page size system. Something design around uniform page sizes will just come more and more Linux incompatible and that incompatibility causes a performance degrade.

            Originally posted by k1e0x View Post
            I think the true value in the ARC algorithm is that datasets do evolve over time and it's necessary to have that cache be able to adapt to those changes. When you know how it works and how small the changes are when it adapts itself, it's really very interesting this works so well... I would assume we would need a much more complicated algorithm to accomplish this stuff... but apparently not. The ARC is relatively simple.
            This is all nice in theory. Problem ARC algorithm is not designed to alter for a non uniform page size system where the dm cache in Linux is. Yes dm cache is also using a relatively simple formula. We don't need a super complex formula but you need to be slightly smarter than the ARC solution to deal with a non uniform page size system

            The iomap and memory management change in the Linux kernel is not a small ones the design is totally counter to the way ZFS zpool and ARC in fact works. Both the iomap and memory management changes is about making non uniform page sizes work. So you will have more non uniformed page sizes than workloads with a mix of huge and non huge pages that bring out the worse in the ZFS ARC cache and this worst is going to come the normal. Unless you wake up you are in trouble. The old interfaces into the Linux kernel are not going to give your cache/block systems inside ZFS the information they in fact need from the Linux memory management system to know what the hell is going on with non uniform pages.

            By the way everything you have referenced k1e0x is old and obsolete already for Linux. Worse that both of those video were by BSD guys who are clueless how Linux is changing.

            Problem here BSD/Windows/OS X has not moved to non uniform page size system so those developing on BSD/Windows/OS X have not seen this problem coming. Linux is ahead of the pack at doing something to work with non uniform page sizes as normal not the rarity.

            This change in Linux does bring some very interesting questions for future file system design. The concept of using 1 size block across the complete file system could be wrong particular with flash insane seek time.

            Linus can see these upcoming changes ZFS is not ready for. ZFS developers are putting their faith in stuff that will not be future compatible.

            Comment


            • Originally posted by oiaohm View Post
              I am not putting it head to head with a tuned cache but a auto tuning cache. dm_cache and bcache are both auto tuning.

              You need to read page 2 and pay very careful attention to the first sentence.
              https://www.usenix.org/legacy/events...do/megiddo.pdf


              One problem is Linux a uniform page size system?
              https://www.youtube.com/watch?v=p5u-vbwu3Fs

              The answer is no. And Linux going to come more and more non uniform page size system. Something design around uniform page sizes will just come more and more Linux incompatible and that incompatibility causes a performance degrade.



              This is all nice in theory. Problem ARC algorithm is not designed to alter for a non uniform page size system where the dm cache in Linux is. Yes dm cache is also using a relatively simple formula. We don't need a super complex formula but you need to be slightly smarter than the ARC solution to deal with a non uniform page size system

              The iomap and memory management change in the Linux kernel is not a small ones the design is totally counter to the way ZFS zpool and ARC in fact works. Both the iomap and memory management changes is about making non uniform page sizes work. So you will have more non uniformed page sizes than workloads with a mix of huge and non huge pages that bring out the worse in the ZFS ARC cache and this worst is going to come the normal. Unless you wake up you are in trouble. The old interfaces into the Linux kernel are not going to give your cache/block systems inside ZFS the information they in fact need from the Linux memory management system to know what the hell is going on with non uniform pages.

              By the way everything you have referenced k1e0x is old and obsolete already for Linux. Worse that both of those video were by BSD guys who are clueless how Linux is changing.

              Problem here BSD/Windows/OS X has not moved to non uniform page size system so those developing on BSD/Windows/OS X have not seen this problem coming. Linux is ahead of the pack at doing something to work with non uniform page sizes as normal not the rarity.

              This change in Linux does bring some very interesting questions for future file system design. The concept of using 1 size block across the complete file system could be wrong particular with flash insane seek time.

              Linus can see these upcoming changes ZFS is not ready for. ZFS developers are putting their faith in stuff that will not be future compatible.
              Well, I'm not sure you're right. ZFS supports many architectures so it's page cache size is probably set at compile time but it doesn't really matter because lets say Linux desperately wants to keep ZFS out of Linux.. fine.. they will only be shooting themselves in the foot then and losing customers / developers because people will just base their storage products on FreeBSD. All the big vendors are either FreeBSD or Solaris/Illumos based now anyhow. I think only Datto is ZoL based. People do use ZoL in smaller implementations and it's popular and a lot of people want to use it but.. if thats not possible.. FreeBSD is right there waiting to gain market share.

              It isn't really that hard to change the underlying platform.. It's the path of least resistance, Use FreeBSD now.. or develop something else on Linux.. what is easier?

              What can you do? We are trying to help Linux get it's big boy pants on and do real storage.. but if they want to have a temper tantrum.. : shrug : Zol's implementation is really good, thats why FreeBSD uses it now.. it's a shame they get treated this way by the core team.

              You know... in 10-20 years we will probably end up in this really weird world where Linux is upstack running the applications and FreeBSD is running the Metal, Network and Storage.. just a weird thought. Might be true tho.. The OS's seem to be moving that way.
              Last edited by k1e0x; 01-30-2020, 06:58 PM.

              Comment


              • Originally posted by k1e0x View Post
                Well, I'm not sure you're right. ZFS supports many architectures so it's page cache size is probably set at compile time but it doesn't really matter because lets say Linux desperately wants to keep ZFS out of Linux.
                https://www.youtube.com/watch?v=p5u-vbwu3Fs

                You need to watch the youtube. Title "Large Pages in Linux".

                There is a nightmare problem at the memory management level. 64Gib of memory 4kb pages equal 16 billion pages to keep track of. Thinking under x86_64 you can use 4kb, 2Mib, 4Mib and 1GiB page sizes, 2Mib page sizes equal 32768 pages to keep track of for 64Gb. 4MiB equal 16384 pages to keep track of for 64GB of memory. Of course 1GiB page size could make sense as memory in servers increase.

                So you cannot do all pages like 2mb/4mb because you will have too much wasted memory space. But you cannot practically do all pages 4kb either because if you do you are wasting hell load of processing managing memory. Basically the Linux kernel is applying the from the slub allocation idea and apply this to system wide memory.

                The result of the Linux kernel changes is you cannot set the page cache size at compile time as a single value any more. You don't have a single pool of memory instead you have pools of memory based on cpu page sizes. If system is requesting file system provide a aligned 4MiB page it better be able to going forwards so there is no extra double handling converting 4kb pages to 4MiB pages.

                The reality is the current form of ZoL may work in current Linux without major changes ZoL doomed as ZoL does not support the memory model more modern Linux kernel will require that is a different beast to the operating systems that ZoL has been designed for..

                Originally posted by k1e0x View Post
                It isn't really that hard to change the underlying platform.. It's the path of least resistance, Use FreeBSD now.. or develop something else on Linux.. what is easier?
                Path to hell is paved with good intentions. Path of least resistance arguement allows you to ignore that ZFS current design is fatally flawed.

                Issue is the requirements of the underlying platform have changed since ZFS was designed.

                Originally posted by k1e0x View Post
                Zol's implementation is really good, thats why FreeBSD uses it now.. it's a shame they get treated this way by the core team.
                The reality ZoL developers have had to support FreeBSD because the core developers for the FreeBSD ZFS died out. Really its lucky the ZoL developers have been rejected by the mainline Linux kernel or FreeBSD would not have ZFS any more either.

                Originally posted by k1e0x View Post
                You know... in 10-20 years we will probably end up in this really weird world where Linux is upstack running the applications and FreeBSD is running the Metal, Network and Storage.. just a weird thought. Might be true tho.. The OS's seem to be moving that way.
                FreeBSD will at some point have to address the problem the Linux kernel developers have run into. So current form ZFS is problem for Linux will be a Freebsd long term as well.

                The problem that has caused the Linux kernel change is coming from the bare metal and these changes will effect how file systems going forwards will need to operate. So Linux running on the Metal with FreeBSD above so your hypervisor is not killing your performance due to too complex of page table is the way it will have to be at the moment.

                k1e0x you really need to watch that video and take it in. The License problem is not the only problem.

                Comment


                • Originally posted by oiaohm View Post

                  https://www.youtube.com/watch?v=p5u-vbwu3Fs

                  You need to watch the youtube. Title "Large Pages in Linux".

                  There is a nightmare problem at the memory management level. 64Gib of memory 4kb pages equal 16 billion pages to keep track of. Thinking under x86_64 you can use 4kb, 2Mib, 4Mib and 1GiB page sizes, 2Mib page sizes equal 32768 pages to keep track of for 64Gb. 4MiB equal 16384 pages to keep track of for 64GB of memory. Of course 1GiB page size could make sense as memory in servers increase.

                  So you cannot do all pages like 2mb/4mb because you will have too much wasted memory space. But you cannot practically do all pages 4kb either because if you do you are wasting hell load of processing managing memory. Basically the Linux kernel is applying the from the slub allocation idea and apply this to system wide memory.

                  The result of the Linux kernel changes is you cannot set the page cache size at compile time as a single value any more. You don't have a single pool of memory instead you have pools of memory based on cpu page sizes. If system is requesting file system provide a aligned 4MiB page it better be able to going forwards so there is no extra double handling converting 4kb pages to 4MiB pages.

                  The reality is the current form of ZoL may work in current Linux without major changes ZoL doomed as ZoL does not support the memory model more modern Linux kernel will require that is a different beast to the operating systems that ZoL has been designed for..



                  Path to hell is paved with good intentions. Path of least resistance arguement allows you to ignore that ZFS current design is fatally flawed.

                  Issue is the requirements of the underlying platform have changed since ZFS was designed.



                  The reality ZoL developers have had to support FreeBSD because the core developers for the FreeBSD ZFS died out. Really its lucky the ZoL developers have been rejected by the mainline Linux kernel or FreeBSD would not have ZFS any more either.



                  FreeBSD will at some point have to address the problem the Linux kernel developers have run into. So current form ZFS is problem for Linux will be a Freebsd long term as well.

                  The problem that has caused the Linux kernel change is coming from the bare metal and these changes will effect how file systems going forwards will need to operate. So Linux running on the Metal with FreeBSD above so your hypervisor is not killing your performance due to too complex of page table is the way it will have to be at the moment.

                  k1e0x you really need to watch that video and take it in. The License problem is not the only problem.
                  I did watch your video. I didn't find it all too interesting, it's a problem yeah. I'm not sure variable is correct but you know we'll see. After seeing what you are talking about no, I don't think that will affect ZFS at all.. every other filesystem sure. ZFS no because ZFS implements Solaris (famous and much imitated) slab memory allocator. I don't see why that can't run in a huge page or anything else.

                  One thing to note here is that FreeBSD I believe doesn't even use the slab because there own memory manager is so close to it, they didn't need to change.. however that may no longer be the case in future releases because of their transition to ZoL/OpenZFS.

                  You talk a lot about the design but you don't really seem to know the design very well. So.. here we go.

                  Back in ~2000's era Jeff Bonwick (the same guy that wrote the Slab) frustrated that hard disk and storage management were such a pain to manage set out with Matt Ahrens to reinvent the wheel. The basic idea was that storage could be like ram. That is was a resource your computer automatically managed for you. (Since leaving Sun Jeff has become CTO for several companies and has been rather quiet, he does make an appearance every now and then though.)

                  You talk about traditional classical layers and the importance of them, the trouble is that (and you rightly identified it) is that those layers are blind to each other.

                  Historical model:
                  - Loopback layer (qcow2, overlays)
                  - Posix layer (VFS, read/write)
                  - File layer (ext3, xfs etc)
                  - Logical volume layer (LVM, MD Raid, LUKS or sometimes all 3 chained together! All pretending to be a block layer to each other.)
                  - Block Layer (fixed block size, usually 4k)

                  In ZFS they ripped all that out and changed it.

                  ZFS model:
                  - ZPL (Speaks a language the OS understands, usually posix.)
                  --- Optional ZVOL Layer (Pools storage for Virtual Machienes, iSCSI, Fiber channel and distributed storage etc, no extra layer added on top like with qcow2)
                  - DMU (re-orders the data into transaction groups)
                  - SPA (SPA works with plugins to do LVM, Raid-Z, Compression using existing or future algorithms, Encryption, other stuff not invented yet, etc. It can even load balance devices.)
                  - Block layer (variable block size)

                  ZFS rewrote how all the layers work and changed them to be aware of each other. It actually takes blocks, bundles them up as objects in transaction groups and that is what's actually written. You can find out more about that here. https://www.youtube.com/watch?v=MsY-BafQgj4

                  Some observations on where these systems are going? Well.. this is going to be a bit of a rant but as you know I'm a systems engineer and I quickly had to add 4 IP's to an interface temporally on Ubuntu. Have you ever really used the "ip" command? It drives me nuts every real OS for 40 years used ifconfig, Linux changed this a few years back the first thing they did is they didn't keep backward syntax and they changed the name to something meaningless. "ip"? what about network interfaces that don't speak "ip"? IP is a pretty common protocol but it's not every protocol. Using this command I bungled the obtuse syntax and it had grammar mistakes (Eg: you are an error) and also told me that I was using a depreciated syntax and had to update my scripts? I'm not using a script I'm typing it! Does anyone in Linux type this anymore? Looking at Ubuntu's networking it's a tangled web of python scripts that call other scripts. It's got 4? different methods to configure the IP, systemd, netplan, debian net.if and network manager. And this is the work of developers. They like systems like this with rich API's, options and yaml "Dude, got to have some yaml or json.. but not XML never XML that's sooo last decade bro". This is the stuff I bleed over daily. We are the ones who have to deal with it when the code doesn't work as intended. Or has very odd behavior. Or just sucks and is slow and nobody knows why (but it works fine on the developers laptop).

                  Know how you do it on FreeBSD?
                  ifconfig same way as always, and one line in rc.conf. Python isn't even installed by default! Simplicity has a lot of value.

                  Linux was deployed and put into the position it was by sysadmins because it was simple and they needed to solve problems. It's no longer like that however..

                  Same thing you can see in KVM and bhyve. bhyve is almost a different class of hypervisor in that it's a few hundred kilobytes in size and does no hardware emulation (qemu) at all. KVM well, how bloated CAN it get? At least those gamers will be able to pass through skyrim... But.. So if you want a really really thin light hypervisor bhyve is your go to, a FreeBSD system after install probably has less than 10 PID's you actually need. Ubuntu has hundreds.

                  I do believe that in 40 years both OS's will still be around.. but they may look very different by then.
                  Last edited by k1e0x; 02-01-2020, 02:19 AM.

                  Comment


                  • Originally posted by k1e0x View Post
                    ZFS implements Solaris (famous and much imitated) slab memory allocator. I don't see why that can't run in a huge page or anything else.
                    This is where you are stuffed.

                    Watch the video again. https://www.youtube.com/watch?v=p5u-vbwu3Fs "Large Pages in Linux". This is not a slab memory allocator. This is not file systems having there own allocate system. That complete slab memory allocate to be Linux compatible has to go replaced by the Large Page system.

                    Originally posted by k1e0x View Post
                    One thing to note here is that FreeBSD I believe doesn't even use the slab because there own memory manager is so close to it, they didn't need to change.. however that may no longer be the case in future releases because of their transition to ZoL/OpenZFS.
                    That is the start of the problem.

                    Originally posted by k1e0x View Post
                    Historical model:
                    - Loopback layer (qcow2, overlays)
                    - Posix layer (VFS, read/write)
                    - File layer (ext3, xfs etc)
                    - Logical volume layer (LVM, MD Raid, LUKS or sometimes all 3 chained together! All pretending to be a block layer to each other.)
                    - Block Layer (fixed block size, usually 4k)
                    You did no watch paying proper attention the video or you missed side 12. First line.
                    Block Layer Already supports arbitrary size pages, thanks to merging

                    Funny enough so the Logical volume layer due to being in the LInux kernel Block Layer.. Variable block size. Has existed in Linux all the way up to just under the file system. There is a problem at the file system drivers that iomap is a plan to fix.

                    So this historic model does not in fact match Linux. DMA means you did not have to use a fixed block size. HDD might have like 4k blocks and you have to write aligned but nothing prevent you using like 32 or 64kb ....As long as it aligned. This feature was basically in the Linux kernel first block layer. Lot of people writing file systems on Linux in the file system layer brought in the 4k limitation crap with huge pages this don't work any more.

                    The large page work is to bring the block layer idea of allocation and memory management/OS page cache into agreement. This way you can have one allocation top to
                    bottom.

                    Originally posted by k1e0x View Post
                    In ZFS they ripped all that out and changed it.

                    ZFS model:
                    - ZPL (Speaks a language the OS understands, usually posix.)
                    --- Optional ZVOL Layer (Pools storage for Virtual Machienes, iSCSI, Fiber channel and distributed storage etc, no extra layer added on top like with qcow2)
                    - DMU (re-orders the data into transaction groups)
                    - SPA (SPA works with plugins to do LVM, Raid-Z, Compression using existing or future algorithms, Encryption, other stuff not invented yet, etc. It can even load balance devices.)
                    - Block layer (variable block size)

                    ZFS rewrote how all the layers work and changed them to be aware of each other. It actually takes blocks, bundles them up as objects in transaction groups and that is what's actually written.
                    Was it required to rewrite all the layers to make them aware of each other. The answer is no. Why are you bundling them up into objects instead of improving the page cache of the OS as large pages does for all file systems over time.

                    ZPL means you must be translating. You also are ignoring the host OS big time. This comes a huge excuse to making own internal allocator that is not in fact aligned with the host OS.

                    You also miss that the block size ZFS wants to use are up to 1Mib. Huge pages on x86 is 2-4mib. 1Mib made sense on a sun Sparc cpu and 32 bit x86 but we use 64bit x86 these days. There are a lot of formal things that are based for hardware we don't use in ZFS that need to be redone as well.

                    Consider this your ZFS model is wrong on Linux.

                    Loopback
                    VFS
                    ZFS
                    -ZPL
                    ---ZVOL
                    -DMU
                    -SPA
                    -Block layer ZFS.
                    Linux Block layer.

                    I would suspect freebsd is will end up equally bad mess. The rip out and replace that was ZFS objective under Solaris under non Solaris has not happened.

                    Then those layers are also sitting on another abstraction layer. You skipped the SPL (Solaris Porting Layer) . What is basically lets keep on using Solaris API for ever more. Of course as Linux behaviours come less Solaris like this is going to be increasing problem. At some point freebsd will change things and have trouble as well. Like it was particular hard to implement trim command for SSD in ZFS for Linux. ZFS was with this almost a decade late getting the feature compare to other Linux file systems.

                    Originally posted by k1e0x View Post
                    Have you ever really used the "ip" command? It drives me nuts every real OS for 40 years used ifconfig,
                    Yes I have and I have been very thankful for it.

                    https://stackoverflow.com/questions/...able-connector
                    These kind of problems ip monitor feature is great.

                    ip command you can also do things like changing/removing default gateway without having to disconnect and reconnect for the routing change to come live and if you are dealing with a slightly suspect managed switch that radius messaging to activate port is roll of dice if it works or not its great to be able to changes like that with network card up.

                    Originally posted by k1e0x View Post
                    Linux was deployed and put into the position it was by sysadmins because it was simple and they needed to solve problems.
                    Really the ip command come into existence because the posix standard ifconfig command cannot do a stack of different things well. What the LInux network stack can allow to be perform well and truly exceeds what the freebsd one today allows. The fact freebsd has not found themselves needing to replace or massively extend the ifconfig is more a sign of how feature behind freebsd network stack as got.

                    Originally posted by k1e0x View Post
                    Same thing you can see in KVM and bhyve. bhyve is almost a different class of hypervisor in that it's a few hundred kilobytes in size and does no hardware emulation (qemu) at all. KVM well, how bloated CAN it get?
                    I get sick of FreeBSD people doing this one. https://lwn.net/Articles/658511/ reality if you fire up Linux kvm with kvmtool there is no hardware emulation either. I call this horrible naming. kvm command is qemu modified to take advantage of kvm kernel feature of Linux. kvmtool also uses kvm kernel feature of Linux without qemu bits so is insanely lightweight..

                    So KVM is not as bloated as what you want to make out in Linux kernel. KVM userspace option has feature rich the kvm command that is based off qemu with all the hardware emulation and the feature poor kvmtool that bring you back to something like bhyve with the same OS support problems.


                    Comment


                    • Originally posted by oiaohm View Post
                      Loopback <- not used (implemented as ZVOL)
                      VFS <- not used (implemented by ZPL)
                      ZFS <- This isn't a layer the whole thing is ZFS..?
                      -ZPL <- ZPL is optional, things like Luster can talk to the DMU
                      ---ZVOL <- Also optional to provide block devices that DONT have to go through the ZPL.
                      -DMU
                      -SPA
                      -Block layer ZFS.
                      Linux Block layer <- Not used, redundant.
                      oiaohm, just a quick response here.. You're missing the design. The entire stack is different, it doesn't lay on top / reuse like everything else in Linux. (and pretty much every other OS too, Windows isn't different) This is why it's the *historical* design. It's not a shim, it's designed to provide exactly what things need without going through unnecessary layers.

                      All you need is the DMU and the SPA and a block device.. That's it, Some application talk directly to the DMU. The SPL has nothing to do with the io path. It implements the slab, the arc and other things for the DMU and SPA. The SPL is also only used on Linux, due to the need to separate out the kernel modules .. and in the latest versions it's been integrated anyhow. On FreeBSD there is no SPL. (Ovis there isn't one on Illumos.)

                      I also know you can change the block sizes, thats age old like you said and that isn't the point. The point is ZFS is variable and does it on the fly automatically. (So 512k write is 512 block, 4k write is 4k. etc, it means it efficiency manages slack.)

                      Also the slab is really really good.. if it wasn't it wouldn't have been imitated or copied by every other OS (including linux).. be very careful redesigning this.. *most* other people got it wrong before Sun. You want to put hobbyist or millennial programmers on this? or seasoned engineers who suffered some of the most agonizing problems and pain converting BSD to SVR5. Solaris was built out of sadness and suffering.. not "I got to make this widget then hop on instagram"

                      ZFS implemented trim in 2013. Linux was just late to the party on that feature.
                      Last edited by k1e0x; 02-02-2020, 12:40 AM.

                      Comment

                      Working...
                      X