Announcement

Collapse
No announcement yet.

Linus Torvalds Doesn't Recommend Using ZFS On Linux

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • drSeehas
    replied
    Originally posted by unis_torvalds View Post
    ... FreeNAS, a fork of OpenBSD ...
    FreeNAS is a fork of FreeBSD.

    Leave a comment:


  • rodne1379
    replied
    Originally posted by Securitex View Post
    Either Oracle or Microsoft (Members of Linux Foundations who pay Linux Foundation and hence pay Linus the salary) like it or not Canonical will be their competitor.
    lol nice observation

    Leave a comment:


  • oiaohm
    replied
    Originally posted by k1e0x View Post
    FreeBSD is apparently is more comfortable with the CDDL in base than the GPL. (by their total removal of GCC) I probably would be too. License options are a good thing for the world. (unless you're a GNU'er)
    The License is only one problem. So FreeBSD being more CDDL compadible does not fix the second problem of ZoL that is design.

    Originally posted by aht0 View Post
    Can you explain a bit more what you mean here?
    Simple question aht0 answering it also answer k1e0x question..

    https://github.com/zfsonlinux/zfs/bl...ule/zfs/dbuf.c

    This here is a pure example of design problem. Back in Linux kernel 2.4.10 year 2001 Linux started getting rid of the block layer cache.

    https://www.oreilly.com/library/view...2/ch15s02.html

    Yes it noted back in the old book. Iomap in finishing the job removing the operations from the file system layer that don't effectively use the Linux page cache and large pages work is finishing of the pagecache so there is no reason for a block cache to exist.

    Solaris has a page cache and a block cache so does Illumos.

    Yes freebsd has a block cache between disc and file system as well.
    https://wiki.freebsd.org/BasicVfsConcepts
    But as noted here it design that the VM page cache(the freebsd equal to linux pagecache) is design to share memory pages with the block cache with the page cache.

    Basically both Linux and Freebsd have taken two different routes to nuke the memory duplication between the page cache and the block cache. Linux has gone the step more of reducing the page cache and block cache down to 1 single cache being the Linux page cache.

    That dbuf,c thing really has no place on Freebsd or Linux the way it currently designed. Yes it horrible the correct design there requires platform particular code to take advantage of how Freebsd reduced duplication between its block and page cache or the fact Linux only has a single cache from block devices to file systems to vfs layer

    The concept of being ZFS being able to have it own unique OS independent stack when it comes to handling caching/memory has to end.

    There are other places I can go digging into ZoL that have no place really on freebsd or linux. Maybe they make sense on OS X, Windows and Illumos but they don't make sense at all on Linux and FreeBSD.


    The memory bandwidth limit that the modern hardware hits means the deduplication of storage between VM page cache and the Block cache FreeBSD does or Linux simple reducing down to 1 single cache total is there to reduce memory operations. Reducing memory operations help when your problem is not enough memory bandwidth to go round. Modern hardware pcie has higher transfer rate than your total memory bandwidth.

    Time when ZFS was design the problem was the other way. The world ZFS was designed in the Storage media and the bandwidth from storage to MMU was lower bandwidth than the MMU transfer speed to ram. The world we are in now is inverted to when ZFS was designed so there are a lot of poor design choices that have to go. Mind you those poor design choices could be removable without changing ZFS on disc format but the one thing it should do is nuke the generic OS abstraction layer as a valid idea.
    Last edited by oiaohm; 02-06-2020, 11:03 AM.

    Leave a comment:


  • aht0
    replied
    Originally posted by oiaohm View Post
    Yes they may use a different approach than Linux does. But the options to deal with having limit cpu and limit memory bandwidth to use are also limited. Any form of abstraction layer causing memory duplication cannot not be tolerated no matter how much coding or legal effect is required to fix it. Yes this include possibly having to rewrite large sections of ZFS under a proper GPL compatible license for Linux or bring back from dead the ZFS for freebsd project.

    The reality here what was the easy way out for ZoL in the past does not work going forwards.
    Can you explain a bit more what you mean here?

    Leave a comment:


  • k1e0x
    replied
    Originally posted by oiaohm View Post
    Yes this include possibly having to rewrite large sections of ZFS under a proper GPL compatible license for Linux or bring back from dead the ZFS for freebsd project.
    FreeBSD is apparently is more comfortable with the CDDL in base than the GPL. (by their total removal of GCC) I probably would be too. License options are a good thing for the world. (unless you're a GNU'er)

    Leave a comment:


  • oiaohm
    replied
    Originally posted by k1e0x View Post
    oiaohm iomap sounds like it's a re-implementation of netmap. netmap didn't replace the TCP/IP stack. As I said before it's going to take a lot of work to do this..
    Problem here with the way the epic servers and and the old lightweight NAS boxes are. Zero copy stuff is a lot work to implement but its work that has to be done no matter what due to the requirements the future hardware is throwing up.

    Originally posted by k1e0x View Post
    what makes you think FreeBSD and Illumos won't do this work and won't add it to ZFS?
    These are two different problems. Illumos is running ZFS without abstraction layer. Illumos have a possibility of fixing this. If the ZoL developers treat FreeBSD the same way they have treated the Linux kernel by not integrating properly ZFS on FreeBSD going forwards will be just not functional as ZoL on a Linux is now.

    Originally posted by k1e0x View Post
    Apple that wants to get back into the server game and there are rumors of macOS server becoming a thing again..
    https://en.wikipedia.org/wiki/MacOS_Server

    With MacOS Server there are always rumours that it will come a thing again before another release where it does not.

    Originally posted by k1e0x View Post
    I somehow believe they will take a different approach than Linux does.
    Yes they may use a different approach than Linux does. But the options to deal with having limit cpu and limit memory bandwidth to use are also limited. Any form of abstraction layer causing memory duplication cannot not be tolerated no matter how much coding or legal effect is required to fix it. Yes this include possibly having to rewrite large sections of ZFS under a proper GPL compatible license for Linux or bring back from dead the ZFS for freebsd project.

    The reality here what was the easy way out for ZoL in the past does not work going forwards.

    Leave a comment:


  • k1e0x
    replied
    oiaohm iomap sounds like it's a re-implementation of netmap. netmap didn't replace the TCP/IP stack. As I said before it's going to take a lot of work to do this.. what makes you think FreeBSD and Illumos won't do this work and won't add it to ZFS? ZFS is the default on Illumos and tho not the default in FreeBSD (not yet at least) it's extremely common. Saying defacto default might not be a stretch

    I'd imagine those two Unix OS's want to solve this too and will, there is also Apple that wants to get back into the server game and there are rumors of macOS server becoming a thing again.. If they do or don't doesn't really matter but they also have an open source Unix kernel and need to solve this. I somehow believe they will take a different approach than Linux does.
    Last edited by k1e0x; 02-06-2020, 12:17 AM.

    Leave a comment:


  • oiaohm
    replied
    Originally posted by k1e0x View Post
    I don't think this has anything to do with filesystems per say. It's a kernel OS limitation.. you just can't move data through the kernel fast enough. All general purpose OS's have this problem.:
    I will give you this is normally not a file-system problem to a point the iomap work is partly to deal with the problem from XFS and other Linux file systems where they were doing memory duplication.

    But a item like ZFS are a big exception to the rule. ZFS as you stated does not have to bother integrating in with how the kernel OS is doing things. Reality is the kernel OS on this modern storage hardware is unable to move the data fast enough. Heck even if you had a system that was purely running L1 cache doing DMA transfers in and out of memory the PCIe bandwidth can still outstrip your mmu transfer-speeds so 100 percent in theory perfect world you are still screwed of course a OS cannot be 100 percent perfect because it has to be doing other things.

    Having ZFS basically be it own OS kernel inside the OS kernel with its own memory management absolutely does not help particular with the reality you are basically out of memory operations. Think of it this way you are now always behind in memory operations on a storage server the more memory operations you need to-do the more you are behind yes you are never catching up. Zero copy operations have to come the normal not the exception. The OS kernel can do so much to do zero copy operations but file system like ZoL doing it own thing does not play along with with OS kernel improvements to increase zero copy operations.

    So the hardware design means no matter how you design your kernel OS there is no way now to have your in memory operations faster than your PCIe transfers. This is problem one. Problem 2 is next.

    Originally posted by k1e0x View Post
    and kqueue is freebsds poll method.. so.. there you go again : shrug :
    https://events.static.linuxfound.org...17-final_0.pdf

    I was not thinking freebsd. I was reading kqueue as a kernel queue not as a freebsd feature.. Wendell to work around the AMD epic I loss IRQ requests he was using this that Western digital added to the Linux kernel.

    Pci/pcie is an ass. Using IRQ brings you cpu load down. PCI/PCIe specification in a IRQ storm event as in devices sending you too many interrupts to process to random-ally lose them. Yes part of the PCI/PCIe specification. Next bit of evil you poll a device that is set to send a IRQ/interrupt when ready PCI/PCIe specification kicks in again by specification device is now not to send interrupt.

    So kqueue and epoll level does not help you. This is in your OS block layer. EPIC cpu have that much PCIe bandwidth that you in theory could run the complete system by polling and not run out of PCIe bandwidth of course you would not have any CPU time to-do anything.

    Of course this is all bad enough. Now there is a final bit of evil in the PCI/PCIe specification a devices ends a IRQ and does not get answer in reasonable time frame be the IRQ lost or just the system insanely busy the device is to reset so now the device has stalled out on you.

    Its good that Western digital gave Linux a hybrid mode between IRQ and polling other wise these storage servers would be in way worse trouble. Yes they are still in trouble but less than they would have been otherwise.

    Please note I am writing PCI/PCIe these problems where in the specification with the first version of PCI but we have not been running into them because we did not have a large enough miss balance between amount of PCI/PCIe transfer and MMU means to process it to cause any major volume of PCI/PCI/e message loss as defined in the specification. AMD has nicely given us this Epic CPU that truly lives up to its name of being EPIC at pushing the PCI/PCIe specification right up to its breaking limit and then some.

    Do note polling devices because you cannot depend on IRQ means you have less CPU time. This is before you attempt run ARC cache or any of those fancy ZFS features.

    There is a reason why I said we need to look at storage server not running data validation and client over network running data validation if things keep on going this way there will be less and less usable CPU time once you connect up a massive number of storage devices on the storage server itself.

    The world that ZFS was designed in is disappearing. Yes the PCIe system hits twice for lack of CPU and MMU transfer speed.

    Leave a comment:


  • k1e0x
    replied
    I don't think this has anything to do with filesystems per say. It's a kernel OS limitation.. you just can't move data through the kernel fast enough. All general purpose OS's have this problem.

    and kqueue is freebsds poll method.. so.. there you go again : shrug :

    Leave a comment:


  • oiaohm
    replied
    Originally posted by k1e0x View Post
    I seen that, ya that is an extreme problem, it's more so with the OS itself tho.. Netflix had the same issue with SSL at 100gb. It will take serious OS engineering to solve this. Not Linux throw it at the wall and see if it sticks engineering.
    Others using Linux did find a solution to the 100gb SSL problem.

    Originally posted by k1e0x View Post
    But yes, ZFS only works if the CPU is x times faster than the disk. If it isn't, you need something else.
    That the problem this is not going to stay a functional option.

    Originally posted by k1e0x View Post
    Generally that is true and its' been true for a long time. I don't see magnetic storage going away so I think things are fine.
    Magnetic storage can cause the same nightmare. Note all those nvme drives in that video are pci-e connected So every two of those can come 1 SAS port that is 12GB/s when multiplexed to drives. So still stomped cleanly into the ground.

    Originally posted by k1e0x View Post
    I don't think storage at that level really is practical economically .. but some people clearly have a use case here.. and we'll have to find a solution for them.
    Storage is not exactly the problem. You get stomped into the ground because we have more PCI-e lanes with possibility to transfer more data than you can safely cpu handle.

    Originally posted by k1e0x View Post
    I'm curious what Wendell did with polling (kqueue tends to way out preform epoll)
    Yes kqueue is faster as long as the interrupts are not getting stopped into the ground by the massive flow of data. Wendell did is using both Polling and kqueue in combination if kqueue has not got interrupt in X time do a poll to see if the interrupt has disappeared or not. Yes this massive data flow problem means if you don't pick up interrupt in time another one from a different item has come in overwriting the information. What Wendell did prevents having to reset drives/controllers that will stall everything to death.

    Basically your cpu is being stomped into the ground and you have to be dealing with that fact it being stomped into the ground so items like interrupts back by PCIe are no longer dependable.

    Originally posted by k1e0x View Post
    Jeff Bonwick (ZFS creator) actually was trying to solve that problem.. He was working on 3d flash raid implementations.. DSSD I think was the company and they were sold to Dell. Dell dropped it so... wonder what Bonwick is up to now. The world needs him to solve storage (again). lol
    You really missed it this problem is system wide. Same thing can happen if you connect up a lot of accelerators. Basically modern larger server systems have way way too much pci-e bandwidth this is only going to get worse with when systems move from pcie 4 to pcie 5/6.

    Basically with this problem a storage specialist is basically useless. Fixing this problem a memory management specialist and a pcie evils specialist. Requirement is that those implementing file systems also don't attempt to do their own thing with memory management as duplication in memory makes your lack of memory bandwidth even worse.

    It bad enough dealing with the massive wave of data the pcie lanes in these modern systems allows without third party file systems like ZFS doing things presuming CPU is X times faster than disc.

    Remember in epic cpu from AMD instead what LTT attempted with a 24 core then step up to a 32 core you could have a 12 or 16 core chip with the exact same number of pcie 4.0 lanes and half the memory bandwidth again. This is your poor low end storage server based on AMD chips perfectly designed to be mega stomped by the drives.

    So having CPU x times faster than disk is not the case all the time any more. It is now possible even with the old school hdd to have disks X times faster than CPU.


    The concept ZFS was designed with AMD has basically thrown upside down now we have to deal with it.

    Leave a comment:

Working...
X