Originally posted by unis_torvalds
View Post
Announcement
Collapse
No announcement yet.
Linus Torvalds Doesn't Recommend Using ZFS On Linux
Collapse
X
-
Originally posted by SecuritexEither Oracle or Microsoft (Members of Linux Foundations who pay Linux Foundation and hence pay Linus the salary) like it or not Canonical will be their competitor.
Leave a comment:
-
Originally posted by k1e0x View PostFreeBSD is apparently is more comfortable with the CDDL in base than the GPL. (by their total removal of GCC) I probably would be too. License options are a good thing for the world. (unless you're a GNU'er)
Originally posted by aht0 View PostCan you explain a bit more what you mean here?
https://github.com/zfsonlinux/zfs/bl...ule/zfs/dbuf.c
This here is a pure example of design problem. Back in Linux kernel 2.4.10 year 2001 Linux started getting rid of the block layer cache.
https://www.oreilly.com/library/view...2/ch15s02.html
Yes it noted back in the old book. Iomap in finishing the job removing the operations from the file system layer that don't effectively use the Linux page cache and large pages work is finishing of the pagecache so there is no reason for a block cache to exist.
Solaris has a page cache and a block cache so does Illumos.
Yes freebsd has a block cache between disc and file system as well.
https://wiki.freebsd.org/BasicVfsConcepts
But as noted here it design that the VM page cache(the freebsd equal to linux pagecache) is design to share memory pages with the block cache with the page cache.
Basically both Linux and Freebsd have taken two different routes to nuke the memory duplication between the page cache and the block cache. Linux has gone the step more of reducing the page cache and block cache down to 1 single cache being the Linux page cache.
That dbuf,c thing really has no place on Freebsd or Linux the way it currently designed. Yes it horrible the correct design there requires platform particular code to take advantage of how Freebsd reduced duplication between its block and page cache or the fact Linux only has a single cache from block devices to file systems to vfs layer
The concept of being ZFS being able to have it own unique OS independent stack when it comes to handling caching/memory has to end.
There are other places I can go digging into ZoL that have no place really on freebsd or linux. Maybe they make sense on OS X, Windows and Illumos but they don't make sense at all on Linux and FreeBSD.
The memory bandwidth limit that the modern hardware hits means the deduplication of storage between VM page cache and the Block cache FreeBSD does or Linux simple reducing down to 1 single cache total is there to reduce memory operations. Reducing memory operations help when your problem is not enough memory bandwidth to go round. Modern hardware pcie has higher transfer rate than your total memory bandwidth.
Time when ZFS was design the problem was the other way. The world ZFS was designed in the Storage media and the bandwidth from storage to MMU was lower bandwidth than the MMU transfer speed to ram. The world we are in now is inverted to when ZFS was designed so there are a lot of poor design choices that have to go. Mind you those poor design choices could be removable without changing ZFS on disc format but the one thing it should do is nuke the generic OS abstraction layer as a valid idea.
Last edited by oiaohm; 06 February 2020, 11:03 AM.
Leave a comment:
-
Originally posted by oiaohm View PostYes they may use a different approach than Linux does. But the options to deal with having limit cpu and limit memory bandwidth to use are also limited. Any form of abstraction layer causing memory duplication cannot not be tolerated no matter how much coding or legal effect is required to fix it. Yes this include possibly having to rewrite large sections of ZFS under a proper GPL compatible license for Linux or bring back from dead the ZFS for freebsd project.
The reality here what was the easy way out for ZoL in the past does not work going forwards.
Leave a comment:
-
Originally posted by oiaohm View PostYes this include possibly having to rewrite large sections of ZFS under a proper GPL compatible license for Linux or bring back from dead the ZFS for freebsd project.
- 3 likes
Leave a comment:
-
Originally posted by k1e0x View Postwhat makes you think FreeBSD and Illumos won't do this work and won't add it to ZFS?
Originally posted by k1e0x View PostApple that wants to get back into the server game and there are rumors of macOS server becoming a thing again..
With MacOS Server there are always rumours that it will come a thing again before another release where it does not.
Originally posted by k1e0x View PostI somehow believe they will take a different approach than Linux does.
The reality here what was the easy way out for ZoL in the past does not work going forwards.
Leave a comment:
-
oiaohm iomap sounds like it's a re-implementation of netmap. netmap didn't replace the TCP/IP stack. As I said before it's going to take a lot of work to do this.. what makes you think FreeBSD and Illumos won't do this work and won't add it to ZFS? ZFS is the default on Illumos and tho not the default in FreeBSD (not yet at least) it's extremely common. Saying defacto default might not be a stretch
I'd imagine those two Unix OS's want to solve this too and will, there is also Apple that wants to get back into the server game and there are rumors of macOS server becoming a thing again.. If they do or don't doesn't really matter but they also have an open source Unix kernel and need to solve this. I somehow believe they will take a different approach than Linux does.Last edited by k1e0x; 06 February 2020, 12:17 AM.
Leave a comment:
-
Originally posted by k1e0x View PostI don't think this has anything to do with filesystems per say. It's a kernel OS limitation.. you just can't move data through the kernel fast enough. All general purpose OS's have this problem.:
But a item like ZFS are a big exception to the rule. ZFS as you stated does not have to bother integrating in with how the kernel OS is doing things. Reality is the kernel OS on this modern storage hardware is unable to move the data fast enough. Heck even if you had a system that was purely running L1 cache doing DMA transfers in and out of memory the PCIe bandwidth can still outstrip your mmu transfer-speeds so 100 percent in theory perfect world you are still screwed of course a OS cannot be 100 percent perfect because it has to be doing other things.
Having ZFS basically be it own OS kernel inside the OS kernel with its own memory management absolutely does not help particular with the reality you are basically out of memory operations. Think of it this way you are now always behind in memory operations on a storage server the more memory operations you need to-do the more you are behind yes you are never catching up. Zero copy operations have to come the normal not the exception. The OS kernel can do so much to do zero copy operations but file system like ZoL doing it own thing does not play along with with OS kernel improvements to increase zero copy operations.
So the hardware design means no matter how you design your kernel OS there is no way now to have your in memory operations faster than your PCIe transfers. This is problem one. Problem 2 is next.
Originally posted by k1e0x View Postand kqueue is freebsds poll method.. so.. there you go again : shrug :
I was not thinking freebsd. I was reading kqueue as a kernel queue not as a freebsd feature.. Wendell to work around the AMD epic I loss IRQ requests he was using this that Western digital added to the Linux kernel.
Pci/pcie is an ass. Using IRQ brings you cpu load down. PCI/PCIe specification in a IRQ storm event as in devices sending you too many interrupts to process to random-ally lose them. Yes part of the PCI/PCIe specification. Next bit of evil you poll a device that is set to send a IRQ/interrupt when ready PCI/PCIe specification kicks in again by specification device is now not to send interrupt.
So kqueue and epoll level does not help you. This is in your OS block layer. EPIC cpu have that much PCIe bandwidth that you in theory could run the complete system by polling and not run out of PCIe bandwidth of course you would not have any CPU time to-do anything.
Of course this is all bad enough. Now there is a final bit of evil in the PCI/PCIe specification a devices ends a IRQ and does not get answer in reasonable time frame be the IRQ lost or just the system insanely busy the device is to reset so now the device has stalled out on you.
Its good that Western digital gave Linux a hybrid mode between IRQ and polling other wise these storage servers would be in way worse trouble. Yes they are still in trouble but less than they would have been otherwise.
Please note I am writing PCI/PCIe these problems where in the specification with the first version of PCI but we have not been running into them because we did not have a large enough miss balance between amount of PCI/PCIe transfer and MMU means to process it to cause any major volume of PCI/PCI/e message loss as defined in the specification. AMD has nicely given us this Epic CPU that truly lives up to its name of being EPIC at pushing the PCI/PCIe specification right up to its breaking limit and then some.
Do note polling devices because you cannot depend on IRQ means you have less CPU time. This is before you attempt run ARC cache or any of those fancy ZFS features.
There is a reason why I said we need to look at storage server not running data validation and client over network running data validation if things keep on going this way there will be less and less usable CPU time once you connect up a massive number of storage devices on the storage server itself.
The world that ZFS was designed in is disappearing. Yes the PCIe system hits twice for lack of CPU and MMU transfer speed.
Leave a comment:
-
I don't think this has anything to do with filesystems per say. It's a kernel OS limitation.. you just can't move data through the kernel fast enough. All general purpose OS's have this problem.
and kqueue is freebsds poll method.. so.. there you go again : shrug :
- 1 like
Leave a comment:
-
Originally posted by k1e0x View PostI seen that, ya that is an extreme problem, it's more so with the OS itself tho.. Netflix had the same issue with SSL at 100gb. It will take serious OS engineering to solve this. Not Linux throw it at the wall and see if it sticks engineering.
Originally posted by k1e0x View PostBut yes, ZFS only works if the CPU is x times faster than the disk. If it isn't, you need something else.
Originally posted by k1e0x View PostGenerally that is true and its' been true for a long time. I don't see magnetic storage going away so I think things are fine.
Originally posted by k1e0x View PostI don't think storage at that level really is practical economically .. but some people clearly have a use case here.. and we'll have to find a solution for them.
Originally posted by k1e0x View PostI'm curious what Wendell did with polling (kqueue tends to way out preform epoll)
Basically your cpu is being stomped into the ground and you have to be dealing with that fact it being stomped into the ground so items like interrupts back by PCIe are no longer dependable.
Originally posted by k1e0x View PostJeff Bonwick (ZFS creator) actually was trying to solve that problem.. He was working on 3d flash raid implementations.. DSSD I think was the company and they were sold to Dell. Dell dropped it so... wonder what Bonwick is up to now.The world needs him to solve storage (again). lol
Basically with this problem a storage specialist is basically useless. Fixing this problem a memory management specialist and a pcie evils specialist. Requirement is that those implementing file systems also don't attempt to do their own thing with memory management as duplication in memory makes your lack of memory bandwidth even worse.
It bad enough dealing with the massive wave of data the pcie lanes in these modern systems allows without third party file systems like ZFS doing things presuming CPU is X times faster than disc.
Remember in epic cpu from AMD instead what LTT attempted with a 24 core then step up to a 32 core you could have a 12 or 16 core chip with the exact same number of pcie 4.0 lanes and half the memory bandwidth again. This is your poor low end storage server based on AMD chips perfectly designed to be mega stomped by the drives.
So having CPU x times faster than disk is not the case all the time any more. It is now possible even with the old school hdd to have disks X times faster than CPU.
The concept ZFS was designed with AMD has basically thrown upside down now we have to deal with it.
Leave a comment:
Leave a comment: