Linus Torvalds Doesn't Recommend Using ZFS On Linux

k1e0x replied

05 February 2020, 02:52 AM
I seen that, ya that is an extreme problem, it's more so with the OS itself tho.. Netflix had the same issue with SSL at 100gb. It will take serious OS engineering to solve this. Not Linux throw it at the wall and see if it sticks engineering.

But yes, ZFS only works if the CPU is x times faster than the disk. If it isn't, you need something else. Generally that is true and its' been true for a long time. I don't see magnetic storage going away so I think things are fine. I don't think storage at that level really is practical economically .. but some people clearly have a use case here.. and we'll have to find a solution for them.

I'm curious what Wendell did with polling (kqueue tends to way out preform epoll)

Jeff Bonwick (ZFS creator) actually was trying to solve that problem.. He was working on 3d flash raid implementations.. DSSD I think was the company and they were sold to Dell. Dell dropped it so... wonder what Bonwick is up to now. The world needs him to solve storage (again). lol

Last edited by k1e0x; 05 February 2020, 03:44 AM.
Leave a comment:
oiaohm replied

05 February 2020, 02:42 AM
Originally posted by k1e0x View Post

Well, nothing's perfect. lol

That is so true. But time moves on.

This Server Deployment was HORRIBLE

https://youtube.com/watch?v=xWjOh0Ph8uM

Get $20 in free credit on your new account at https://www.linode.com/linusMonitor and manage your PC in real-time with Pulseway! Create your free account tod...

Here is Linus tech tips setting up a new server started with plan he would use ZFS. End up back on XFS and dm.

Current day systems are running into a new problem. ZFS was designed on the concept that your ram bandwidth is in fact greater than your storage bandwidth so you can be wasteful of ram bandwidth. Now we have a new nightmare the new NVME drive in volume equal more bandwidth from your drives than your complete cpu ram bus.

The requirement of zero copy from device into memory and back out is not coming a optional required feature its coming a mandatory required feature. So there is no time for the fancy arc cache stunts resulting in 3 copies in ram. There is no time for the file system to have it own allocate system and copy out to the host OS memory system.

Yes we do really need to rethink where checksum of data need to be as well. End to End maybe the checksum of data from a file server in fact needs to be done by the client with the client having means to inform server that X data appears to have a problem. Same with compression because the storage server can be out of resources purely caused by running the storage.

So yes the video above is ZFS losing a user because ZFS design cannot perform in the modern day nasty hardware. Yes the modern day hardware is nasty in more ways than one with the fact you can have to poll instead of using interrupts because interrupts are getting lost due to the massive presure.

Yes it kind of insane that a dual socket 128 cores/256 thread setup can basically be strangled by the current day high speed storage media due to not having enough memory bandwidth as some other parties have found out. When ZFS was design the idea that you could be strangled in memory bandwidth was not even a possibility. Performance optimisation of not duplicate in memory (zero copy operation) is not an optional feature its a require feature in these new setups and will come more common.

k1e0x you might not like this. But I don't see ZFS lasting out 10 years if it does not major-ally alter it path because it just going to come more and more incompatible with current day hardware. Yes the current day hardware is going to have to cause us to reconsider how we do things as well. Being strangled in the ram bus of the storage server is a really new problem.
Leave a comment:
oiaohm replied

03 February 2020, 04:37 PM
Originally posted by k1e0x View Post

Yeah, there are differences (the slab on Solaris and FreeBSD's version of it has also seen improvement and changes). FreeBSD is much less radical about the changes they make. The block layer uses a driver??? no way! Every OS does, the difference is HOW it uses that.

The Linux Block Layer from the drivers has changed as well. So ZoL no longer provides requests correctly to the block layer of Linux for performance. Yes how the block layer is used is important. Its possible to be using the Block layer wrong and have performance problems.

They would NEVER rip out ifconfig. You'd never see it happen.. they would fix it's limitation (and have, it does wifi on FreeBSD, like it *should*.)

Originally posted by k1e0x View Post

I don't see any reason why snap can't use a ZVOL and being all Ubuntu technology you can just import it.

Turns out that ZVOL is slower than using the Linux loopback on top of ZFS. So no point in importing it. Loopback usage of the Linux kernel pagecache bring some advantages to performance and is optimised for the huge page usage and will be optimised for the large page usage in future. Does not matter if you can do things if those things don't in fact perform.

Originally posted by k1e0x View Post

I think I've kind of won you over btw on ZFS. Yeah.. it's a good thing in the world. Hopefully the dream of making filesystems as easy as ram will be a reality. "Storage your computer manages for you" That is the idea.. They aren't there yet, but they know that and are working towards it.

The memory operation requirements is why I see ZFS/ZoL as doomed long term. There is already sign of this..

Originally posted by k1e0x View Post

How does ZFS do in the enterprise? People don't really tell us who is *really* using it, business tend to be very private about their critical core infrastructure.. but anonymously who uses it..? Well if you look at Oracle's stats.

ZFS was able to top SAP storage benchmarks for 2017.

Lets throw out some out of date information to attempt to win point is all this line is.
https://www.intel.com/content/dam/ww...tion-guide.pdf

The current recommend install information in 2020 for SAP to get best performance is XFS dax and HANA. Both are using the Linux kernel block device drivers means to place like a 2/4Meg huge page in memory of Linux in a single operation from storage media. No middle crap like the arc cache. Using application intelligent check-summing of data that can in fact reduce the amount of checksum processing you need to do so you have the same level of data protection.

So ZFS is currently classed for SAP as under-performing junk. Its not like SAP developers did not look at ZFS and take some ideas. Basically SAP developers looked at ZFS took some ideas and worked out they could do it faster and better. Lot of this is caused by ZoL deciding to use their own memory manager and not being altered to Linux VFS layer so request with X size page required that is then passed though the file system for the block device to read into that page correctly aligned and read to be used with no latter modification required.

Originally posted by k1e0x View Post

2 / 3 Top SaaS companies use it.
7 / 10 Top financial companies use it.
9 / 10 Top telecommunication companies use it.
5 / 5 Top semiconductor companies use it.
3 / 5 Top gas and oil companies use it.
2 / 3 Top media companies use it.

I guess this is all parties using SAP that if are following current day SAP recommendations are no longer using ZFS but are in fact using XFS and HANA.

Has good data center foot print too 9pb per 1 rack.

Originally posted by k1e0x View Post

But who would care about that market.. small peanuts. Disney, Intel, AT&T. JP Morgan, or ARCO meh.. not important. No need to put this in Linux. Linus is probably right. We should make gaming better on Linux. lol

The XFS developer being focused on performance that requires system wide intergration more than data security has turned out to be important. There are many ways achive data integrity without having checksum in file system. There is very limited ways to achieve performance.

ZFS market position not stable. The concept that ZFS developers have they can ignore the way the host operating system works is leading to ZFS losing market share to XFS due to poor performance of the ZFS option.
Leave a comment:
gilboa replied

03 February 2020, 12:23 PM
Originally posted by skeevy420 View Post

But that's true of anything in an enterprise environment.

Not really. I've seen huge enterprises use CentOS + XFS / ext4 without enterprise grade support.
It's extremely stable and the free support can be sufficient for a skilled customer.
... However, try getting *any* type of support for ZOL running on CentOS (or RHEL), you'll have a blast.

- Gilboa
Leave a comment:
skeevy420 replied

03 February 2020, 11:39 AM
Originally posted by gilboa View Post

Using ZOL in an enterprise environment without enterprise grade support is *foolish*. If it breaks, you get to keep all the pieces.

But that's true of anything in an enterprise environment.
Leave a comment:
gilboa replied

03 February 2020, 11:37 AM
Originally posted by k1e0x View Post

1. I have never seen Oracle ZFS storage appliance used in the enterprise personally, I know they exist I've just never seen them. Only Sun Microsystem's before Solaris version 10. (Pool version 28~) Generally it's used on FreeBSD storage clusters that are secondary storage to NetApp, EMC or DDN. (around 32-64 spindles) You can get paid commercial ZFS (and ZoL) support from both FreeBSD 3rd parties and Canonical. Ubuntu 19.10 has ZFS on root in the installer and so will 20.04-LTS

A. I was talking about Oracle Solaris and Oracle Unbreakable Linux.
B. I'm not sure whether Canonical Enterprise offering includes full support for ZFS. Last I heard, it was marked as "experimental" (read: unsupported).

2. All I got to say about ext4 is hope you like fsck.

I've got a couple of PB worth of storage, be that glusterfs clusters, ovirt clusters or our own proprietary application, that seems to suggest otherwise.
But feel free to think otherwise.

3. You don't know what you're talking about. COW alone has nothing to do with bit-rot or uncorrectable errors. You're thinking of block checksums, and yes, they are good. COW provides other features such as snapshots, cloning and boot environments. Boot environments are pretty cool.. maybe Linux should get on that... oh wait.. ZFS is the only Linux file system that does it and we can't have *that*.

A. You assume I don't understand the difference between checksums and COW. No idea why.
B. .... What makes you think I need it (not the former, the latter)?

I believe you're completely missing my point.
Using ZOL in an enterprise environment without enterprise grade support is *foolish*. If it breaks, you get to keep all the pieces.

- Gilboa

Last edited by gilboa; 03 February 2020, 11:39 AM.
Leave a comment:
k1e0x replied

03 February 2020, 02:48 AM
Originally posted by oiaohm View Post

Loopback is used by snapd and other items. Snapd is not smart enough to use ZVOL. So maybe it important to improve loopback for legacy applications not just say not used.

VFS is used in Linux file system namespaces(systemd uses these like they are going out of fashion) so the Linux VFS layer is not implemented by ZPL because ZPL does not contain the namespace features. So the Linux kernel VFS layer with the page cache is always there in Linux because ZFS is round peg square hole without being redesigned from the ground up.

Your sata/SAS controllers drivers and so on under Linux force you back to the Linux Block Layer to interface with them. So not redundant as claimed without the Linux block layer under Linux ZFS cannot write to local discs. Now if you are going to networking you still need to keep zero copy if possible to use the same allocation system as the Linux kernel.

So all your so called corrections were wrong k1e0x

That is the problem ZFS is breaking the possibility of zero copy operations from the block device(sata/sas... controller) to the page cache.

I am not just that its how those placed in ram. So that you end up in zero copy.

The horrible point here is slab in the Linux kernel is superseded technology. Yes Linux kernel took in slab but the Linux kernel developers reworked how it functions in a big way. With Linux kernel being made support hugepages slab got redesigned to the new beast that was not named. The new system uses a different solution to work better with multi different page sizes. Large pages work redesign sections of the memory system again.

So basically the slab technology ZFS is using is between 1 to 2 generations behind what the Linux kernel has. So ZFS slab usage under Linux is pointless duplication with a less effective method. One problem the Linux replacement to SLAB is under GPLv2 that happens to be incompatible with CDDL.

Well, nothing's perfect. lol

Yeah, there are differences (the slab on Solaris and FreeBSD's version of it has also seen improvement and changes). FreeBSD is much less radical about the changes they make. The block layer uses a driver??? no way! Every OS does, the difference is HOW it uses that. They would NEVER rip out ifconfig. You'd never see it happen.. they would fix it's limitation (and have, it does wifi on FreeBSD, like it *should*.) It's developed by people who *like* unix and are not in a power struggle with the other distro or the whims of whoever control X project. It's a team (core team) and they make sound engineering decisions for what all of their future is going to be.. A lot of them seamed to be more seasoned programmers too. Kirk McKusick still maintains and improves UFS that he created on it as a student at Berkley in the 70's. (guess what, the grandfather of filesystems likes ZFS too)

On ZFS it would be really nice to see Linux do the same thing FreeBSD did and not need some of these pieces imported from the Illumos branch. I don't see any reason why snap can't use a ZVOL and being all Ubuntu technology you can just import it. You can DD anything to a zvol and use it like a file. ZVOL's are really nice btw.. have you played with them? You can do a lot of interesting things with them. That layer really needs to get adopted on other systems.

I think in the end fundamentally Linux wants to be something VERY different from what sysadmins want. I haven't really seen anything that great come out of Linux in about 10 years.. They are catering to a home user market that just doesn't exist and they no longer seem to give a shit about being Like-Unix anymore so.. yeah.. not much of a loss.. I personally think Linux gaming is the dumbest thing in the world, it's fine if it works but Linux is a server OS. : shrug : Isn't it? And if it's not.. I don't care about it. What a user runs on his workstation doesn't interest me.. could be android for all I care.. it's just a browser anyhow.

I think I've kind of won you over btw on ZFS. Yeah.. it's a good thing in the world. Hopefully the dream of making filesystems as easy as ram will be a reality. "Storage your computer manages for you" That is the idea.. They aren't there yet, but they know that and are working towards it.

How does ZFS do in the enterprise? People don't really tell us who is *really* using it, business tend to be very private about their critical core infrastructure.. but anonymously who uses it..? Well if you look at Oracle's stats.

ZFS was able to top SAP storage benchmarks for 2017.
2 / 3 Top SaaS companies use it.
7 / 10 Top financial companies use it.
9 / 10 Top telecommunication companies use it.
5 / 5 Top semiconductor companies use it.
3 / 5 Top gas and oil companies use it.
2 / 3 Top media companies use it.
Has good data center foot print too 9pb per 1 rack.

But who would care about that market.. small peanuts. Disney, Intel, AT&T. JP Morgan, or ARCO meh.. not important. No need to put this in Linux. Linus is probably right. We should make gaming better on Linux. lol

Last edited by k1e0x; 03 February 2020, 04:34 AM.
Likes 1
Leave a comment:
oiaohm replied

02 February 2020, 09:48 PM
Originally posted by k1e0x View Post

Loopback <- not used (implemented as ZVOL)
VFS <- not used (implemented by ZPL)

Loopback is used by snapd and other items. Snapd is not smart enough to use ZVOL. So maybe it important to improve loopback for legacy applications not just say not used.

VFS is used in Linux file system namespaces(systemd uses these like they are going out of fashion) so the Linux VFS layer is not implemented by ZPL because ZPL does not contain the namespace features. So the Linux kernel VFS layer with the page cache is always there in Linux because ZFS is round peg square hole without being redesigned from the ground up.

Originally posted by k1e0x View Post

Linux Block layer <- Not used, redundant.

Your sata/SAS controllers drivers and so on under Linux force you back to the Linux Block Layer to interface with them. So not redundant as claimed without the Linux block layer under Linux ZFS cannot write to local discs. Now if you are going to networking you still need to keep zero copy if possible to use the same allocation system as the Linux kernel.

So all your so called corrections were wrong k1e0x

Originally posted by k1e0x View Post

The entire stack is different, it doesn't lay on top / reuse like everything else in Linux.

That is the problem ZFS is breaking the possibility of zero copy operations from the block device(sata/sas... controller) to the page cache.

Originally posted by k1e0x View Post

I also know you can change the block sizes, thats age old like you said and that isn't the point. The point is ZFS is variable and does it on the fly automatically. (So 512k write is 512 block, 4k write is 4k. etc, it means it efficiency manages slack.).

I am not just that its how those placed in ram. So that you end up in zero copy.

Originally posted by k1e0x View Post

Also the slab is really really good.. if it wasn't it wouldn't have been imitated or copied by every other OS (including linux).. be very careful redesigning this..

The horrible point here is slab in the Linux kernel is superseded technology. Yes Linux kernel took in slab but the Linux kernel developers reworked how it functions in a big way. With Linux kernel being made support hugepages slab got redesigned to the new beast that was not named. The new system uses a different solution to work better with multi different page sizes. Large pages work redesign sections of the memory system again.

So basically the slab technology ZFS is using is between 1 to 2 generations behind what the Linux kernel has. So ZFS slab usage under Linux is pointless duplication with a less effective method. One problem the Linux replacement to SLAB is under GPLv2 that happens to be incompatible with CDDL.
Leave a comment:
k1e0x replied

01 February 2020, 11:54 PM
Originally posted by oiaohm View Post

Loopback <- not used (implemented as ZVOL)
VFS <- not used (implemented by ZPL)
ZFS <- This isn't a layer the whole thing is ZFS..?
-ZPL <- ZPL is optional, things like Luster can talk to the DMU
---ZVOL <- Also optional to provide block devices that DONT have to go through the ZPL.
-DMU
-SPA
-Block layer ZFS.
Linux Block layer <- Not used, redundant.

oiaohm, just a quick response here.. You're missing the design. The entire stack is different, it doesn't lay on top / reuse like everything else in Linux. (and pretty much every other OS too, Windows isn't different) This is why it's the *historical* design. It's not a shim, it's designed to provide exactly what things need without going through unnecessary layers.

All you need is the DMU and the SPA and a block device.. That's it, Some application talk directly to the DMU. The SPL has nothing to do with the io path. It implements the slab, the arc and other things for the DMU and SPA. The SPL is also only used on Linux, due to the need to separate out the kernel modules .. and in the latest versions it's been integrated anyhow. On FreeBSD there is no SPL. (Ovis there isn't one on Illumos.)

I also know you can change the block sizes, thats age old like you said and that isn't the point. The point is ZFS is variable and does it on the fly automatically. (So 512k write is 512 block, 4k write is 4k. etc, it means it efficiency manages slack.)

Also the slab is really really good.. if it wasn't it wouldn't have been imitated or copied by every other OS (including linux).. be very careful redesigning this.. *most* other people got it wrong before Sun. You want to put hobbyist or millennial programmers on this? or seasoned engineers who suffered some of the most agonizing problems and pain converting BSD to SVR5. Solaris was built out of sadness and suffering.. not "I got to make this widget then hop on instagram"

ZFS implemented trim in 2013. Linux was just late to the party on that feature.

Last edited by k1e0x; 02 February 2020, 12:40 AM.
Likes 1
Leave a comment:
oiaohm replied

01 February 2020, 09:30 PM
Originally posted by k1e0x View Post

ZFS implements Solaris (famous and much imitated) slab memory allocator. I don't see why that can't run in a huge page or anything else.

This is where you are stuffed.

Watch the video again. https://www.youtube.com/watch?v=p5u-vbwu3Fs "Large Pages in Linux". This is not a slab memory allocator. This is not file systems having there own allocate system. That complete slab memory allocate to be Linux compatible has to go replaced by the Large Page system.

Originally posted by k1e0x View Post

One thing to note here is that FreeBSD I believe doesn't even use the slab because there own memory manager is so close to it, they didn't need to change.. however that may no longer be the case in future releases because of their transition to ZoL/OpenZFS.

That is the start of the problem.

Originally posted by k1e0x View Post

Historical model:
- Loopback layer (qcow2, overlays)
- Posix layer (VFS, read/write)
- File layer (ext3, xfs etc)
- Logical volume layer (LVM, MD Raid, LUKS or sometimes all 3 chained together! All pretending to be a block layer to each other.)
- Block Layer (fixed block size, usually 4k)

You did no watch paying proper attention the video or you missed side 12. First line.
Block Layer Already supports arbitrary size pages, thanks to merging

Funny enough so the Logical volume layer due to being in the LInux kernel Block Layer.. Variable block size. Has existed in Linux all the way up to just under the file system. There is a problem at the file system drivers that iomap is a plan to fix.

So this historic model does not in fact match Linux. DMA means you did not have to use a fixed block size. HDD might have like 4k blocks and you have to write aligned but nothing prevent you using like 32 or 64kb ....As long as it aligned. This feature was basically in the Linux kernel first block layer. Lot of people writing file systems on Linux in the file system layer brought in the 4k limitation crap with huge pages this don't work any more.

The large page work is to bring the block layer idea of allocation and memory management/OS page cache into agreement. This way you can have one allocation top to
bottom.

Originally posted by k1e0x View Post

In ZFS they ripped all that out and changed it.

ZFS model:
- ZPL (Speaks a language the OS understands, usually posix.)
--- Optional ZVOL Layer (Pools storage for Virtual Machienes, iSCSI, Fiber channel and distributed storage etc, no extra layer added on top like with qcow2)
- DMU (re-orders the data into transaction groups)
- SPA (SPA works with plugins to do LVM, Raid-Z, Compression using existing or future algorithms, Encryption, other stuff not invented yet, etc. It can even load balance devices.)
- Block layer (variable block size)

ZFS rewrote how all the layers work and changed them to be aware of each other. It actually takes blocks, bundles them up as objects in transaction groups and that is what's actually written.

Was it required to rewrite all the layers to make them aware of each other. The answer is no. Why are you bundling them up into objects instead of improving the page cache of the OS as large pages does for all file systems over time.

ZPL means you must be translating. You also are ignoring the host OS big time. This comes a huge excuse to making own internal allocator that is not in fact aligned with the host OS.

You also miss that the block size ZFS wants to use are up to 1Mib. Huge pages on x86 is 2-4mib. 1Mib made sense on a sun Sparc cpu and 32 bit x86 but we use 64bit x86 these days. There are a lot of formal things that are based for hardware we don't use in ZFS that need to be redone as well.

Consider this your ZFS model is wrong on Linux.

Loopback
VFS
ZFS
-ZPL
---ZVOL
-DMU
-SPA
-Block layer ZFS.
Linux Block layer.

I would suspect freebsd is will end up equally bad mess. The rip out and replace that was ZFS objective under Solaris under non Solaris has not happened.

Then those layers are also sitting on another abstraction layer. You skipped the SPL (Solaris Porting Layer) . What is basically lets keep on using Solaris API for ever more. Of course as Linux behaviours come less Solaris like this is going to be increasing problem. At some point freebsd will change things and have trouble as well. Like it was particular hard to implement trim command for SSD in ZFS for Linux. ZFS was with this almost a decade late getting the feature compare to other Linux file systems.

Originally posted by k1e0x View Post

Have you ever really used the "ip" command? It drives me nuts every real OS for 40 years used ifconfig,

Yes I have and I have been very thankful for it.

How to detect the physical connected state of a network cable/connector?

https://stackoverflow.com/questions/808560/how-to-detect-the-physical-connected-state-of-a-network-cable-connector

In a Linux environment, I need to detect the physical connected or disconnected state of an RJ45 connector to its socket. Preferably using BASH scripting only. The following solutions which have ...

These kind of problems ip monitor feature is great.

ip command you can also do things like changing/removing default gateway without having to disconnect and reconnect for the routing change to come live and if you are dealing with a slightly suspect managed switch that radius messaging to activate port is roll of dice if it works or not its great to be able to changes like that with network card up.

Originally posted by k1e0x View Post

Linux was deployed and put into the position it was by sysadmins because it was simple and they needed to solve problems.

Really the ip command come into existence because the posix standard ifconfig command cannot do a stack of different things well. What the LInux network stack can allow to be perform well and truly exceeds what the freebsd one today allows. The fact freebsd has not found themselves needing to replace or massively extend the ifconfig is more a sign of how feature behind freebsd network stack as got.

Originally posted by k1e0x View Post

Same thing you can see in KVM and bhyve. bhyve is almost a different class of hypervisor in that it's a few hundred kilobytes in size and does no hardware emulation (qemu) at all. KVM well, how bloated CAN it get?

I get sick of FreeBSD people doing this one. https://lwn.net/Articles/658511/ reality if you fire up Linux kvm with kvmtool there is no hardware emulation either. I call this horrible naming. kvm command is qemu modified to take advantage of kvm kernel feature of Linux. kvmtool also uses kvm kernel feature of Linux without qemu bits so is insanely lightweight..

So KVM is not as bloated as what you want to make out in Linux kernel. KVM userspace option has feature rich the kvm command that is based off qemu with all the hardware emulation and the feature poor kvmtool that bring you back to something like bhyve with the same OS support problems.
Leave a comment:

Previous 1 2 3 4 5 template Next

Announcement

Linus Torvalds Doesn't Recommend Using ZFS On Linux

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: