Announcement

**skeevy420** · 03 February 2020, 11:39 AM

Originally posted by gilboa View Post

Using ZOL in an enterprise environment without enterprise grade support is *foolish*. If it breaks, you get to keep all the pieces.

But that's true of anything in an enterprise environment.

**gilboa** · 03 February 2020, 12:23 PM

Originally posted by skeevy420 View Post

But that's true of anything in an enterprise environment.

Not really. I've seen huge enterprises use CentOS + XFS / ext4 without enterprise grade support.
It's extremely stable and the free support can be sufficient for a skilled customer.
... However, try getting *any* type of support for ZOL running on CentOS (or RHEL), you'll have a blast.

- Gilboa

**oiaohm** · 03 February 2020, 04:37 PM

Originally posted by k1e0x View Post

Yeah, there are differences (the slab on Solaris and FreeBSD's version of it has also seen improvement and changes). FreeBSD is much less radical about the changes they make. The block layer uses a driver??? no way! Every OS does, the difference is HOW it uses that.

The Linux Block Layer from the drivers has changed as well. So ZoL no longer provides requests correctly to the block layer of Linux for performance. Yes how the block layer is used is important. Its possible to be using the Block layer wrong and have performance problems.

They would NEVER rip out ifconfig. You'd never see it happen.. they would fix it's limitation (and have, it does wifi on FreeBSD, like it *should*.)

Originally posted by k1e0x View Post

I don't see any reason why snap can't use a ZVOL and being all Ubuntu technology you can just import it.

Turns out that ZVOL is slower than using the Linux loopback on top of ZFS. So no point in importing it. Loopback usage of the Linux kernel pagecache bring some advantages to performance and is optimised for the huge page usage and will be optimised for the large page usage in future. Does not matter if you can do things if those things don't in fact perform.

Originally posted by k1e0x View Post

I think I've kind of won you over btw on ZFS.

Yeah.. it's a good thing in the world. Hopefully the dream of making filesystems as easy as ram will be a reality. "Storage your computer manages for you" That is the idea.. They aren't there yet, but they know that and are working towards it.

The memory operation requirements is why I see ZFS/ZoL as doomed long term. There is already sign of this..

Originally posted by k1e0x View Post

How does ZFS do in the enterprise? People don't really tell us who is *really* using it, business tend to be very private about their critical core infrastructure.. but anonymously who uses it..? Well if you look at Oracle's stats.

ZFS was able to top SAP storage benchmarks for 2017.

Lets throw out some out of date information to attempt to win point is all this line is.
https://www.intel.com/content/dam/ww...tion-guide.pdf

The current recommend install information in 2020 for SAP to get best performance is XFS dax and HANA. Both are using the Linux kernel block device drivers means to place like a 2/4Meg huge page in memory of Linux in a single operation from storage media. No middle crap like the arc cache. Using application intelligent check-summing of data that can in fact reduce the amount of checksum processing you need to do so you have the same level of data protection.

So ZFS is currently classed for SAP as under-performing junk. Its not like SAP developers did not look at ZFS and take some ideas. Basically SAP developers looked at ZFS took some ideas and worked out they could do it faster and better. Lot of this is caused by ZoL deciding to use their own memory manager and not being altered to Linux VFS layer so request with X size page required that is then passed though the file system for the block device to read into that page correctly aligned and read to be used with no latter modification required.

Originally posted by k1e0x View Post

2 / 3 Top SaaS companies use it.
7 / 10 Top financial companies use it.
9 / 10 Top telecommunication companies use it.
5 / 5 Top semiconductor companies use it.
3 / 5 Top gas and oil companies use it.
2 / 3 Top media companies use it.

I guess this is all parties using SAP that if are following current day SAP recommendations are no longer using ZFS but are in fact using XFS and HANA.

Has good data center foot print too 9pb per 1 rack.

Originally posted by k1e0x View Post

But who would care about that market.. small peanuts. Disney, Intel, AT&T. JP Morgan, or ARCO meh.. not important.

No need to put this in Linux. Linus is probably right. We should make gaming better on Linux. lol

The XFS developer being focused on performance that requires system wide intergration more than data security has turned out to be important. There are many ways achive data integrity without having checksum in file system. There is very limited ways to achieve performance.

ZFS market position not stable. The concept that ZFS developers have they can ignore the way the host operating system works is leading to ZFS losing market share to XFS due to poor performance of the ZFS option.

**oiaohm** · 05 February 2020, 02:42 AM

Originally posted by k1e0x View Post

Well, nothing's perfect. lol

That is so true. But time moves on.

This Server Deployment was HORRIBLE

https://youtube.com/watch?v=xWjOh0Ph8uM

Get $20 in free credit on your new account at https://www.linode.com/linusMonitor and manage your PC in real-time with Pulseway! Create your free account tod...

Here is Linus tech tips setting up a new server started with plan he would use ZFS. End up back on XFS and dm.

Current day systems are running into a new problem. ZFS was designed on the concept that your ram bandwidth is in fact greater than your storage bandwidth so you can be wasteful of ram bandwidth. Now we have a new nightmare the new NVME drive in volume equal more bandwidth from your drives than your complete cpu ram bus.

The requirement of zero copy from device into memory and back out is not coming a optional required feature its coming a mandatory required feature. So there is no time for the fancy arc cache stunts resulting in 3 copies in ram. There is no time for the file system to have it own allocate system and copy out to the host OS memory system.

Yes we do really need to rethink where checksum of data need to be as well. End to End maybe the checksum of data from a file server in fact needs to be done by the client with the client having means to inform server that X data appears to have a problem. Same with compression because the storage server can be out of resources purely caused by running the storage.

So yes the video above is ZFS losing a user because ZFS design cannot perform in the modern day nasty hardware. Yes the modern day hardware is nasty in more ways than one with the fact you can have to poll instead of using interrupts because interrupts are getting lost due to the massive presure.

Yes it kind of insane that a dual socket 128 cores/256 thread setup can basically be strangled by the current day high speed storage media due to not having enough memory bandwidth as some other parties have found out. When ZFS was design the idea that you could be strangled in memory bandwidth was not even a possibility. Performance optimisation of not duplicate in memory (zero copy operation) is not an optional feature its a require feature in these new setups and will come more common.

k1e0x you might not like this. But I don't see ZFS lasting out 10 years if it does not major-ally alter it path because it just going to come more and more incompatible with current day hardware. Yes the current day hardware is going to have to cause us to reconsider how we do things as well. Being strangled in the ram bus of the storage server is a really new problem.

**k1e0x** · 05 February 2020, 02:52 AM

I seen that, ya that is an extreme problem, it's more so with the OS itself tho.. Netflix had the same issue with SSL at 100gb. It will take serious OS engineering to solve this. Not Linux throw it at the wall and see if it sticks engineering.

But yes, ZFS only works if the CPU is x times faster than the disk. If it isn't, you need something else. Generally that is true and its' been true for a long time. I don't see magnetic storage going away so I think things are fine. I don't think storage at that level really is practical economically .. but some people clearly have a use case here.. and we'll have to find a solution for them.

I'm curious what Wendell did with polling (kqueue tends to way out preform epoll)

Jeff Bonwick (ZFS creator) actually was trying to solve that problem.. He was working on 3d flash raid implementations.. DSSD I think was the company and they were sold to Dell. Dell dropped it so... wonder what Bonwick is up to now.

The world needs him to solve storage (again). lol

**oiaohm** · 05 February 2020, 07:53 AM

Originally posted by k1e0x View Post

I seen that, ya that is an extreme problem, it's more so with the OS itself tho.. Netflix had the same issue with SSL at 100gb. It will take serious OS engineering to solve this. Not Linux throw it at the wall and see if it sticks engineering.

Others using Linux did find a solution to the 100gb SSL problem.

Originally posted by k1e0x View Post

But yes, ZFS only works if the CPU is x times faster than the disk. If it isn't, you need something else.

That the problem this is not going to stay a functional option.

Originally posted by k1e0x View Post

Generally that is true and its' been true for a long time. I don't see magnetic storage going away so I think things are fine.

Magnetic storage can cause the same nightmare. Note all those nvme drives in that video are pci-e connected So every two of those can come 1 SAS port that is 12GB/s when multiplexed to drives. So still stomped cleanly into the ground.

Originally posted by k1e0x View Post

I don't think storage at that level really is practical economically .. but some people clearly have a use case here.. and we'll have to find a solution for them.

Storage is not exactly the problem. You get stomped into the ground because we have more PCI-e lanes with possibility to transfer more data than you can safely cpu handle.

Originally posted by k1e0x View Post

I'm curious what Wendell did with polling (kqueue tends to way out preform epoll)

Yes kqueue is faster as long as the interrupts are not getting stopped into the ground by the massive flow of data. Wendell did is using both Polling and kqueue in combination if kqueue has not got interrupt in X time do a poll to see if the interrupt has disappeared or not. Yes this massive data flow problem means if you don't pick up interrupt in time another one from a different item has come in overwriting the information. What Wendell did prevents having to reset drives/controllers that will stall everything to death.

Basically your cpu is being stomped into the ground and you have to be dealing with that fact it being stomped into the ground so items like interrupts back by PCIe are no longer dependable.

Originally posted by k1e0x View Post

Jeff Bonwick (ZFS creator) actually was trying to solve that problem.. He was working on 3d flash raid implementations.. DSSD I think was the company and they were sold to Dell. Dell dropped it so... wonder what Bonwick is up to now.

The world needs him to solve storage (again). lol

You really missed it this problem is system wide. Same thing can happen if you connect up a lot of accelerators. Basically modern larger server systems have way way too much pci-e bandwidth this is only going to get worse with when systems move from pcie 4 to pcie 5/6.

Basically with this problem a storage specialist is basically useless. Fixing this problem a memory management specialist and a pcie evils specialist. Requirement is that those implementing file systems also don't attempt to do their own thing with memory management as duplication in memory makes your lack of memory bandwidth even worse.

It bad enough dealing with the massive wave of data the pcie lanes in these modern systems allows without third party file systems like ZFS doing things presuming CPU is X times faster than disc.

Remember in epic cpu from AMD instead what LTT attempted with a 24 core then step up to a 32 core you could have a 12 or 16 core chip with the exact same number of pcie 4.0 lanes and half the memory bandwidth again. This is your poor low end storage server based on AMD chips perfectly designed to be mega stomped by the drives.

So having CPU x times faster than disk is not the case all the time any more. It is now possible even with the old school hdd to have disks X times faster than CPU.

The concept ZFS was designed with AMD has basically thrown upside down now we have to deal with it.

**k1e0x** · 05 February 2020, 03:55 PM

I don't think this has anything to do with filesystems per say. It's a kernel OS limitation.. you just can't move data through the kernel fast enough. All general purpose OS's have this problem.

and kqueue is freebsds poll method.. so.. there you go again : shrug :

**oiaohm** · 05 February 2020, 08:52 PM

Originally posted by k1e0x View Post

I don't think this has anything to do with filesystems per say. It's a kernel OS limitation.. you just can't move data through the kernel fast enough. All general purpose OS's have this problem.:

I will give you this is normally not a file-system problem to a point the iomap work is partly to deal with the problem from XFS and other Linux file systems where they were doing memory duplication.

But a item like ZFS are a big exception to the rule. ZFS as you stated does not have to bother integrating in with how the kernel OS is doing things. Reality is the kernel OS on this modern storage hardware is unable to move the data fast enough. Heck even if you had a system that was purely running L1 cache doing DMA transfers in and out of memory the PCIe bandwidth can still outstrip your mmu transfer-speeds so 100 percent in theory perfect world you are still screwed of course a OS cannot be 100 percent perfect because it has to be doing other things.

Having ZFS basically be it own OS kernel inside the OS kernel with its own memory management absolutely does not help particular with the reality you are basically out of memory operations. Think of it this way you are now always behind in memory operations on a storage server the more memory operations you need to-do the more you are behind yes you are never catching up. Zero copy operations have to come the normal not the exception. The OS kernel can do so much to do zero copy operations but file system like ZoL doing it own thing does not play along with with OS kernel improvements to increase zero copy operations.

So the hardware design means no matter how you design your kernel OS there is no way now to have your in memory operations faster than your PCIe transfers. This is problem one. Problem 2 is next.

Originally posted by k1e0x View Post

and kqueue is freebsds poll method.. so.. there you go again : shrug :

https://events.static.linuxfound.org/sites/events/files/slides/lemoal-nvme-polling-vault-2017-final_0.pdf

I was not thinking freebsd. I was reading kqueue as a kernel queue not as a freebsd feature.. Wendell to work around the AMD epic I loss IRQ requests he was using this that Western digital added to the Linux kernel.

Pci/pcie is an ass. Using IRQ brings you cpu load down. PCI/PCIe specification in a IRQ storm event as in devices sending you too many interrupts to process to random-ally lose them. Yes part of the PCI/PCIe specification. Next bit of evil you poll a device that is set to send a IRQ/interrupt when ready PCI/PCIe specification kicks in again by specification device is now not to send interrupt.

So kqueue and epoll level does not help you. This is in your OS block layer. EPIC cpu have that much PCIe bandwidth that you in theory could run the complete system by polling and not run out of PCIe bandwidth of course you would not have any CPU time to-do anything.

Of course this is all bad enough. Now there is a final bit of evil in the PCI/PCIe specification a devices ends a IRQ and does not get answer in reasonable time frame be the IRQ lost or just the system insanely busy the device is to reset so now the device has stalled out on you.

Its good that Western digital gave Linux a hybrid mode between IRQ and polling other wise these storage servers would be in way worse trouble. Yes they are still in trouble but less than they would have been otherwise.

Please note I am writing PCI/PCIe these problems where in the specification with the first version of PCI but we have not been running into them because we did not have a large enough miss balance between amount of PCI/PCIe transfer and MMU means to process it to cause any major volume of PCI/PCI/e message loss as defined in the specification. AMD has nicely given us this Epic CPU that truly lives up to its name of being EPIC at pushing the PCI/PCIe specification right up to its breaking limit and then some.

Do note polling devices because you cannot depend on IRQ means you have less CPU time. This is before you attempt run ARC cache or any of those fancy ZFS features.

There is a reason why I said we need to look at storage server not running data validation and client over network running data validation if things keep on going this way there will be less and less usable CPU time once you connect up a massive number of storage devices on the storage server itself.

The world that ZFS was designed in is disappearing. Yes the PCIe system hits twice for lack of CPU and MMU transfer speed.

**k1e0x** · 05 February 2020, 11:55 PM

oiaohm iomap sounds like it's a re-implementation of netmap. netmap didn't replace the TCP/IP stack. As I said before it's going to take a lot of work to do this.. what makes you think FreeBSD and Illumos won't do this work and won't add it to ZFS? ZFS is the default on Illumos and tho not the default in FreeBSD (not yet at least) it's extremely common. Saying defacto default might not be a stretch

I'd imagine those two Unix OS's want to solve this too and will, there is also Apple that wants to get back into the server game and there are rumors of macOS server becoming a thing again.. If they do or don't doesn't really matter but they also have an open source Unix kernel and need to solve this. I somehow believe they will take a different approach than Linux does.

**oiaohm** · 06 February 2020, 12:56 AM

Originally posted by k1e0x View Post

oiaohm iomap sounds like it's a re-implementation of netmap. netmap didn't replace the TCP/IP stack. As I said before it's going to take a lot of work to do this..

Problem here with the way the epic servers and and the old lightweight NAS boxes are. Zero copy stuff is a lot work to implement but its work that has to be done no matter what due to the requirements the future hardware is throwing up.

Originally posted by k1e0x View Post

what makes you think FreeBSD and Illumos won't do this work and won't add it to ZFS?

These are two different problems. Illumos is running ZFS without abstraction layer. Illumos have a possibility of fixing this. If the ZoL developers treat FreeBSD the same way they have treated the Linux kernel by not integrating properly ZFS on FreeBSD going forwards will be just not functional as ZoL on a Linux is now.

Originally posted by k1e0x View Post

Apple that wants to get back into the server game and there are rumors of macOS server becoming a thing again..

Mac OS X Server - Wikipedia

https://en.wikipedia.org/wiki/MacOS_Server

With MacOS Server there are always rumours that it will come a thing again before another release where it does not.

Originally posted by k1e0x View Post

I somehow believe they will take a different approach than Linux does.

Yes they may use a different approach than Linux does. But the options to deal with having limit cpu and limit memory bandwidth to use are also limited. Any form of abstraction layer causing memory duplication cannot not be tolerated no matter how much coding or legal effect is required to fix it. Yes this include possibly having to rewrite large sections of ZFS under a proper GPL compatible license for Linux or bring back from dead the ZFS for freebsd project.

The reality here what was the easy way out for ZoL in the past does not work going forwards.

Announcement

Linus Torvalds Doesn't Recommend Using ZFS On Linux

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment