Announcement

**k1e0x** · 29 January 2020, 07:30 PM

Originally posted by oiaohm View Post

That is not exactly true. ZFS loses in single HDD solutions to XFS as well.

This is what I call bias bench-marking you see ZFS with ZIL and L2ARC but then they don't give xfs one cache options either then claim win.

https://www.redhat.com/en/blog/impro...mance-dm-cache

Yes dm-cache and bcache and other solutions like it really do speed up xfs a lot. Mirrors + cache options with xfs in perform do normally beat zfs with ZIL and L2ARC at least now. There has been a recent change.

Do notice a warm cache under XFS works out to ~4x faster. What is roughly the same boost you get by enabling L2ARC to ZFS expect you are starting off slower. So ZFS with ZIL and L2ARC on does not catch up to XFS with Cache on. In fact the performance difference gets wider not narrower in XFS advantage. Only reason ZFS in benchmarks with L2ARC wins over XFS is those benchmarking are basically not giving XFS a cache.

So yes the thing you complain about with normal file system benchmarks being unfair because they don't allow L2ARC you see those attempting to sell ZFS do the reverse where they don't give XFS or any other file system any of the other caching options.

Really ZFS without cache is not unfair in fact its giving ZFS a better chance than having to face off against XFS with cache. Basically fair competition here if you have the solid state drive for cache you should compare all file systems setup to use it that way and the file system does not need to have cache feature to have a block level cache under it.

Its surprising to a lot of people how poor the L2ARC and ZIL in fact performs when you compare it to other cache options. Having data integrity stuff does not come free.

ZFS send is a good feature. That there is currently not a good replication for that. But not all workloads you need this integrity and replication like postgresql database with their WAL don't need ZFS send or file system integrity so IO performance is more important and their own backup system provides that stuff.

Basically ZFS features that effect IO make it not the most suitable for particular workloads.

Its about time you stop this lie. You can put caching under XFS on Optane as well and see insane performance boosts. If you objective is IOPS ZFS never wins.

Something to wake up to iomap change in Linux kernel is a major one as this allows the VFS layer to put request straight to the block layer if the block map information from the file system is already got and in the iomap.

Why does XFS not have data block checksumming or compression simple. Does it make any sense when you were planning to allow the VFS layer to bipass the file system layer. Basically this model change means the file system driver is only there to process the file system metadata. So in this model compression/checksum are either in the VFS or the block layer.

You can put any cache you want on XFS. Cache is real world testing and that is what matters.

Its surprising to a lot of people how poor the L2ARC and ZIL in fact performs when you compare it to other cache options.

Really..?
https://www.usenix.org/legacy/events...do/megiddo.pdf
(Original paper. Look at table VIII on page page 15, the ARC is nearly outperforming a tuned offline cache)

Bryan Cantrill did a review of that paper.. if you're the more.. audio/visual type.

Bryan Cantrill on ARC: A Self-Tuning, Low Overhead Replacement Cache [PWL SF] 10/2017

https://www.youtube.com/watch?v=F8sZRBdmqc0

Bryan Cantrill on "ARC: A Self-Tuning, Low Overhead Replacement Cache" by Nimrod Megiddo and Dharmendra Modha ( https://www.usenix.org/legacy/event/fast03/te...

And, Allan Jude did a ELI5 talk on the algorithm if you are wondering what all these numbers and symbols mean in the other sources.

ELI5: ZFS Caching Explain Like I'm 5: How the ZFS Adaptive Replacement Cache works

https://www.youtube.com/watch?v=1Wo3i2gkAIk

by Allan JudeAt: FOSDEM 2019https://video.fosdem.org/2019/K.1.105/zfs_caching.webmAn in-depth look at how caching works in ZFS, specifically the Adaptive Rep...

I think the true value in the ARC algorithm is that datasets do evolve over time and it's necessary to have that cache be able to adapt to those changes. When you know how it works and how small the changes are when it adapts itself, it's really very interesting this works so well... I would assume we would need a much more complicated algorithm to accomplish this stuff... but apparently not. The ARC is relatively simple.

It's open source btw.. feel free to re-implement it.

**oiaohm** · 30 January 2020, 02:38 AM

Originally posted by k1e0x View Post

(Original paper. Look at table VIII on page page 15, the ARC is nearly outperforming a tuned offline cache)

I am not putting it head to head with a tuned cache but a auto tuning cache. dm_cache and bcache are both auto tuning.

You need to read page 2 and pay very careful attention to the first sentence.

https://www.usenix.org/legacy/events/fast03/tech/full_papers/megiddo/megiddo.pdf

We consider the problem of cache management in a demand paging scenario with uniform page sizes.

One problem is Linux a uniform page size system?

"Large Pages in Linux" - Matthew Wilcox (LCA 2020)

https://www.youtube.com/watch?v=p5u-vbwu3Fs

Matthew Wilcoxhttps://lca2020.linux.org.au/schedule/presentation/45/Since 2002, Linux has used huge pages to improve CPU performance. Originally, huge pages...

The answer is no. And Linux going to come more and more non uniform page size system. Something design around uniform page sizes will just come more and more Linux incompatible and that incompatibility causes a performance degrade.

Originally posted by k1e0x View Post

I think the true value in the ARC algorithm is that datasets do evolve over time and it's necessary to have that cache be able to adapt to those changes. When you know how it works and how small the changes are when it adapts itself, it's really very interesting this works so well... I would assume we would need a much more complicated algorithm to accomplish this stuff... but apparently not. The ARC is relatively simple.

This is all nice in theory. Problem ARC algorithm is not designed to alter for a non uniform page size system where the dm cache in Linux is. Yes dm cache is also using a relatively simple formula. We don't need a super complex formula but you need to be slightly smarter than the ARC solution to deal with a non uniform page size system

The iomap and memory management change in the Linux kernel is not a small ones the design is totally counter to the way ZFS zpool and ARC in fact works. Both the iomap and memory management changes is about making non uniform page sizes work. So you will have more non uniformed page sizes than workloads with a mix of huge and non huge pages that bring out the worse in the ZFS ARC cache and this worst is going to come the normal. Unless you wake up you are in trouble. The old interfaces into the Linux kernel are not going to give your cache/block systems inside ZFS the information they in fact need from the Linux memory management system to know what the hell is going on with non uniform pages.

By the way everything you have referenced k1e0x is old and obsolete already for Linux. Worse that both of those video were by BSD guys who are clueless how Linux is changing.

Problem here BSD/Windows/OS X has not moved to non uniform page size system so those developing on BSD/Windows/OS X have not seen this problem coming. Linux is ahead of the pack at doing something to work with non uniform page sizes as normal not the rarity.

This change in Linux does bring some very interesting questions for future file system design. The concept of using 1 size block across the complete file system could be wrong particular with flash insane seek time.

Linus can see these upcoming changes ZFS is not ready for. ZFS developers are putting their faith in stuff that will not be future compatible.

**k1e0x** · 30 January 2020, 06:18 PM

Originally posted by oiaohm View Post

I am not putting it head to head with a tuned cache but a auto tuning cache. dm_cache and bcache are both auto tuning.

You need to read page 2 and pay very careful attention to the first sentence.
https://www.usenix.org/legacy/events...do/megiddo.pdf

One problem is Linux a uniform page size system?

"Large Pages in Linux" - Matthew Wilcox (LCA 2020)

https://www.youtube.com/watch?v=p5u-vbwu3Fs

Matthew Wilcoxhttps://lca2020.linux.org.au/schedule/presentation/45/Since 2002, Linux has used huge pages to improve CPU performance. Originally, huge pages...

The answer is no. And Linux going to come more and more non uniform page size system. Something design around uniform page sizes will just come more and more Linux incompatible and that incompatibility causes a performance degrade.

This is all nice in theory. Problem ARC algorithm is not designed to alter for a non uniform page size system where the dm cache in Linux is. Yes dm cache is also using a relatively simple formula. We don't need a super complex formula but you need to be slightly smarter than the ARC solution to deal with a non uniform page size system

The iomap and memory management change in the Linux kernel is not a small ones the design is totally counter to the way ZFS zpool and ARC in fact works. Both the iomap and memory management changes is about making non uniform page sizes work. So you will have more non uniformed page sizes than workloads with a mix of huge and non huge pages that bring out the worse in the ZFS ARC cache and this worst is going to come the normal. Unless you wake up you are in trouble. The old interfaces into the Linux kernel are not going to give your cache/block systems inside ZFS the information they in fact need from the Linux memory management system to know what the hell is going on with non uniform pages.

By the way everything you have referenced k1e0x is old and obsolete already for Linux. Worse that both of those video were by BSD guys who are clueless how Linux is changing.

Problem here BSD/Windows/OS X has not moved to non uniform page size system so those developing on BSD/Windows/OS X have not seen this problem coming. Linux is ahead of the pack at doing something to work with non uniform page sizes as normal not the rarity.

This change in Linux does bring some very interesting questions for future file system design. The concept of using 1 size block across the complete file system could be wrong particular with flash insane seek time.

Linus can see these upcoming changes ZFS is not ready for. ZFS developers are putting their faith in stuff that will not be future compatible.

Well, I'm not sure you're right. ZFS supports many architectures so it's page cache size is probably set at compile time but it doesn't really matter because lets say Linux desperately wants to keep ZFS out of Linux.. fine.. they will only be shooting themselves in the foot then and losing customers / developers because people will just base their storage products on FreeBSD. All the big vendors are either FreeBSD or Solaris/Illumos based now anyhow. I think only Datto is ZoL based. People do use ZoL in smaller implementations and it's popular and a lot of people want to use it but.. if thats not possible.. FreeBSD is right there waiting to gain market share.

It isn't really that hard to change the underlying platform.. It's the path of least resistance, Use FreeBSD now.. or develop something else on Linux.. what is easier?

What can you do? We are trying to help Linux get it's big boy pants on and do real storage.. but if they want to have a temper tantrum.. : shrug : Zol's implementation is really good, thats why FreeBSD uses it now.. it's a shame they get treated this way by the core team.

You know... in 10-20 years we will probably end up in this really weird world where Linux is upstack running the applications and FreeBSD is running the Metal, Network and Storage.. just a weird thought. Might be true tho.. The OS's seem to be moving that way.

**oiaohm** · 31 January 2020, 12:47 PM

Originally posted by k1e0x View Post

Well, I'm not sure you're right. ZFS supports many architectures so it's page cache size is probably set at compile time but it doesn't really matter because lets say Linux desperately wants to keep ZFS out of Linux.

"Large Pages in Linux" - Matthew Wilcox (LCA 2020)

https://www.youtube.com/watch?v=p5u-vbwu3Fs

Matthew Wilcoxhttps://lca2020.linux.org.au/schedule/presentation/45/Since 2002, Linux has used huge pages to improve CPU performance. Originally, huge pages...

You need to watch the youtube. Title "Large Pages in Linux".

There is a nightmare problem at the memory management level. 64Gib of memory 4kb pages equal 16 billion pages to keep track of. Thinking under x86_64 you can use 4kb, 2Mib, 4Mib and 1GiB page sizes, 2Mib page sizes equal 32768 pages to keep track of for 64Gb. 4MiB equal 16384 pages to keep track of for 64GB of memory. Of course 1GiB page size could make sense as memory in servers increase.

So you cannot do all pages like 2mb/4mb because you will have too much wasted memory space. But you cannot practically do all pages 4kb either because if you do you are wasting hell load of processing managing memory. Basically the Linux kernel is applying the from the slub allocation idea and apply this to system wide memory.

The result of the Linux kernel changes is you cannot set the page cache size at compile time as a single value any more. You don't have a single pool of memory instead you have pools of memory based on cpu page sizes. If system is requesting file system provide a aligned 4MiB page it better be able to going forwards so there is no extra double handling converting 4kb pages to 4MiB pages.

The reality is the current form of ZoL may work in current Linux without major changes ZoL doomed as ZoL does not support the memory model more modern Linux kernel will require that is a different beast to the operating systems that ZoL has been designed for..

Originally posted by k1e0x View Post

It isn't really that hard to change the underlying platform.. It's the path of least resistance, Use FreeBSD now.. or develop something else on Linux.. what is easier?

Path to hell is paved with good intentions. Path of least resistance arguement allows you to ignore that ZFS current design is fatally flawed.

Issue is the requirements of the underlying platform have changed since ZFS was designed.

Originally posted by k1e0x View Post

Zol's implementation is really good, thats why FreeBSD uses it now.. it's a shame they get treated this way by the core team.

The reality ZoL developers have had to support FreeBSD because the core developers for the FreeBSD ZFS died out. Really its lucky the ZoL developers have been rejected by the mainline Linux kernel or FreeBSD would not have ZFS any more either.

Originally posted by k1e0x View Post

You know... in 10-20 years we will probably end up in this really weird world where Linux is upstack running the applications and FreeBSD is running the Metal, Network and Storage.. just a weird thought. Might be true tho.. The OS's seem to be moving that way.

FreeBSD will at some point have to address the problem the Linux kernel developers have run into. So current form ZFS is problem for Linux will be a Freebsd long term as well.

The problem that has caused the Linux kernel change is coming from the bare metal and these changes will effect how file systems going forwards will need to operate. So Linux running on the Metal with FreeBSD above so your hypervisor is not killing your performance due to too complex of page table is the way it will have to be at the moment.

k1e0x you really need to watch that video and take it in. The License problem is not the only problem.

**k1e0x** · 01 February 2020, 01:38 AM

Originally posted by oiaohm View Post

"Large Pages in Linux" - Matthew Wilcox (LCA 2020)

https://www.youtube.com/watch?v=p5u-vbwu3Fs

Matthew Wilcoxhttps://lca2020.linux.org.au/schedule/presentation/45/Since 2002, Linux has used huge pages to improve CPU performance. Originally, huge pages...

You need to watch the youtube. Title "Large Pages in Linux".

There is a nightmare problem at the memory management level. 64Gib of memory 4kb pages equal 16 billion pages to keep track of. Thinking under x86_64 you can use 4kb, 2Mib, 4Mib and 1GiB page sizes, 2Mib page sizes equal 32768 pages to keep track of for 64Gb. 4MiB equal 16384 pages to keep track of for 64GB of memory. Of course 1GiB page size could make sense as memory in servers increase.

So you cannot do all pages like 2mb/4mb because you will have too much wasted memory space. But you cannot practically do all pages 4kb either because if you do you are wasting hell load of processing managing memory. Basically the Linux kernel is applying the from the slub allocation idea and apply this to system wide memory.

The result of the Linux kernel changes is you cannot set the page cache size at compile time as a single value any more. You don't have a single pool of memory instead you have pools of memory based on cpu page sizes. If system is requesting file system provide a aligned 4MiB page it better be able to going forwards so there is no extra double handling converting 4kb pages to 4MiB pages.

The reality is the current form of ZoL may work in current Linux without major changes ZoL doomed as ZoL does not support the memory model more modern Linux kernel will require that is a different beast to the operating systems that ZoL has been designed for..

Path to hell is paved with good intentions. Path of least resistance arguement allows you to ignore that ZFS current design is fatally flawed.

Issue is the requirements of the underlying platform have changed since ZFS was designed.

The reality ZoL developers have had to support FreeBSD because the core developers for the FreeBSD ZFS died out. Really its lucky the ZoL developers have been rejected by the mainline Linux kernel or FreeBSD would not have ZFS any more either.

FreeBSD will at some point have to address the problem the Linux kernel developers have run into. So current form ZFS is problem for Linux will be a Freebsd long term as well.

The problem that has caused the Linux kernel change is coming from the bare metal and these changes will effect how file systems going forwards will need to operate. So Linux running on the Metal with FreeBSD above so your hypervisor is not killing your performance due to too complex of page table is the way it will have to be at the moment.

k1e0x you really need to watch that video and take it in. The License problem is not the only problem.

I did watch your video. I didn't find it all too interesting, it's a problem yeah. I'm not sure variable is correct but you know we'll see. After seeing what you are talking about no, I don't think that will affect ZFS at all.. every other filesystem sure. ZFS no because ZFS implements Solaris (famous and much imitated) slab memory allocator. I don't see why that can't run in a huge page or anything else.

One thing to note here is that FreeBSD I believe doesn't even use the slab because there own memory manager is so close to it, they didn't need to change.. however that may no longer be the case in future releases because of their transition to ZoL/OpenZFS.

You talk a lot about the design but you don't really seem to know the design very well. So.. here we go.

Back in ~2000's era Jeff Bonwick (the same guy that wrote the Slab) frustrated that hard disk and storage management were such a pain to manage set out with Matt Ahrens to reinvent the wheel. The basic idea was that storage could be like ram. That is was a resource your computer automatically managed for you. (Since leaving Sun Jeff has become CTO for several companies and has been rather quiet, he does make an appearance every now and then though.)

You talk about traditional classical layers and the importance of them, the trouble is that (and you rightly identified it) is that those layers are blind to each other.

Historical model:
- Loopback layer (qcow2, overlays)
- Posix layer (VFS, read/write)
- File layer (ext3, xfs etc)
- Logical volume layer (LVM, MD Raid, LUKS or sometimes all 3 chained together! All pretending to be a block layer to each other.)
- Block Layer (fixed block size, usually 4k)

In ZFS they ripped all that out and changed it.

ZFS model:
- ZPL (Speaks a language the OS understands, usually posix.)
--- Optional ZVOL Layer (Pools storage for Virtual Machienes, iSCSI, Fiber channel and distributed storage etc, no extra layer added on top like with qcow2)
- DMU (re-orders the data into transaction groups)
- SPA (SPA works with plugins to do LVM, Raid-Z, Compression using existing or future algorithms, Encryption, other stuff not invented yet, etc. It can even load balance devices.)
- Block layer (variable block size)

ZFS rewrote how all the layers work and changed them to be aware of each other. It actually takes blocks, bundles them up as objects in transaction groups and that is what's actually written. You can find out more about that here. https://www.youtube.com/watch?v=MsY-BafQgj4

Some observations on where these systems are going? Well.. this is going to be a bit of a rant but as you know I'm a systems engineer and I quickly had to add 4 IP's to an interface temporally on Ubuntu. Have you ever really used the "ip" command? It drives me nuts every real OS for 40 years used ifconfig, Linux changed this a few years back the first thing they did is they didn't keep backward syntax and they changed the name to something meaningless. "ip"? what about network interfaces that don't speak "ip"? IP is a pretty common protocol but it's not every protocol. Using this command I bungled the obtuse syntax and it had grammar mistakes (Eg: you are an error) and also told me that I was using a depreciated syntax and had to update my scripts? I'm not using a script I'm typing it! Does anyone in Linux type this anymore? Looking at Ubuntu's networking it's a tangled web of python scripts that call other scripts. It's got 4? different methods to configure the IP, systemd, netplan, debian net.if and network manager. And this is the work of developers. They like systems like this with rich API's, options and yaml "Dude, got to have some yaml or json.. but not XML never XML that's sooo last decade bro". This is the stuff I bleed over daily. We are the ones who have to deal with it when the code doesn't work as intended. Or has very odd behavior. Or just sucks and is slow and nobody knows why (but it works fine on the developers laptop).

Know how you do it on FreeBSD?
ifconfig same way as always, and one line in rc.conf. Python isn't even installed by default! Simplicity has a lot of value.

Linux was deployed and put into the position it was by sysadmins because it was simple and they needed to solve problems. It's no longer like that however..

Same thing you can see in KVM and bhyve. bhyve is almost a different class of hypervisor in that it's a few hundred kilobytes in size and does no hardware emulation (qemu) at all. KVM well, how bloated CAN it get? At least those gamers will be able to pass through skyrim... But.. So if you want a really really thin light hypervisor bhyve is your go to, a FreeBSD system after install probably has less than 10 PID's you actually need. Ubuntu has hundreds.

I do believe that in 40 years both OS's will still be around.. but they may look very different by then.

**oiaohm** · 01 February 2020, 09:30 PM

Originally posted by k1e0x View Post

ZFS implements Solaris (famous and much imitated) slab memory allocator. I don't see why that can't run in a huge page or anything else.

This is where you are stuffed.

Watch the video again. https://www.youtube.com/watch?v=p5u-vbwu3Fs "Large Pages in Linux". This is not a slab memory allocator. This is not file systems having there own allocate system. That complete slab memory allocate to be Linux compatible has to go replaced by the Large Page system.

Originally posted by k1e0x View Post

One thing to note here is that FreeBSD I believe doesn't even use the slab because there own memory manager is so close to it, they didn't need to change.. however that may no longer be the case in future releases because of their transition to ZoL/OpenZFS.

That is the start of the problem.

Originally posted by k1e0x View Post

Historical model:
- Loopback layer (qcow2, overlays)
- Posix layer (VFS, read/write)
- File layer (ext3, xfs etc)
- Logical volume layer (LVM, MD Raid, LUKS or sometimes all 3 chained together! All pretending to be a block layer to each other.)
- Block Layer (fixed block size, usually 4k)

You did no watch paying proper attention the video or you missed side 12. First line.
Block Layer Already supports arbitrary size pages, thanks to merging

Funny enough so the Logical volume layer due to being in the LInux kernel Block Layer.. Variable block size. Has existed in Linux all the way up to just under the file system. There is a problem at the file system drivers that iomap is a plan to fix.

So this historic model does not in fact match Linux. DMA means you did not have to use a fixed block size. HDD might have like 4k blocks and you have to write aligned but nothing prevent you using like 32 or 64kb ....As long as it aligned. This feature was basically in the Linux kernel first block layer. Lot of people writing file systems on Linux in the file system layer brought in the 4k limitation crap with huge pages this don't work any more.

The large page work is to bring the block layer idea of allocation and memory management/OS page cache into agreement. This way you can have one allocation top to
bottom.

Originally posted by k1e0x View Post

In ZFS they ripped all that out and changed it.

ZFS model:
- ZPL (Speaks a language the OS understands, usually posix.)
--- Optional ZVOL Layer (Pools storage for Virtual Machienes, iSCSI, Fiber channel and distributed storage etc, no extra layer added on top like with qcow2)
- DMU (re-orders the data into transaction groups)
- SPA (SPA works with plugins to do LVM, Raid-Z, Compression using existing or future algorithms, Encryption, other stuff not invented yet, etc. It can even load balance devices.)
- Block layer (variable block size)

ZFS rewrote how all the layers work and changed them to be aware of each other. It actually takes blocks, bundles them up as objects in transaction groups and that is what's actually written.

Was it required to rewrite all the layers to make them aware of each other. The answer is no. Why are you bundling them up into objects instead of improving the page cache of the OS as large pages does for all file systems over time.

ZPL means you must be translating. You also are ignoring the host OS big time. This comes a huge excuse to making own internal allocator that is not in fact aligned with the host OS.

You also miss that the block size ZFS wants to use are up to 1Mib. Huge pages on x86 is 2-4mib. 1Mib made sense on a sun Sparc cpu and 32 bit x86 but we use 64bit x86 these days. There are a lot of formal things that are based for hardware we don't use in ZFS that need to be redone as well.

Consider this your ZFS model is wrong on Linux.

Loopback
VFS
ZFS
-ZPL
---ZVOL
-DMU
-SPA
-Block layer ZFS.
Linux Block layer.

I would suspect freebsd is will end up equally bad mess. The rip out and replace that was ZFS objective under Solaris under non Solaris has not happened.

Then those layers are also sitting on another abstraction layer. You skipped the SPL (Solaris Porting Layer) . What is basically lets keep on using Solaris API for ever more. Of course as Linux behaviours come less Solaris like this is going to be increasing problem. At some point freebsd will change things and have trouble as well. Like it was particular hard to implement trim command for SSD in ZFS for Linux. ZFS was with this almost a decade late getting the feature compare to other Linux file systems.

Originally posted by k1e0x View Post

Have you ever really used the "ip" command? It drives me nuts every real OS for 40 years used ifconfig,

Yes I have and I have been very thankful for it.

How to detect the physical connected state of a network cable/connector?

https://stackoverflow.com/questions/808560/how-to-detect-the-physical-connected-state-of-a-network-cable-connector

In a Linux environment, I need to detect the physical connected or disconnected state of an RJ45 connector to its socket. Preferably using BASH scripting only. The following solutions which have ...

These kind of problems ip monitor feature is great.

ip command you can also do things like changing/removing default gateway without having to disconnect and reconnect for the routing change to come live and if you are dealing with a slightly suspect managed switch that radius messaging to activate port is roll of dice if it works or not its great to be able to changes like that with network card up.

Originally posted by k1e0x View Post

Linux was deployed and put into the position it was by sysadmins because it was simple and they needed to solve problems.

Really the ip command come into existence because the posix standard ifconfig command cannot do a stack of different things well. What the LInux network stack can allow to be perform well and truly exceeds what the freebsd one today allows. The fact freebsd has not found themselves needing to replace or massively extend the ifconfig is more a sign of how feature behind freebsd network stack as got.

Originally posted by k1e0x View Post

Same thing you can see in KVM and bhyve. bhyve is almost a different class of hypervisor in that it's a few hundred kilobytes in size and does no hardware emulation (qemu) at all. KVM well, how bloated CAN it get?

I get sick of FreeBSD people doing this one. https://lwn.net/Articles/658511/ reality if you fire up Linux kvm with kvmtool there is no hardware emulation either. I call this horrible naming. kvm command is qemu modified to take advantage of kvm kernel feature of Linux. kvmtool also uses kvm kernel feature of Linux without qemu bits so is insanely lightweight..

So KVM is not as bloated as what you want to make out in Linux kernel. KVM userspace option has feature rich the kvm command that is based off qemu with all the hardware emulation and the feature poor kvmtool that bring you back to something like bhyve with the same OS support problems.

**k1e0x** · 01 February 2020, 11:54 PM

Originally posted by oiaohm View Post

Loopback <- not used (implemented as ZVOL)
VFS <- not used (implemented by ZPL)
ZFS <- This isn't a layer the whole thing is ZFS..?
-ZPL <- ZPL is optional, things like Luster can talk to the DMU
---ZVOL <- Also optional to provide block devices that DONT have to go through the ZPL.
-DMU
-SPA
-Block layer ZFS.
Linux Block layer <- Not used, redundant.

oiaohm, just a quick response here.. You're missing the design. The entire stack is different, it doesn't lay on top / reuse like everything else in Linux. (and pretty much every other OS too, Windows isn't different) This is why it's the *historical* design. It's not a shim, it's designed to provide exactly what things need without going through unnecessary layers.

All you need is the DMU and the SPA and a block device.. That's it, Some application talk directly to the DMU. The SPL has nothing to do with the io path. It implements the slab, the arc and other things for the DMU and SPA. The SPL is also only used on Linux, due to the need to separate out the kernel modules .. and in the latest versions it's been integrated anyhow. On FreeBSD there is no SPL. (Ovis there isn't one on Illumos.)

I also know you can change the block sizes, thats age old like you said and that isn't the point. The point is ZFS is variable and does it on the fly automatically. (So 512k write is 512 block, 4k write is 4k. etc, it means it efficiency manages slack.)

Also the slab is really really good.. if it wasn't it wouldn't have been imitated or copied by every other OS (including linux).. be very careful redesigning this.. *most* other people got it wrong before Sun. You want to put hobbyist or millennial programmers on this? or seasoned engineers who suffered some of the most agonizing problems and pain converting BSD to SVR5. Solaris was built out of sadness and suffering.. not "I got to make this widget then hop on instagram"

ZFS implemented trim in 2013. Linux was just late to the party on that feature.

**oiaohm** · 02 February 2020, 09:48 PM

Originally posted by k1e0x View Post

Loopback <- not used (implemented as ZVOL)
VFS <- not used (implemented by ZPL)

Loopback is used by snapd and other items. Snapd is not smart enough to use ZVOL. So maybe it important to improve loopback for legacy applications not just say not used.

VFS is used in Linux file system namespaces(systemd uses these like they are going out of fashion) so the Linux VFS layer is not implemented by ZPL because ZPL does not contain the namespace features. So the Linux kernel VFS layer with the page cache is always there in Linux because ZFS is round peg square hole without being redesigned from the ground up.

Originally posted by k1e0x View Post

Linux Block layer <- Not used, redundant.

Your sata/SAS controllers drivers and so on under Linux force you back to the Linux Block Layer to interface with them. So not redundant as claimed without the Linux block layer under Linux ZFS cannot write to local discs. Now if you are going to networking you still need to keep zero copy if possible to use the same allocation system as the Linux kernel.

So all your so called corrections were wrong k1e0x

Originally posted by k1e0x View Post

The entire stack is different, it doesn't lay on top / reuse like everything else in Linux.

That is the problem ZFS is breaking the possibility of zero copy operations from the block device(sata/sas... controller) to the page cache.

Originally posted by k1e0x View Post

I also know you can change the block sizes, thats age old like you said and that isn't the point. The point is ZFS is variable and does it on the fly automatically. (So 512k write is 512 block, 4k write is 4k. etc, it means it efficiency manages slack.).

I am not just that its how those placed in ram. So that you end up in zero copy.

Originally posted by k1e0x View Post

Also the slab is really really good.. if it wasn't it wouldn't have been imitated or copied by every other OS (including linux).. be very careful redesigning this..

The horrible point here is slab in the Linux kernel is superseded technology. Yes Linux kernel took in slab but the Linux kernel developers reworked how it functions in a big way. With Linux kernel being made support hugepages slab got redesigned to the new beast that was not named. The new system uses a different solution to work better with multi different page sizes. Large pages work redesign sections of the memory system again.

So basically the slab technology ZFS is using is between 1 to 2 generations behind what the Linux kernel has. So ZFS slab usage under Linux is pointless duplication with a less effective method. One problem the Linux replacement to SLAB is under GPLv2 that happens to be incompatible with CDDL.

**k1e0x** · 03 February 2020, 02:48 AM

Originally posted by oiaohm View Post

Loopback is used by snapd and other items. Snapd is not smart enough to use ZVOL. So maybe it important to improve loopback for legacy applications not just say not used.

VFS is used in Linux file system namespaces(systemd uses these like they are going out of fashion) so the Linux VFS layer is not implemented by ZPL because ZPL does not contain the namespace features. So the Linux kernel VFS layer with the page cache is always there in Linux because ZFS is round peg square hole without being redesigned from the ground up.

Your sata/SAS controllers drivers and so on under Linux force you back to the Linux Block Layer to interface with them. So not redundant as claimed without the Linux block layer under Linux ZFS cannot write to local discs. Now if you are going to networking you still need to keep zero copy if possible to use the same allocation system as the Linux kernel.

So all your so called corrections were wrong k1e0x

That is the problem ZFS is breaking the possibility of zero copy operations from the block device(sata/sas... controller) to the page cache.

I am not just that its how those placed in ram. So that you end up in zero copy.

The horrible point here is slab in the Linux kernel is superseded technology. Yes Linux kernel took in slab but the Linux kernel developers reworked how it functions in a big way. With Linux kernel being made support hugepages slab got redesigned to the new beast that was not named. The new system uses a different solution to work better with multi different page sizes. Large pages work redesign sections of the memory system again.

So basically the slab technology ZFS is using is between 1 to 2 generations behind what the Linux kernel has. So ZFS slab usage under Linux is pointless duplication with a less effective method. One problem the Linux replacement to SLAB is under GPLv2 that happens to be incompatible with CDDL.

Well, nothing's perfect. lol

Yeah, there are differences (the slab on Solaris and FreeBSD's version of it has also seen improvement and changes). FreeBSD is much less radical about the changes they make. The block layer uses a driver??? no way! Every OS does, the difference is HOW it uses that. They would NEVER rip out ifconfig. You'd never see it happen.. they would fix it's limitation (and have, it does wifi on FreeBSD, like it *should*.) It's developed by people who *like* unix and are not in a power struggle with the other distro or the whims of whoever control X project. It's a team (core team) and they make sound engineering decisions for what all of their future is going to be.. A lot of them seamed to be more seasoned programmers too. Kirk McKusick still maintains and improves UFS that he created on it as a student at Berkley in the 70's. (guess what, the grandfather of filesystems likes ZFS too)

On ZFS it would be really nice to see Linux do the same thing FreeBSD did and not need some of these pieces imported from the Illumos branch. I don't see any reason why snap can't use a ZVOL and being all Ubuntu technology you can just import it. You can DD anything to a zvol and use it like a file. ZVOL's are really nice btw.. have you played with them? You can do a lot of interesting things with them. That layer really needs to get adopted on other systems.

I think in the end fundamentally Linux wants to be something VERY different from what sysadmins want. I haven't really seen anything that great come out of Linux in about 10 years.. They are catering to a home user market that just doesn't exist and they no longer seem to give a shit about being Like-Unix anymore so.. yeah.. not much of a loss.. I personally think Linux gaming is the dumbest thing in the world, it's fine if it works but Linux is a server OS. : shrug : Isn't it? And if it's not.. I don't care about it. What a user runs on his workstation doesn't interest me.. could be android for all I care.. it's just a browser anyhow.

I think I've kind of won you over btw on ZFS.

Yeah.. it's a good thing in the world. Hopefully the dream of making filesystems as easy as ram will be a reality. "Storage your computer manages for you" That is the idea.. They aren't there yet, but they know that and are working towards it.

How does ZFS do in the enterprise? People don't really tell us who is *really* using it, business tend to be very private about their critical core infrastructure.. but anonymously who uses it..? Well if you look at Oracle's stats.

ZFS was able to top SAP storage benchmarks for 2017.
2 / 3 Top SaaS companies use it.
7 / 10 Top financial companies use it.
9 / 10 Top telecommunication companies use it.
5 / 5 Top semiconductor companies use it.
3 / 5 Top gas and oil companies use it.
2 / 3 Top media companies use it.
Has good data center foot print too 9pb per 1 rack.

But who would care about that market.. small peanuts. Disney, Intel, AT&T. JP Morgan, or ARCO meh.. not important.

No need to put this in Linux. Linus is probably right. We should make gaming better on Linux. lol

**gilboa** · 03 February 2020, 11:37 AM

Originally posted by k1e0x View Post

1. I have never seen Oracle ZFS storage appliance used in the enterprise personally, I know they exist I've just never seen them. Only Sun Microsystem's before Solaris version 10. (Pool version 28~) Generally it's used on FreeBSD storage clusters that are secondary storage to NetApp, EMC or DDN. (around 32-64 spindles) You can get paid commercial ZFS (and ZoL) support from both FreeBSD 3rd parties and Canonical. Ubuntu 19.10 has ZFS on root in the installer and so will 20.04-LTS

A. I was talking about Oracle Solaris and Oracle Unbreakable Linux.
B. I'm not sure whether Canonical Enterprise offering includes full support for ZFS. Last I heard, it was marked as "experimental" (read: unsupported).

2. All I got to say about ext4 is hope you like fsck.

I've got a couple of PB worth of storage, be that glusterfs clusters, ovirt clusters or our own proprietary application, that seems to suggest otherwise.
But feel free to think otherwise.

3. You don't know what you're talking about. COW alone has nothing to do with bit-rot or uncorrectable errors. You're thinking of block checksums, and yes, they are good. COW provides other features such as snapshots, cloning and boot environments. Boot environments are pretty cool.. maybe Linux should get on that... oh wait.. ZFS is the only Linux file system that does it and we can't have *that*.

A. You assume I don't understand the difference between checksums and COW. No idea why.
B. .... What makes you think I need it (not the former, the latter)?

I believe you're completely missing my point.
Using ZOL in an enterprise environment without enterprise grade support is *foolish*. If it breaks, you get to keep all the pieces.

- Gilboa

Announcement

Linus Torvalds Doesn't Recommend Using ZFS On Linux

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment