Announcement

**kloczek** · 07 June 2020, 10:07 AM

Originally posted by Volta View Post

Of course:

https://www.kernel.org/doc/html/latest/x86/x86_64/mm.html

https://access.redhat.com/articles/rhel-limits

Thank you for sending me to /dev/tree with generic RH documentation where are described theoretical limits.
On those pages there is no even single word about that hardware which you've mention.

Do you know that Intel CPUs limit is 2TB of RAM per CPU socket? How many hHW did you saw with 16 Intel CPUs with HT crossbars?
I'm not even asking for that hypothetical box with 4k cores which I know that it is single machine for many partitioned systems with extearmly low latency interconnects.
With 2TB limit you can easily calculate how many CPUs will be necessary to have to have +32TB RAM.
Look on biggest Intel Xeon CPU spec https://en.wikichip.org/wiki/intel/xeon_platinum/9282
On that page you can find "Max Mem 2 TiB"

So again: could you please point on any documentation about that HW with 4096 CPU cores?

**Volta** · 07 June 2020, 10:46 AM

Originally posted by kloczek View Post

Thank you for sending me to /dev/tree with generic RH documentation where are described theoretical limits.
On those pages there is no even single word about that hardware which you've mention.

Do you know that Intel CPUs limit is 2TB of RAM per CPU socket? How many hHW did you saw with 16 Intel CPUs with HT crossbars?
I'm not even asking for that hypothetical box with 4k cores which I know that it is single machine for many partitioned systems with extearmly low latency interconnects.
With 2TB limit you can easily calculate how many CPUs will be necessary to have to have +32TB RAM.
Look on biggest Intel Xeon CPU spec https://en.wikichip.org/wiki/intel/xeon_platinum/9282
On that page you can find "Max Mem 2 TiB"

So again: could you please point on any documentation about that HW with 4096 CPU cores?

No problem, but are we talking about hardware limits or limits Operating System can handle? You still didn't provide information how much memory does Solaris support. :> For example Xeon® Platinum 8280L supports 4.5 TB RAM.

https://www.sap.com/dmc/benchmark/2020/Cert20008.pdf

Linux scale-up server running 896 threads and 12 TB RAM. It's a single system FYI.

**Firnefex** · 07 June 2020, 01:57 PM

Aside from some features ZFS should have that BtrFS doesn't (which in practice close to noone cares about), every performance comparison I've seen so far showed that ZFS is way slower. Mostly around 60% speed of BtrFS, but with its own encryption sometimes even below 30%.
BtrFS may have its flaws, but its GPL and competes very well compared to that bloated multi billion dollar FS, that as NTFS can't be dropped because they already invested so much.

I am very curious about bcacheFS though, its relatively new and should be feature rich like ZFS, but faster than BtrFS. There isn't much info yet other than facing problems with mount times.

**kloczek** · 07 June 2020, 06:47 PM

Originally posted by Volta View Post

No problem, but are we talking about hardware limits or limits Operating System can handle? You still didn't provide information how much memory does Solaris support. :> For example Xeon® Platinum 8280L supports 4.5 TB RAM.

https://www.sap.com/dmc/benchmark/2020/Cert20008.pdf

Linux scale-up server running 896 threads and 12 TB RAM. It's a single system FYI.

So between which one lines in this PDF is something about Linux or using it as SIS system?
On https://support.hpe.com/hpesc/public...a00050296en_us is that HW partitioning documentation which clearly suggests that this hardware usually is used to host multiple OSess.

To just let you know.
On Solaris Sun and then Oracle spent a lot of time only to redesign kernel locking infrastructure to allow work under single kernel on HW with hundreds or thousands CPUs. On Linux nothing like this never has been done and that is indirect prove that Linux still cannot be uses to handle HW on number of CPUs scale like it if practically ATM possible with Solaris.

Linux still has problem when is used without swap because it is not possible to disable dirty pages scanner. In worse case scenario such scanner eats single CPU to constantly scan TLB generating on that a lot of locks on kernel side.

Other thing. Try to think about scenario like rebooting system with tenths of terabytes of RAM. After reboot all cached data will gone and HW will literally burn all IO lines to feel RAM with cached data. On Solaris it is possible to reboot system on new kernel with preserve in memory state of all HW components and kernel or applications memory regions. With that reboot may take few seconds and restarted in kernel apce ZFS ARC or application will instantly able to start reconnection all those preserved data.
On Linux no one even have been thinking what needs to be done on kernel side to allow implement something like this.

**Volta** · 08 June 2020, 04:18 AM

Originally posted by kloczek View Post

So between which one lines in this PDF is something about Linux or using it as SIS system?
On https://support.hpe.com/hpesc/public...a00050296en_us is that HW partitioning documentation which clearly suggests that this hardware usually is used to host multiple OSess.

There's Operating System line which says it's SUSE Linux. In this case it's scale-up bare metal configuration. Here's another one with 18 TB RAM:

https://www.sap.com/dmc/benchmark/2019/Cert19054.pdf

To just let you know.
On Solaris Sun and then Oracle spent a lot of time only to redesign kernel locking infrastructure to allow work under single kernel on HW with hundreds or thousands CPUs. On Linux nothing like this never has been done and that is indirect prove that Linux still cannot be uses to handle HW on number of CPUs scale like it if practically ATM possible with Solaris.

It's an old fairy tell. Linux probably has the best SMP implementation built around RCU which existed before Linux. Furthermore, all of the scale-out supercomputers that are seen by the kernel as a single unit demand scalability abilities beyond Solaris ever dreamed about. There's no direct comparison though, so it's hard to tell after all. Here comes mentioned earlier super high memory levels, 4096 CPUs etc. SGI Ultraviolet allowed direct access to 4PB RAM and up to 4096 threads in a single x86-64 system (as what OS kernel saw). What I feel sad about is SPARC. It seems it's great CPU, but once too tied to SUN and currently not enough promoted by Oracle. I hope they won't ruin it.

Linux still has problem when is used without swap because it is not possible to disable dirty pages scanner. In worse case scenario such scanner eats single CPU to constantly scan TLB generating on that a lot of locks on kernel side.

I'm not sure if that's still the case. However, Linux now has io_uring that beats everything else by a large margin.

Other thing. Try to think about scenario like rebooting system with tenths of terabytes of RAM. After reboot all cached data will gone and HW will literally burn all IO lines to feel RAM with cached data. On Solaris it is possible to reboot system on new kernel with preserve in memory state of all HW components and kernel or applications memory regions. With that reboot may take few seconds and restarted in kernel apce ZFS ARC or application will instantly able to start reconnection all those preserved data.
On Linux no one even have been thinking what needs to be done on kernel side to allow implement something like this.

I'm not sure about this one, but Linux has live kernel patching which may sometimes prevent from reboots. Solaris probably doesn't have this. When comes to caching I think there's something similar on Linux:

Rapid restartRapid restart is also called the warm cache effect. For example, a file server has none of the file contents in memory after starting. As clients connect and read or write data, that data is cached in the page cache. Eventually, the cache contains mostly hot data. After a reboot, the system must start the process again on traditional storage.

NVDIMM enables an application to keep the warm cache across reboots if the application is designed properly. In this example, there would be no page cache involved: the application would cache data directly in the persistent memory.

Chapter 7. Using NVDIMM persistent memory storage Red Hat Enterprise Linux 8 | Red Hat Customer Portal

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_storage_devices/using-nvdimm-persistent-memory-storage_managing-storage-devices

Access Red Hat’s knowledge, guidance, and support through your subscription.

**kloczek** · 08 June 2020, 06:41 AM

It's an old fairy tell. Linux probably has the best SMP implementation built around RCU which existed before Linux. Furthermore, all of the scale-out supercomputers that are seen by the kernel as a single unit demand scalability abilities beyond Solaris ever dreamed about. There's no direct comparison though, so it's hard to tell after all. Here comes mentioned earlier super high memory levels, 4096 CPUs etc. SGI Ultraviolet allowed direct access to 4PB RAM and up to 4096 threads in a single x86-64 system (as what OS kernel saw). What I feel sad about is SPARC. It seems it's great CPU, but once too tied to SUN and currently not enough promoted by Oracle. I hope they won't ruin it.

SGI Ultraviolet it is partitioned HW with exactly that extremely low latency interconnect between partitions.
So again you are pointing on non-SIS example.

I'm not sure if that's still the case. However, Linux now has io_uring that beats everything else by a large margin.

What has asynchronous IO with paged memory management? Nothing ..

I'm not sure about this one, but Linux has live kernel patching which may sometimes prevent from reboots. Solaris probably doesn't have this. When comes to caching I think there's something similar on Linux:

Live kernel live patching it is not the same.
BTW: Solaris had this functionality long time before it was possible to have that on Linux.

https://access.redhat.com/documentat...torage-devices

nvdimm it is completely different story. I'm talking about regular RAM. To preserve RAM content you need to carefully shutdown system and preserve RAM content before new kernel will be downloaded and booted.

**S.Pam** · 08 June 2020, 08:47 AM

kexec(8): directly boot into new kernel - Linux man page

https://linux.die.net/man/8/kexec

kexec is a system call that enables you to load and boot into another kernel from the currently running kernel. kexec performs the function of the boot ...

**kloczek** · 08 June 2020, 04:41 PM

Originally posted by Spam View Post

https://linux.die.net/man/8/kexec

Did you ever try to use kexec?
Probably not because kexec leaves allocated all data and code segments used by old kernel.
In other words each restart over kexec gives you less and less memory.
Did you notice that none of the current Linux distribution is OOTB kexec ready?

Linux simple works on to small HW boxes and most of the Linux kernel developers do not bother about reboot time .. that is that main cause (not to mention that on developent most of the kernel developers are withing with VMs.
If you will have one full rack box like M8 you will find that only checking all HW components on POST stage takes +30 min.
Try to reboot that 8 CPU sockets x86 HW and you will find that reboot because POST takes not to much less ..
Try to think that 17 years ago when Sun Fire 10K was introduced that box had up to 32 CPU sockets. This is why on development of the Solaris 10 Sun added fast reboot.

Solaris in case of reboot is completely different because it maintains all data with HW state in separated region and it can pass that to new kernel as parameter.
On top of that Solaris "reboot -f" (-f -> fast .. which is default) if system is rebooted on the same kernel it shutdown only all processes and restart everything with restarted init which is b*dy fast because it does not go over POST/boot loader.
Solaris has fast reboot more than 10 years and as because it is default reboot command behaviour you need to specify "reboot -p" to force on reboot to go over POST/bootloader.
IIRC the same functionality has now FreeBSD as well. Linux still has no something like this because it will require rewrite many parts of the kernel code to keep some data in continuous region which could be preserved between loading different kernels.

.. +10-17 year and only in that small are. That is actual time window length which Linux kernel still stays behind Solaris kernel.
There are many other technologies like FMA (Fault Administration Management), PSH (Predictive Self Healing) or RBAC and many more like real trusted system (you can boot in trusted mode and perform all binaries signs versification). On Solaris all distribution binaries are signed (regular Solaris and all OpenSolaris like OmniOS as well) and those signs are stored in ELF sections. If on exec()/execwe() some binary will have sign which key is not in in kernel key ring that binary will be not allowed to start. On Linux you can have something like this but signing is stored on extended attributes which means that you need to enable ext attr on your rootfs and than before reboot you must sign all binaries in those ext attrs because NONE of the Linux distributions can even store in packages like deb or rpm ext attrs. Simple Linux chose wrong method of storing those signs and now only on Android OOTB you have full trusted boot.

**onlyLinuxLuvUBack** · 08 June 2020, 07:33 PM

Originally posted by kloczek View Post

Did you ever try to use kexec?
Probably not because kexec leaves allocated all data and code segments used by old kernel.
In other words each restart over kexec gives you less and less memory.
Did you notice that none of the current Linux distribution is OOTB kexec ready?

Linux simple works on to small HW boxes and most of the Linux kernel developers do not bother about reboot time .. that is that main cause (not to mention that on developent most of the kernel developers are withing with VMs.
If you will have one full rack box like M8 you will find that only checking all HW components on POST stage takes +30 min.
Try to reboot that 8 CPU sockets x86 HW and you will find that reboot because POST takes not to much less ..
Try to think that 17 years ago when Sun Fire 10K was introduced that box had up to 32 CPU sockets. This is why on development of the Solaris 10 Sun added fast reboot.

Solaris in case of reboot is completely different because it maintains all data with HW state in separated region and it can pass that to new kernel as parameter.
On top of that Solaris "reboot -f" (-f -> fast .. which is default) if system is rebooted on the same kernel it shutdown only all processes and restart everything with restarted init which is b*dy fast because it does not go over POST/boot loader.
Solaris has fast reboot more than 10 years and as because it is default reboot command behaviour you need to specify "reboot -p" to force on reboot to go over POST/bootloader.
IIRC the same functionality has now FreeBSD as well. Linux still has no something like this because it will require rewrite many parts of the kernel code to keep some data in continuous region which could be preserved between loading different kernels.

.. +10-17 year and only in that small are. That is actual time window length which Linux kernel still stays behind Solaris kernel.
There are many other technologies like FMA (Fault Administration Management), PSH (Predictive Self Healing) or RBAC and many more like real trusted system (you can boot in trusted mode and perform all binaries signs versification). On Solaris all distribution binaries are signed (regular Solaris and all OpenSolaris like OmniOS as well) and those signs are stored in ELF sections. If on exec()/execwe() some binary will have sign which key is not in in kernel key ring that binary will be not allowed to start. On Linux you can have something like this but signing is stored on extended attributes which means that you need to enable ext attr on your rootfs and than before reboot you must sign all binaries in those ext attrs because NONE of the Linux distributions can even store in packages like deb or rpm ext attrs. Simple Linux chose wrong method of storing those signs and now only on Android OOTB you have full trusted boot.

Hello kloczek,

do you have a test procedure for how to verify zfs is better than ext4 ?

I would like to try it out.
Thanks !

**kloczek** · 09 June 2020, 03:09 AM

Originally posted by onlyLinuxLuvUBack View Post

Hello kloczek,

do you have a test procedure for how to verify zfs is better than ext4 ?

I would like to try it out.
Thanks !

Just uese any avalaible benchmarks which is possible to use on Linux.

Announcement

EXT4 Gets A Nice Batch Of Fixes For Linux 5.8

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment