Announcement

**FireBurn** · 13 September 2017, 05:15 AM

I still don't get why distros don't create separate packages for each CPU generation and let the package manager fetch the best one for your hardware

**b8e5n** · 13 September 2017, 05:25 AM

Originally posted by FireBurn View Post

I still don't get why distros don't create separate packages for each CPU generation and let the package manager fetch the best one for your hardware

I agree, though it would make the storage explode... I would be more in favor for client-side compilation to optimise for the user's cpu generation and usable flags.

**sdack** · 13 September 2017, 05:26 AM

I'd like to have seen how Debian would have compared here, seeing how well it did last time around.

The distros need to wake up and pick up the pace, seeing how Intel with it's Clear Linux is dominating in performance.

**chithanh** · 13 September 2017, 06:18 AM

Originally posted by geearf View Post

Like I don't think the kernel would change a lot based on grayski's benchmarks, but maybe glibc as you mentioned or something else...

The average case will see a very small but measurable benefit. But if your code starts spending major time in kernel space (e.g. the recently mainlined kernel based TLS) then this may be different. Hence I wrote that the benefit of each optimization squarely depends on individual use cases.

Originally posted by FireBurn View Post

I still don't get why distros don't create separate packages for each CPU generation and let the package manager fetch the best one for your hardware

Distros are very reluctant to provide multiple packages for the same thing. The number of combinations where something can potentially go wrong will explode. If multiple code paths / function multi-versioning / etc. can provide most of the benefits, then you will rather see that.

Originally posted by sdack View Post

The distros need to wake up and pick up the pace, seeing how Intel with it's Clear Linux is dominating in performance.

My guess is that most of Clear Linux performance advantage comes from dropping support for older hardware, plus a couple of performance related patches and customizations, e.g. in glibc. The former is not really an option for most distros (dropping support for all hardware prior to Haswell/Westmere? The most recent Debian release just got rid of everything prior to Pentium Pro...). The newer glibc version will find its way into distros in due time.

**sykobee** · 13 September 2017, 08:20 AM

Apple's iOS App Store stores an intermediate representation of applications, and compiles to native (presumably with caching) for the target architecture upon download. That's a lot of applications, although a more limited set of target architectures. Possibly one day, with LLVM can compile everything in the Linux userspace, a Linux distro will provide the same sort of functionality for users.

It seems to me that an enthusiast (recent CPU) Linux user should definitely be looking at Clearlinux, or even Gentoo, to get the best performance, and there can be significant gains.

**sdack** · 13 September 2017, 09:06 AM

Originally posted by chithanh View Post

My guess is that most of Clear Linux performance advantage comes from dropping support for older hardware, plus a couple of performance related patches and customizations, e.g. in glibc. The former is not really an option for most distros (dropping support for all hardware prior to Haswell/Westmere? The most recent Debian release just got rid of everything prior to Pentium Pro...). The newer glibc version will find its way into distros in due time.

I can understand the reasons for why this is, but it doesn't explain why it has to stay like this.

Take Debian for example. It did actually perform quite well in the last comparison and was in second place behind Clear Linux. Then take a look at the work that's been done by the Debian project. They maintain a stable, a testing and an unstable distribution, provide ports for i386 (32-bit), amd64 (64-bit), 3 flavours of ARM (arm64, armel, armhf), 3 flavours of MIPS (mips, mipsel, mips64el), PowerPC and System Z. Not to mention further ports that are in progress such as Sparc and Motorola 68k.

I don't think it's impossible at all. It's more about picking up the pace and getting started than having to surpass limitations.

**jrch2k8** · 13 September 2017, 09:09 AM

Originally posted by chithanh View Post

My guess is that most of Clear Linux performance advantage comes from dropping support for older hardware, plus a couple of performance related patches and customizations, e.g. in glibc. The former is not really an option for most distros (dropping support for all hardware prior to Haswell/Westmere? The most recent Debian release just got rid of everything prior to Pentium Pro...). The newer glibc version will find its way into distros in due time.

Sure the kernel may help but the biggest chunk of performance probably comes from:

1.) Latest GCC, Latest glibc.
2.) Careful fine tuning of the compiling flags(and patches in some cases) that use the newer blocks in GCC like LTO, Graphite, Auto vectorization, Loop optimizations(like -fsplit-loops), etc. etc.
3.) AutoFDO for profiling <-- Big perf gains but is a lot harder to compile for mere mortals that is why is not that popular on most distros

**arjan_intel** · 13 September 2017, 09:13 AM

Originally posted by geearf View Post

Do you guys have an idea of which packages matter the most in terms of compilation optimizations?
Maybe it's the whole OS, but I'm wondering if one could not recompile certain key packages to get about the same performance gain without having to go all source like gentoo.

It's mostly the math-heavy things (libm from glibc, the BLAS library of your choice, etc) where there is a real split in performance between generations. The AVX2+FMA line is a split where performance fundamentally changes.
(On, say, Skylake cpus like this core i9, a float point add takes 4 cycles, a floating point mul takes 4 cycles, but a FMA (multiple and add) ALSO takes 4 cycles. this means that code that does lots of multiply and adds on FP can get significant benefit)

**jrch2k8** · 13 September 2017, 09:25 AM

Originally posted by sdack View Post

I can understand the reasons for why this is, but it doesn't explain why it has to stay like this.

Take Debian for example. It did actually perform quite well in the last comparison and was in second place behind Clear Linux. Then take a look at the work that's been done by the Debian project. They maintain a stable, a testing and an unstable distribution, provide ports for i386 (32-bit), amd64 (64-bit), 3 flavours of ARM (arm64, armel, armhf), 3 flavours of MIPS (mips, mipsel, mips64el), PowerPC and System Z. Not to mention further ports that are in progress such as Sparc and Motorola 68k.

I don't think it's impossible at all. It's more about picking up the pace and getting started than having to surpass limitations.

This has nothing to do with hardware support since x86_64 default to SSE2+ class hardware at bare minimum and the kernel can switch implementation of algorithms(when needed) at runtime(if you CPU support AVX2 the kernel will load all the modules that use SIMD optimized for AVX2, kernels are smarter than user space code).

The main difference is compilation process and how much runtime debug apparatus have the kernel enabled(frame pointers, kprobes, etc.), I think Fedora enable lots debug features in their default kernel but I'm not sure.

In the cases you see those huge massive spikes between distros is mostly due to "Too complex LOOPS" that simply cannot be optimized by certain compiler version specially at default settings, this means if a newer version of the compiler with certain extra flags find a way to break those loops and optimize them(vectorize them, parallelize them, etc) the gain will be huge compared to the un optimized one that is basically worst case scenario result

**jrch2k8** · 13 September 2017, 09:32 AM

Originally posted by arjan_intel View Post

It's mostly the math-heavy things (libm from glibc, the BLAS library of your choice, etc) where there is a real split in performance between generations. The AVX2+FMA line is a split where performance fundamentally changes.
(On, say, Skylake cpus like this core i9, a float point add takes 4 cycles, a floating point mul takes 4 cycles, but a FMA (multiple and add) ALSO takes 4 cycles. this means that code that does lots of multiply and adds on FP can get significant benefit)

Agreed but I also seen some beast improvement on Sandy Bridge and Haswell, I guess is a combination of all the factors at the end of the day(better use of modern hardware features, compilation optimizations, leaner kernel build, patches, etc.)

Btw, you guys have tested how close is latest GCC(with your current compilation optimizations) to ICC(where possible)?

Announcement

Core i9 7900X vs. Threadripper 1950X On Ubuntu 17.10, Antergos, Clear Linux

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment