Announcement

**jrch2k8** · 13 September 2017, 09:34 AM

Originally posted by arjan_intel View Post

It's mostly the math-heavy things (libm from glibc, the BLAS library of your choice, etc) where there is a real split in performance between generations. The AVX2+FMA line is a split where performance fundamentally changes.
(On, say, Skylake cpus like this core i9, a float point add takes 4 cycles, a floating point mul takes 4 cycles, but a FMA (multiple and add) ALSO takes 4 cycles. this means that code that does lots of multiply and adds on FP can get significant benefit)

Also Clear linux have a forum or any place for technical questions outside the mailing list? I was thinking adding support for ZFS but I'm still new to it and have some questions

**arjan_intel** · 13 September 2017, 09:37 AM

Originally posted by jrch2k8 View Post

Also Clear linux have a forum or any place for technical questions outside the mailing list? I was thinking adding support for ZFS but I'm still new to it and have some questions

mailing list is the preferred place; no need to be shy we're all friendly peeps

**arjan_intel** · 13 September 2017, 09:40 AM

Originally posted by jrch2k8 View Post

Agreed but I also seen some beast improvement on Sandy Bridge and Haswell, I guess is a combination of all the factors at the end of the day(better use of modern hardware features, compilation optimizations, leaner kernel build, patches, etc.)

Btw, you guys have tested how close is latest GCC(with your current compilation optimizations) to ICC(where possible)?

Haswell is the first cpu to have FMA so at least some of your observation matches.

I make it a habit to not compare to ICC, makes for a much easier conversation at the coffee machine at work.

**jrch2k8** · 13 September 2017, 09:55 AM

Originally posted by arjan_intel View Post

mailing list is the preferred place; no need to be shy we're all friendly peeps

all right I'll try it later once I have all the questions

**arjan_intel** · 13 September 2017, 09:57 AM

Originally posted by jrch2k8 View Post

all right I'll try it later once I have all the questions

we're also usually on freenode IRC (mostly US timezones though) in #clearlinux

**chithanh** · 13 September 2017, 10:11 AM

Given that Debian is very stability oriented, I don't think they can follow all of those.

Using contractions like FMA will potentially affect the result of computations, therefore they are allowed by IEEE 754 and ISO C/C++ only within very strict limits. If you turn them on in gcc (via -ffp-contract=fast) the results of floating point math will effectively become implementation defined.

In order to be effective, autovectorization often involves algebraic optimizations (reordering of operations) which runs afoul of language standards.

**arjan_intel** · 13 September 2017, 10:15 AM

Originally posted by chithanh View Post

Given that Debian is very stability oriented, I don't think they can follow all of those.

Using contractions like FMA will potentially affect the result of computations, therefore they are allowed by IEEE 754 and ISO C/C++ only within very strict limits. If you turn them on in gcc (via -ffp-contract=fast) the results of floating point math will effectively become implementation defined.

In order to be effective, autovectorization often involves algebraic optimizations (reordering of operations) which runs afoul of language standards.

this is not actually correct. FMA is still allowed even at O2; the result is not less accurate than without FMA... it's a little more accurate.

**geearf** · 13 September 2017, 10:18 AM

Originally posted by arjan_intel View Post

It's mostly the math-heavy things (libm from glibc, the BLAS library of your choice, etc) where there is a real split in performance between generations. The AVX2+FMA line is a split where performance fundamentally changes.
(On, say, Skylake cpus like this core i9, a float point add takes 4 cycles, a floating point mul takes 4 cycles, but a FMA (multiple and add) ALSO takes 4 cycles. this means that code that does lots of multiply and adds on FP can get significant benefit)

Interesting, thank you!

**sdack** · 13 September 2017, 10:19 AM

Originally posted by jrch2k8 View Post

This has nothing to do with hardware support since x86_64 default to SSE2+ class hardware at bare minimum and the kernel can switch implementation of algorithms(when needed) at runtime(if you CPU support AVX2 the kernel will load all the modules that use SIMD optimized for AVX2, kernels are smarter than user space code).

The main difference is compilation process and how much runtime debug apparatus have the kernel enabled(frame pointers, kprobes, etc.), I think Fedora enable lots debug features in their default kernel but I'm not sure.

In the cases you see those huge massive spikes between distros is mostly due to "Too complex LOOPS" that simply cannot be optimized by certain compiler version specially at default settings, this means if a newer version of the compiler with certain extra flags find a way to break those loops and optimize them(vectorize them, parallelize them, etc) the gain will be huge compared to the un optimized one that is basically worst case scenario result

Again, these are explanations of why it is, but not why it should stay this way.

**torsionbar28** · 13 September 2017, 10:24 AM

Originally posted by tiwake View Post

(have yet to run gentoo... I'm not man enough.. maybe when I get a threadripper desktop sometime next year or something)

Tried it for a while myself, but I found it cumbersome in practice. It's not unusual to experience breakage due to unenforced dependencies, etc. and then you're left perusing the forums to find a solution. Meanwhile your machine is totally borked because the systemd you just installed is incompatible with the glibc you built yesterday. Ain't nobody got time fo' dat.

But at least machines these days are fast enough that compile times are not a major issue. I was running Gentoo on a 667 Mhz Alpha 21264 for a while, and sometimes it took *days* to compile all the updates, and by the time it was finished, there was a stack of new updates that needed compiling. The machine spent more time compiling its own updates than doing any productive work, lol.

Announcement

Core i9 7900X vs. Threadripper 1950X On Ubuntu 17.10, Antergos, Clear Linux

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment