Announcement

**onlyLinuxLuvUBack** · 29 February 2024, 04:18 PM

Originally posted by sophisticles View Post

Many years ago see any performance fain, in

New print on demand coffee cup:

Performance Fain -sophisticles //phoronix.com

Also part2 article could be desktop tests like people asked for, plus python unoptimized pandas and duckdb data load/query/saveout, etc, and shufflecake fs bench too.

**Linuxhippy** · 29 February 2024, 04:28 PM

Originally posted by sophisticles View Post

Many years ago, on another forum, I floated the idea that Linux distros should be compiled with SSE optimizations enabled, or even better, have the whole thing coded from top to bottom to use SSE.

SIMD is only good at some tasks, same as with GPUs. However there are a lot of algorithms which don't fit into this category.
Furthermore if you "code from top to bottom to use SSE" you limit yourself to x86.

There are a limited number of SIMD units on a processor ... you quickly saturate the SIMD units and kill any performance gains that you would otherwise gain.
SIMD use is only beneficial when it is used sparingly

This is only true for rather special architecture like AMD's Bulldozer cores or ARM power/area saving cores where two integer units share a single SIMD-capable FP core (or better can use halve of the vector width). However even on those architectures, your top-to-bottom-coded-SSE applications won't consist of 100% SSE instructions - there are simply many things where SIMD doesn't make sense. So except for workloads where you really have high-density SIMD code without any memory-stalls (like multicore image processing), SIMD-unit over-utilization is not a big problem for most workloads.

**Anux** · 29 February 2024, 04:33 PM

Originally posted by avis View Post

v3/v4 advantage over plain x86-64 is not so clear cut. There are far too many regressions and it looks like it would be best if it were per application/library, not for everything.

I think so too, apart from some serious gains there are also some heavy regressions and the average looks like 1 or 2 % from std to v4. And that is on a threadripper with big cache and high memory bandwidth.
You would have to do a case by case test with all versions on your own system for each package to really know if you have gained something or lost perf.

Will users be able to select std/v3/v4 per package with an easy pacman command? The reason I don't use gentoo is exactly because compiling everything yourself is much too much work for those few percent improvements that you would have to benchmark at least once.

If it were say a 15% improvement with no regressions I would instantly switch.

**skeevy420** · 29 February 2024, 05:36 PM

Originally posted by avis View Post

v3/v4 advantage over plain x86-64 is not so clear cut. There are far too many regressions and it looks like it would be best if it were per application/library, not for everything.

IIRC, CachyOS also builds packages with -O3 optimizations as well. Changing their default -O3 to -O2 is one of my first changes when using CachyOS. O2/O3 could just as well be a culprit for any differences as v3/v4 could be.

**TemplarGR** · 29 February 2024, 09:19 PM

It is so funny when AMD fanboys were claiming for years that AVX512 offers no benefit, when it is clear from the benchmarks that it does....

**Dukenukemx** · 01 March 2024, 12:21 AM

Originally posted by TemplarGR View Post

It is so funny when AMD fanboys were claiming for years that AVX512 offers no benefit, when it is clear from the benchmarks that it does....

You're thinking of Linus Torvald fans.

**kylew77** · 01 March 2024, 12:35 AM

Very interesting article. It is worth noting that some articles slightly digressed and most were about the same regardless of microarchitecture. It was only PHP that had that huge uplift to v3 and then a modest upgrade to v4.

**drakonas777** · 01 March 2024, 03:39 AM

Originally posted by TemplarGR View Post

It is so funny when AMD fanboys were claiming for years that AVX512 offers no benefit, when it is clear from the benchmarks that it does....

It was more in the context of Intel AVX512 on N10 CPUs where it caused reduced boost frequencies or even throttling in some cases, so benefits were drastically reduced. I don't remember anyone on this forum arguing that AVX512 has no benefits in general.

**sobrus** · 01 March 2024, 03:54 AM

Originally posted by TemplarGR View Post

It is so funny when AMD fanboys were claiming for years that AVX512 offers no benefit, when it is clear from the benchmarks that it does....

It was rather AVX512 offers significant improvements but only in very limited use cases (and this is exactly what Linus said).
The question is : how expensive it is to implement AVX512? How much die space does it take?
The differences are rather minor, save for GNU Radio and PHP Bench, and surprisingly v3 scored some victories over v4.
So maybe it's better to use the silicon to fit in more cores instead? Or enlarge L2 cache?
That was the point, and looking at the results - still very valid. Silicon area is limited, so It's better to optimize chip for what it does 99% of time, not 1% of time.

But if AVX512 only takes minimal die amount, there's no reason not to have it, even if just for rare use cases.
Even if benchmarks show it usually offers rather negligible differences vs SSE2 from year 2000.
Do not understimate SSE2. Together with SSE it has over 200 SIMD instructions, of course limited to 128-bit vector length.

**Svyatko** · 01 March 2024, 03:27 PM

Originally posted by sobrus View Post

It was rather AVX512 offers significant improvements but only in very limited use cases (and this is exactly what Linus said).
The question is : how expensive it is to implement AVX512? How much die space does it take?
The differences are rather minor, save for GNU Radio and PHP Bench, and surprisingly v3 scored some victories over v4.
So maybe it's better to use the silicon to fit in more cores instead? Or enlarge L2 cache?
That was the point, and looking at the results - still very valid. Silicon area is limited, so It's better to optimize chip for what it does 99% of time, not 1% of time.

But if AVX512 only takes minimal die amount, there's no reason not to have it, even if just for rare use cases.
Even if benchmarks show it usually offers rather negligible differences vs SSE2 from year 2000.
Do not understimate SSE2. Together with SSE it has over 200 SIMD instructions, of course limited to 128-bit vector length.

For proper comparison you have to use the same hardware (incl. CPU) with AVX-512 being enabled and disabled (by BIOS settings or kernel parameters or virtual machine settings or some another way).
Some software has big advantage with using AVX-512.
AMD Zen 4 is the first desktop architecture with rather good AVX-512 implementation, and it benefits from it.
In discussed review compiler uses Skylake-X subset of AVX-512, but newer CPUs have much broader instructions set: https://en.wikipedia.org/wiki/Advanc...s_with_AVX-512
After SSE2 Intel implements SSE3, SSSE3, SSE4.1, SSE4.2, and then AVX, FMA, BMI.

Announcement

Arch Linux CachyOS Benchmarks Of x86-64-v3 & x86-64-v4 Repositories

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment