Announcement

**lucasbekker** · 17 May 2018, 10:32 AM

I would really like to see the same benchmarks on a dual socket skylake system!

**pegasus** · 17 May 2018, 10:57 AM

Is it still true with recent gcc versions that autovectorization is only applied at -O3 but not at -O2? I know this is true for gcc 6 for example. And it produces interesting results on avx512 ... some applications show major improvements, some others major slowdowns (due to thermal throttling). It seems like there's a sweet spot at how much appart (timewise) you can submit avx512 instructions to get the most out of it, but since this timing depends on the cooling system performance, it is not really deterministic.

**lucasbekker** · 17 May 2018, 11:02 AM

Originally posted by pegasus View Post

Is it still true with recent gcc versions that autovectorization is only applied at -O3 but not at -O2? I know this is true for gcc 6 for example. And it produces interesting results on avx512 ... some applications show major improvements, some others major slowdowns (due to thermal throttling). It seems like there's a sweet spot at how much appart (timewise) you can submit avx512 instructions to get the most out of it, but since this timing depends on the cooling system performance, it is not really deterministic.

I don't understand why 1 avx512 instruction should cost more energy than 2 avx2 instructions...

**willmore** · 17 May 2018, 12:20 PM

Nice article, Michael. On the previous gcc article, there was some odd regression in the Coffee Lake part, any chance you could look into that like you did this chip? Thank you!

**tillschaefer** · 17 May 2018, 12:25 PM

JavaScript is needed to view these results.
... that worked before.

**Michael** · 17 May 2018, 12:42 PM

Originally posted by tillschaefer View Post

JavaScript is needed to view these results.
... that worked before.

Most likely will begin requiring JavaScript for graph viewing by non-Premium members moving forward or just a ASCII/text-based graph for non-JS users.

**carewolf** · 17 May 2018, 12:58 PM

Originally posted by pegasus View Post

Is it still true with recent gcc versions that autovectorization is only applied at -O3 but not at -O2? I know this is true for gcc 6 for example. And it produces interesting results on avx512 ... some applications show major improvements, some others major slowdowns (due to thermal throttling). It seems like there's a sweet spot at how much appart (timewise) you can submit avx512 instructions to get the most out of it, but since this timing depends on the cooling system performance, it is not really deterministic.

That is still the case. Even basic block vectorization is not done on -O2 even though it produces smaller binaries

**sdack** · 17 May 2018, 02:05 PM

The timed PHP compilation looks wrong. The value for gcc 8.1 with -O3 -march=native is quite a bit off when compared to -O3. The -march switch should not have such an impact on the overall compiler performance with nearly 4% difference.

**pegasus** · 17 May 2018, 04:26 PM

Originally posted by lucasbekker View Post

I don't understand why 1 avx512 instruction should cost more energy than 2 avx2 instructions...

It's not the computation itself, it's the data movement that's costing the most energy. Compared to it computation is essentially free. With avx512 you have more data flying around in a shorter timespan in a physically smaller space so you generate more heat in a smaller space and therefore heat up much more.

Announcement

A Closer Look At The GCC 8 Compiler Performance On Intel Skylake

A Closer Look At The GCC 8 Compiler Performance On Intel Skylake

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment