Announcement

**mlau** · 09 January 2019, 10:06 AM

Originally posted by skeevy420 View Post

Not really....it's essentially the AVX division line (which, oddly, isn't listed in their requirements). I made/suggested the same division line when I said I'd stop Gen 1 x86_64 with Westmere\AES and would start Gen 2 with Sandy/AVX --

I'd start Gen 2 at Haswell (AVX2/FMA3, BMI1/2, ...). Intel cpu's haven't changed much since then (apart from avx512 which has only niche uses at the moment).

**ms178** · 09 January 2019, 10:27 AM

Originally posted by hubicka View Post

Note that http://hubicka.blogspot.com/2018/12/...lding-and.html has some additional benchmarks to speedometer. One remaining issue is the fact that Skia (a graphic library used to render some stuff) needs to be ported to GCC. Currently it has hand optimized vector rendering code only for Clang. I plan to look into that after finishing some GCC work - Firefox is very good interesting real-world LTO benchmark and there was number of things to fix/improve for GCC 9 which I noticed while looking into its performance.

Thanks a lot for your optimization work! As you see from all the feedback it had quite an impact! Just in case you need another test case for further LTO/PGO tuning could have a look at Chromium for some optimization work, too?! Or is it already a well tested target internally at Suse?

**kpedersen** · 09 January 2019, 10:32 AM

Originally posted by skeevy420 View Post

Being able to get more life and performance out of older hardware means we're not being wasteful consumers by tossing it aside and upgrading as fast as possible.

Thats exactly what I was saying. Breaking free from this constant consumer upgrade cycle is "liberating"

Things like Clear Linux (and macOS) make this extremely hard.

**skeevy420** · 09 January 2019, 11:27 AM

Originally posted by mlau View Post

I'd start Gen 2 at Haswell (AVX2/FMA3, BMI1/2, ...). Intel cpu's haven't changed much since then (apart from avx512 which has only niche uses at the moment).

My line of thinking was up to Westmere, Sandy to Broadwell, and Skylake-Current. Adding Sandy to the earlier stuff adds in AVX code that only Sandy would support -- makes more sense to use it as the baseline for all AVX/AVX2 builds, IMHO.

Originally posted by kpedersen View Post

Thats exactly what I was saying. Breaking free from this constant consumer upgrade cycle is "liberating"

Things like Clear Linux (and macOS) make this extremely hard.

That's where I disagree about Clear Linux. I doubt that there are a whole lot of performance gains to be made for pre-2011 processors, pre-AVX stuff, that LTO/O3/PGO won't be able to accomplish and I'd suggest working with the Gentoo LTO people for that stuff -- I figure once that gets along far enough other distributions might start using their work.

It makes sense for Intel to work on what they're doing -- targeted binaries and AVX+Newer optimizations. That work can then be used by Arch, Fedora, or Debian for a multi-generation x86_64 model or expanded to cover more micro-architectures/feature sets like Solus.

We're now dealing with "i486-i586-i686 64-Bit Edition". It'll be interesting to see what approach various distributions go with.

**ms178** · 09 January 2019, 11:48 AM

Originally posted by kpedersen View Post

Thats exactly what I was saying. Breaking free from this constant consumer upgrade cycle is "liberating"

Things like Clear Linux (and macOS) make this extremely hard.

While I am all for unlocking all the performance which is already available in older hardware, I'd argue that there comes a time where a newer baseline of CPUs would be beneficial for the whole ecosystem, to serve as a new generic target. Just remember how 64-bit-x86-CPUs also happened to support SSE2 which could be assumed by the programmers to be supported on all 64-bit platforms. I know, there is a newer technique called Funcion Multi Versioning to get around this exact limitation, but it has some disadvantages, e.g. increasing binary size. And consider the use of APUs + dGPUs for GPGPU computing nowadays (which is still in its infancy but this is going to change sooner or later with chache coherent interconnects, new memory technologies and packaging).

There were other factors which made the adoption of these newer CPU instructions very slow (e.g. Intel's own product segmentation, or the implementation of these vector extensions which made its usage dependant on the software developers - unlike the new ARM SVE which scales automatically with increased vector sizes). All of these mistakes are quite a pitty considering vectorization and parallelization were the major source of CPU innovation during the last decade.

I want to see all of this goodness used more effectively instead of the brute force approach of higher IPC and frequency!

**hubicka** · 09 January 2019, 01:48 PM

Originally posted by ms178 View Post

Thanks a lot for your optimization work! As you see from all the feedback it had quite an impact! Just in case you need another test case for further LTO/PGO tuning could have a look at Chromium for some optimization work, too?! Or is it already a well tested target internally at Suse?

I am currently looking into hhvm and clang binary for bit more tests. Chromium builds by GCC in SUSE's RPM package, so I can try to look at it, too. I did about two or three years ago last time since I am not that familiar with its build machinery. Since LTO support was added in meanwhile, I guess it is time to try ago.

**ms178** · 09 January 2019, 02:58 PM

Originally posted by hubicka View Post

I am currently looking into hhvm and clang binary for bit more tests. Chromium builds by GCC in SUSE's RPM package, so I can try to look at it, too. I did about two or three years ago last time since I am not that familiar with its build machinery. Since LTO support was added in meanwhile, I guess it is time to try ago.

Great to hear that! I guess this could have quite an impact for us users, too.

By the way, I've found a recent ticket in the Chromium bug tracker which fits perfectly into the picture as they recently looked into investigating the Clang LTO + PGO build as well [UPDATE: Turns out that specific bug is not about what I thought it was, but Chromium devs are eager to see the numbers regardless]. I've mentioned your work to them today and hope that they are as curious as I am to see what the results will look like.

See: https://bugs.chromium.org/p/chromium...tail?id=906037

**Weasel** · 09 January 2019, 05:53 PM

Originally posted by ms178 View Post

I want to see all of this goodness used more effectively instead of the brute force approach of higher IPC and frequency!

I don't think you know what brute force means. Increasing IPC is much more complicated than adding vector instructions or widening the vectors.

**ms178** · 09 January 2019, 06:21 PM

Originally posted by Weasel View Post

I don't think you know what brute force means. Increasing IPC is much more complicated than adding vector instructions or widening the vectors.

You mix seperate things I said there and there are certainly more aspects relevant to IPC than adding vector instructions and widening vectors. Also I am not a CPU engineer and I probably shouldn't have mentioned IPC at all to make my point. To rephrase my thoughts: to me as a layman the Intel approach with Skylake +, ++ and +++ was just that: brute force instead of real innovation (adding two - four of the same core building blocks, refining the process for higher frequencies at the cost of higher thermal density and TDP). Of course they had little choice because they made themselves dependant on a new process nodes for their newer IP (a mistake that they are going to fix in the future, as said on their architecture day back in December).

Just for comparison, the HSA approach with APUs is in my (layman) eyes way more innovative albeit not fully where it needs to be hard- and software wise yet. Also AMD has the disadvantage of the underdog here to sway the rest of the market to adopt their framework.

**Weasel** · 10 January 2019, 04:14 PM

Originally posted by ms178 View Post

You mix seperate things I said there and there are certainly more aspects relevant to IPC than adding vector instructions and widening vectors.

No I'm talking about increasing the IPC without widening the vectors. That's a very complicated process. Out-of-Order logic is extremely complex.

Increasing vector sizes is trivial in comparison, that's why they even go this route, to avoid complexity on the chip. OO scales badly in comparison with vectors. So increasing vector sizes is much closer to "brute force" in this case, because it's the simplest and linear-scaling solution. Brute force generally means the most straightforward approach to something, not clever or complex, which is what widening the vectors is.

Okay, increasing frequency (without doing anything else, aka overclocking) is technically easier but...

Announcement

Fedora's Firefox To Stick With GCC Over Clang, Beefed Up By LTO/PGO Optimizations

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment