Announcement

**sunnyflunk** · 28 January 2017, 12:37 AM

Originally posted by Yndoendo View Post

I applaud them but having multiple systems on the same distro just means performance will latched to the designated set of testing hardware. Clear LInux can get away since Intel is trying to make it work faster on their chips but insert alternative architecture or manufacturer and it all falls apart.

This is very incorrect. The performance gains quoted aren't latched to the test hardware and the improvements quoted have no requirements for specific hardware (runs the same paths on any 64-bit processor). The binaries are simply more efficient but still use -mtune=generic -march=x86-64, no FMV (or avx2 libs), no patched source. I can boost performance further by going down these other routes and have done so in other performance areas.

Originally posted by Yndoendo View Post

Gentoo could come out on top if they created automated application profiling to find the optimal compiler settings per-application. Since their solution would be architecture independent.

I would love more people going down the optimization path (other than the Clear Linux folks who already do). Could do a lot more working together (but with different implementations) and bring some fresh ideas.

**cj.wijtmans** · 28 January 2017, 07:53 AM

Originally posted by sunnyflunk View Post

This is pretty much what I'm thinking and where I'm going. Build packages optimized with proven build flags, PGO where possible, and plug in advanced instructions where (and only where) it makes sense. Pretty much the purpose behind automating my processes as much as possible so that I can apply it to as many packages as possible. So I can set up a test, run it overnight and know what the flags do and whether CPU instructions are valuable (and to what degree). Then by adding a couple of lines, can test the impact of a PGO implementation. In future, will utilise clang and linker variations to test their performance

FMV is very doable, but can take a long time (and likely maintenance to the patches with future releases). I have ideas in how to implement it better into my testing, but not coded yet.

The biggest hurdle is having a benchmark to test each package

That is a fine way to do it, if you want to miss a lot of optimization.

**indepe** · 28 January 2017, 05:25 PM

Originally posted by Yndoendo View Post

I applaud them but having multiple systems on the same distro just means performance will latched to the designated set of testing hardware. Clear LInux can get away since Intel is trying to make it work faster on their chips but insert alternative architecture or manufacturer and it all falls apart.

Just checking...where you referring to function multiversioing (FMV) as the posts before yours did?

**indepe** · 28 January 2017, 05:34 PM

Originally posted by sunnyflunk View Post

FMV is very doable, but can take a long time (and likely maintenance to the patches with future releases). I have ideas in how to implement it better into my testing, but not coded yet.

I'd see FMV more as part of working on source code optimizations, than as a part of working on build/packaging optimizations. However one doesn't exclude the other.

EDIT: Probably Clear Linux added FMV versions as patches, so perhaps you can get them from the Clear Linux source repos. AFAIK Intel welcomes that, as they say they would like to get changes upstream (a process which doesn't seem to meet a lot of active response).

**indepe** · 28 January 2017, 06:31 PM

Originally posted by sunnyflunk View Post

The biggest hurdle is having a benchmark to test each package

Obviously your approach works best with having a benchmark for each package. However, when you don't, how about categorizing packages into groups, and then using general benchmarks (like those used by Michael to compare distros) to test the summary effect of optimizations for each group? Or are your optimizations (PGO and so on) too specific in nature?

Another question, if I may, you seem to be a good person to ask which optimizations to use, in general, to compile/build an application with performance in mind.

**sunnyflunk** · 29 January 2017, 05:17 AM

Originally posted by indepe View Post

I'd see FMV more as part of working on source code optimizations, than as a part of working on build/packaging optimizations. However one doesn't exclude the other.

EDIT: Probably Clear Linux added FMV versions as patches, so perhaps you can get them from the Clear Linux source repos. AFAIK Intel welcomes that, as they say they would like to get changes upstream (a process which doesn't seem to meet a lot of active response).

Often packaging optimizations are better than compiling avx2. So taking my ogg numbers quoted in the article, my automated testing revealed that there were 3 ways to address the slow decoding speed.

1. -funroll-loops: This impacted encoding speed badly though, so not a great solution. (and never something you enable without a comprehensive test)
2. -mavx2: If the loops are inefficient, then people able to power through them saves a lot of time!
3. PGO: This was even better than -mavx2 even and has the benefit of running on all hardware. This doesn't preclude using CPU instructions as well, but the gain was not as impressive from avx2 once PGO is applied.

Looking at Clear Linux packaging files, I see FMV applied in 2 packages. From what I've seen they have specific machines that they target (hence the avx2 libraries), whereas for me to properly implement and fully test FMV, it would take many hours at least for one package. Enabling avx2 libs takes no time at all.

But I'm definitely going to investigate automating and testing FMV better, will just have to improve my process. I've seen applying avx2 to functions increase the speed beyond a full -mavx2 binary and I've also seen it decrease performance.

**sunnyflunk** · 29 January 2017, 05:40 AM

Originally posted by indepe View Post

Obviously your approach works best with having a benchmark for each package. However, when you don't, how about categorizing packages into groups, and then using general benchmarks (like those used by Michael to compare distros) to test the summary effect of optimizations for each group? Or are your optimizations (PGO and so on) too specific in nature?

Another question, if I may, you seem to be a good person to ask which optimizations to use, in general, to compile/build an application with performance in mind.

The benchmark doesn't have to be super specific to the package, it just needs to show performance changes when the change the speed of the package. A general benchmark isn't often helpful. Sometimes you are looking at a 3% gain, if the benchmark isn't specific enough, it won't show up beyond margin of error. I do use the same benchmark for multiple tests. But you have to be very careful, cause if you optimize to a non-representative benchmark bad things happen. For example, if I PGO a picture library to editing metadata, it does so at the expense of the other functions (i.e. time consuming conversions and decoding). And usually you want 2 tests/benchmarks, one for the PGO, and a separate one to test the results.

Without testing the optimizations per package, I'd probably stick to plain -O3 -march=native. Honestly, trying to over optimize without being able to validate the results can do more harm than good. But you can boost performance quite a bit when you can pick the right flags that work on the specific code.

**indepe** · 30 January 2017, 05:56 AM

Originally posted by sunnyflunk View Post

Without testing the optimizations per package, I'd probably stick to plain -O3 -march=native. Honestly, trying to over optimize without being able to validate the results can do more harm than good. But you can boost performance quite a bit when you can pick the right flags that work on the specific code.

Same experience with source code optimizations: for functions that are crucial to performance, it is always necessary to measure the actual effect, guessing won't do. Which is also why it is great to have a site like phoronix.

**indepe** · 30 January 2017, 05:38 PM

Originally posted by sunnyflunk View Post

... Sometimes you are looking at a 3% gain, if the benchmark isn't specific enough, it won't show up beyond margin of error. ...

By the way, I recently found that using the cpufreq governor "userspace" instead of "powersave" or "performance", with a non-turbo constant frequency (without the +1 MHz), results in much more repeatable execution times on my machine. It reduces the number of interrupts (excessively long execution times as measured by RDTSC at the sub-millisecond level) by a factor of 4, but greatly reduces fluctuations even when I have already filtered out those longer execution times. If I use the constant CPU frequency, I need to run tests only 2-3 times (though with large loops in each test), instead 10-20 times as with "performance", in order to get a reliable comparison between two alternative implementations. You probably knew that already, but it was new for me (as mostly everything Linux..

).

**sunnyflunk** · 30 January 2017, 06:20 PM

Originally posted by indepe View Post

By the way, I recently found that using the cpufreq governor "userspace" instead of "powersave" or "performance", with a non-turbo constant frequency (without the +1 MHz), results in much more repeatable execution times on my machine. It reduces the number of interrupts (excessively long execution times as measured by RDTSC at the sub-millisecond level) by a factor of 4, but greatly reduces fluctuations even when I have already filtered out those longer execution times. If I use the constant CPU frequency, I need to run tests only 2-3 times (though with large loops in each test), instead 10-20 times as with "performance", in order to get a reliable comparison between two alternative implementations. You probably knew that already, but it was new for me (as mostly everything Linux..

).

I haven't come across such issues as yet. Given my goal is optimizing Solus, I'm making sure that I'm testing what someone would get from the repo (and then push the optimizations straight to the repo). So the tests must use system provided libraries/binaries (directly or indirectly). So a game doesn't have to be built against local libs to test mesa/xorg-server, but it does if I want to test SDL optimizations for example.

Then I can go back and run all the tests with a different kernel configuration/scheduler and see the impact of that.

Announcement

Solus Linux Experimenting With Automated Profiling/Optimizations

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment