Benchmarking OpenMandriva 4.0 Alpha - The First Linux OS With An AMD Zen Optimized Build

F.Ultra replied

29 December 2018, 01:34 PM
Originally posted by stormcrow View Post

That's a very good example, thank you for sharing. Have you tried the code on differing compilers to see if they all behave with slower results with the second, or perhaps different architectures? (ARMv6/7 v. x86 for example) for merely curiosity sake. Conventional wisdom would suggest that removing the conditionals should speed up the execution, at least that's what one prof used to harp on back in the day, seems that's not strictly the case these days with some modern processors and compilers.

Don't have access to ARM-hardware so have not tested on anything else than amd64. I agree with your prof since that is a conclusion that I've drawn myself from 37 years of programming and it _should_ still apply today considering that lessening the impact of conditionals is why Intel et al went with the branch speculations that lead us to the Spectre vulnerabilities.

In the end (this was some years ago but the strangeness of it made me remember it to this day) I shrugged it off as either a strange anomaly or a flawed benchmark and decided to keep the optimized version since it made no sense that it would be slower. And the slowdown was very tiny, we are talking some microseconds after processing some millions of messages, and the thing that mattered most was that in all my code was 1600% faster than our precious solution (which was itself 145% faster than the fastest Java implementation available on the open market).

I read that the SQLite devs no longer runs any benchmarks regularly since the complexity of todays systems taint the picture (which is why I also cringe when I see these benchmarks on Phoronix where the chart shows "number of nanoseconds" since the difference shown there between the systems are probable below the standard deviation anyway) and instead rely on the cycle count done by Valgrind.
Leave a comment:
CochainComplex replied

28 December 2018, 12:02 PM
Originally posted by jojo7887 View Post

Recently I got into building the kernel myself, I know how to remove/add drivers and features but never tried these build parameters. Let's say I downloaded the source, got the .config file setup correctly and I'm ready to hit make, how do I add the "Znver1" optimization? Would appreciate any input :-)

i have used this patch https://github.com/graysky2/kernel_gcc_patch ...runing on debian 9.6 / gcc6.3 / Kernel 4.20 no issues at all

git apply this patch then use make oldconfig or make menuconfig - it should ask you what arch you want to build against.

Last edited by CochainComplex; 28 December 2018, 12:07 PM.
Leave a comment:
r08z replied

28 December 2018, 06:58 AM
All these tests do is show that for the time being, Clear Linux® folks know what they're doing. I've never seen a Phoronix benchmark disappoint when distros were tested against Clear Linux.
Leave a comment:
berolinux replied

27 December 2018, 09:32 PM
This is an alpha release -- obviously it isn't as fine-tuned as it could be. (Hint: Volunteers wanted -- join us in #openmandriva-cooker on freenode or on http://forum.openmandriva.org/)

For the initial build, we're essentially relying on clang and gcc to do their job, along with making sure some instructions that aren't in generic x86_64 (SSE*, 3dnow, AVX2, ...) are used in applications that have asm implementations of relevant code. (And also making sure Intel specific drivers don't get built into the znver1 kernel, but that's more about saving space than about performance).
There's obviously more that can be done (and will be done in the future).
Likes 1
Leave a comment:
Spooktra replied

26 December 2018, 08:59 PM
Well that was a let down.
Leave a comment:
stormcrow replied

26 December 2018, 07:15 PM
Originally posted by F.Ultra View Post

I sometimes long for the old times where I knew the exact number of cpu cycles a piece of code would consume by just looking at the mnemonics (M68K and MOS6510). These days the inner workings of the cpu is so complex that even other programs running at the same time on the same caches makes your code behave different.

Also it sometimes feels like either the cpu or the compilers today optimize better for "worse" code and that obvious optimizations are not always that obvious at all.

For instance I once created a source code generator for a particular protocol used in my industry (the layout of the protocol is defined using XML so it's known before hand exactly how the on-the-wire protocol will be for each specific implementation) and one of the features of this protocol is that it have a cache (called a dictionary in the protocol lingua) to avoid sending non changed data on the wire. I noticed that on specific combinations of the specification certain circumstances could not happen in a well behaved implementation so I made an obvious optimization in those cases:

The original generated code:

Code:

if ((*pmap & 32) == 32) { msg->MsgSeqNum = 0; do { msg->MsgSeqNum = (msg->MsgSeqNum << 7) | (*message & 127); } while ((*message++ & 128) == 0 && message < endptr); ctx->dictionary[0].type = DICT_ASSIGNED; ctx->dictionary[0].u_integer = msg->MsgSeqNum; } else { if (ctx->dictionary[0].type == DICT_ASSIGNED) { msg->MsgSeqNum = ++ctx->dictionary[0].u_integer; } else if (ctx->dictionary[0].type == DICT_UNDEFINED) { msg->MsgSeqNum = ctx->dictionary[0].u_integer = 0; ctx->dictionary[0].type = DICT_ASSIGNED; } }

The obvious optimized code (since in this case the operator used in the XML template specifies that the cache/dictionary cannot be in an undefined state when the bit in pmap is not set):

Code:

if ((*pmap & 32) == 32) { msg->MsgSeqNum = 0; do { msg->MsgSeqNum = (msg->MsgSeqNum << 7) | (*message & 127); } while ((*message++ & 128) == 0 && message < endptr); ctx->decoder_dict[0].u_integer = msg->MsgSeqNum; } else { msg->MsgSeqNum = ++ctx->decoder_dict[0].u_integer; }

As one can see the optimized version removes a lot of conditionals in the common code path (the common path here is that the bit in pmap is not set). Yet the optimized version is a tiny tiny bit slower for some reason in every benchmark I've run on it.

That's a very good example, thank you for sharing. Have you tried the code on differing compilers to see if they all behave with slower results with the second, or perhaps different architectures? (ARMv6/7 v. x86 for example) for merely curiosity sake. Conventional wisdom would suggest that removing the conditionals should speed up the execution, at least that's what one prof used to harp on back in the day, seems that's not strictly the case these days with some modern processors and compilers.
Leave a comment:
wizard69 replied

26 December 2018, 05:01 PM
Originally posted by Anty View Post

Results are inconsistent... One would expect same performance as worst case scenario but here it is hit or miss - in average znver1 opts nullify potential gains. Is this again GCC quirk?

Inconsistent yes - a surprise no.

First this is release one. It will take awhile to implement all optimizations.

Second GCC might not be the best compiler to do an optimized distro on. It would be interesting to see more testing here.

This is all all very interesting but I suspect that performance testing will be far more interesting when software starts to take C++17 features into account. This especially with I gels latest software donation.
Likes 1
Leave a comment:
F.Ultra replied

26 December 2018, 12:03 PM
Originally posted by stormcrow View Post

They are surprising results, but this set of benchmarks does underline why simply blanket applying optimizations (ex: -march=native and such) could actually create performance regressions rather than enhancements or no change either way.

Without deep dives into what's going on in the code paths all we can do is speculate what's causing the regressions. But for some real examples of such things in the past, you can look at the AVX-512 history in Intel CPUs. The generation in which AVX-512 was introduced only ran those instructions at half processor speed (example a 3 GHz CPU could only run AVX-512 instructions at 1.5 GHz) so any time that code path was taken there was the potential you could be slowing things down versus taking other potential instruction paths instead. There's other gotchas that can be hit with overly aggressive optimizations like unwanted code mangling and the like.

For practical matters though: don't just assume aggressive optimizations are faster even if they seem so. Test, test, test using empirical results.

I sometimes long for the old times where I knew the exact number of cpu cycles a piece of code would consume by just looking at the mnemonics (M68K and MOS6510). These days the inner workings of the cpu is so complex that even other programs running at the same time on the same caches makes your code behave different.

Also it sometimes feels like either the cpu or the compilers today optimize better for "worse" code and that obvious optimizations are not always that obvious at all.

For instance I once created a source code generator for a particular protocol used in my industry (the layout of the protocol is defined using XML so it's known before hand exactly how the on-the-wire protocol will be for each specific implementation) and one of the features of this protocol is that it have a cache (called a dictionary in the protocol lingua) to avoid sending non changed data on the wire. I noticed that on specific combinations of the specification certain circumstances could not happen in a well behaved implementation so I made an obvious optimization in those cases:

The original generated code:

Code:

if ((*pmap & 32) == 32) { msg->MsgSeqNum = 0; do { msg->MsgSeqNum = (msg->MsgSeqNum << 7) | (*message & 127); } while ((*message++ & 128) == 0 && message < endptr); ctx->dictionary[0].type = DICT_ASSIGNED; ctx->dictionary[0].u_integer = msg->MsgSeqNum; } else { if (ctx->dictionary[0].type == DICT_ASSIGNED) { msg->MsgSeqNum = ++ctx->dictionary[0].u_integer; } else if (ctx->dictionary[0].type == DICT_UNDEFINED) { msg->MsgSeqNum = ctx->dictionary[0].u_integer = 0; ctx->dictionary[0].type = DICT_ASSIGNED; } }

The obvious optimized code (since in this case the operator used in the XML template specifies that the cache/dictionary cannot be in an undefined state when the bit in pmap is not set):

Code:

if ((*pmap & 32) == 32) { msg->MsgSeqNum = 0; do { msg->MsgSeqNum = (msg->MsgSeqNum << 7) | (*message & 127); } while ((*message++ & 128) == 0 && message < endptr); ctx->decoder_dict[0].u_integer = msg->MsgSeqNum; } else { msg->MsgSeqNum = ++ctx->decoder_dict[0].u_integer; }

As one can see the optimized version removes a lot of conditionals in the common code path (the common path here is that the bit in pmap is not set). Yet the optimized version is a tiny tiny bit slower for some reason in every benchmark I've run on it.
Last edited by F.Ultra; 26 December 2018, 12:10 PM.
Likes 1
Leave a comment:
ms178 replied

26 December 2018, 11:01 AM
Originally posted by jojo7887 View Post

Recently I got into building the kernel myself, I know how to remove/add drivers and features but never tried these build parameters. Let's say I downloaded the source, got the .config file setup correctly and I'm ready to hit make, how do I add the "Znver1" optimization? Would appreciate any input :-)

I recommend to use a Ramdisk for speeding up the process: sudo mount -t tmpfs -o size=2048m tmpfs /mnt (#then copy your Kernel source and config file to /mnt/linux-***)

Search for HOSTCFLAGS / HOSTCXXFLAGS / HOSTLDFLAGS in the top-level Makefile and adjust them with the following (except for the optimization flags I recommend to keep the original flags):
HOSTCLFAGS = -mtune=native -march=native -O3 -fasynchronous-unwind-tables -feliminate-unused-debug-types -ftree-loop-distribution -floop-nest-optimize -fgraphite-identity -floop-parallelize-all -std=gnu89 (#if you are brave you could try gnu11, this usually works nowadays)
HOSTCXXFLAGS = -mtune=native -march=native -O3 -fasynchronous-unwind-tables -feliminate-unused-debug-types -ftree-loop-distribution -floop-nest-optimize -fgraphite-identity -floop-parallelize-all
HOSTLDFLAGS = -O3 -mtune=native -march=native -fgraphite-identity -floop-nest-optimize -Wl,--as-needed -fopenmp

Then start the compilation by typing in the console from within the Kernel directory: sudo make -j12 (#for a 6 Core) rpm-pkg (#or deb-pkg if you are on Debian/Ubuntu, this will get you three packages - their location depend on your distro, on Suse Tumbleweed they are in /usr/src/packages/X86_64/, with Ubuntu they are in the same or one above the top level directory).

Make sure that you have all the build dependencies installed.

Last edited by ms178; 26 December 2018, 11:03 AM.
Leave a comment:
stormcrow replied

26 December 2018, 10:37 AM
Originally posted by Anty View Post

Results are inconsistent... One would expect same performance as worst case scenario but here it is hit or miss - in average znver1 opts nullify potential gains. Is this again GCC quirk?

They are surprising results, but this set of benchmarks does underline why simply blanket applying optimizations (ex: -march=native and such) could actually create performance regressions rather than enhancements or no change either way.

Without deep dives into what's going on in the code paths all we can do is speculate what's causing the regressions. But for some real examples of such things in the past, you can look at the AVX-512 history in Intel CPUs. The generation in which AVX-512 was introduced only ran those instructions at half processor speed (example a 3 GHz CPU could only run AVX-512 instructions at 1.5 GHz) so any time that code path was taken there was the potential you could be slowing things down versus taking other potential instruction paths instead. There's other gotchas that can be hit with overly aggressive optimizations like unwanted code mangling and the like.

For practical matters though: don't just assume aggressive optimizations are faster even if they seem so. Test, test, test using empirical results.
Likes 3
Leave a comment:

Announcement

Benchmarking OpenMandriva 4.0 Alpha - The First Linux OS With An AMD Zen Optimized Build

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: