AMD Compiler Optimization Benchmarks With GCC 4.10 (GCC 5.0)

Adarion replied

19 August 2014, 05:00 AM
And Michael still refuses to use Gentoo for benchmarks.

But now let's take these results in consideration for EVERY kind of benchmark out there. Int_el (R) (TM) optimized benchmarks being run on AMD CPUs/APUs - and then people blare about AMD being slower than intel ones. Well, maybe in some parts intel's raw performance is better. But it is not that much better than what all these hardware testers make you think. Especially when they are running their blob benchmark software.
Leave a comment:
sdack replied

17 August 2014, 01:33 PM
Originally posted by sdack View Post

Just in case you are not being aware of it: btver1 and bdver1 are not the same type of CPU. btver1 stands for the Bobcat APU, which is probably the CPU inside your E-350. bdver1 refers to the Bulldozer CPU like the FX8150. The later is also the one with SSE4.1, SSE4.2 and AVX instructions, which are missing from the Bobcat. So while the article shows results for bdver1 will this be of no meaning for your E-350. You should expect results similar to the Phenom (aka K10, amdfam10 or Barcelona) with your E-350 though.

Since you are trying to squeeze some more performance out of your E-350s... take a look at ZSWAP for the Linux kernel if you have not done so already. It is said to work miracles. It inserts compression of memory pages just before these get swapped. You get less swapping and swapping will also be faster, and it feels like having more memory. You will either need to enable it first in the kernel configuration if it has not been done yet (it is marked as experimental) and then add/change in /etc/default/grub (or wherever your distro keeps the GRUB configuration):

GRUB_CMDLINE_LINUX_DEFAULT="zswap.max_pool_percent =25 zswap.compressor=lz4 zswap.enabled=1 quiet"

The ZSWAP feature needs to be enabled on boot and it is easiest to pass the arguments with grub to the kernel. Even when you have a desktop or server and think you do not need it should you at least take a look at it. With LZ4 as compressor/decompressor does it achieve compression ratios in the range from 1:2 to 1:4 and with speeds of 200MB/s-500MB/s for compression and 1GB/s-2GB/s for decompression. Even when you use a very fast SSD for swapping will it reduce the I/O and extend its life.
Leave a comment:
sdack replied

17 August 2014, 12:49 PM
Originally posted by rudregues View Post

oleid, I have a E-350 too. Tried many benchmarks, even comparing Gentoo and Ubuntu and came to the following conclusion: there's little to no difference. And sometimes x86_64 Ubuntu binaries was a little faster than btver1 optimized binaries.

Just in case you are not being aware of it: btver1 and bdver1 are not the same type of CPU. btver1 stands for the Bobcat APU, which is probably the CPU inside your E-350. bdver1 refers to the Bulldozer CPU like the FX8150. The later is also the one with SSE4.1, SSE4.2 and AVX instructions, which are missing from the Bobcat. So while the article shows results for bdver1 will this be of no meaning for your E-350. You should expect results similar to the Phenom (aka K10, amdfam10 or Barcelona) with your E-350 though.

Last edited by sdack; 17 August 2014, 12:52 PM.
Leave a comment:
rudregues replied

17 August 2014, 11:51 AM
Originally posted by oleid View Post

Yes, it's called generic optimization. The compiler will generate multiple versions of the very same code and decide on runtime what version to use.

The article wants to present the influence of different architecture optimizations on the performance. But this has nothing to do with runtime CPU dispatching.

And that's exactly what I benchmarked maybe a year ago. mtune=generic vs march=native on my E-350 for C-Ray and Graphics-Magick. And using the current compiler of that time (I guess it was gcc 4.7.x) there was no difference. I'm redoing the benchmark to check if it's still true for gcc 4.9. Of curse these results only affect this very CPU using the current compiler -- as every scientific result.

Obviously, if you are doing numeric simulations, you will compile using march=native, but for most distribution packages, this won't make a difference. When I got my E-350, I compiled a lot of packages using my own CFLAGS in order to get most out of this CPU, however, now I'm simply using the distribution provided packages.

My point is only to include mtune=generic in these benckmarks to get a glimpse if generic tuning does a good job for this cpu (could be interesting for the compiler people).

oleid, I have a E-350 too. Tried many benchmarks, even comparing Gentoo and Ubuntu and came to the following conclusion: there's little to no difference. And sometimes x86_64 Ubuntu binaries was a little faster than btver1 optimized binaries.
Leave a comment:
sdack replied

17 August 2014, 10:28 AM
Originally posted by oleid View Post

My bad. The Intel compiler has this cpu dispatcher, not the gnu compiler. It's the cause why AMD often has a disadvantage in certain benchmarks.

Agner`s CPU blog - Intel's "cripple AMD" function

http://www.agner.org/optimize/blog/read.php?i=49#49

It seems as if you are a bit cranky... I suggest you should visit those strippers

Michael reads the comments to his articles, that's why this is the proper place for suggestions.

No, I was not being cranky. Maybe I was a bit cocky, but what does this bother you? Best for you is to concern yourself with the topic and not with people's feelings. Maybe you then will make less mistakes. Who knows?!

Of course can you request it, but it is pointless. 4.10 is in an early stage. It still has missing code, bugs and regressions in it. So whatever you could get from it is meaningless. Even the article itself has a chance to give a false picture, because the gains shown could be the result of bugs or incomplete code and become less once 4.10 is stable. But to remain optimistic... chances are these gains are real.

Benchmarking with generic then costs additional time. But let us assume it would show generic as being faster than the other options. All it would tell you is that it has a regression at this time. This would be no news. It is already known that the compiler is still under development. If it would show to be slower, then it would also only confirm what is to be expected and the news would be in the gains coming from the other options. As it so happens is this exactly what the article focuses on and it delivers while no time was wasted.

So you can keep suggesting ideas. I rather stick to hope, and hope to get continuous news updates like these, which keep it brief and informative without being bloated and taking too much time to produce. The less time gets spend on it the more time for other news becomes available. Don't you love quickies, too?
Leave a comment:
oleid replied

17 August 2014, 09:37 AM
Originally posted by sdack View Post

It does not matter what it is called, it is not being done here.

Some applications do have code to detect the CPU at run-time and can switch to using different functions or plugins, which then make use of a particular instruction set. However, such features need to be put into the code by the programmer and do not come automatically by using gcc.

My bad. The Intel compiler has this cpu dispatcher, not the gnu compiler. It's the cause why AMD often has a disadvantage in certain benchmarks.

Agner`s CPU blog - Intel's "cripple AMD" function

http://www.agner.org/optimize/blog/read.php?i=49#49

Originally posted by sdack View Post

Yeah, and I would like to see some strippers, but this would also not be quite on the topic of the article.

It seems as if you are a bit cranky... I suggest you should visit those strippers

Michael reads the comments to his articles, that's why this is the proper place for suggestions.
Leave a comment:
sdack replied

17 August 2014, 07:58 AM
Originally posted by carewolf View Post

I believe on x64 that generic == k8

No. generic means the standard instruction set including 64bit extensions. It does not include MMX, SSE or AVX. k8 includes MMX, SSE and SSE2. It would be rather pointless to try and to optimize audio/video applications with only generic.

Tuning with generic does not match any of the processors (neither AMD or Intel CPUs from what I have seen). But it is somewhat closer to k8 than it is to amdfam10 and bdver1.

You can find the full details about it in $TOPDIR/gcc/config/i386/i386.c of gcc.

Last edited by sdack; 17 August 2014, 08:02 AM.
Leave a comment:
sdack replied

17 August 2014, 07:30 AM
Originally posted by oleid View Post

Yes, it's called generic optimization. The compiler will generate multiple versions of the very same code and decide on runtime what version to use.

It does not matter what it is called, it is not being done here.

Some applications do have code to detect the CPU at run-time and can switch to using different functions or plugins, which then make use of a particular instruction set. However, such features need to be put into the code by the programmer and do not come automatically by using gcc.

I suggest you read the documentation.

Originally posted by oleid

My point is only to include mtune=generic in these benckmarks to get a glimpse if generic tuning does a good job for this cpu (could be interesting for the compiler people).

Yeah, and I would like to see some strippers, but this would also not be quite on the topic of the article.
Leave a comment:
oleid replied

17 August 2014, 06:58 AM
Originally posted by carewolf View Post

I believe on x64 that generic == k8

ArchLinux uses these:

CPPFLAGS="-D_FORTIFY_SOURCE=2"
CFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector --param=ssp-buffer-size=4"
CXXFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector --param=ssp-buffer-size=4"
LDFLAGS="-Wl,-O1,--sort-common,--as-needed,-z,relro"

Other binary distributions should use something similar, maybe with special flags for certain packages as e.g. lame.
Leave a comment:
carewolf replied

17 August 2014, 06:25 AM
Originally posted by oleid View Post

I once thought that, too. Then I benchmarked my bulldozer: mtune=generic vs. march=native. And guess what? There was no difference! That's why I'd like to see mtune =generic in these benchmarks, too. After all, this is what you'll roughly get from the distributions.

I believe on x64 that generic == k8
Leave a comment:

Announcement

AMD Compiler Optimization Benchmarks With GCC 4.10 (GCC 5.0)

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: