Announcement

**popper** · 11 June 2012, 10:54 AM

ROTFL

http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-May/049690.html

"Evan Cheng apple.com Request for Help: Teach ARM target to auto-detect cpu / subtarget featuresThu May 10 22:11:23 CDT 2012

I believe one of the reason the benchmark numbers are totally bogus is that the compilation are done on ARM hosts.

Given the benchmarks are apparently compiled without -mcpu=cortex-a9, I suspect LLVM ended up generating code for "generic" ARMv4 cpu.

This article makes me sick in my stomach.
Thanks,Evan"

"Michael Larabel on June 11, 2012
The bench marking was still being done from a PandaBoard ES with Texas Instruments OMAP4460 dual-core ARM Cortex-A9 development board. Via the CFLAGS/CXXFLAGS, -march=armv7-a was passed to each compiler. "

On the other hand once you sort out your flags war and reach consensus it might be interesting to see this test run on a

Calxeda quad-core ARM Cortex-A9 processor optimized for using in Servers over 10Gigabit/s internal fabric on each card

sample box with 2 or more cards installed for 32 Cortex A9 cores/8 SOC and greater etc and you really should go and get the latest Linaro GCC etc too.

Calxeda ARM Powered Server – ARMdevices.net

http://armdevices.net/2012/06/11/calxeda-arm-powered-server/

https://[URL]http://www.youtube.com/...[/URL]

Hmm i cant seem to get in post video link working, odd.

**Michael** · 11 June 2012, 11:01 AM

Originally posted by popper View Post

On the other hand once you sort out your flags war

The flags used in this article were just normal, a compiler flag/tuning on ARM is forthcoming in a future multi-page article.

**chithanh** · 11 June 2012, 11:36 AM

armv7 is what e.g. Ubuntu will target in their upcoming ARM releases, so it seems very relevant how that performs. Compiling all software with hardware specific CFLAGS is typically only done by Gentoo or other source based distros.

**gururise** · 11 June 2012, 12:05 PM

Originally posted by chithanh View Post

armv7 is what e.g. Ubuntu will target in their upcoming ARM releases, so it seems very relevant how that performs. Compiling all software with hardware specific CFLAGS is typically only done by Gentoo or other source based distros.

Here are a few relavant flags I'd like to see tested:

1. Ubuntu Standard armv7 + hard-float
2. Android Standard armv7 + softfp (note: this is not soft-float, it's still using hardware fp, just the headers are compatible with soft-float).
3. Android Standard armv5 + soft-float

These three flags will cover most software written for Linux and Android.

**chithanh** · 11 June 2012, 04:09 PM

Possibly Ubuntu and other distros will use -march=armv7-a -mtune=cortex-a9 (same idea as -march=i486 -mtune=i686 for x86) so that would be another interesting data point.

**XorEaxEax** · 12 June 2012, 02:29 AM

Again, what is the point of running the 7-zip benchmark with no -On optimization setting? This means that atleast GCC will default to -O0 which is no optimization. Just add -O2 or preferably -O3 so that this benchmark ends up being in any way relevant, NO ONE will use 7-zip compiled with no optimizations. You are benchmarking compiler optimization here, what possible point is it then to NOT enable optimizations????

**ssvb** · 12 June 2012, 04:36 AM

Yeah, while the phoronix test suite framework itself is fine, the choice of benchmarks is very questionable at best.

Let's have a look at the "popular" C-Ray 1.1 benchmark. It can be downloaded from http://www.phoronix-test-suite.com/b...ray-1.1.tar.gz
It is typically run as "./c-ray-mt -t 32 -s 1600x1200 -r 8 -i sphfract -o output.ppm", but changing 1600x1200 to 160x120 lets it run for seconds instead of hundreds of seconds on ARM. Profiling of gcc-4.7.0 compiled code shows the following:

Code:

./c-ray-mt -t 32 -s 160x120 -r 8 -i sphfract -o output.ppm
 
samples  %        image name               symbol name
28459    51.8672  c-ray-mt                 shade
17869    32.5667  c-ray-mt                 ray_sphere
4110      7.4906  c-ray-mt                 trace
3185      5.8047  c-ray-mt                 render_scanline
319       0.5814  libm-2.13.so             __ieee754_pow
194       0.3536  libm-2.13.so             powl
136       0.2479  libm-2.13.so             __exp1
108       0.1968  libc-2.13.so             memcpy
78        0.1422  c-ray-mt                 get_primary_ray
73        0.1330  c-ray-mt                 get_sample_pos
59        0.1075  libm-2.13.so             isnanl
42        0.0765  vmlinux                  __do_softirq
36        0.0656  vmlinux                  __schedule
35        0.0638  libm-2.13.so             checkint
31        0.0565  libc-2.13.so             fputc
18        0.0328  libm-2.13.so             __mul
4         0.0073  c-ray-mt                 main

And this reveals a major performance problem: function calls overhead is insane. Just making sure that ray_sphere function gets inlined improves performance significantly. As a workaround, -finline-limit=100000 option can be added for more aggressive inlining. The results of "./c-ray-mt -t 32 -s 160x120 -r 8 -i sphfract -o output.ppm" on ARM Cortex-A9 1.2GHz compiled with gcc 4.7.0:
Rendering took: 6 seconds (6685 milliseconds) for CFLAGS="-O3 -ffast-math"
Rendering took: 5 seconds (5436 milliseconds) for CFLAGS="-O3 -ffast-math -finline-limit=100000"

But the real fix is to use "static inline" for the performance critical functions. The one who developed this C-Ray application apparently has no clue about performance optimizations. Or maybe it was done on purpose to make the job harder for the compilers. The compilers, which are configured to use aggressive inlining by default are going to win by a huge margin on this test (trading it for larger binary sizes because there are no free cookies).

Generally, I get an impression that such selection of phoronix benchmarks has been done on purpose. Surely, when having compiler optimizations disabled or benchmarking poorly written code such as C-Ray, the difference between the results from different compilers may be quite significant (and mostly random). Benchmarking properly written code with properly selected optimization options is surely boring, because it is less likely to show surprising wins or sensations

**cltang** · 13 June 2012, 01:55 PM

Depending on how GCC was configured (you can see by passing -v), this might be a non-issue, but passing only -march=armv7-a without other -mtune= or -mcpu= options might have resulted in GCC tuning for the Cortex-A8.
You might want to re-check to be sure...

**ssvb** · 13 June 2012, 03:05 PM

Tuning for Cortex-A8 works good for Cortex-A9 too. They are reasonably similar, and scheduling instructions for in-order dual-issue processor does not usually do any harm for its out-of-order dual-issue twin. Moreover, there are cases when -mcpu=cortex-a9 is bad for performance: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53659 (just filed this enhancement request)

Announcement

Clang Compiling Against GCC On Ubuntu ARM Linux

Clang Compiling Against GCC On Ubuntu ARM Linux

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment