Announcement

**codestr0m** · 09 February 2013, 01:18 PM

Thanks for posting this!
----------------------------
It seems clear we have some action items to work on, but I'm not sure if we honestly have the time to fix everything before EKOPath 5 Final. I'd bet that most of this could be cleared up by 5.5 (or certainly by 6.0)
----------------------------
I'm curious if anyone else reading the forums can post benchmarks using their own codes + processor/system details.

Quick links to some downloads
----------------------------
Linux

http://c591116.r16.cf2.rackcdn.com/enzo/nightly/Linux/enzo-2013-02-08-installer.run

Solaris

http://c591116.r16.cf2.rackcdn.com/enzo/nightly/Solaris/enzo-2013-02-08-installer.run

FreeBSD

http://c591116.r16.cf2.rackcdn.com/enzo/nightly/FreeBSD/enzo-2013-02-08-installer.run

# Quick install line
chmod +x ekopath-2013-02-08-installer.run ; ./ekopath-2013-02-08-installer.run --mode unattended --prefix /opt/ekopath-02-08
----------------------------

EKOPath 5 is really a *BIG* difference behind the scenes. For example if you do pathcc -show hello.c # You'll notice that we're using a modified clang as part of the process. This is almost certainly what allowed us build those additional benchmarks. (To clarify a bit - we're not using the llvm backend or any llvm ir. In the past we were using a modified gnu cc1, but that and all other gnu code has been removed. EKOPath 5.5 will have a new backend we've been working on, but pushing out both of those big changes at the same time just wasn't possible. )

----------------------------
Small selfish side note about ENZO (sister compiler to EKOPath that has added GPGPU support and additional features for multicore programming)
----------------------------
While it's not possible speed-up every code on the GPU - We have put a huge amount of work in the programming models available for ENZO and it's backend performance. Personally, I don't get as excited (or worried) about 5-10% CPU performance when we can offer 30% gains to 10x with the GPU. I can't make promises, but we may try to drop a few OpenACC pragma around those benchmarks and post numbers on a Tesla 2050..

**Cyborg16** · 09 February 2013, 02:11 PM

Originally posted by codestr0m View Post

----------------------------
Small selfish side note about ENZO (sister compiler to EKOPath that has added GPGPU support and additional features for multicore programming)
----------------------------
While it's not possible speed-up every code on the GPU - We have put a huge amount of work in the programming models available for ENZO and it's backend performance. Personally, I don't get as excited (or worried) about 5-10% CPU performance when we can offer 30% gains to 10x with the GPU. I can't make promises, but we may try to drop a few OpenACC pragma around those benchmarks and post numbers on a Tesla 2050..

So the real reason to be excited about EKOPath is automatic GPGPU usage? I am involved in some "scientific computing" but so far haven't had a reason to use anything other than GCC and clang.

**codestr0m** · 09 February 2013, 02:38 PM

Originally posted by Cyborg16 View Post

So the real reason to be excited about EKOPath is automatic GPGPU usage? I am involved in some "scientific computing" but so far haven't had a reason to use anything other than GCC and clang.

s/EKOPath/ENZO/g
---------
I'm biased, but I'd certainly recommend you test EKOPath and Intel compilers if you don't have a GPU. If you can get access to a system with a GPU (Tesla 2050, 2070 or 2090) *and* you're willing to add some pragma or directives to your code ENZO may be interesting. (The performance gains can be well worth the effort) We're working on support for -autogpu which like autovectorization or other automatic optimizations requires zero code changes. This isn't ready for production and just "noteworthy" at this point. (Honestly, give us a couple more months)

**ChrisXY** · 09 February 2013, 03:04 PM

Well, a noticeable improvement is that it does not return with an error with -march=native.

But "-march=native" is not recognized and as all nonrecognized march parameters it activates the generic profile:

Code:

/usr/lib/5.0.0/x8664/ipl -VHO:rotate -LIST:source=off:notes=off -PHASE:p:i -O3 -LANG:math_errno=off -OPT:ffast_math=ON -OPT:Ofast= -show -LANG:=ansi_c -TARG:abi=n64 -TARG:processor=generic -TARG:sse=on -TARG:sse2=on -TARG:sse3=off -TARG:ssse3=off -TARG:sse4a=off -TARG:sse4_1=off -TARG:sse4_2=off -TARG:avx=off -TARG:fma=off -TARG:xop=off -TARG:aes=off -TARG:pclmul=off -TARG:3dnow=off -fB,/tmp/pathcc-B-1934caf9.B -fp,hello.o hello.c -cmds pathcc -O3 -LANG:math_errno=off -OPT:ffast_math=ON -OPT:Ofast= -TARG:abi=n64 -TARG:processor=generic -TARG:sse=on -TARG:sse2=on -TARG:sse3=off -TARG:ssse3=off -TARG:sse4a=off -TARG:sse4_1=off -TARG:sse4_2=off -TARG:avx=off -TARG:fma=off -TARG:xop=off -TARG:aes=off -TARG:pclmul=off -TARG:3dnow=off

The correct way to autochoose the cpu is -march=auto:
"-march=auto -Ofast"

Code:

/usr/lib/5.0.0/x8664/ipl -VHO:rotate -LIST:source=off:notes=off -PHASE:p:i -O3 -LANG:math_errno=off -OPT:ffast_math=ON -OPT:Ofast= -show -LANG:=ansi_c -TARG:abi=n64 -TARG:processor=pentium4 -TARG:sse=on -TARG:sse2=on -TARG:sse3=on -TARG:ssse3=on -TARG:sse4a=off -TARG:sse4_1=on -TARG:sse4_2=on -TARG:avx=off -TARG:fma=off -TARG:xop=off -TARG:aes=on -TARG:pclmul=off -TARG:3dnow=off -fB,/tmp/pathcc-B-19683af2.B -fp,hello.o hello.c -cmds pathcc -O3 -LANG:math_errno=off -OPT:ffast_math=ON -OPT:Ofast= -TARG:abi=n64 -TARG:processor=pentium4 -TARG:sse=on -TARG:sse2=on -TARG:sse3=on -TARG:ssse3=on -TARG:sse4a=off -TARG:sse4_1=on -TARG:sse4_2=on -TARG:avx=off -TARG:fma=off -TARG:xop=off -TARG:aes=on -TARG:pclmul=off -TARG:3dnow=off

Slightly better, it builds for SSE3 but for Pentium 4?! This is a ivy bridge mobile cpu, i7 3632qm! If you could just copy & paste the cpu recognition from another compiler, that would be great.

The installer installs the manpages to /usr/docs/man/man1/ which is not in the man search path on archlinux, but I don't know about other systems. But it seems nonstandard to me. Use "man -l /file" to open files directly with man.

Code:

       -march=<cpu-type>
               (For x86) Compiler will optimize code for the selected cpu type: opteron, opteron-sse3, xeon, em64t, nocona, prescott, core, core2, wolfdale, harpertown, nehalem, barcelona, shanghai, istanbul, sandy, bdver1, auto.  auto means to optimize for the host platform that the compiler is running  on.   Core  refers  to  the
              Intel Core Microarchitecture, used by 64-bit CPUs such as Woodcrest.  The default is auto.

It seems none of the cpu profiles, even bdver1 enable the use of avx by default. In fact it says

Code:

pathcc -o hello_pathcc hello.c -march=bdver1 -O3 -mavx -show
pathcc ERROR: Target processor does not support AVX.

I am not so proficient what exactly is supported in which cpus, but I thought bulldozer supported avx right from the beginning?

So the closest for me would probably be using -march=sandy -Ofast and perhaps -mavx and -mpclmul.
Unfortunately sandybridge did not support fma and xop so I can't activate it directly. Are there any real cpu specific optimizations or is it just for choosing which instructions to use (i.e. generic with all the supported stuff enabled one by one being equally good)?

Intel's cpus don't support 3dnow but I saw that the parameter to activate 3dnow is not documented in the manpage (it's pretty clear that it's -m3dnow though. It says it's not supported for bdver1, by the way, not sure if this is right).

The benchmark is not that good I think, because it very probably uses the generic cpu build profile (Michael is using an Ivy Bridge cpu too). It may be fair in that gcc is set to the generic build profile too but that's not really where ekopath is supposed to shine, right?

**mattst88** · 09 February 2013, 03:13 PM

There isn't much to see out of Parallel BZIP2 Compression.

Are you building only pbzip2 with the various compilers? pbzip2 is sort of just a front end for libbzip2, which is where the work actually happens.

**codestr0m** · 09 February 2013, 03:19 PM

Originally posted by ChrisXY View Post

Well, a noticeable improvement is that it does not return with an error with -march=native.

But "-march=native" is not recognized and as all nonrecognized march parameters it activates the generic profile:

Code:

/usr/lib/5.0.0/x8664/ipl -VHO:rotate -LIST:source=off:notes=off -PHASE:p:i -O3 -LANG:math_errno=off -OPT:ffast_math=ON -OPT:Ofast= -show -LANG:=ansi_c -TARG:abi=n64 -TARG:processor=generic -TARG:sse=on -TARG:sse2=on -TARG:sse3=off -TARG:ssse3=off -TARG:sse4a=off -TARG:sse4_1=off -TARG:sse4_2=off -TARG:avx=off -TARG:fma=off -TARG:xop=off -TARG:aes=off -TARG:pclmul=off -TARG:3dnow=off -fB,/tmp/pathcc-B-1934caf9.B -fp,hello.o hello.c -cmds pathcc -O3 -LANG:math_errno=off -OPT:ffast_math=ON -OPT:Ofast= -TARG:abi=n64 -TARG:processor=generic -TARG:sse=on -TARG:sse2=on -TARG:sse3=off -TARG:ssse3=off -TARG:sse4a=off -TARG:sse4_1=off -TARG:sse4_2=off -TARG:avx=off -TARG:fma=off -TARG:xop=off -TARG:aes=off -TARG:pclmul=off -TARG:3dnow=off

The correct way to autochoose the cpu is -march=auto:
"-march=auto -Ofast"

Code:

/usr/lib/5.0.0/x8664/ipl -VHO:rotate -LIST:source=off:notes=off -PHASE:p:i -O3 -LANG:math_errno=off -OPT:ffast_math=ON -OPT:Ofast= -show -LANG:=ansi_c -TARG:abi=n64 -TARG:processor=pentium4 -TARG:sse=on -TARG:sse2=on -TARG:sse3=on -TARG:ssse3=on -TARG:sse4a=off -TARG:sse4_1=on -TARG:sse4_2=on -TARG:avx=off -TARG:fma=off -TARG:xop=off -TARG:aes=on -TARG:pclmul=off -TARG:3dnow=off -fB,/tmp/pathcc-B-19683af2.B -fp,hello.o hello.c -cmds pathcc -O3 -LANG:math_errno=off -OPT:ffast_math=ON -OPT:Ofast= -TARG:abi=n64 -TARG:processor=pentium4 -TARG:sse=on -TARG:sse2=on -TARG:sse3=on -TARG:ssse3=on -TARG:sse4a=off -TARG:sse4_1=on -TARG:sse4_2=on -TARG:avx=off -TARG:fma=off -TARG:xop=off -TARG:aes=on -TARG:pclmul=off -TARG:3dnow=off

Slightly better, it builds for SSE3 but for Pentium 4?! This is a ivy bridge mobile cpu, i7 3632qm! If you could just copy & paste the cpu recognition from another compiler, that would be great.

The installer installs the manpages to /usr/docs/man/man1/ which is not in the man search path on archlinux, but I don't know about other systems. But it seems nonstandard to me. Use "man -l /file" to open files directly with man.

Code:

       -march=<cpu-type>
               (For x86) Compiler will optimize code for the selected cpu type: opteron, opteron-sse3, xeon, em64t, nocona, prescott, core, core2, wolfdale, harpertown, nehalem, barcelona, shanghai, istanbul, sandy, bdver1, auto.  auto means to optimize for the host platform that the compiler is running  on.   Core  refers  to  the
              Intel Core Microarchitecture, used by 64-bit CPUs such as Woodcrest.  The default is auto.

It seems none of the cpu profiles, even bdver1 enable the use of avx by default. In fact it says

Code:

pathcc -o hello_pathcc hello.c -march=bdver1 -O3 -mavx -show
pathcc ERROR: Target processor does not support AVX.

I am not so proficient what exactly is supported in which cpus, but I thought bulldozer supported avx right from the beginning?

So the closest for me would probably be using -march=sandy -Ofast and perhaps -mavx, -mxop, -maes, -mpclmul.
Unfortunately sandybridge did not support fma and xop so I can't activate it directly.

Intel's cpus don't support 3dnow but I saw that the parameter to activate 3dnow is not documented in the manpage (it's pretty clear that it's -m3dnow though. It says it's not supported for bdver1, by the way, not sure if this is right).

The correct way to -march=auto or -march=native is to not set this at all. EKOPath/ENZO unlike other compilers automatically pick the best CPU profile for the current host system. If you need to target another system is when you should use those. (I think we may have a bug here and I'll double check)

We switched over to CPUID instead of parsing /proc/cpuinfo - can you give some output of your /proc/cpuinfo and the processor info. /* This is one of those areas where I'd like the most feedback. */

About AVX - With the exception of a corner case on AMD - Please tell me where it would be a performance win compared to SSE4.1/4.2. With that in mind we've disabled it by default and in fact AVX can cause performance *degradation* if not used properly. For more information on this please reference Agner's work on CPU instruction timing data.

bdver1 doesn't support 3DNOW - that was dropped

The closest CPU recognition we may be able to get "inspiration" from would be libav and their cpu check stuff.

Lastly - sorry about the manpage location - Most of our users use --prefix when installing and never use a "default".

**ChrisXY** · 09 February 2013, 03:28 PM

Originally posted by codestr0m View Post

The correct way to -march=auto or -march=native is to not set this at all. EKOPath/ENZO unlike other compilers automatically pick the best CPU profile for the current host system. If you need to target another system is when you should use those. (I think we may have a bug here and I'll double check)

We switched over to CPUID instead of parsing /proc/cpuinfo - can you give some output of your /proc/cpuinfo and the processor info. /* This is one of those areas where I'd like the most feedback. */

Well, if it would work I wouldn't want to set it manually.

I have 8 of these:

Code:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping        : 9
microcode       : 0x13
cpu MHz         : 1200.000
cache size      : 6144 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips        : 4391.75
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual

edit: by cpu information you mean that? http://pastebin.com/4TbSia7Q

Originally posted by codestr0m View Post

About AVX - With the exception of a corner case on AMD - Please tell me where it would be a performance win compared to SSE4.1/4.2. With that in mind we've disabled it by default and in fact AVX can cause performance *degradation* if not used properly. For more information on this please reference Agner's work on CPU instruction timing data.

Thanks for the explanation. I wasn't aware of that.

Originally posted by codestr0m View Post

bdver1 doesn't support 3DNOW - that was dropped

Then all is good.

Originally posted by codestr0m View Post

Lastly - sorry about the manpage location - Most of our users use --prefix when installing and never use a "default".

Actually I used --prefix=/usr
It's usually in /usr/share/man/ somewhere.

**codestr0m** · 09 February 2013, 03:38 PM

Originally posted by ChrisXY View Post

Well, if it would work I wouldn't want to set it manually.

.

I'll use the CPU info you provided and see if we can get both sets of bugs fixed in the driver. Give us a couple days and hopefully I remember to reply to this thread once it's fixed. Alternatively, pull another nightly in a week or few days and yell if it's not. (Squeaky wheel)

**XorEaxEax** · 10 February 2013, 03:33 AM

Happy to see some info on EKOPath, been quiet since the open source announcement. As for the results, I seem to recall that EKOPath was optimized for AMD cpu's or am I mistaken?

As for the tests, again why are there so many tests where there are no optimization levels declared, like SCIMARK for example, it's impossible to draw any worthwhile conclusions from those tests, for all we know they could be done at -O0.

Looking at the tests where we do have an optimization setting (hence tests which are of any interest), Ekopath seems to do quite well with the exception of the BLAKEv2 test where it does horribly, and to a lesser extent the Himeno benchmark.

Again, can Michael please fix the benchmarks so that they declare optimization level for all tests, else they are of little interest as we don't know what level is being compared. Using -O3 across the board would be the obvious choice if only one optimization level is used per benchmark (as is the case here).

Announcement

PathScale EKOPath 5.0 Beta Compiler Performance

PathScale EKOPath 5.0 Beta Compiler Performance

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment