We have done some ad-hoc benchmarking on the Pandaboard here at the U of Arizona. We have noticed considerable improvement on some of our ad-hoc benchmarks running scientific apps if we load the file system using a SATA to USB file system mounting. Have you tried this in any of your benchmarks at phoronix?
We have done a lot of benchmarks that are more HPC-centric than the Phoronix suite; HPC Challenge and HPEC Challenge mostly... see http://meegs.mit.edu/HPEC11.pdf for some results...
This is pretty intriguing. We have ported and tested most of the scientific apps associated with the SC Student Cluster Competition, and we are using primarily an NFS mounted filesystem with netbooting across OTG (that paper should be at the CHiMIT site somewhere). I will see if I can recreate your setup and see if it gives us a performance bump.
Is the compiler only using the ARM cores?
Some of the numbers don't make any sense. For example, the OMAP4460 has hardware H.264 capable of 1080/30P. The GPU doesn't seem to be being used on the graphics routines. The NEON or VSP doesn't appear to be being used on some of the routines involving floating point.
Haven't looked at the sources yet...
but the performance would lead me to believe NEON is in use... these are extraordinarily good numbers...
I just reran the suite ... the biggest change is that I used the latest benchmarks... PTS 3.8 .... most of the individual tests had the same version numbers as the previous PTS test so I will look at that as well... I am still segfaulting on 4 of the tests (hence the lack of data on the graphic-centric tests)...
I will try this one more time with gcc 4.7 and then I think we will have squeezed as much performance out of the ES as we can... I preset my clock with "cpufreq-set -g performance" ... not sure if it stayed there; I had a lot of omap overheating warnings during the ogg encode before it crashed....
That's the odd thing
Some of the performance needs NEON running to get those numbers, but other ones, it doesn't seem to be running. It just seemed odd that some of the things were better than others that should have been close. I'll have to go back over the article to pick up on which items. I'm busy tonight, but I think I can get at it tomorrow.
It looks some externals have to be loaded into Linux to get the GPU to work. I found some PowerVR SDK stuff for the OMAP4 on
(you have to sign-up for it)
I haven't tried either, yet. I don't have my Pandaboard es (s) yet. I have some VAR-SOM-OM44 but have only played with them and not compiled anything for them. They have the 1.5 GHz OMAP4460. It'll be interesting to see if the benchmarks increase by 25% (1.5/1.2) since the the GPU does not increase (that I know of).
On that neon lib, looks like some of its code sacrifices accuracy (like in the floorf example mentioned there). That may make it produce wrong results in apps that care about accuracy, ie all not already compiling with -ffast-math.
I haven't had a chance to go through the code. What caught my eye was the author's statement that the GCC compiler didn't fully implement NEON. I think that the ratios of the SciMark Composite times to the Jacobi times between the compilers should be closer to the same. Re looking at them, they aren't that far off old/new ~65% vs ~60%.
I thought that you had used the older compiler on the Dec OMAP4 vs Intel data, but the SciMark Composite is the same as the 12.04 1/26 number
Yup... time to take a deep look at how these codes are implemented...
Every time I recompile I get a 10% performance bump on most benches...
My EA3, with it's 4430, beats or ties the best ES score (allegedly unoptimized) on 7 of 26 benchies and it is really close on another 5 or so..
Items of concern; the standard deviations are quite large; I will see if I have anything running on the Panda that may explain the cycle stealing between runs... some codes don't use both cores, and while OpenMPI was a dependency during the install it wasn't called on many codes. Also, I installed omap4-extras that is a binary blob provided by ImgTec to unlock some functionality of the SGX540 but I don't think I am getting any math performance from it. ARM Cortex -A9 is fully IEEE-754 compliant but NEON, PVR, and Ducati aren't so even if we can unlock some performance there, it may provide accuracy problems for us.
Couldn't install SciMark for some reason, but we wouldn't have won that competition running at 1 Ghz, I shouldn't think... will try this once more before Precise ships in a week or so; (there are indications these will run even faster as a diskless node; lack of swap may be in play there). Sigh, looks like I have my summer project defined for me...
OMAP4 vs. Tegra 3
Dual core vs. Quad core... it is interesting to see what the Panda excels at...
Note that I segfaulted on 15 of the codes, mostly the GPU-centric ones... maybe we need a few more ifdefs in there. We will probably hold off further testing until we get our arms around this (and anyway, Ubuntu Precise "ships" on April 24th, so we may go back and try to do a headless install with that revision).
Hm, wasn't vpxenc multithreaded? Or is the test using it with a single thread only?