Announcement

Collapse
No announcement yet.

The Most Downloaded Benchmarks With The Phoronix Test Suite / OpenBenchmarking.org

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • The Most Downloaded Benchmarks With The Phoronix Test Suite / OpenBenchmarking.org

    Phoronix: The Most Downloaded Benchmarks With The Phoronix Test Suite / OpenBenchmarking.org

    With last week OpenBenchmarking.org crossing 22 million test/suite downloads I decided to dig in and see what have been the most popular test profiles downloaded from OpenBenchmarking.org for execution by the Phoronix Test Suite...

    http://www.phoronix.com/scan.php?pag...ded-Benchmarks

  • #2
    Originally posted by phoronix View Post
    Phoronix: The Most Downloaded Benchmarks With The Phoronix Test Suite / OpenBenchmarking.org

    With last week OpenBenchmarking.org crossing 22 million test/suite downloads I decided to dig in and see what have been the most popular test profiles downloaded from OpenBenchmarking.org for execution by the Phoronix Test Suite...

    http://www.phoronix.com/scan.php?pag...ded-Benchmarks
    Cachebench is a funny benchmark. It is very unclear what is actually being measured by cachebench. The results are funny, see for example http://openbenchmarking.org/result/1...TA-I76700GTX95

    ----

    After examining CacheBench for a while, my conclusions are:
    • "cachebench -d" (cache read bandwidth) is being compiled with "-O" optimizations and the generated assembly code is:

      Code:
      loop:
      	movsd  0x8(%rsp),%xmm2  // read memory
      	addsd  (%rax),%xmm2  // read memory
      	movsd  %xmm2,0x8(%rsp)  // write memory
      	add    $0x8,%rax
      	cmp    %rdx,%rax
      	jne    loop
      • "movsd %xmm2,0x8(%rsp)" instruction is a cache/memory write instruction. In consequence of this, "cachebench -d" is actually measuring cache read&write bandwidth.
      • CPUs (both AMD and Intel) are unoptimized for read-writing the 0x8(%rsp) every third instruction in the loop. CPU designers assumed that such cases will be filtered out by the compiler, but because CacheBench uses just "-O" to compile the code the filtering didn't happen.
    • For vectorizable code, "-mavx -mtune=bdver3 -O3" will generate 2*slower code than "-mavx -mtune=generic -O3". In consequence of this I have removed -mtune=bdver3 from /etc/portage/make.conf on my A10-7850K machine
    • Actual max cache bandwidth with 256-bit AVX:
      • A10-7850K @ 4GHz:
        • CacheBench in its current version achieves 1.7 GiB/s read bandwidth
        • I was able to get 25.6 GiB/s read bandwidth with "-mavx -O3 -mtune=generic" for an optimized version of CacheBench's C code
      • E5-2676 v3 @ 2.40GHz (virtualized CPU via http://aws.amazon.com):
        • CacheBench in its current version achieves 2.2 GiB/s read bandwidth
        • I was able to get 27.8 GiB/s read bandwidth with "-mavx -O3 -mtune=generic" for an optimized version of CacheBench's C code
      • i7-6700 @ 4GHz:
        • PTS result: CacheBench in its current version achieves 2.5 GiB/s read bandwidth
        • Extrapolated result: 27.8*4/2.4 = 46 GiB/s

    Comment

    Working...
    X