Announcement

Collapse
No announcement yet.

Intel MKL-DNN Deep Neural Network Library Benchmarks On Xeon & EPYC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • milkylainen
    replied
    Originally posted by coder View Post
    Assuming it wasn't intentionally rigged to make AMD look bad, my guess is it's just using instructions (probably AVX-512, at that) for which AMD has no equivalent. Then, AMD has to fall back on some scalar code path included for the sake of compatibility.

    If you look at specifically which tests are extremely Intel-biased, they're:
    • deconvolution
    • u8s8f32 (meaning: f32 += unsigned 8-bit * signed 8-bit ?)
    Lacking a key instruction used in the optimized deconvolution code path could break AMD in those benchmarks, and getting good performance on the u8s8f32 tests surely depends on having the right instructions for it.

    But, the bottom line is that this benchmark really doesn't tell us how the CPUs compare, unless you happen to be running a workload that's dependent on this specific library. So, I would suggest such strongly-biased tests not be included in PTS.
    I don't think it was rigged. I am just questioning the inclusion as a CPU "performance" metric.
    AVX512 or not, I have a hard time accepting the 40x performance difference, never mind the tests that AMD actually did win.
    I came much to the same conclusion as you.

    Leave a comment:


  • coder
    replied
    Originally posted by Royi View Post
    Also, in DNN accuracy isn’t important, so why not use the Ofast flag which is basically O3 flag with non precise FP.
    I don't know that you can say accuracy is absolutely unimportant. People are willing to make measured tradeoffs, though. Without more domain expertise, I think he should compile it according to the optimized flags in the project's own buildsystem. That way, he'd just be taking the project maintainers' recommendations, with regard to speed/accuracy tradeoffs.

    Originally posted by Royi View Post
    Also, why not use AVX2 for compilation?
    Does --enable-multiarch get you that?

    Originally posted by Royi View Post
    Moreover, while MKL is known for discriminating non Intel CPUs this library doesn’t as it chooses code path based only on CPU features.
    Yeah, but if certain cases are built around using specialized datatypes that are currently supported only on Intel CPUs, then the net effect is the same.

    Leave a comment:


  • Royi
    replied
    Originally posted by coder View Post
    Assuming it wasn't intentionally rigged to make AMD look bad, my guess is it's just using instructions (probably AVX-512, at that) for which AMD has no equivalent. Then, AMD has to fall back on some scalar code path included for the sake of compatibility.

    If you look at specifically which tests are extremely Intel-biased, they're:
    • deconvolution
    • u8s8f32 (meaning: f32 += unsigned 8-bit * signed 8-bit ?)
    Lacking a key instruction used in the optimized deconvolution code path could break AMD in those benchmarks, and getting good performance on the u8s8f32 tests surely depends on having the right instructions for it.
    MKL-DNN chooses the same code path for Intel CPU or AMD given they have the same features.
    Since in the test the flasg -msse4.1 was used the code generated was using features which both CPU's have.

    Leave a comment:


  • coder
    replied
    Originally posted by milkylainen View Post
    Anything that protrudes like 40x performance difference is between contemporary performance equals (yes, more or less) is either:
    Seriously flawed or extremely biased.

    And no, one more shiny instruction set do not account for the difference. So I vote for the former.
    Assuming it wasn't intentionally rigged to make AMD look bad, my guess is it's just using instructions (probably AVX-512, at that) for which AMD has no equivalent. Then, AMD has to fall back on some scalar code path included for the sake of compatibility.

    If you look at specifically which tests are extremely Intel-biased, they're:
    • deconvolution
    • u8s8f32 (meaning: f32 += unsigned 8-bit * signed 8-bit ?)
    Lacking a key instruction used in the optimized deconvolution code path could break AMD in those benchmarks, and getting good performance on the u8s8f32 tests surely depends on having the right instructions for it.

    But, the bottom line is that this benchmark really doesn't tell us how the CPUs compare, unless you happen to be running a workload that's dependent on this specific library. So, I would suggest such strongly-biased tests not be included in PTS.

    Leave a comment:


  • Royi
    replied
    I am not experienced Linux user but I don’t get the flags used.
    First if it uses OpenMP why need for the thread lib?
    Also, in DNN accuracy isn’t important, so why not use the Ofast flag which is basically O3 flag with non precise FP.

    Also, why not use AVX2 for compilation?

    Really strange choices.

    Regarding the library.
    While this is an Intel library it is open source unlike the classic MKL library.
    Moreover, while MKL is known for discriminating non Intel CPUs this library doesn’t as it chooses code path based only on CPU features.

    Hopefully this policy will be applied in MKL as well.

    Leave a comment:


  • atomsymbol
    replied
    Originally posted by milkylainen View Post

    This test has serious issues.
    You can't use it as a test comparing AMD to Intel stating it's an "Intel library".
    Anything that protrudes like 40x performance difference is between contemporary performance equals (yes, more or less) is either:
    Seriously flawed or extremely biased.

    And no, one more shiny instruction set do not account for the difference. So I vote for the former.
    I doubt you are right. By what method exactly did you determine that Intel is cheating? The outlier results might actually be the two where 2*Xeon 8280 is slower than 2*EPYC 7742.

    In case of the u8s8f32 datatype Intel might be using the DL Boost instruction (https://en.wikichip.org/wiki/x86/avx512vnni) which could significantly contribute to the 40x speedup over AVX2 (irrespective of whether the AVX2 instructions are executed on an AMD or Intel CPU).

    Xeon 8280 might be executing two FMA AVX-512 instructions per clock per core (2*512=1024 bits per clock), compared to AMD's 256 bits per clock per core (Ryzen 1000/2000) and 2*256=512 bits per clock per core (Ryzen 3000). Combined with Intel's higher operating frequency this alone could account for about 3x higher per-core performance over Ryzen 3000 (6x in case of Ryzen 1000/2000).

    https://ark.intel.com/content/www/us...-2-70-ghz.html
    https://www.agner.org/optimize/microarchitecture.pdf (section 20.11)
    https://en.wikichip.org/wiki/amd/mic...tectures/zen_2

    Leave a comment:


  • milkylainen
    replied
    Originally posted by Michael View Post

    The article explains quite clearly that MKL-DNN / DNNL is an Intel library.
    This test has serious issues.
    You can't use it as a test comparing AMD to Intel stating it's an "Intel library".
    Anything that protrudes like 40x performance difference is between contemporary performance equals (yes, more or less) is either:
    Seriously flawed or extremely biased.

    And no, one more shiny instruction set do not account for the difference. So I vote for the former.

    Leave a comment:


  • coder
    replied
    Originally posted by Rigaldo View Post
    I had seen a patched version of MKL which performed pretty well on AMD.
    https://github.com/fo40225/Anaconda-Windows-AMD
    Now this probably won't easily match with these benchmarks, but I can attest much better results can be reproduced with whatever patch is there!
    I wonder what are the chances of those patches ever making it into Intel's repo...

    Anyway, it'd be interesting to compare something like TensorFlow, compiled according to Intel & AMD's recommendations, on their respective CPUs. Most people aren't using MKL-DNN, directly - if they use it at all (because GPUs and the growing number of AI ASICs are really the way to do this stuff), they're using it as a backend for another framework.

    Leave a comment:


  • marty1885
    replied
    I mean, MKL is a Intel libray so.. heck. But the performance difference is ridiculous, it looks like two totally different libraries running.

    Leave a comment:


  • Rigaldo
    replied
    I had seen a patched version of MKL which performed pretty well on AMD.
    https://github.com/fo40225/Anaconda-Windows-AMD
    Now this probably won't easily match with these benchmarks, but I can attest much better results can be reproduced with whatever patch is there!

    Leave a comment:

Working...
X