Announcement

Collapse
No announcement yet.

Benchmarking AMD Zen 3 With Predictive Store Forwarding Disabled

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • smitty3268
    replied
    Found one comment online that the SPEC benchmarks might be impacted by this feature.

    No idea if that's true or not.

    Leave a comment:


  • Peter Fodrek
    replied
    Originally posted by coder View Post
    Do you have any sources on this (other than the ones in your post, which are unrelated to PSF)?
    Zen 3's Predictive Store Forwarding aims to enhance performance by trying to predict dependencies between loads and stores. PSF can speculatively execute instructions based on what it thinks the result of the load will be and while the predictions should be largely accurate, there is the possibility of incorrect CPU speculation.
    https://www.phoronix.com/scan.php?pa...urity-Analysis

    PSF acts as a alternative to link form FPU part exit to FPU aprt entry in Core


    that is misisng in K8 a Zen as well


    from
    https://www.anandtech.com/show/1998/4

    Leave a comment:


  • coder
    replied
    Originally posted by Peter Fodrek View Post
    This feaure mainly helps to speedup in multimedia and matrix operations/instruction
    Do you have any sources on this (other than the ones in your post, which are unrelated to PSF)?

    Leave a comment:


  • coder
    replied
    Originally posted by BingoNightly View Post
    Marketing?
    If it was, I'd say that was a really bad call. Who had even heard of this feature, before now? I read a fair amount of the Zen3 launch coverage and don't remember even a mention of it!

    I really doubt AMD is wasting time and money on low-level CPU features only for the sake of marketing. That's more like something Intel would do. Although, I'd say both companies do a certain amount of that, at the chipset & Windows driver-level.

    Leave a comment:


  • Peter Fodrek
    replied

    Originally posted by muncrief View Post
    Well, if PSF is that ineffective it's currently a waste of silicon space, and AMD should either improve it or remove it. I can't help but wonder if something wasn't set up correctly for these tests, or if there's some other unintended anomaly, but if it's really this bad PSF doesn't seem to add any real value to a processor.

    This feaure mainly helps to speedup in multimedia and matrix operations/instruction

    AMD Developers Looking At GNU C Library Platform Optimizations For Zen
    on 25 March 2020
    Stemming from Glibc semantics that effectively "cripple AMD" in just checking for Intel CPUs while AMD CPUs with Glibc are not even taking advantage of Haswell era CPU features,

    Under a "request for comments" flag, patches tentatively posted add AMD Zen and AVX/AVX2 platform support and refactor the platform support within the CPU features detection. This would at run-time allow CPU features like AVX2, FMA, BMI2, POPCNT, and other instructions to be enabled when detected to be running on an AMD Zen based processor.
    https://www.phoronix.com/scan.php?pa...m-Optimize-Zen


    It "just" disbale speedup for Zen3 as of Matlab levels

    Nov 18th, 2019 10:53
    MATLAB is a popular math computing environment in use by engineering firms, universities, and other research institutes. Some of its operations can be made to leverage Intel MKL (Math Kernel Library), which is poorly optimized for, and notoriously slow on AMD Ryzen processors. Reddit user Nedflanders1976 devised a way to restore anywhere between 20 to 300 percent performance on Ryzen and Ryzen Threadripper processors, by forcing MATLAB to use advanced instruction-sets such as AVX2. By default, MKL queries your processor's vendor ID string, and if it sees anything other than "GenuineIntel...," it falls back to SSE, posing a significant performance disadvantage to "AuthenticAMD" Ryzen processors that have a full IA SSE4, AVX, and AVX2 implementation
    https://www.techpowerup.com/261241/m...ificantly?cp=3

    for glibc 2.33 and 2.34 as of 2.34 will add AMD specific optimized code

    GNU C Library 2.33 Released With HWCAPS To Load Optimized Libraries For Modern CPUs
    on 1 February 2021
    https://www.phoronix.com/scan.php?pa...C-Library-2.33

    Leave a comment:


  • halo9en
    replied
    Originally posted by milkylainen View Post
    Phew. Dodged that one by a hair.
    Nice to see that performance wasn't shot to bits.
    Heh, just like Intel. Uh wait...

    Leave a comment:


  • BingoNightly
    replied
    Originally posted by zeb_ View Post
    One can reverse the question: what is the interest of enabling PFS if it does not provide an advantage?
    Marketing?

    Leave a comment:


  • coder
    replied
    Originally posted by muncrief View Post
    I can't help but wonder if something wasn't setup correctly for these tests, or if there's some other unintended anomaly, but if it's really this bad PSF doesn't seem to be add any real value to a processor.
    A way to know for sure would be to hand-code a test in asm and benchmark it with/without the feature enabled.

    Leave a comment:


  • coder
    replied
    Originally posted by smitty3268 View Post
    It seems like it's likely a feature that becomes more effective in longer running processes. A short benchmark might not be affected nearly as much as a long-running server process.
    It's not a bad question, but the second of the quotes you included confirms what I'd assumed -- that the profiling data is actually very short-lived!

    All sorts of internal CPU state, like this, get implicitly or explicitly replaced, once a context-switch happens (most notably, things like branch-prediction, which is very consequential). Context-switches can occur anywhere from a few times per second (per core) to a hundreds of thousands (see https://openbenchmarking.org/test/pts/stress-ng-1.3.1 and note that it's measuring context switches on all cores), with the latter happening if the thread is actively doing things like blocking I/O or synchronization.

    However, the limiting factor on its window of applicability is more typically going to be the relatively small number of addresses it has the storage to cache (and the ability to lookup). It could be as few as a couple dozen, but it's almost certainly not enough to help outside of a small-to-medium size loop.
    Last edited by coder; 05 April 2021, 04:54 AM.

    Leave a comment:


  • coder
    replied
    Originally posted by zeb_ View Post
    One can reverse the question: what is the interest of enabling PFS if it does not provide an advantage?
    Yeah, Michael which benchmarks showed the greatest benefit, and by how much?

    I also wonder whether it's a scenario more likely to occur in poorly-optimized (or unoptimized) code, since it seems to describe a situation where you store something and then read it back before it even hits L1 cache. Normally, an optimizing compiler would cache such values in registers, if they would be needed again, so soon. Maybe it would also help with optimized code, if you're spilling registers and the core's write buffers are full.

    All of this has me wondering something else. When you get a cache miss, a cacheline has to be evicted before the new one can be fetched. So, I wonder if modern CPUs use idle cycles on the memory bus to pre-emptively write back the cachelines most likely to be victimized. That would at least lower the penalty of a cache miss, somewhat.
    Last edited by coder; 05 April 2021, 04:50 AM.

    Leave a comment:

Working...
X