Announcement

Collapse
No announcement yet.

AMD Zen 3 Performance With The Initial "znver3" GCC Compiler Support

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    It would be interesting to compare these results with -march=x86-64-v2 and -march=x86-64-v3 to see if using the default scheduler with SSE4.2 or AVX2/FMA makes a noticeable enough difference to justify providing CPU-specific binaries.

    Comment


    • #12
      Originally posted by S.Pam View Post
      You can always use a source-based distribution =)
      then you'll spend all your speed gains and more on rebuilds
      Originally posted by S.Pam View Post
      I think code can also be compiled with multiple targets so that they detect at run-time what code-path to use.
      yes and it's about time normal distros started doing it for time-critical packages

      Comment


      • #13
        Originally posted by pal666 View Post
        then you'll spend all your speed gains and more on rebuilds
        That's nonsense. You spend all of your speed gains at the time when you're using the application in question.

        I swear. it's waaaay quicker to build a Gentoo system then it is to strip down Windows... And I guarantee Gentoo built by the handbook guidelines is far better optimized than Windows...

        Comment


        • #14
          I wonder what more detailed analysis of SciMark 2 improvements will show. The other tests and benchmarks show what I was expecting, about 1%-2% improvements compared to using znver2. The SciMark 2 for some reasons is way more than that. Interesting.

          I also remember when Phoronix did comparison of znver2 vs znver1 on Ryzen 3000 series, also SciMark 2 gained significant improvements, with other benchmarks just maybe 2%-3%, that were expected.

          There is something in SciMark 2 that is very sensitive to some critical timing in some tight loop or instruction ordering. Maybe decompose SciMark 2 into its individual benchmarks? Also what it means the SciMark is compiled with specific options? Is entire Java stack recompiled with different options? Is it running benchmark long enough to offset any JIT time differences?
          Last edited by baryluk; 10 December 2020, 03:19 AM.

          Comment


          • #15
            Originally posted by baryluk View Post
            I wonder what more detailed analysis of SciMark 2 improvements will show. The other tests and benchmarks show what I was expecting, about 1%-2% improvements compared to using znver2. The SciMark 2 for some reasons is way more than that. Interesting.

            I also remember when Phoronix did comparison of znver2 vs znver1 on Ryzen 3000 series, also SciMark 2 gained significant improvements, with other benchmarks just maybe 2%-3%, that were expected.

            There is something in SciMark 2 that is very sensitive to some critical timing in some tight loop or instruction ordering. Maybe decompose SciMark 2 into its individual benchmarks? Also what it means the SciMark is compiled with specific options? Is entire Java stack recompiled with different options? Is it running benchmark long enough to offset any JIT time differences?
            Yeah, seems sensitive to something alright. It could be many things though. Perhaps one of the tested loops fits better in cachelines with the new configuration.

            I am more curious about the GraphicsMagick result, where the Haswell configuration rules. It could be an accident, but perhaps something from that configuration is good for znver3?

            Comment


            • #16
              Originally posted by carewolf View Post
              Yeah, seems sensitive to something alright.
              Two new instructions come to mind: a vector AES instruction, and a new carry-less quad word multiplication. When these get used then I would expect to see some difference.

              Everything else in the benchmarks looks like ordinary deviations, which will be why we saw znver2 outperforming znver3 a few times. Three runs for each test isn't much and doesn't allow for very precise comparisons. It's barely enough to even calculate a first deviation for a test. But don't tell Michael I've said that ...

              Comment


              • #17
                I hope you aren't running that memory at 4266MHz.

                Comment


                • #18
                  Originally posted by sdack View Post
                  Two new instructions come to mind: a vector AES instruction, and a new carry-less quad word multiplication. When these get used then I would expect to see some difference.
                  Those wouldn' be used unless specifically coded for, and both are for encryption/hashing not scientific calculations. I doubt a single test in the suite would use them, even the specific encryption tests are too old to take advantage of them.

                  Comment


                  • #19
                    Originally posted by sdack View Post
                    Two new instructions come to mind: a vector AES instruction, and a new carry-less quad word multiplication. When these get used then I would expect to see some difference.

                    Everything else in the benchmarks looks like ordinary deviations, which will be why we saw znver2 outperforming znver3 a few times. Three runs for each test isn't much and doesn't allow for very precise comparisons. It's barely enough to even calculate a first deviation for a test. But don't tell Michael I've said that ...

                    AES and carry-less quad word multiplication are not used by scimark.

                    It is mostly FP64 ("double") and vectorization and cache usage that are a factor here. There isn't even many divisions (or integer division) done in SciMark 2, so the improved latency of integer division is not a factor either.

                    I think the factor however is "L2" cache. gcc l2 cache size param, actually refers here to LLC (last level cache), i.e. L3 in case of Zen. Setting znver3 I think sets higher value than when using znver2, and some algorithms in SciMark can use better blocking (decomposing two dimensional array traversal into chunks / blocks, but the optimal size of these blocks is highly dependent on the L2 and L3 cache sizes).

                    I don't have access to znver3 hardware to confirm, but my guess the SOR (Successive Over Relaxation) sub-benchmark from SciMark 2, is the "culprit" here. It has doubly nested loop, which is exactly the type that would benefit here from better blocking.

                    Comment


                    • #20
                      Originally posted by sdack View Post
                      Two new instructions come to mind: a vector AES instruction, and a new carry-less quad word multiplication. When these get used then I would expect to see some difference.

                      Everything else in the benchmarks looks like ordinary deviations, which will be why we saw znver2 outperforming znver3 a few times. Three runs for each test isn't much and doesn't allow for very precise comparisons. It's barely enough to even calculate a first deviation for a test. But don't tell Michael I've said that ...
                      Michael doesn't unfortunately care about quality of numbers produced by benchmarks, just that they are produced quick so he can post more articles. His statistical knowledge (and frankly most other people who do benchmarks) is very small.

                      Comment

                      Working...
                      X