Announcement

Collapse
No announcement yet.

Core i9 7980XE GCC 9 AVX Compiler Tuning Performance Benchmarks

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Core i9 7980XE GCC 9 AVX Compiler Tuning Performance Benchmarks

    Phoronix: Core i9 7980XE GCC 9 AVX Compiler Tuning Performance Benchmarks

    Continuing on with our benchmarks this month of the newly-released GCC 9 compiler, here are some additional numbers for the AVX-512-enabled Intel Core i9 7980XE processor on Ubuntu Linux when testing tuning for various AVX widths...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Great! I've been wanting to see benchmarks comparing different `-mprefer-vector-width=` options for a while.
    It's disappointing to see `=512` do so poorly. I think compilers do a bad job taking advantage of it, eg, I've never seen one emit masked operations on their own.

    Comment


    • #3
      This may be due to the use of -fvect-cost-model=dynamic with -O3. either the overheads of the checks, or lower rates of vectorization due to the profitability model slow things down. -O2 uses -fvect-cost-model=cheap, but I've found ​-fvect-cost-model=unlimited to be the best for a small set of benchmarks, on Skylake-X. Michael would it be possible to rerun these benchmarks looking at with ​​-fvect-cost-model as well?

      -fvect-cost-model=model
      Alter the cost model used for vectorization. The model argument should be one of ‘unlimited’, ‘dynamic’ or ‘cheap’. With the ‘unlimited’ model the vectorized code-path is assumed to be profitable while with the ‘dynamic’ model a runtime check guards the vectorized code-path to enable it only for iteration counts that will likely execute faster than when executing the original scalar loop. The ‘cheap’ model disables vectorization of loops where doing so would be cost prohibitive for example due to required runtime checks for data dependence or alignment but otherwise is equal to the ‘dynamic’ model.
      Last edited by wolfwood; 27 May 2019, 10:35 PM.

      Comment


      • #4
        Very interesting indeed.

        Are these tests run at a fixed cpu frequency? If not, can the frequency be monitored and displayed with the results as well?

        Comment


        • #5
          Originally posted by celrod View Post
          Great! I've been wanting to see benchmarks comparing different `-mprefer-vector-width=` options for a while.
          It's disappointing to see `=512` do so poorly. I think compilers do a bad job taking advantage of it, eg, I've never seen one emit masked operations on their own.
          But note with AVX512VL support the compiler can also make masked operations with 128- and 256-bit widths.

          Edit: Basically the AVX512VL extension makes it possible for the compiler to replace SSE and AVX instructions with more powerful AVX512 instructions.. Never checked if it really does so though, but I assume so as SSE instructions are replaced with AVX ones when using -mavx -mprefer-vector-width=128
          Last edited by carewolf; 28 May 2019, 03:46 AM.

          Comment


          • #6
            Originally posted by pegasus View Post
            Very interesting indeed.

            Are these tests run at a fixed cpu frequency? If not, can the frequency be monitored and displayed with the results as well?
            The frequency is lower when using AVX instead of scalar instructions, and even lower when using AVX-512. Tried on an Xeon Gold 6128 and the frequency is 1.9Ghz when using AVX-512 if I remember correctly.

            Comment


            • #7
              Additional charts normalized to clock would indeed be very interesting. Especially since it should directly correlate to the powerdrain

              Comment


              • #8
                Originally posted by carewolf View Post

                But note with AVX512VL support the compiler can also make masked operations with 128- and 256-bit widths.

                Edit: Basically the AVX512VL extension makes it possible for the compiler to replace SSE and AVX instructions with more powerful AVX512 instructions.. Never checked if it really does so though, but I assume so as SSE instructions are replaced with AVX ones when using -mavx -mprefer-vector-width=128
                Two simple cases where I think auto-vectorizers really ought to be able to take advantage of masking:
                1. Loops; the remainder of a loop (ie, loop iterations mod vector width) ought to be done in a single masked iteration, not serially.
                2. SLP vectorization. There are lots of opportunities where there are repeated calculations that show up in multiples that aren't a power of 2. These don't get vectorizer, or are handled inefficiently. Eg, if there are 7 elements, autovectorzers will often break that into vectors of length 4, 2, and a 1. That means they'll use 3 loads/stores, consume 3 registers, and use 3 instructions per operation it wants to perform on the 7 numbers. If they instead just used masked load/stores, all those "3"s become "1". For a simple column major matrix multiplication with a 7x13 * 13x10 matrix, this makes the difference between taking 23 ns vs 130 ns (requiring 3 registers per results in spilling).

                Like you point out, Skylake-X and on can even use masks for the earlier instruction sets. Those sort of tricks make it much easier to explicitly vectorize avx512 code than avx2 code -- meaning even when the optimal width is 4 or less, they should often get a boost. But compilers do not seem to take advantage of it.

                Originally posted by Virtus View Post
                The frequency is lower when using AVX instead of scalar instructions, and even lower when using AVX-512. Tried on an Xeon Gold 6128 and the frequency is 1.9Ghz when using AVX-512 if I remember correctly.
                Yeah. You can mess with these settings in the bios / when overclocking.
                I have water cooling with a nice 360mm radiator. But, avx512-heavy loads do generate a lot of heat, and will crash the computer if you set their clocks too high and run them on all cores.
                The crashes were practically instantaneous when doing linear algebra, which is more or less pure-avx512, and took about a minute when running statistical software I wrote. Obviously time to crash is dependent on how high your over-clocks were.
                My point here is that -- as it wouldn't crash under avx2 loads -- the need to downclock on avx512 is real.

                But, on the otherhand, all these testcases ran much faster thanks to avx512, because the wider vectors more than made up for the slower speeds.
                Unfortunately, it seems that to see this benefit, you have to deliberately right the software with avx512-vectorization in mind...



                In many of my workloads, avx512 provides a huge benefit, and I don't know how to get them working on a GPU (the standard avx512 counterargument).
                But seeing benchmarks like these (and ones I ran on the Polyhedron test suite in an earlier Phoronix discussion) suggest I'm in the minority; that to most avx512 is wasted silicon. =(
                Selfishly, I hope Intel doesn't decide to abandon avx512, AMD decides to at least support it like gen 1 Ryzen supported avx2, and that compilers improve.

                Processor manufacturers obviously can't afford to give too many of them away, but I can't help but think they'd be better off if folks working on LLVM and GCC's optimizes had chips with the latest instructions. Same for libraries like OpenBLAS and FFTW. In the case of OpenBLAS, its primary maintainer (martin-frbg) said [regarding avx-512](https://github.com/xianyi/OpenBLAS/p...nt-427099496):
                Fixed the naming for you, will leave the algorithm discussion to the adults with SkylakeX access.
                OpenBLAS was very slow to get avx512 support, and AFAIK it is still pretty bad. Odds are, if he had access to such a chip, avx512 would've started looking good on OpenBLAS benchmarks much sooner. Julia's official binaries ship with OpenBLAS, as does R on most linux distros.

                Comment


                • #9
                  Originally posted by celrod View Post
                  In many of my workloads, avx512 provides a huge benefit, and I don't know how to get them working on a GPU (the standard avx512 counterargument).
                  But seeing benchmarks like these (and ones I ran on the Polyhedron test suite in an earlier Phoronix discussion) suggest I'm in the minority; that to most avx512 is wasted silicon. =(
                  You are indeed. I've seen most of the math heavy workloads migrate to gpus in the past few years and nobody cares for avx512 these days.
                  If you can't get your code ported to OpenMP/OpenCL/CUDA to eventually run on GPUs, take a look at NEC Aurora Tsubasa cards. They're super wide vector engines built with NEC decades of vector supercomputer knowhow.

                  Comment


                  • #10
                    I think Bayesian statistics would be a great niche for avx512, but statisticians in general (Bayesian or otherwise) are computer-and-optimization-illiterate; the most popular BLAS library is the reference BLAS you get when downloading R with RStudio, at 30x slower than OpenBLAS & co.
                    Even software written by computer scientists meant for doing Bayes, like Stan, has it's most basic data structure specified in a way so as to be incompatible with vectorization: like elements are not stored contiguously in arrays.
                    In my experience, it isn't too difficult to get >4x or >100x better performance.

                    Of course, while reference BLAS implementations and Stan are popular, no one will benefit from any vectorization at all.
                    But programs running for days or weeks is fairly routine; there is a need, just no knowledge of how better is possible. It is a field that has a lot it can gain from better optimized software, if only its practitioners were as computer-literate as the machine-learning folks are.

                    Double precision is a must. I'm still a graduate student (I will finish in a couple months), so for now good double precision performance on a GPU is out of my budget (maybe I'll get a Radeon VII, which has quarter double precision -- or a Navi successor if any are similar).
                    And I do intend to play around eventually. Typical Bayesian models are no where near as clean to vectorize as machine learning.
                    A neural network is just a bunch of large matrix multiplications and non-linear transforms applied to large vectors.
                    With a Bayesian model, you have a lot of smaller-grained stuff. 256-wide vectors will be overkill for much of it. And I'm worried about memory problems if you try and run too many chains on simulated data sets at once.
                    But I have very little experience with GPGPU, so it still seems alien/intimidating/hard to reason about.

                    Comment

                    Working...
                    X