Announcement

Collapse
No announcement yet.

Skylake AVX-512 Benchmarks With GCC 8.0

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Skylake AVX-512 Benchmarks With GCC 8.0

    Phoronix: Skylake AVX-512 Benchmarks With GCC 8.0

    For those curious about the current benefits of AVX-512, here are some benchmarks using a recent snapshot of the GCC 8 compiler and comparing the performance of the generated binaries for the skylake and skylake-avx512 targets...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Getting a compiler that can handle AVX-512 is a necessary step to using AVX-512, but unfortunately it doesn't rewrite your code so that it's actually using AVX-512 properly.

    Comment


    • #3
      Clang tests using AVX-512 would be interesting as well.

      Comment


      • #4
        Try latest latest x264
        it looks like they added AVX 512 support

        Comment


        • #5
          Originally posted by chuckula View Post
          Getting a compiler that can handle AVX-512 is a necessary step to using AVX-512, but unfortunately it doesn't rewrite your code so that it's actually using AVX-512 properly.
          Sure, but generally you want to write C code without targeting a specific platform. It's the compiler job's to optimize per platform.

          I say "generally" because in some cases it does make sense to write platform-specific optimized code, even using assembly. Examples include vector math libraries, media codecs, and compression. You can then select the best code during compile time, or better yet provide a single library that loads the most optimal implementation at runtime.

          gcc does have some basic function multiversioning support to automatically handle this at runtime, but libraries that want to be more portable often do their own loading.

          Comment


          • #6
            Originally posted by emblemparade View Post

            Sure, but generally you want to write C code without targeting a specific platform. It's the compiler job's to optimize per platform.

            I say "generally" because in some cases it does make sense to write platform-specific optimized code, even using assembly. Examples include vector math libraries, media codecs, and compression. You can then select the best code during compile time, or better yet provide a single library that loads the most optimal implementation at runtime.

            gcc does have some basic function multiversioning support to automatically handle this at runtime, but libraries that want to be more portable often do their own loading.
            Function multiversioning is a very useful feature because it makes it much easier for binary packages to include multiple code-paths for different chips that may vary greatly in age to execute the same basic program while using the appropriate level of optimization. That's like having a single package that lets an old Athlon use SSE and a brand new Xeon using AVX-512 to perform the same program optimally. However, as you see in your own example, the programmers still have to make sure that each version of the function gets written to properly use each architecture, so it's a solution that needs the compiler + coder together.

            Comment


            • #7
              AVX-512 just seams too complicated and takes up far too much die space to be a useful instruction set. Is it really worth dedicating 25.7% of your die to it? (I just did the maths on it and was surprised it was that high, my eyeball guess was 1/5th). Also AVX-512 data is so large it takes 2 cycles to load it into a Skylake CPU, so when AVX-512 is used the CPU downclocks to 1.8-2.4GHz so you have to be using AVX-512 quite a bit for it to be worth using it at all.

              Comment


              • #8
                Originally posted by Spazturtle View Post
                AVX-512 just seams too complicated and takes up far too much die space to be a useful instruction set. Is it really worth dedicating 25.7% of your die to it? (I just did the maths on it and was surprised it was that high, my eyeball guess was 1/5th). Also AVX-512 data is so large it takes 2 cycles to load it into a Skylake CPU, so when AVX-512 is used the CPU downclocks to 1.8-2.4GHz so you have to be using AVX-512 quite a bit for it to be worth using it at all.
                What would be interesting is if that part is useful in AVX-512 256-bit mode. That still has better instructions than AVX1/2, but does it still need to slow down the processor? I think gcc added a -mprefer-avx256 flag similar to the old -mprefer-avx128 flag, so it should be testable - for instance on the those tests that regressed?

                Comment


                • #9
                  That's kind of a minor improvement for so much die space. Are there any benchmarks showing much better improvements?

                  Comment


                  • #10
                    Originally posted by carewolf View Post

                    What would be interesting is if that part is useful in AVX-512 256-bit mode. That still has better instructions than AVX1/2, but does it still need to slow down the processor? I think gcc added a -mprefer-avx256 flag similar to the old -mprefer-avx128 flag, so it should be testable - for instance on the those tests that regressed?
                    Intel CPUs will downclock when using AVX-512 256bit or AVX 2 but they will only take a single CPU cycle as all new Intel CPUs are 256bit wide, they still downclock a little with AVX1.

                    AMD Ryzen CPUs don't downclock at all with AVX 1 or 2 but will take 2 CPU cycles with both AVX 1 and AVX 2 as they are only 128bit wide.

                    Ryzen cores are significantly smaller then Skylake cores despite being on a node that is 18% larger. AMD chose to make Ryzen very lean by not including AVX-512 or making the CPU 265bits wide.

                    Maybe Zen 2 or 3 will be 256bit wide and able to perform a AVX 1/2 instruction in a single cycle but I don't think we will see AMD adopting AVX-512 any time soon or ever. AMD will likely have a renewed push for HSA as GPUs can do everything AVX-512 is used for.

                    AVX-512 is very expensive and to me doesn't seam worth it, dedicating 25% of your CPU core to just AVX-512's baseline instruction sets doesn't seam worth it, and supporting more of the extended AVX-512 instruction sets would take up more space. On a 6 core Skylake CPU if your remove the AVX-512 part of the core you would have space on the die for 2 additional cores (4 more threads), does AVX-512 give you a better performance boost then 2 more cores / 4 more threads? And more programs can use additional cores/threads then can use AVX-512.

                    Comment

                    Working...
                    X