Announcement

Collapse
No announcement yet.

Skylake AVX-512 Benchmarks With GCC 8.0

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by Spazturtle View Post

    Intel CPUs will downclock when using AVX-512 256bit or AVX 2 but they will only take a single CPU cycle as all new Intel CPUs are 256bit wide, they still downclock a little with AVX1.
    Yes but the downclocking for AVX1/2 is significantly less than the downclocking for AVX-512 (~10% vs ~30%), right?

    AVX-512 is very expensive and to me doesn't seam worth it, dedicating 25% of your CPU core to just AVX-512's baseline instruction sets doesn't seam worth it, and supporting more of the extended AVX-512 instruction sets would take up more space. On a 6 core Skylake CPU if your remove the AVX-512 part of the core you would have space on the die for 2 additional cores (4 more threads), does AVX-512 give you a better performance boost then 2 more cores / 4 more threads? And more programs can use additional cores/threads then can use AVX-512.
    I do not disagree. But I would like the extra instructions, at least in 128 or 256bit mode.

    Comment


    • #12
      Originally posted by chuckula View Post
      Getting a compiler that can handle AVX-512 is a necessary step to using AVX-512, but unfortunately it doesn't rewrite your code so that it's actually using AVX-512 properly.
      This.

      First, GCC needs pragmas (or attributes, etc.) so you can give it hints to aid in loop unrolling/vectorization and software pipelining. Then, people need to actually use them. Finally, you might get some decent compiler-generated SIMD.

      Comment


      • #13
        Originally posted by Spazturtle View Post
        AVX-512 just seams too complicated and takes up far too much die space to be a useful instruction set.
        Agreed. IMO, it's mostly about Intel having GPU-envy*.

        The SSE family of extensions was great. AVX 1/2 was even better. Intel doesn't know when to stop adding more of a good thing. They seem desperate to find ways to breathe more life into their moribund x86 ISA and running out of ideas.

        * Note: meanwhile, their HD Graphics GPUs are limited to 128-bit SMID. Go figure.

        Comment


        • #14
          Intel is pushing stuff in to CPUs which are much better off running at GPU. But it is understandable, as they are trying to cut in to the compute stuff which is more and more moving to GPUs. Finaly something like AMDs HSA starting to make sense - that was essentially also the reason why Bulldozer sucked, as they were trying to offload all the compute where the bulldozer cores sucked to the GPU part, but somehow there was no push for that at the time, and thus they failed miserably with the CPU alone.

          Comment


          • #15
            I also think Intel have made a mistake on AVX-512 as well. The massive down-clocking caused by AVX-512 on their processors mean most server and workstations perform much slower because pretty much no one runs only software that benefits from AVX-512 (typically only encryption on the average server stack IMHO). Yes in theory the modern Intel CPU is supposed to be able to ramp up and down on clock pretty quick (and have different clock rate per-core under load) but reality shows pretty big hits on machines with encrypted front ends because 95% of the time is spent on other parts that don't use AVX. Even if those parts did I doubt the normal workload of some web server application (be it PHP, Python or Go) - which tend to be mostly about storing, sorting, aggregating and generating some formatted text - would ever really benefit from AVX. I think AMD understand this and made design choices based on reality. The real issue I see with AVX-512 right now is any of your libraries could end up using it in an update and suddenly you are left confused why your overall performance has plummeted.

            This makes me think we need to see some multi-tasking/app stack tests on Phoronix. I'd be interested to see a Nextcloud test composed of Nginx/Apache + OpenSSL + Redis + PHP-FPM + Maridb benchmark between Intel's Xeon and Epyc. Based on Cloudflares experiments it would be interesting to see how simply adjusting OpenSSL to use AVX-512 would affect the results on Intel chips.

            Comment


            • #16
              Just to make AVX512 seem even worse.
              Do powerconsumption measurement and effeciency.

              AVX512 isn't very great imho.

              Comment


              • #17
                In defence of AVX-512, Intel has ever so slowly evolved their SIMD extensions towards a "real vector ISA", which are significantly better targets for vectorizing compilers rather than packed SIMD. With AVX2 we got gather, now with AVX-512 we finally have scatter and predication. The major things lacking, AFAICS, is a vector length register (so the compiler doesn't have to generate a separate tail loop), and strided load/store.

                Though I guess it will take a while for compilers to utilize these features, so it'll be some time before we're seeing the full performance benefit of them.

                It would be nice if Intel would design a vector ISA with a variable vector width, similar to ARM SVE, or the RISC-V vector extensions. That way at least the ISA could be available everywhere, and how much die area to dedicate to the vector engine could then be separately decided on a SKU basis depending on the target market.

                As for clocking down when using AVX-512, yeah, that can be a problem. IIRC for this reason glibc had to revert patches that made memcpy/memset use AVX-512, as it turned out that in code which wasn't using AVX-512 otherwise this was a slowdown compared to the previous AVX2 routines.
                Last edited by jabl; 11-29-2017, 10:34 AM.

                Comment


                • #18
                  Originally posted by jabl View Post
                  In defence of AVX-512, Intel has ever so slowly evolved their SIMD extensions towards a "real vector ISA", which are significantly better targets for vectorizing compilers rather than packed SIMD. With AVX2 we got gather, now with AVX-512 we finally have scatter and predication. The major things lacking, AFAICS, is a vector length register (so the compiler doesn't have to generate a separate tail loop), and strided load/store.
                  I managed to do the tails efficiently with AVX2 already.

                  The key was that _mm256_maskstore_epi32 and _mm256_maskload_epi32 are now fast while they were hideously slow in SSE. You can then make a mask for it with _mm256_add_epi32(offsetMask, _mm256_set1_epi32(x - length)) (a constant load and three instructions, offsetMask is {0, 1, 2, 3, 4, 5, 6, 7}). Any elements over length will then then be masked out of being loaded or stored.
                  Last edited by carewolf; 11-29-2017, 10:57 AM.

                  Comment


                  • #19
                    Originally posted by partizann View Post
                    Intel is pushing stuff in to CPUs which are much better off running at GPU. But it is understandable, as they are trying to cut in to the compute stuff which is more and more moving to GPUs. Finaly something like AMDs HSA starting to make sense.
                    On the side of heterogeneous computing, i'm curious about the merge of Vulkan and OpenCl...

                    Comment


                    • #20
                      Originally posted by boboviz View Post
                      On the side of heterogeneous computing, i'm curious about the merge of Vulkan and OpenCl...
                      HSA has all the architecture support needed to target standard C++ code for your GPU. I think integrating OpenCL's compute capabilities into Vulkan would be a worthy improvement change, but I don't see it accomplishing quite what SYCL can offer:

                      https://www.khronos.org/sycl

                      Comment

                      Working...
                      X