Announcement

Collapse
No announcement yet.

AVX / AVX2 / AVX-512 Performance + Power On Intel Rocket Lake

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by ddriver View Post
    I don't think it is that. The thing is so far SIMD units have been fairly general purpose. Intel is cramming a bunch of highly purpose specific stuff into avx 512. His is not a problem with the width of execution or power efficiency, but with the support hell it is to keep introducing new niche use instructions and having no instruction set and features uniformity between platforms.
    I disagree. Linus probably doesn't give a crap about the ugliness of the instruction set, etc. He is concerned with what the processor context looks like and how easy/hard it is to keep track of and save/restore on context switches. Like the FPU state was in the early days, there's a lot of work that has to go into tracking of the FPU context needs to be saved/restored on CS. Adding in even more *huge* registers to this for the AVX-512 extension doesn't sit well with OS people. CSs are already horribly expensive.

    Comment


    • #12
      Originally posted by ddriver View Post

      Intel appears to have given up on improving general purpose performance and is doing a lot of work to boost purpose specific tasks both in hardware and on the software front as well. This also explains how come their cpus show disproportionately big gains in some corner cases, even if more or less stuck in general.
      Intel's avx512 extensions are coming from the server chip world. They've targeted networking and AI processing applications with the projection that both will become ubiquitous. They have big customers that want this, for example FB wanting avx512 bfloat16 operations for training in their Zion platform.

      I read recently that the Rocket Lake implementation combines two avx2 FMA units when running avx512 operations. If so, does this explain the small gains between avx2 and avx512 configurations?

      The Ice Lake Server chips are documented as having two avx512 FMA units... even on the 8 core products. It would be interesting to compare those vs Rocket Lake on the AI tests to see which benchmarks can benefit.

      Comment


      • #13
        Originally posted by lucasbekker View Post
        AVX-512 is primarily aimed at software that has to perform a LOT of similar mathematical operations on large amounts of data. These kind of programs mostly fall into two categories:
        - Math libraries like openBLAS or MKL, which use intrinsics
        - Custom math kernels, mostly written in CUDA or other SPMD compilers like Intel® Implicit SPMD Program Compiler (ispc.github.io)

        These benchmarks only show that the software being compiled does not fall into either of these catagories (except for some of the mining stuff, probably hacky custom kernels...)
        This is the oft cited answer for what AVX-512 is useful for, yet nobody has any examples of this usage benefiting from AVX-512. Do you have any real world examples? Actual software products that use it, and specific workloads where there is a demonstrated benefit? These seem to always be missing from these AVX 512 discussions. IMO the benefit of AVX 512 seems far more theoretical than practical at this point.

        Originally posted by lucasbekker View Post
        It is unfortunate that AVX-512 is getting a bad reputation because of these kind of benchmarks, because if you are making use of AVX-512 in the intended way, the performance benefints can be HUGE.
        What is the intended way? Can you quantify the benefits? Where can I go to see this benefit demonstrated?

        Edit: AVX-512 feels like the CPU instruction equivalent of an herbal supplement, with promises of increased vitality, improved clarity, and stronger constitution. Not FDA approved. Not intended to treat or cure any disease. Consult your doctor before taking. Results not guaranteed. Advertisement contains paid actors. Batteries not included. Void where prohibited. Not for sale in ME, TX, CA, NY, or NJ.
        Last edited by torsionbar28; 07 April 2021, 02:02 PM.

        Comment


        • #14
          Originally posted by torsionbar28 View Post
          IMO the benefit of AVX 512 seems far more theoretical than practical at this point. [...] What is the intended way? Can you quantify the benefits? Where can I go to see this benefit demonstrated?
          There are a couple of highly specialized applications where AVX-512 is used already, we have seen Garlicoin here, anandtech has some other benchmars, such as NAMD. So there are benefits when it is properly used. I would partially agree that at this point home users aren't affected that much. But I guess developers need more time to make use of these and adoption of AVX-512 is still slow because next to no processor with any meaningful market penetrations supports it yet. AVX2 also took a long time to become important, history might repeat itself here in regards to AVX-512. With the new x86 feature levels AVX-512 support is a requirement for v4, so you will need it sooner or later if you want to be compatible with the latest feature level.

          As far as I have read, AVX-512 is also designed to allow for more general pupose usage (scatter and gather instructions) than former vector extensions. I suggest this blog post from Matt Pharr (who wrote ISPC) on the benefits: https://pharr.org/matt/blog/2018/04/...s-and-ooo.html (and the follow-ups)
          Last edited by ms178; 07 April 2021, 02:23 PM.

          Comment


          • #15
            Seems the benefit of AVX in general is too small. Really, really, small. Unless specifically written for it.

            Comment


            • #16
              Except in some minor cases, avx512 appears to be better in almost all the comparisons. In some cases avx512 reach high rates.

              Comment


              • #17
                Originally posted by Azrael5 View Post
                Except in some minor cases, avx512 appears to be better in almost all the comparisons. In some cases avx512 reach high rates.
                Until you look at efficiency and then it's horrible.

                Comment


                • #18
                  Originally posted by TemplarGR View Post
                  Seems the benefit of AVX in general is too small. Really, really, small. Unless specifically written for it.
                  I get 3-4x using AVX2-enabled MKL vs scalar blas on matrix tasks such as eigenvector calculation or matrix mult. These calcs are increasingly common across consumer workloads especially for creators which is basically where desktop is going.

                  my python 2 is scalar blas, whereas my python3 numpy is AVX2-enabled blas:

                  🐅[email protected]:~$ python2
                  Python 2.7.18 (default, Mar 8 2021, 13:02:45)
                  [GCC 9.3.0] on linux2
                  Type "help", "copyright", "credits" or "license" for more information.
                  >>> import datetime as dt
                  >>> import numpy as np
                  >>> def calc():
                  ... nn = dt.datetime.utcnow()
                  ... xx = np.random.rand(1000, 1000)
                  ... yy = np.linalg.eig(xx)
                  ... print (dt.datetime.utcnow() - nn)
                  ...
                  >>> calc()
                  0:00:04.335998


                  🐅[email protected]:~$ python3
                  Python 3.8.5 (default, Jan 27 2021, 15:41:15)
                  [GCC 9.3.0] on linux
                  Type "help", "copyright", "credits" or "license" for more information.
                  >>> import datetime as dt
                  >>> import numpy as np
                  >>> def calc():
                  ... nn = dt.datetime.utcnow()
                  ... xx = np.random.rand(1000, 1000)
                  ... yy = np.linalg.eig(xx)
                  ... print(dt.datetime.utcnow() - nn)
                  ...
                  >>> calc()
                  0:00:01.299623


                  This is on a Ryzen 2700X. Intel AVX is even faster.
                  Last edited by vegabook; 07 April 2021, 04:01 PM.

                  Comment


                  • #19
                    Originally posted by schmidtbag View Post
                    I'm beginning to understand more clearly why Linus really doesn't like AVX512.
                    Originally posted by ddriver View Post

                    I don't think it is that. The thing is so far SIMD units have been fairly general purpose. Intel is cramming a bunch of highly purpose specific stuff into avx 512. His is not a problem with the width of execution or power efficiency, but with the support hell it is to keep introducing new niche use instructions and having no instruction set and features uniformity between platforms.

                    Note that it is just coincidental that intel is adding all those custom instructions to avx 512, the same can be introduced on 256bit systems, which I think is how amd will initially support some of the more useful and general purpose avx instructions in its upcoming platforms. Similar to how they used 2x128 for avx 256 initially.
                    LLVM was initially intended to make it easier to use dynamic recompilation technologies. Unfortunately, LLVM's C/C++ clang compiler does not support recompilation if a binary statically compiled for the worst case (that is: base AMD64 instruction set) is run on an AVX-512 capable CPU. Linus is against dynamic recompilation - I have never seen any kind of support for dynamic recompilation in his posts and he has been critical of such technologies. It isn't therefore surprising that Linus is highly critical of AVX-512 or of any kind of non-standard special-purpose instruction set that suggests JIT code generation for C/C++/etc code.

                    Originally posted by ddriver View Post
                    Intel appears to have given up on improving general purpose performance and is doing a lot of work to boost purpose specific tasks both in hardware and on the software front as well. This also explains how come their cpus show disproportionately big gains in some corner cases, even if more or less stuck in general.
                    It is hard to design a CPU that can predict multiple branches per cycle, but once a CPU can predict 2 branches per clock it is relatively easy (although still quite hard) to extend the design to predict 3-4 branches per clock. The number of branches/clock is related to the number of loads and stores per clock - it is pointless to combine a branch prediction unit capable to 3-4 branches per cycle with a backend that can do only 1 store per cycle. I wonder how many loads and stores per clock will Zen 4 be able to perform assuming that it is going to be 20% faster than Zen 3 which is 20% faster than Zen 2 per clock. If AMD adds AVX-512 to Zen 4 then the probability of Zen 4 having more load/store units than Zen 3 is lower because AVX-512 support requires a large number of transistors even if implemented as 2x256.

                    Comment


                    • #20
                      Originally posted by vegabook View Post
                      This is on a Ryzen 2700X. Intel AVX is even faster.
                      Intel AVX is not faster. It's really not. Intel MKL by default does not use AVX on AMD CPUs, it falls back to something like SSE2 or SSE4. Quite slow.

                      You need to binary patch Intel MKL binaries to be able to bypass this rather arbitrary limitation. See:
                      - https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html
                      - https://www.extremetech.com/computin...eadripper-cpus
                      - http://www.swallowtail.org/naughty-intel.shtml

                      Comment

                      Working...
                      X