Announcement

Collapse
No announcement yet.

GCC Lands AVX-512 Fully-Masked Vectorization

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    from the commit message I think the most fun part is that the 128 bit implementation, when used for iterations that are a power of 2 are the fastest ones. Additionally the masked epilog version is also faster than the fully masked version. The table does highlight the bug in that the 512bit code was not very performant often losing to the 256 and 128 bit versions.

    One of the motivating testcases is from PR108410 which in turn
    is extracted from x264 where large size vectorization shows
    issues with small trip loops. Execution time there improves
    compared to classic AVX512 with AVX2 epilogues for the cases
    of less than 32 iterations.

    sz scal 128 256 512 512e 512f
    1 9.42 11.32 9.35 11.17 15.13 16.89
    2 5.72 6.53 6.66 6.66 7.62 8.56
    3 4.49 5.10 5.10 5.74 5.08 5.73
    4 4.10 4.33 4.29 5.21 3.79 4.25
    6 3.78 3.85 3.86 4.76 2.54 2.85
    8 3.64 1.89 3.76 4.50 1.92 2.16
    12 3.56 2.21 3.75 4.26 1.26 1.42
    16 3.36 0.83 1.06 4.16 0.95 1.07
    20 3.39 1.42 1.33 4.07 0.75 0.85
    24 3.23 0.66 1.72 4.22 0.62 0.70
    28 3.18 1.09 2.04 4.20 0.54 0.61
    32 3.16 0.47 0.41 0.41 0.47 0.53
    34 3.16 0.67 0.61 0.56 0.44 0.50
    38 3.19 0.95 0.95 0.82 0.40 0.45
    42 3.09 0.58 1.21 1.13 0.36 0.40


    'size' specifies the number of actual iterations, 512e is for
    a masked epilog and 512f for the fully masked loop. From
    4 scalar iterations on the AVX512 masked epilog code is clearly
    the winner, the fully masked variant is clearly worse and
    it's size benefit is also tiny.​

    Comment


    • #12
      Originally posted by phoronix View Post
      Phoronix: GCC Lands AVX-512 Fully-Masked Vectorization

      Stemming from looking at the generated x264 video encode binary and some performance inefficiencies, SUSE engineers have worked out AVX-512 fully masked vectorization support for the GCC 14 development code...

      https://www.phoronix.com/news/GCC-AV...-Masked-Vector
      Somehow, trying to apply avx-512 to short vectors and complaining about that makes big news and strong opinions.

      Applying avx-512 to long vectors to speed-up matrix multiplications works exceptionally well, even Linus regrets complaining about avx-512 and keeps quiet.

      Comment


      • #13
        Originally posted by Anux View Post
        You left out one important fact, by using AVX their whole CPU was throttled and everything worked slower, which ruled their implementation useless for mixed workloads.

        The problem was not node specific, it was just a bad implementation.
        This was even an issue for AVX2 in desktop Haswell, people just didnt complain much because voltage/power was increased instead of throttling like the early AVX512 Xeons. That whole clocking/power scheme was (in hindsight) a mistake.

        Comment


        • #14
          probably a silly question but since we're talking parallel processing, what can cpu matrixes like AVX512 and AVX2 do that an iGPU can't?

          Comment


          • #15
            Originally posted by marlock View Post
            probably a silly question but since we're talking parallel processing, what can cpu matrixes like AVX512 and AVX2 do that an iGPU can't?
            thats right... AVX512 was intels attempt to stay relevant in the GPU age... this was before intel startet the ARC GPUs...

            compared to CUDA or AMD ROCm/HIP the AVX512 way has no real usage in the world.

            all modern super computers like the ALL AMD Fontire with epic cpu and CDNA compute gpus...

            also all the modern CPUs are all APUs ... apple m2 Ultra soc is a APU... the amd ryzen 7000 are all APU and so one and so one.

            if you listen carefully in the forum people successfully run blender and AI workloads on the Vega64 and result is 5-6 times faster than their 8core ryzen 2000...

            this means my 5-6 year old Vega64 with HIP is faster than the newst "AMD ryzen 7950X3D" for blender and AI workloads


            Phantom circuit Sequence Reducer Dyslexia

            Comment


            • #16
              Unfortunately not all CPUs are APUs, not really...

              eg: only a Ryzen 5 5600G is an APU, while a Ryzen 5 5600X has no iGPU, and because of that it uses the extra room in the silicon mainly to double L3 caches and offers up to PCIe 4.0 instead of PCIe 3.0 to external slots (presumably leveraging the extra lanes that are not used inside the chip to connect the iGPU segment to the CPU)

              So as much as "just use the iGPU" sounds great it can't be done immediately even if it makes perfect sense

              And sure, dedicated GPUs can demolish and put to shame CPU matrices in tasks where a GPU is an adequate solution, but that's a thoroughly unfair max horsepower comparison...

              ... wereas I suspect there is more to it, so I'm trying to figure out if there is any task where a cpu matrix is relevant but a GPU isn't... eg: "a GPU can't do this and a matrix can" or "a matrix inside the CPU itself is better in this specific task because it shares the L3 memory with usual cpu instructions so bandwidth bla bla"... or something in those lines

              That's why I specifically mention iGPUs vs. CPU matrices, not dGPUs... an iGPU is on the same chip, shares the same RAM, maaaaybe has the same level of access to L3/2/1 as a cpu matrix (but afaik not)... it's a bunch of less corner cases to consider and unfair comparisons to avoid... but still a very different beast from a cpu matrix instruction afaik

              PS: Intel does do a LOT of CPUs with iGPUs too so matrices being "Intels attempt to stay relevant in the GPU age" doesn't really make any sense when they ship both in the same system instead

              Oh, and also there might be processing loads where the power efficiency of running them through a CPU matrix is better, even if the max speed to go through the entire task is slower... in this again a dGPU has a separate thermal budget, wereas an iGPU is at least partially tied to the same thermal budget and power draw limits as the CPU, so if a matrix exists in the same chip as an iGPU the sensible choice isn't automatically throwing everything at the GPU whenever possible...
              Last edited by marlock; 19 June 2023, 11:33 PM.

              Comment


              • #17
                Originally posted by marlock View Post
                probably a silly question but since we're talking parallel processing, what can cpu matrixes like AVX512 and AVX2 do that an iGPU can't?
                Mix instructions. CPUs are worse at big blocks of data that all need to be processed the same, but better when it needs to work on small blocks and take branches. If you just need to do the same work to a few hundred or few thousand data elements, the cost of setting up a GPU processing chain, often isn't worth it. Several million elements of data, then the GPUs start winning big.

                Comment


                • #18
                  Originally posted by marlock View Post
                  probably a silly question but since we're talking parallel processing, what can cpu matrixes like AVX512 and AVX2 do that an iGPU can't?
                  Deal with arbitrary control flow, execute out of order, and find vectorization opportunities in normal system code.

                  Comment


                  • #19
                    Originally posted by marlock View Post
                    probably a silly question but since we're talking parallel processing, what can cpu matrixes like AVX512 and AVX2 do that an iGPU can't?
                    - IGPs are very inflexible. There is lots of initialization overhead. They dont support all the formats a CPU does.

                    - There is no guarantee of IGP availability or a properly functioning driver, while AVX2 (and even 512) is easy to check and pretty much universally supported.


                    But the *real* kicker is that you need some kind of compute/graphics API with zero copy support to do *anything* but completely standalone functions on an IGP. OpenCL supports this... But I never even seen it used out in the wild, even where it would massively help. Apparently Vulkan supports this, and its implemented in the Tencent NCNN framework, but I have never seen it used by an actual application either. CUDA may support this on Tegra boards, but I know precisely nothing about Tegra land.
                    Last edited by brucethemoose; 20 June 2023, 02:21 PM.

                    Comment


                    • #20
                      Originally posted by Anux View Post
                      You left out one important fact, by using AVX their whole CPU was throttled and everything worked slower, which ruled their implementation useless for mixed workloads.

                      The problem was not node specific, it was just a bad implementation.
                      No, that is precisely what i was referring to in the part of my post that you didn't quote.

                      "Yes AMD's first implementation was better than Intel's first implementation, but it damn sure better be half a decade after their competitor did it and on a TSMC 5nm node vs Intel 14nm."

                      Intel's 1st implementation 6 years ago left a lot to be desired. The shortcomings of clock speeds / power consumption / heat have been improved over the generations since Skylake, to the point where it's a non issue on Sapphire Rapids.

                      Comment

                      Working...
                      X