GCC Lands AVX-512 Fully-Masked Vectorization

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • coder
    replied
    Originally posted by pWe00Iri3e7Z9lHOX2Qx View Post
    I did find the disparity in the community's reaction pretty silly.
    It makes a lot of sense, once you dig into the details of how problematic AVX-512 was on Intel's 14 nm products.

    Originally posted by pWe00Iri3e7Z9lHOX2Qx View Post
    Intel introduces AVX-512 in 2017: Boo! Hiss! Stop with the magic function garbage!

    AMD adds AVX-512 in 2022: Yay! Amaze balls!

    Yes AMD's first implementation was better than Intel's first implementation, but it damn sure better be half a decade after their competitor did it and on a TSMC 5nm node vs Intel 14nm.
    That's exactly it. Intel simply jumped onto AVX-512 before the underlying process technology was ready. The result was often deleterious clock-throttling (on server cores) and previously unseen TDPs when using it on the 14 nm desktop CPUs that had it.

    Not to mention issues like the need to vzerroupper:

    AVX2 got some bad press, when it launched in Haswell CPUs, due to clock-throttling issues. AVX-512 was like that, only much worse. There's nothing inconsistent, here.

    Another legit complaint about AVX-512 was the degree of fragmentation between Intel's different product lines. By the time Zen 4 implemented it, they were able to implement virtually all facets.

    BTW, I think Linus way overestimated how much die space AVX-512 uses.

    It did bloat the context, noticeably, with the vector registers occupying 4x the footprint, relative to AVX2 (as Intel doubled both the number and size of the vector ISA registers resulting in 32 * 64B = 2 kiB). As cache sizes and memory bandwidth both increase, that's less of an issue.

    I think he had a legit point about people using it for simple things like string ops triggering clock-throttling, that makes it a net-loss for application-level performance. That's not true of Sapphire Rapids or Zen 4, but it was certainly a legit concern on 14 nm Xeons, as demonstrated here:

    "the AVX-512 workload consumes only 2.5% of the CPU time when all the requests are ChaCha20-Poly1305, and less then 0.3% when doing ChaCha20-Poly1305 for 10% of the requests. Irregardless the CPU throttles down, because that what it does when it sees AVX-512 running on all cores."

    https://blog.cloudflare.com/on-the-d...uency-scaling/


    Using just a tiny bit of AVX-512 was enough to trigger heavy clock-throttling that slowed down the rest of the application code by much more than the AVX-512 was able to accelerate the portion where it was used.

    So, yeah. If you introduce a feature before its time, and the implementation comes with so many caveats and pitfalls, it's only natural that you catch some blowback!
    Last edited by coder; 21 June 2023, 06:55 PM.

    Leave a comment:


  • max0x7ba
    replied
    Originally posted by qarium View Post
    thats right... AVX512 was intels attempt to stay relevant in the GPU age... this was before intel startet the ARC GPUs...

    compared to CUDA or AMD ROCm/HIP the AVX512 way has no real usage in the world.
    Intel MKL utilizes AVX-512. Intel MKL is used directly or through BLAS/LAPACK/OpenMP APIs by virtually all scientific software, including PyTorch, TensorFlow, NumpPy, SciPy, R, Matlab, etc..

    AMD ROCm/HIP is something few people want to waste time having to deal with.
    Last edited by max0x7ba; 20 June 2023, 09:47 PM.

    Leave a comment:


  • pWe00Iri3e7Z9lHOX2Qx
    replied
    Originally posted by Anux View Post
    You left out one important fact, by using AVX their whole CPU was throttled and everything worked slower, which ruled their implementation useless for mixed workloads.

    The problem was not node specific, it was just a bad implementation.
    No, that is precisely what i was referring to in the part of my post that you didn't quote.

    "Yes AMD's first implementation was better than Intel's first implementation, but it damn sure better be half a decade after their competitor did it and on a TSMC 5nm node vs Intel 14nm."

    Intel's 1st implementation 6 years ago left a lot to be desired. The shortcomings of clock speeds / power consumption / heat have been improved over the generations since Skylake, to the point where it's a non issue on Sapphire Rapids.

    Leave a comment:


  • brucethemoose
    replied
    Originally posted by marlock View Post
    probably a silly question but since we're talking parallel processing, what can cpu matrixes like AVX512 and AVX2 do that an iGPU can't?
    - IGPs are very inflexible. There is lots of initialization overhead. They dont support all the formats a CPU does.

    - There is no guarantee of IGP availability or a properly functioning driver, while AVX2 (and even 512) is easy to check and pretty much universally supported.


    But the *real* kicker is that you need some kind of compute/graphics API with zero copy support to do *anything* but completely standalone functions on an IGP. OpenCL supports this... But I never even seen it used out in the wild, even where it would massively help. Apparently Vulkan supports this, and its implemented in the Tencent NCNN framework, but I have never seen it used by an actual application either. CUDA may support this on Tegra boards, but I know precisely nothing about Tegra land.
    Last edited by brucethemoose; 20 June 2023, 02:21 PM.

    Leave a comment:


  • WorBlux
    replied
    Originally posted by marlock View Post
    probably a silly question but since we're talking parallel processing, what can cpu matrixes like AVX512 and AVX2 do that an iGPU can't?
    Deal with arbitrary control flow, execute out of order, and find vectorization opportunities in normal system code.

    Leave a comment:


  • carewolf
    replied
    Originally posted by marlock View Post
    probably a silly question but since we're talking parallel processing, what can cpu matrixes like AVX512 and AVX2 do that an iGPU can't?
    Mix instructions. CPUs are worse at big blocks of data that all need to be processed the same, but better when it needs to work on small blocks and take branches. If you just need to do the same work to a few hundred or few thousand data elements, the cost of setting up a GPU processing chain, often isn't worth it. Several million elements of data, then the GPUs start winning big.

    Leave a comment:


  • marlock
    replied
    Unfortunately not all CPUs are APUs, not really...

    eg: only a Ryzen 5 5600G is an APU, while a Ryzen 5 5600X has no iGPU, and because of that it uses the extra room in the silicon mainly to double L3 caches and offers up to PCIe 4.0 instead of PCIe 3.0 to external slots (presumably leveraging the extra lanes that are not used inside the chip to connect the iGPU segment to the CPU)

    So as much as "just use the iGPU" sounds great it can't be done immediately even if it makes perfect sense

    And sure, dedicated GPUs can demolish and put to shame CPU matrices in tasks where a GPU is an adequate solution, but that's a thoroughly unfair max horsepower comparison...

    ... wereas I suspect there is more to it, so I'm trying to figure out if there is any task where a cpu matrix is relevant but a GPU isn't... eg: "a GPU can't do this and a matrix can" or "a matrix inside the CPU itself is better in this specific task because it shares the L3 memory with usual cpu instructions so bandwidth bla bla"... or something in those lines

    That's why I specifically mention iGPUs vs. CPU matrices, not dGPUs... an iGPU is on the same chip, shares the same RAM, maaaaybe has the same level of access to L3/2/1 as a cpu matrix (but afaik not)... it's a bunch of less corner cases to consider and unfair comparisons to avoid... but still a very different beast from a cpu matrix instruction afaik

    PS: Intel does do a LOT of CPUs with iGPUs too so matrices being "Intels attempt to stay relevant in the GPU age" doesn't really make any sense when they ship both in the same system instead

    Oh, and also there might be processing loads where the power efficiency of running them through a CPU matrix is better, even if the max speed to go through the entire task is slower... in this again a dGPU has a separate thermal budget, wereas an iGPU is at least partially tied to the same thermal budget and power draw limits as the CPU, so if a matrix exists in the same chip as an iGPU the sensible choice isn't automatically throwing everything at the GPU whenever possible...
    Last edited by marlock; 19 June 2023, 11:33 PM.

    Leave a comment:


  • qarium
    replied
    Originally posted by marlock View Post
    probably a silly question but since we're talking parallel processing, what can cpu matrixes like AVX512 and AVX2 do that an iGPU can't?
    thats right... AVX512 was intels attempt to stay relevant in the GPU age... this was before intel startet the ARC GPUs...

    compared to CUDA or AMD ROCm/HIP the AVX512 way has no real usage in the world.

    all modern super computers like the ALL AMD Fontire with epic cpu and CDNA compute gpus...

    also all the modern CPUs are all APUs ... apple m2 Ultra soc is a APU... the amd ryzen 7000 are all APU and so one and so one.

    if you listen carefully in the forum people successfully run blender and AI workloads on the Vega64 and result is 5-6 times faster than their 8core ryzen 2000...

    this means my 5-6 year old Vega64 with HIP is faster than the newst "AMD ryzen 7950X3D" for blender and AI workloads


    Leave a comment:


  • marlock
    replied
    probably a silly question but since we're talking parallel processing, what can cpu matrixes like AVX512 and AVX2 do that an iGPU can't?

    Leave a comment:


  • brucethemoose
    replied
    Originally posted by Anux View Post
    You left out one important fact, by using AVX their whole CPU was throttled and everything worked slower, which ruled their implementation useless for mixed workloads.

    The problem was not node specific, it was just a bad implementation.
    This was even an issue for AVX2 in desktop Haswell, people just didnt complain much because voltage/power was increased instead of throttling like the early AVX512 Xeons. That whole clocking/power scheme was (in hindsight) a mistake.

    Leave a comment:

Working...
X