Announcement

Collapse
No announcement yet.

GCC Lands AVX-512 Fully-Masked Vectorization

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by qarium View Post
    thats right... AVX512 was intels attempt to stay relevant in the GPU age... this was before intel startet the ARC GPUs...

    compared to CUDA or AMD ROCm/HIP the AVX512 way has no real usage in the world.
    Intel MKL utilizes AVX-512. Intel MKL is used directly or through BLAS/LAPACK/OpenMP APIs by virtually all scientific software, including PyTorch, TensorFlow, NumpPy, SciPy, R, Matlab, etc..

    AMD ROCm/HIP is something few people want to waste time having to deal with.
    Last edited by max0x7ba; 20 June 2023, 09:47 PM.

    Comment


    • #22
      Originally posted by pWe00Iri3e7Z9lHOX2Qx View Post
      I did find the disparity in the community's reaction pretty silly.
      It makes a lot of sense, once you dig into the details of how problematic AVX-512 was on Intel's 14 nm products.

      Originally posted by pWe00Iri3e7Z9lHOX2Qx View Post
      Intel introduces AVX-512 in 2017: Boo! Hiss! Stop with the magic function garbage!

      AMD adds AVX-512 in 2022: Yay! Amaze balls!

      Yes AMD's first implementation was better than Intel's first implementation, but it damn sure better be half a decade after their competitor did it and on a TSMC 5nm node vs Intel 14nm.
      That's exactly it. Intel simply jumped onto AVX-512 before the underlying process technology was ready. The result was often deleterious clock-throttling (on server cores) and previously unseen TDPs when using it on the 14 nm desktop CPUs that had it.

      Not to mention issues like the need to vzerroupper:

      AVX2 got some bad press, when it launched in Haswell CPUs, due to clock-throttling issues. AVX-512 was like that, only much worse. There's nothing inconsistent, here.

      Another legit complaint about AVX-512 was the degree of fragmentation between Intel's different product lines. By the time Zen 4 implemented it, they were able to implement virtually all facets.

      BTW, I think Linus way overestimated how much die space AVX-512 uses.

      It did bloat the context, noticeably, with the vector registers occupying 4x the footprint, relative to AVX2 (as Intel doubled both the number and size of the vector ISA registers resulting in 32 * 64B = 2 kiB). As cache sizes and memory bandwidth both increase, that's less of an issue.

      I think he had a legit point about people using it for simple things like string ops triggering clock-throttling, that makes it a net-loss for application-level performance. That's not true of Sapphire Rapids or Zen 4, but it was certainly a legit concern on 14 nm Xeons, as demonstrated here:

      "the AVX-512 workload consumes only 2.5% of the CPU time when all the requests are ChaCha20-Poly1305, and less then 0.3% when doing ChaCha20-Poly1305 for 10% of the requests. Irregardless the CPU throttles down, because that what it does when it sees AVX-512 running on all cores."

      https://blog.cloudflare.com/on-the-d...uency-scaling/


      Using just a tiny bit of AVX-512 was enough to trigger heavy clock-throttling that slowed down the rest of the application code by much more than the AVX-512 was able to accelerate the portion where it was used.

      So, yeah. If you introduce a feature before its time, and the implementation comes with so many caveats and pitfalls, it's only natural that you catch some blowback!
      Last edited by coder; 21 June 2023, 06:55 PM.

      Comment


      • #23
        Originally posted by user556 View Post
        Intel's lacklustre efforts to remedy the power consumption problem is probably more than a little to do with them betting on the E-cores. If those didn't exist then the P-Cores would be better for it, IMHO.
        The P-cores might be more efficient, but that would certainly come at the cost of some performance.

        However, I think your analysis is flawed. I believe Intel came to the idea of a hybrid CPU somewhat late, after Golden Cove was very far along, and they could see how far short the Intel 7 (10 nm ESF) node delivered on efficiency. It was at that point they accepted the need to go hybrid, if they were to have any chance of competing on heavily-threaded workloads. This late decision-making explains their muddled messaging and approach on AVX-512 support in Alder Lake.

        Comment


        • #24
          Originally posted by raystriker View Post
          We don't know if it's a node thing though?
          Yes, it was a node thing. Intel's AVX-512 implementation hasn't markedly changed (aside from adding more instructions), yet their more recent implementations don't exhibit the same power problems seen in their 14 nm CPUs.

          Furthermore, Rocket Lake virtually proves it was node and not microarchitecture, because Rocket Lake was a 14 nm backport of their Ice Lake core. AVX-512 works a lot better on Ice Lake, but Rocket Lake exhibited all the high power utilization problems seen on other 14 nm CPUs that supported it.​

          Originally posted by brucethemoose View Post
          This was even an issue for AVX2 in desktop Haswell, people just didnt complain much because voltage/power was increased instead of throttling like the early AVX512 Xeons. That whole clocking/power scheme was (in hindsight) a mistake.
          They had no choice. Server CPUs can't just stray outside their power envelope, the way desktop CPUs can get away with doing. About the only thing they could've done was to way increase the TDP rating of their Xeons, but that would've meant servers having overbuilt power & cooling for non- AVX-512 workloads. And don't forget: when Skylake SP first launched, AVX-512 workloads were almost nonexistent. So, you'd be overbuilding servers (i.e. making them more expensive) for virtually no benefit.

          The other thing they could've done is not implemented dual-FMA, but that would've significantly limited the performance gains of AVX-512, in key workloads.
          Last edited by coder; 21 June 2023, 09:56 PM.

          Comment


          • #25
            Originally posted by coder View Post
            Yes, it was a node thing. Intel's AVX-512 implementation hasn't markedly changed (aside from adding more instructions), yet their more recent implementations don't exhibit the same power problems seen in their 14 nm CPUs.
            At some level, virtually everything is a node thing. Throw enough transistors at whatever issue you're having and it tends to go away.

            Smart chip manufacturers know what their node limitations are and design a chip with those in mind, rather than jumping the gun and adding features that can't be implemented very well. It's tough to change the first impression you give to people.

            Comment


            • #26
              Originally posted by smitty3268 View Post
              At some level, virtually everything is a node thing. Throw enough transistors at whatever issue you're having and it tends to go away.
              It's not about transistor count, but circuit complexity vs. frequency. Smaller nodes tend to come with not only density but also efficiency benefits. That's primarily where 14 nm was lacking -- on the efficiency front.

              Comment


              • #27
                Originally posted by coder View Post
                It's not about transistor count, but circuit complexity vs. frequency. Smaller nodes tend to come with not only density but also efficiency benefits. That's primarily where 14 nm was lacking -- on the efficiency front.
                I'm aware, but my comment stands. And there are generally ways to work around such things if you have an unlimited budget. You just have to design things differently, and the downside is it costs a lot more.

                Comment


                • #28
                  Originally posted by smitty3268 View Post
                  I'm aware, but my comment stands. And there are generally ways to work around such things if you have an unlimited budget. You just have to design things differently, and the downside is it costs a lot more.
                  That hasn't been true for a couple decades, at least. You used to be able to plow a larger transistor budget into hard-wiring things that had previously been microcoded. However, modern (vector) FPUs are fully hardwired, perhaps with the exception of operations like sqrt, These days, the main tradeoff is how much latency the instructions have. That's really about critical path, which is a matter of clock speed vs. density, but not outright transistor budget.

                  What Intel does with greater transistor budgets is things like increasing cache sizes, adding more shadow registers, enlarging reorder buffers, adding more pipelines, adding more instructions... not to mention adding more cores! I think the individual pipeline stages don't change much.
                  Last edited by coder; 23 June 2023, 04:20 AM.

                  Comment


                  • #29
                    Originally posted by coder View Post
                    That hasn't been true for a couple decades, at least. You used to be able to plow a larger transistor budget into hard-wiring things that had previously been microcoded. However, modern (vector) FPUs are fully hardwired, perhaps with the exception of operations like sqrt, These days, the main tradeoff is how much latency the instructions have. That's really about critical path, which is a matter of clock speed vs. density, but not outright transistor budget.

                    What Intel does with greater transistor budgets is things like increasing cache sizes, adding more shadow registers, enlarging reorder buffers, adding more pipelines, adding more instructions... not to mention adding more cores! I think the individual pipeline stages don't change much.
                    Double the # of AVX pipelines each core has and run each at half speed, and you get the same performance while using less power.

                    Yes, I'm well aware that would make things more complicated/etc. It's all just tradeoffs.

                    The whole point of AVX512 was to increase the throughput a cpu could achieve. If you only cared about latency for limited amounts of data, stick with AVX2, or even SSE. Nobody would have cared nearly as much about that, versus the reality of what Intel shipped.
                    Last edited by smitty3268; 24 June 2023, 09:18 PM.

                    Comment


                    • #30
                      Originally posted by smitty3268 View Post
                      Double the # of AVX pipelines each core has and run each at half speed, and you get the same performance while using less power.
                      That can achieve the same theoretical throughput, but increasing latency can also be a performance killer. Case in point: look at the changes between AMD's GCN and RDNA. They kept the same throughput per CU, but slashed latency to 1/4th, resulting in a substantial performance improvement.

                      Getting back to AVX-512, it also creates more work for the scheduler and increases register pressure. There's no free lunch, here.

                      Xeon Phi (KNL) implemented AVX-512 on 14nm, but ran at just 1.5 GHz (1.7 GHz turbo). I think that's very telling, and shows where the proper design point was, for the technology.

                      Comment

                      Working...
                      X