No announcement yet.

Targeted Intel oneAPI DPC++ Compiler Optimization Rules Out 2k+ SPEC CPU Submissions

  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by Developer12 View Post
    See tovalds' thoughts on the matter:
    I think it's risky to view Linus' comments out-of-context.

    One clue is this point he made, in that post:

    "I want my power limits to be reached with regular integer code, not with some AVX512 power virus that takes away top frequency (because people ended up using it for memcpy!)"

    A significant downside of AVX-512, in Intel 14nm CPUs (Skylake-X & Cascade Lake), was sometimes severe clock-throttling that could occur, even from modest use. It had been shown to negatively affect clockspeeds enough that you were often better off completely avoiding AVX-512, if you weren't using it heavily.

    When you also consider the Balkanization of AVX-512 subsets, it seems clear that Intel jumped the gun on implementing it. Deploying it on 14nm technology just came with too many caveats: power/clockspeed and die space.

    However, Michael's tests of Zen4 and Sapphire Rapids have shown virtually no clock-throttling or excessive power utilization and Zen 4 even managed to implement it in a way that seems quite area-efficient. Not only that, but Zen 4 came right out of the gate with virtually all subsets implemented (except for a couple of the most recent ones).

    That said, Zen 5 is rumored to have gone for native 512-bit execution paths. So, we'll have to see how that pans out, but I think the current generation of AVX-512 is relatively harmless. We can perhaps turn our criticism to other die space hogs, like Intel's AMX, which also bloats thread context by a whopping 8 kB, and so far isn't useful for anything but AI inferencing.
    Last edited by coder; 12 February 2024, 07:50 AM.


    • #22
      Originally posted by s_j_newbury View Post
      Other more generally useful improvements could have potentially been made instead of optimizing design for AVX512. There's no free lunch.
      Yes, but also no. Skylake client cores didn't have AVX-512, so the scope for improvements made in lieu of AVX-512 was somewhat limited.

      Originally posted by s_j_newbury View Post
      There is a generally held view amongst many "programmers" that FP is how maths is done, which is wrong. FP is how maths is approximated.
      For most engineering and scientific applications, you're already dealing with approximate data, anyhow. fp64 is usually good enough for those purposes. I'll bet even a surprising amount of financial software just uses fp64, especially if it's just for modelling purposes and not actual transaction-processing.


      • #23
        Originally posted by sophisticles View Post

        This is not disputable, it is a fact that on Intel CPU's the fp registers are integrated into the SIMD unit and if you ever wrote any assembler you would know this.​
        You're such a bonehead that you didn't even bother to read your own references! If you check page 20-21, that one even says:

        "x87 is still around!
        Scientifc applications that beneft from 80 bits of FP
        precision sometimes still use it​"

        Not only does it implement support for up to 80-bit precision, but x87 also supports denormals in hardware, which SSE and AVX don't. Enabling denormal support, in SSE, comes at a huge performance cost, if you actually have some operation involving one.

        Originally posted by Developer12 View Post
        Are you joking? You're basing all of this on a wikipedia article on computer components from 3 decades ago?
        sophisticles​ has such a high opinion of himself that he believes his stale knowledge + a couple hasty google searches are superior to most people on this forum. I think he's possibly suffering from severe Narcissistic personality disorder, but perhaps there's something else wrong with him.
        Last edited by coder; 12 February 2024, 10:09 AM.


        • #24
          Originally posted by Developer12 View Post
          The whole reason intel brought in AVX512 is because they want to market their CPUs (having really no GPUs to speak of yet) as being suitable for AI workloads in some way, even if it's just inference.
          No, the roots of AVX-512 date back to their Xeon Phi product line, which was more about HPC and really only started to embrace deep learning when it was already on its way to the graveyard of discontinued Intel product lines. I'm pretty sure even the decision to add AVX-512 to Skylake-X wasn't primarily motivated by AI.

          AVX-512 has a few ISA-level improvements, besides mere vector width, which make it more desirable even with 128-bit or 256-bit operands than AVX/AVX2.
          • opmask registers
          • separate source(s) and destination operands
          • twice the number of vector registers
          • new scatter and permutation operations

          These are among the reasons Intel is bothering with AVX10/256. If vector width were the only real benefit AVX-512 offered over AVX2, then AVX10/256 would make no sense.
          Last edited by coder; 12 February 2024, 10:05 AM.


          • #25
            I started working in the Silicon Valley in the late 1970s and Intel was rotten then, and is rotten now. They consistently crushed other companies and engineers with bullying tactics and frivolous law suits, instead of fairly competing with superior products.

            So anyone who is surprised by this shouldn't be, as Intel will never change. And the only thing that saved consumers from them is AMD.

            And the only way to defeat Intel, and perhaps one day scrape out the rot, is to never purchase any Intel products again, and build systems exclusively with AMD products.


            • #26
              Can someone get us some actual details on what exactly the Intel compiler did wrong here?

              Has it:

              1. Detected a common code pattern and applied common optimization for said code pattern?

              OR it:

              2. Detected a specific benchmark code pattern and applied special "only for this benchmark code pattern" optimization?

              Inquiring minds want to know, and apparently press nowadays can't be arsed with such "irrelevant" details when it's much easier to post a sensationalist headline.