Linus Torvalds Comes Out Against "Completely Broken" x86_64 Feature Levels

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • coder
    Senior Member
    • Nov 2014
    • 8922

    #61
    Originally posted by hwertz View Post
    I would have assumed (given processor-intensive loads can move onto P-Cores) that they could just implement the AVX-512 instructions "double pumped" on a 256-bit data path on the E-Cores like AMD did on some models, rather than removing the instruction entirely.
    LOL, they couldn't do that because they had already implemented 256-bit AVX instructions as "double-pumped" via 128-bit pipelines. I guess breaking AVX-512 into 4 parts wolud've been too complex and/or required too many extra physical vector registers. Plus, not all AVX-512 instructions can work that way. In Zen 4, AMD had to dedicate a special port to instructions that can't be split.

    Originally posted by hwertz View Post
    ​After all if you're running anything intensive (which you probably are if you're using AVX-512 instructions) the schedule should move it over to a P-Core anyway.
    What they did in Skymont was to drastically narrow the performance gap between P-cores and E-cores, thus reducing that thread scheduling hazard quite considerably.

    Comment

    • coder
      Senior Member
      • Nov 2014
      • 8922

      #62
      Originally posted by geerge View Post
      Once AMD started doing avx512 better than intel, intel tried to take their ball home.
      They didn't. Alder Lake removed AVX-512 almost a year before Zen 4 launched, and Zen 4's AVX-512 still performed worse than Intel's, as you can see in AVX-512 benchmarks between Genoa and Sappire Rapids CPUs with similar core counts.

      Originally posted by geerge View Post
      ​ AVX10 is a stall tactic and a middle finger to anyone paying attention.
      I have to agree on this point. I think they didn't have to make it an incompatible instruction encoding. They could've just said "we're adding an option to restrict AVX-512 to smaller operand sizes" and then gone ahead and added the ability to query the max size as they did in AVX10. That way, legacy AVX-512 code could continue to work on AVX10 CPUs, so long as it didn't use operands greater than 256-bit.

      Comment

      • carewolf
        Senior Member
        • Nov 2012
        • 2269

        #63
        Originally posted by NateHubbard View Post

        If you have to verify the whole thing anyway, that would imply that it isn't useful and you can't really assume anything.
        It means you only have two versions instead of of 3 or four.

        Comment

        • carewolf
          Senior Member
          • Nov 2012
          • 2269

          #64
          Originally posted by coder View Post
          No, that's not how it works. First, you're thinking of SVE, not NEON. Second, you have to actually write your code to support runtime-based vector length. An ARM CPU with 128-bit SVE has not only 128-bit pipelines, but also 128-bit vector registers. It doesn't just magically somehow run code that's written to assume 512-bit vectors. The code has to adapt to the CPU, not the other way around.

          SVE is still better than the AVX family of instructions, in that the vector lengths aren't hardcoded in the instruction opcodes, like how x86 and just about everyone else does it. So, you can write SVE code that automatically runs faster on a 512-bit implementation, but is still compatible with a 128-bit one.
          Yeah, but it does worse at certain things like exploding. If you know you have certain bit-lengths you can do specific operations easier. You can read a 64bit value, interpret it as 4x16bbit, and expand it into 4x32bit SIMD registers, do a long number of operations on it in SIMD, and then write back as 64bit 4x16. This gets more complicated with variable length registers, where you have to read an unknown number of 4x16bit values and expand into two registers of 4x32, keep operating on _two_ registers of data throughout the SIMD code, before writing back as a single register. This means twice as many instructions in the inner loop, since I have yet to see a "load half of my bit-width" instruction, or a "half my bit-width" intrinsic type for variable bit-width instruction sets.

          Comment

          • hansdegoede
            Phoronix Member
            • Feb 2008
            • 65

            #65
            Originally posted by coder View Post
            This says the mitigation includes Skylake:
            Reading this part of the comments here made me curious, so I've done a little digging.

            The 20230808 microcode update:


            Updates the 06-55-0* microcode for Skylake-X which are the Xeon derived HEDT processors which also include AVX512 support.

            The microcode for the regular 06-5e-0* and low-power 06-4e-0* Skylake models was last updated with the 20220510 microcode release.

            For mapping of these family-model numbers to code names see:


            As for Linux behavior on regular and mobile skylake I checked the kernel code and also booted a skylake machine. By default Linux leaves AVX enabled when there is no microcode update and "cat /sys/devices/system/cpu/vulnerabilities/gather_data_sampling" prints: "Vulnarable: No microcode". So hardfalcon is correct that AVX / AVX2 are vulnerable to the GDS vulnarability on Skylake machines.

            Note that by default Linux leaves AVX / AVX2 enabled because the performance penalty and amount of non working software blindly assuming AVX availability would be quite bad.

            Comment

            • coder
              Senior Member
              • Nov 2014
              • 8922

              #66
              Originally posted by carewolf View Post
              Yeah, but it does worse at certain things like exploding. If you know you have certain bit-lengths you can do specific operations easier. You can read a 64bit value, interpret it as 4x16bbit, and expand it into 4x32bit SIMD registers, do a long number of operations on it in SIMD, and then write back as 64bit 4x16. This gets more complicated with variable length registers,
              I'm not familiar with SVE, but x86 has instructions to load the lower half of a vector register. You could conceivably do that, then promote the values from 16-bit to 32-bit, without regard to how many of them there are. That's only something affecting how much your pointer increments by, between each pair of load & store operations.

              Comment

              • jayN
                Phoronix Member
                • Apr 2020
                • 99

                #67
                What will AVX512 mean after AVX10.2 is released in Diamond Rapids? I believe the e-cores in Clearwater Forest will also support some length of AVX simd ... maybe avx2, but Intel's comments indicate that they want all the AVX10.2 operations supported, with only the 128/256/512 length support being different. Has AMD said they will go along with that? It does seem to simplify the SIMD options.

                Comment

                • coder
                  Senior Member
                  • Nov 2014
                  • 8922

                  #68
                  Originally posted by jayN View Post
                  What will AVX512 mean after AVX10.2 is released in Diamond Rapids?
                  The usual sort of ISA splintering, at least until we get some clear direction from AMD about their stance on AVX10. Perhaps the biggest impact it's had is slowing the rate of code being optimized for AVX-512. However, because it's so easy to port from AVX-512 to AVX10, I think the effect might not be as bad as if it were another 3DNow vs. SSE situation.

                  Originally posted by jayN View Post
                  I believe the e-cores in Clearwater Forest will also support some length of AVX simd ... maybe avx2, but Intel's comments indicate that they want all the AVX10.2 operations supported, with only the 128/256/512 length support being different.
                  AVX2 is a given. The only question is whether Clearwater Forest will support AVX10.2. If it does, it's almost certain to be limited to 256-bit operands. Intel's original docs refer to AVX10/512 as "legacy" and talk about supporting it only on P-core processors. Obviously, they could change direction on this, but their messaging has so far indicated that AVX10/256 will be the recommended option for everything besides HPC stuff that's meant to run only on high-end server & workstation CPUs.

                  Originally posted by jayN View Post
                  ​​Has AMD said they will go along with that? It does seem to simplify the SIMD options.
                  I've not seen or heard anything from AMD about AVX10.

                  If I were them, I wouldn't say anything until the final days of the merge window, when they need to merge in compiler support for their first CPU to support AVX10. Until then, they wouldn't want to give anyone a reason to hold off on further AVX-512 optimizations.

                  I'm pretty sure AVX10 support from them is a given. It will be interesting to see if they carry forward AVX-512 support, or withdraw it from at least their APUs. I wonder if it would make any sense for them to have a core that supports AVX-512, but only AVX10/256, like if they do their half-width pipeline thing.

                  One reason I'm fairly confident AMD will join the AVX10 bandwagon is that the x86 Ecosystem Advisory Group seems like a response to a rebuke from big hyperscalers for Intel and AMD to stop playing games with the ISA. I think their biggest customers told them to get on the same page, or risk losing out on future business. So, AMD will come along on this, and the real questions are just about the 256/512-bit thing and on which products, plus whether to support AVX-512 on APUs and C-core server CPUs.
                  Last edited by coder; 03 January 2025, 04:48 PM.

                  Comment

                  Working...
                  X