Announcement

Collapse
No announcement yet.

GCC 10 Lands Support For Emulating MMX With SSE Instructions

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by Tomin View Post
    Can't Linux emulate MMX if processor doesn't support it? At least in the early days Linux could emulate x87 instructions if user didn't have an FPU. If that can be implemented for MMX too, then in theory it really could be dropped. I hope that no performance critical application is using these instructions anymore. Really old software should run fast enough with emulation. Fast enough meaning faster than with the hardware it was designed for. Or at least that's what I would expect but correct me if you know better.
    Yes, the MMX instructions could conceivably be trapped and emulated by the OS in a future processor. However trapping an instruction and running an interrupt is very inefficient compared to emulating it "inline" by the compiler, so the various techniques will likely supplement each other:
    • Trapping or microcode emulation for closed source legacy applications
    • Compiler emulation for applications where recompiling is feasible, but rewriting code is not.
    • Code rewrite for "live" code bases.
    However because trapping is so expensive, it will likely be many years, before Intel and AMD will dare to remove the instructions from the microcode.
    Veto
    Senior Member
    Last edited by Veto; 17 May 2019, 08:54 PM.

    Comment


    • #12
      Originally posted by carewolf View Post

      MMX isn't the same speed as SSE. It is much slower instruction for instruction (2x slower in best if cases) So doing 8 byte vectors with SSE instead of MMX will give a significant performance boost. And even better if the code is mixing 8 byte vectors with 16 byte vectors which code originally written for NEON might.
      When you have a chunk of MMX intrinsics that map one-on-one for MMX instructions, presumably the SSE instructions that you'll get are the MMX-equivalent SSE variants. Meaning that the code will either use non-packed variables or packed variables up to 64 bits (the hardware might do calculations on bits 64-127 as well, but the results will be discarded since no values will be loaded there - or if the values are loaded

      I don't think converting MMX intrinsics to SSE will give any speed boosts.

      Rewriting the code to pack variables in 128 bit and then process them in packed-128 bit orders, yes that would double the performance in modern CPUs with native 128 bit vectors (some old non-x64 CPUs had emulated 128bit vectors). But that would require rewriting the code. Because imagine the scenario where one has an intrinsic to write a 64 bit chunk and that is "converted" to 128 bit - which then becomes out-of-bound, and segfaults.

      Comment


      • #13
        Originally posted by Veto View Post
        Yes, the MMX instructions could conceivably be trapped and emulated by the OS in a future processor. However trapping an instruction and running an interrupt is very inefficient compared to emulating it "inline" by the compiler, so the various techniques will likely supplement each other:
        Yes, definitely we need the other techniques as well. Emulating would only solve it for such software that we can't possibly rewrite. Such that doesn't have source available or it doesn't even exist anymore. This is also a problem if the user uses another architecture, e. g. ARM or POWER, and they have to use emulation for everything but you really don't want to do that if you have source that you can compile.

        Comment


        • #14
          Originally posted by _Alex_ View Post

          When you have a chunk of MMX intrinsics that map one-on-one for MMX instructions, presumably the SSE instructions that you'll get are the MMX-equivalent SSE variants. Meaning that the code will either use non-packed variables or packed variables up to 64 bits (the hardware might do calculations on bits 64-127 as well, but the results will be discarded since no values will be loaded there - or if the values are loaded

          I don't think converting MMX intrinsics to SSE will give any speed boosts.
          I have written SSE code that operates on 64-bits at a time, and it was much faster than using MMX intrinsics which I foolishly did (only) once. SSE has 64bit store and load operations, and then you just operate on the whole 128bit and ignore the parts you don't need. Since most modern CPU can execute two SSE instructions for every MMX, operating on 128bits and throwing away 64 is faster than only operating on the 64bit with MMX.

          Of course operating on 128 bit at a time would be even faster, but this particular algorithm I wrote was just unpacking 4*8bit color values out to 4*16bit color values, operating on them and packing them back. It was a 3 times speed up from letting the compiler try, though you could get another 2x speedup by doing two pixels at a time. The one-pixel at a time was just an easier plugin for most algorithms where it didn't make sense to rewrite the loop to do more at a time.
          carewolf
          Senior Member
          Last edited by carewolf; 18 May 2019, 04:14 AM.

          Comment


          • #15
            Real deal is when SSE gets emulated by AVX. And then SSE gets dropped from CPUs. That will save die space and allow for moar cores

            Comment


            • #16
              Uh-oh. As I wrote, I don't think retiring MMX is a good idea and will be easy.

              1. No big deal for FOSS. Rewrite / recompile and tadaa! there you go.
              2. But all blob-SW will be doomed. Not every SW will check for CPU capabilities during startup and choose different code paths. Maybe in the very early days when 486 boxes were still common it would check for MMX presence and enable it or not. But later, and that might well be the last 20 years, virtually every x86-compatible CPU had MMX. Even VIAs, embedded AMD Geode LX things, every Pentium, Duron, Athlon, Transmeta CPU. So it kind of was set that people have it so you could brainlessly enable it.
              You will need either a good hardware emulation or someting in the OS kernel that will catch these instructions and either re-form them (code morphing engine anyone?) or run them through some sort of emulator.

              And I hardly assume game/program developers will dig out their 15 year old code (if the can find it) and re-check the code for hardwired MMX, re-write, re-compile, re-test on real HW and then put the binary for free (as in beer) in the net.
              Adarion
              Senior Member
              Last edited by Adarion; 18 May 2019, 07:54 AM.
              Stop TCPA, stupid software patents and corrupt politicians!

              Comment


              • #17
                Originally posted by Adarion View Post
                Uh-oh. As I wrote, I don't think retiring MMX is a good idea and will be easy.
                I think that by "retiring" they mean "move to lower-performance microcode emulation", as that's where all old crap instructions went once the CPU component actually executing them was retired.

                The CPU will do some shenanigans and remap MMX to other instructions that it can execute in hardware.

                Is this slower than keeping the hardware to execute the instruction? Yes, yes it is, it's some form of low-level software emulation.

                Will this matter? Eh probably not. Software requiring ancient crap instructions is likely going to be designed for ancient crap hardware, so it won't need very high performance by modern standards. It's likely to be perfectly fine.

                Comment


                • #18
                  Isn't calculating 128 bit and discarding 64 bit of them not very power inefficient?

                  Comment


                  • #19
                    Originally posted by mifritscher View Post
                    Isn't calculating 128 bit and discarding 64 bit of them not very power inefficient?
                    This is true, if MMX is implemented efficiently in CPUs.

                    Comment


                    • #20
                      Small nitpick. SSE2 is the one that introduced integer operations on par to MMX ones. MMX has no float point operations.

                      Comment

                      Working...
                      X