Announcement

Collapse
No announcement yet.

GCC 10 Lands Support For Emulating MMX With SSE Instructions

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • carewolf
    replied
    Originally posted by gnufreex View Post
    Real deal is when SSE gets emulated by AVX. And then SSE gets dropped from CPUs. That will save die space and allow for moar cores
    That has been the case ever since AVX got introduced. If you specify -mavx all SSE instructions gets replaced by their AVX128 counter part. Doubt it will ever be removed from the CPUs. Intel still makes CPUs without AVX, and the microcodes are the same, only the encoding is different.

    Leave a comment:


  • iive
    replied
    Small nitpick. SSE2 is the one that introduced integer operations on par to MMX ones. MMX has no float point operations.

    Leave a comment:


  • Weasel
    replied
    Originally posted by mifritscher View Post
    Isn't calculating 128 bit and discarding 64 bit of them not very power inefficient?
    This is true, if MMX is implemented efficiently in CPUs.

    Leave a comment:


  • mifritscher
    replied
    Isn't calculating 128 bit and discarding 64 bit of them not very power inefficient?

    Leave a comment:


  • starshipeleven
    replied
    Originally posted by Adarion View Post
    Uh-oh. As I wrote, I don't think retiring MMX is a good idea and will be easy.
    I think that by "retiring" they mean "move to lower-performance microcode emulation", as that's where all old crap instructions went once the CPU component actually executing them was retired.

    The CPU will do some shenanigans and remap MMX to other instructions that it can execute in hardware.

    Is this slower than keeping the hardware to execute the instruction? Yes, yes it is, it's some form of low-level software emulation.

    Will this matter? Eh probably not. Software requiring ancient crap instructions is likely going to be designed for ancient crap hardware, so it won't need very high performance by modern standards. It's likely to be perfectly fine.

    Leave a comment:


  • Adarion
    replied
    Uh-oh. As I wrote, I don't think retiring MMX is a good idea and will be easy.

    1. No big deal for FOSS. Rewrite / recompile and tadaa! there you go.
    2. But all blob-SW will be doomed. Not every SW will check for CPU capabilities during startup and choose different code paths. Maybe in the very early days when 486 boxes were still common it would check for MMX presence and enable it or not. But later, and that might well be the last 20 years, virtually every x86-compatible CPU had MMX. Even VIAs, embedded AMD Geode LX things, every Pentium, Duron, Athlon, Transmeta CPU. So it kind of was set that people have it so you could brainlessly enable it.
    You will need either a good hardware emulation or someting in the OS kernel that will catch these instructions and either re-form them (code morphing engine anyone?) or run them through some sort of emulator.

    And I hardly assume game/program developers will dig out their 15 year old code (if the can find it) and re-check the code for hardwired MMX, re-write, re-compile, re-test on real HW and then put the binary for free (as in beer) in the net.
    Last edited by Adarion; 18 May 2019, 07:54 AM.

    Leave a comment:


  • gnufreex
    replied
    Real deal is when SSE gets emulated by AVX. And then SSE gets dropped from CPUs. That will save die space and allow for moar cores

    Leave a comment:


  • carewolf
    replied
    Originally posted by _Alex_ View Post

    When you have a chunk of MMX intrinsics that map one-on-one for MMX instructions, presumably the SSE instructions that you'll get are the MMX-equivalent SSE variants. Meaning that the code will either use non-packed variables or packed variables up to 64 bits (the hardware might do calculations on bits 64-127 as well, but the results will be discarded since no values will be loaded there - or if the values are loaded

    I don't think converting MMX intrinsics to SSE will give any speed boosts.
    I have written SSE code that operates on 64-bits at a time, and it was much faster than using MMX intrinsics which I foolishly did (only) once. SSE has 64bit store and load operations, and then you just operate on the whole 128bit and ignore the parts you don't need. Since most modern CPU can execute two SSE instructions for every MMX, operating on 128bits and throwing away 64 is faster than only operating on the 64bit with MMX.

    Of course operating on 128 bit at a time would be even faster, but this particular algorithm I wrote was just unpacking 4*8bit color values out to 4*16bit color values, operating on them and packing them back. It was a 3 times speed up from letting the compiler try, though you could get another 2x speedup by doing two pixels at a time. The one-pixel at a time was just an easier plugin for most algorithms where it didn't make sense to rewrite the loop to do more at a time.
    Last edited by carewolf; 18 May 2019, 04:14 AM.

    Leave a comment:


  • Tomin
    replied
    Originally posted by Veto View Post
    Yes, the MMX instructions could conceivably be trapped and emulated by the OS in a future processor. However trapping an instruction and running an interrupt is very inefficient compared to emulating it "inline" by the compiler, so the various techniques will likely supplement each other:
    Yes, definitely we need the other techniques as well. Emulating would only solve it for such software that we can't possibly rewrite. Such that doesn't have source available or it doesn't even exist anymore. This is also a problem if the user uses another architecture, e. g. ARM or POWER, and they have to use emulation for everything but you really don't want to do that if you have source that you can compile.

    Leave a comment:


  • _Alex_
    replied
    Originally posted by carewolf View Post

    MMX isn't the same speed as SSE. It is much slower instruction for instruction (2x slower in best if cases) So doing 8 byte vectors with SSE instead of MMX will give a significant performance boost. And even better if the code is mixing 8 byte vectors with 16 byte vectors which code originally written for NEON might.
    When you have a chunk of MMX intrinsics that map one-on-one for MMX instructions, presumably the SSE instructions that you'll get are the MMX-equivalent SSE variants. Meaning that the code will either use non-packed variables or packed variables up to 64 bits (the hardware might do calculations on bits 64-127 as well, but the results will be discarded since no values will be loaded there - or if the values are loaded

    I don't think converting MMX intrinsics to SSE will give any speed boosts.

    Rewriting the code to pack variables in 128 bit and then process them in packed-128 bit orders, yes that would double the performance in modern CPUs with native 128 bit vectors (some old non-x64 CPUs had emulated 128bit vectors). But that would require rewriting the code. Because imagine the scenario where one has an intrinsic to write a 64 bit chunk and that is "converted" to 128 bit - which then becomes out-of-bound, and segfaults.

    Leave a comment:

Working...
X