Fedora 29 Proposal "i686 Is For x86-64" Would Allow More Optimizations, Require SSE2

carewolf replied

07 June 2018, 06:40 PM
Originally posted by Weasel View Post

Horizontal ops are lame because they are micro-coded and don't offer any performance benefit, just smaller code (compared to doing it "manually"). I think he's referring to *integer* SIMD operations, especially the zero and sign-extension stuff is super useful in many cases (e.g. pmovzx). SSE is not just about floating point.

I was referring to shufp, byte shuffling, which is an essential and extremely versatile operation for integer vector operations, mullop, 32bit integer multiplication, again a central operation missing from earlier SSE versions, but essential to autovectorizing any C code with arbitrary ints being multiplied, and the sign-extend conversions which are also very important for auto-vectorizing C code with mixed signed integer types.
Leave a comment:
Weasel replied

06 June 2018, 06:12 PM
I've no idea what you are talking about "new register space" when they all use the exact same registers (xmm0-7 for 32-bit, xmm0-15 for x64). Most people's reason/obsession over SSE2 compared to SSE1 is not even about SIMD. It's that they now have "double scalar fp math" with SSE instead of having to use the x87 FPU, because it's fashionable to hate on the x87 fpu which to me is nonsense but ok.

Obviously you can still use 32-bit floating point math (single-precision) with SSE1 instructions, the double-precision ones are just extra available if you need it.
Leave a comment:
caligula replied

06 June 2018, 05:59 PM
Originally posted by Weasel View Post

SSE4 is not a replacement, it's an extension. You get new useful instructions that you can use, and it's not even about integer operations directly, but shuffling stuff into places (which is needed for the "normal" integer operations after, not SSE4).

I was just wondering this claim: "Actually they are way more significant than SSE2."

SSE2 (especially the AMD flavor) already provides lots of instructions, lots of new register space. It's not obvious to me that the update to say SSE3/SSSE3 is a bigger improvement in any sense. It's definitely an improvement, but does any data prove that they're actually a way more significant update.
Leave a comment:
Weasel replied

06 June 2018, 05:54 PM
Originally posted by caligula View Post

Agreed, but are you saying that SSE2 doesn't include integer operations or that the operations are pretty useless due to carry/sign issues?

SSE4 is not a replacement, it's an extension. You get new useful instructions that you can use, and it's not even about integer operations directly, but shuffling stuff into places (which is needed for the "normal" integer operations after, not SSE4).

Of course, with all this said, I don't think compilers can make particularly good use of this, seeing as they tend to suck at automatic SIMD and vectorization... and since the "default" flag is -O2 that's even less (vectorization tends to increase code size a lot, so GCC only enables a few of it at -O2, you'd need -O3 or enable flags manually)

Last edited by Weasel; 06 June 2018, 05:56 PM.
Leave a comment:
caligula replied

06 June 2018, 01:31 PM
Originally posted by Weasel View Post

Horizontal ops are lame because they are micro-coded and don't offer any performance benefit, just smaller code (compared to doing it "manually"). I think he's referring to *integer* SIMD operations, especially the zero and sign-extension stuff is super useful in many cases (e.g. pmovzx). SSE is not just about floating point.

Agreed, but are you saying that SSE2 doesn't include integer operations or that the operations are pretty useless due to carry/sign issues?
Leave a comment:
chithanh replied

06 June 2018, 11:08 AM
Originally posted by Hugh View Post

I don't know of any distro that provides x32 libraries.

Gentoo has an x32 port. Also, with Gentoo you have control over what your libraries are compiled for (x86, x32, amd64). So you can run a pure x32 system if you want, and need only the amd64 toolchain for building your kernel.

Of course, the usual problems apply. Software that assumes __x86_64__ == __LP64__ or depends on hand-written x86/amd64 assembly will fail, and in general you won't be able to run binaries downloaded from somewhere.
Leave a comment:
Weasel replied

06 June 2018, 10:01 AM
Originally posted by caligula View Post

Ok, I'm not qualified enough to argue about this. AFAIK SSE2 has both integer/float operations with 8-64 bit numbers and can use 8 x 128 bits of storage for packed stuff. The SSE2 set also contains instructions for bypassing cache which is sometimes useful. If you compare this to MMX systems, SSE2 can provide 4-16x performance increase in integer math in a tight loop and maybe more due to other savings.

In SSE3-SSE4 the main addition is horizontal ops, some special ops like dot product acceleration. These are useful in some domains, but not in general. I can assume some programs like lame/flac/ffmpeg can use them, but many can't. The problem with SSE 4 is that SSE 4.2 is only available in 2-3 latest AMD generations. The full power of SSE2 is also only available on x86 AMD and x64 Intel (twice the register count). The AVX instructions also show mixed results in tests, apparently due to some throttling and lots of heat generation.

Horizontal ops are lame because they are micro-coded and don't offer any performance benefit, just smaller code (compared to doing it "manually"). I think he's referring to *integer* SIMD operations, especially the zero and sign-extension stuff is super useful in many cases (e.g. pmovzx). SSE is not just about floating point.
Leave a comment:
caligula replied

06 June 2018, 09:28 AM
Originally posted by carewolf View Post

Actually they are way more significant than SSE2. SSSE3 has the all important shuffle instruction and SSE4.1 as integer multiply, it more than 4 doubles the places a compiler can autovectorize in most codebases I have worked on.

Edit: Though that might be because I do more integer math than FP, it is not nearly as important for FP performance.

Ok, I'm not qualified enough to argue about this. AFAIK SSE2 has both integer/float operations with 8-64 bit numbers and can use 8 x 128 bits of storage for packed stuff. The SSE2 set also contains instructions for bypassing cache which is sometimes useful. If you compare this to MMX systems, SSE2 can provide 4-16x performance increase in integer math in a tight loop and maybe more due to other savings.

In SSE3-SSE4 the main addition is horizontal ops, some special ops like dot product acceleration. These are useful in some domains, but not in general. I can assume some programs like lame/flac/ffmpeg can use them, but many can't. The problem with SSE 4 is that SSE 4.2 is only available in 2-3 latest AMD generations. The full power of SSE2 is also only available on x86 AMD and x64 Intel (twice the register count). The AVX instructions also show mixed results in tests, apparently due to some throttling and lots of heat generation.
Leave a comment:
grok replied

05 June 2018, 09:55 PM
I forgot to be more clear in saying that you need a pure x32 userland (at least) to realize the RAM savings, although I didn't realize x32 was done with an x86-64 kernel. (or I forgot, having never run anything x32 anyway)

If you can run a desktop with x86-64 kernel and i686 userland (or a server if only that is available) I think it's still good for some old computers that have a 64bit CPU but only 1GB RAM. I've seen it, two slots DDR1 (DIMM or So-DIMM). Can still be fast enough on a single core to be good for something, in fact the last one I use like that (some Athlon 64 at 2 GHz with some HDD and low profile graphics card) was embarassingly fast at booting and running a Mate desktop. Single-thread performance and HDD speed weren't even really out of place almost 15 years later.

Now.. there are a few modern computers where this would still be useful : mini laptop with a quad core 14nm CPU, fixed 2GB RAM and 32GB eMMC. 64bit iso is easier to install since it supports the UEFI (I don't know if some specific 32bit linux iso supports UEFI. The distro I used says it doesn't in the 32bit version)
It works fine with a 64bit OS but if it was mine I'd like to run 64bit kernel and i686 userland.
Likes 1
Leave a comment:
Hugh replied

05 June 2018, 08:39 PM
Originally posted by grok View Post

x32 has to be a pure x32 system, so no proprietary software like Skype or drivers, no flash player (ok this is not important anymore but used to be even 5+ years ago), no Wine to play some silly Warcraft III. Biggest benefit should be lower CPU load from SSL/TLS.
And so, the niche for x32 was to be server VMs. But as we talk of dropping even i686, this should explain the lack of interest.

Apparently x32 is purely a userland thing and an x86-64 kernel supports it (if the feature is enabled). But you'd need versions of all the libraries too (kind of like you need x86-32 libraries on your x86-64 machine if you wish to run x86-32 userland binaries). So you could have some programs compiled for x32 and some for x86-64 and some for x86-32.

I don't know of any distro that provides x32 libraries.

Most UNIX utilities used to fit in 64k of program space and 64k of data space (the PDP-11's limitations, the almost-original host of UNIX). In the i386 days, I considered trying to compile some of those utilities in i286 mode. But I never got around to it.

It is interesting to me that almost all SPARC and Power userland code is 32-bit on 64-bit hardware. It seems to be something to do with bad code density costs on those RISC architectures.

Last edited by Hugh; 05 June 2018, 08:42 PM.
Likes 1
Leave a comment:

Announcement

Fedora 29 Proposal "i686 Is For x86-64" Would Allow More Optimizations, Require SSE2

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: