Announcement

**smitty3268** · 05 October 2016, 03:28 PM

Some of those optimizations (AVX2 but also SSE2) seem to be disabled on LLVM 3.9.

if (HAVE_LLVM < 0x0309 && ...

Unless I'm missing something, which is possible....

**mczak** · 05 October 2016, 07:05 PM

Originally posted by carewolf View Post

Hmm using the AVX2 gather instruction.. I have been really been looking forward to using this instruction, but it turns out it is much much slower than doing all the fetches using normal load instructions, at least on Haswell, on Broadwell and Skylake it is on par, but nowhere is it any faster.

Yes, it's not particularly fast. That said, I've done some tests on Haswell with some real code, and things didn't seem to get slower with gather. There's benefits beyond the pure maximum instruction throughput, one being the generated code being _much_ smaller (on a pure synthetic test, that probably won't matter).
According to the published throughput numbers (as found in agner fog's guides) gather should do better relative to ordinary multiple fetches when used with 8x32bit values and not 4x32bit values (because it's not quite twice as slow, whereas individual fetches surely is). And also if you use dword indices and not qword ones. Luckily llvmpipe only needs dword indices. The throughput on Haswell for gather fetching 8 dwords is 1 every 12 clocks. If you'd use individual fetch, it will be sequence (7 times) of extract offsets, pinsrd (using mem op) - the first value can use movd's instead. The chip could actually handle 2 loads per clock, however both the extracts and inserts will use the shuffle unit on port5, so the whole sequence might actually require 14 clocks anyway too and therefore be slower even on haswell.
So long story short, we found no performance penalty with llvmpipe even on Haswell when using gather, hence it's enabled even there.
On Broadwell using gather should really be faster, and on Skylake it should rock (but I didn't have the hardware to test). (On Haswell, not only will that 8-wide gather require 12 clocks, the chip is internally generating 34 micro-ops for it too - sounds pretty much like it's implemented almost the same as the manual extract/insert sequence. On Broadwell, those numbers are improved to 7 clocks, 14 micro-ops - and Skylake improves this to just 5 clocks, 4 micro-ops.)

I'm not sure why you got slower results. Maybe doing things manually, you were able to hide the fetches better behind other code. Or you had scalar indices in the first place - moving the offsets into xmm regs just so you can use gather you cannot win. It is also possible llvm didn't quite generate the best code for the scalar fetches (in particular I'm not sure scalar extraction of offsets is the fastest way to do it in general) for llvmpipe without avx2, but in any case we didn't really see a performance difference on Haswell.

**mczak** · 05 October 2016, 07:11 PM

Originally posted by smitty3268 View Post

Some of those optimizations (AVX2 but also SSE2) seem to be disabled on LLVM 3.9.

if (HAVE_LLVM < 0x0309 && ...

Unless I'm missing something, which is possible....

That's for the pmin/pmax instructions. llvm 3.9 ditched support for them completely - you're supposed to just use cmp/select, and in the end llvm will magically generate optimal assembly (using pmin/pmax instructions) in the end.
(And indeed it will do that - however on older llvm versions codegen wasn't as clever and the code generated was quite suboptimal, actually generating cmp/select, hence we still use the intrinsics for older versions.)

**carewolf** · 05 October 2016, 08:03 PM

Originally posted by mczak View Post

Yes, it's not particularly fast. That said, I've done some tests on Haswell with some real code, and things didn't seem to get slower with gather. There's benefits beyond the pure maximum instruction throughput, one being the generated code being _much_ smaller (on a pure synthetic test, that probably won't matter).
According to the published throughput numbers (as found in agner fog's guides) gather should do better relative to ordinary multiple fetches when used with 8x32bit values and not 4x32bit values (because it's not quite twice as slow, whereas individual fetches surely is). And also if you use dword indices and not qword ones. Luckily llvmpipe only needs dword indices. The throughput on Haswell for gather fetching 8 dwords is 1 every 12 clocks. If you'd use individual fetch, it will be sequence (7 times) of extract offsets, pinsrd (using mem op) - the first value can use movd's instead. The chip could actually handle 2 loads per clock, however both the extracts and inserts will use the shuffle unit on port5, so the whole sequence might actually require 14 clocks anyway too and therefore be slower even on haswell.
So long story short, we found no performance penalty with llvmpipe even on Haswell when using gather, hence it's enabled even there.
On Broadwell using gather should really be faster, and on Skylake it should rock (but I didn't have the hardware to test). (On Haswell, not only will that 8-wide gather require 12 clocks, the chip is internally generating 34 micro-ops for it too - sounds pretty much like it's implemented almost the same as the manual extract/insert sequence. On Broadwell, those numbers are improved to 7 clocks, 14 micro-ops - and Skylake improves this to just 5 clocks, 4 micro-ops.)

I'm not sure why you got slower results. Maybe doing things manually, you were able to hide the fetches better behind other code. Or you had scalar indices in the first place - moving the offsets into xmm regs just so you can use gather you cannot win. It is also possible llvm didn't quite generate the best code for the scalar fetches (in particular I'm not sure scalar extraction of offsets is the fastest way to do it in general) for llvmpipe without avx2, but in any case we didn't really see a performance difference on Haswell.

Ah cool. I tried using it a few places we have table lookup and extracted offset using some shift and convert to si32 instructions. It was a perfect place to try to used gather, but if used with same 128bit stepping it was much slower on Haswell. If the whole thing was rewritten to use 256bit stepping it was equally fast, but not using the gather instruction was while not faster (because of even bigger overhead from extracting and recomposing), it was no slower either. It was kind of a let down from a perfect instruction that filled an exact hole in out simd code.

**mczak** · 05 October 2016, 08:25 PM

Originally posted by carewolf View Post

Ah cool. I tried using it a few places we have table lookup and extracted offset using some shift and convert to si32 instructions. It was a perfect place to try to used gather, but if used with same 128bit stepping it was much slower on Haswell. If the whole thing was rewritten to use 256bit stepping it was equally fast, but not using the gather instruction was while not faster (because of even bigger overhead from extracting and recomposing), it was no slower either. It was kind of a let down from a perfect instruction that filled an exact hole in out simd code.

Using shift+movd might indeed actually be faster for offset extraction at least in some cases over using pextrd (as it won't use the shuffle unit).
And yes, gather was really what was missing for perfect vectorization for quite some code. It is too bad on Haswell it was more of a convenience function and didn't really do much. IIRC intel pretty much admitted that even before Haswell was released while promising it would get faster with future cpus.
(FWIW I've no idea how fast gather is on AMD Carrizo / Zen, albeit at least for Carrizo I would be surprised if it isn't decomposed into a huge number of micro-ops just like on Haswell.)

**carewolf** · 06 October 2016, 05:23 AM

Originally posted by mczak View Post

Using shift+movd might indeed actually be faster for offset extraction at least in some cases over using pextrd (as it won't use the shuffle unit).
And yes, gather was really what was missing for perfect vectorization for quite some code. It is too bad on Haswell it was more of a convenience function and didn't really do much. IIRC intel pretty much admitted that even before Haswell was released while promising it would get faster with future cpus.
(FWIW I've no idea how fast gather is on AMD Carrizo / Zen, albeit at least for Carrizo I would be surprised if it isn't decomposed into a huge number of micro-ops just like on Haswell.)

If I understand the issue correctly, the problem is not all the micro-ops it decomposed into. The problem was that those micro-ops didn't have the same optimization that normal loads have, so two loads that fall in the same cache-line wouldn't be joined. It was basically just a bad implementation were a crucial optimization was missing. If you gathered from places that were very separate and never on the same cache-line it would perform like the individual loads, but on any compact source, the individual loads would be faster due to better optimized joined loads. So unless AMD made the same mistake, I see no reason they would be as slow as Haswell.

Announcement

Gallium3D's Gallivm Gets Basic AVX2 Support

Comment

Comment

Comment

Comment

Comment

Comment