GCC Lands AVX-512 Fully-Masked Vectorization

Written by Michael Larabel in GNU on 19 June 2023 at 06:30 AM EDT. 31 Comments

Stemming from looking at the generated x264 video encode binary and some performance inefficiencies, SUSE engineers have worked out AVX-512 fully masked vectorization support for the GCC 14 development code.

Back in January SUSe compiler engineer Jan Hubicka opened this bug around the x264 benchmark with the averaging loop not being well optimized for AVX-512.

"x264 benchmark has a loop averaging two unsigned char arrays that is executed with relatively low trip counts that does not play well with our vectorized code. For AVX512 most time is spent in unvectorized variant since the average number of iterations is too small to reach the vector code.
...
For sizes 12-16 128bit vectorization wins, 20-28 behaves funily. However avx512 vectorization is a huge loss for all sizes up to 31 bytes. aocc seems to win for 16 bytes.
...
One issue is that we at most perform one epilogue loop vectorization, so with AVX512 we vectorize the epilogue with AVX2 but its epilogue remains unvectorized. With AVX512 we'd want to use a fully masked epilogue using AVX512 instead.

I started working on fully masked vectorization support for AVX512 but got distracted."

Fast forward nearly six months, SUSE compiler engineer Richard Biener has landed an initial implementation of AVX-512 fully masked vectorization within the GNU Compiler Collection codebase for helping out the x264 test case and other less-than-full vector cases.

"This implements fully masked vectorization or a masked epilog for avx512 style masks which single themselves out by representing each lane with a single bit and by using integer modes for the mask (both is much like gcn).

avx512 is also special in that it doesn't have any instruction to compute the mask from a scalar iv like sve has with while_ult. Instead the masks are produced by vector compares and the loop control retains the scalar iv (mainly to avoid dependences on mask generation, a suitable mask test instruction is available).

like rvv code generation prefers a decrementing iv though ivopts messes things up in some cases removing that iv to eliminate it with an incrementing one used for address generation.

one of the motivating testcases is from pr108410 which in turn is extracted from x264 where large size vectorization shows issues with small trip loops. Execution time there improves compared to classic avx512 with avx2 epilogues for the cases of less than 32 iterations."

The AVX-512 fully masked vectorization support landed this morning in GCC 14 Git via this commit.

31 Comments