Intel Publishes Blazing Fast AVX-512 Sorting Library, Numpy Switching To It For 10~17x Faster Sorts

Written by Michael Larabel in Intel on 15 February 2023 at 04:00 PM EST. 51 Comments

Intel recently published an open-source C++ header file library for high performance SIMD-based sorting, which initially is focused on providing a lightning fast AVX-512 quicksort implementation. As of today that code has been merged to Numpy and is providing some 10~17x speed-ups.

Toward the end of last year Intel quietly made available x86-simd-sort via their GitHub account. It's a C++ header file library for high performance SIMD sorting though in its current form is just focused on an AVX-512 quicksort implementation.

There hasn't been much coverage of this x86-simd-sort project and the GitHub page itself doesn't do much to talk up the crazy fast performance potential of AVX-512 for sorting... But now by way of the widely-used Numpy open-source project, there is prominent use of it and achieving some staggering results.

Merged today into Numpy was PR 22315 to vectorize the quicksort for 16-bit and 64-bit data types using AVX-512. On an Intel Tigerlake system this sped-up 16-bit int sorting by 17x while float 64-bit sorting by nearly 10x for random arrays and 32-bit data types were 12~13x faster sorts. This Numpy change was made by Intel engineer Raghuveer Devulapalli and is leveraging the x86-simd-sort code.

A speed-up worth celebrating... From multi-vendor support to more efficient AVX-512 implementations on newer processors to more robust software use, there is a lot to enjoy around AVX-512 these days.

A 10~17x speed-up for sorting with AVX-512 is pretty astonishing, especially when factoring in the better AVX-512 efficiencies with recent generations of Intel CPUs. With the latest Xeon Scalable processors the thermal and power impact of AVX-512 is no longer too great or causing significant CPU down-clocking as it was panned for in the past, but is in rather good shape. See my recent Intel Xeon "Sapphire Rapids" AVX-512 benchmarks that includes the power efficiency It's too bad though that the latest Intel Core client processors no longer are offering AVX-512. Meanwhile over on the AMD side with their Zen 4 processors from the Ryzen 7000 series through the 4th Gen EPYC server processors is (finally) AVX-512 support.

It will be interesting to see what other software projects decide to make use of this x86-simd-sort for speedy AVX-512 sorting. It's another notable win for Advanced Vector Extensions 512 similar to how last year simdjson tapped AVX-512 for very fast JSON parsing as something one would normally not think of immediately as a great use-case for AVX-512.

51 Comments