Intel Releases x86-simd-sort 2.0 With Faster AVX-512 Sorting, New Algorithms

Written by Michael Larabel in Intel on 23 June 2023 at 05:00 PM EDT. 9 Comments
Earlier this year Intel software engineers published a blazing fast AVX-512 sorting library that was initially picked up by Numpy where it netted them 10~17x faster sorts. Today marks the release of x86-simd-sort 2.0 with even more AVX-512 features in place and additional sorting algorithms added.

In March x86-simd-sort 1.0 was published as the first stable release for this C++ header file library that supported SIMD-based 16 / 32 / 64-bit data-type sorting. Since the v1.0 release I haven't heard much from the x86-simd-sort team while today saw the release of version 2.0.

x86-simd-sort 2.0

With x86-simd-sort is a lot more goodies for these interested in high performance sorting and continuing to maximize the performance potential of AVX-512. The release highlights of x86-simd-sort 2.0 include:
- AVX-512 based argsort algorithms using O(1) space for 32-bit and 64-bit data types. It returns the indices that would sort an array. These are up to 6x faster when compared to a scalar solution that uses std::sort. This new feature is leveraged by NumPy in np.argsort in its latest release v1.25.

- AVX-512 based quick select algorithm for 16-bit, 32-bit and 64-bit data types. These are equivalent to std::nthelement but performs a lot faster. The performance depends on the ratio K/N (where K is the index of the element the array is partitioned around and N is the array size). For smaller values of K, it is up to 6x faster for 64-bit data, up to 15x faster for 32-bit data and up to 7x faster for 16-bit data. The performance gets better as K gets larger.

- AVX-512 based partial sort algorithm for 16-bit, 32-bit and 64-bit data types. These are equivalent to std::partial_sort. As with quick select, its performance depends on the ratio K/N (where K is the size of partial sorted array and N is the array size) and it tends to perform a an order of magnitude faster for larger values. It is about 1.05x faster for tiny partial array sort and up to 20x faster for slightly larger partial arrays.

- AVX-512 based key-value sort. This sorts a pair of key-value arrays and is currently supported only for 64-bit data types. A version of this has been contributed to Oceanbase, an open source distributed relational database.

- AVX-512 sort for _Float16 data type using AVX-512 FP16 ISA. In NumPy, these are nearly 3x faster than AVX-512 based sort that emulates float16 data type.

Very impressive work while sadly AVX-512 isn't found with the latest Intel client processors, but at least will be quite exciting for Xeon Scalable servers as well as all AMD Zen 4 customers.

AVX-512 processors

Downloads and more details on x86-simd-sort 2.0 via Intel's GitHub.
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via

Popular News This Week