Intel Releases x86-simd-sort 2.0 With Faster AVX-512 Sorting, New Algorithms
Earlier this year Intel software engineers published a blazing fast AVX-512 sorting library that was initially picked up by Numpy where it netted them 10~17x faster sorts. Today marks the release of x86-simd-sort 2.0 with even more AVX-512 features in place and additional sorting algorithms added.
In March x86-simd-sort 1.0 was published as the first stable release for this C++ header file library that supported SIMD-based 16 / 32 / 64-bit data-type sorting. Since the v1.0 release I haven't heard much from the x86-simd-sort team while today saw the release of version 2.0.
With x86-simd-sort is a lot more goodies for these interested in high performance sorting and continuing to maximize the performance potential of AVX-512. The release highlights of x86-simd-sort 2.0 include:
Very impressive work while sadly AVX-512 isn't found with the latest Intel client processors, but at least will be quite exciting for Xeon Scalable servers as well as all AMD Zen 4 customers.
Downloads and more details on x86-simd-sort 2.0 via Intel's GitHub.
In March x86-simd-sort 1.0 was published as the first stable release for this C++ header file library that supported SIMD-based 16 / 32 / 64-bit data-type sorting. Since the v1.0 release I haven't heard much from the x86-simd-sort team while today saw the release of version 2.0.
With x86-simd-sort is a lot more goodies for these interested in high performance sorting and continuing to maximize the performance potential of AVX-512. The release highlights of x86-simd-sort 2.0 include:
- AVX-512 based argsort algorithms using O(1) space for 32-bit and 64-bit data types. It returns the indices that would sort an array. These are up to 6x faster when compared to a scalar solution that uses std::sort. This new feature is leveraged by NumPy in np.argsort in its latest release v1.25.
- AVX-512 based quick select algorithm for 16-bit, 32-bit and 64-bit data types. These are equivalent to std::nthelement but performs a lot faster. The performance depends on the ratio K/N (where K is the index of the element the array is partitioned around and N is the array size). For smaller values of K, it is up to 6x faster for 64-bit data, up to 15x faster for 32-bit data and up to 7x faster for 16-bit data. The performance gets better as K gets larger.
- AVX-512 based partial sort algorithm for 16-bit, 32-bit and 64-bit data types. These are equivalent to std::partial_sort. As with quick select, its performance depends on the ratio K/N (where K is the size of partial sorted array and N is the array size) and it tends to perform a an order of magnitude faster for larger values. It is about 1.05x faster for tiny partial array sort and up to 20x faster for slightly larger partial arrays.
- AVX-512 based key-value sort. This sorts a pair of key-value arrays and is currently supported only for 64-bit data types. A version of this has been contributed to Oceanbase, an open source distributed relational database.
- AVX-512 sort for _Float16 data type using AVX-512 FP16 ISA. In NumPy, these are nearly 3x faster than AVX-512 based sort that emulates float16 data type.
Very impressive work while sadly AVX-512 isn't found with the latest Intel client processors, but at least will be quite exciting for Xeon Scalable servers as well as all AMD Zen 4 customers.
Downloads and more details on x86-simd-sort 2.0 via Intel's GitHub.
9 Comments