Announcement

Collapse
No announcement yet.

Intel Releases x86-simd-sort 5.0 With 4~5x Faster C++ Object Sorting Using AVX-512

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intel Releases x86-simd-sort 5.0 With 4~5x Faster C++ Object Sorting Using AVX-512

    Phoronix: Intel Releases x86-simd-sort 5.0 With 4~5x Faster C++ Object Sorting Using AVX-512

    It's been nearly one year to the day since outlining intel's AVX-512 powered sorting library to offer blazing fast sort speeds. Over the past year has brought the 1.0 release, new algorithms in v2.0, AVX2 support and more AVX-512 optimizations in v4.0, and now today Intel is out with x86-simd-sort 5.0 with yet more performance improvements...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    What can this be useful for? What can benefit from it?

    Comment


    • #3
      Originally posted by timofonic View Post
      What can this be useful for? What can benefit from it?
      TFA mentions NumPy, and any NumPy speedup is pretty big deal. While I personally don't deal with it that much as I work at the data storage level and mainly just maintain the raw data availibility via databases with minor massaging.

      But if you deal with a lot of numerical data, say data analysis, then this might be huge.

      Comment


      • #4
        I have an impression that since its introduction years ago, ppl are trying to find real life applications for AVX512 that are not already covered by GPUs.
        While with earlier SIMD extensions it was relatively simple - multimedia and games benefited much, this is no longer the case with AVX512.
        So we have json parsing, ps3 emulation, AI interferencing, dynamic molecular simulation and ray tracing.
        AVX512 is also a must nowadays for winning processor comparison on phoronix.

        And now we are seeing Intel working on another specialized use case (sorting), with another specialized library.
        I won't argue that it's better to have AVX512 than not, but still it doesn't convince me that Linus was wrong.
        This is nowhere as useful as SSE/SSE2 was years ago.

        Also, it would be nice to know what we are comparing AVX512 to. Because from what I see AVX2 code was added to this library much LATER than AVX512 code (in 4.0 few months ago). And isn't getting anywhere as much love as AVX512 which is being optimized again and again. So it's 5x faster than what? MMX?

        I'm waiting for benchmarks comparing v3 and v4 optimized distro builds. At least AVX512 has some clever instructions that AVX2 doesn't have, and more registers.
        Last edited by sobrus; 13 February 2024, 04:08 AM.

        Comment


        • #5
          Originally posted by sobrus View Post
          While with earlier SIMD extensions it was relatively simple - multimedia and games benefited much, this is no longer the case with AVX512.
          So we have json parsing, ps3 emulation, AI interferencing, dynamic molecular simulation and ray tracing.
          Everything that benefited from SSE most likely also benefits from AVX.

          This is nowhere as useful as SSE/SSE2 was years ago.
          Give it time, as soon as it is as wide spread as SSE it will get more adoption, up until now only a small percentage of users have capable hardware.

          Also reread what Linus said about AVX, it was much more specific than "it's useless".

          Originally posted by timofonic View Post
          What can benefit from it?
          AMD CPUs.
          ‚Äč

          Comment


          • #6
            I wonder how good the code optimization is for taking advantage of AVX512 -- I mean not so much that I care
            but only that if say LLVM can detect relevant use cases and optimize the generated code to use those features when it
            sees IR that it could apply to then one would think one could just have some kind of analyzer tool
            based on compiler code generation level analysis that looks for relevant spots and simply emits LINT
            or something suggesting "Hey this code area could possibly be sped up" and then people could
            attend to profiling / optimizing those areas specifically if there's something that could be done beyond
            letting the compiler optimization work.

            I guess the worst case would be "hot" / performance critical code that just isn't even phrased in such a way the
            compiler could even figure out that it should / could use AVX512 for that function / block in which case if a person doesn't
            catch it in a manual optimization process they'd never know.


            Originally posted by sobrus View Post
            I have an impression that since its introduction years ago, ppl are trying to find real life applications for AVX512 that are not already covered by GPUs.
            While with earlier SIMD extensions it was relatively simple - multimedia and games benefited much, this is no longer the case with AVX512.
            So we have json parsing, ps3 emulation, AI interferencing, dynamic molecular simulation and ray tracing.
            AVX512 is also a must nowadays for winning processor comparison on phoronix.

            And now we are seeing Intel working on another specialized use case (sorting), with another specialized library.
            I won't argue that it's better to have AVX512 than not, but still it doesn't convince me that Linus was wrong.
            This is nowhere as useful as SSE/SSE2 was years ago.

            Also, it would be nice to know what we are comparing AVX512 to. Because from what I see AVX2 code was added to this library much LATER than AVX512 code (in 4.0 few months ago). And isn't getting anywhere as much love as AVX512 which is being optimized again and again. So it's 5x faster than what? MMX?

            I'm waiting for benchmarks comparing v3 and v4 optimized distro builds. At least AVX512 has some clever instructions that AVX2 doesn't have, and more registers.

            Comment


            • #7
              Originally posted by pong View Post
              I wonder how good the code optimization is for taking advantage of AVX512
              From what I see, most software isn't even optimized for AVX2 yet, despite widespread hardware support for years. Even the ones that are usually highly tuned, like libvpx, which had much AVX2 work recently in 1.13 release (it runs noticeably better now).
              I bought AVX2 capable machine three years ago, it was already "outdated" a bit back then, and not much has changed since. It's still a bit "outdated", and waiting for v3 software adoption at the same time.

              Comment


              • #8
                I think most here have a warped perception how long SSE2 needed to be widely supported. Introduced 2000 but it wasn't till 2004 that all new CPUs had the instructions, then at least another 5 years till there was a saturation of SSE2 CPUs on the user side.
                Microsoft Visual C++ got SSE2 compiler support in version 2012.

                AVX512 was first introduced in 2016 and the most new CPUs today still come without it. So there is no way we are at the same saturation that we had with SSE2 in 2010.

                Comment


                • #9
                  Sounds great in theory, having the potential to speed up basically every program ever written,
                  but in my initial test on a Ryzen 7950X the SIMD sort took 111 ms, while std::sort took 104 ms.

                  I was using the "sort points by distance to origin" example on the github page with 1m elements.

                  I'll have to look into it again later why that is the case.
                  Last edited by david-nk; 14 February 2024, 11:03 AM.

                  Comment


                  • #10
                    Originally posted by david-nk View Post
                    Sounds great in theory, having the potential to speed up basically every program ever written,
                    but in my initial test on a Ryzen 7950X the SIMD sort took 111 ms, while std::sort took 104 ms.
                    This will vary heavily with different CPUs.
                    It might also depend on the compiler you're using and the options you pass it. (generic O2 vs native O3 for example)

                    Comment

                    Working...
                    X