Announcement

Collapse
No announcement yet.

AVX/AVX-512 Tuning Doesn't Payoff For LibreOffice's Calc Spreadsheets

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • AVX/AVX-512 Tuning Doesn't Payoff For LibreOffice's Calc Spreadsheets

    Phoronix: AVX/AVX-512 Tuning Doesn't Payoff For LibreOffice's Calc Spreadsheets

    While Advanced Vector Extensions (AVX) can provide some big performance boosts when software is properly tuned for it and most often we are writing about projects adding support for it, in the case of LibreOffice they are now going ahead and removing their AVX and AVX-512 tuning...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    To be fair, Ryzen CPUs don't have that powerful AVX implementations compared to the Intel CPUs. So it may not be that surprising that is does not perform as expected.

    Comment


    • #3
      I wonder if what he described is experienced across different compilers (e.g., GCC, Intel, Microsoft's) or only Clang. He did say compilers, but if memory doesn't fail they're using Clang on Windows too.

      Comment


      • #4
        The code must use array/vector operations heavily to get full advantage of AVX2+. You don't get always performance boost from AVX2.
        Here is my experience when I was developing a code to do some computing for my graduation.
        What I have observed is that when I tune my code with AVX2 (woth -mavx2) the CPU frequency decrease from 2.5GHz to 2.2 GHz with the first CPU and from 3.6GHz to 3.0GHz with the second which is approximately ~= -(12-17)%. If you have some array/vector operations that took for example 40% (4s) of runtime without AVX2, that time will decrease to 2s after tuning for AVX2 ~= +20% of the total runtime.
        With a simple math, considering the runtime was 10s, the runtime after AVX2 = 10s * 1.15 * 0.8 = 9.2s ~ 8%
        So unless you have a code that its array/vector operations took >35% of the total runtime you will get nothing from AVX2.

        Comment


        • #5
          So Linus Torvalds was right about AVX-512.

          Originally posted by Linux Torvalds
          I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on.

          Comment


          • #6
            Originally posted by -MacNuke- View Post
            To be fair, Ryzen CPUs don't have that powerful AVX implementations compared to the Intel CPUs. So it may not be that surprising that is does not perform as expected.
            The testing of AVX performance on Zen 1 (Ryzen 5 2500U) has clearly demonstrated the developer has no real understanding about hardware and is very likely incapable to maintain the AVX2 code path as well. So it's good to remove them altogether.

            Comment


            • #7
              Originally posted by zxy_thf View Post
              The testing of AVX performance on Zen 1 (Ryzen 5 2500U) has clearly demonstrated the developer has no real understanding about hardware and is very likely incapable to maintain the AVX2 code path as well. So it's good to remove them altogether.
              Ahh no. He's weighed the probable benefits against the added maintenance burden, and decided that the outcome is against the AVX code, simple as that. The avx code need to be improved a lot (better separation from rest of lo code, a maintainer, ...) and it will be added back.

              Comment


              • #8
                This isn't surprising. Getting performance out of AVX is tricky when it's mixed with code that only uses SSE. The switch between the two (at least used to) result in a painful pipeline flush, which means that "randomly" enabling AVX results in lower performance. Thus in order to really gain anything from AVX, you need compute intensive work that is highly vectorised. From my POV, computation that only takes a few ms is laughable in this context. There's no way it's worth the pain at such small scale.

                Comment


                • #9
                  Originally posted by uid313 View Post
                  So Linus Torvalds was right about AVX-512.
                  To be fair, AVX-512 does have its use cases. Ian Cutress from Anandtech is a researcher mathematician that is quite fond of it, since he used it in his work. On the other hand, that also exemplify that actual use cases of AVX are rare and very specific, showing that it really don't have much use in consumer grade CPUs.

                  Comment


                  • #10
                    Originally posted by M@GOid View Post

                    To be fair, AVX-512 does have its use cases. Ian Cutress from Anandtech is a researcher mathematician that is quite fond of it, since he used it in his work. On the other hand, that also exemplify that actual use cases of AVX are rare and very specific, showing that it really don't have much use in consumer grade CPUs.
                    I would seriously take what Ian Cutress says about AVX512 with a truck load full of salt. I've confronted him about his AVX512 tests and the code he's using (3D particles movements from his own PhD thesis), but he's been very evasive in his explanations. The code is closed source, its credibility purely resting on Ian's own words, and the results usually don't make sense (i.e. some of the speed up cannot be explained by the doubling of the vector size alone and implies he's actually benchmarking specific instructions that do not exist in AVX2, or worse, comparing apples to oranges by proxy of completely different algorithms).

                    From my own professional experience, I would opinionate a few judgements about AVX512:
                    • AVX512 is borderline useless. It's nowhere near ubiquitous as AVX2 and does still incur a heavy frequency throttling, especially if you consider that the other CPU cores are likely used by other jobs / containers that will also pay the price for another process using AVX512 on the same piece of silicon.
                    • AVX512 is thus only useful if the applications is already heavily vectorised and is running multithreaded on all the CPU cores (this point makes judging the performance impact of the frequency throttling easier, since it's self contained to just one application).
                    • If the application is already (or wants to be) heavily vectorised and multithreaded, then AVX512 is actually a worse option. GPUs (and CUDA specifically) are better alternatives (I would say that empirically, on comparably priced hardware and power consumption, GPUs are usually 7x faster than equivalent CPUs). From 3D offline rendering to AI, nobody of sound judgment uses AVX512 instead of GPUs unless they work in one of the many Intel marketing departments.
                    • If you're still going to do AVX512 programming, for the love of all past and future deities, please do use ISPC (https://ispc.github.io/) and do not manually implement a separate code path for each instruction sets (bonus: ISPC works on ARM Neon too).

                    Comment

                    Working...
                    X