Announcement

Collapse
No announcement yet.

GNU C Library Tuning For AArch64 Helps Memset Performance By ~24%

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GNU C Library Tuning For AArch64 Helps Memset Performance By ~24%

    Phoronix: GNU C Library Tuning For AArch64 Helps Memset Performance By ~24%

    A patch merged yesterday to the GNU C Library (glibc) codebase can help the memset() function's performance by 24% as measured on an Arm Neoverse-N1 core...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Who said software was not as important as hardware ?

    Comment


    • #3
      Originally posted by Phoronos View Post
      Who said software was not as important as hardware ?
      Don't know. Who said that? Countless CPU cycles and hardware improvements have been wasted due to crappy software.

      Comment


      • #4
        Originally posted by Phoronos View Post
        Who said software was not as important as hardware ?
        It is way more so.

        Comment


        • #5
          I would love to see some benchmarks.

          Comment


          • #6
            Originally posted by Veto View Post
            Countless CPU cycles and hardware improvements have been wasted due to crappy software.
            A lot of the "IPC" gains in recent generations of modern CPUs are probably due to them getting ever better at executing such suboptimal code.

            A corollary of that should be that such code optimizations are less rewarding, on newer CPUs. It would be interesting to see if that holds true. The Neoverse N1 uses cores derived from the A76, which started shipping in phones almost 6 years ago. So, it's not a very modern core and certainly more area-optimized than the latest X-series cores.

            Originally posted by monty11ez View Post
            I would love to see some benchmarks.
            Ideally, please also try it on Nvidia's Grace or Amazon's Graviton 4, at the very least! I'm less interested in AmpereOne, because its cores are more area-optimized than the Neoverse V-series used in the former two I mentioned.
            Last edited by coder; 12 September 2024, 03:04 AM.

            Comment


            • #7
              Originally posted by coder View Post
              A lot of the "IPC" gains in recent generations of modern CPUs are probably due to them getting ever better at executing such suboptimal code.

              A corollary of that should be that such code optimizations are less rewarding, on newer CPUs. It would be interesting to see if that holds true. The Neoverse N1 uses cores derived from the A76, which started shipping in phones almost 6 years ago. So, it's not a very modern core and certainly more area-optimized than the latest X-series cores.
              It's the reverse - as CPUs get wider, deeper and larger in every way, they also get better at executing suboptimal code. However that doesn't mean we will ever get decent performance from -O0! So software optimization will always matter. And using the latest compilers and good options that result in the fastest possible code (eg. -march=native -Ofast).

              Note IPC gains are often measured on SPEC, ie. a large body of software that remains fixed.

              Comment


              • #8
                Originally posted by coder View Post
                A lot of the "IPC" gains in recent generations of modern CPUs are probably due to them getting ever better at executing such suboptimal code.

                A corollary of that should be that such code optimizations are less rewarding, on newer CPUs. It would be interesting to see if that holds true.
                This is a great example of the sentiments that lead to slow and wasteful software: "compilers are so good at optimizing, that it is not worth it to optimize ones code", "CPUs are getting faster, so it is not worth optimizing ones code", "memory is cheap, so it is not worth optimizing ones code", "It is easier to code this in Python/Electron/Java/...".

                While the trade-offs involved may justify not prioritizing performance and ressource use, often it is just an excuse for being lazy/ignorant.

                And no, the IPC gains are mainly due to executing more things in parallel and masking memory access delays. These gains are even greater when software is optimized to utilize it.

                Comment


                • #9
                  Originally posted by Veto View Post
                  This is a great example of the sentiments that lead to slow and wasteful software: "compilers are so good at optimizing, that it is not worth it to optimize ones code", "CPUs are getting faster, so it is not worth optimizing ones code", "memory is cheap, so it is not worth optimizing ones code", "It is easier to code this in Python/Electron/Java/...".
                  I accept your pivot, as long as you don't pretend that I'm arguing for "slow and wasteful software". All I did was to ask the question of how much newer CPUs benefit from the types of tweaks made to this memset code, because I like to make decisions on the basis of data and not rhetoric. Even if the amount were less, whether or not the optimization still made sense would certainly depend on how much less.

                  Originally posted by Veto View Post
                  ​While the trade-offs involved may justify not prioritizing performance and ressource use, often it is just an excuse for being lazy/ignorant.
                  Yeah, it depends. I write shell scripts and Python, but also C/C++. I try to use the right tool for the job. For some things, it's just a waste of time and a maintenance liability to use C/C++, but for other things it can be a performance killer if you use Python. And even if you use C, I think we all know that using bad algorithms and data structures can be a far greater performance killer than what you get by coding closer to the metal.

                  Originally posted by Veto View Post
                  ​And no, the IPC gains are mainly due to executing more things in parallel and masking memory access delays.
                  LOL, you give the simplistic version. IPC gains don't just come from wider CPUs and deeper out-of-order machinery. They also come from things like memory renaming and better branch-prediction, both of which can enable greater occupancy of existing parallel execution resources, particularly if some loop hasn't been unrolled or has nested branches. For instance:

                  When I recently interviewed Mike Clark, he told me, “…you’ll see the actual foundational lift play out in the future on Zen 6, even though it was really Zen 5 that set the table for that.” And at that same Zen 5 architecture event, AMD’s Chief Technology Officer Mark Papermaster said, “Zen 5 is a ground-up redesign of the Zen architecture,” which has brought numerous and impactful changes to the design of the core.



                  Originally posted by Veto View Post
                  ​These gains are even greater when software is optimized to utilize it.
                  Sometimes. The Zen cores from gen 1 to gen 4 all had 6-wide dispatch and 4 ALUs, yet the IPC increased quite a lot over that range. Some of those IPC improvements were certainly about overcoming barriers that limited concurrency available to prior generations. Again, I can point to the specific example of memory-renaming as a capability Zen 2 gained to help it overcome the performance penalties stemming from register spills, which people generally try to avoid in well-optimized code. Another example is the way Zen 3 added renaming of the flags register, which can break false dependencies between independent branches.

                  So... yeah, beat that drum. I've been around long enough to see lots of instances where people trashed code with "optimizations", without really understanding what they're doing. In either case, ignorance is the real problem. Once you know what are the performance pitfalls for compilers and CPUs and you've profiled the code to know where the hotspots actually are, then go ahead and optimize up to that point of diminishing returns.

                  Also, knowing what you're doing enables you optimize code with a lighter touch, so that it's not so incomprehensible to the next person who touches it that they trash your magnum opus and rewrite it with something that might be even worse than what you started with.

                  Comment


                  • #10
                    Originally posted by coder View Post
                    I accept your pivot, as long as you don't pretend that I'm arguing for "slow and wasteful software". All I did was to ask the question of how much newer CPUs benefit from the types of tweaks made to this memset code, because I like to make decisions on the basis of data and not rhetoric.
                    Thanks for accommodating my rant - it gave relief 😂 It was not aimed at anyone in particular and I completely agree the basis for making decisions (and optimizations) should be based on data (e.g. profiling) and not opinions. Based on your lengthy post, you are clearly very knowledgeable about optimizations and CPU architectures.

                    Originally posted by coder View Post
                    ​LOL, you give the simplistic version. IPC gains don't just come from wider CPUs and deeper out-of-order machinery. They also come from things like memory renaming and better branch-prediction, both of which can enable greater occupancy of existing parallel execution resources, particularly if some loop hasn't been unrolled or has nested branches.
                    True, it was quite simplistic. Although, I could argue that branch prediction, register renaming and whatnot are also just elaborate schemes for enabling more parallel execution and hiding memory delays. But that is mincing words.

                    Originally posted by coder View Post
                    ​​So... yeah, beat that drum. I've been around long enough to see lots of instances where people trashed code with "optimizations", without really understanding what they're doing. In either case, ignorance is the real problem. Once you know what are the performance pitfalls for compilers and CPUs and you've profiled the code to know where the hotspots actually are, then go ahead and optimize up to that point of diminishing returns.
                    Agreed. Ignorance and indifference, often leading to bad architectural choices that are not easily fixed by optimization.

                    More on-topic: The memset patch in the article is on the other hand quite clever and a great example of how modern CPU architectures not only increase IPC for old code, but enables even more intricate optimizations.

                    Comment

                    Working...
                    X