Announcement

Collapse
No announcement yet.

Intel Lands A Nice Memset Performance Optimization In Glibc

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intel Lands A Nice Memset Performance Optimization In Glibc

    Phoronix: Intel Lands A Nice Memset Performance Optimization In Glibc

    Intel engineer Noah Goldstein has landed another nice performance optimization in the GNU C Library "glibc" for benefiting newer Intel processors...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    It would be nice to see a benchmark of this with musl and others libcs.

    Comment


    • #3
      On the flip size,
      Heh... in such cases, I have to wonder whether the author made a typo or they really think it's that and not "on the flip side".

      non-temporal writes can avoid data in their RFO requests saving memory bandwidth.
      RFO (Read For Overwite) is indeed a nasty penalty and a classical disadvantage of copy-back caches, but I sure hope non-temporal stores don't entirely bypass the cache hierarchy. When I first played around with them, back in the Pentium 4 days, they seemed to be implemented by being constrained to a single cache set.

      Depending on how much of the cache hierarchy you bypass and how large the transfer is, it might indeed be suboptimal to use non-temporal stores, especially if you consider that a big usecase for memset() is to zero newly-allocated memory pretty much immediately before using it. In such cases you want the address range to be hot in cache!

      I'm guessing there's not a full cache bypass going on, or else the benchmark results wouldn't show such an improvement.
      Last edited by coder; 31 May 2024, 10:06 AM.

      Comment


      • #4
        /lib64/ld-linux-x86-64.so.2 --list-tunables |grep temporal shows glibc.cpu.x86_non_temporal_threshold = 4MiB here (and half of L3 size: getconf -a | grep CACHE), so I think the point is that this is large enough that bypassing the cache is preferable to throwing out most of it.

        Comment


        • #5
          Originally posted by coder View Post
          Depending on how much of the cache hierarchy you bypass and how large the transfer is, it might indeed be suboptimal to use non-temporal stores, especially if you consider that a big usecase for memset() is to zero newly-allocated memory pretty much immediately before using it. In such cases you want the address range to be hot in cache!
          On the other hand, if you write out a few megabytes, you probably don't want to evict nearly everything from the cache, given that the start of the memset area (where you might want to start accesses afterwards) is probably going to be already evicted again anyway.

          Comment


          • #6
            Originally posted by Tobu View Post
            /lib64/ld-linux-x86-64.so.2 --list-tunables |grep temporal shows glibc.cpu.x86_non_temporal_threshold = 4MiB here (and half of L3 size: getconf -a | grep CACHE), so I think the point is that this is large enough that bypassing the cache is preferable to throwing out most of it.
            Yeah, then it's a more limited optimization and not a proper substitute for the one lost due to the mitigation. I think limiting it to such large sizes is just about not trashing other L3 contents and really not about avoiding RFO.

            One thing that puzzles me about the RFO aspect is that I assumed you could avoid RFO with an aligned, 512-bit write. That matches the 64 byte cacheline size, so I hoped caches would be smart enough not to generate a read in that case. In the AVX-512 path, maybe RFO is already out of the picture and this is really just about not trashing other cache contents (which is the main point of non-temporal stores).

            Comment


            • #7
              Originally posted by coder View Post
              Heh... in such cases, I have to wonder whether the author made a typo or they really think it's that and not "on the flip side".


              RFO (Read For Overwite) is indeed a nasty penalty and a classical disadvantage of copy-back caches, but I sure hope non-temporal stores don't entirely bypass the cache hierarchy. When I first played around with them, back in the Pentium 4 days, they seemed to be implemented by being constrained to a single cache set.

              Depending on how much of the cache hierarchy you bypass and how large the transfer is, it might indeed be suboptimal to use non-temporal stores, especially if you consider that a big usecase for memset() is to zero newly-allocated memory pretty much immediately before using it. In such cases you want the address range to be hot in cache!

              I'm guessing there's not a full cache bypass going on, or else the benchmark results wouldn't show such an improvement.
              Several MBytes of cache isn't going to stay that hot. So this will still be an improvement. Note the conditions under which the optimization is used.

              Comment

              Working...
              X