Announcement

Collapse
No announcement yet.

Intel's Glibc Non-Temporal Stores Memset Optimization Extended To AMD CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intel's Glibc Non-Temporal Stores Memset Optimization Extended To AMD CPUs

    Phoronix: Intel's Glibc Non-Temporal Stores Memset Optimization Extended To AMD CPUs

    Merged last month to the GNU C Library (glibc) Git code was a new tunable for non-temporal stores for memset. This optimization for glibc's memset performance was limited to Intel processors given at the time it was only tested/benchmarked on Intel CPUs but now it's proven to be useful too for AMD processors...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Would be interesting to see how this changes STREAM numbers on amd cpus... Ideally it should improve them. I remember that compiling stream on intel with icc and support for non-temporal stores improved stream drastically.

    Comment


    • #3
      Originally posted by pegasus View Post
      Would be interesting to see how this changes STREAM numbers on amd cpus... Ideally it should improve them. I remember that compiling stream on intel with icc and support for non-temporal stores improved stream drastically.
      I didn't know what STREAM is. For anyone in my situation: it's a memory bandwidth benchmark.

      Comment


      • #4
        I'm not a kernel guy, but

        + if (cpu_features->basic.kind == arch_kind_intel
        + || cpu_features->basic.kind == arch_kind_amd)
        ..

        isn't calling two times "cpu_features->basic.kind" slower than calling it once and then comparing the result?

        + if (cpu_features->basic.kind == (arch_kind_intel || arch_kind_amd))
        ..

        Comment


        • #5
          Originally posted by aerospace View Post
          I'm not a kernel guy, but

          + if (cpu_features->basic.kind == arch_kind_intel
          + || cpu_features->basic.kind == arch_kind_amd)
          ..

          isn't calling two times "cpu_features->basic.kind" slower than calling it once and then comparing the result?

          + if (cpu_features->basic.kind == (arch_kind_intel || arch_kind_amd))
          ..
          Yes, but I would expect a C compiler to be able to optimise simple cases like that where everything is known (i.e. no function calls involved). So I would be surprised if they didn't compile to the same assembly.

          Also this is glibc not the kernel, so it doesn't matter that neither of us are kernel guys (though I technically have one patch in the kernel, in an obscure ACPI driver).


          EDIT: Also we are both assuming this is a bitmask and not an enumerartion. I haven't checked if that is the case. Either way GCC would just do a single load (unless a volatile is involved and I that seems extremely unlikely).
          Last edited by Vorpal; 11 June 2024, 12:47 PM.

          Comment


          • #6
            Originally posted by aerospace View Post
            I'm not a kernel guy, but

            + if (cpu_features->basic.kind == arch_kind_intel
            + || cpu_features->basic.kind == arch_kind_amd)
            ..

            isn't calling two times "cpu_features->basic.kind" slower than calling it once and then comparing the result?

            + if (cpu_features->basic.kind == (arch_kind_intel || arch_kind_amd))
            ..
            Nope. That won't be correct in any normal world.

            Comment


            • #7
              Originally posted by pegasus View Post
              Would be interesting to see how this changes STREAM numbers on amd cpus... Ideally it should improve them. I remember that compiling stream on intel with icc and support for non-temporal stores improved stream drastically.
              STREAM doesn't use memset, it uses FPU ops.

              Comment


              • #8
                From my limited understanding of this, I don't see what would prevent this from working on other architectures.

                Comment


                • #9
                  Originally posted by aerospace View Post
                  I'm not a kernel guy, but

                  + if (cpu_features->basic.kind == arch_kind_intel
                  + || cpu_features->basic.kind == arch_kind_amd)
                  ..

                  isn't calling two times "cpu_features->basic.kind" slower than calling it once and then comparing the result?

                  + if (cpu_features->basic.kind == (arch_kind_intel || arch_kind_amd))
                  ..
                  To do both you'd need to use a bitflags representation which of course has a limited number of bits, but could look like:

                  Code:
                  if (cpu_features->basic.kind & (arch_kind_intel | arch_kind_amd) != 0)
                  A switch statement wouldn't have the bits limitation, but I don't know enough about C++ to know if it compiles to a perfect hash table when enough options are present for it to be worth it for constant time.

                  Comment


                  • #10
                    Quaternions

                    Your code may give false positives. It may also give one false negative if basic.kind == 0.
                    Last edited by Volta; 12 June 2024, 01:16 PM.

                    Comment

                    Working...
                    X