Announcement

Collapse
No announcement yet.

An Improved Linux MEMSET Is Being Tackled For Possibly Better Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • An Improved Linux MEMSET Is Being Tackled For Possibly Better Performance

    Phoronix: An Improved Linux MEMSET Is Being Tackled For Possibly Better Performance

    Borislav Petkov has taken to improve the Linux kernel's memset function with it being an area previously criticzed by Linus Torvalds and other prominent developers...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Typo:

    Originally posted by phoronix View Post
    Phoronix: An Improved Linux MEMSET Is Being Tackled For Possibly Better Performance

    Borislav Petkov has taken to improve the Linux kernel's memset function with it being an area previously criticzed by Linus Torvalds and other prominent developers...

    http://www.phoronix.com/scan.php?pag...-Better-MEMSET

    Comment


    • #3
      Perhaps it's premature, but I wonder if they could use the new movdiri or movdir64b instructions, introduced in Intel's Tremont core and Tiger Lake CPUs.

      https://www.phoronix.com/scan.php?pa...t-Instructions
      https://fuse.wikichip.org/news/1158/...lus-successor/

      It's hard to find good information on them, but I think they should avoid the write-miss penalty inherent in copy-back caches.

      Comment


      • #4
        BTW, if anyone is wondering about ERMS, this seems like a pretty good discussion of the subject:

        I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy. ERMSB was introduced with the Ivy Bridge microarchitecture. See the section "Enhanced REP MOVSB and ...


        You can see if your CPU has it by grepping for erms in /proc/cpuinfo. It was introduced in Ivy Bridge and is documented here (search for "Enhanced REP MOVSB and STOSB Operation"):

        Comment


        • #5
          I think this is something glibc optimizes more than the kernel, so how does it compare to their implementation?

          Comment


          • #6
            As much as I love optimizations I get mixed icky feelings when beautiful kernel code (arch or not) is tainted with way to much knowledge.
            There is reason behind having a good compiler and a good compiler + good userspace libraries to abstract horrid and stupid sensitive stuff like this.

            Comment


            • #7
              Originally posted by milkylainen View Post
              As much as I love optimizations I get mixed icky feelings when beautiful kernel code (arch or not) is tainted with way to much knowledge.
              There is reason behind having a good compiler and a good compiler + good userspace libraries to abstract horrid and stupid sensitive stuff like this.
              I see your point, but I think this is just a step closer to JIT compilation for the native CPU. That would probably be the best of both worlds - keep your nice code and make good & efficient use of the native hardware.

              I'm not saying Linux will ever get there, but perhaps some alternative kernel will.

              Comment


              • #8
                Originally posted by carewolf View Post
                I think this is something glibc optimizes more than the kernel, so how does it compare to their implementation?
                glibc has ERMS-optimized string and memory copy routines.


                I ran the test program in [1] and what's interesting is that my 6 year old haswell is over twice as fast as a zen 2700x.
                (e.g. for 512 byte 63/75/121 for haswell vs. 154/174/188 for zen+). Clang produces slightly faster code than gcc for the 2700x.

                Anyone with a ryzen 3000 interested in the code at [1]? I'm interested if zen2 has improved there.


                [1] https://lkml.org/lkml/2019/9/13/807

                Comment


                • #9
                  Originally posted by mlau View Post
                  I ran the test program in [1]
                  Meh, he's sampling rdtsc twice per iteration (with mfence) and not measuring how much overhead that adds. Sure, he starts the program at 0, but that still calls the code under test, even if it has no work to do.

                  My preferred way to benchmark small operations is to exponentially increase the number of iterations until the loop time just fits within an OS timeslice, and then repeat that number several times and take the minimum. I also subtract off the overhead of sampling the timer. In this way, I can quickly get cycle-accurate timings without too much concern about the system being completely idle.

                  Anyway, what are your timings for size 0? We ought to know how much rdtsc_ordered() might be contributing to that discrepancy.


                  Originally posted by mlau View Post
                  what's interesting is that my 6 year old haswell is over twice as fast as a zen 2700x.
                  I wonder if ERMS is patented. Check /proc/cpuinfo - does your 2700X list the erms flag?

                  Originally posted by mlau View Post
                  for 512 byte 63/75/121 for haswell vs. 154/174/188 for zen+
                  BTW, for those who didn't bother to read the linked benchmark, the numbers correspond to:
                  __builtin_memset()/rep_stosb()/memset_rep()

                  Comment


                  • #10
                    the fence is to ensure that the tsc count is taken after previous stores have retired to memory. the benchmark tests how many cycles it takes to store X bytes to the memory hierarchy.
                    Count 0 illustrates the setup costs of the various methods. IIRC, ERMS has a relatively high startup cost, so it only makes sense with larger amounts.

                    I don't think ERMS is patented; from my understanding it's an optimization in the microcode to the common "rep;stosb" (repeat store byte/w/l/) patterns to take
                    advantage of the wide internal datapaths. i.e. you can write "mov ecx, 1000h; rep; stosb" and microcode will optimize that to 128 256b-stores almost for free.

                    Comment

                    Working...
                    X