Announcement

Collapse
No announcement yet.

An Improved Linux MEMSET Is Being Tackled For Possibly Better Performance

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by mlau View Post
    the fence is to ensure that the tsc count is taken after previous stores have retired to memory. the benchmark tests how many cycles it takes to store X bytes to the memory hierarchy.
    I know why it's there. The point was that reading tsc and executing mfence presumably have some overhead.

    Originally posted by mlau View Post
    Count 0 illustrates the setup costs of the various methods.
    Yeah, so he should also measure just the timing overhead, which might vary on different CPUs.

    Originally posted by mlau View Post
    I don't think ERMS is patented; from my understanding it's an optimization in the microcode to the common "rep;stosb" (repeat store byte/w/l/) patterns to take advantage of the wide internal datapaths. i.e. you can write "mov ecx, 1000h; rep; stosb" and microcode will optimize that to 128 256b-stores almost for free.
    I get what it does. The question was whether CPUID on your Ryzen 2700X lists it or not. If not, then we can speculate about why it's missing.

    Comment


    • #12
      Originally posted by mlau View Post

      glibc has ERMS-optimized string and memory copy routines.


      I ran the test program in [1] and what's interesting is that my 6 year old haswell is over twice as fast as a zen 2700x.
      (e.g. for 512 byte 63/75/121 for haswell vs. 154/174/188 for zen+). Clang produces slightly faster code than gcc for the 2700x.

      Anyone with a ryzen 3000 interested in the code at [1]? I'm interested if zen2 has improved there.


      [1] https://lkml.org/lkml/2019/9/13/807
      It will be memory bandwidth limited in all real cases. What do you think you are ecen measuring? Since even a naive implementation hits the memory bandwidth limitation, there are only two things worth measuring : Power consumption and binary size

      Comment


      • #13
        Originally posted by carewolf View Post
        It will be memory bandwidth limited in all real cases.
        I don't agree with that. CPU caches exist because they provide benefits (which means they must have a decent hit-rate). If CPUs had no caches, then you'd have a point.

        In fact, you can find tons of code that uses memset() to zero-initialize structures, many of which are even stack-based (which is almost always resident in L1).

        Comment


        • #14
          Originally posted by carewolf View Post

          It will be memory bandwidth limited in all real cases. What do you think you are ecen measuring? Since even a naive implementation hits the memory bandwidth limitation, there are only two things worth measuring : Power consumption and binary size
          It measures the amount of clock cycles necessary to move X bytes to the memory hierarchy using method Y. It's important for linux' copy_to/from_user()
          (e.g. ioctl() data, ...) among other things.
          And as coder already said, not everything will hit main memory right away, caches exist for a reason.

          ERMS has a size benefit: the main loop is just 2 bytes opcode, plus the setup of the counter. AND it has been in use since the 8086 days, so
          theoretically 30 year old binaries can get a stringop/memop speedup for free on intel cpus.
          Last edited by mlau; 09-17-2019, 05:33 AM.

          Comment


          • #15
            Originally posted by coder View Post
            I don't agree with that. CPU caches exist because they provide benefits (which means they must have a decent hit-rate). If CPUs had no caches, then you'd have a point.

            In fact, you can find tons of code that uses memset() to zero-initialize structures, many of which are even stack-based (which is almost always resident in L1).
            Right. I am too used to be optmizing image routines, they rarely fit in cache, and end up being limited that way. This is a rather compact representation too though. How CISC

            Comment


            • #16
              Originally posted by carewolf View Post
              Right. I am too used to be optmizing image routines, they rarely fit in cache, and end up being limited that way.
              Depending on what you're doing, tile-based processing can be a lot more efficient.

              This, and the heavy use of interleaved formats, are my main beefs with OpenCV. But, it's a good library for rapid prototyping or working at low resolutions.

              Anyway, even within the scope of a single operation, things like transposes work a lot better with the cache hierarchy if you decompose them into a series of tile transposes. That's probably true of just about any kind of spatial transform, but especially transposes.

              Originally posted by carewolf View Post
              This is a rather compact representation too though. How CISC
              That's a good point. By having a higher-level, more semantic representation of what you're trying to do, I'm sure it's a lot easier for the CPU microcode to substitute a more efficient implementation.

              Comment

              Working...
              X