Announcement

Collapse
No announcement yet.

Optional L1d Flushing On Context Switching Looks Like It Will Try Again For Linux 5.10

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Optional L1d Flushing On Context Switching Looks Like It Will Try Again For Linux 5.10

    Phoronix: Optional L1d Flushing On Context Switching Looks Like It Will Try Again For Linux 5.10

    The feature to provide opt-in flushing of the L1 data cache on each context switch looks like it will be coming with the Linux 5.10 cycle for this functionality providing security benefits but at the cost of further performance degradation...

    http://www.phoronix.com/scan.php?pag...tch-Linux-5.10

  • #2
    Let's talk about SMT disabled case.

    When an app does a system call, I don't see why is L1 flushing a big deal on Linux. First, it's a monolithic kernel, so the call is going to take forever anyway, and L1 refill isn't going to make things much worse. Also, I don't see L1 being of much help in the case of Linux. I mean, L1 is virtually addressed, so on a syscall it's going to just do all cache misses since a) address space has changed and b) kernel code won't be in L1 instruction cache.

    As for the app slowing others down, the CPU time spent on a system call should be billed to the app anyways (otherways the design of CPU time billing is badly designed, so that's a separate issue), so no, it doesn't slow other's down, it slows down only itself.

    I would certaily like this functionality to be opt-in, or opt-out, plus a way for OS to fake it (to tell the app that it is flushing L1 when in reality it is not).

    So, where am I wrong and what is Linus thinking? I don't get him at all.

    Comment


    • #3
      Originally posted by xfcemint View Post
      Let's talk about SMT disabled case.

      When an app does a system call, I don't see why is L1 flushing a big deal on Linux. First, it's a monolithic kernel, so the call is going to take forever anyway, and L1 refill isn't going to make things much worse. Also, I don't see L1 being of much help in the case of Linux. I mean, L1 is virtually addressed, so on a syscall it's going to just do all cache misses since a) address space has changed and b) kernel code won't be in L1 instruction cache.

      As for the app slowing others down, the CPU time spent on a system call should be billed to the app anyways (otherways the design of CPU time billing is badly designed, so that's a separate issue), so no, it doesn't slow other's down, it slows down only itself.

      I would certaily like this functionality to be opt-in, or opt-out, plus a way for OS to fake it (to tell the app that it is flushing L1 when in reality it is not).

      So, where am I wrong and what is Linus thinking? I don't get him at all.
      Let me first start with saying that I had the exact same thoughts before I actually thought a bit harder on the problem.
      I think you are over-simplifying the cache loading mechanisms of modern CPUs.
      Zen prefetcher slide
      Zen prefetcher slide

      Flushing it will matter. Most CPUs nowdays are pretty good at predicting the code path and preloading data,
      so I think it's a false assumption to say that there is no predicted data in the cache when you take the call.
      Most likely you are shitting on work the CPU has already done to get ahead.
      Flushing causes a highly variable side effect that will vary with almost any parameter.

      x86 CPUs are notoriously brilliant at prediction (coming from large code size sets).
      Take an older PowerPC for example. A T4240. PPC e6500 cores, 24 cores, absolutely no hardware prefetch.
      Actually, they scratched that from the design, it was supposed to have it.
      So single thread read performance is horrible. Misses everything unless you manually prefetch.
      The CPU scales nicely to 24 cores since there are no aggressive prefetchers, but single thread read perf. is absolute crap.
      A typical modern x86 otoh, prefetching wont do much, only waste opslots most of the time.
      On a T4240, L1d flushing would probably work somewhat, on a modern x86, it's probably going to be a pain.

      Regarding the virtually addressed part, some high frequency calls are vDSOs today.
      So syscalls do not have to live beyond the context switch (although most do).

      So my guess is you're crapping all over the work that has been done already.
      Performance are going to be highly variable.
      Personally, I think it is a stupid way to "fix" broken hardware.
      Last edited by milkylainen; 09-17-2020, 02:37 AM. Reason: Language.

      Comment


      • #4
        Originally posted by milkylainen View Post

        Let me first start with saying that I had the exact same thoughts before I actually thought a bit harder on the problem.
        I think you are over-simplifying the cache loading mechanisms of modern CPUs.

        Flushing it will matter. Most CPUs nowdays are pretty good at predicting the code path and preloading data,
        so I think it's a false assumption to say that there is no predicted data in the cache when you take the call.
        Well I'd say:

        1. As long as the OS has the option to fake the L1 flush, all what you and Linus are saying doesn't matter a bit

        2. As long as the OS has the ability to bill the wasted cycles correctly to the app that requested it, all what you and Linus are saying doesn't matter a bit. This, of course, won't work in the SMT case, but that's a separate issue.

        3. I don't see that CPUs are able to perform such advanced prediction as you are implying. On a context switch, when address space changes, all the predictions will be garbage. Of course, a tagged L1 cache would be of significant help there and some new CPUs support that, but, as far as I know, Linux kernel doesn't support this advanced feature yet, so again, L1 is useless on a syscall.

        Comment

        Working...
        X