Announcement

**milkylainen** · 16 September 2020, 04:26 PM

Originally posted by xfcemint View Post

Let's talk about SMT disabled case.

When an app does a system call, I don't see why is L1 flushing a big deal on Linux. First, it's a monolithic kernel, so the call is going to take forever anyway, and L1 refill isn't going to make things much worse. Also, I don't see L1 being of much help in the case of Linux. I mean, L1 is virtually addressed, so on a syscall it's going to just do all cache misses since a) address space has changed and b) kernel code won't be in L1 instruction cache.

As for the app slowing others down, the CPU time spent on a system call should be billed to the app anyways (otherways the design of CPU time billing is badly designed, so that's a separate issue), so no, it doesn't slow other's down, it slows down only itself.

I would certaily like this functionality to be opt-in, or opt-out, plus a way for OS to fake it (to tell the app that it is flushing L1 when in reality it is not).

So, where am I wrong and what is Linus thinking? I don't get him at all.

Let me first start with saying that I had the exact same thoughts before I actually thought a bit harder on the problem.

I think you are over-simplifying the cache loading mechanisms of modern CPUs.
Zen prefetcher slide
Zen prefetcher slide

Flushing it will matter. Most CPUs nowdays are pretty good at predicting the code path and preloading data,
so I think it's a false assumption to say that there is no predicted data in the cache when you take the call.
Most likely you are shitting on work the CPU has already done to get ahead.
Flushing causes a highly variable side effect that will vary with almost any parameter.

x86 CPUs are notoriously brilliant at prediction (coming from large code size sets).
Take an older PowerPC for example. A T4240. PPC e6500 cores, 24 cores, absolutely no hardware prefetch.
Actually, they scratched that from the design, it was supposed to have it.
So single thread read performance is horrible. Misses everything unless you manually prefetch.
The CPU scales nicely to 24 cores since there are no aggressive prefetchers, but single thread read perf. is absolute crap.
A typical modern x86 otoh, prefetching wont do much, only waste opslots most of the time.
On a T4240, L1d flushing would probably work somewhat, on a modern x86, it's probably going to be a pain.

Regarding the virtually addressed part, some high frequency calls are vDSOs today.
So syscalls do not have to live beyond the context switch (although most do).

So my guess is you're crapping all over the work that has been done already.
Performance are going to be highly variable.
Personally, I think it is a stupid way to "fix" broken hardware.

Announcement

Optional L1d Flushing On Context Switching Looks Like It Will Try Again For Linux 5.10

Optional L1d Flushing On Context Switching Looks Like It Will Try Again For Linux 5.10

Comment