Originally posted by dragorth
View Post
Then there's also the system memory TLB caches, which map between process memory addresses and physical memory. These need to be synced when the system TLB structures in system memory are modified by the OS. These are coarser grained, ie 4Kbyte or 2Mbyte. I've never heard of needing to flush these TLB's, other than when bringing a core online from offline state.
Outside the uarch, there's multiple OS-managed TLB tiers in the Linux kernel. These implement protected memory (ie read-only code pages) demand-paged shared memory (ie elf libraries where only the in-use portions are read from disk), virtual memory (swap - oversimplifying it), and even virtual memory on virtual memory (hypervisors). These work at the same granularity as the system memory TLB [... on some OSes they can work higher granularity and there's work underway to make that true under Linux as well.] AMD-V and Intel VT-X implement hardware-accelerated walkers for these TLB structures, which are stored in system memory, and cached in the L3-L1 data cache heirarchy as well as the (smaller) system memory TLB caches. For a hypervisor, flushing them requires flushing the CPU data caches, which is extremely expensive because it leads to load hazards for all the queued operations in the CPU's ROB, schedulers, and pipelines, which can be over 200 instructions deep. Those instructions then end up taking 100+ clocks per memory operand instead of the more typical 2 clocks per operand to load. So your superscalar CPU with 6 cores/12 threads, each of which can decode 3 instructions and retire 6 per cycle, now has the performance of a single-thread in-order arch that takes 50 clocks per instruction, for the next 100,000 cycles. ouch. And that's just the warmup, not the true settling time.
Leave a comment: