Announcement

Collapse
No announcement yet.

New x86/x86_64 KVM Patches Would Help Reduce Excess TLB Flushing

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • New x86/x86_64 KVM Patches Would Help Reduce Excess TLB Flushing

    Phoronix: New x86/x86_64 KVM Patches Would Help Reduce Excess TLB Flushing

    A set of more than two dozen patches by Google engineer Sean Christopherson overhauls KVM's x86/x86_64 TDP MMU zapping and flushing code...

    https://www.phoronix.com/scan.php?pa...duce-TLB-Flush

  • #2
    "A TLB is a cache of translation from memory virtual address to physical address."

    https://events19.linuxfoundation.org...Hypervisor.pdf

    Is that what TLB flushing is? If not, can someone please explain "TLB Flushing" to me in layman's terms?

    The article seems to be targeting the audience for those who have expertise in KVM development. I mean, I use KVM as a hypervisor in my Debian server and virt-manager in my desktop. How does this help me when running pfSense and Debian servers in a KVM hypervisor?
    Last edited by GraysonPeddie; 21 November 2021, 09:47 AM.

    Comment


    • #3
      Originally posted by GraysonPeddie View Post
      "A TLB is a cache of translation from memory virtual address to physical address."

      https://events19.linuxfoundation.org...Hypervisor.pdf

      Is that what TLB flushing is? If not, can someone please explain "TLB Flushing" to me in layman's terms?

      The article seems to be targeting the audience for those who have expertise in KVM development. I mean, I use KVM as a hypervisor in my Debian server and virt-manager in my desktop. How does this help me when running pfSense and Debian servers in a KVM hypervisor?
      A cache is a buffer to keep the table closer to the CPU, which is faster. The flushing happens when the required address is not in the cache. In laymen's terms, assume the cache is emptied at this time, and starts fresh, so this is major speed drop. (I don't know enough about the specific implementation to go deeper, but this is enough to give you an idea of how this effects the VM speed.)

      Comment


      • #4
        Thank you dragorth. That makes sense to me. I do know about the cache in regards to CPU, but this "TLB flush" is new to me when it comes to using a KVM hypervisor. As I am not a C/C++ developer and I have not had any experience working with KVM under the hood and understand how it works, there is a lot of technical jargon that I do not understand. As a +10-year experienced Linux user who uses a KVM hypervisor, I know what a hypervisor is (for running virtual machines in either Type 1 (KVM) or Type 2 (VirtualBox); however, how am I going to understand how it works under-the-hood?
        Last edited by GraysonPeddie; 21 November 2021, 01:14 PM.

        Comment


        • #5
          Originally posted by GraysonPeddie View Post
          "A TLB is a cache of translation from memory virtual address to physical address."

          Is that what TLB flushing is? If not, can someone please explain "TLB Flushing" to me in layman's terms?
          You have it right. Stepping back a bit, without caching every memory access would look something like:

          - CPU issues virtual address to MMU
          - MMU walks a multi level (typically 4) set of page tables to translate the virtual address to a physical address, going to main memory for each level
          - CPU uses the translated (physical) address and goes to main memory again for the actual data load or store

          Most of the HW spec discussion you see is about instruction & data caches, which avoid the last step above on a cache hit. Some caches are virtually addressed (ie the tag for each cache line contains a virtual address) which allows a cache hit to also avoid the second step above.

          In the same way that a cache line temporarily stores the results of a recent data access, a TLB (Translation Lookaside Buffer) temporarily stores the results of the page table walk required for each virtual-to-physical translation. If you think about avoiding 4 "invisible" memory reads for every load/store operation you can see how important TLBs are to CPU performance.

          My recollection is that the "lookaside" part of the name came from the fact that if L1 cache was virtually addressed then checking the L1 cache tags for a match could happen in parallel with checking the TLB tags, so that the translation would often be ready if the L1 cache missed.

          In the same way that a data cache needs to be flushed before a DMA device alters memory, TLBs need to be flushed when virtual-to-physical mappings change. Just like a data cache there are sometimes different granularities of flushing available.

          When running under a hypervisor there are actually two levels of translation - the first goes from "guest virtual address" to "guest physical address" (this is normal MMU operation) but then there is a second translation going from "guest physical address" to "system physical address". That second translation is required because the guest does not own all of physical memory, just a collection of pages allocated by the hypervisor and made to appear as a physically contiguous chunk of memory to the guest OS.

          I remember reading that running under a hypervisor takes the MMU traffic up to 16 potential reads per load/store (4 guest x 4 host per guest read worst case), so TLB management becomes even more important for maintaining system performance.

          Using 2MB pages rather than 4KB pages means that roughly 1/512 as many TLBs are required for any given working set of memory, so you can also see how large page support and usage affects whether or not you are thrashing TLBs.

          And yes, GPUs have page tables as well so similar TLB management issues apply there as well.
          Last edited by bridgman; 21 November 2021, 02:52 PM.
          Test signature

          Comment


          • #6
            Originally posted by dragorth View Post

            A cache is a buffer to keep the table closer to the CPU, which is faster. The flushing happens when the required address is not in the cache. In laymen's terms, assume the cache is emptied at this time, and starts fresh, so this is major speed drop. (I don't know enough about the specific implementation to go deeper, but this is enough to give you an idea of how this effects the VM speed.)
            TLB's exist as a mechanism to map from one memory region to another. At the CPU uarch level, TLB's map from L2/L3/L4 to process memory addresses. These are fine-grained maps, working with the granularity of a cache line, which is typically 16 bytes plus or minus a factor of 2. When we talk about TLB flushing for Spectre mitigation, this is generally what we're talking about.

            Then there's also the system memory TLB caches, which map between process memory addresses and physical memory. These need to be synced when the system TLB structures in system memory are modified by the OS. These are coarser grained, ie 4Kbyte or 2Mbyte. I've never heard of needing to flush these TLB's, other than when bringing a core online from offline state.

            Outside the uarch, there's multiple OS-managed TLB tiers in the Linux kernel. These implement protected memory (ie read-only code pages) demand-paged shared memory (ie elf libraries where only the in-use portions are read from disk), virtual memory (swap - oversimplifying it), and even virtual memory on virtual memory (hypervisors). These work at the same granularity as the system memory TLB [... on some OSes they can work higher granularity and there's work underway to make that true under Linux as well.] AMD-V and Intel VT-X implement hardware-accelerated walkers for these TLB structures, which are stored in system memory, and cached in the L3-L1 data cache heirarchy as well as the (smaller) system memory TLB caches. For a hypervisor, flushing them requires flushing the CPU data caches, which is extremely expensive because it leads to load hazards for all the queued operations in the CPU's ROB, schedulers, and pipelines, which can be over 200 instructions deep. Those instructions then end up taking 100+ clocks per memory operand instead of the more typical 2 clocks per operand to load. So your superscalar CPU with 6 cores/12 threads, each of which can decode 3 instructions and retire 6 per cycle, now has the performance of a single-thread in-order arch that takes 50 clocks per instruction, for the next 100,000 cycles. ouch. And that's just the warmup, not the true settling time.
            Last edited by linuxgeex; 22 November 2021, 03:11 AM.

            Comment

            Working...
            X