Google's Working Set Reporting Feature Aims To Better Deal With Over-Committed VMs

Written by Michael Larabel in Google on 22 May 2023 at 08:09 AM EDT. 1 Comment

Google engineers this month have begun posting new patches for the Linux memory management subsystem and related components for a feature dubbed Working Set Reporting.

The Working Set Reporting functionality builds off MGLRU and is aimed to better deal with over-committed VMs or containers. The recent RFC patch series sums things up as:

Background
==========
For both clients and servers, workloads can be containerized with virtual machines, kubernetes containers, or memcgs. The workloads differ between servers and clients.

Server jobs have more predictable memory footprints, and are concerned about stability and performance. One technique is proactive reclaim, which reclaims memory ahead of memory pressure, and makes apparent the amount of actually free memory on a machine.

Client applications are more bursty and unpredictable since they react to user interactions. The system needs to respond quickly to interesting events, and be aware of energy usage.

An overcommitted machine can scale the containers' footprint through memory.max/high, virtio-balloon, etc.
The balloon device is a typical mechanism for sharing memory between a guest VM and host. It is particularly useful in multi-VM scenarios where memory is overcommitted and dynamic changes to VM memory size are required as workloads change on the system. The balloon device now has a number of features to assist in judiciously sharing memory resources amongst the guests and host (e.g free page hinting, stats, free page reporting). For a host controller program tasked with optimizing memory resources in a multi-VM environment, it must use these tools to answer two concrete questions:

1. When is the right time to modify the balloon?
2. How much should the balloon be changed by?

An early project to develop such an "auto-balloon" capability was done in 2013. More recently, additional VIRTIO devices have been created (virtio-mem, virtio-pmem) that offer more tools for a number of use cases, each with advantages and disadvantages. A previous proposal to extend MGLRU with working set interfaces focuses on the server use cases but does not work for clients.

Proposal
==========
A unified Working Set reporting structure that works for both servers and clients. It involves per-node histograms on the host, per-memcg histograms, and a virtio-balloon driver extension.

There are two ways of working with Working Set reporting: event-driven and querying. The host controller can receive notifications from reclaim, which produces a report, or the controller can query for the histogram directly.
Patch 1 introduces the Working Set reporting mechanism and the host interfaces. See the Details section for
Patch 2 extends the virtio-balloon driver with Working Set reporting.
The initial RFC builds on MGLRU and is intended to be a Proof of Concept for discussion and refinements. T.J. and I aim to support the active/inactive LRU and working set estimation from the userspace. We are working on demo scripts and getting some numbers as well.

In addition to the Linux kernel memory management changes, there are QEMU patches for VirtIO Balloon to add the Working Set Reporting feature.

"The use case is a host with overcommitted memory and 1 or more VMs. The goal is to get both timely and accurate information on overall memory utilization in order to drive appropriate reclaim activities, since in some client device use cases a VM might need a significant fraction of the overall memory for a period of time, but then enter a quiet period that results in a large number of cold pages in the guest.

The balloon device now has a number of features to assist in sharing memory resources amongst the guests and host (e.g free page hinting, stats, free page reporting). As mentioned in slide 12 in [1], the balloon doesn't have a good mechanism to drive the reclaim of guest cache. Our use case includes both typical page cache as well as "application caches" with memory that should be discarded in times of system-wide memory pressure. In some cases, virtio-pmem can be a method for host control of guest cache but there are undesirable security implications."

And then there is also the proposed VirtIO spec update that was sent out last week for discussion.

Or for an easier overview of this Working Set Reporting functionality, the Google engineers involved shared with us the slides from their LSF/MM/BPF 2023 talk over this proposed feature:

Google slide deck on Working Set Reporting

The notification mechanism for Working Set Reporting reports remains under development. It will certainly be interesting to see all what comes of this Working Set Reporting initiative.

1 Comment