Facebook/Meta Tackling Transparent Page Placement For Tiered-Memory Linux Systems
Prior to the Linux 5.15 kernel, during the memory reclaim process when the system RAM was under memory pressure was to simply toss out cold pages. However, with Linux 5.15 came the ability to shift those cold pages to any slower memory tiers. In particular, modern and forthcoming servers with Optane DC persistent memory or CXL-enabled memory, etc. Therefore the pages are still accessible if needed but not occupying precious system DRAM if they aren't being used and to avoid just flushing them out or swapping to disk.
As noted in that article from September, however, there wasn't a means of promoting pages back into DRAM when capacity is available or the pages become hot. Facebook (now Meta) has been working on that promotion handling and this past week sent out their latest patches.
The Linux kernel has been working to better deal with hot/cold pages across multi-tiered memory systems. After being demoted, this patch series allows hot pages to go back to the top tier.
Transparent Page Placement for Tiered-Memory provides that support for leveraging AutoNUMA and promoting pages from slow tier nodes to top-tier nodes.
From the patch series:
We tested this patchset on systems with CXL-enabled DRAM and PMEM tiers. We find this patchset can bring hotter pages to the toptier node while moving the colder pages to the slow-tier nodes for a good range of Meta production workloads with live traffic. As a result, toptier nodes serve more hot pages and the application performance improves.
...
With default page placement policy, file caches fills up the toptier node and anons get trapped in the slowtier node. Only 14% of the total anons reside in toptier node. Remote NUMA read bandwidth is 80%. Throughput regression is 18% compared to all memory being served from toptier node.
This patchset brings 80% of the anons to the toptier node. Anons on the slowtier memory is mostly cold anons. As the toptier node can not host all the hot memory, some hot files still remain on the slowtier node. Even though, remote NUMA read bandwidth reduces from 80% to 40%. With this patchset, throughput regression is only 5% compared to the baseline of toptier node serving the whole working set.
With tiered memory servers to become more prevalent with CXL, it's great this tiered-memory handling is being sorted out now and soon enough should work its way to the mainline kernel.