Linux Developers Discuss Improvements To Memory Tiering
Already within the Linux kernel there is initial support for tiered memory servers for dealing with platforms like those with Intel Optane DC Persistent Memory for being able to promote/demote pages to slower classes of memory when the speedy system RAM is under pressure. But with more tiered memory servers coming about especially with HBM classes of memory, Google and other vendors are discussing better handling of Linux's tiered memory interface.
The past several kernel releases has offered the basic promoting/demoting of active/inactive memory pages to the respective tiers of memory. Google's Wei Xu summed up though some of the existing shortcomings to the interface and possible improvements to make with this code only becoming more critical moving forward especially with Compute Express Link (CXL) and other technologies coming to market.
Wei Xu summed up the current situation as:
Other upstream kernel developers have been in agreement that there is room for improvement, including Intel engineers acknowledging this who worked on much of the original tiered memory handling code for the kernel.
Those interested in some of the suggested improvements and discussion around Linux's tiered memory handling can see this kernel mailing list thread for all the activity.
The past several kernel releases has offered the basic promoting/demoting of active/inactive memory pages to the respective tiers of memory. Google's Wei Xu summed up though some of the existing shortcomings to the interface and possible improvements to make with this code only becoming more critical moving forward especially with Compute Express Link (CXL) and other technologies coming to market.
Wei Xu summed up the current situation as:
* The current tiering initialization code always initializes each memory-only NUMA node into a lower tier. But a memory-only NUMA node may have a high performance memory device (e.g. a DRAM device attached via CXL.mem or a DRAM-backed memory-only node on a virtual machine) and should be put into the top tier.
* The current tiering hierarchy always puts CPU nodes into the top tier. But on a system with HBM (e.g. GPU memory) devices, these memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are better to be placed into the next lower tier.
* Also because the current tiering hierarchy always puts CPU nodes into the top tier, when a CPU is hot-added (or hot-removed) and triggers a memory node from CPU-less into a CPU node (or vice versa), the memory tiering hierarchy gets changed, even though no memory node is added or removed. This can make the tiering hierarchy much less stable.
* A higher tier node can only be demoted to selected nodes on the next lower tier, not any other node from the next lower tier. This strict, hard-coded demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space), and has resulted in the feature request for an interface to override the system-wide, per-node demotion order from the userspace.
* There are no interfaces for the userspace to learn about the memory tiering hierarchy in order to optimize its memory allocations.
Other upstream kernel developers have been in agreement that there is room for improvement, including Intel engineers acknowledging this who worked on much of the original tiered memory handling code for the kernel.
Those interested in some of the suggested improvements and discussion around Linux's tiered memory handling can see this kernel mailing list thread for all the activity.
6 Comments