Linux Patches Allow Sharing PTEs Between Processes - Can Mean Significant RAM Savings

Written by Michael Larabel in Linux Kernel on 26 January 2025 at 07:10 AM EST. 20 Comments
LINUX KERNEL
A set of patches being worked on by Oracle engineers allow for optionally sharing page table entries (PTEs) between processes. For some workloads this can equate to very significant memory savings.

The patches being worked on by Oracle engineers introduce a new in-memory file-system (MSHAREFS) and allow for optionally sharing page table entries between processes. With certain classes of workloads, this can mean very significant memory savings -- huge savings in fact as noted in the patch cover letter after Oracle realized this memory overhead in actual use on a database server.

server with lots of RAM


The patch cover letter from Anthony Yznaga explains:
"Memory pages shared between processes require page table entries (PTEs) for each process. Each of these PTEs consume some of the memory and as long as the number of mappings being maintained is small enough, this space consumed by page tables is not objectionable. When very few memory pages are shared between processes, the number of PTEs to maintain is mostly constrained by the number of pages of memory on the system. As the number of shared pages and the number of times pages are shared goes up, amount of memory consumed by page tables starts to become significant. This issue does not apply to threads. Any number of threads can share the same pages inside a process while sharing the same PTEs. Extending this same model to sharing pages across processes can eliminate this issue for sharing across processes as well.

Some of the field deployments commonly see memory pages shared across 1000s of processes. On x86_64, each page requires a PTE that is 8 bytes long which is very small compared to the 4K page size. When 2000 processes map the same page in their address space, each one of them requires 8 bytes for its PTE and together that adds up to 8K of memory just to hold the PTEs for one 4K page. On a database server with 300GB SGA, a system crash was seen with out-of-memory condition when 1500+ clients tried to share this SGA even though the system had 512GB of memory. On this server, in the worst case scenario of all 1500 processes mapping every page from SGA would have required 878GB+ for just the PTEs. If these PTEs could be shared, the a substantial amount of memory saved.

This patch series implements a mechanism that allows userspace processes to opt into sharing PTEs. It adds a new in-memory filesystem - msharefs. A file created on msharefs represents a shared region where all processes mapping that region will map objects within it with shared PTEs. When the file is created, a new host mm struct is created to hold the shared page tables and vmas for objects later mapped into the shared region. This host mm struct is associated with the file and not with a task. When a process mmap's the shared region, a vm flag VM_MSHARE is added to the vma. On page fault the vma is checked for the presence of the VM_MSHARE flag. If found, the host mm is searched for a vma that covers the fault address. Fault handling then continues using that host vma which establishes PTEs in the host mm. Fault handling in a shared region also links the shared page table to the process page table if the shared page table already exists."

The patches were previously posted as a "request for comments" (RFC) while now have graduated to a "v1" proposal. Those interested in this work for optional sharing of PTEs between Linux processes can see this patch series for all of the new code under review.
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Popular News This Week