MGLRU Continues To Look Very Promising For Linux Kernel Performance
One of many promising kernel patch series at the moment for enhancing Linux kernel performance is the multi-gen LRU framework (MGLRU) devised by Google engineers. They found the current Linux kernel page reclaim code is too expensive for CPU resources and can make poor eviction choices while MGLRU aims to yield better performance. These results are quite tantalizing and MGLRU is now up to its ninth revision.
Sent out yesterday were the MGLRU v8 patches for continuing to tidy up this multi-gen LRU framework for improving the Linux kernel's page reclaim behavior.
While Linus Torvalds is not opposed to MGLRU, he raised objections again over the proposed "TIERS_PER_GEN" Kconfig option as part of this patch series. This allows a configurable number of tiers per generation in MGLRU between 2 and 4. Torvalds believes this option is too confusing especially as the wrong value can lead to a build error and a value that most users likely won't know how to best set.
The default value though is sane and so in following the guidance from Torvalds, MGLRU v9 was spun to drop that confusing and unnecessary option.
MGLRU is looking good from Google's own tests:
5. Apache Cassandra achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]% and [4.11, 7.50]% more operations per second (OPS), respectively, for exponential (distribution) access, random access and Zipfian (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%, [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for exponential access, random access and Zipfian access, when swap was on.
6. Apache Hadoop took 95% CIs [5.31, 9.69]% and [2.02, 7.86]% less average wall time to finish twelve parallel TeraSort jobs, respectively, under the medium- and the high-concurrency conditions, when swap was on. There were no statistically significant changes in average wall time for the rest of the benchmark matrix.
7. PostgreSQL achieved 95% CI [1.75, 6.42]% more transactions per minute (TPM) under the high-concurrency condition, when swap was off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM, respectively, under the medium- and the high-concurrency conditions, when swap was on. There were no statistically significant changes in TPM for the rest of the benchmark matrix.
8. Redis achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and [11.47, 19.36]% more total operations per second (OPS), respectively, for sequential access, random access and Gaussian (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%, [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively, for sequential access, random access and Gaussian access, when THP=never.
There were also independent results included with the patch series:
I have Archlinux with 8G RAM + zswap + swap. While developing, I have lots of apps opened such as multiple LSP-servers for different langs, chats, two browsers, etc... Usually, my system gets quickly to a point of SWAP-storms, where I have to kill LSP-servers, restart browsers to free memory, etc, otherwise the system lags heavily and is barely usable.
1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU patchset, and I started up by opening lots of apps to create memory pressure, and worked for a day like this. Till now I had not a single SWAP-storm, and mind you I got 3.4G in SWAP. I was never getting to the point of 3G in SWAP before without a single SWAP-storm.
Vaibhav from IBM reported:
In a synthetic MongoDB Benchmark, seeing an average of ~19% throughput improvement on POWER10(Radix MMU + 64K Page Size) with MGLRU patches on top of v5.16 kernel for MongoDB + YCSB across three different request distributions, namely, Exponential, Uniform and Zipfan.
Shuang from U of Rochester reported:
With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]% and [9.26, 10.36]% higher throughput, respectively, for random access, Zipfian (distribution) access and Gaussian (distribution) access, when the average number of jobs per CPU is 1; 95% CIs [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher throughput, respectively, for random access, Zipfian access and Gaussian access, when the average number of jobs per CPU is 2.
Daniel from Michigan Tech reported:
With Memcached allocating ~100GB of byte-addressable Optante, performance improvement in terms of throughput (measured as queries per second) was about 10% for a series of workloads.
MGLRU is looking good and with a bit of luck will hopefully be mainlined into the Linux kernel soon.