Patches Updated To Tackle vmap/vmalloc Lock Contention That Can Yield ~12x Throughput
An important Linux kernel patch series has been updated for the new year that in synthetic tests has yielded a ~12x throughput improvement on an AMD Ryzen Threadripper system.
Uladzislau Rezki with Sony has been working for months to eliminate locking contention within the Linux kernel's vmap/vmalloc code. This locking contention caused by a single spinlock protecting the global vmap space is leading to serious issues on today's increasingly high core count systems.
The patch series now up to its third iteration aim to make it more scalable:
In a synthetic test stressing the vmalloc path, the Sony engineer found the throughput to be around ~12x higher on an AMD Ryzen Threadripper 3970X test system.
The v3 patches for dealing with this vmap/vmalloc locking contention is out for review on the Linux kernel mailing list. Hopefully this is just the tip of the iceberg we see for Linux performance optimizations in 2024.
Uladzislau Rezki with Sony has been working for months to eliminate locking contention within the Linux kernel's vmap/vmalloc code. This locking contention caused by a single spinlock protecting the global vmap space is leading to serious issues on today's increasingly high core count systems.
The patch series now up to its third iteration aim to make it more scalable:
"We introduce an effective vmap node logic. A node behaves as independent entity to serve an allocation request directly(if possible) from its pool. That way it bypasses a global vmap space that is protected by its own lock.
An access to pools are serialized by CPUs. Number of nodes are equal to number of CPUs in a system. Please note the high threshold is bound to 128 nodes.
Pools are size segregated and populated based on system demand. The maximum alloc request that can be stored into a segregated storage is 256 pages. The lazily drain path decays a pool by 25% as a first step and as second populates it by fresh freed VAs for reuse instead of returning them into a global space."
In a synthetic test stressing the vmalloc path, the Sony engineer found the throughput to be around ~12x higher on an AMD Ryzen Threadripper 3970X test system.
The v3 patches for dealing with this vmap/vmalloc locking contention is out for review on the Linux kernel mailing list. Hopefully this is just the tip of the iceberg we see for Linux performance optimizations in 2024.
14 Comments