New TTM Code Can Yield 3~5x Faster Page Allocation For AMDGPU, Other Benefits

Written by Michael Larabel in Linux Kernel on 25 October 2020 at 12:00 PM EDT. 10 Comments

The Linux kernel's TTM memory management code that is most notably used by the Radeon / AMDGPU kernel drivers but also Nouveau, QXL, VMWGFX, and others, is seeing a new back-end allocation pool that can yield 3~5x faster page allocation performance for video memory.

Longtime AMD Linux driver developer Christian König has been working on this new TT back-end allocation pool that he posted today. With the patch series it is made the default for TTM and updates all existing TTM-based drivers to using this new allocation code for pages.

Christian talks of the promising results with this new code:

This replaces the spaghetti code in the two existing page pools.

First of all depending on the allocation size it is between 3 (1GiB) and 5 (1MiB) times faster than the old implementation.

It makes better use of buddy pages to allow for larger physical contiguous allocations which should result in better TLB utilization at least for amdgpu.

Instead of a completely braindead approach of filling the pool with one CPU while another one is trying to shrink it we only give back freed pages.

This also results in much less locking contention and a trylock free MM shrinker callback, so we can guarantee that pages are given back to the system when needed.

Downside of this is that it takes longer for many small allocations until the pool is filled up. We could address this, but I couldn't find an use case where this actually matters. And we don't bother freeing large chunks of pages any more.

The sysfs files are replaced with a single module parameter, allowing users to override how many pages should be globally pooled in TTM. This unfortunately breaks the UAPI slightly, but as far as we know nobody ever depended on this.

Zeroing memory coming from the pool was handled inconsistently. The alloc_pages() based pool was zeroing it, the dma_alloc_attr() based one wasn't. The new implementation isn't zeroing pages from the pool either and only sets the __GFP_ZERO flag when necessary.

The implementation has only 753 lines of code compared to the over 2600 of the old one, and also allows for saving quite a bunch of code in the drivers since we don't need specialized handling there any more based on kernel config.

Additional to all of that there was a neat bug with IOMMU, coherent DMA mappings and huge pages which is now fixed in the new code as well.

All in this new TTM code appears to be very promising and will be interesting to see how well it works out primarily for AMDGPU.

The patches are now out for review and we'll see if there is enough interest to get it reviewed timely and merged potentially for Linux 5.11.

10 Comments