64-bit ARM Linux Kernel Against CPU-Specific Optimizations: "Pretty Unmaintainable"
While micro-architecture specific optimizations are rather common place within the Linux x86_64 kernel for various Intel and AMD CPU families with various performance tricks, the ARM64 Linux kernel maintainers are against introducing new micro-architecture specific optimizations as it affects new ARM processors.
Ampere Computing sent out a set of 4 patches providing an optimization for their new AmpereOne server processors. Ampere Computing found these new high core count ARM server processors could benefit from aggressive prefetches when using the 4K page size. The reported benefit with HugeTLB or Tmpfs during sequential read performance tests was "up to 1.3 ~ 1.4x."
While those gains are exciting for enhancing the AmpereOne Linux performance, it's looking like that work won't be upstreamed into the mainline Linux kernel.
Prominent ARM Linux kernel developer Will Deacon commented on the performance-enhancing patches specific to AmpereOne CPUs:
ARM's Mark Rutland chimed in to agree with Deacon's statement and also endorsing the removal of the Thunder-X1 targeted optimization. Kernel developer Marc Zyngier also agreed and has already been working on a patch to drop that Thunder-X1 specific code.
So in the interest of code maintainability and avoiding over-complicating the ARM64 Linux kernel code, they aren't after CPU/micro-architecture specific optimizations. We'll see if this leads to any ARM Linux focused distributions carrying such patches themselves or any AmpereOne-optimized Linux distributions moving forward, especially given Ampere's focus on high performance and power efficiency ARM Linux servers and likely not wanting to leave any optimizations go untouched especially with their aim of compete with AMD EPYC and Intel Xeon servers.
Ampere Computing sent out a set of 4 patches providing an optimization for their new AmpereOne server processors. Ampere Computing found these new high core count ARM server processors could benefit from aggressive prefetches when using the 4K page size. The reported benefit with HugeTLB or Tmpfs during sequential read performance tests was "up to 1.3 ~ 1.4x."
"Test result:
In hugetlb or tmpfs, We can get big seqential read performance improvement up to 1.3x ~ 1.4x."
While those gains are exciting for enhancing the AmpereOne Linux performance, it's looking like that work won't be upstreamed into the mainline Linux kernel.
Prominent ARM Linux kernel developer Will Deacon commented on the performance-enhancing patches specific to AmpereOne CPUs:
"We tend to shy away from micro-architecture specific optimisations in the arm64 kernel as they're pretty unmaintainable, hard to test properly, generally lead to bloat and add additional obstacles to updating our library routines.
Admittedly, we have something for Thunder-X1 in copy_page() (disguised as ARM64_HAS_NO_HW_PREFETCH) but, frankly, that machine needed all the help it could get and given where it is today I suspect we could drop that code without any material consequences.
So I'd really prefer not to merge this; modern CPUs should do better at copying data. It's copy_to_user(), not rocket science."
ARM's Mark Rutland chimed in to agree with Deacon's statement and also endorsing the removal of the Thunder-X1 targeted optimization. Kernel developer Marc Zyngier also agreed and has already been working on a patch to drop that Thunder-X1 specific code.
So in the interest of code maintainability and avoiding over-complicating the ARM64 Linux kernel code, they aren't after CPU/micro-architecture specific optimizations. We'll see if this leads to any ARM Linux focused distributions carrying such patches themselves or any AmpereOne-optimized Linux distributions moving forward, especially given Ampere's focus on high performance and power efficiency ARM Linux servers and likely not wanting to leave any optimizations go untouched especially with their aim of compete with AMD EPYC and Intel Xeon servers.
52 Comments