Announcement

**Kjell** · 27 February 2024, 01:38 PM

Appreciate this test but it would be a lot easier to compare the benchmarks if the 64K- vs regular 4K-config had a different color

**GPTshop.ai** · 27 February 2024, 02:24 PM

Would be interesting to include the recent threadripper workstation benchmarks for comprison.

**GPTshop.ai** · 27 February 2024, 02:36 PM

I can only imagine how a grace-garce superchip system would perform. I expect it to scale well. We will hopefully know soon.

**Mitch** · 27 February 2024, 04:12 PM

Speaking way out of my area of expertise but, if I recall, 16k was already a big jump in performance from 4k and 4k as a default/presumed page size might actually be sunset in Linux consumer devices of all kinds.

I think even Android (15?) is looking at supporting 16k and early examinations show clear benefits:

x.com

https://twitter.com/MishaalRahman/status/1691552342947123510

Sticking with Android for a second, I had read that Android Apps may need developers to rebuild their apps to support >4k size. If so, it may be a few years or releases after Android 15 that we'll see mandatory support for >4k there. I don't know if Android is or will soon go to 64k, even though ARM seems to support it usually.

**kloczek** · 27 February 2024, 07:04 PM

Lol .. someone discovered the wheel

**user556** · 27 February 2024, 08:33 PM

I presume the reason it makes an improvement is all about the TLBs and CPU caches, not page swapping for these solo bench tests, right? I'm quite vague on any details of the working. Anyone got a description of the typical sequence that might be happening?

**CommunityMember** · 27 February 2024, 08:48 PM

Originally posted by user556 View Post

I presume the reason it makes an improvement is all about the TLBs and CPU caches, not page swapping for these solo bench tests, right? I'm quite vague on any details of the working. Anyone got a description of the typical sequence that might be happening?

In most cases you are correct (one can always find an exception to any rule). While on most x86_64 CPUs the TLBs are rather large (and may even have their own levels), on ARM devices the TLBs tend to be much smaller, and every miss can have a large impact on performance (and in some cases, TLB thrashing can make it even worse) due to the page walks required for a miss.

**Dawn** · 27 February 2024, 10:42 PM

Originally posted by CommunityMember View Post

In most cases you are correct (one can always find an exception to any rule). While on most x86_64 CPUs the TLBs are rather large (and may even have their own levels), on ARM devices the TLBs tend to be much smaller, and every miss can have a large impact on performance (and in some cases, TLB thrashing can make it even worse) due to the page walks required for a miss.

ARM Neoverse TLBs are smaller, but I wouldn't say drastically so. Neoverse V2 has a slightly smaller L1 DTLB than Zen3 (48 entries vs 64) and the same L2 DTLB size (2k entries.)

AMD bulked up TLB structures in Zen4 (to 72 + 3072 iirc) but ARM did in the V3 as well (L1DTLB has 96 entries, while L2DTLB stays at 2048.)

**kobblestown** · 28 February 2024, 05:05 AM

AMD processors also have page coalescing, whereby if 8 consecutive 32kB-aligned pages are consecutive in physical memory and share the same attributes, then the TLB entry covers the whole 32kB. This is totaly transparent and does not need any setup.

Of course, for that to happen, the application should allocate from the OS in chunks >= 32kB but I assume most do. That depends on the behaviour of the allocator but in this day and age it's stupid not to.

Also, the OS can help if it alligns the allocations on 32kB boundaries (and provided it finds consecutive 32kB to service the request) but it need not configure anything on the CPU.

Announcement

64K Kernel Page Size Performance Benefits For HPC Shown With NVIDIA's GH200 Grace CPU

64K Kernel Page Size Performance Benefits For HPC Shown With NVIDIA's GH200 Grace CPU

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment