Announcement

**vladpetric** · 31 August 2020, 06:08 PM

TLB misses are still largely driven by the fact that page sizes are from ~40 years ago - 4 KiB - yet memory sizes and consumption has gone up significantly.

Running a ~128GiB server with regular pages is absolutely retarded - you end up with ~33 million small pages, most of which needing translation. Yet we do it all the time, because Linux is really behind when it comes to huge pages.

Yes, Linux has THP but it has non-trivial overhead and doesn't work that well. Oftentimes it makes performance worse (e.g., https://www.percona.com/blog/2019/03...for-databases/ ). The real solution is to allow apps to allocate huge pages where needed, without having to jump through configuration hoops (hugetlbfs mount, with permissions? Like WTF, all you need on Windows is permissions, and an API call), and without putting huge pages in a special, restrictive pool.

**milkylainen** · 01 September 2020, 02:52 AM

Originally posted by vladpetric View Post

TLB misses are still largely driven by the fact that page sizes are from ~40 years ago - 4 KiB - yet memory sizes and consumption has gone up significantly.

Running a ~128GiB server with regular pages is absolutely retarded - you end up with ~33 million small pages, most of which needing translation. Yet we do it all the time, because Linux is really behind when it comes to huge pages.

Yes, Linux has THP but it has non-trivial overhead and doesn't work that well. Oftentimes it makes performance worse (e.g., https://www.percona.com/blog/2019/03...for-databases/ ). The real solution is to allow apps to allocate huge pages where needed, without having to jump through configuration hoops (hugetlbfs mount, with permissions? Like WTF, all you need on Windows is permissions, and an API call), and without putting huge pages in a special, restrictive pool.

Agree. (T)HP is a mess. I remember the first years with THP, total erratic performance at times.
It's better nowdays, but I still don't trust it.
I don't know how many times I've attributed bad uptime performance to page-reclaim gone awry.
And as you say. It's hardly transparent.

**uid313** · 01 September 2020, 04:20 AM

Does it make sense to see what the compiler does when it splits the code out to functions and refactor the source code accordingly?

**jabl** · 01 September 2020, 07:10 AM

Originally posted by vladpetric View Post

TLB misses are still largely driven by the fact that page sizes are from ~40 years ago - 4 KiB - yet memory sizes and consumption has gone up significantly.

Memory sizes have increased dramatically, yes, but most allocations are still quite small. It's the number of allocations that have gone up hugely, not the size of individual allocations (with a few exceptions e.g. in HPC). If you increase the base page size corresponding to the memory size increase over the past 3 decades you end up with a lot of internal fragmentation. E.g. Linus Torvalds has frequently ranted on this topic.

Now if you'd start from scratch today it'd probably make sense to make the base page size somewhat larger than 4 KiB. Maybe 16 or even 32 KiB would be good for general purpose workloads today. But that's still not dramatically larger than the current 4 KiB.

The real solution is to allow apps to allocate huge pages where needed, without having to jump through configuration hoops (hugetlbfs mount, with permissions? Like WTF, all you need on Windows is permissions, and an API call), and without putting huge pages in a special, restrictive pool.

Isn't that what you get with thp=madvise and using madvise(...,MADV_HUGEPAGE)/mmap(...,MAP_HUGETLB, ...)/posix_memalign?

**vladpetric** · 01 September 2020, 10:02 AM

Originally posted by jabl View Post

Memory sizes have increased dramatically, yes, but most allocations are still quite small. It's the number of allocations that have gone up hugely, not the size of individual allocations (with a few exceptions e.g. in HPC). If you increase the base page size corresponding to the memory size increase over the past 3 decades you end up with a lot of internal fragmentation. E.g. Linus Torvalds has frequently ranted on this topic.

Now if you'd start from scratch today it'd probably make sense to make the base page size somewhat larger than 4 KiB. Maybe 16 or even 32 KiB would be good for general purpose workloads today. But that's still not dramatically larger than the current 4 KiB.

Isn't that what you get with thp=madvise and using madvise(...,MADV_HUGEPAGE)/mmap(...,MAP_HUGETLB, ...)/posix_memalign?

When you say that the allocations are small - do you refer to individual malloc calls? Because a good memory allocator** will get big chunks of memory in one shot anyway. So maybe you get memory from the OS in chunks of, say, 16 MiB (8 2MiB pages), but malloc can still request small bits and pieces. There isn't really any additional fragmentation from huge pages here. If you do mmap of an actual file, then sure, 2MiB would cause some fragmentation there, but that use case is not nearly as common. Most of the memory, TTBOMK, is malloc-ed, and not mmap'ed on an actual file (anonymous mmap by malloc counts as malloc here). And it's a trade-off ...

While Linus Torvalds generally makes good calls for kernel development, I think he's very much wrong about huge pages in user space. Honestly, I think he's the main reason modern x86 (Intel and AMD) processors have a dedicated L2 TLB cache of thousands of entries.

Also, what makes a lot of sense for a 2GiB phone doesn't for a 128GiB server. The trade-offs can be quite different.

With respect to madvise - it's just a hint. I also don't think the mmap with MAP_HUGETLB succeeds unless you also set up a huge pages pool with hugetlbfs and the appropriate permissions on the hugetlbfs (a major nuisance). Of course, correct me if I'm wrong.

** almost anything but gnu malloc, but that's a different story (different reasons).

**jabl** · 01 September 2020, 10:30 AM

Originally posted by vladpetric View Post

When you say that the allocations are small - do you refer to individual malloc calls? Because a good memory allocator** will get big chunks of memory in one shot anyway. So maybe you get memory from the OS in chunks of, say, 16 MiB (8 2MiB pages), but malloc can still request small bits and pieces. There isn't really any additional fragmentation from huge pages here. If you do mmap of an actual file, then sure, 2MiB would cause some fragmentation there, but that use case is not nearly as common. Most of the memory, TTBOMK, is malloc-ed, and not mmap'ed on an actual file (anonymous mmap by malloc counts as malloc here). And it's a trade-off ...

While Linus Torvalds generally makes good calls for kernel development, I think he's very much wrong about huge pages in user space. Honestly, I think he's the main reason modern x86 (Intel and AMD) processors have a dedicated L2 TLB cache of thousands of entries.

Also, what makes a lot of sense for a 2GiB phone doesn't for a 128GiB server. The trade-offs can be quite different.

With respect to madvise - it's just a hint. I also don't think the mmap with MAP_HUGETLB succeeds unless you also set up a huge pages pool with hugetlbfs and the appropriate permissions on the hugetlbfs (a major nuisance). Of course, correct me if I'm wrong.

** almost anything but gnu malloc, but that's a different story (different reasons).

As you say, malloc can to an extent abstract away the chunkiness of memory mappings. I'm mostly referring to file mappings. Consider for instance the linux kernel source tree, or any other decently large source tree. Most files are much closer in size to 4k than 2 MB (the next available page size on x86-64). If the kernel would use 2 MB pages for file mappings most of the space in the page cache would be wasted. And I suspect similarly for many other workloads, unless you're working with multimedia or other large datasets consisting of a few large files. For instance, looking at /proc/pid/maps for firefox on my system it has over 2400 entries; if each of those would be backed by 2 MB pages the memory consumption of firefox alone would be huge (though presumably with larger pages some of those mappings could be coalesced).

And yes, with thp=madvise you can allocate huge pages via THP without forcing every allocation to use it, and without messing with hugetlbfs. You might be right about mmap(.., MAP_HUGETLB, ...) though, I'm not sure.

**vladpetric** · 01 September 2020, 12:54 PM

Originally posted by jabl View Post

As you say, malloc can to an extent abstract away the chunkiness of memory mappings. I'm mostly referring to file mappings. Consider for instance the linux kernel source tree, or any other decently large source tree. Most files are much closer in size to 4k than 2 MB (the next available page size on x86-64). If the kernel would use 2 MB pages for file mappings most of the space in the page cache would be wasted. And I suspect similarly for many other workloads, unless you're working with multimedia or other large datasets consisting of a few large files. For instance, looking at /proc/pid/maps for firefox on my system it has over 2400 entries; if each of those would be backed by 2 MB pages the memory consumption of firefox alone would be huge (though presumably with larger pages some of those mappings could be coalesced).

And yes, with thp=madvise you can allocate huge pages via THP without forcing every allocation to use it, and without messing with hugetlbfs. You might be right about mmap(.., MAP_HUGETLB, ...) though, I'm not sure.

Understood - are those mostly .so's? I did a simple test (not a heavy user of Firefox), and I get the following:

~1521 entries in /proc/pid/maps for firefox
759 are anonymous entries
762 have a file

Of the ones mapped to a file, 672 are .so entries. But there's only 124 unique .so's. So 124 from .so's, + 90 from other files (can't really say what they are). So worst case is 214 * 2 = 428 wasted MiB. Which sounds somewhat bad ... but in practice it's ok.

Let me rephrase - maybe it's really wasteful on a Desktop (4-8 GiB of RAM). But on a >128GiB server, it's peanuts.

Anyway, the ld loader + kernel could be smart about the mapping - i.e., don't map small .so's to huge pages.

**Space Heater** · 01 September 2020, 01:57 PM

Originally posted by vladpetric View Post

While Linus Torvalds generally makes good calls for kernel development, I think he's very much wrong about huge pages in user space. Honestly, I think he's the main reason modern x86 (Intel and AMD) processors have a dedicated L2 TLB cache of thousands of entries.

I'm pretty certain the reason for not changing the default page size is to maintain compatibility with existing software that assumes the page size will always be 4KB.

**vladpetric** · 01 September 2020, 02:02 PM

Originally posted by Space Heater View Post

I'm pretty certain the reason for not changing the default page size is to maintain compatibility with existing software that assumes the page size will always be 4KB.

Such as?

Anyway, free software is typically fixed to address these issues.

When you run a 128GiB server with 4KiB pages (effectively, 33M of them), it's a bit like running a Tesla off a AA battery pack ...

Announcement

LLVM Merges Machine Function Splitter For ~32% Reduction In TLB Misses

LLVM Merges Machine Function Splitter For ~32% Reduction In TLB Misses

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment