Announcement

**Kjell** · 02 January 2024, 08:25 PM

This is amazing

**timofonic** · 02 January 2024, 11:32 PM

What advantages does this provide?

I guess it's more tailored at big iron with extremely beefy CPUs having extremely bug core counts, but not so much more.

Can anyone explain this better instead the mere information regurgitation of this news article? Please...

Originally posted by Kjell View Post

This is amazing

Amazing, yet very abstract ..

**AlanTuring69** · 03 January 2024, 12:00 AM

Originally posted by timofonic View Post

What advantages does this provide?

I guess it's more tailored at big iron with extremely beefy CPUs having extremely bug core counts, but not so much more.

Can anyone explain this better instead the mere information regurgitation of this news article? Please...

Amazing, yet very abstract ..

Moving some amount of global (all-core) locking into per-core (thread? probably thread) locking, which is to say splitting the locks out into N <= 128. The benefits will be mostly seen in high core count systems, the more the better up to 128, but any contention reduction is helpful for > 1 core systems.

**clearNickel** · 03 January 2024, 12:40 AM

Impressive for sure and it was tested on the exact CPU I have. I need to test this out!

**coder** · 03 January 2024, 01:18 AM

Originally posted by timofonic View Post

What advantages does this provide?

Improves efficiency of systems doing heavily multi-threaded or multi-process memory allocation/deallocation. I wonder if parallel compilation jobs might be one such use case.

Don't expect anywhere near 12x, in any real-world scenario, however. That was a microbenchmark, specifically designed to exercise the problem area. If this were a common bottleneck, someone probably would've fixed it, already.

**indepe** · 03 January 2024, 05:44 AM

Maybe above 128 it becomes increasingly unlikely that they all use these functions at the same time.

**blackshard** · 03 January 2024, 06:02 AM

Originally posted by indepe View Post

Maybe above 128 it becomes increasingly unlikely that they all use these functions at the same time.

On the contrary, the more "threads", the more stress on the vmalloc code there would be, because more code is potentially accessing DRAM and thus more pages get mapped at the "same" time.

To understand what is going on, one must know a bit how virtual memory works in modern systems: when a process allocates memory, it gets a pointer to a virtual address in the process address space, but no physical memory is actually given to the process. Physical memory pages gets assigned in an on-demand fashion, right when they are used by the process. Physical pages are usually small on x86/arm systems (4 kilobytes) and it depends on the hardware (search for the TLB on the internet and see how it works) but the kernel can be configured to use bigger pages if hardware allows to (up to 2mb).

**Serafean** · 03 January 2024, 06:36 AM

Originally posted by blackshard View Post

Physical pages are usually small on x86/arm systems (4 kilobytes) and it depends on the hardware (search for the TLB on the internet and see how it works) but the kernel can be configured to use bigger pages if hardware allows to (up to 2mb).

Apple's M processors support 4k and 16k pages, Asahi linux uses 16k exclusively. Kernel works, some software breaks: https://github.com/AsahiLinux/docs/wiki/Broken-Software . It's a fun world

Also Talos workstations have that problem (ppcle with 64k pages AFAIK)

**indepe** · 03 January 2024, 06:53 AM

Originally posted by blackshard View Post

On the contrary, the more "threads", the more stress on the vmalloc code there would be, because more code is potentially accessing DRAM and thus more pages get mapped at the "same" time.

Of course, what I meant is that the spikes in usage might decrease in relative size (not absolute). Since things start to average out. But that might be just a small factor, if at all.

EDIT:
I'll give an example, and to make it simple, let's just say on average any given thread will spend 50% of its time in vmalloc. so with two threads, they will likely spend 25% of the time with both in vmalloc. That's a lot, so one might want to allow 2 parallel accesses, which is 100% of the thread number. But with 192 threads, maybe 128 parallel accesses would be enough.

Announcement

Patches Updated To Tackle vmap/vmalloc Lock Contention That Can Yield ~12x Throughput

Patches Updated To Tackle vmap/vmalloc Lock Contention That Can Yield ~12x Throughput

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment