Announcement

Collapse
No announcement yet.

Patches Updated To Tackle vmap/vmalloc Lock Contention That Can Yield ~12x Throughput

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Patches Updated To Tackle vmap/vmalloc Lock Contention That Can Yield ~12x Throughput

    Phoronix: Patches Updated To Tackle vmap/vmalloc Lock Contention That Can Yield ~12x Throughput

    An important Linux kernel patch series has been updated for the new year that in synthetic tests has yielded a ~12x throughput improvement on an AMD Ryzen Threadripper system...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    This is amazing

    Comment


    • #3
      What advantages does this provide?

      I guess it's more tailored at big iron with extremely beefy CPUs having extremely bug core counts, but not so much more.

      Can anyone explain this better instead the mere information regurgitation of this news article? Please...

      Originally posted by Kjell View Post
      This is amazing
      Amazing, yet very abstract ..

      Comment


      • #4
        Originally posted by timofonic View Post
        What advantages does this provide?

        I guess it's more tailored at big iron with extremely beefy CPUs having extremely bug core counts, but not so much more.

        Can anyone explain this better instead the mere information regurgitation of this news article? Please...



        Amazing, yet very abstract ..
        Moving some amount of global (all-core) locking into per-core (thread? probably thread) locking, which is to say splitting the locks out into N <= 128. The benefits will be mostly seen in high core count systems, the more the better up to 128, but any contention reduction is helpful for > 1 core systems.

        Comment


        • #5
          Impressive for sure and it was tested on the exact CPU I have. I need to test this out!

          Comment


          • #6
            Originally posted by timofonic View Post
            What advantages does this provide?
            Improves efficiency of systems doing heavily multi-threaded or multi-process memory allocation/deallocation. I wonder if parallel compilation jobs might be one such use case.

            Don't expect anywhere near 12x, in any real-world scenario, however. That was a microbenchmark, specifically designed to exercise the problem area. If this were a common bottleneck, someone probably would've fixed it, already.

            Comment


            • #7
              Maybe above 128 it becomes increasingly unlikely that they all use these functions at the same time.

              Comment


              • #8
                Originally posted by indepe View Post
                Maybe above 128 it becomes increasingly unlikely that they all use these functions at the same time.
                On the contrary, the more "threads", the more stress on the vmalloc code there would be, because more code is potentially accessing DRAM and thus more pages get mapped at the "same" time.

                To understand what is going on, one must know a bit how virtual memory works in modern systems: when a process allocates memory, it gets a pointer to a virtual address in the process address space, but no physical memory is actually given to the process. Physical memory pages gets assigned in an on-demand fashion, right when they are used by the process. Physical pages are usually small on x86/arm systems (4 kilobytes) and it depends on the hardware (search for the TLB on the internet and see how it works) but the kernel can be configured to use bigger pages if hardware allows to (up to 2mb).

                Comment


                • #9
                  Originally posted by blackshard View Post
                  Physical pages are usually small on x86/arm systems (4 kilobytes) and it depends on the hardware (search for the TLB on the internet and see how it works) but the kernel can be configured to use bigger pages if hardware allows to (up to 2mb).
                  Apple's M processors support 4k and 16k pages, Asahi linux uses 16k exclusively. Kernel works, some software breaks: https://github.com/AsahiLinux/docs/wiki/Broken-Software . It's a fun world
                  Also Talos workstations have that problem (ppcle with 64k pages AFAIK)

                  Comment


                  • #10
                    Originally posted by blackshard View Post

                    On the contrary, the more "threads", the more stress on the vmalloc code there would be, because more code is potentially accessing DRAM and thus more pages get mapped at the "same" time.
                    Of course, what I meant is that the spikes in usage might decrease in relative size (not absolute). Since things start to average out. But that might be just a small factor, if at all.

                    EDIT:
                    I'll give an example, and to make it simple, let's just say on average any given thread will spend 50% of its time in vmalloc. so with two threads, they will likely spend 25% of the time with both in vmalloc. That's a lot, so one might want to allow 2 parallel accesses, which is 100% of the thread number. But with 192 threads, maybe 128 parallel accesses would be enough.
                    Last edited by indepe; 03 January 2024, 07:11 AM.

                    Comment

                    Working...
                    X