Announcement

Collapse
No announcement yet.

64K Kernel Page Size Performance Benefits For HPC Shown With NVIDIA's GH200 Grace CPU

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 64K Kernel Page Size Performance Benefits For HPC Shown With NVIDIA's GH200 Grace CPU

    Phoronix: 64K Kernel Page Size Performance Benefits For HPC Shown With NVIDIA's GH200 Grace CPU

    By default the AArch64 kernel on Ubuntu and other Linux distributions tend to default to a standard 4K page size but for newer AArch64 hardware especially in the server/HPC space, there can be great benefits to using a 64K page size. As it's been a while since I last ran any 64-bit ARM 4K vs. 64K kernel page size benchmarks, while having remote access to the NVIDIA GH200 I ran a fresh comparison for looking at the performance advantages to switching over to a 64K page size kernel. These new 64K kernel numbers are shown alongside the recent AMD EPYC and Intel Xeon CPU reference benchmark results for a look at how the 4K vs. 64K page size affects the overall computing landscape.

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Appreciate this test but it would be a lot easier to compare the benchmarks if the 64K- vs regular 4K-config had a different color
    Last edited by Kjell; 27 February 2024, 01:42 PM.

    Comment


    • #3
      Would be interesting to include the recent threadripper workstation benchmarks for comprison.

      Comment


      • #4
        I can only imagine how a grace-garce superchip system would perform. I expect it to scale well. We will hopefully know soon.

        Comment


        • #5
          Speaking way out of my area of expertise but, if I recall, 16k was already a big jump in performance from 4k and 4k as a default/presumed page size might actually be sunset in Linux consumer devices of all kinds.

          I think even Android (15?) is looking at supporting 16k and early examinations show clear benefits:

          Sticking with Android for a second, I had read that Android Apps may need developers to rebuild their apps to support >4k size. If so, it may be a few years or releases after Android 15 that we'll see mandatory support for >4k there. I don't know if Android is or will soon go to 64k, even though ARM seems to support it usually.

          Comment


          • #6
            Lol .. someone discovered the wheel

            Comment


            • #7
              I presume the reason it makes an improvement is all about the TLBs and CPU caches, not page swapping for these solo bench tests, right? I'm quite vague on any details of the working. Anyone got a description of the typical sequence that might be happening?

              Comment


              • #8
                Originally posted by user556 View Post
                I presume the reason it makes an improvement is all about the TLBs and CPU caches, not page swapping for these solo bench tests, right? I'm quite vague on any details of the working. Anyone got a description of the typical sequence that might be happening?
                In most cases you are correct (one can always find an exception to any rule). While on most x86_64 CPUs the TLBs are rather large (and may even have their own levels), on ARM devices the TLBs tend to be much smaller, and every miss can have a large impact on performance (and in some cases, TLB thrashing can make it even worse) due to the page walks required for a miss.

                Comment


                • #9
                  Originally posted by CommunityMember View Post

                  In most cases you are correct (one can always find an exception to any rule). While on most x86_64 CPUs the TLBs are rather large (and may even have their own levels), on ARM devices the TLBs tend to be much smaller, and every miss can have a large impact on performance (and in some cases, TLB thrashing can make it even worse) due to the page walks required for a miss.
                  ARM Neoverse TLBs are smaller, but I wouldn't say drastically so. Neoverse V2 has a slightly smaller L1 DTLB than Zen3 (48 entries vs 64) and the same L2 DTLB size (2k entries.)

                  AMD bulked up TLB structures in Zen4 (to 72 + 3072 iirc) but ARM did in the V3 as well (L1DTLB has 96 entries, while L2DTLB stays at 2048.)

                  Comment


                  • #10
                    AMD processors also have page coalescing, whereby if 8 consecutive 32kB-aligned pages are consecutive in physical memory and share the same attributes, then the TLB entry covers the whole 32kB. This is totaly transparent and does not need any setup.

                    Of course, for that to happen, the application should allocate from the OS in chunks >= 32kB but I assume most do. That depends on the behaviour of the allocator but in this day and age it's stupid not to.

                    Also, the OS can help if it alligns the allocations on 32kB boundaries (and provided it finds consecutive 32kB to service the request) but it need not configure anything on the CPU.

                    Comment

                    Working...
                    X