Announcement

Collapse
No announcement yet.

Ampere Altra Max Continues To Deliver Competitive Power Efficiency To AMD EPYC & Intel Xeon

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Ampere Altra Max Continues To Deliver Competitive Power Efficiency To AMD EPYC & Intel Xeon

    Phoronix: Ampere Altra Max Continues To Deliver Competitive Power Efficiency To AMD EPYC & Intel Xeon

    While it's been three years now since Ampere Altra Q80 was first introduced and two years since first testing the 128-core Ampere Altra Max, this ARM server platform has aged rather well with more robust hardware platforms coming to market with better firmware, the AArch64 Linux/open-source software ecosystem as a whole improving a lot during this time and more open-source projects receiving ARM optimizations, and other improvements made. While we're eagerly awaiting to see AmpereOne hardware, here is a look at how Ampere Altra Max M128-30 is standing up against current AMD EPYC Genoa(X) and Bergamo server CPUs along with Intel Xeon Scalable Sapphire Rapids processors in raw performance and power efficiency.

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    This just displays how much better ARM is compared to x86. Imagine if this Ampere was refreshed to be manufactured in same factory as the EPYCs, what efficiency benefit it would add to the already superior show.

    Comment


    • #3
      Originally posted by varikonniemi View Post
      This just displays how much better ARM is compared to x86. Imagine if this Ampere was refreshed to be manufactured in same factory as the EPYCs, what efficiency benefit it would add to the already superior show.
      It has very little to do with ARMv8/v9 vs x86_64. The CPUs ISA is just a thing in a very large system design.
      These direct ISA statements are worth next to nothing unless you prove instruction set or design efficiency on paper, which I assure you, you won't.
      And even if you could, it's still just bits in a very big system design.

      Lets say you have a more efficient design, then you still can't say that it depends on the instruction set.
      It's easier to say that the design teams are better and are providing more efficient designs using their fab capabilities, but that ISA is an be all, end all thing is just stupid.

      Comment


      • #4
        These aarch64-based platforms really seem ideal for all those none-compute-based websites. Some requests may take a bit longer, but it doesn't really matter to anyone. A good trade-off indeed

        Comment


        • #5
          Originally posted by varikonniemi View Post
          This just displays how much better ARM is compared to x86.
          Did you bother to look through the results? There are only a handful of cases where it manages to win on efficiency.

          Originally posted by varikonniemi View Post
          Imagine if this Ampere was refreshed to be manufactured in same factory as the EPYCs,
          It's not hard to work out. TSMC publishes the efficiency gains of each process node. According to this, I think it should use about 30% less power at the same performance (i.e. on N5 vs N7).

          Comment


          • #6
            Originally posted by varikonniemi View Post
            This just displays how much better ARM is compared to x86. Imagine if this Ampere was refreshed to be manufactured in same factory as the EPYCs, what efficiency benefit it would add to the already superior show.
            Performance is no less important than efficiency.

            Extensive out-of-order structures, large BTBs, accurate predictors, etc. These all consume power.

            That's why in phones the weakest (but most efficient) cores don't even have an out-of-order, and architecturally they are closer to an Intel Pentium than a CoreDuo.

            Neoverse N1 from Ampere Altera has all structures much smaller and more primitive than ZEN4 or Raptor Cove - hence its higher efficiency. There is no magic here.

            That's why Intel, and probably slowly AMD too, will go into poor-core based processors - where multiprocessing proves itself. Single-threaded performance is too expensive in transistors and energy.

            Eg. Raptor Cove is 4 times bigger (needs 4 times more transistors) than Gracemont, and has (from what I remember) only ~40% higher IPC.​
            Last edited by HEL88; 12 December 2023, 12:45 AM.

            Comment


            • #7
              An interesting metric would be to throttle the AMD64 product clock speeds down to the point where they match the Ampere part's geomean performance, and then measure the performance per watt. This is because Intel/AMD server CPUs tend to operate radically more efficiently when you clock them down, ie around 2.4GHz instead of 3.6GHz. The difference, esp for the Intel parts, is huge. And downclocking them by 33% doesn't mean a 33% reduction in performance either, as the memory subsystem keeps operating at full speed, so the relative performance per clock rises.

              Comment


              • #8
                We have an Ampere Altra Max M128-30 and may observations are similar to the results of this test. It can't match AMD/Intel, not even Graviton3/3E in performance but it is very power efficient. It performs well for highly parallel workloads. It would be nice if it had more memory channels. Me personally, this was (and still is given the pricing of Nvidia Grace) the best system for the roll we acquired it for.

                Comment


                • #9
                  Originally posted by HEL88 View Post
                  Performance is no less important than efficiency.

                  Extensive out-of-order structures, large BTBs, accurate predictors, etc. These all consume power.

                  That's why in phones the weakest (but most efficient) cores don't even have an out-of-order, and architecturally they are closer to an Intel Pentium than a CoreDuo.
                  Apple's E-cores are OoO, and have been for a really long time. They're incredibly efficient, too. There are some energy-saving optimizations you can do, if you have all that nice out-of-order machinery. Also, stalls are a higher power-state than idling the core. So, if you can reduce stalls to complete the work sooner and get back to idle, it can provide energy-savings (though not if you burn too much power, in the the race-to-idle).

                  The thing to keep in mind about ARM's E-cores is they're not only optimizing for energy-efficiency, but also area-efficiency. The two are related, but not synonymous. Apple is willing to build bigger E-cores, in order to make them as efficient as possible (as well as reducing how often you need to wake up the P-cores).

                  Originally posted by HEL88 View Post
                  Neoverse N1 from Ampere Altera has all structures much smaller and more primitive than ZEN4 or Raptor Cove - hence its higher efficiency. There is no magic here.
                  Well... it did come up that Altra's caches can detect an overwrite and avoid the typical write-miss penalty. Best exemplified in their standout Stream Triad performance, in spite of using the exact same speed, number, and type of DIMMs as EPYC.

                  Those cache fetches Altra isn't doing also translate into energy it's not wasting!

                  Comment


                  • #10
                    Originally posted by coder View Post
                    Apple's E-cores are OoO, and have been for a really long time.
                    And they represent only a fraction of what is in the performance core. After all, this is what he writes about - the bigger the OoO the more accurate the predictor the bigger the BTB the higher the energy consumption.

                    That's why energy-efficient cores ALWAYS have these structures smaller than performance cores.​

                    Write how deep the OoO is for e-cores at Apple. Performance core is 640 deep.
                    Well... it did come up that Altra's caches can detect an overwrite and avoid the typical write-miss penalty.
                    nice

                    This does not change the fact that Neoverse N1 is a significantly weaker core, for example:

                    OoO: 128 vs 320 in ZEN4
                    8 execution port vs 14 in zen4
                    etc.

                    Benchmarking 4 cores performs similar to the old i5 6600k at 3GHz - very weak as of today: ​

                    Compilation:


                    File Compression:


                    Render time:


                    Deep Diving Neoverse N1 – Chips and Cheese

                    Comment

                    Working...
                    X