Announcement

Collapse
No announcement yet.

Ampere Altra Performance Shows It Can Compete With - Or Even Outperform - AMD EPYC & Intel Xeon

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by PerformanceExpert View Post
    Actually AMD uses many more NUMA domains due to 4 cores sharing a L3 slice in Zen 2 and chiplets.
    It's not NUMA in a meaningful sense. Yes, there's a slight latency hit when accessing data in another CCX' L3 slice, but it's not like reading/writing memory via a memory controller that's multiple hops away. Cache is dynamic and data tends not to sit in it for long, whereas if the data one needs resides in DRAM accessed via a far-away memory controller, that's where it's going stay.

    Originally posted by PerformanceExpert View Post
    Either way there is no "nuance" here - there isn't a magic setting that could significantly improve scores across all benchmarks.
    Not across all benchmarks, but NUMA favors specific ones. When extrapolating from the limited data we have here, such details are very relevant.

    Comment


    • Originally posted by Dukenukemx View Post
      None of these are desktop ARM chips.
      Technically true, but now it just feels like you're trying to move the goalposts. So, what is it you're really after? I thought you just wanted a way to code for ARM other than on a cellphone or Pi.

      Comment


      • Originally posted by ldesnogu View Post
        Are you just trolling? Or unable to admit you were plain wrong?
        Yes, he's one of the site's biggest trolls, and arguing with him isn't much more effective than one would expect. That doesn't mean he's always wrong, though. And at least his posts are short.

        Comment


        • Originally posted by pal666 View Post
          x86_64 also didn't retain full binary compatibility, they are even on this metric
          What do you mean by that?

          Originally posted by pal666 View Post
          and arm had few thousand more several years later(i.e. it was smaller than contemporary x86 - it was its main point)
          Why are you still arguing that point, when ARMv8-A is a completely different ISA?

          The reason why I brought up the 8088 is that x86-64 is just a bolt-on to the legacy x86 ISA, thus such complaints about its ISA are very relevant to modern x86.

          Originally posted by pal666 View Post
          amd64 is also a new isa, compatibility is done by supporting two isas in one cpu(just like arms do)
          In one sense, yes. But, if you look at the structure and content of x86-64, it's just a scaled-up version of the original x86, not unlike when Intel went to 32-bits. So, it's not accurate to call it a new ISA.
          Last edited by coder; 17 December 2020, 11:41 AM.

          Comment


          • Originally posted by juanrga View Post
            You are comparing the efficiency of a CPU alone system with the efficiency of CPU+GPU systems. GPU accelerators improve the efficiency a lot of because GPUs are optimized for throughput. If you eliminate the GPUs in those systems, the efficiency drops massively.
            Yeah, but also no.

            Fujitsu made a deliberate decision to build a CPU that would fill both roles. In my opinion, it's fair to judge its efficiency on standard benchmarks, because efficiency has real world consequences, and that's what you're measuring.

            Comment


            • Originally posted by juanrga View Post
              The ARM64 advantage means a 20--30% smaller die area than a significant x856 design.

              https://www.hpcwire.com/2020/03/17/m...erver-roadmap/
              Yeah it's smaller because it's slower on the same process node.

              Comment


              • Originally posted by Weasel View Post
                Yeah it's smaller because it's slower on the same process node.
                It's not just smaller but also faster. The Neoverse N1 core is about a third of the size of a Zen 2 core and total silicon area of EPYC 7742 is ~1100mm^2 vs ~350-400mm^2 for Altra (so also about a third). Yet the Altra beats the 7742 on raw performance.

                Comment


                • Originally posted by coder View Post
                  It's not NUMA in a meaningful sense. Yes, there's a slight latency hit when accessing data in another CCX' L3 slice, but it's not like reading/writing memory via a memory controller that's multiple hops away. Cache is dynamic and data tends not to sit in it for long, whereas if the data one needs resides in DRAM accessed via a far-away memory controller, that's where it's going stay.
                  Sure but we're talking about the NUMA differences. Both EPYC and Altra support 2P systems and NUMA is the same in that case. The difference between different DRAM controllers on the same die is also minor (precisely because they are on the same die - remember Zen 1?). So the key NUMA difference between Altra and Zen 2 is in core-to-core latencies - the latencies between different chiplets are higher than within a chiplet, and that extra latency doesn't exist on Altra.

                  Not across all benchmarks, but NUMA favors specific ones. When extrapolating from the limited data we have here, such details are very relevant.
                  Yes different NUMA settings can give a few percent on some applications. But it's just one of a million optimizations you can apply. These results were using a 4KB pagesize, while using 64KB pages gives ~10% overall performance gain on Arm (based on actual measurements I did). Is that not relevant?

                  These results were measured using stock installs and settings, thus relevant to what most people would see. You can always tweak and tune further but that was not the goal of this comparison.
                  Last edited by PerformanceExpert; 17 December 2020, 02:30 PM.

                  Comment


                  • Originally posted by coder View Post
                    What do you mean by that?
                    what word is difficult for you?
                    Originally posted by coder View Post
                    Why are you still arguing that point, when ARMv8-A is a completely different ISA?
                    just as amd64, but op was comparing arm to x86
                    Originally posted by coder View Post
                    The reason why I brought up the 8088 is that x86-64 is just a bolt-on to the legacy x86 ISA, thus such complaints about its ISA are very relevant to modern x86.
                    the reason you brought it up is that you have no clue what you are talking about
                    Originally posted by coder View Post
                    In one sense, yes. But, if you look at the structure and content of x86-64, it's just a scaled-up version of the original x86, not unlike when Intel went to 32-bits. So, it's not accurate to call it a new ISA.
                    can x86 code be successfully run in x86-64 mode? no. when i look at armv8-a i see scaled up version of original arm

                    Comment


                    • Originally posted by coder View Post
                      Fujitsu made a deliberate decision to build a CPU that would fill both roles. In my opinion, it's fair to judge its efficiency on standard benchmarks, because efficiency has real world consequences, and that's what you're measuring.
                      However the standard benchmark is too simplistic and overstating the benefit of GPUs. Fujitsu's goal was to make programming and optimization easier and enable much higher efficiencies on the many HPC codes that don't work well on GPUs.

                      Comment

                      Working...
                      X