Announcement

Collapse
No announcement yet.

Apple M1 ARM Performance With A 2020 Mac Mini

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by PerformanceExpert View Post

    On modern CPUs the vast majority of instructions are a single micro-op. Only complex instructions are split into multiple micro-ops. It varies but eg. for Cortex-A72 "On average, Filippo said, each ARMv8 instruction translates into 1.08 micro-ops.".

    So micro-ops are just a different encoding of the original ISA.
    ARM is considered RISC so it shows that even traditional RICS went for hybrid aproach.
    I would be interested in the ration for x86 which was traditionally CISC. On the Zen3 arch debate between Ian Cutress from Anandtech and Wendell from L1techs mentioned how some x87 instructions were speeded by a lot. So there is still a multimicro-ops staf even for something that old.

    Originally posted by PerformanceExpert View Post
    You have RISC and CISC swapped here. Initial RISCs didn't have any complex instructions, and every instruction was directly executed in a single cycle. CISCs used a micro-code engine to execute every instruction, which took many cycles and was extremely slow. Those days are gone now. RISC ISAs became more complex, while CISCs stopped using the most complex instructions and sped up the commonly used operations by using more transistors.
    You are right. Gonna edit it so it is not confusing people. Thanks.

    Comment


    • Originally posted by ldesnogu View Post
      You can't run DOS programs on Win 10 64-bit, you have to rely on an emulator (which could be run on an ARM machine).

      And as far as Wine is concerned: https://www.macrumors.com/2020/11/18...oftware-on-m1/

      Anyway as far as long term HW supoprt goes, nothing beats a self assembled machine with carefully chosen components

      But that's an OS issue, not a hardware one. The hardware still supports 16-bit execution just fine. Unlike ARM which can't even run the last major revision of the ISA.

      Hence why apple has been asking for the llvm IR for quite some time, and developers would be smart to stash a copy of the IR somewhere if they are distributing outside the apple app store. However that doesn't help existing legacy applications. Rosetta 2 looks good, but because it's not a full hardware emulation you're going to be finding corner cases for years to come, and that everything GPU has to be translated to Metal, a bespoke costum API doesn't help matters any. Comparing to DOSBOX which does cycle accurate emulation of a whole system is certainly on the hopeful side.

      Comment


      • Originally posted by pixo View Post

        ARM is considered RISC so it shows that even traditional RICS went for hybrid aproach.
        I would be interested in the ration for x86 which was traditionally CISC. On the Zen3 arch debate between Ian Cutress from Anandtech and Wendell from L1techs mentioned how some x87 instructions were speeded by a lot. So there is still a multimicro-ops staf even for something that old.


        You are right. Gonna edit it so it is not confusing people. Thanks.
        ARM also added an instruction just to accelerate java-script. Maybe not so RISC anymore. The reason you do micro-ops is to reduce the complexity of the execution side in favor of some complexity at the front end. What exactly is cut and what isn't is highly design specific. And then there's also instruction fusion. You take common idioms in instruction streams and mash them together to one fused-op. (Say add 1 and compare.) The idea being not only do you save a cycle, you avoid the power overhead of moving the data around so much.

        There is indeed a lot going on in the micro-arch that can add to the strengths or mitigate the weaknesses of the top-level ISA.

        Comment


        • Originally posted by WorBlux View Post

          ARM also added an instruction just to accelerate java-script. Maybe not so RISC anymore. The reason you do micro-ops is to reduce the complexity of the execution side in favor of some complexity at the front end. What exactly is cut and what isn't is highly design specific. And then there's also instruction fusion. You take common idioms in instruction streams and mash them together to one fused-op. (Say add 1 and compare.) The idea being not only do you save a cycle, you avoid the power overhead of moving the data around so much.

          There is indeed a lot going on in the micro-arch that can add to the strengths or mitigate the weaknesses of the top-level ISA.
          That is true but for that you dont need micro-ops. Those are needed to split your instruction to sequence of micro-ops and execute them in multiple cycles. These sequences are defined in microcode.
          Pure RISC CPUs did not have micro-ops because all instructions were implemented in HW (hardwired) and there was no need to split them to sequence of micro-ops. Thats why each instruction took one cycle to execute. Fusing instruction can still be done but it comes to terminology if you call them fused-ops or fused instructions. There is still no need to split the instruction to sequence of micro-ops. And in OoO execution you reorganize and execute several instruction, if possible, instead of micro-ops like in CISC.

          In the end both ARM and x86 are hybrids. Some ARM instructions are done via micro-ops to save space and some x86 instructions are hardwired for speed.
          How much is done via micro-ops, and by how many of them, and how much is hardwired is one aspect of micro-arch.

          Comment


          • Originally posted by pixo View Post

            That is true but for that you dont need micro-ops. Those are needed to split your instruction to sequence of micro-ops and execute them in multiple cycles. These sequences are defined in microcode.
            Pure RISC CPUs did not have micro-ops because all instructions were implemented in HW (hardwired) and there was no need to split them to sequence of micro-ops. Thats why each instruction took one cycle to execute. Fusing instruction can still be done but it comes to terminology if you call them fused-ops or fused instructions. There is still no need to split the instruction to sequence of micro-ops. And in OoO execution you reorganize and execute several instruction, if possible, instead of micro-ops like in CISC.

            In the end both ARM and x86 are hybrids. Some ARM instructions are done via micro-ops to save space and some x86 instructions are hardwired for speed.
            How much is done via micro-ops, and by how many of them, and how much is hardwired is one aspect of micro-arch.
            Yes fused-ops is probably the better term.

            Yes the first RISC's were doing it that way to simplify pipelining. Most instruction were one cycle, but not all, multiply and load come to mind but those are conceptually simple tasks and you'd raise a stall until it was safe to proceed with issue again.

            Yet OoO and super-scaler have changed the constraints and bottlenecks since then, and the u-arch has somewhat converged. For kicks and giggles compare the A78 to skylake, and the A76 to Tremont.

            Comment


            • Originally posted by WorBlux View Post
              But that's an OS issue, not a hardware one. The hardware still supports 16-bit execution just fine. Unlike ARM which can't even run the last major revision of the ISA.
              I was answering to the claim you can still run DOS programs on a Windows x86 machine. You're facing the same issue as on an M1 machine: you have to run DOSBOX on Win 10 64-bit.

              Hence why apple has been asking for the llvm IR for quite some time, and developers would be smart to stash a copy of the IR somewhere if they are distributing outside the apple app store. However that doesn't help existing legacy applications. Rosetta 2 looks good, but because it's not a full hardware emulation you're going to be finding corner cases for years to come, and that everything GPU has to be translated to Metal, a bespoke costum API doesn't help matters any. Comparing to DOSBOX which does cycle accurate emulation of a whole system is certainly on the hopeful side.
              Why would a user program need accurate hardware simulation? I mean beyond old programs, no one directly accesses HW anymore I hope. And for cross graphics API translation I already posted a link which shows Wine working on an M1.

              Anyway for Apple, emulation is just a stop gap until most applications are ported. And it's doing a very good job at that it seems.

              Comment


              • Originally posted by ldesnogu View Post
                I was answering to the claim you can still run DOS programs on a Windows x86 machine. You're facing the same issue as on an M1 machine: you have to run DOSBOX on Win 10 64-bit.


                Why would a user program need accurate hardware simulation? I mean beyond old programs, no one directly accesses HW anymore I hope. And for cross graphics API translation I already posted a link which shows Wine working on an M1.

                Anyway for Apple, emulation is just a stop gap until most applications are ported. And it's doing a very good job at that it seems.
                Win7_32 will run on modern hardware, and supports DOS and a huge swath of the x86 windows back catalogue. Yes DOS is largely a solved problem, but there is a large catalogue of application between the DOS era and the modern app store.

                And porting only helps if you or someone that still cares about the application has the source code. There are still functional programs out there that have unique feature, but which for the source is lost to time. And yes some of these touch hardware more directly for whatever reason. Maybe the control an external device or are quite sensitive to timing, or are self-modifying in some way. Experience has shown there are always corner cases, and the more corner it is the harder it is to fix.

                And I did look more into the crossover on rosseta claim. Yes some applications work well, but support is spotty at best. You are always going to be translating through two layers of API. Direct X or GL -> vulkan -> metal. and nobody in the FOSS world is really that interested in doing DX->metal. Parallels looks like it has to basic support for a paravirtualized dx11 driver above metal, but there are still reports of compatibility problems there, and how you'd best leverage rosetta in that situation is a question unanswered.

                Comment


                • A lot of people are surerly suprised about results but one thing people are not aware about it is....
                  128-bit memory bus.

                  Literally that M1 works in 8x 16 bit channels of memory with LPDDR4X-4266-class. A lot of results you see that are impressive do not come from superiority of ARM silicon, but more from that it supports more memory channels then normal PCs (outside of HEDT) with very fast ddr4 ram by default that is localized closely to chip itself. Also there a single core can utilize whole memory bus not just some group of cores. Speed of this chip comes a lot from that 1 core of it can access as much data as Threadripper 1950X which is 16 core-32 thread CPU (at level of ~~60GB/s)

                  This results with impressive results in some benchmarks like ZSTD compression or SQL stuff, but it doesn't show superiority of ARM architecture at all, but more shows it, that normal desktop CPUs should start moving to quad channels instead of dual.

                  Comment


                  • Originally posted by piotrj3 View Post
                    A lot of people are surerly suprised about results but one thing people are not aware about it is....
                    128-bit memory bus.

                    Literally that M1 works in 8x 16 bit channels of memory with LPDDR4X-4266-class. A lot of results you see that are impressive do not come from superiority of ARM silicon, but more from that it supports more memory channels then normal PCs (outside of HEDT) with very fast ddr4 ram by default that is localized closely to chip itself.
                    At the risk of asking a dumb question, doesn't a typical PC CPU also have 128-bit memory (2 channels x 64-bits/channel) ?

                    Agree that having more / smaller channels opens the door for more efficient memory usage (we do the same with GPUs) but I think they are both 128-bit, at least if you ignore OEMs who configure a dual-channel CPU with single-channel RAM and no expansion capability.
                    Test signature

                    Comment


                    • Originally posted by bridgman View Post

                      At the risk of asking a dumb question, doesn't a typical PC CPU also have 128-bit memory (2 channels x 64-bits/channel) ?

                      Agree that having more / smaller channels opens the door for more efficient memory usage (we do the same with GPUs) but I think they are both 128-bit, at least if you ignore OEMs who configure a dual-channel CPU with single-channel RAM and no expansion capability.
                      It is good question. Yes it is true that normal CPUs theoreticly do have 2x64bits, but with many channels performance is actually clearly always worse comparing to small channels, also it means everytime you read a value smaller then 64 bits, you theoreticly waste memory bus (which considering most typical type of data is integer.... is quite wasteful).

                      I mean theoretical max performance of memory in 128 bit memory bus with 4266MHz memory would be around ~~68GB/s. Practicly in benchmarks not even Ryzen 5950X achieves beyond ~~36GB/s (Intel 10900k is even worse here). Meanwhile on Anandtech review of M1 there was such quote

                      One aspect we’ve never really had the opportunity to test is exactly how good Apple’s cores are in terms of memory bandwidth. Inside of the M1, the results are ground-breaking: A single Firestorm achieves memory reads up to around 58GB/s, with memory writes coming in at 33-36GB/s. Most importantly, memory copies land in at 60 to 62GB/s depending if you’re using scalar or vector instructions. The fact that a single Firestorm core can almost saturate the memory controllers is astounding and something we’ve never seen in a design before.
                      Last edited by piotrj3; 27 November 2020, 07:33 AM.

                      Comment

                      Working...
                      X