Announcement

Collapse
No announcement yet.

How A Raspberry Pi 4 Performs Against Intel's Latest Celeron, Pentium CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Does anyone have a 4 GB RPi on hand to test, I am curious if any of these tests are RAM limited? Also, I assume the Ondemand governor was used, although I don't think that will change anything significantly.

    Comment


    • Originally posted by ldesnogu View Post

      Cortex-A77 has about the performance of Apple A11. Iphone with A11 was released on Sept 2017, while Qualcomm 865 was released in early 2020. So that's more than the 1/1.5 year I previously claimed, it's about 2 years.

      Well that gap has narrowed considerably - it's now one generation as the upcoming Cortex-X1 pretty much matches A13 and the fastest Zen 2. Note the much higher power efficiency of the Cortex cores (showing that high performance does not mean inefficient):



      Comment


      • Originally posted by JimmyZ View Post
        The horrifying situation now is that high performance ARM processors are not for sale, like Amazon Graviton2 and Apple Ax, which is worse then the situation of x86, yes there're only 2 competitors but at least they sell bare processors.
        You can order high-end Arm servers and workstations online, eg. https://store.avantek.co.uk/arm-servers.html. Arm is expanding in servers, laptops and desktops, so there will be more choice in the future.

        Comment


        • Originally posted by atomsymbol View Post

          Just a quick note:

          Given a memory address X, store(X) executes in 1 clock because L1D cache stores are pipelined, i.e. L1D can sustain to execute 1 store every cycle although each of the stores takes 4 cycles to finish. Similarly, load(Y) when Y is different from X takes 1 cycle to execute because L1D cache can sustain to execute 1 load every cycle although each load takes 4 cycles to complete. So in fact (assuming there is one L1D load port and one L1D store port), load(Y) and load(Z) take 1+4=5 cycles to finish. From this we can derive that, assuming the load pipeline is being kept 100% busy, without any memory bypassing actually any L1D load takes just 1 cycle to execute if the load address is different than the store addresses of the previous 4 store instructions.

          So, I really think that for memory bypassing to really have a performance impact (not 1% performance impact like in the paper, but 20+% performance impact) you actually need to target/find the situations when the store(X) load(X) pair are seen by the instruction window at the same time (aka "in the same clock").



          I am not so sure that the paper is about a negation, at least not in the sense that it is a fundamental negation. Given the technological constraints (max number of transistors in a single chip) and algorithmic constraints (what the state of the art is in a given year), the paper is right. The paper essentially says that "we didn't find the situation/circumstances in which memory bypassing is required in order for performance to improve by 20+% and we do not intend to search further".
          I'm afraid that, in practice, the utilization of the load pipeline/path is way way lower than that (so you end up waiting). The utilization of the ALU pipelines is definitely better, but even there you rarely ever get close to 100% utilization.

          I suggest the following experiment:

          Start with a CPU intensive benchmark that lasts roughly 10-20 seconds (more is not a problem, it'll just make you wait a bit longer). The easiest thing is the compression of a larger file, which I also copied on /dev/shm. But feel free to pick your CPU intensive benchmark (this is not meaningful for an I/O benchmark that waits a lot ...)

          Then run the following:

          perf stat -e cycles,instructions,cache-references << your actual command >>

          If your processor is an Intel one, the following should probably work as well:

          perf stat -e cycles,uops_retired.all,mem_uops_retired.all_loads << your actual command >>

          Generally, it's better to report uops vs instructions, and the uops_retired.all, mem_uops_retired.all_loads counters are precise on my processor.

          Then see what the IPC is (instructions per cycle, or uops per cycle). Also see how frequent the loads are.

          Please try the previous, as I'm curious what numbers you get. The first command should also work on an RPi2/3/4 actually.

          Comment


          • Originally posted by atomsymbol View Post

            I am not so sure that the paper is about a negation, at least not in the sense that it is a fundamental negation. Given the technological constraints (max number of transistors in a single chip) and algorithmic constraints (what the state of the art is in a given year), the paper is right. The paper essentially says that "we didn't find the situation/circumstances in which memory bypassing is required in order for performance to improve by 20+% and we do not intend to search further".
            Thanks

            It is indeed really difficult to do a negative proof in engineering - you can maybe show that something doesn't work in some circumstances, but the proof may be turned upside down when you change the setup.

            Anyway, RENO does store-load bypassing in a way that I believe is worth doing. Also, see name of first author.

            Comment


            • Originally posted by vladpetric View Post

              I'm afraid that, in practice, the utilization of the load pipeline/path is way way lower than that (so you end up waiting). The utilization of the ALU pipelines is definitely better, but even there you rarely ever get close to 100% utilization.

              I suggest the following experiment:

              Start with a CPU intensive benchmark that lasts roughly 10-20 seconds (more is not a problem, it'll just make you wait a bit longer). The easiest thing is the compression of a larger file, which I also copied on /dev/shm. But feel free to pick your CPU intensive benchmark (this is not meaningful for an I/O benchmark that waits a lot ...)

              Then run the following:

              perf stat -e cycles,instructions,cache-references << your actual command >>

              If your processor is an Intel one, the following should probably work as well:

              perf stat -e cycles,uops_retired.all,mem_uops_retired.all_loads << your actual command >>

              Generally, it's better to report uops vs instructions, and the uops_retired.all, mem_uops_retired.all_loads counters are precise on my processor.

              Then see what the IPC is (instructions per cycle, or uops per cycle). Also see how frequent the loads are.

              Please try the previous, as I'm curious what numbers you get. The first command should also work on an RPi2/3/4 actually.
              Except none of this has anything todo with loads reading the store output or the reverse.

              https://fuse.wikichip.org/news/2339/...performance/2/

              A77 increases IPC to double that of a A72 by in fact increasing the number of instructions decoded and being processed in the uops. Even that the A77 doubles the processed instructions per clock it only increases the uops by half.

              A57, A72 and A76 in fact have the same number of micro ops. A57 is the RPI 3 that handles 2 full arm instructions per cycle. A72 that is RPI4 hands 3 full ARM instructions per cycle. A76 handles 4 instructions per cycle. So yes RPI4 A72 is not in fact filling the uops because it renaming enough instructions per cycle to use all the uops. Its impossible for a RPI4 with a A72 even with ideal instructions to get 100 percent uops utilisation is in fact impossible.

              Yes in IPC A72 is under half that of a A77. The clockspeed of a A77 also doubles compared to a A72. So to work out what a A77 is compared to a intel chip you need to quad the performance you are seeing out the RPI4 at least. Intel chips are not winning by enough.

              I worked out your big mistake.

              ARM cores are not hyperthreaded. So you only have 1 decoder engine and 1 thread this makes a big difference. You don't have the problem that means that you need to read the store que when doing a load. You don't need the complex load store structures when you don't have hyper-threading. Its for a funny reason. High speed register storage can only be made so big. Be it a hyperthreading x86 core or a single threaded arm core your max size for register storage is the same. Notice something arm has more free shadow registers so something is sent to store and some up coming operation is going to use it arm can leave the value in a shadow register.

              store(X) load(X) pair are seen by the instruction window at the same time (aka "in the same clock").
              atomsymbol requirement to get gain from complex load store setup. Does not happen in a arm as that is rewritten to store(X) use register that X was stored in instead of Load(X) so this skips the load(X) completely. Of course you cannot do that if you don't have the register storage space.

              Hyperthreading that starts in the alpha chips cause level of complexity and a lot more register space usage. This lead into having to have more complex load/store structures to cover for the fact you run short on register space. Yes having your loads read the outgoing store buffer or the reverse slows down your load store speeds.

              A72 issues is
              1) its not processing enough instructions per cycle because its rename system is too small.
              2) its need a few more uops
              3) its clock speed it too low.

              The load store the design of the A77 and A78 is basically the same as the A72 the buffers have been made bigger due to more uops. That design does not cause any issues when you don't have hyperthreading. The ability for the load to read the outgoing store buffer is only required when you hyperthread resulting in having to store more state in registers so effectively running yourself out of register space storage.

              Arm design the load store buffer sizes align to the number of uops the core has. How effectively those uops are used aligns to how much renaming of registers is possible. Register renaming directly controls the IPC per clock value.

              Interesting point A77/A78 require a more expensive nm of production but power usage of a A72 at 1.5Ghz is the same as a A77/A78 at 3Ghz. So yes in theory if you could get a A77/A78 produced in the same soc design as what is on the RPI4 you could drop it on there and keep the complete board the same. Yes the outside mm squared of the silicon chip also stay the same between A77/A78 and a A72. So it is possible for someone to make a insanely fast RPI class board its just not cost effective yet to make them using A77/A78 cores yet.

              There are many server grade chips of arm that expanded in the same way A77 and A78 have of course they are not arm A77 or A78 cores.

              RPI4 is also still under performing for where it should be.

              I can totally understand apple going arm.
              1 arm cores that give the same performance as intel x86 chips are not power hungry.
              2 apple cooler designs are commonly designed by a moron. Yes a laptop with fan that cools nothing apple in fact made that. They need cores than can be passively cooled to get past their hardware designer who are complete idiots at times.

              Comment


              • Originally posted by oiaohm View Post

                Except none of this has anything todo with loads reading the store output or the reverse.

                https://fuse.wikichip.org/news/2339/...performance/2/

                A77 increases IPC to double that of a A72 by in fact increasing the number of instructions decoded and being processed in the uops. Even that the A77 doubles the processed instructions per clock it only increases the uops by half.

                A57, A72 and A76 in fact have the same number of micro ops. A57 is the RPI 3 that handles 2 full arm instructions per cycle. A72 that is RPI4 hands 3 full ARM instructions per cycle. A76 handles 4 instructions per cycle. So yes RPI4 A72 is not in fact filling the uops because it renaming enough instructions per cycle to use all the uops. Its impossible for a RPI4 with a A72 even with ideal instructions to get 100 percent uops utilisation is in fact impossible.

                Yes in IPC A72 is under half that of a A77. The clockspeed of a A77 also doubles compared to a A72. So to work out what a A77 is compared to a intel chip you need to quad the performance you are seeing out the RPI4 at least. Intel chips are not winning by enough.

                I worked out your big mistake.

                ARM cores are not hyperthreaded. So you only have 1 decoder engine and 1 thread this makes a big difference. You don't have the problem that means that you need to read the store que when doing a load. You don't need the complex load store structures when you don't have hyper-threading. Its for a funny reason. High speed register storage can only be made so big. Be it a hyperthreading x86 core or a single threaded arm core your max size for register storage is the same. Notice something arm has more free shadow registers so something is sent to store and some up coming operation is going to use it arm can leave the value in a shadow register.



                atomsymbol requirement to get gain from complex load store setup. Does not happen in a arm as that is rewritten to store(X) use register that X was stored in instead of Load(X) so this skips the load(X) completely. Of course you cannot do that if you don't have the register storage space.

                Hyperthreading that starts in the alpha chips cause level of complexity and a lot more register space usage. This lead into having to have more complex load/store structures to cover for the fact you run short on register space. Yes having your loads read the outgoing store buffer or the reverse slows down your load store speeds.

                A72 issues is
                1) its not processing enough instructions per cycle because its rename system is too small.
                2) its need a few more uops
                3) its clock speed it too low.

                The load store the design of the A77 and A78 is basically the same as the A72 the buffers have been made bigger due to more uops. That design does not cause any issues when you don't have hyperthreading. The ability for the load to read the outgoing store buffer is only required when you hyperthread resulting in having to store more state in registers so effectively running yourself out of register space storage.

                Arm design the load store buffer sizes align to the number of uops the core has. How effectively those uops are used aligns to how much renaming of registers is possible. Register renaming directly controls the IPC per clock value.

                Interesting point A77/A78 require a more expensive nm of production but power usage of a A72 at 1.5Ghz is the same as a A77/A78 at 3Ghz. So yes in theory if you could get a A77/A78 produced in the same soc design as what is on the RPI4 you could drop it on there and keep the complete board the same. Yes the outside mm squared of the silicon chip also stay the same between A77/A78 and a A72. So it is possible for someone to make a insanely fast RPI class board its just not cost effective yet to make them using A77/A78 cores yet.

                There are many server grade chips of arm that expanded in the same way A77 and A78 have of course they are not arm A77 or A78 cores.

                RPI4 is also still under performing for where it should be.

                I can totally understand apple going arm.
                1 arm cores that give the same performance as intel x86 chips are not power hungry.
                2 apple cooler designs are commonly designed by a moron. Yes a laptop with fan that cools nothing apple in fact made that. They need cores than can be passively cooled to get past their hardware designer who are complete idiots at times.
                If I'm making a mistake it is not completely ignoring you.

                Hyperthreading generally makes IPC worse, because you have two threads competing for resources (and that includes the instruction window, or the load-store queues).

                Comment


                • Originally posted by vladpetric View Post
                  Hyperthreading generally makes IPC worse, because you have two threads competing for resources (and that includes the instruction window, or the load-store queues).
                  No its not the load-store queues where the problem. The biggest harm of Hyperthreading is running you out of register storage and that has knock on effects.

                  Originally posted by vladpetric View Post
                  RENO does store-load bypassing in a way that I believe is worth doing.
                  https://iscaconf.org/isca2005/papers/02B-03.PDF

                  Go read the original Mips based RENO paper. Original RENO does not do store-load bypassing. Original mips RENO does load elimination same way arm cores do.

                  RENO stands for RENaming Optimizer. "modified MIPS-R10000 register renaming mechanism" Direct quote out that paper.

                  https://fuse.wikichip.org/news/2339/...performance/2/

                  Since the bit on there called Rename in the arm cores that is the RENO equal part. Interest point is that the Rename part of the arm design is older than the MIPS RENO and does the same things.

                  You see RENO mentioned when people are talking about Alpha and X86 expect they are now talking about this horrible thing in the load store buffers in store-load bypassing instead of in registers. Why are does the Alpha and x86 do RENO in the load store buffers simple they are hyper-threading they have run out of high speed register space.

                  Like it or not there are two basic designs for RENO.
                  1) MIPS/ARM design of RENO that is register based. This load elimination by storing value in shadow register as well as sending it out to be stored. The one in arm cores technically is not RENO its a older thing that does the same things 99.999% of the time there are a handful of corner cases that you can do that shows arm implementation is not a proper RENO but is the older beast.
                  2) Hyperthreading cores alpha/x86/powerpc you see something that is called RENO that is load/store buffer based. This is RENO that done by store-load bypassing in the load store buffer. Has to be done this way because you are out of register space to use as effective scratch pad.

                  Can you see the knock on effect of hyper-threading now. Something that for most effectiveness should be done in registers the RENO is forced out to the load store buffer area due to having hyper-threading. Its not the only form of optimisation that no longer ends up in the ideal location due to implementing hyper-threading so running your design low on register space.

                  Yes hyperthreading lowers IPC. Designs that take RENO done the way hyperthreading cores would do it also take a IPC hit because you are no longer getting the full advantage of RENO implemented in registers.

                  If you are the Vlad Petric from the original RENO paper you should have known it has absolutely nothing to-do with load-store buffers or store-load bypassing as it technically not either. Original RENO is a optimisation to eliminate as many uops as you can results in something that behaves as if you have a store-load bypass by instead eliminating different loads out of existence. Original RENO also makes add and other instructions pull the magic disappearing act.

                  The RENO you find on hyperthreaded cpus in the load store buffers is restricted to only effecting load/store operations. You start seeing in hyperthreaded cpus RENO implemented like twice once for general instructions and once for the load/stores in the load store buffer stuff.

                  The paper on memory bypassing not worth it is more than true. RENO is worth it but its not memory bypassing but instruction elimination. Instruction elimination is always going to help performance arm cores are already doing this.

                  Comment


                  • Originally posted by oiaohm View Post
                    Except none of this ...
                    Your posts are messy (the grammar of sentences; the thread of reasoning in general) and very hard to read. Can you please do something about it?

                    Comment


                    • Originally posted by vladpetric View Post

                      I'm afraid that, in practice, the utilization of the load pipeline/path is way way lower than that (so you end up waiting). The utilization of the ALU pipelines is definitely better, but even there you rarely ever get close to 100% utilization.

                      I suggest the following experiment:

                      Start with a CPU intensive benchmark that lasts roughly 10-20 seconds (more is not a problem, it'll just make you wait a bit longer). The easiest thing is the compression of a larger file, which I also copied on /dev/shm. But feel free to pick your CPU intensive benchmark (this is not meaningful for an I/O benchmark that waits a lot ...)

                      Then run the following:

                      perf stat -e cycles,instructions,cache-references << your actual command >>

                      If your processor is an Intel one, the following should probably work as well:

                      perf stat -e cycles,uops_retired.all,mem_uops_retired.all_loads << your actual command >>

                      Generally, it's better to report uops vs instructions, and the uops_retired.all, mem_uops_retired.all_loads counters are precise on my processor.

                      Then see what the IPC is (instructions per cycle, or uops per cycle). Also see how frequent the loads are.

                      Please try the previous, as I'm curious what numbers you get. The first command should also work on an RPi2/3/4 actually.
                      The performance counter cache-references can also mean LLC-references or "L2 cache references", so I passed L1-dcache-loads to /usr/bin/perf.

                      Summary of the code snippets below:
                      • A10-7850K, app=xz: 0.41 L1D loads per cycle (45% L1D load pipeline utilization (not considering to the number of load ports))
                      • Ryzen 3700X, app=xz: 0.58 L1D loads per cycle (58% L1D load pipeline utilization (not considering to the number of load ports))
                      • Ryzen 3700X, app=g++: 0.67 L1D loads per cycle (67% L1D load pipeline utilization (not normalized to the number of load ports))
                      • Raspberry Pi 2, app=xz: not meaningful because of very low IPC
                      I suppose that with 0.67 L1D loads per cycle, the number of matching store(X)-load(X) pairs occurring within 0-3 cycles is just a small fraction of 0.67 - for example less than 0.1, so in order for memory bypassing to be required for improving performance the IPC would have to be larger than 10 instructions per clock.

                      If IPC keeps increasing over time then L1D pipeline utilization will increase as well, and thus the probability of a store(X)-load(X) pair to occur within 0-3 cycles will be over time amplified to a performance bottleneck. However, it will take several decades (or more) for single-threaded IPC to reach the value 10.

                      Code:
                      A10-7850K
                      $ perf stat -e cycles,instructions,cache-references,L1-dcache-loads,L1-dcache-prefetches,L1-dcache-load-misses -- xz -9c /usr/bin/Xorg | wc -c
                      
                       Performance counter stats for 'xz -9c /usr/bin/Xorg':
                      
                           5,147,093,251      cycles                                                      
                           4,694,812,581      instructions              #    0.91  insn per cycle        
                              52,196,930      cache-references                                            
                           2,134,968,624      L1-dcache-loads                                            
                              49,383,148      L1-dcache-prefetches                                        
                              44,112,814      L1-dcache-load-misses     #    2.07% of all L1-dcache hits  
                      
                             1.314936065 seconds time elapsed
                      
                             1.253729000 seconds user
                             0.059701000 seconds sys
                      Code:
                      Ryzen 3700X (a slightly different /usr/bin/Xorg file)
                      $ perf stat -e cycles,instructions,cache-references,L1-dcache-loads,L1-dcache-prefetches,L1-dcache-load-misses -- xz -9c /usr/bin/Xorg | wc -c
                      
                       Performance counter stats for 'xz -9c /usr/bin/Xorg':
                      
                           3,611,880,161      cycles                                                      
                           5,175,382,384      instructions              #    1.43  insn per cycle        
                              85,128,735      cache-references                                            
                           2,083,427,179      L1-dcache-loads                                            
                              24,899,168      L1-dcache-prefetches                                        
                              55,135,959      L1-dcache-load-misses     #    2.65% of all L1-dcache hits  
                      
                             0.831343249 seconds time elapsed
                      
                             0.813425000 seconds user
                             0.019290000 seconds sys
                      Code:
                      Ryzen 3700X
                      $ perf stat -e cycles,instructions,cache-references,L1-dcache-loads,L1-dcache-prefetches,L1-dcache-load-misses -- g++ -O2 ...
                      
                       Performance counter stats for 'g++ -O2 ...':
                      
                          16,519,230,778      cycles                                                      
                          24,517,551,053      instructions              #    1.48  insn per cycle        
                           1,619,398,618      cache-references                                            
                          11,028,752,404      L1-dcache-loads                                            
                             392,157,539      L1-dcache-prefetches                                        
                             584,586,070      L1-dcache-load-misses     #    5.30% of all L1-dcache hits  
                      
                             3.814482470 seconds time elapsed
                      
                             3.741325000 seconds user
                             0.070113000 seconds sys
                      Code:
                      RPi2
                      $ perf_4.9 stat -e cycles,instructions,cache-references,L1-dcache-loads -- xz -3c Xorg | wc -c
                      
                       Performance counter stats for 'xz -3c Xorg':
                      
                           3,389,885,517      cycles:u                                                    
                             906,475,610      instructions:u            #    0.27  insn per cycle        
                             350,135,938      cache-references:u                                          
                             350,135,938      L1-dcache-loads:u
                      Last edited by atomsymbol; 17 August 2020, 12:34 PM.

                      Comment

                      Working...
                      X