Announcement

Collapse
No announcement yet.

How A Raspberry Pi 4 Performs Against Intel's Latest Celeron, Pentium CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by vladpetric View Post
    Nonetheless, the average is closer to 2:1, which is why for modern (not toy) processor designs the load queue has twice the size of the store queue.
    That is not true on the load queue being twice the size of store. Your A77/A78 and your server designed arm cores that do match up against x86 performance have load queues slightly smaller than store queues.
    https://en.wikichip.org/wiki/arm_hol...ndividual_Core
    Like A77 85 load vs 90 store. Why it this way will come clear when I answer the next bit.

    Originally posted by vladpetric View Post
    2. From a performance standpoint, stores most of the times don't matter. The main reason is that stores don't produce, for the most part, data that is immediately needed.
    Turns out stores matter. If you are not able to store fast enough you cannot clear data out of registers to cache fast enough you end up filling the cpu usable register space and this is instance processing stall. To be safe you are normally need closer to a 1 to 1 in load store buffer ratio so that if you are copy memory for example from one location to another by CPU you will end up with 1 to 1 load store ratio you don't want CPU stalling out every time that happens because it run out of store and possible filled back into registers.

    Originally posted by vladpetric View Post
    Yes, you could have a load that needs to read from the same address, and is alive (in the instruction window) at the same time as the store, but those cases are rare. The Load/Store scheduler in modern processors figures out store-load pairs (based on PC) which are likely to collide, and prevents the load from issuing ahead of the store (as the Loh paper discusses).
    Does not have to be a load/store scheduler that nukes the load. Arm the register renaming process that sees this load is going to be the same address as a store. Register renaming process also has like you have a add 1 to register 1. copy register 1 to register 2 add 1 to register 2 store register 2 come at run time basically register 2 does not get used and all the processing stays in register 1. So modern processors can have either a load/store scheduler or a smart register rename process both end up with the result that you don't have a store to a address with a following load from that address happening as the load is going to cease to exist replaced by use register.

    One of the quirks of difference that you see in arm that are using arm register renaming you will see a load from address then a store to the same address in the same clock cycle because a load issuing ahead of store is possible. Load issuing after store disappears. This is why with arm you don't want smarts in the load/store buffers.

    A load ahead of store you are after the value in the L1 that the load is asking for not the value store has pushed into the store buffer. A load after store you are wanting the value the store is sending out so this case arm redirect to the register that held the value instead of load function so the load function disappears. If you add the smarts to the load store buffers so a load checks if value change to address is in the store buffer this will break how the arm design works.

    This also explains why you might be wanting to keep the load buffer slightly shorter than the store buffer so it slightly faster to get from one end of the que to the other.

    Originally posted by vladpetric View Post
    The other situation where a store could slow things down significantly is if the store queue gets full (again, that rarely happens).
    Sorry store ques getting close to full to overful happen a lot. Memory copy stuff and some of your advanced maths stuff. Some games are horrible for having a low load ratio with a high store usage because they optimised to have a lot of core stuff being calculated from values in registers.

    Originally posted by vladpetric View Post
    Loads do matter, because at least another instruction needs to wait for the load to finish and produce a result (ok, you could have a load that is wasted - nobody using the output - but again, that's super rare).
    That depends on the chip you are talking about something aggressive like a ryzen/Zen2 will have a lot of speculative loads that are basically loaded on the guess that the execution might go x direction if it does not that load is a complete waste. This leads to ryzen/Zen2 insanely high load numbers does increase IPC by less than 10 percent but that 10 percent gain IPC comes from basically 1 in 10 speculative guessed loads being right. so Ryzen/Zen2 a load happening that no one uses the output is insanely common. So how wasteful you are on loads does come down to how extremely you are willing to go after IPC.

    Originally posted by vladpetric View Post
    So load to total instruction count is more useful than (load + store) to total instruction count, because stores don't really have a latency (ok, they do have a latency, it's just that in the vast majority of cases it's not on the critical path, so it just doesn't matter at all).
    This can be a useful diagnostic metric. There are cases where store matters.

    The reality here is making your load/store buffers larger don't help much unless you have the micro ops to use that larger load store buffers. Even if you have the micro ops to use larger load/store buffers you don't need larger load/store if you don't have the instruction dispatch and register renaming to fill the micro ops.

    Yes the size of the load/store buffers tell you are little about the cpu but without understanding how the CPU design is going to use those load store buffers the size does not tell you much. Like a Ryzen/Zen2 that is going to speculatively fill the load buffer having a lot larger load buffer than store makes sense then you have arm core designs where due to the register renaming design where a slightly smaller load buffer than store makes sense. Both designs can put out close the same IPC when they have close to the same level micro ops with means to fill and use those micro ops.

    Yes lot of your general programs are 2 to 1 load/store ratio but you don't want your cpu stalling out when you have 1 to 1 load/store ratio(coping memory) or a 0 to 1 load/store ratio(like dumping out random data from in cpu random number generator micro op).

    Comment


    • Originally posted by oiaohm View Post
      Yes lot of your general programs are 2 to 1 load/store ratio but you don't want your cpu stalling out when you have 1 to 1 load/store ratio(coping memory) or a 0 to 1 load/store ratio(like dumping out random data from in cpu random number generator micro op).
      Just some notes:
      • x86 CPUs have memory copy instructions (REP MOVS)
        • None of x86 CPUs features a highly optimized memory copy logic, i.e. implementing the memory copying logic inside of L1D/L2/L3 caches themselves
        • Copying a 4 KiB page in just 10 CPU cycles (about 1.6 TB/s) is doable (in theory) if all of the 4 KiB data is already in L1D/L2/L3 caches - without utilizing any traditional load/store port during the process of copying the 4 KiB page
          • The question is whether it is desirable for an actual non-theoretical x86 CPU to feature such a high-speed memcpy implementation
      • AVX memcpy on a CPU with 1 store port: 4 GHz * 32 bytes = 128 GB/s
        • 128 GB/s is a relatively high number, i.e. even if the CPU has just 1 256-bit store port then memcpy() is unlikely to be a bottleneck in real-world code
        • The 2nd store port in IceLake-derived CPUs speeds up memcpy() by up to 100%, but it is probable that memcpy isn't the primary reason for the existence a 2nd store port in a CPU
      • Dumping data from a random number generator is just a synthetic benchmark

      Comment


      • Originally posted by vladpetric View Post
        Unfortunately our numbers don't fully align. Even if we were running the same xz version (I have 5.2.5, latest stable), we could still have differences in compiler, or some specialized routines using different assembly (ifunc).

        3. Could you kindly add the perf data for xz and virtio-win-0.1.185.iso?
        perf stat results (combined from 2 perf stat runs):

        Code:
        $ perf stat -e ... -- xz -3c -T1 virtio-win-0.1.185.iso | wc -c
        
         Performance counter stats for 'xz -3c -T1 virtio-win-0.1.185.iso':
        
           130,998,797,769      cycles                                                      
           113,408,081,318      instructions              #    0.87  insn per cycle        
             6,207,884,764      cache-references                                            
            52,070,531,511      L1-dcache-loads                                            
             1,932,673,658      L1-dcache-prefetches                                        
             4,147,646,284      L1-dcache-load-misses     #    7.97% of all L1-dcache hits  
            39,159,406,747      ls_dispatch.ld_dispatch                                    
            17,481,464,420      ls_dispatch.store_dispatch                                  
             1,473,011,913      ls_dispatch.ld_st_dispatch
        
              30.104931172 seconds time elapsed
        Last edited by atomsymbol; 08-21-2020, 10:39 AM.

        Comment


        • Originally posted by atomsymbol View Post

          perf stat results (combined from 2 perf stat runs):

          Code:
          $ perf stat -e ... -- xz -3c -T1 virtio-win-0.1.185.iso | wc -c
          
          Performance counter stats for 'xz -3c -T1 virtio-win-0.1.185.iso':
          
          130,998,797,769 cycles
          113,408,081,318 instructions # 0.87 insn per cycle
          6,207,884,764 cache-references
          52,070,531,511 L1-dcache-loads
          1,932,673,658 L1-dcache-prefetches
          4,147,646,284 L1-dcache-load-misses # 7.97% of all L1-dcache hits
          39,159,406,747 ls_dispatch.ld_dispatch
          17,481,464,420 ls_dispatch.store_dispatch
          1,473,011,913 ls_dispatch.ld_st_dispatch
          
          30.104931172 seconds time elapsed
          The interesting performance counter is "ls_dispatch.ld_st_dispatch" described as:
          • Dispatch of a single op that performs a load from and store to the same memory address. Number of single ops that do load/store to an address.

          Based on the above perf stat output, 1.3% (1.473 / 113.4) of all instructions are store(X)-load(X) pairs from/to the same address X within a single clock. Based on this number (if it is correct), assuming that a store(X)-load(X) pair stalls execution by at most 3 cycles, it can be estimated that adding memory bypassing to the CPU would lead to less than 3.4% speedup in case of "xz -c3 -T1": 1.473 * 3 / 130.999 = 3.4%. The problem with this estimate is that the average stall due to store(X)-load(X) is smaller than 3 cycles. As CPUs become wider (will have higher single-threaded IPC), this number will increase over time, but the CPU would need to be able to sustain an IPC of 5-10 when running "xz -c3 -T1" for memory bypassing to improve performance by a significant margin.
          This is great, thanks! Will take a look at all this over the weekend.

          Comment


          • Originally posted by vladpetric View Post
            This is great, thanks! Will take a look at all this over the weekend.
            I think I misunderstood the perf counter description. ls_dispatch.ld_st_dispatch most likely means the load(X)-store(X) pair, not the store(X)-load(X) pair.

            Comment


            • Originally posted by atomsymbol View Post

              I think I misunderstood the perf counter description. ls_dispatch.ld_st_dispatch most likely means the load(X)-store(X) pair, not the store(X)-load(X) pair.
              I think we need to check the AMD perf counter spec, as the perf description is ambiguous enough.

              Somewhere around here ... :

              https://www.amd.com/en/support/tech-docs

              Edit:

              crap, it doesn't say anything extra:

              PMCx029 [LS Dispatch] (Core::X86::Pmc::Core::LsDispatch) Read-write. Reset: 00h. Counts the number of operations dispatched to the LS unit. Unit Masks ADDed. PMCx029 Bits Description 7:3 Reserved. 2 LdStDispatch: Load-op-Store Dispatch. Read-write. Reset: 0. Dispatch of a single op that performs a load from and store to the same memory address. 1 StoreDispatch. Read-write. Reset: 0. Dispatch of a single op that performs a memory store. 0 LdDispatch. Read-write. Reset: 0. Dispatch of a single op that performs a memory load.
              Last edited by vladpetric; 08-21-2020, 10:59 AM.

              Comment


              • Originally posted by atomsymbol View Post

                Just some notes:
                • x86 CPUs have memory copy instructions (REP MOVS)
                  • None of x86 CPUs features a highly optimized memory copy logic, i.e. implementing the memory copying logic inside of L1D/L2/L3 caches themselves
                  • Copying a 4 KiB page in just 10 CPU cycles (about 1.6 TB/s) is doable (in theory) if all of the 4 KiB data is already in L1D/L2/L3 caches - without utilizing any traditional load/store port during the process of copying the 4 KiB page
                    • The question is whether it is desirable for an actual non-theoretical x86 CPU to feature such a high-speed memcpy implementation
                .
                Not quite
                https://news.ycombinator.com/item?id=12048651
                REP MOVS can in fact bi-pass L1D/L2/L3 and issue command straight to MMU for block to block copy. When you get to like 4kb of size it does ask the question if this should enter the caches at all or should this just be a direct MMU operation.

                Originally posted by atomsymbol View Post
                • AVX memcpy on a CPU with 1 store port: 4 GHz * 32 bytes = 128 GB/s
                  • 128 GB/s is a relatively high number, i.e. even if the CPU has just 1 256-bit store port then memcpy() is unlikely to be a bottleneck in real-world code
                  • The 2nd store port in IceLake-derived CPUs speeds up memcpy() by up to 100%, but it is probable that memcpy isn't the primary reason for the existence a 2nd store port in a CPU
                • Dumping data from a random number generator is just a synthetic benchmark
                The hardware random generator seams synthetic but a website running SSL can be hitting the random number generator insanely hard you do strike sections of code that will be doing hardware random generator to store in volume. So its not just a synthetic benchmark its a synthetic benchmark that replicates something that happens quite a bit in different server loads.

                Originally posted by atomsymbol View Post
                I think I misunderstood the perf counter description. ls_dispatch.ld_st_dispatch most likely means the load(X)-store(X) pair, not the store(X)-load(X) pair.
                LdStDispatch needs to exist in some x86 designs. Store(X) Load(X) is handled in the store/load buffers in the AMD design. So LdStDispatch has to exist to bipass the buffer processing so you can get a Load value from L1 instead of getting the value that was just pushed into the store buffer. Arm you don't need this bipass due to not processing in the buffers but this effects store and load buffer sizes.

                Comment


                • Originally posted by oiaohm View Post
                  Not quite
                  https://news.ycombinator.com/item?id=12048651
                  REP MOVS can in fact bi-pass L1D/L2/L3 and issue command straight to MMU for block to block copy. When you get to like 4kb of size it does ask the question if this should enter the caches at all or should this just be a direct MMU operation.
                  No, I do not mean that. Optimizing "REP MOVS" when designing a high-performance x86 CPU with the goal/target "make REP MOVS be 10% faster than hand-coded assembly" is pointless. I mean a 1000% speedup on large blocks. ... When copying 32 MiB of aligned non-overlapping data then it could first write overlapping modified data from CPU caches to memory and then send a few commands to the memory modules to perform the copy operation while the CPU is concurrently running other tasks that do not interfere with the ongoing copy operation within the memory modules. Similarly, when copying a large file on an SSD the operating system could send commands to the SSD device (note: SSD is internally a high-bandwidth memory device, while the PCI-Express x4 interface to which the SSD is connected is a serial interface).

                  Comment


                  • Originally posted by atomsymbol View Post

                    No, I do not mean that. Optimizing "REP MOVS" when designing a high-performance x86 CPU with the goal/target "make REP MOVS be 10% faster than hand-coded assembly" is pointless. I mean a 1000% speedup on large blocks. ... When copying 32 MiB of aligned non-overlapping data then it could first write overlapping modified data from CPU caches to memory and then send a few commands to the memory modules to perform the copy operation while the CPU is concurrently running other tasks that do not interfere with the ongoing copy operation within the memory modules. Similarly, when copying a large file on an SSD the operating system could send commands to the SSD device (note: SSD is internally a high-bandwidth memory device, while the PCI-Express x4 interface to which the SSD is connected is a serial interface).
                    That 1000% speed exists in some x86 chips with REP MOVS but its not implemented that the data enters the caches. REP MOVS are auto optimised to use the largest copy the cpu core has access to. There is a reason why MMU optimisation is not common there is race condition between cores to over come and a cost in determining if a copy is going to be large enough to use the MMU.

                    If the MMU copy is done its not done write overlapping modified data first its done duplicate existing data by page table directions to the MMU then apply modified data.

                    MMU really does not have fine grained mapping of memory information the most detailed information it has is the page tables this includes for DMA. Yes when limited to the MMU granularity does limit what kind of operations you can do. This granularity causes another problem. 4k is the smallest page size . The largest pagesize is 2MiB on x86. REP Mov that optimise to MMU can work while page entries are 4k in current x86 implementations you use 2MIB page it not happening as it quite a lot of optimisation processing to say this is going to be a large enough operation to need a 2MiB page copy.

                    There are some MMU for arm that do support sending the copy page instruction to the MMU yourself. This is mostly not used unless developer of program goes out of way to code it in. The granularity of the MMU is a real limiting factor.

                    Yes I do agree that it could be useful to get that 1000% speedup on large blocks.

                    Comment

                    Working...
                    X