Announcement

Collapse
No announcement yet.

How A Raspberry Pi 4 Performs Against Intel's Latest Celeron, Pentium CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • oiaohm
    replied
    Originally posted by atomsymbol View Post

    No, I do not mean that. Optimizing "REP MOVS" when designing a high-performance x86 CPU with the goal/target "make REP MOVS be 10% faster than hand-coded assembly" is pointless. I mean a 1000% speedup on large blocks. ... When copying 32 MiB of aligned non-overlapping data then it could first write overlapping modified data from CPU caches to memory and then send a few commands to the memory modules to perform the copy operation while the CPU is concurrently running other tasks that do not interfere with the ongoing copy operation within the memory modules. Similarly, when copying a large file on an SSD the operating system could send commands to the SSD device (note: SSD is internally a high-bandwidth memory device, while the PCI-Express x4 interface to which the SSD is connected is a serial interface).
    That 1000% speed exists in some x86 chips with REP MOVS but its not implemented that the data enters the caches. REP MOVS are auto optimised to use the largest copy the cpu core has access to. There is a reason why MMU optimisation is not common there is race condition between cores to over come and a cost in determining if a copy is going to be large enough to use the MMU.

    If the MMU copy is done its not done write overlapping modified data first its done duplicate existing data by page table directions to the MMU then apply modified data.

    MMU really does not have fine grained mapping of memory information the most detailed information it has is the page tables this includes for DMA. Yes when limited to the MMU granularity does limit what kind of operations you can do. This granularity causes another problem. 4k is the smallest page size . The largest pagesize is 2MiB on x86. REP Mov that optimise to MMU can work while page entries are 4k in current x86 implementations you use 2MIB page it not happening as it quite a lot of optimisation processing to say this is going to be a large enough operation to need a 2MiB page copy.

    There are some MMU for arm that do support sending the copy page instruction to the MMU yourself. This is mostly not used unless developer of program goes out of way to code it in. The granularity of the MMU is a real limiting factor.

    Yes I do agree that it could be useful to get that 1000% speedup on large blocks.

    Leave a comment:


  • atomsymbol
    replied
    Originally posted by oiaohm View Post
    Not quite
    https://news.ycombinator.com/item?id=12048651
    REP MOVS can in fact bi-pass L1D/L2/L3 and issue command straight to MMU for block to block copy. When you get to like 4kb of size it does ask the question if this should enter the caches at all or should this just be a direct MMU operation.
    No, I do not mean that. Optimizing "REP MOVS" when designing a high-performance x86 CPU with the goal/target "make REP MOVS be 10% faster than hand-coded assembly" is pointless. I mean a 1000% speedup on large blocks. ... When copying 32 MiB of aligned non-overlapping data then it could first write overlapping modified data from CPU caches to memory and then send a few commands to the memory modules to perform the copy operation while the CPU is concurrently running other tasks that do not interfere with the ongoing copy operation within the memory modules. Similarly, when copying a large file on an SSD the operating system could send commands to the SSD device (note: SSD is internally a high-bandwidth memory device, while the PCI-Express x4 interface to which the SSD is connected is a serial interface).

    Leave a comment:


  • oiaohm
    replied
    Originally posted by atomsymbol View Post

    Just some notes:
    • x86 CPUs have memory copy instructions (REP MOVS)
      • None of x86 CPUs features a highly optimized memory copy logic, i.e. implementing the memory copying logic inside of L1D/L2/L3 caches themselves
      • Copying a 4 KiB page in just 10 CPU cycles (about 1.6 TB/s) is doable (in theory) if all of the 4 KiB data is already in L1D/L2/L3 caches - without utilizing any traditional load/store port during the process of copying the 4 KiB page
        • The question is whether it is desirable for an actual non-theoretical x86 CPU to feature such a high-speed memcpy implementation
    .
    Not quite
    https://news.ycombinator.com/item?id=12048651
    REP MOVS can in fact bi-pass L1D/L2/L3 and issue command straight to MMU for block to block copy. When you get to like 4kb of size it does ask the question if this should enter the caches at all or should this just be a direct MMU operation.

    Originally posted by atomsymbol View Post
    • AVX memcpy on a CPU with 1 store port: 4 GHz * 32 bytes = 128 GB/s
      • 128 GB/s is a relatively high number, i.e. even if the CPU has just 1 256-bit store port then memcpy() is unlikely to be a bottleneck in real-world code
      • The 2nd store port in IceLake-derived CPUs speeds up memcpy() by up to 100%, but it is probable that memcpy isn't the primary reason for the existence a 2nd store port in a CPU
    • Dumping data from a random number generator is just a synthetic benchmark
    The hardware random generator seams synthetic but a website running SSL can be hitting the random number generator insanely hard you do strike sections of code that will be doing hardware random generator to store in volume. So its not just a synthetic benchmark its a synthetic benchmark that replicates something that happens quite a bit in different server loads.

    Originally posted by atomsymbol View Post
    I think I misunderstood the perf counter description. ls_dispatch.ld_st_dispatch most likely means the load(X)-store(X) pair, not the store(X)-load(X) pair.
    LdStDispatch needs to exist in some x86 designs. Store(X) Load(X) is handled in the store/load buffers in the AMD design. So LdStDispatch has to exist to bipass the buffer processing so you can get a Load value from L1 instead of getting the value that was just pushed into the store buffer. Arm you don't need this bipass due to not processing in the buffers but this effects store and load buffer sizes.

    Leave a comment:


  • vladpetric
    replied
    Originally posted by atomsymbol View Post

    I think I misunderstood the perf counter description. ls_dispatch.ld_st_dispatch most likely means the load(X)-store(X) pair, not the store(X)-load(X) pair.
    I think we need to check the AMD perf counter spec, as the perf description is ambiguous enough.

    Somewhere around here ... :

    https://www.amd.com/en/support/tech-docs

    Edit:

    crap, it doesn't say anything extra:

    PMCx029 [LS Dispatch] (Core::X86::Pmc::Core::LsDispatch) Read-write. Reset: 00h. Counts the number of operations dispatched to the LS unit. Unit Masks ADDed. PMCx029 Bits Description 7:3 Reserved. 2 LdStDispatch: Load-op-Store Dispatch. Read-write. Reset: 0. Dispatch of a single op that performs a load from and store to the same memory address. 1 StoreDispatch. Read-write. Reset: 0. Dispatch of a single op that performs a memory store. 0 LdDispatch. Read-write. Reset: 0. Dispatch of a single op that performs a memory load.
    Last edited by vladpetric; 08-21-2020, 10:59 AM.

    Leave a comment:


  • atomsymbol
    replied
    Originally posted by vladpetric View Post
    This is great, thanks! Will take a look at all this over the weekend.
    I think I misunderstood the perf counter description. ls_dispatch.ld_st_dispatch most likely means the load(X)-store(X) pair, not the store(X)-load(X) pair.

    Leave a comment:


  • vladpetric
    replied
    Originally posted by atomsymbol View Post

    perf stat results (combined from 2 perf stat runs):

    Code:
    $ perf stat -e ... -- xz -3c -T1 virtio-win-0.1.185.iso | wc -c
    
    Performance counter stats for 'xz -3c -T1 virtio-win-0.1.185.iso':
    
    130,998,797,769 cycles
    113,408,081,318 instructions # 0.87 insn per cycle
    6,207,884,764 cache-references
    52,070,531,511 L1-dcache-loads
    1,932,673,658 L1-dcache-prefetches
    4,147,646,284 L1-dcache-load-misses # 7.97% of all L1-dcache hits
    39,159,406,747 ls_dispatch.ld_dispatch
    17,481,464,420 ls_dispatch.store_dispatch
    1,473,011,913 ls_dispatch.ld_st_dispatch
    
    30.104931172 seconds time elapsed
    The interesting performance counter is "ls_dispatch.ld_st_dispatch" described as:
    • Dispatch of a single op that performs a load from and store to the same memory address. Number of single ops that do load/store to an address.

    Based on the above perf stat output, 1.3% (1.473 / 113.4) of all instructions are store(X)-load(X) pairs from/to the same address X within a single clock. Based on this number (if it is correct), assuming that a store(X)-load(X) pair stalls execution by at most 3 cycles, it can be estimated that adding memory bypassing to the CPU would lead to less than 3.4% speedup in case of "xz -c3 -T1": 1.473 * 3 / 130.999 = 3.4%. The problem with this estimate is that the average stall due to store(X)-load(X) is smaller than 3 cycles. As CPUs become wider (will have higher single-threaded IPC), this number will increase over time, but the CPU would need to be able to sustain an IPC of 5-10 when running "xz -c3 -T1" for memory bypassing to improve performance by a significant margin.
    This is great, thanks! Will take a look at all this over the weekend.

    Leave a comment:


  • atomsymbol
    replied
    Originally posted by vladpetric View Post
    Unfortunately our numbers don't fully align. Even if we were running the same xz version (I have 5.2.5, latest stable), we could still have differences in compiler, or some specialized routines using different assembly (ifunc).

    3. Could you kindly add the perf data for xz and virtio-win-0.1.185.iso?
    perf stat results (combined from 2 perf stat runs):

    Code:
    $ perf stat -e ... -- xz -3c -T1 virtio-win-0.1.185.iso | wc -c
    
     Performance counter stats for 'xz -3c -T1 virtio-win-0.1.185.iso':
    
       130,998,797,769      cycles                                                      
       113,408,081,318      instructions              #    0.87  insn per cycle        
         6,207,884,764      cache-references                                            
        52,070,531,511      L1-dcache-loads                                            
         1,932,673,658      L1-dcache-prefetches                                        
         4,147,646,284      L1-dcache-load-misses     #    7.97% of all L1-dcache hits  
        39,159,406,747      ls_dispatch.ld_dispatch                                    
        17,481,464,420      ls_dispatch.store_dispatch                                  
         1,473,011,913      ls_dispatch.ld_st_dispatch
    
          30.104931172 seconds time elapsed
    Last edited by atomsymbol; 08-21-2020, 10:39 AM.

    Leave a comment:


  • atomsymbol
    replied
    Originally posted by oiaohm View Post
    Yes lot of your general programs are 2 to 1 load/store ratio but you don't want your cpu stalling out when you have 1 to 1 load/store ratio(coping memory) or a 0 to 1 load/store ratio(like dumping out random data from in cpu random number generator micro op).
    Just some notes:
    • x86 CPUs have memory copy instructions (REP MOVS)
      • None of x86 CPUs features a highly optimized memory copy logic, i.e. implementing the memory copying logic inside of L1D/L2/L3 caches themselves
      • Copying a 4 KiB page in just 10 CPU cycles (about 1.6 TB/s) is doable (in theory) if all of the 4 KiB data is already in L1D/L2/L3 caches - without utilizing any traditional load/store port during the process of copying the 4 KiB page
        • The question is whether it is desirable for an actual non-theoretical x86 CPU to feature such a high-speed memcpy implementation
    • AVX memcpy on a CPU with 1 store port: 4 GHz * 32 bytes = 128 GB/s
      • 128 GB/s is a relatively high number, i.e. even if the CPU has just 1 256-bit store port then memcpy() is unlikely to be a bottleneck in real-world code
      • The 2nd store port in IceLake-derived CPUs speeds up memcpy() by up to 100%, but it is probable that memcpy isn't the primary reason for the existence a 2nd store port in a CPU
    • Dumping data from a random number generator is just a synthetic benchmark

    Leave a comment:


  • oiaohm
    replied
    Originally posted by vladpetric View Post
    Nonetheless, the average is closer to 2:1, which is why for modern (not toy) processor designs the load queue has twice the size of the store queue.
    That is not true on the load queue being twice the size of store. Your A77/A78 and your server designed arm cores that do match up against x86 performance have load queues slightly smaller than store queues.
    https://en.wikichip.org/wiki/arm_hol...ndividual_Core
    Like A77 85 load vs 90 store. Why it this way will come clear when I answer the next bit.

    Originally posted by vladpetric View Post
    2. From a performance standpoint, stores most of the times don't matter. The main reason is that stores don't produce, for the most part, data that is immediately needed.
    Turns out stores matter. If you are not able to store fast enough you cannot clear data out of registers to cache fast enough you end up filling the cpu usable register space and this is instance processing stall. To be safe you are normally need closer to a 1 to 1 in load store buffer ratio so that if you are copy memory for example from one location to another by CPU you will end up with 1 to 1 load store ratio you don't want CPU stalling out every time that happens because it run out of store and possible filled back into registers.

    Originally posted by vladpetric View Post
    Yes, you could have a load that needs to read from the same address, and is alive (in the instruction window) at the same time as the store, but those cases are rare. The Load/Store scheduler in modern processors figures out store-load pairs (based on PC) which are likely to collide, and prevents the load from issuing ahead of the store (as the Loh paper discusses).
    Does not have to be a load/store scheduler that nukes the load. Arm the register renaming process that sees this load is going to be the same address as a store. Register renaming process also has like you have a add 1 to register 1. copy register 1 to register 2 add 1 to register 2 store register 2 come at run time basically register 2 does not get used and all the processing stays in register 1. So modern processors can have either a load/store scheduler or a smart register rename process both end up with the result that you don't have a store to a address with a following load from that address happening as the load is going to cease to exist replaced by use register.

    One of the quirks of difference that you see in arm that are using arm register renaming you will see a load from address then a store to the same address in the same clock cycle because a load issuing ahead of store is possible. Load issuing after store disappears. This is why with arm you don't want smarts in the load/store buffers.

    A load ahead of store you are after the value in the L1 that the load is asking for not the value store has pushed into the store buffer. A load after store you are wanting the value the store is sending out so this case arm redirect to the register that held the value instead of load function so the load function disappears. If you add the smarts to the load store buffers so a load checks if value change to address is in the store buffer this will break how the arm design works.

    This also explains why you might be wanting to keep the load buffer slightly shorter than the store buffer so it slightly faster to get from one end of the que to the other.

    Originally posted by vladpetric View Post
    The other situation where a store could slow things down significantly is if the store queue gets full (again, that rarely happens).
    Sorry store ques getting close to full to overful happen a lot. Memory copy stuff and some of your advanced maths stuff. Some games are horrible for having a low load ratio with a high store usage because they optimised to have a lot of core stuff being calculated from values in registers.

    Originally posted by vladpetric View Post
    Loads do matter, because at least another instruction needs to wait for the load to finish and produce a result (ok, you could have a load that is wasted - nobody using the output - but again, that's super rare).
    That depends on the chip you are talking about something aggressive like a ryzen/Zen2 will have a lot of speculative loads that are basically loaded on the guess that the execution might go x direction if it does not that load is a complete waste. This leads to ryzen/Zen2 insanely high load numbers does increase IPC by less than 10 percent but that 10 percent gain IPC comes from basically 1 in 10 speculative guessed loads being right. so Ryzen/Zen2 a load happening that no one uses the output is insanely common. So how wasteful you are on loads does come down to how extremely you are willing to go after IPC.

    Originally posted by vladpetric View Post
    So load to total instruction count is more useful than (load + store) to total instruction count, because stores don't really have a latency (ok, they do have a latency, it's just that in the vast majority of cases it's not on the critical path, so it just doesn't matter at all).
    This can be a useful diagnostic metric. There are cases where store matters.

    The reality here is making your load/store buffers larger don't help much unless you have the micro ops to use that larger load store buffers. Even if you have the micro ops to use larger load/store buffers you don't need larger load/store if you don't have the instruction dispatch and register renaming to fill the micro ops.

    Yes the size of the load/store buffers tell you are little about the cpu but without understanding how the CPU design is going to use those load store buffers the size does not tell you much. Like a Ryzen/Zen2 that is going to speculatively fill the load buffer having a lot larger load buffer than store makes sense then you have arm core designs where due to the register renaming design where a slightly smaller load buffer than store makes sense. Both designs can put out close the same IPC when they have close to the same level micro ops with means to fill and use those micro ops.

    Yes lot of your general programs are 2 to 1 load/store ratio but you don't want your cpu stalling out when you have 1 to 1 load/store ratio(coping memory) or a 0 to 1 load/store ratio(like dumping out random data from in cpu random number generator micro op).

    Leave a comment:


  • vladpetric
    replied
    Originally posted by atomsymbol View Post

    Precise results obtained by simulating cache accesses in callgrind:

    36% of all instructions are load/store instructions. This is valid for "xz -c3 -T1 virtio-win-0.1.185.iso" - other applications would show a different percentage of load/store instructions, but in summary expecting 1-in-10 instructions to be a load/store instruction (10%) is unrealistic, i.e: most apps are well above 10%.

    Curiously, the data below shows that load:store ratio is almost exactly 2:1 (27.1 : 13.9). I wonder whether this is just a coincidence or a general rule that most apps follow (not considering toy benchmarks).
    Here's my data:

    Code:
    $ valgrind --tool=callgrind --cache-sim=yes -- xz -3c -T1 ./virtio-win-0.1.185.iso > virtio-win-0.1.185.iso.xz
    ==2537234== Callgrind, a call-graph generating cache profiler
    ==2537234== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.
    ==2537234== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
    ==2537234==
    ==2537234== Events : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
    ==2537234== Collected : 96109168347 23293054178 11744076794 2535 1942602462 270595441 2437 65479019 23737948
    ==2537234==
    ==2537234== I refs: 96,109,168,347
    ==2537234==
    ==2537234== D refs: 35,037,130,972 (23,293,054,178 rd + 11,744,076,794 wr)
    I cut out a lot of useless info from the above.

    Unfortunately our numbers don't fully align. Even if we were running the same xz version (I have 5.2.5, latest stable), we could still have differences in compiler, or some specialized routines using different assembly (ifunc). But the proportion of load/store instructions is super close, so I think it's good enough for a comparison.

    Alright, some thoughts:

    1. The 2:1 ratio is simply an average across multiple benchmarks (distribution mode if you wish). But I do know benchmarks with lower and higher ratios. Specifically, if you have a lot of register spills, you'll have more stores, relatively speaking (spills are typically 1store/1load).

    But if you have a file checksum-like benchmark, then the ratio is highly skewed to the loads.

    For instance:

    Code:
    perf stat -e cycles,instructions,mem_uops_retired.all_loads,,me m_uops_retired.all_stores sha256sum virtio-win-0.1.185.iso
    e20a645df49607e7c9cebcc9840d3634b25b32832e45be1f11 a73123590fa9fb virtio-win-0.1.185.iso
    ,
    Performance counter stats for 'sha256sum virtio-win-0.1.185.iso':
    
    5,945,494,604 cycles:u
    20,507,306,371 instructions:u # 3.45 insn per cycle
    2,514,696,183 mem_uops_retired.all_loads:u
    485,107,654 mem_uops_retired.all_stores:u
    
    2.168431630 seconds time elapsed
    
    2.110880000 seconds user
    0.053926000 seconds sys
    So more than 5-1 loads to stores

    Why? Because sha256sum can do most of the checksumming work in registers ... Well, I'm guessing that sha256sum has sufficiently complex machinery that you need to store some of those values (they don't all fit in registers).

    Nonetheless, the average is closer to 2:1, which is why for modern (not toy) processor designs the load queue has twice the size of the store queue.

    That ratio was slightly lower when x86-32 was more popular (more like 1.5:1), because with x86-32 you have fewer registers (7 gprs), and a shitton more spills.

    2. From a performance standpoint, stores most of the times don't matter. The main reason is that stores don't produce, for the most part, data that is immediately needed.

    Yes, you could have a load that needs to read from the same address, and is alive (in the instruction window) at the same time as the store, but those cases are rare. The Load/Store scheduler in modern processors figures out store-load pairs (based on PC) which are likely to collide, and prevents the load from issuing ahead of the store (as the Loh paper discusses).

    The other situation where a store could slow things down significantly is if the store queue gets full (again, that rarely happens). But otherwise, stores are more-or-less fire-and-forget.

    Loads do matter, because at least another instruction needs to wait for the load to finish and produce a result (ok, you could have a load that is wasted - nobody using the output - but again, that's super rare).

    So load to total instruction count is more useful than (load + store) to total instruction count, because stores don't really have a latency (ok, they do have a latency, it's just that in the vast majority of cases it's not on the critical path, so it just doesn't matter at all).

    3. Could you kindly add the perf data for xz and virtio-win-0.1.185.iso?

    4. As I said earlier, I was wrong about the 10-1 ratio, please ignore that.

    Leave a comment:

Working...
X