Announcement

Collapse
No announcement yet.

How A Raspberry Pi 4 Performs Against Intel's Latest Celeron, Pentium CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • atomsymbol
    replied
    Originally posted by vladpetric View Post

    1. I really appreciate you doing this!

    2. I think something is wrong with the L1-dcache-loads on (your?) Ryzen systems. Cache-references and L1-dcache-loads should not be that discrepant (they should be equal or close to equal). In particular, cache-references is roughly an order of magnitude lower than L1-dcache-loads, and that is simply not sensible.

    Thing is, loads being roughly 1-in-10 instructions is something that we observe on RPi2, is pretty close to my measurements on the Intel system, and it is a bit more believable to me than 1-in-2 instructions being loads ... To me, cache-references is more believable here.

    3. Could I kindly ask you to do a consistent test? If I may propose https://fedorapeople.org/groups/virt...in-0.1.185.iso

    This way we could compare apples to apples.

    Why? Well, why not . It's a bit large yes (393 MiB).

    4. The IPC on RPi2 is ... big facepalm
    Precise results obtained by simulating cache accesses in callgrind:

    36% of all instructions are load/store instructions. This is valid for "xz -c3 -T1 virtio-win-0.1.185.iso" - other applications would show a different percentage of load/store instructions, but in summary expecting 1-in-10 instructions to be a load/store instruction (10%) is unrealistic, i.e: most apps are well above 10%.

    Curiously, the data below shows that load:store ratio is almost exactly 2:1 (27.1 : 13.9). I wonder whether this is just a coincidence or a general rule that most apps follow (not considering toy benchmarks).

    It took about 25 minutes for "callgrind xz" to compress the ISO file, which means callgrind was about 50-times slower than normal xz execution.

    Code:
    $ callgrind --cache-sim=yes -- xz -3c -T1 ./virtio-win-0.1.185.iso | wc -c
    
    desc: I1 cache: 32768 B, 64 B, 8-way associative
    desc: D1 cache: 32768 B, 64 B, 8-way associative
    desc: LL cache: 33554432 B, 64 B, direct-mapped
    
    Events : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
    Collected : 114279135645 27105874827 13902443594 2533 1931774728 273271284 2328 2686967 949904
    
    I refs: 114,279,135,645
    I1 misses: 2,533
    LLi misses: 2,328
    I1 miss rate: 0.00%
    LLi miss rate: 0.00%
    
    D refs: 41,008,318,421 (27,105,874,827 rd + 13,902,443,594 wr)
    D1 misses: 2,205,046,012 ( 1,931,774,728 rd + 273,271,284 wr)
    LLd misses: 3,636,871 ( 2,686,967 rd + 949,904 wr)
    D1 miss rate: 5.4% ( 7.1% + 2.0% )
    LLd miss rate: 0.0% ( 0.0% + 0.0% )
    
    LL refs: 2,205,048,545 ( 1,931,777,261 rd + 273,271,284 wr)
    LL misses: 3,639,199 ( 2,689,295 rd + 949,904 wr)
    LL miss rate: 0.0% ( 0.0% + 0.0% )

    Leave a comment:


  • vladpetric
    replied
    Originally posted by atomsymbol View Post



    On the RPi2, approximately every 3rd ARM instruction appears to be a load, 350 / 906 = 0.39.

    To obtain the precise number of load instructions in user-code (i.e: without speculative loads) we would need to annotate the assembly code.

    ----

    Code:
    $ pi cat /proc/cpuinfo | tail
    Hardware : BCM2835
    Revision : a01041
    Model : Raspberry Pi 2 Model B Rev 1.1
    Fair point, I was misreading the data (I was also comparing against a gzip run, which is not apples to apples)

    When profiling the compression of the .iso file that I mentioned earlier, I get:

    Code:
    /dev/shm $ perf stat -e cycles,inst_retired.any,uops_retired.all,mem_uops_ retired.all_loads xz -3c virtio-win-0.1.185.iso > virtio-win.iso.xz
    
    Performance counter stats for 'xz -3c virtio-win-0.1.185.iso':
    
    155,634,110,621 cycles:u
    95,291,274,136 inst_retired.any:u
    104,104,748,746 uops_retired.all:u
    22,259,028,509 mem_uops_retired.all_loads:u
    
    7.819202754 seconds time elapsed
    
    53.806171000 seconds user
    0.177891000 seconds sys
    Now inst_retired.any, uops_retired.all, mem_uops_retired.all_loads - these are all precise counters (thing collected on retirement). And I do trust the Intel precise counters, primarily because I validated them against simulation runs.

    So, somewhere between 1-in-4 and 1-in-5 instructions are loads, for xz. That's what I'm seeing. Intel IPC with xz is not that great, either, I have to admit this. A useful load gets dispatched roughly once every 7 cycles.

    So, I still think that you will very much see the latency of the loads ...

    Actually, I think we should also use -T1 (single threaded), for a better comparison of microarchitectural effects. My numbers are quite similar, though the command itself does take longer:

    Code:
    /dev/shm $ perf stat -e cycles,inst_retired.any,uops_retired.all,mem_uops_ retired.all_loads xz -3c -T1 virtio-win-0.1.185.iso > virtio-win.iso.xz
    
    Performance counter stats for 'xz -3c -T1 virtio-win-0.1.185.iso':
    
    118,360,577,701 cycles:u
    94,952,136,061 inst_retired.any:u
    103,532,304,227 uops_retired.all:u
    22,811,172,257 mem_uops_retired.all_loads:u
    
    42.374730049 seconds time elapsed
    
    42.180461000 seconds user
    0.117740000 seconds sys
    The IPC also increased by avoiding the inter-thread communication.

    Without -T1, a different number of threads is could be used on each run. If we want to run a multithreaded test, I think we should agree on a number of threads (e.g., 4).

    Oh, here's some numbers with branches included:

    Code:
    $ perf stat -e cycles,instructions,branches,branch-misses,inst_retired.any,uops_retired.all,mem_uops_ retired.all_loads xz -3c -T1 virtio-win-0.1.185.iso > virtio-win.iso.xz
    
    Performance counter stats for 'xz -3c -T1 virtio-win-0.1.185.iso':
    
    118,389,607,680 cycles:u (57.14%)
    94,944,328,786 instructions:u # 0.80 insn per cycle (71.43%)
    14,754,365,523 branches:u (85.71%)
    906,875,767 branch-misses:u # 6.15% of all branches (57.14%)
    94,896,558,766 inst_retired.any:u (71.43%)
    103,512,664,941 uops_retired.all:u (85.71%)
    22,839,152,686 mem_uops_retired.all_loads:u (42.86%)
    
    42.722314322 seconds time elapsed
    
    42.225622000 seconds user
    0.114992000 seconds sys
    The difference between instructions (a speculative counter) and inst_retired.any (a precise counter) is small. Obviously, I don't have numbers for your systems. But for my system, branch misspeculation doesn't play a big role.
    Last edited by vladpetric; 19 August 2020, 01:12 PM.

    Leave a comment:


  • atomsymbol
    replied
    Originally posted by atomsymbol View Post
    Code:
    RPi2
    $ perf_4.9 stat -e cycles,instructions,cache-references,L1-dcache-loads -- xz -3c Xorg | wc -c
    
    Performance counter stats for 'xz -3c Xorg':
    
    3,389,885,517 cycles:u
    906,475,610 instructions:u # 0.27 insn per cycle
    350,135,938 cache-references:u
    350,135,938 L1-dcache-loads:u
    Originally posted by vladpetric View Post
    Thing is, loads being roughly 1-in-10 instructions is something that we observe on RPi2, is pretty close to my measurements on the Intel system, and it is a bit more believable to me than 1-in-2 instructions being loads ... To me, cache-references is more believable here.
    On the RPi2, approximately every 3rd ARM instruction appears to be a load, 350 / 906 = 0.39.

    To obtain the precise number of load instructions in user-code (i.e: without speculative loads) we would need to annotate the assembly code.

    ----

    Code:
    $ pi cat /proc/cpuinfo | tail
    Hardware : BCM2835
    Revision : a01041
    Model : Raspberry Pi 2 Model B Rev 1.1

    Leave a comment:


  • oiaohm
    replied
    Originally posted by vladpetric View Post
    2. I think something is wrong with the L1-dcache-loads on (your?) Ryzen systems. Cache-references and L1-dcache-loads should not be that discrepant (they should be equal or close to equal). In particular, cache-references is roughly an order of magnitude lower than L1-dcache-loads, and that is simply not sensible.
    Those Ryzen figures are about right. Ryzen chips are more ram speed sensitive than intel if you trace that to the top you have a way more aggressive speculative load system than the intel one. That does result in more loads somewhere between 9-12 to cache-references. So yes a order of magnitude higher is right. Does this make looking at Rzyen load figures mostly pointless yes.

    Originally posted by vladpetric View Post
    Thing is, loads being roughly 1-in-10 instructions is something that we observe on RPi2, is pretty close to my measurements on the Intel system, and it is a bit more believable to me than 1-in-2 instructions being loads ... To me, cache-references is more believable here.
    Yes aggressive preloading results in 1 in 2 instructions being loads most of these loads are not generated by the program but by the aggressive speculation/preloads and does give the Epic and Ryzen increased IPC. Yes a warped one intentionally doing way more loads to increase IPC. Yes this does increase risk of load store clash.

    Originally posted by vladpetric View Post
    4. The IPC on RPi2 is ... big facepalm
    What RPI2. There is two of them.
    https://www.raspberrypi.org/document...e/raspberrypi/
    The first RPI2 that are BCM2836 what is quad A7 cpu 32 bit only cpu. Current made RPI2 B 1.2 are BCM2837 yes 1 number difference but this is a quad A57 able to run 64 bit code. Yes the IPC between those is chalk and cheese. I would be suspecting A7 based RPI2. Yes the BCM2837 is the same chip in the RPI3 this leads to some confusing where some people think all RPI2 have closed to RPI3 performance because they only have the 1.2 or new versions of the RPI2 not the original.

    Need to be a little more exact when benchmarking RPI2 due to the fact there is two of them with totally different soc chips.

    https://www.itproportal.com/2012/10/...ance-analysis/ Yes there is a huge jump in perform going from A7 to A57. That a over double performance change.

    Leave a comment:


  • vladpetric
    replied
    Originally posted by atomsymbol View Post

    The performance counter cache-references can also mean LLC-references or "L2 cache references", so I passed L1-dcache-loads to /usr/bin/perf.

    Summary of the code snippets below:
    • A10-7850K, app=xz: 0.41 L1D loads per cycle (45% L1D load pipeline utilization (not considering to the number of load ports))
    • Ryzen 3700X, app=xz: 0.58 L1D loads per cycle (58% L1D load pipeline utilization (not considering to the number of load ports))
    • Ryzen 3700X, app=g++: 0.67 L1D loads per cycle (67% L1D load pipeline utilization (not normalized to the number of load ports))
    • Raspberry Pi 2, app=xz: not meaningful because of very low IPC
    I suppose that with 0.67 L1D loads per cycle, the number of matching store(X)-load(X) pairs occurring within 0-3 cycles is just a small fraction of 0.67 - for example less than 0.1, so in order for memory bypassing to be required for improving performance the IPC would have to be larger than 10 instructions per clock.

    If IPC keeps increasing over time then L1D pipeline utilization will increase as well, and thus the probability of a store(X)-load(X) pair to occur within 0-3 cycles will be over time amplified to a performance bottleneck. However, it will take several decades (or more) for single-threaded IPC to reach the value 10.

    Code:
    A10-7850K
    $ perf stat -e cycles,instructions,cache-references,L1-dcache-loads,L1-dcache-prefetches,L1-dcache-load-misses -- xz -9c /usr/bin/Xorg | wc -c
    
    Performance counter stats for 'xz -9c /usr/bin/Xorg':
    
    5,147,093,251 cycles
    4,694,812,581 instructions # 0.91 insn per cycle
    52,196,930 cache-references
    2,134,968,624 L1-dcache-loads
    49,383,148 L1-dcache-prefetches
    44,112,814 L1-dcache-load-misses # 2.07% of all L1-dcache hits
    
    1.314936065 seconds time elapsed
    
    1.253729000 seconds user
    0.059701000 seconds sys
    Code:
    Ryzen 3700X (a slightly different /usr/bin/Xorg file)
    $ perf stat -e cycles,instructions,cache-references,L1-dcache-loads,L1-dcache-prefetches,L1-dcache-load-misses -- xz -9c /usr/bin/Xorg | wc -c
    
    Performance counter stats for 'xz -9c /usr/bin/Xorg':
    
    3,611,880,161 cycles
    5,175,382,384 instructions # 1.43 insn per cycle
    85,128,735 cache-references
    2,083,427,179 L1-dcache-loads
    24,899,168 L1-dcache-prefetches
    55,135,959 L1-dcache-load-misses # 2.65% of all L1-dcache hits
    
    0.831343249 seconds time elapsed
    
    0.813425000 seconds user
    0.019290000 seconds sys
    Code:
    Ryzen 3700X
    $ perf stat -e cycles,instructions,cache-references,L1-dcache-loads,L1-dcache-prefetches,L1-dcache-load-misses -- g++ -O2 ...
    
    Performance counter stats for 'g++ -O2 ...':
    
    16,519,230,778 cycles
    24,517,551,053 instructions # 1.48 insn per cycle
    1,619,398,618 cache-references
    11,028,752,404 L1-dcache-loads
    392,157,539 L1-dcache-prefetches
    584,586,070 L1-dcache-load-misses # 5.30% of all L1-dcache hits
    
    3.814482470 seconds time elapsed
    
    3.741325000 seconds user
    0.070113000 seconds sys
    Code:
    RPi2
    $ perf_4.9 stat -e cycles,instructions,cache-references,L1-dcache-loads -- xz -3c Xorg | wc -c
    
    Performance counter stats for 'xz -3c Xorg':
    
    3,389,885,517 cycles:u
    906,475,610 instructions:u # 0.27 insn per cycle
    350,135,938 cache-references:u
    350,135,938 L1-dcache-loads:u
    1. I really appreciate you doing this!

    2. I think something is wrong with the L1-dcache-loads on (your?) Ryzen systems. Cache-references and L1-dcache-loads should not be that discrepant (they should be equal or close to equal). In particular, cache-references is roughly an order of magnitude lower than L1-dcache-loads, and that is simply not sensible.

    Thing is, loads being roughly 1-in-10 instructions is something that we observe on RPi2, is pretty close to my measurements on the Intel system, and it is a bit more believable to me than 1-in-2 instructions being loads ... To me, cache-references is more believable here.

    3. Could I kindly ask you to do a consistent test? If I may propose https://fedorapeople.org/groups/virt...in-0.1.185.iso

    This way we could compare apples to apples.

    Why? Well, why not . It's a bit large yes (393 MiB).

    4. The IPC on RPi2 is ... big facepalm

    Leave a comment:


  • CooliPi
    replied
    System Information


    PROCESSOR: ARMv8 Cortex-A72 @ 2.20GHz
    Core Count: 4
    Scaling Driver: cpufreq-dt performance

    GRAPHICS:

    MOTHERBOARD: BCM2835 Raspberry Pi 4 Model B Rev 1.4

    MEMORY: 8GB

    DISK: 32GB SM32G
    File-System: ext4
    Mount Options: relatime rw

    OPERATING SYSTEM: Ubuntu 20.04
    Kernel: 5.4.0-1015-raspi (aarch64)
    Display Server: X Server 1.20.8
    Compiler: GCC 9.3.0
    Security: itlb_multihit: Not affected
    + l1tf: Not affected
    + mds: Not affected
    + meltdown: Not affected
    + spec_store_bypass: Vulnerable
    + spectre_v1: Mitigation of __user pointer sanitization
    + spectre_v2: Vulnerable
    + srbds: Not affected
    + tsx_async_abort: Not affected


    Current Test Identifiers:
    - Raspberry Pi 4
    - Core i3 10100
    - Pentium Gold G6400
    - Celeron G5900

    Enter a unique name to describe this test run / configuration: Raspberry Pi 4 8GB + CooliPi 4B + Noctua 60mm [email protected] @2147MHz

    If desired, enter a new description below to better describe this result set / system configuration under test.
    Press ENTER to proceed without changes.

    Current Description: Benchmarks for a future article.

    New Description: Using CooliPi 4B heatsink, Noctua 60mm 5V fan, MX-2 paste from Arctic Cooling all in a Red Bull small fridge at 1-10degC

    Any better idea how to name it? It's insane...

    Leave a comment:


  • atomsymbol
    replied
    Originally posted by vladpetric View Post
    He's somewhere between a bullshitter and a troll.

    He now claims he understands my own freakin' research paper better (RENO) than I do. He doesn't; he reads a few keywords here and there and creates a bullshitty narrative in his mind.

    Well, I have a personal policy not to feed the trolls ... I recommend that you do the same
    I lean more towards an inclusive policy rather than an exclusive policy in such cases. i.e. trying to explain what is wrong.

    Leave a comment:


  • vladpetric
    replied
    Originally posted by atomsymbol View Post

    Your posts are messy (the grammar of sentences; the thread of reasoning in general) and very hard to read. Can you please do something about it?
    He's somewhere between a bullshitter and a troll.

    He now claims he understands my own freakin' research paper better (RENO) than I do. He doesn't; he reads a few keywords here and there and creates a bullshitty narrative in his mind.

    Well, I have a personal policy not to feed the trolls ... I recommend that you do the same

    Leave a comment:


  • CooliPi
    replied
    Originally posted by DihydrogenOxide View Post
    Does anyone have a 4 GB RPi on hand to test, I am curious if any of these tests are RAM limited? Also, I assume the Ondemand governor was used, although I don't think that will change anything significantly.
    I have three 4GB and also the new 8GB version of RPi. I don't think any of the the tests are RAM size limited (maybe by throughput).

    Last time I used PTS as a benchmark (and to stress RPIs when overclocking them) , the answer regarding governor was - yes, it plays a role. Because it somehow shortens some lags between parts of the benchmark. I wildly guess it is related with new process creation. Some tests were some percents faster when using performance governor, the variance was definitely more predictable - less spread between runs.

    RPi has some RTOS running under the hood, they've removed lots of unnecessary code from it, remains some code to manage frequency scaling/changing - this may also add latencies, because as I eerily remember, when I was running my realtime AD converter code on a RPI3 and the ondemand governor was active, I had some insanely long latencies sometimes. After I fixed the governor to performance, it was all under 123 us (so the real spread of nanosleep latency was between 67-123 us). Not tried it on a RPi4 yet. My bet - it may add some more latency besides Linux's latencies if you run ondemand governor. Maybe even some cache thrashing.

    Right now, I test a RPi 4 8GB in a small Red Bull fridge using CooliPi ( http://www.coolipi.com ) with the 60mm Noctua fan, real temperatures vary between 1-10°C. I use MX-2 thermal paste from Arctic Cooling.

    Two hours ago, I was testing it in a deep freeze unit of our main fridge between peas and carrots. Aside from the fact, that vcgencmd still has the temperature reporting bug (see here https://www.coolipi.com/Liquid_Nitrogen.html ) even after a year after we tested it with liquid Nitrogen, it overclocks well at 2147MHz. I bet the new 8GB versions are going to be good overclockers, because with RPi 4 4GB, more units I had rebooted when all of their cores were being loaded simultaneously. The testimony of the PMIC guilt is that some were stable at 1860MHz with over_voltage=2, but not at over_voltage=3 (reboot).

    Single-core overclockability was relatively good, but when overloading it (by PTS for example), the integrated PMIC switcher couldn't supply enough current, hence a reboot came after 1.2V rail voltage drop came, which causes reset/reboot.

    The 8GB version has at least one inductor around the PMIC different, but looks like it's for good. It's even smaller than the previous one. Maybe they increased the frequency? I have yet to see it with an oscilloscope.

    To really use more RAM, it's necessary to go 64bit. I've installed Raspberry Pi OS 64bit (experimental), but PTS couldn't find some libraries, so the testing was somewhat incomplete. Now I'm trying it with Ubuntu 20.04 64bit (also contains the buggy vcgencmd, but kernel temperature reporting seems OK) at 2147MHz, using performance governor and all of this is in that small fridge. To keep temperatures the lowest, I use CooliPi 4B with that anecdotal 60mm Noctua fan on top of it (when idle, it has about 2.8°C higher temperature than ambient air).

    To wrap it up - I had three RPi4 4GB overclocked to 1750, 1850 and 2000 MHz, respectively, and now a single new 8GB version overclocks directly to 2147MHz, albeit needing appropriate cooling.

    My question to you, more knowledgable users is this: how can I get the runs of the same clocked RPis out of openbenchmarking.org database? And second, how (the hell) to name my overclocked runs, so that it's sane? I'm always confused when PTS asks 6 lines of descriptions at the beginning. Very confusing - I'm not familiar with it yet so sorry for an inconvenience.

    I was a bit worried that finishing the PTS suite with an overclocked RPi at 2147MHz would require liquid Nitrogen (again - see my video, I don't have enough of it to pour the RPi for 6 hours again), but it seems that it's stable at mild temperatures around 10°C. Over_voltage=6 of course...
    Last edited by CooliPi; 17 August 2020, 12:53 PM.

    Leave a comment:


  • atomsymbol
    replied
    Originally posted by vladpetric View Post

    I'm afraid that, in practice, the utilization of the load pipeline/path is way way lower than that (so you end up waiting). The utilization of the ALU pipelines is definitely better, but even there you rarely ever get close to 100% utilization.

    I suggest the following experiment:

    Start with a CPU intensive benchmark that lasts roughly 10-20 seconds (more is not a problem, it'll just make you wait a bit longer). The easiest thing is the compression of a larger file, which I also copied on /dev/shm. But feel free to pick your CPU intensive benchmark (this is not meaningful for an I/O benchmark that waits a lot ...)

    Then run the following:

    perf stat -e cycles,instructions,cache-references << your actual command >>

    If your processor is an Intel one, the following should probably work as well:

    perf stat -e cycles,uops_retired.all,mem_uops_retired.all_loads << your actual command >>

    Generally, it's better to report uops vs instructions, and the uops_retired.all, mem_uops_retired.all_loads counters are precise on my processor.

    Then see what the IPC is (instructions per cycle, or uops per cycle). Also see how frequent the loads are.

    Please try the previous, as I'm curious what numbers you get. The first command should also work on an RPi2/3/4 actually.
    The performance counter cache-references can also mean LLC-references or "L2 cache references", so I passed L1-dcache-loads to /usr/bin/perf.

    Summary of the code snippets below:
    • A10-7850K, app=xz: 0.41 L1D loads per cycle (45% L1D load pipeline utilization (not considering to the number of load ports))
    • Ryzen 3700X, app=xz: 0.58 L1D loads per cycle (58% L1D load pipeline utilization (not considering to the number of load ports))
    • Ryzen 3700X, app=g++: 0.67 L1D loads per cycle (67% L1D load pipeline utilization (not normalized to the number of load ports))
    • Raspberry Pi 2, app=xz: not meaningful because of very low IPC
    I suppose that with 0.67 L1D loads per cycle, the number of matching store(X)-load(X) pairs occurring within 0-3 cycles is just a small fraction of 0.67 - for example less than 0.1, so in order for memory bypassing to be required for improving performance the IPC would have to be larger than 10 instructions per clock.

    If IPC keeps increasing over time then L1D pipeline utilization will increase as well, and thus the probability of a store(X)-load(X) pair to occur within 0-3 cycles will be over time amplified to a performance bottleneck. However, it will take several decades (or more) for single-threaded IPC to reach the value 10.

    Code:
    A10-7850K
    $ perf stat -e cycles,instructions,cache-references,L1-dcache-loads,L1-dcache-prefetches,L1-dcache-load-misses -- xz -9c /usr/bin/Xorg | wc -c
    
     Performance counter stats for 'xz -9c /usr/bin/Xorg':
    
         5,147,093,251      cycles                                                      
         4,694,812,581      instructions              #    0.91  insn per cycle        
            52,196,930      cache-references                                            
         2,134,968,624      L1-dcache-loads                                            
            49,383,148      L1-dcache-prefetches                                        
            44,112,814      L1-dcache-load-misses     #    2.07% of all L1-dcache hits  
    
           1.314936065 seconds time elapsed
    
           1.253729000 seconds user
           0.059701000 seconds sys
    Code:
    Ryzen 3700X (a slightly different /usr/bin/Xorg file)
    $ perf stat -e cycles,instructions,cache-references,L1-dcache-loads,L1-dcache-prefetches,L1-dcache-load-misses -- xz -9c /usr/bin/Xorg | wc -c
    
     Performance counter stats for 'xz -9c /usr/bin/Xorg':
    
         3,611,880,161      cycles                                                      
         5,175,382,384      instructions              #    1.43  insn per cycle        
            85,128,735      cache-references                                            
         2,083,427,179      L1-dcache-loads                                            
            24,899,168      L1-dcache-prefetches                                        
            55,135,959      L1-dcache-load-misses     #    2.65% of all L1-dcache hits  
    
           0.831343249 seconds time elapsed
    
           0.813425000 seconds user
           0.019290000 seconds sys
    Code:
    Ryzen 3700X
    $ perf stat -e cycles,instructions,cache-references,L1-dcache-loads,L1-dcache-prefetches,L1-dcache-load-misses -- g++ -O2 ...
    
     Performance counter stats for 'g++ -O2 ...':
    
        16,519,230,778      cycles                                                      
        24,517,551,053      instructions              #    1.48  insn per cycle        
         1,619,398,618      cache-references                                            
        11,028,752,404      L1-dcache-loads                                            
           392,157,539      L1-dcache-prefetches                                        
           584,586,070      L1-dcache-load-misses     #    5.30% of all L1-dcache hits  
    
           3.814482470 seconds time elapsed
    
           3.741325000 seconds user
           0.070113000 seconds sys
    Code:
    RPi2
    $ perf_4.9 stat -e cycles,instructions,cache-references,L1-dcache-loads -- xz -3c Xorg | wc -c
    
     Performance counter stats for 'xz -3c Xorg':
    
         3,389,885,517      cycles:u                                                    
           906,475,610      instructions:u            #    0.27  insn per cycle        
           350,135,938      cache-references:u                                          
           350,135,938      L1-dcache-loads:u
    Last edited by atomsymbol; 17 August 2020, 12:34 PM.

    Leave a comment:

Working...
X