Announcement

Collapse
No announcement yet.

How A Raspberry Pi 4 Performs Against Intel's Latest Celeron, Pentium CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by DihydrogenOxide View Post
    Does anyone have a 4 GB RPi on hand to test, I am curious if any of these tests are RAM limited? Also, I assume the Ondemand governor was used, although I don't think that will change anything significantly.
    I have three 4GB and also the new 8GB version of RPi. I don't think any of the the tests are RAM size limited (maybe by throughput).

    Last time I used PTS as a benchmark (and to stress RPIs when overclocking them) , the answer regarding governor was - yes, it plays a role. Because it somehow shortens some lags between parts of the benchmark. I wildly guess it is related with new process creation. Some tests were some percents faster when using performance governor, the variance was definitely more predictable - less spread between runs.

    RPi has some RTOS running under the hood, they've removed lots of unnecessary code from it, remains some code to manage frequency scaling/changing - this may also add latencies, because as I eerily remember, when I was running my realtime AD converter code on a RPI3 and the ondemand governor was active, I had some insanely long latencies sometimes. After I fixed the governor to performance, it was all under 123 us (so the real spread of nanosleep latency was between 67-123 us). Not tried it on a RPi4 yet. My bet - it may add some more latency besides Linux's latencies if you run ondemand governor. Maybe even some cache thrashing.

    Right now, I test a RPi 4 8GB in a small Red Bull fridge using CooliPi ( http://www.coolipi.com ) with the 60mm Noctua fan, real temperatures vary between 1-10°C. I use MX-2 thermal paste from Arctic Cooling.

    Two hours ago, I was testing it in a deep freeze unit of our main fridge between peas and carrots. Aside from the fact, that vcgencmd still has the temperature reporting bug (see here https://www.coolipi.com/Liquid_Nitrogen.html ) even after a year after we tested it with liquid Nitrogen, it overclocks well at 2147MHz. I bet the new 8GB versions are going to be good overclockers, because with RPi 4 4GB, more units I had rebooted when all of their cores were being loaded simultaneously. The testimony of the PMIC guilt is that some were stable at 1860MHz with over_voltage=2, but not at over_voltage=3 (reboot).

    Single-core overclockability was relatively good, but when overloading it (by PTS for example), the integrated PMIC switcher couldn't supply enough current, hence a reboot came after 1.2V rail voltage drop came, which causes reset/reboot.

    The 8GB version has at least one inductor around the PMIC different, but looks like it's for good. It's even smaller than the previous one. Maybe they increased the frequency? I have yet to see it with an oscilloscope.

    To really use more RAM, it's necessary to go 64bit. I've installed Raspberry Pi OS 64bit (experimental), but PTS couldn't find some libraries, so the testing was somewhat incomplete. Now I'm trying it with Ubuntu 20.04 64bit (also contains the buggy vcgencmd, but kernel temperature reporting seems OK) at 2147MHz, using performance governor and all of this is in that small fridge. To keep temperatures the lowest, I use CooliPi 4B with that anecdotal 60mm Noctua fan on top of it (when idle, it has about 2.8°C higher temperature than ambient air).

    To wrap it up - I had three RPi4 4GB overclocked to 1750, 1850 and 2000 MHz, respectively, and now a single new 8GB version overclocks directly to 2147MHz, albeit needing appropriate cooling.

    My question to you, more knowledgable users is this: how can I get the runs of the same clocked RPis out of openbenchmarking.org database? And second, how (the hell) to name my overclocked runs, so that it's sane? I'm always confused when PTS asks 6 lines of descriptions at the beginning. Very confusing - I'm not familiar with it yet so sorry for an inconvenience.

    I was a bit worried that finishing the PTS suite with an overclocked RPi at 2147MHz would require liquid Nitrogen (again - see my video, I don't have enough of it to pour the RPi for 6 hours again), but it seems that it's stable at mild temperatures around 10°C. Over_voltage=6 of course...
    Last edited by CooliPi; 17 August 2020, 12:53 PM.

    Comment


    • Originally posted by atomsymbol View Post

      Your posts are messy (the grammar of sentences; the thread of reasoning in general) and very hard to read. Can you please do something about it?
      He's somewhere between a bullshitter and a troll.

      He now claims he understands my own freakin' research paper better (RENO) than I do. He doesn't; he reads a few keywords here and there and creates a bullshitty narrative in his mind.

      Well, I have a personal policy not to feed the trolls ... I recommend that you do the same

      Comment


      • Originally posted by vladpetric View Post
        He's somewhere between a bullshitter and a troll.

        He now claims he understands my own freakin' research paper better (RENO) than I do. He doesn't; he reads a few keywords here and there and creates a bullshitty narrative in his mind.

        Well, I have a personal policy not to feed the trolls ... I recommend that you do the same
        I lean more towards an inclusive policy rather than an exclusive policy in such cases. i.e. trying to explain what is wrong.

        Comment


        • System Information


          PROCESSOR: ARMv8 Cortex-A72 @ 2.20GHz
          Core Count: 4
          Scaling Driver: cpufreq-dt performance

          GRAPHICS:

          MOTHERBOARD: BCM2835 Raspberry Pi 4 Model B Rev 1.4

          MEMORY: 8GB

          DISK: 32GB SM32G
          File-System: ext4
          Mount Options: relatime rw

          OPERATING SYSTEM: Ubuntu 20.04
          Kernel: 5.4.0-1015-raspi (aarch64)
          Display Server: X Server 1.20.8
          Compiler: GCC 9.3.0
          Security: itlb_multihit: Not affected
          + l1tf: Not affected
          + mds: Not affected
          + meltdown: Not affected
          + spec_store_bypass: Vulnerable
          + spectre_v1: Mitigation of __user pointer sanitization
          + spectre_v2: Vulnerable
          + srbds: Not affected
          + tsx_async_abort: Not affected


          Current Test Identifiers:
          - Raspberry Pi 4
          - Core i3 10100
          - Pentium Gold G6400
          - Celeron G5900

          Enter a unique name to describe this test run / configuration: Raspberry Pi 4 8GB + CooliPi 4B + Noctua 60mm [email protected] @2147MHz

          If desired, enter a new description below to better describe this result set / system configuration under test.
          Press ENTER to proceed without changes.

          Current Description: Benchmarks for a future article.

          New Description: Using CooliPi 4B heatsink, Noctua 60mm 5V fan, MX-2 paste from Arctic Cooling all in a Red Bull small fridge at 1-10degC

          Any better idea how to name it? It's insane...

          Comment


          • Originally posted by atomsymbol View Post

            The performance counter cache-references can also mean LLC-references or "L2 cache references", so I passed L1-dcache-loads to /usr/bin/perf.

            Summary of the code snippets below:
            • A10-7850K, app=xz: 0.41 L1D loads per cycle (45% L1D load pipeline utilization (not considering to the number of load ports))
            • Ryzen 3700X, app=xz: 0.58 L1D loads per cycle (58% L1D load pipeline utilization (not considering to the number of load ports))
            • Ryzen 3700X, app=g++: 0.67 L1D loads per cycle (67% L1D load pipeline utilization (not normalized to the number of load ports))
            • Raspberry Pi 2, app=xz: not meaningful because of very low IPC
            I suppose that with 0.67 L1D loads per cycle, the number of matching store(X)-load(X) pairs occurring within 0-3 cycles is just a small fraction of 0.67 - for example less than 0.1, so in order for memory bypassing to be required for improving performance the IPC would have to be larger than 10 instructions per clock.

            If IPC keeps increasing over time then L1D pipeline utilization will increase as well, and thus the probability of a store(X)-load(X) pair to occur within 0-3 cycles will be over time amplified to a performance bottleneck. However, it will take several decades (or more) for single-threaded IPC to reach the value 10.

            Code:
            A10-7850K
            $ perf stat -e cycles,instructions,cache-references,L1-dcache-loads,L1-dcache-prefetches,L1-dcache-load-misses -- xz -9c /usr/bin/Xorg | wc -c
            
            Performance counter stats for 'xz -9c /usr/bin/Xorg':
            
            5,147,093,251 cycles
            4,694,812,581 instructions # 0.91 insn per cycle
            52,196,930 cache-references
            2,134,968,624 L1-dcache-loads
            49,383,148 L1-dcache-prefetches
            44,112,814 L1-dcache-load-misses # 2.07% of all L1-dcache hits
            
            1.314936065 seconds time elapsed
            
            1.253729000 seconds user
            0.059701000 seconds sys
            Code:
            Ryzen 3700X (a slightly different /usr/bin/Xorg file)
            $ perf stat -e cycles,instructions,cache-references,L1-dcache-loads,L1-dcache-prefetches,L1-dcache-load-misses -- xz -9c /usr/bin/Xorg | wc -c
            
            Performance counter stats for 'xz -9c /usr/bin/Xorg':
            
            3,611,880,161 cycles
            5,175,382,384 instructions # 1.43 insn per cycle
            85,128,735 cache-references
            2,083,427,179 L1-dcache-loads
            24,899,168 L1-dcache-prefetches
            55,135,959 L1-dcache-load-misses # 2.65% of all L1-dcache hits
            
            0.831343249 seconds time elapsed
            
            0.813425000 seconds user
            0.019290000 seconds sys
            Code:
            Ryzen 3700X
            $ perf stat -e cycles,instructions,cache-references,L1-dcache-loads,L1-dcache-prefetches,L1-dcache-load-misses -- g++ -O2 ...
            
            Performance counter stats for 'g++ -O2 ...':
            
            16,519,230,778 cycles
            24,517,551,053 instructions # 1.48 insn per cycle
            1,619,398,618 cache-references
            11,028,752,404 L1-dcache-loads
            392,157,539 L1-dcache-prefetches
            584,586,070 L1-dcache-load-misses # 5.30% of all L1-dcache hits
            
            3.814482470 seconds time elapsed
            
            3.741325000 seconds user
            0.070113000 seconds sys
            Code:
            RPi2
            $ perf_4.9 stat -e cycles,instructions,cache-references,L1-dcache-loads -- xz -3c Xorg | wc -c
            
            Performance counter stats for 'xz -3c Xorg':
            
            3,389,885,517 cycles:u
            906,475,610 instructions:u # 0.27 insn per cycle
            350,135,938 cache-references:u
            350,135,938 L1-dcache-loads:u
            1. I really appreciate you doing this!

            2. I think something is wrong with the L1-dcache-loads on (your?) Ryzen systems. Cache-references and L1-dcache-loads should not be that discrepant (they should be equal or close to equal). In particular, cache-references is roughly an order of magnitude lower than L1-dcache-loads, and that is simply not sensible.

            Thing is, loads being roughly 1-in-10 instructions is something that we observe on RPi2, is pretty close to my measurements on the Intel system, and it is a bit more believable to me than 1-in-2 instructions being loads ... To me, cache-references is more believable here.

            3. Could I kindly ask you to do a consistent test? If I may propose https://fedorapeople.org/groups/virt...in-0.1.185.iso

            This way we could compare apples to apples.

            Why? Well, why not . It's a bit large yes (393 MiB).

            4. The IPC on RPi2 is ... big facepalm

            Comment


            • Originally posted by vladpetric View Post
              2. I think something is wrong with the L1-dcache-loads on (your?) Ryzen systems. Cache-references and L1-dcache-loads should not be that discrepant (they should be equal or close to equal). In particular, cache-references is roughly an order of magnitude lower than L1-dcache-loads, and that is simply not sensible.
              Those Ryzen figures are about right. Ryzen chips are more ram speed sensitive than intel if you trace that to the top you have a way more aggressive speculative load system than the intel one. That does result in more loads somewhere between 9-12 to cache-references. So yes a order of magnitude higher is right. Does this make looking at Rzyen load figures mostly pointless yes.

              Originally posted by vladpetric View Post
              Thing is, loads being roughly 1-in-10 instructions is something that we observe on RPi2, is pretty close to my measurements on the Intel system, and it is a bit more believable to me than 1-in-2 instructions being loads ... To me, cache-references is more believable here.
              Yes aggressive preloading results in 1 in 2 instructions being loads most of these loads are not generated by the program but by the aggressive speculation/preloads and does give the Epic and Ryzen increased IPC. Yes a warped one intentionally doing way more loads to increase IPC. Yes this does increase risk of load store clash.

              Originally posted by vladpetric View Post
              4. The IPC on RPi2 is ... big facepalm
              What RPI2. There is two of them.
              https://www.raspberrypi.org/document...e/raspberrypi/
              The first RPI2 that are BCM2836 what is quad A7 cpu 32 bit only cpu. Current made RPI2 B 1.2 are BCM2837 yes 1 number difference but this is a quad A57 able to run 64 bit code. Yes the IPC between those is chalk and cheese. I would be suspecting A7 based RPI2. Yes the BCM2837 is the same chip in the RPI3 this leads to some confusing where some people think all RPI2 have closed to RPI3 performance because they only have the 1.2 or new versions of the RPI2 not the original.

              Need to be a little more exact when benchmarking RPI2 due to the fact there is two of them with totally different soc chips.

              https://www.itproportal.com/2012/10/...ance-analysis/ Yes there is a huge jump in perform going from A7 to A57. That a over double performance change.

              Comment


              • Originally posted by atomsymbol View Post
                Code:
                RPi2
                $ perf_4.9 stat -e cycles,instructions,cache-references,L1-dcache-loads -- xz -3c Xorg | wc -c
                
                Performance counter stats for 'xz -3c Xorg':
                
                3,389,885,517 cycles:u
                906,475,610 instructions:u # 0.27 insn per cycle
                350,135,938 cache-references:u
                350,135,938 L1-dcache-loads:u
                Originally posted by vladpetric View Post
                Thing is, loads being roughly 1-in-10 instructions is something that we observe on RPi2, is pretty close to my measurements on the Intel system, and it is a bit more believable to me than 1-in-2 instructions being loads ... To me, cache-references is more believable here.
                On the RPi2, approximately every 3rd ARM instruction appears to be a load, 350 / 906 = 0.39.

                To obtain the precise number of load instructions in user-code (i.e: without speculative loads) we would need to annotate the assembly code.

                ----

                Code:
                $ pi cat /proc/cpuinfo | tail
                Hardware : BCM2835
                Revision : a01041
                Model : Raspberry Pi 2 Model B Rev 1.1

                Comment


                • Originally posted by atomsymbol View Post



                  On the RPi2, approximately every 3rd ARM instruction appears to be a load, 350 / 906 = 0.39.

                  To obtain the precise number of load instructions in user-code (i.e: without speculative loads) we would need to annotate the assembly code.

                  ----

                  Code:
                  $ pi cat /proc/cpuinfo | tail
                  Hardware : BCM2835
                  Revision : a01041
                  Model : Raspberry Pi 2 Model B Rev 1.1
                  Fair point, I was misreading the data (I was also comparing against a gzip run, which is not apples to apples)

                  When profiling the compression of the .iso file that I mentioned earlier, I get:

                  Code:
                  /dev/shm $ perf stat -e cycles,inst_retired.any,uops_retired.all,mem_uops_ retired.all_loads xz -3c virtio-win-0.1.185.iso > virtio-win.iso.xz
                  
                  Performance counter stats for 'xz -3c virtio-win-0.1.185.iso':
                  
                  155,634,110,621 cycles:u
                  95,291,274,136 inst_retired.any:u
                  104,104,748,746 uops_retired.all:u
                  22,259,028,509 mem_uops_retired.all_loads:u
                  
                  7.819202754 seconds time elapsed
                  
                  53.806171000 seconds user
                  0.177891000 seconds sys
                  Now inst_retired.any, uops_retired.all, mem_uops_retired.all_loads - these are all precise counters (thing collected on retirement). And I do trust the Intel precise counters, primarily because I validated them against simulation runs.

                  So, somewhere between 1-in-4 and 1-in-5 instructions are loads, for xz. That's what I'm seeing. Intel IPC with xz is not that great, either, I have to admit this. A useful load gets dispatched roughly once every 7 cycles.

                  So, I still think that you will very much see the latency of the loads ...

                  Actually, I think we should also use -T1 (single threaded), for a better comparison of microarchitectural effects. My numbers are quite similar, though the command itself does take longer:

                  Code:
                  /dev/shm $ perf stat -e cycles,inst_retired.any,uops_retired.all,mem_uops_ retired.all_loads xz -3c -T1 virtio-win-0.1.185.iso > virtio-win.iso.xz
                  
                  Performance counter stats for 'xz -3c -T1 virtio-win-0.1.185.iso':
                  
                  118,360,577,701 cycles:u
                  94,952,136,061 inst_retired.any:u
                  103,532,304,227 uops_retired.all:u
                  22,811,172,257 mem_uops_retired.all_loads:u
                  
                  42.374730049 seconds time elapsed
                  
                  42.180461000 seconds user
                  0.117740000 seconds sys
                  The IPC also increased by avoiding the inter-thread communication.

                  Without -T1, a different number of threads is could be used on each run. If we want to run a multithreaded test, I think we should agree on a number of threads (e.g., 4).

                  Oh, here's some numbers with branches included:

                  Code:
                  $ perf stat -e cycles,instructions,branches,branch-misses,inst_retired.any,uops_retired.all,mem_uops_ retired.all_loads xz -3c -T1 virtio-win-0.1.185.iso > virtio-win.iso.xz
                  
                  Performance counter stats for 'xz -3c -T1 virtio-win-0.1.185.iso':
                  
                  118,389,607,680 cycles:u (57.14%)
                  94,944,328,786 instructions:u # 0.80 insn per cycle (71.43%)
                  14,754,365,523 branches:u (85.71%)
                  906,875,767 branch-misses:u # 6.15% of all branches (57.14%)
                  94,896,558,766 inst_retired.any:u (71.43%)
                  103,512,664,941 uops_retired.all:u (85.71%)
                  22,839,152,686 mem_uops_retired.all_loads:u (42.86%)
                  
                  42.722314322 seconds time elapsed
                  
                  42.225622000 seconds user
                  0.114992000 seconds sys
                  The difference between instructions (a speculative counter) and inst_retired.any (a precise counter) is small. Obviously, I don't have numbers for your systems. But for my system, branch misspeculation doesn't play a big role.
                  Last edited by vladpetric; 19 August 2020, 01:12 PM.

                  Comment


                  • Originally posted by vladpetric View Post

                    1. I really appreciate you doing this!

                    2. I think something is wrong with the L1-dcache-loads on (your?) Ryzen systems. Cache-references and L1-dcache-loads should not be that discrepant (they should be equal or close to equal). In particular, cache-references is roughly an order of magnitude lower than L1-dcache-loads, and that is simply not sensible.

                    Thing is, loads being roughly 1-in-10 instructions is something that we observe on RPi2, is pretty close to my measurements on the Intel system, and it is a bit more believable to me than 1-in-2 instructions being loads ... To me, cache-references is more believable here.

                    3. Could I kindly ask you to do a consistent test? If I may propose https://fedorapeople.org/groups/virt...in-0.1.185.iso

                    This way we could compare apples to apples.

                    Why? Well, why not . It's a bit large yes (393 MiB).

                    4. The IPC on RPi2 is ... big facepalm
                    Precise results obtained by simulating cache accesses in callgrind:

                    36% of all instructions are load/store instructions. This is valid for "xz -c3 -T1 virtio-win-0.1.185.iso" - other applications would show a different percentage of load/store instructions, but in summary expecting 1-in-10 instructions to be a load/store instruction (10%) is unrealistic, i.e: most apps are well above 10%.

                    Curiously, the data below shows that load:store ratio is almost exactly 2:1 (27.1 : 13.9). I wonder whether this is just a coincidence or a general rule that most apps follow (not considering toy benchmarks).

                    It took about 25 minutes for "callgrind xz" to compress the ISO file, which means callgrind was about 50-times slower than normal xz execution.

                    Code:
                    $ callgrind --cache-sim=yes -- xz -3c -T1 ./virtio-win-0.1.185.iso | wc -c
                    
                    desc: I1 cache: 32768 B, 64 B, 8-way associative
                    desc: D1 cache: 32768 B, 64 B, 8-way associative
                    desc: LL cache: 33554432 B, 64 B, direct-mapped
                    
                    Events : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
                    Collected : 114279135645 27105874827 13902443594 2533 1931774728 273271284 2328 2686967 949904
                    
                    I refs: 114,279,135,645
                    I1 misses: 2,533
                    LLi misses: 2,328
                    I1 miss rate: 0.00%
                    LLi miss rate: 0.00%
                    
                    D refs: 41,008,318,421 (27,105,874,827 rd + 13,902,443,594 wr)
                    D1 misses: 2,205,046,012 ( 1,931,774,728 rd + 273,271,284 wr)
                    LLd misses: 3,636,871 ( 2,686,967 rd + 949,904 wr)
                    D1 miss rate: 5.4% ( 7.1% + 2.0% )
                    LLd miss rate: 0.0% ( 0.0% + 0.0% )
                    
                    LL refs: 2,205,048,545 ( 1,931,777,261 rd + 273,271,284 wr)
                    LL misses: 3,639,199 ( 2,689,295 rd + 949,904 wr)
                    LL miss rate: 0.0% ( 0.0% + 0.0% )

                    Comment


                    • Originally posted by atomsymbol View Post

                      Precise results obtained by simulating cache accesses in callgrind:

                      36% of all instructions are load/store instructions. This is valid for "xz -c3 -T1 virtio-win-0.1.185.iso" - other applications would show a different percentage of load/store instructions, but in summary expecting 1-in-10 instructions to be a load/store instruction (10%) is unrealistic, i.e: most apps are well above 10%.

                      Curiously, the data below shows that load:store ratio is almost exactly 2:1 (27.1 : 13.9). I wonder whether this is just a coincidence or a general rule that most apps follow (not considering toy benchmarks).
                      Here's my data:

                      Code:
                      $ valgrind --tool=callgrind --cache-sim=yes -- xz -3c -T1 ./virtio-win-0.1.185.iso > virtio-win-0.1.185.iso.xz
                      ==2537234== Callgrind, a call-graph generating cache profiler
                      ==2537234== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.
                      ==2537234== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
                      ==2537234==
                      ==2537234== Events : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
                      ==2537234== Collected : 96109168347 23293054178 11744076794 2535 1942602462 270595441 2437 65479019 23737948
                      ==2537234==
                      ==2537234== I refs: 96,109,168,347
                      ==2537234==
                      ==2537234== D refs: 35,037,130,972 (23,293,054,178 rd + 11,744,076,794 wr)
                      I cut out a lot of useless info from the above.

                      Unfortunately our numbers don't fully align. Even if we were running the same xz version (I have 5.2.5, latest stable), we could still have differences in compiler, or some specialized routines using different assembly (ifunc). But the proportion of load/store instructions is super close, so I think it's good enough for a comparison.

                      Alright, some thoughts:

                      1. The 2:1 ratio is simply an average across multiple benchmarks (distribution mode if you wish). But I do know benchmarks with lower and higher ratios. Specifically, if you have a lot of register spills, you'll have more stores, relatively speaking (spills are typically 1store/1load).

                      But if you have a file checksum-like benchmark, then the ratio is highly skewed to the loads.

                      For instance:

                      Code:
                      perf stat -e cycles,instructions,mem_uops_retired.all_loads,,me m_uops_retired.all_stores sha256sum virtio-win-0.1.185.iso
                      e20a645df49607e7c9cebcc9840d3634b25b32832e45be1f11 a73123590fa9fb virtio-win-0.1.185.iso
                      ,
                      Performance counter stats for 'sha256sum virtio-win-0.1.185.iso':
                      
                      5,945,494,604 cycles:u
                      20,507,306,371 instructions:u # 3.45 insn per cycle
                      2,514,696,183 mem_uops_retired.all_loads:u
                      485,107,654 mem_uops_retired.all_stores:u
                      
                      2.168431630 seconds time elapsed
                      
                      2.110880000 seconds user
                      0.053926000 seconds sys
                      So more than 5-1 loads to stores

                      Why? Because sha256sum can do most of the checksumming work in registers ... Well, I'm guessing that sha256sum has sufficiently complex machinery that you need to store some of those values (they don't all fit in registers).

                      Nonetheless, the average is closer to 2:1, which is why for modern (not toy) processor designs the load queue has twice the size of the store queue.

                      That ratio was slightly lower when x86-32 was more popular (more like 1.5:1), because with x86-32 you have fewer registers (7 gprs), and a shitton more spills.

                      2. From a performance standpoint, stores most of the times don't matter. The main reason is that stores don't produce, for the most part, data that is immediately needed.

                      Yes, you could have a load that needs to read from the same address, and is alive (in the instruction window) at the same time as the store, but those cases are rare. The Load/Store scheduler in modern processors figures out store-load pairs (based on PC) which are likely to collide, and prevents the load from issuing ahead of the store (as the Loh paper discusses).

                      The other situation where a store could slow things down significantly is if the store queue gets full (again, that rarely happens). But otherwise, stores are more-or-less fire-and-forget.

                      Loads do matter, because at least another instruction needs to wait for the load to finish and produce a result (ok, you could have a load that is wasted - nobody using the output - but again, that's super rare).

                      So load to total instruction count is more useful than (load + store) to total instruction count, because stores don't really have a latency (ok, they do have a latency, it's just that in the vast majority of cases it's not on the critical path, so it just doesn't matter at all).

                      3. Could you kindly add the perf data for xz and virtio-win-0.1.185.iso?

                      4. As I said earlier, I was wrong about the 10-1 ratio, please ignore that.

                      Comment

                      Working...
                      X