Announcement

**vladpetric** · 17 August 2020, 12:45 PM

Originally posted by atomsymbol

Your posts are messy (the grammar of sentences; the thread of reasoning in general) and very hard to read. Can you please do something about it?

He's somewhere between a bullshitter and a troll.

He now claims he understands my own freakin' research paper better (RENO) than I do. He doesn't; he reads a few keywords here and there and creates a bullshitty narrative in his mind.

Well, I have a personal policy not to feed the trolls ... I recommend that you do the same

**CooliPi** · 17 August 2020, 01:07 PM

System Information

PROCESSOR: ARMv8 Cortex-A72 @ 2.20GHz
Core Count: 4
Scaling Driver: cpufreq-dt performance

GRAPHICS:

MOTHERBOARD: BCM2835 Raspberry Pi 4 Model B Rev 1.4

MEMORY: 8GB

DISK: 32GB SM32G
File-System: ext4
Mount Options: relatime rw

OPERATING SYSTEM: Ubuntu 20.04
Kernel: 5.4.0-1015-raspi (aarch64)
Display Server: X Server 1.20.8
Compiler: GCC 9.3.0
Security: itlb_multihit: Not affected
+ l1tf: Not affected
+ mds: Not affected
+ meltdown: Not affected
+ spec_store_bypass: Vulnerable
+ spectre_v1: Mitigation of __user pointer sanitization
+ spectre_v2: Vulnerable
+ srbds: Not affected
+ tsx_async_abort: Not affected

Current Test Identifiers:
- Raspberry Pi 4
- Core i3 10100
- Pentium Gold G6400
- Celeron G5900

Enter a unique name to describe this test run / configuration: Raspberry Pi 4 8GB + CooliPi 4B + Noctua 60mm fan@5V @2147MHz

If desired, enter a new description below to better describe this result set / system configuration under test.
Press ENTER to proceed without changes.

Current Description: Benchmarks for a future article.

New Description: Using CooliPi 4B heatsink, Noctua 60mm 5V fan, MX-2 paste from Arctic Cooling all in a Red Bull small fridge at 1-10degC

Any better idea how to name it? It's insane...

**vladpetric** · 17 August 2020, 09:00 PM

Originally posted by atomsymbol

The performance counter cache-references can also mean LLC-references or "L2 cache references", so I passed L1-dcache-loads to /usr/bin/perf.

Summary of the code snippets below:

A10-7850K, app=xz: 0.41 L1D loads per cycle (45% L1D load pipeline utilization (not considering to the number of load ports))
Ryzen 3700X, app=xz: 0.58 L1D loads per cycle (58% L1D load pipeline utilization (not considering to the number of load ports))
Ryzen 3700X, app=g++: 0.67 L1D loads per cycle (67% L1D load pipeline utilization (not normalized to the number of load ports))
Raspberry Pi 2, app=xz: not meaningful because of very low IPC

I suppose that with 0.67 L1D loads per cycle, the number of matching store(X)-load(X) pairs occurring within 0-3 cycles is just a small fraction of 0.67 - for example less than 0.1, so in order for memory bypassing to be required for improving performance the IPC would have to be larger than 10 instructions per clock.

If IPC keeps increasing over time then L1D pipeline utilization will increase as well, and thus the probability of a store(X)-load(X) pair to occur within 0-3 cycles will be over time amplified to a performance bottleneck. However, it will take several decades (or more) for single-threaded IPC to reach the value 10.

Code:

A10-7850K
$ perf stat -e cycles,instructions,cache-references,L1-dcache-loads,L1-dcache-prefetches,L1-dcache-load-misses -- xz -9c /usr/bin/Xorg | wc -c

Performance counter stats for 'xz -9c /usr/bin/Xorg':

5,147,093,251 cycles
4,694,812,581 instructions # 0.91 insn per cycle
52,196,930 cache-references
2,134,968,624 L1-dcache-loads
49,383,148 L1-dcache-prefetches
44,112,814 L1-dcache-load-misses # 2.07% of all L1-dcache hits

1.314936065 seconds time elapsed

1.253729000 seconds user
0.059701000 seconds sys

Code:

Ryzen 3700X (a slightly different /usr/bin/Xorg file)
$ perf stat -e cycles,instructions,cache-references,L1-dcache-loads,L1-dcache-prefetches,L1-dcache-load-misses -- xz -9c /usr/bin/Xorg | wc -c

Performance counter stats for 'xz -9c /usr/bin/Xorg':

3,611,880,161 cycles
5,175,382,384 instructions # 1.43 insn per cycle
85,128,735 cache-references
2,083,427,179 L1-dcache-loads
24,899,168 L1-dcache-prefetches
55,135,959 L1-dcache-load-misses # 2.65% of all L1-dcache hits

0.831343249 seconds time elapsed

0.813425000 seconds user
0.019290000 seconds sys

Code:

Ryzen 3700X
$ perf stat -e cycles,instructions,cache-references,L1-dcache-loads,L1-dcache-prefetches,L1-dcache-load-misses -- g++ -O2 ...

Performance counter stats for 'g++ -O2 ...':

16,519,230,778 cycles
24,517,551,053 instructions # 1.48 insn per cycle
1,619,398,618 cache-references
11,028,752,404 L1-dcache-loads
392,157,539 L1-dcache-prefetches
584,586,070 L1-dcache-load-misses # 5.30% of all L1-dcache hits

3.814482470 seconds time elapsed

3.741325000 seconds user
0.070113000 seconds sys

Code:

RPi2
$ perf_4.9 stat -e cycles,instructions,cache-references,L1-dcache-loads -- xz -3c Xorg | wc -c

Performance counter stats for 'xz -3c Xorg':

3,389,885,517 cycles:u
906,475,610 instructions:u # 0.27 insn per cycle
350,135,938 cache-references:u
350,135,938 L1-dcache-loads:u

1. I really appreciate you doing this!

2. I think something is wrong with the L1-dcache-loads on (your?) Ryzen systems. Cache-references and L1-dcache-loads should not be that discrepant (they should be equal or close to equal). In particular, cache-references is roughly an order of magnitude lower than L1-dcache-loads, and that is simply not sensible.

Thing is, loads being roughly 1-in-10 instructions is something that we observe on RPi2, is pretty close to my measurements on the Intel system, and it is a bit more believable to me than 1-in-2 instructions being loads ... To me, cache-references is more believable here.

3. Could I kindly ask you to do a consistent test? If I may propose https://fedorapeople.org/groups/virt...in-0.1.185.iso

This way we could compare apples to apples.

Why? Well, why not

. It's a bit large yes (393 MiB).

4. The IPC on RPi2 is ... big facepalm

**oiaohm** · 17 August 2020, 11:07 PM

Originally posted by vladpetric View Post

2. I think something is wrong with the L1-dcache-loads on (your?) Ryzen systems. Cache-references and L1-dcache-loads should not be that discrepant (they should be equal or close to equal). In particular, cache-references is roughly an order of magnitude lower than L1-dcache-loads, and that is simply not sensible.

Those Ryzen figures are about right. Ryzen chips are more ram speed sensitive than intel if you trace that to the top you have a way more aggressive speculative load system than the intel one. That does result in more loads somewhere between 9-12 to cache-references. So yes a order of magnitude higher is right. Does this make looking at Rzyen load figures mostly pointless yes.

Originally posted by vladpetric View Post

Thing is, loads being roughly 1-in-10 instructions is something that we observe on RPi2, is pretty close to my measurements on the Intel system, and it is a bit more believable to me than 1-in-2 instructions being loads ... To me, cache-references is more believable here.

Yes aggressive preloading results in 1 in 2 instructions being loads most of these loads are not generated by the program but by the aggressive speculation/preloads and does give the Epic and Ryzen increased IPC. Yes a warped one intentionally doing way more loads to increase IPC. Yes this does increase risk of load store clash.

Originally posted by vladpetric View Post

4. The IPC on RPi2 is ... big facepalm

What RPI2. There is two of them.

Just a moment...

https://www.raspberrypi.org/documentation/hardware/raspberrypi/

The first RPI2 that are BCM2836 what is quad A7 cpu 32 bit only cpu. Current made RPI2 B 1.2 are BCM2837 yes 1 number difference but this is a quad A57 able to run 64 bit code. Yes the IPC between those is chalk and cheese. I would be suspecting A7 based RPI2. Yes the BCM2837 is the same chip in the RPI3 this leads to some confusing where some people think all RPI2 have closed to RPI3 performance because they only have the 1.2 or new versions of the RPI2 not the original.

Need to be a little more exact when benchmarking RPI2 due to the fact there is two of them with totally different soc chips.

https://www.itproportal.com/2012/10/...ance-analysis/ Yes there is a huge jump in perform going from A7 to A57. That a over double performance change.

**vladpetric** · 19 August 2020, 12:31 PM

Originally posted by atomsymbol

On the RPi2, approximately every 3rd ARM instruction appears to be a load, 350 / 906 = 0.39.

To obtain the precise number of load instructions in user-code (i.e: without speculative loads) we would need to annotate the assembly code.

----

Code:

$ pi cat /proc/cpuinfo | tail
Hardware : BCM2835
Revision : a01041
Model : Raspberry Pi 2 Model B Rev 1.1

Fair point, I was misreading the data (I was also comparing against a gzip run, which is not apples to apples)

When profiling the compression of the .iso file that I mentioned earlier, I get:

Code:

/dev/shm $ perf stat -e cycles,inst_retired.any,uops_retired.all,mem_uops_ retired.all_loads xz -3c virtio-win-0.1.185.iso > virtio-win.iso.xz

Performance counter stats for 'xz -3c virtio-win-0.1.185.iso':

155,634,110,621 cycles:u
95,291,274,136 inst_retired.any:u
104,104,748,746 uops_retired.all:u
22,259,028,509 mem_uops_retired.all_loads:u

7.819202754 seconds time elapsed

53.806171000 seconds user
0.177891000 seconds sys

Now inst_retired.any, uops_retired.all, mem_uops_retired.all_loads - these are all precise counters (thing collected on retirement). And I do trust the Intel precise counters, primarily because I validated them against simulation runs.

So, somewhere between 1-in-4 and 1-in-5 instructions are loads, for xz. That's what I'm seeing. Intel IPC with xz is not that great, either, I have to admit this. A useful load gets dispatched roughly once every 7 cycles.

So, I still think that you will very much see the latency of the loads ...

Actually, I think we should also use -T1 (single threaded), for a better comparison of microarchitectural effects. My numbers are quite similar, though the command itself does take longer:

Code:

/dev/shm $ perf stat -e cycles,inst_retired.any,uops_retired.all,mem_uops_ retired.all_loads xz -3c [B]-T1[/B] virtio-win-0.1.185.iso > virtio-win.iso.xz

Performance counter stats for 'xz -3c -T1 virtio-win-0.1.185.iso':

118,360,577,701 cycles:u
94,952,136,061 inst_retired.any:u
103,532,304,227 uops_retired.all:u
22,811,172,257 mem_uops_retired.all_loads:u

42.374730049 seconds time elapsed

42.180461000 seconds user
0.117740000 seconds sys

The IPC also increased by avoiding the inter-thread communication.

Without -T1, a different number of threads is could be used on each run. If we want to run a multithreaded test, I think we should agree on a number of threads (e.g., 4).

Oh, here's some numbers with branches included:

Code:

$ perf stat -e cycles,instructions,branches,branch-misses,inst_retired.any,uops_retired.all,mem_uops_ retired.all_loads xz -3c -T1 virtio-win-0.1.185.iso > virtio-win.iso.xz

Performance counter stats for 'xz -3c -T1 virtio-win-0.1.185.iso':

118,389,607,680 cycles:u (57.14%)
94,944,328,786 instructions:u # 0.80 insn per cycle (71.43%)
14,754,365,523 branches:u (85.71%)
906,875,767 branch-misses:u # 6.15% of all branches (57.14%)
94,896,558,766 inst_retired.any:u (71.43%)
103,512,664,941 uops_retired.all:u (85.71%)
22,839,152,686 mem_uops_retired.all_loads:u (42.86%)

42.722314322 seconds time elapsed

42.225622000 seconds user
0.114992000 seconds sys

The difference between instructions (a speculative counter) and inst_retired.any (a precise counter) is small. Obviously, I don't have numbers for your systems. But for my system, branch misspeculation doesn't play a big role.

**vladpetric** · 20 August 2020, 08:02 PM

Originally posted by atomsymbol

Precise results obtained by simulating cache accesses in callgrind:

36% of all instructions are load/store instructions. This is valid for "xz -c3 -T1 virtio-win-0.1.185.iso" - other applications would show a different percentage of load/store instructions, but in summary expecting 1-in-10 instructions to be a load/store instruction (10%) is unrealistic, i.e: most apps are well above 10%.

Curiously, the data below shows that load:store ratio is almost exactly 2:1 (27.1 : 13.9). I wonder whether this is just a coincidence or a general rule that most apps follow (not considering toy benchmarks).

Here's my data:

Code:

$ valgrind --tool=callgrind --cache-sim=yes -- xz -3c -T1 ./virtio-win-0.1.185.iso > virtio-win-0.1.185.iso.xz
==2537234== Callgrind, a call-graph generating cache profiler
==2537234== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.
==2537234== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==2537234==
==2537234== Events : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
==2537234== Collected : 96109168347 23293054178 11744076794 2535 1942602462 270595441 2437 65479019 23737948
==2537234==
==2537234== I refs: 96,109,168,347
==2537234==
==2537234== D refs: 35,037,130,972 (23,293,054,178 rd + 11,744,076,794 wr)

I cut out a lot of useless info from the above.

Unfortunately our numbers don't fully align. Even if we were running the same xz version (I have 5.2.5, latest stable), we could still have differences in compiler, or some specialized routines using different assembly (ifunc). But the proportion of load/store instructions is super close, so I think it's good enough for a comparison.

Alright, some thoughts:

1. The 2:1 ratio is simply an average across multiple benchmarks (distribution mode if you wish). But I do know benchmarks with lower and higher ratios. Specifically, if you have a lot of register spills, you'll have more stores, relatively speaking (spills are typically 1store/1load).

But if you have a file checksum-like benchmark, then the ratio is highly skewed to the loads.

For instance:

Code:

perf stat -e cycles,instructions,mem_uops_retired.all_loads,,me m_uops_retired.all_stores sha256sum virtio-win-0.1.185.iso
e20a645df49607e7c9cebcc9840d3634b25b32832e45be1f11 a73123590fa9fb virtio-win-0.1.185.iso
,
Performance counter stats for 'sha256sum virtio-win-0.1.185.iso':

5,945,494,604 cycles:u
20,507,306,371 instructions:u # 3.45 insn per cycle
2,514,696,183 mem_uops_retired.all_loads:u
485,107,654 mem_uops_retired.all_stores:u

2.168431630 seconds time elapsed

2.110880000 seconds user
0.053926000 seconds sys

So more than 5-1 loads to stores

Why? Because sha256sum can do most of the checksumming work in registers ... Well, I'm guessing that sha256sum has sufficiently complex machinery that you need to store some of those values (they don't all fit in registers).

Nonetheless, the average is closer to 2:1, which is why for modern (not toy) processor designs the load queue has twice the size of the store queue.

That ratio was slightly lower when x86-32 was more popular (more like 1.5:1), because with x86-32 you have fewer registers (7 gprs), and a shitton more spills.

2. From a performance standpoint, stores most of the times don't matter. The main reason is that stores don't produce, for the most part, data that is immediately needed.

Yes, you could have a load that needs to read from the same address, and is alive (in the instruction window) at the same time as the store, but those cases are rare. The Load/Store scheduler in modern processors figures out store-load pairs (based on PC) which are likely to collide, and prevents the load from issuing ahead of the store (as the Loh paper discusses).

The other situation where a store could slow things down significantly is if the store queue gets full (again, that rarely happens). But otherwise, stores are more-or-less fire-and-forget.

Loads do matter, because at least another instruction needs to wait for the load to finish and produce a result (ok, you could have a load that is wasted - nobody using the output - but again, that's super rare).

So load to total instruction count is more useful than (load + store) to total instruction count, because stores don't really have a latency (ok, they do have a latency, it's just that in the vast majority of cases it's not on the critical path, so it just doesn't matter at all).

3. Could you kindly add the perf data for xz and virtio-win-0.1.185.iso?

4. As I said earlier, I was wrong about the 10-1 ratio, please ignore that.

**oiaohm** · 21 August 2020, 02:54 AM

Originally posted by vladpetric View Post

Nonetheless, the average is closer to 2:1, which is why for modern (not toy) processor designs the load queue has twice the size of the store queue.

That is not true on the load queue being twice the size of store. Your A77/A78 and your server designed arm cores that do match up against x86 performance have load queues slightly smaller than store queues.

Cortex-A77 - Microarchitectures - ARM - WikiChip

https://en.wikichip.org/wiki/arm_holdings/microarchitectures/cortex-a77#Individual_Core

Like A77 85 load vs 90 store. Why it this way will come clear when I answer the next bit.

Originally posted by vladpetric View Post

2. From a performance standpoint, stores most of the times don't matter. The main reason is that stores don't produce, for the most part, data that is immediately needed.

Turns out stores matter. If you are not able to store fast enough you cannot clear data out of registers to cache fast enough you end up filling the cpu usable register space and this is instance processing stall. To be safe you are normally need closer to a 1 to 1 in load store buffer ratio so that if you are copy memory for example from one location to another by CPU you will end up with 1 to 1 load store ratio you don't want CPU stalling out every time that happens because it run out of store and possible filled back into registers.

Originally posted by vladpetric View Post

Yes, you could have a load that needs to read from the same address, and is alive (in the instruction window) at the same time as the store, but those cases are rare. The Load/Store scheduler in modern processors figures out store-load pairs (based on PC) which are likely to collide, and prevents the load from issuing ahead of the store (as the Loh paper discusses).

Does not have to be a load/store scheduler that nukes the load. Arm the register renaming process that sees this load is going to be the same address as a store. Register renaming process also has like you have a add 1 to register 1. copy register 1 to register 2 add 1 to register 2 store register 2 come at run time basically register 2 does not get used and all the processing stays in register 1. So modern processors can have either a load/store scheduler or a smart register rename process both end up with the result that you don't have a store to a address with a following load from that address happening as the load is going to cease to exist replaced by use register.

One of the quirks of difference that you see in arm that are using arm register renaming you will see a load from address then a store to the same address in the same clock cycle because a load issuing ahead of store is possible. Load issuing after store disappears. This is why with arm you don't want smarts in the load/store buffers.

A load ahead of store you are after the value in the L1 that the load is asking for not the value store has pushed into the store buffer. A load after store you are wanting the value the store is sending out so this case arm redirect to the register that held the value instead of load function so the load function disappears. If you add the smarts to the load store buffers so a load checks if value change to address is in the store buffer this will break how the arm design works.

This also explains why you might be wanting to keep the load buffer slightly shorter than the store buffer so it slightly faster to get from one end of the que to the other.

Originally posted by vladpetric View Post

The other situation where a store could slow things down significantly is if the store queue gets full (again, that rarely happens).

Sorry store ques getting close to full to overful happen a lot. Memory copy stuff and some of your advanced maths stuff. Some games are horrible for having a low load ratio with a high store usage because they optimised to have a lot of core stuff being calculated from values in registers.

Originally posted by vladpetric View Post

Loads do matter, because at least another instruction needs to wait for the load to finish and produce a result (ok, you could have a load that is wasted - nobody using the output - but again, that's super rare).

That depends on the chip you are talking about something aggressive like a ryzen/Zen2 will have a lot of speculative loads that are basically loaded on the guess that the execution might go x direction if it does not that load is a complete waste. This leads to ryzen/Zen2 insanely high load numbers does increase IPC by less than 10 percent but that 10 percent gain IPC comes from basically 1 in 10 speculative guessed loads being right. so Ryzen/Zen2 a load happening that no one uses the output is insanely common. So how wasteful you are on loads does come down to how extremely you are willing to go after IPC.

Originally posted by vladpetric View Post

So load to total instruction count is more useful than (load + store) to total instruction count, because stores don't really have a latency (ok, they do have a latency, it's just that in the vast majority of cases it's not on the critical path, so it just doesn't matter at all).

This can be a useful diagnostic metric. There are cases where store matters.

The reality here is making your load/store buffers larger don't help much unless you have the micro ops to use that larger load store buffers. Even if you have the micro ops to use larger load/store buffers you don't need larger load/store if you don't have the instruction dispatch and register renaming to fill the micro ops.

Yes the size of the load/store buffers tell you are little about the cpu but without understanding how the CPU design is going to use those load store buffers the size does not tell you much. Like a Ryzen/Zen2 that is going to speculatively fill the load buffer having a lot larger load buffer than store makes sense then you have arm core designs where due to the register renaming design where a slightly smaller load buffer than store makes sense. Both designs can put out close the same IPC when they have close to the same level micro ops with means to fill and use those micro ops.

Yes lot of your general programs are 2 to 1 load/store ratio but you don't want your cpu stalling out when you have 1 to 1 load/store ratio(coping memory) or a 0 to 1 load/store ratio(like dumping out random data from in cpu random number generator micro op).

**vladpetric** · 21 August 2020, 10:35 AM

Originally posted by atomsymbol

perf stat results (combined from 2 perf stat runs):

Code:

$ perf stat -e ... -- xz -3c -T1 virtio-win-0.1.185.iso | wc -c

Performance counter stats for 'xz -3c -T1 virtio-win-0.1.185.iso':

130,998,797,769 cycles
113,408,081,318 instructions # 0.87 insn per cycle
6,207,884,764 cache-references
52,070,531,511 L1-dcache-loads
1,932,673,658 L1-dcache-prefetches
4,147,646,284 L1-dcache-load-misses # 7.97% of all L1-dcache hits
39,159,406,747 ls_dispatch.ld_dispatch
17,481,464,420 ls_dispatch.store_dispatch
[I][U]1,473,011,913 ls_dispatch.ld_st_dispatch[/U][/I]

30.104931172 seconds time elapsed

The interesting performance counter is "ls_dispatch.ld_st_dispatch" described as:

Dispatch of a single op that performs a load from and store to the same memory address. Number of single ops that do load/store to an address.

Based on the above perf stat output, 1.3% (1.473 / 113.4) of all instructions are store(X)-load(X) pairs from/to the same address X within a single clock. Based on this number (if it is correct), assuming that a store(X)-load(X) pair stalls execution by at most 3 cycles, it can be estimated that adding memory bypassing to the CPU would lead to less than 3.4% speedup in case of "xz -c3 -T1": 1.473 * 3 / 130.999 = 3.4%. The problem with this estimate is that the average stall due to store(X)-load(X) is smaller than 3 cycles. As CPUs become wider (will have higher single-threaded IPC), this number will increase over time, but the CPU would need to be able to sustain an IPC of 5-10 when running "xz -c3 -T1" for memory bypassing to improve performance by a significant margin.

This is great, thanks! Will take a look at all this over the weekend.

**vladpetric** · 21 August 2020, 10:55 AM

Originally posted by atomsymbol

I think I misunderstood the perf counter description. ls_dispatch.ld_st_dispatch most likely means the load(X)-store(X) pair, not the store(X)-load(X) pair.

I think we need to check the AMD perf counter spec, as the perf description is ambiguous enough.

Somewhere around here ... :

https://www.amd.com/en/support/tech-docs

Edit:

crap, it doesn't say anything extra:

PMCx029 [LS Dispatch] (Core::X86::Pmc::Core::LsDispatch) Read-write. Reset: 00h. Counts the number of operations dispatched to the LS unit. Unit Masks ADDed. PMCx029 Bits Description 7:3 Reserved. 2 LdStDispatch: Load-op-Store Dispatch. Read-write. Reset: 0. Dispatch of a single op that performs a load from and store to the same memory address. 1 StoreDispatch. Read-write. Reset: 0. Dispatch of a single op that performs a memory store. 0 LdDispatch. Read-write. Reset: 0. Dispatch of a single op that performs a memory load.

**oiaohm** · 21 August 2020, 09:43 PM

Originally posted by atomsymbol

Just some notes:

x86 CPUs have memory copy instructions (REP MOVS)
- None of x86 CPUs features a highly optimized memory copy logic, i.e. implementing the memory copying logic inside of L1D/L2/L3 caches themselves
- Copying a 4 KiB page in just 10 CPU cycles (about 1.6 TB/s) is doable (in theory) if all of the 4 KiB data is already in L1D/L2/L3 caches - without utilizing any traditional load/store port during the process of copying the 4 KiB page
  - The question is whether it is desirable for an actual non-theoretical x86 CPU to feature such a high-speed memcpy implementation

.

Not quite

https://news.ycombinator.com/item?id=12048651

REP MOVS can in fact bi-pass L1D/L2/L3 and issue command straight to MMU for block to block copy. When you get to like 4kb of size it does ask the question if this should enter the caches at all or should this just be a direct MMU operation.

Originally posted by atomsymbol

AVX memcpy on a CPU with 1 store port: 4 GHz * 32 bytes = 128 GB/s
- 128 GB/s is a relatively high number, i.e. even if the CPU has just 1 256-bit store port then memcpy() is unlikely to be a bottleneck in real-world code
- The 2nd store port in IceLake-derived CPUs speeds up memcpy() by up to 100%, but it is probable that memcpy isn't the primary reason for the existence a 2nd store port in a CPU
Dumping data from a random number generator is just a synthetic benchmark

The hardware random generator seams synthetic but a website running SSL can be hitting the random number generator insanely hard you do strike sections of code that will be doing hardware random generator to store in volume. So its not just a synthetic benchmark its a synthetic benchmark that replicates something that happens quite a bit in different server loads.

Originally posted by atomsymbol

I think I misunderstood the perf counter description. ls_dispatch.ld_st_dispatch most likely means the load(X)-store(X) pair, not the store(X)-load(X) pair.

LdStDispatch needs to exist in some x86 designs. Store(X) Load(X) is handled in the store/load buffers in the AMD design. So LdStDispatch has to exist to bipass the buffer processing so you can get a Load value from L1 instead of getting the value that was just pushed into the store buffer. Arm you don't need this bipass due to not processing in the buffers but this effects store and load buffer sizes.

Announcement

How A Raspberry Pi 4 Performs Against Intel's Latest Celeron, Pentium CPUs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment