Announcement

**oiaohm** · 12 March 2019, 02:37 AM

Originally posted by Weasel View Post

You don't even know what a "partial load" is, holy christ. If anything, if x86_64 didn't zero extend to 64 bits, it would suffer from this all the time anytime it used a 32-bit register. They had to do it because 32-bit values are used a lot more than 64-bit values. So it was saved by a design decision that x86 (32-bit) doesn't suffer from in the first place. Both suffer if you load different offset (i.e. store in [rax] and read from [rax+1] when it's a word or dword or qword) but again that has nothing to do with your example which doesn't load ANYTHING partially.

No learn to read. This has nothing todo with a "partial load".
The high penalty for a correctly-predicted dependent load is a bit of a concern though, as it is now more than 3× worse than for the older Yorkfield.
Every time Store to Load forward works there is high penalty if you are hitting store to load slow enough it will be masked out by through put. Basically every time a Store to Load forwards works you have like 5 cycle stall to mask over this is still fast than going out to L1 and back. if you have kept the value in registers you have zero stall to mask over. Yes if you get store to load forward stalls in the right orders you can have cpu twiddling it thumbs doing nothing for 20-40 percent of the time this can even stall out speculative execution.

Store to Load forwards works well in out of order processors as long as you are not overloading. To not overload store to load forwards you need enough operations using only registers to mask over cost store to load forwards. Store to Load forwards is not a magic bullet it has it limitations. 8 register ISA resulting in lots of extra memory operations pressures store to load feature to breaking point.

I was asking about reducing memory operations to L1 mostly because it make minor difference for performance if the memory operations is going to L1 or going to store to load forwards both are slower than in the registers and both can stall you.

Basically if you make a ISA and it cannot perform well with just L1 with no store to load forwards add store to load forwards feature will also still stall out.

Also the other thing store to load does not always work for another reason. CPU only looks so far head. You have X value stored but the load back is far enough in the future cpu does not see it. If you have more registers that value may have stayed in registers. Real world code gets messy being short on registers hits you many different ways.

1) overload the store to load forward system so you end up stalling cpu waiting for it because you cannot do any next instructions until you get some value back out the store to load forward system..
2) excess operations to L1 because store to load forward missed that they were store to load forwards.

Both of these nightmares are reduced by having enough registers in the ISA so that everything is better balanced between memory and register operations.

Yes if you are not pushing the store to load forwards system hard it can do more speculative store recording so reducing problem 2. Overload store to load forwards end up using most of store to load functionality for short range operations.

This is the domino effect.

1) L1 has excess operations because store to load forwards is not working effectively.
2) Store to load forwards is not working effectively due to excess memory operations.
3) Excess memory operations are because the ISA does not have enough registers so compiler went nuts doing memory operations to make up for the lack of registers sometimes to the point of not using registers at all for areas of code.

MSVC and GCC will both do this in x86 32 bit mode for code that is only doing memory operations in 64 bit mode they would have done everything in registers so nicely tanks the Store to load forward system because of the MSVC/GCC generated code in 32 bit mode.

My point about excess L1 was not wrong. I was just skipping over the Store to load forward problem in the middle and it limitations but the first sign of this is either low performance or excess L1 operations.

**Weasel** · 12 March 2019, 01:02 PM

Yeah, not going to waste more of my time. You mixing compilers for what is basically purely runtime, and using random words like "stall" a million times when you don't even understand what a stall is. The CPU is designed for fast access to memory, because it needs it, not just because of register spills, it's just how real world code works. That's why it speculates it in the first place.

Latency is higher, but throughput is rarely a problem and usually the same. And that's what matters in most workloads.

**oiaohm** · 20 April 2019, 09:36 PM

Originally posted by Weasel View Post

Yeah, not going to waste more of my time. You mixing compilers for what is basically purely runtime, and using random words like "stall" a million times when you don't even understand what a stall is. The CPU is designed for fast access to memory, because it needs it, not just because of register spills, it's just how real world code works. That's why it speculates it in the first place.

Latency is higher, but throughput is rarely a problem and usually the same. And that's what matters in most workloads.

I see did not get it. This hitting the caches to hell gives both high latency and bad throughput. You cannot have throughput if the stuff cannot make it up though the caches because other items are filling the space.

By the way the intel developers of the cpu define it as a stall not me.

Announcement

Wine Developers Release Hangover Alpha To Run Windows x86_64 Programs On 64-Bit ARM

Comment

Comment

Comment