Announcement

**Weasel** · 17 August 2018, 08:12 AM

@oiaohm: Yeah, again you are master of hardware and Intel/AMD and all the others clearly don't know what they're doing with the branch predictor for 3 decades hindering their performance (while even wikipedia claims otherwise, with a reference). I don't CARE about what YOU consider single-threaded workload or not, or what RISC-V considers it. Not everything can be parallelized (at least not trivially).

Here's a simple situation for you: you have, literally, one thread in your application. You programmed it and it has one thread literally.

Find ways to speed it up with branch prediction. I don't give a FUCK of what other workload you run on your PC, or how "efficient overall" it is. I want this one app that's single-threaded to complete fast (which means less amount of time!). Get it?

Every bullshit you said specifically requires programming for more threads, but there's only 1 in this situation. Deal with the facts.

Real example: you use a super compression algorithm which has zero parallelization potential for decompressing (any parallelization decreases compression ratio and we want the maximum possible so it must be single-threaded!). For example, LZMA with only 1 block for maximum compression ratio. I want this decoded as fast as possible. Tell me again how you do it without a branch predictor.

The decoder has only 1 thread since literally every bit/byte depends on the previous data (so it can't be done in parallel). I don't give a flying fuck of what other workload there is on my PC at the same time as the decoder. All I care is the amount of time it takes from the moment I start the decoder and the moment it finishes.

**carewolf** · 17 August 2018, 09:06 AM

Originally posted by audir8 View Post

L1TF is in the same category of bugs as Meltdown and Spectre, so to believe what you're saying, this whole class of bugs had to have been planted by the NSA a decade ago with some cooperation from multiple vendors or their employees, only to be found recently and in this case be self-disclosed now. Except for certain variants of Spectre, all the other flaws only show up in Intel chips.

I guess you can believe what you want. I think it's much more likely that Intel did have 12 year olds designing processors.

carewolf SGX isn't only for DRM, it can just be considered an extension of TPM, which would just make it good security if done well.

It is 'security' that protects against owners of computers(us) from reading what signed applications put in memory. It could be used to enhance security in hosted servers, but that is not what it id designed for, and it isn't good at it. It is good at DRM, because that is what it was designed for.

**oiaohm** · 17 August 2018, 04:46 PM

Originally posted by Weasel View Post

Here's a simple situation for you: you have, literally, one thread in your application. You programmed it and it has one thread literally.

Except this is not exactly true.

Originally posted by Weasel View Post

Real example: you use a super compression algorithm which has zero parallelization potential for decompressing (any parallelization decreases compression ratio and we want the maximum possible so it must be single-threaded!). For example, LZMA with only 1 block for maximum compression ratio. I want this decoded as fast as possible. Tell me again how you do it without a branch predictor.

The decoder has only 1 thread since literally every bit/byte depends on the previous data (so it can't be done in parallel). I don't give a flying fuck of what other workload there is on my PC at the same time as the decoder. All I care is the amount of time it takes from the moment I start the decoder and the moment it finishes.

FPGA hardware implementation of the LZMA compression algorithm

https://www.researchgate.net/publication/282173562_FPGA_hardware_implementation_of_the_LZMA_compression_algorithm

Download Citation | FPGA hardware implementation of the LZMA compression algorithm | Software-based LZMA (Lempel Ziv Markov-chain algorithm) nondestructive compression algorithm is very slow and consumes too much CPU (central... | Find, read and cite all the research you need on ResearchGate

Particular algorithms with zero parallelization fastest way to implement them in x86 system is not use a CPU but use a FPGA chip this is why 4.19 Linux kernel is getting a generic FPGA sub system. Please note this does not have any branch prediction in the FPGA version of LZMA. The LZMA algorithm was designed before we had branch prediction.

LZMA is one of the things that causes the branch predictor in x86 to screw up so instead of improving performance is being hindered. LZMA is one of those things where you could gain performance disabling the branch predictor so having more memory bandwidth. LZMA commonly bottle necks in memory transfers.

I think it funny you picked one of the algorithms that shows the problem with the idea of speculative execution while attempt to argue for speculative execution.

Originally posted by Weasel View Post

Find ways to speed it up with branch prediction.

This is the tunnel version problem. Not every algorithm can be speed up with branch prediction there are some like LZMA all you do by using branch prediction is in fact slow them down by wasting the memory bandwidth the algorithm needs on failed predictions.

Branch predictor is a double side sword. It helps and hinders. In fact the branch predictor hinders more often than most people know. Branch predictor helps performance when its getting its guess right something like LZMA the CPU branch predictor is wrong about 70 percent of the time. There is no real way to tweak the branch predictor to fix this problem because no 2 data streams LZMA is compressing/decompressing have the same pattern in code paths though the compression/decompression algorithm. So there are some algorithms you cannot make a branch predictor ever work for and those algorithms the only thing branch prediction does is hinder performance.

Thread management engine in cpu only helps. So yes a thread management engine on a inorder core can cause you LZMA processing to complete faster due to reducing context switching overhead to read or write data from storage you know those kernel threads and remove the interruptions by the OS scheduler. In fact even with a CPU with speculative execution this in cpu core would help. There is no downside other than silicon consume and operating system has to be designed to support it. So it does not matter the algorithm you use with thread management engines there are no algorithms at them that will make them perform worse.

There is quite a long list of algorithms commonly used in compression that are not speculative execution compatible yet we attempt to make the run on speculative execution cpus.

As number of cores in a x86 chip has increase the amount of memory bandwidth per core has decreased this is also make the speculative execution failures more costly. Cores are increasing faster than memory bandwidth speeds. This is the problem as hardware gets more multi threaded in design and memory bandwidth gets more limited the advantages of speculative really do start disappearing until you get to the point there is no advantage at all to speculative execution and that using algorithms that suit speculative execution. You have to remember there are quite a few algorithms that don't suite speculative execution at all.

Ideal CPU would be able to be controlled what code use branch prediction and what code don't so when you have single threaded code that is branch prediction incompatible you would be able to turn it off and not end up wasting memory bandwidth that the single thread code needs to run as fast as possible.

We really do need to take a step back have a close look at the general processing workload we could be in a location where majority of what we are doing is using algorithms that don't suit speculative execution. Its a shock to a lot who argue for speculative execution that there are algorithms where there is no performance gain using speculative execution that are single threaded and worse that quite number of those algorithms suffer performance lose due to speculative execution. Surprising right common used algorithms are anti speculative execution.

One of the issues with the Pentium 4 is when it made a failed branch prediction it took way too long to correct it was very fast when branch prediction was right and very slow when it was wrong. Problem here is algorithm that had problem on the Pentium 4 had problem hidden by making branch prediction failure time improve but it does not fix the wasted memory bandwidth problem. So something Intel only half fixed after the Pentium 4 has been hit us for all this time. This is the thing as the number of x86 cores go up the problems of the Pentium 4 are returning because per core memory bandwidth is going down.

**Weasel** · 17 August 2018, 04:51 PM

Originally posted by oiaohm View Post

Except this is not exactly true.

But it is, and you can do it, and a lot of applications do it. (obviously it may have more than 1 thread but those are dormant or doing insignificant stuff so it's irrelevant). In fact, SMT even works *because* this is a thing.

Obviously, most of what you said is just bullshit because you really have no idea just how smart and complex branch predictors are these days. Nobody gives a crap about Pentium 4, that's the failed CPU that even used branch hints prefixes, which is a retarded idea to begin with. RISC-like garbage. The importance is the predictability of branches, not their static prediction. It cannot be done at compile time, it needs to be done only at runtime depending on which branches or paths were taken. Yes, it takes into account stuff like that, it's nowhere near as simple as you make it out to be.

And why does it matter if it fails the prediction? It's not any slower than not predicting at all, speaking of modern CPUs not your P4. Even 70% failure rate would not be detrimental, because 30% is still better than zero predictions. wtf is wrong with you. 30% of the time, it's not waiting where an in-order CPU would wait for the result.

Do you even realize just how LONG the pipelines in modern CPUs are? Imagine reaching a branch with an in-order CPU, with only 5% of your pipeline filled, and then you have to wait for the result. What a waste of time. If anything branch predictors are more and more important the longer our pipelines become, and they do because everyone wants "single threaded performance" out of a CPU, not "massively threaded performance" because then they'd be using a GPU.

**oiaohm** · 17 August 2018, 06:12 PM

Originally posted by Weasel View Post

And why does it matter if it fails the prediction? It's not any slower than not predicting at all, speaking of modern CPUs not your P4. Even 70% failure rate would not be detrimental, because 30% is still better than zero predictions. wtf is wrong with you. 30% of the time, it's not waiting where an in-order CPU would wait for the result.

The 70% failure has still caused reads from cache and memory. 30% being right does not help you when you are out of memory bandwidth due to the 70% of junk so are stalled waiting on cache lines. Xeon Phi used only in-order cores because once you get low on memory bandwidth speculative execution hurts.

Originally posted by Weasel View Post

Do you even realize just how LONG the pipelines in modern CPUs are? Imagine reaching a branch with an in-order CPU, with only 5% of your pipeline filled, and then you have to wait for the result. What a waste of time. If anything branch predictors are more and more important the longer our pipelines become, and they do because everyone wants "single threaded performance" out of a CPU, not "massively threaded performance" because then they'd be using a GPU.

Risc-V in-order is 5 max long pipeline and Risc-V out of order is also 5 max long in pipeline. Wait you are talking about x86 out of orders being about 14 in pipeline. Longer than 14 branch predictor don't help you any more. CPU pipelines got to max length many years ago.

So yes one of the things you need to keep down in a in-order is pipe line length. Pent 4 failure was caused by thinking you could keep on growing the pipeline length intel got to 31 before they worked out there was no way to fix this and brought pipeline length back to 14. Yes 2004/2006 was max pipeline length by 2007 intel was back to 14 pipeline length. There is absolutely no point to a pipeline length longer than 14 in a general processor any longer than 14 branch prediction for performance ceases to work..

When you have a pipeline as short as the boom risc-v out of order processor branch prediction starts showing very little gain. Yes you gain a lot by running more instructions at a time between branch points this is a gain in performance with worse case being equal to not doing anything at all. But the speculative execution of branches you don't know if you need or don't need the results is seaming more and more how to waste power and not gain that much.

Turns out there are quite narrow windows where branch prediction works at all. Branch prediction only kind of works with a pipeline between 8 to 14. Pipelines shorter than 8 you might as well stack as many instructions as you can into a pipeline cycle between branches taking the stall on the branches and pipelines longer than 14 are worthless no matter what you do because the stalls are too huge just due to pipeline length.

Current day test results it is seaming that branch prediction is only masking over that you have made that the cpu pipeline too long and it only able to mask over that problem while you have enough memory bandwidth and your pipeline length remains under 14. Algorithm like compression that are memory bandwidth sensitive are harmed by speculative branch prediction.

This is the problem the idea of branch prediction for speed may be completely wrong. Processing multi instruction per pipeline cycle is absolutely right attempting to fill that by predicting what way branches is seaming to be absolutely wrong so we need to look to other methods of filling cpu at the branch points like processing other threads. This is the problem those developing the next generation of CPU are coming more worried that there is no where that speculative branch prediction works right.

Weasel you missed that pipeline length stopped growing and x86 has stalled at a particular length. You also miss that power, risc-v and arm commonly have shorter pipeline lengths. As you said branch prediction came in to usage because of pipeline length attempting to fix a that problem. If you are making a cpu with shorter pipelines does branch prediction make any sense any more? It seams not.

**Weasel** · 17 August 2018, 08:18 PM

Originally posted by oiaohm View Post

The 70% failure has still caused reads from cache and memory. 30% being right does not help you when you are out of memory bandwidth due to the 70% of junk so are stalled waiting on cache lines. Xeon Phi used only in-order cores because once you get low on memory bandwidth speculative execution hurts.

What? You make no sense. So you're saying it's better to wait rather than utilize that memory bandwidth? Because no matter how you slice it, even if the bandwidth is saturated, it's still no worse than waiting in the first place. I mean, you'd be waiting anyway so what difference does it make.

Xeon Phi was in-order because out-of-order requires massive amount of transistors compared to SIMD, so they wanted to drop it and increase resources elsewhere since it takes such a huge chunk of the die. I don't think you understand basic physics at this point.

Why the fuck do you think we have SIMD at all if out-of-order could just as well parallelize scalar operations huh? It takes much more resources to do two scalar operations in parallel via out-of-order processing, than to do them via SIMD. It scales quadratically while SIMD scales linearly in number of transistors/resources used.

In fact by far the largest components of a CPU die are the OOO engine and the caches, both of which are needed for single-threaded performance.

Xeon Phi was not designed for single-threaded workloads, so yeah. Guess why it's in-order?

Originally posted by oiaohm View Post

Risc-V in-order is 5 max long pipeline and Risc-V out of order is also 5 max long in pipeline. Wait you are talking about x86 out of orders being about 14 in pipeline. Longer than 14 branch predictor don't help you any more. CPU pipelines got to max length many years ago.

So yes one of the things you need to keep down in a in-order is pipe line length. Pent 4 failure was caused by thinking you could keep on growing the pipeline length intel got to 31 before they worked out there was no way to fix this and brought pipeline length back to 14. Yes 2004/2006 was max pipeline length by 2007 intel was back to 14 pipeline length. There is absolutely no point to a pipeline length longer than 14 in a general processor any longer than 14 branch prediction for performance ceases to work..

There's nothing wrong with a longer pipeline, except the fact it wastes tons of space and resources, as I said above.

It doesn't decrease performance by itself, it just does it indirectly because you could've used that space for other things (SIMD or more cores or whatever), or simply lower power. Note that Skylake has more than 200 entry queue for execution of OOO instructions.

And ARM, AFAIK, has even longer pipeline. I don't care about RISC-V at this point, you keep talking about lab experiments and prototypes, and I don't care about any of that stuff. Reality is different.

**oiaohm** · 17 August 2018, 10:18 PM

Originally posted by Weasel View Post

What? You make no sense. So you're saying it's better to wait rather than utilize that memory bandwidth? Because no matter how you slice it, even if the bandwidth is saturated, it's still no worse than waiting in the first place. I mean, you'd be waiting anyway so what difference does it make.

Quite a bit difference and that the problem particularly when your memory bandwidth is saturated. Problem is you will be waiting on the memory system be this a in order or out of order core of any form. Waiting in the branches so keeping memory request to what is required can increase you performance by 200 times on different single threaded workloads because you are not being stalled as long because you are not needing as much memory bandwidth.

You are right you will have to wait when you at max memory bandwidth. Having failed speculative execution put more pressure on the memory system causes you to wait more than the performance you gain by speculative execution. You are not thinking about it right.

If you have 30 percent right. You about 2/3 more memory bandwidth caused by the wrong paths. If 1/3 would have been over saturated memory system any how you are well and truly in the weeds of performance as the complete memory system stalls your processing this is why you see insane performance boost when you avoid this. 200 to 2000 times faster on particular algorithms have been seen in cpus that have instructions to say do not use speculative execution here because we know this is not going to work well.

While you have surplus memory bandwidth to use speculative execution can be a good idea in some cases. Once you are short on memory bandwidth speculative execution comes really bad makes being stalled at branch points in code not be a big problem. This is why you need the ability to turn speculative execution off.

Originally posted by Weasel View Post

Xeon Phi was in-order because out-of-order requires massive amount of transistors compared to SIMD, so they wanted to drop it and increase resources elsewhere since it takes such a huge chunk of the die. I don't think you understand basic physics at this point.

Not exactly true. Risc-v boom shows that correct out of order designs are the same size as their in order relations. There has been historic way of designing out of order that is quite silicon expensive but that is not the only way you can design it.

Originally posted by Weasel View Post

Why the fuck do you think we have SIMD at all if out-of-order could just as well parallelize scalar operations huh? It takes much more resources to do two scalar operations in parallel via out-of-order processing, than to do them via SIMD. It scales quadratically while SIMD scales linearly in number of transistors/resources used.

SIMD turns out to take more transistors than doing vector in risc-v and vector in risc-v gets recycled for out of order. If intel was not using a horrible old school design they not have that problem.

Originally posted by Weasel View Post

And ARM, AFAIK, has even longer pipeline. I don't care about RISC-V at this point, you keep talking about lab experiments and prototypes, and I don't care about any of that stuff. Reality is different.

The ARM Cortex A73 - Artemis Unveiled

https://www.anandtech.com/show/10347/arm-cortex-a73-artemis-unveiled/2

Arm has reduced it pipeline length because they found the same thing intel did over 14 pipeline length in fact hinders performance. Lot of third party arm designs are attempt to be under 10 pipeline length to have higher performance than the reference.

Both Intel and ARM attempted making pipelines longer only to find it does not work.

Originally posted by Weasel View Post

There's nothing wrong with a longer pipeline, except the fact it wastes tons of space and resources, as I said above.

Be it x86, arm, power.... a pipeline longer than 14 has been documented over and over again not giving any more performance in fact hindering performance. Yes longer than 14 pipeline you are wasting silicon on something that does not give you any more performance even with all the speculative execution tricks.

The A73 runs faster than the a72. The A72 has 15 length pipe line the A73 has 11 length. 1/5 the power is saved. Everyone making CPU these days is avoiding pipelines longer than 14 even 1 over as the A72 in fact hurts.

Originally posted by Weasel View Post

It doesn't decrease performance by itself, it just does it indirectly

Pipeline length like it or not is documented over and over again as directly hindering performance once it gets over a particular length its issue like pipeline bubbling and other things that cause stalls in pipelines have bigger and bigger effects the longer the pipeline gets. So pipeline has a max functional size any bigger and you gain nothing in fact you are losing in power effectiveness as well as performance to the natural issues of the pipeline design.

Each stage in a pipeline is clock locked this means the longer pipeline gets when pipeline has to be flushed the more clock cycles it consumes so the larger the stall before cpu gets back to normal processing. So its not indirectly this is direct harm.

It fine to grow a pipeline if extra steps in fact increase performance. The problem is at 14 that enough space for every performance enhancing known that you can place in a pipeline.

Originally posted by Weasel View Post

because you could've used that space for other things (SIMD or more cores or whatever), or simply lower power. Note that Skylake has more than 200 entry queue for execution of OOO instructions.

Out of order execution does not mean you have to-do speculative execution. Think about it this is only a 200 entry queue. Lets say you fill that up with 70 percent of instructions who results you will be throwing because away speculative execution you could be really hurting things. So not only does speculative execution hurt memory bandwidth when you are out it can in fact be hurting over all performance particular if you have another thread you could have been processing that the results you would have 100 percent used.

If you focus in on making a cpu good for pure single thread workflows you make it quite bad at multi threaded workflows also if you not careful you make a cpu that eats it self out of memory resources..

OOO engine of some form is needed. So CPU can do stuff when there is memory latency and other things.

Originally posted by Weasel View Post

caches, both of which are needed for single-threaded.

This is a question how many and how big.

High performance risc-v has L1 and L3 no L2. L3 is also used to perform atomic operations between cpu cores so improving multi threading. Yet it has not shown any major performance lose with cores doing single threaded workloads. Mostly because the other alteration is allowing single bytes to be requested from the L3 instead instead of complete cache-lines so reducing memory bandwidth consume. So its possible that caches are so huge because we have completely screwed up how they should be done.

L2 was added because L3 could not provided enough bandwidth. If you go back to an old 8086 xt all it really has is a L3 and L1. Maybe we have taken a complete wrong turn with the the idea of pairs cores sharing L2 then L2 accessing L3.

**Weasel** · 18 August 2018, 08:03 AM

Originally posted by oiaohm View Post

Quite a bit difference and that the problem particularly when your memory bandwidth is saturated. Problem is you will be waiting on the memory system be this a in order or out of order core of any form. Waiting in the branches so keeping memory request to what is required can increase you performance by 200 times on different single threaded workloads because you are not being stalled as long because you are not needing as much memory bandwidth.

You are right you will have to wait when you at max memory bandwidth. Having failed speculative execution put more pressure on the memory system causes you to wait more than the performance you gain by speculative execution. You are not thinking about it right.

If you have 30 percent right. You about 2/3 more memory bandwidth caused by the wrong paths. If 1/3 would have been over saturated memory system any how you are well and truly in the weeds of performance as the complete memory system stalls your processing this is why you see insane performance boost when you avoid this. 200 to 2000 times faster on particular algorithms have been seen in cpus that have instructions to say do not use speculative execution here because we know this is not going to work well.

While you have surplus memory bandwidth to use speculative execution can be a good idea in some cases. Once you are short on memory bandwidth speculative execution comes really bad makes being stalled at branch points in code not be a big problem. This is why you need the ability to turn speculative execution off.

Lol man, you can't even saturate the entire memory bandwidth with just 1 core (thread).

Originally posted by oiaohm View Post

Not exactly true. Risc-v boom shows that correct out of order designs are the same size as their in order relations. There has been historic way of designing out of order that is quite silicon expensive but that is not the only way you can design it.

Alright, since you don't seem to get it.

Show me RISC-V that can compete with whatever current x86 CPU (at performance and same class), Xeon Phi or not, doesn't matter.

If you don't then it's meaningless. I don't give a shit about "theory" or claims or what they claim or babbling or "lab" or "prototypes". In "theory", we'd have 50 Ghz by now, or so people said in the 90s. ffs dude.

So, put up or shut up about RISC-V. Seriously. (and energy efficiency is completely *useless* in our current discussion, don't even bring it up and then "extrapolating" based on that, that's not how reality works)

Originally posted by oiaohm View Post

SIMD turns out to take more transistors than doing vector in risc-v and vector in risc-v gets recycled for out of order.

SIMD is vector so you're just using random buzzwords right now to refer to the same thing and appear as if you have a point.

Originally posted by oiaohm View Post

Out of order execution does not mean you have to-do speculative execution. Think about it this is only a 200 entry queue. Lets say you fill that up with 70 percent of instructions who results you will be throwing because away speculative execution you could be really hurting things.

Waiting isn't any better.

Throwing 70% of the time still means you execute 30% of the time.
Waiting 100% of the time means you execute 0% of the time.

I wonder which one is better, indeed.

There's absolutely no way you will fill a 200 entry queue without speculative execution. Have you ever looked at 99.9% of HLL code? There's ifs and while and other loops all over the place. Branches are probably every 20 instructions at most if not sooner.

This is called reality. General purpose CPUs are not for FORTRAN, that's what GPUs or GPU-like supercomputer CPUs are for. Reality is most software for normal users is FULL of branches (and yes a loop is a conditional branch!!!).

Oh by the way, here's a post from bridgman (AMD): https://www.phoronix.com/forums/foru...53#post1041653

Here's an excerpt for you: As cores get a bit wider each year the number of workloads where SMT can help goes up a bit as well.

Read the entire thing you realize why that happens, and without speculative execution you'd need about 4x logical cores per physical core to even attempt to use all those idle execution units. Which is useless when you only have a single thread (see below).

Think about it until it sinks in.

Originally posted by oiaohm View Post

So not only does speculative execution hurt memory bandwidth when you are out it can in fact be hurting over all performance particular if you have another thread you could have been processing that the results you would have 100 percent used.

I told you already, I don't care about another thread, I'm talking about single-threaded performance.

**oiaohm** · 19 August 2018, 03:36 AM

Originally posted by Weasel View Post

Lol man, you can't even saturate the entire memory bandwidth with just 1 core (thread).

In a intel x86 you can because the caches are designed for a 90 percent hit rate. So if you are doing something that as a 70 percent cache miss cpu speed by size if intel cache lines(the size transfers are done between memory and caches) is greater than your memory speed.

[QUOTE=Weasel;n1041693]Show me RISC-V that can compete with whatever current x86 CPU (at performance and same class), Xeon Phi or not, doesn't matter.

The Hybrid Threading Processor for Sparse Data Kernels

https://www.youtube.com/watch?v=QdTwLs_8RZE

Presentation by Tony Brewer at Micron Technology on May 9, 2018 at the RISC-V Workshop in Barcelona, hosted by Barcelona Supercomputing Center and Universita...

Here you see for big data Risc-v beating your Xeon and Xeon Phi and GPU processors. Not by a small margin. Totally by focusing on make sure your memory system is effective. Using a cache that when you have high miss rate has lower cost. Cache lines reduced down to 8 to 16 bytes. All intel x86 are using 64 byte cache lines.

You should not be able to saturate entire memory bandwidth with 1 core but intel design you can and it all because how the cache is designed.

Originally posted by Weasel View Post

SIMD is vector so you're just using random buzzwords right now to refer to the same thing and appear as if you have a point.

No risc-v has cray vector and one of the prototypes is SIMD implement kind how intel did it. So they are two different things when talking about Risc chips. This is just you not knowing the terminology.

Originally posted by Weasel View Post

Throwing 70% of the time still means you execute 30% of the time.
Waiting 100% of the time means you execute 0% of the time.

That is straight up ignoring that you have a cpu with caches designed for 90% hit rate so not waiting and attempting to execute results in getting in cache hell.

Also its not wait 100% of the time not doing speculative execution.

Originally posted by Weasel View Post

There's absolutely no way you will fill a 200 entry queue without speculative execution. Have you ever looked at 99.9% of HLL code? There's ifs and while and other loops all over the place. Branches are probably every 20 instructions at most if not sooner.

That is wrong cray style barrel multi threading where cpu is not just processing 1 thread at time can fill 200 entry queue without speculative execution but this is not the only way.

Also there is ways for filling 200 entry queue by a different optimisation. Something you have not considered because you have not looked at in order cores does every branch cause a stall the answer is no it does not. When you don't do speculative execution and do pure out of order core without speculative execution once the information required for a branch is solved you can proceed down that branch even if everything from before that branch is not processed yet.

Originally posted by Weasel View Post

Reality is most software for normal users is FULL of branches (and yes a loop is a conditional branch!!!).

The question you have not asked is how much overhead does that cause in a cpu with out of order execution but lacking speculative execution. Please note pipeline length is also a factor here. 20 instructions so your pipeline is 14 long hmm problem. This is where pipeline is problem.

Remember 14 long pipeline means there can be a 14 clock cycle delay between when instructions enters to when it result comes out. So your basic for loop
for(c=0;c<10:c++){
}
So there could be a 14 clock cycle delay between when c++ gets processed to when you can compare if c is less than 10. So quite a large stall. 20 instructions with multi instruction processing lets say 2 roughly instructions at a time that is 10 long in clock cycles. 14 clock cycle delay due to pipeline that long you are kind of forced to use speculative execution because you cannot get the results back out the pipeline quick enough for the branch. If the c++ is at the start loop and the c<10 is at the end loop so in two pipe processing groups this now comes not 14 but 27 cycles. Yes even if its not 20 instructions but 20 clocks between branches you are in trouble pipeline length has forced you into speculative execution.

If cpu is 5 long pipeline. So 9 clock cycle over head at worse. 2 instruction at a time 10 clock cycles. If the c++ was placed at one end of the loop and the compare c<10 at the other when you get to the branch you know the result so you don't need to perform speculative execution.

Pipeline length is quite a factor if you need speculative execution or not. The shorter the pipeline less benefit speculative execution is until there is absolutely no benefit at all..

Originally posted by Weasel View Post

Read the entire thing you realize why that happens, and without speculative execution you'd need about 4x logical cores per physical core to even attempt to use all those idle execution units. Which is useless when you only have a single thread (see below).

No this is wrong a big reason for needing speculative execution is the over head of too long of pipeline.

Option 1) If you have a shorter pipeline system processing branches filling an out of order system it no problem filling execution units without speculative execution most of the time. You can be having as high as a 95% hit rate in a 5 long pipeline cpu on branches.(hit rate meaning no stall when you got to branch everything was ready to solve branch) Yet a speculative execution solution is more often than not under 50%. Of course if your cpu has 14 long pipeline you cannot do this.

Option 2) An out of order system bias to processing to solve branches(barrel multi threading) like how you have compilers doing loop unrolling. There is some extra complexity to get performance without speculative execution but the extra complexity method does not have the downside problem. This kind of system gets you normally to 8-10 pipeline length once you get longer than that the delays in pipeline start being too much to mask by this method.

Barrel processing groups you instructions so that you can target the instruction to solve for the next branch and delay the instruction not need for the next branch and use them to fill the space. Barrel multi threading your single thread code comes multi threads so more suitable for SMT processing even better biased.

Option 3) speculative execution the worse option this gets you to 14 long pipelines and is wasteful. Wasting processing units on results you with throw away worse wasting cache requests for information that will never be used. Longer than 14 pipelines is not effective for anyone. Only way 15 and longer pipelines could be effective is if someone designs a new method for accelerating branches..

Mix of option 1 and 2 is insanely effective. Particularly when 2 can be done as thread management engine to accelerate SMT workloads and reduce overhead of SMT workloads.

Cray was making highly multi threaded hardware but people where giving them lots of single thread work to process. Barrel multi threading is a different option to achieve a lot of the same things as speculative execution except there is no speculation in barrel multi threading its more biasing so you process branches fast then process all the instructions you know you have to process and using that to fill all slots.

Weasel there is more than 1 way to solve the fill processing units problem. But first you have to see the bottlenecks. Pipeline length is major bottleneck the longer the pipeline the more restricted your branch solving optimisation methods become. The way the cache system is optimised is another major bottleneck.

**Weasel** · 19 August 2018, 07:45 AM

Originally posted by oiaohm View Post

In a intel x86 you can because the caches are designed for a 90 percent hit rate. So if you are doing something that as a 70 percent cache miss cpu speed by size if intel cache lines(the size transfers are done between memory and caches) is greater than your memory speed.

A single-threaded memcpy won't saturate your memory bandwidth, and cache has nothing to do with it, since it bypasses the cache anyway. The cache doesn't load faster than a memcpy can.

Originally posted by oiaohm View Post

https://www.youtube.com/watch?v=QdTwLs_8RZE
Here you see for big data Risc-v beating your Xeon and Xeon Phi and GPU processors. Not by a small margin. Totally by focusing on make sure your memory system is effective. Using a cache that when you have high miss rate has lower cost. Cache lines reduced down to 8 to 16 bytes. All intel x86 are using 64 byte cache lines.

Where in the video is the actual measured benchmarks? It's too long and I don't have time to watch it all, so at what minute/sec is it?

And I hope you are talking about single-threaded workload, right? I keep having to repeat this for some stupid reason since you keep dodging it.

Since, you know, that's the... whole point of speculative execution...

Originally posted by oiaohm View Post

That is straight up ignoring that you have a cpu with caches designed for 90% hit rate so not waiting and attempting to execute results in getting in cache hell.

Also its not wait 100% of the time not doing speculative execution.

When encountering a branch, you are waiting literally 100% of the time without speculative execution. Obviously we're talking about the point where you encounter a branch.

Originally posted by oiaohm View Post

That is wrong cray style barrel multi threading where cpu is not just processing 1 thread at time can fill 200 entry queue without speculative execution but this is not the only way.

So does any CPU with SMT or Hyperthreading but I don't care about filling the entire queue with DIFFERENT THREADS.

I'm talking about SINGLE THREADED PERFORMANCE FFS. That's the **whole** god damn point of speculative execution: to fill the pipeline with ONE THREAD's execution as much as possible and thus, to increase SINGLE THREADED PERFORMANCE not to fill it with other threads because that's what SMT is for, not speculative execution.

Speculative execution is **ONLY** about single threaded performance. **ONLY**. Why the fuck do you keep ignoring this on purpose.

I don't give a fuck if any of your RISC examples are for multiple threads: that completely invalidates them. It is **ONLY** about single-threaded performance. That's the entire point behind speculative execution. Single-threaded performance.

Repeat that 1000 times until you get it.

Originally posted by oiaohm View Post

Option 1) If you have a shorter pipeline system processing branches filling an out of order system it no problem filling execution units without speculative execution most of the time. You can be having as high as a 95% hit rate in a 5 long pipeline cpu on branches.(hit rate meaning no stall when you got to branch everything was ready to solve branch) Yet a speculative execution solution is more often than not under 50%. Of course if your cpu has 14 long pipeline you cannot do this.

Option 2) An out of order system bias to processing to solve branches(barrel multi threading) like how you have compilers doing loop unrolling. There is some extra complexity to get performance without speculative execution but the extra complexity method does not have the downside problem. This kind of system gets you normally to 8-10 pipeline length once you get longer than that the delays in pipeline start being too much to mask by this method.

If a branch is consistently predicted less than 50% of the time, what makes you think the CPU won't simply reset to blind guess, which is still a 50% success rate being completely blind (i.e. flip a coin). You literally have no idea what you're talking about when it comes to branch prediction.

Announcement

L1 Terminal Fault - The Latest Speculative Execution Side Channel Attack

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment