Announcement

**Weasel** · 02 September 2021, 08:23 AM

Originally posted by xfcemint View Post

Nope, you are wrong there.

Your statement is too general to be applied to all possible stall-prevention mechanisms available. So, in general, pipeline stalls do cause a slowdown, but most pipeline stalls can be avoided. It boils down to statistics: in a particular design, how many stalls are expected per clock tick (let's say 0.1 stalls/tick), and how many stalls are expected per one executed instruction.

According to my estimations, the expected performance loss in my design is less than 15%. The low level of performance loss is due to use of speculative execution. The loss of performance is, indeed, caused mostly by pipeline stalls, and there is also some performance loss due to inreased cache latency.

Speculative execution can negate most of the performance loss caused by stalls.

Even if a design doesn't make use of speculative execution, the expected performance loss would be around 1 cycle per branch, and a branch happens approximately every 5 instructions. If a CPU executes 5 instructions in 2 ticks on average, then the expected performance loss due to stalls would be 33% (5 instructions in 3 ticks instead of 5 instructions in 2 ticks).

You can't make a blanket statement like yours without a deeper analysis.

First of all 15% is an absolute huge performance drop just for some paranoia.

Secondly, your theory and analysis is just made up bullshit of assumptions. IBM did make a CPU without speculative execution (it still was OoO), and it failed miserably, its performance was abysmal in real-world software (not number crunching, where a GPU is better anyways).

**Weasel** · 02 September 2021, 03:04 PM

Originally posted by xfcemint View Post

That is just your opinion, or your point of view. Maybe your point of view is somehow more appropriate than mine?

But, there are many people who value additional security much, much more than a 15% performance increase.

Well obviously not enough to have the CPU designers use a performance-poor design to satisfy them.

Originally posted by xfcemint View Post

The person who asked about "high-performance and very secure CPU design" specifically stated that he is not a high-performance enthusiast. So, both of them likely wanted explanations that value security much more than you are suggesting here (you used the word "paranoia").

The point is that you can't have a high-performance CPU that is "very secure" (i.e. immune to Spectre or speculative execution exploits), so what he was asking for is impossible. Much of the performance comes from things that can unfortunately be exploited.

Originally posted by xfcemint View Post

What you are doing here is a very bad way to argue something. In order to argue fairly, you should try to find flaws in my statements (either factual or logical flaws), and offer counter-arguments and facts supporting your side.

Your phrase "bullshit of assumptions" is your viewpoint; it is not a fact, and it does not point out any logical flaws. Which particular statement that I made is "bullshit", incorrect or flawed?

About "IBM CPU that failed" - a very weak argument. Which CPU? How can you be certain that it failed because it lacked performance due to missing speculative execution? Maybe it failed because IBM didn't put enough money into development? Or, maybe the designers didn't produce a good design? Or, maybe there was some other reason why it failed.

Generally, it is a weak argument because one example product from the past doesn't prove anything, especially if it is not accompanied by in-depth analysis. You are drawing some far-fetched conclusions from a fact that one IBM product once failed.

Taken altogether, I would call your post very dishonest. It contains blanket statements, far-fetched conclusions, hearsay, and not even a tiny bit of effort to argue fairly.

Yeah sorry for that, I could've worded it differently. The point is that speculative execution (not just OoO) is crucial to performance-per-thread on CPUs these days. It has been ever since we couldn't scale frequencies up anymore, more than 20 years ago.

Where do you think performance comes from? Since frequencies are the same as 20 years ago. We have more core counts, and way longer pipelines, to compute things in parallel on the same thread. But if it has to "wait" or "stall" you are literally throwing it all away and wasting the long pipeline (the pipeline on current CPUs is incredibly long).

This actually matters a lot because loops are also branches, very predictable branches, but still branches. So you could fit several iterations of a loop into the pipeline at once, since it almost always gets predicted correctly. This is literally the only reason current CPUs are faster than old CPUs at same frequencies, per thread.

And not all stuff can be parallelized. In fact, if you have a highly parallelable workload, just use a GPU as it's much more suited for that kind of processing. CPUs should excel in single-core performance, that's their main job.

**Weasel** · 03 September 2021, 08:29 AM

Originally posted by xfcemint View Post

Where does my design fail to execute a loop as quickly as a high-performance design? The prediction accuracy is going to be very high, so no additional stalls are to be expected just because it is executing a loop.

I don't think you understand how CPUs work that well tbh. If you have any kind of speculative execution, you'll be vulnerable to side-channel attacks with no mitigations (that impact performance).

Besides, your calculations were heavily off track. A CPU doesn't execute "5 instructions in 2 clock cycles on average" as if it can execute 5 instructions next to each other like that. It can execute them only if they're independent of each other and of course on average, but it completely misses the big picture. It's not 5 instructions next to each other! It can be 5 instructions extremely far apart or in 10 different loop iterations of the same loop! (the independence happens easily in multiple iterations due to register renaming)

For that to work you need, of course, to traverse branches (loops) with speculative execution. The CPU in this case doesn't "see" the loop, it just sees it as if it was unrolled, and then can execute further iterations of the loop in parallel. This is why usually it's throughput that bottlenecks benchmarks, not latency. (see Agner Fog's Optimization manuals)

tl;dr you can't get 5 instructions per clock cycle without speculative execution in real-world applications. At best probably 2 adjacent instructions per cycle, like the old Pentiums.

**Weasel** · 04 September 2021, 09:23 AM

Originally posted by xfcemint View Post

It goes like this: an expert, a hobbyist and an idiot are introduced to each other in a coffe bar. The hobbyist thinks that he was introduced to two idiots: they are both saying things which are obviously incorrect.

You are "the hobbyist". Although, I'm primarily a programmer, and a hobbiyist CPU designer. So I'm not an absolute expert for CPU design, but you are much worse than me.

You literally said you're the hobbyist??? So that's you. Enjoy your fantasy-land CPU though I'm sure it will work great in your dreams.

Originally posted by xfcemint View Post

OoO can be implemented independently of SE. SE just removes many stalls (mostly on branches).

What is the point of OoO if it stalls? What is the point of a long pipeline if it's always half empty?

Exactly.

And that means you can't improve CPU per-thread performance anymore. You want to improve thread performance with each CPU generation, not keep it the same or even devolve it.

Originally posted by xfcemint View Post

I said, many times already: my design has speculative execution, but it stalls a bit more often than current CPUs.

Your design is a pure fantasy hobbyist bs and you're not worth my time anymore after what you said, thankfully neither actual CPU designers' time. Keep living in your delusions.

**Weasel** · 04 September 2021, 05:30 PM

Originally posted by xfcemint View Post

I'm a hobbiyist CPU deigner. You are "the hobbyist" from the story, who can't figure out what an expert is saying.

You're proving the contrary.

Originally posted by xfcemint View Post

The real question is: how often does it stall. If looses only 10% of CPU cycles on stalls, then it's not a big deal, at least for me.

Clock cycles have nothing to do with it, yet again you show you have no idea what you're talking about, lmao.

When it "stalls" it's waiting for results on previous operations which are still processed on each clock cycle. The problem is that the rest of the pipeline is unused (well, excluding SMT/HyperThreading). At this point, you can't boost performance without increasing the frequency, and we all know how that works out.

Maybe you're content with a crappy performing CPU but others not only want a fast one today, they want a faster one tomorrow. How do you even increase your CPU performance if it's always stuck waiting? By doing what? What's your plan on increasing your CPU's single-threaded performance in 10 years?

Originally posted by xfcemint View Post

On the other hand, what's the point of designing a CPU which doesn't produce results as expected (what's the point of designing a flawed CPU?).

But it does? Did it do 1+1 and not compute 2? All results are exact and deterministic, and it's correct unless there's a bug, which is a serious thing mind you.

So it's absolutely correct and produces the results it was designed for. You spying on it via some side channel crap has nothing to do with CORRECTNESS because your stupid spying does NOT affect its output at all.

Originally posted by xfcemint View Post

I can't believe how stupid some people are, after all the explanations that I have given you still don't get it.

Maybe it's time to look in the mirror and realize why nobody else uses your magical CPU design. If the whole world is full of idiots then... you know what they say.

**Weasel** · 06 September 2021, 08:54 AM

Originally posted by xfcemint View Post

The processing continues when the pipeline stalls, but a CPU can stall only for an integral number of clock cycles. Perhaps it can be said that a CPU stalls for "0.7 clock cycles" on an average stall, but that metaphor is a bit hard to understand.

It stalls until the instructions produce the output.

Originally posted by xfcemint View Post

Whether the results are "correct" depends on the way you define "correctness". In my view, the results are incorrect, and the existence of SPECTRE proves it.

Actually no, this is not an opinion, spare me the SJW terms. It's designed to produce correct results, and it does.

Side-channels are literally external spying on it, which is like saying, a CPU got too much voltage into it and now it produces wrong results, so it must be a design fault. Yeah, right.

Or you're using an electron microscope to take snapshots of the CPU and then figure out what it's doing based on that. Does it make it incorrect because it doesn't have protection against this? LMAO.

Originally posted by xfcemint View Post

Nobody uses my magical design because:
a) it is magical; I'm a wizard
b) most honest people cannot tell the difference between a skilled narcissist and an expert. They have been misled by people like you.

Yeah too bad Intel didn't hire a world-class expert such as yourself to fix their CPUs.

**Weasel** · 08 September 2021, 08:14 AM

Maybe because you have no idea what you're talking about. Your "speculative" cache is an absurd notion by itself considering how scare L1 cache already is and cache already occupies the most die area of the CPU, and you want to make it even worse. Laughable.

**tildearrow** · 08 September 2021, 01:26 PM

Originally posted by xfcemint View Post

At this moment I am forced to conclude that tildearrow doesn't care a tiny bit about security in CPUs. If he cared, we would have replied. He has some other, more important things to do, obviously.

Also, given the non-reaction of other forum members to my post, the inevitable conclusion is that nobody cares.

Likely, I just wasted my time explaining a secure CPU design.

Excuse me but I did read your post. I merely did not know what else to add to the conversation (plus I totally forgot about the thread D: )

Let me ask though, how would you implement this "speculation-aware cache"? Switch the cache out on context switch or something?

And 15% performance drop? Why? Isn't there any more optimal way

**Weasel** · 08 September 2021, 03:22 PM

Originally posted by xfcemint View Post

I'm quite surprised that you have figured out that caches do present a problem. I'm going to provide an answer to this objection of yours since it is a valid concern, despite the non-polite style of question asked by you.

Maybe it would do you good to be less condescending. Think about why nobody uses your design it. Rather than consider them all idiots, or other people not knowing what they're talking about.

Anyway I understand your speculative cache logic. I'm not saying it's hard to implement. The problem is that it's an additional cache, which is even more waste, when cache is already scarce. 20% is a lot. If more die area is used by the speculative cache, less can be used for the L1 cache or by other transistors, which leads to less performance indirectly.

Think of it like this: with the massive amount of transistors you use for speculative cache buffer (because caches waste a lot) you could put them into larger L1 cache, improving performance, or extending the pipeline, again improving performance. Or heck, even add another core or two (if the CPU has ~30 cores this is plausible, given design constraints). Apple's M1 has a very large cache and is the reason it's so performant, despite what ARM fanboys will claim.

However there's also a direct cost to it: you'll have to discard the cache if the prediction was wrong. This is more performance penalty, albeit not as large as the former.

tl;dr it's not hard to implement, it's just terrible for performance/transistor budget.

**Weasel** · 09 September 2021, 08:46 AM

Originally posted by xfcemint View Post

Oh, now I get it, he meant "scarce", not scare. And he thinks that cache size is scarce, not only L1 but also L2 and L3.

Nope. So many transistors are available with current technology that it's almost the opposite: caches are the last priority for the die area. Everything else on the CPU is more important than the cache size (after some minimum size which is close to a quarter of current x86 cache sizes).

https://electronics.stackexchange.co...icroprocessors

Yeah you're right, there's barely any cache on that picture, not half the entire CPU just L3 cache. Yep. Last priority.

L1 cache has to be extremely close while L2 is also per-core and much bigger. The thing is that L1 is limited by die area because it has to be physically close (speed of light constraints at such low latencies).

Adding a speculative L1 cache would decrease size from the actual L1 cache, or it would be slower. L1 cache is limited for a reason. But nah, CPU designers are dumb, you're super smart, you can add more L1 cache for "free". They obviously aren't increasing L1 cache out of spite, not because of technical issues, but you can!

https://stackoverflow.com/questions/...49736#38549736

Originally posted by xfcemint View Post

Doubling the cache size gives LESS than 6% additional performance in current designs.

The die power budget is the primary constraint on current CPUs, not the die area.

Only a complete novice is not familiar with this information.

Ah yes, stay a pisslow in your dreamworld.

Originally posted by xfcemint View Post

The issue is the priority of power budget allocation and die area allocation. Since the budget priority for cache size is low, it is not a big problem to add a relatively small new unit to a CPU core (the die area of speculative cache buffer should be less than 25% the size of L1).

Generally, this last post of yours is completely misleading, with weak arguments, and arguments with distorted priority. So I'm certainly not debunking it all one by one. It's just a pile of "bullshit" (that seems to be Weasel's favorite word).

Actually it is, because if it was "not a problem" then everyone would increase the L1 cache instead of adding another L1 cache for speculative execution. It's capped for a reason.

Like I said, do yourself a favor, save yourself from embarrassment, and just assume that others know better than you and why current CPUs are designed this way. Or continue being a clown.

Or maybe you found a way to use the same physical space for the speculative cache using alien inter-dimensional tech. Still, you should rather increase the actual L1 cache using such amazing tech, to give even better performance, than waste it like that for paranoia.

Announcement

Linux 5.15 Adds Another Knob To Harden Against Side Channel Attacks

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment