Announcement

**uid313** · 28 June 2021, 07:32 AM

This feels more like a workaround than a true fix.
Is there any real way to fix the underlying issue?
Is it possible to have a SMT/HT implementation without security issues?
Or is the real solution to design CPUs without SMT?

I know that SMT can increase performance, but is it really needed?
Or is it only needed on x86 because of it's poor design?

**hotaru** · 28 June 2021, 08:24 AM

Originally posted by uid313 View Post

Or is the real solution to design CPUs without SMT?

yes, that is the real solution. unfortunately, people are reluctant to give up the less than about 30% (on Intel) or 40% (on AMD) performance increase that SMT brings.

**numacross** · 28 June 2021, 08:40 AM

Originally posted by hotaru View Post

yes, that is the real solution. unfortunately, people are reluctant to give up the less than about 30% (on Intel) or 40% (on AMD) performance increase that SMT brings.

The gain depends on the workload. For example 7-zip is able to scale to almost 100% on AMD SMT:

**Developer12** · 28 June 2021, 10:16 AM

Originally posted by uid313 View Post

I know that SMT can increase performance, but is it really needed?
Or is it only needed on x86 because of it's poor design?

Quick primer: You want to decode as many instructions as possible as fast as possible. This allows the CPU core to search through them looking for opportunities to reorder (or speculate) instructions and better keep all of the core's execution units running all the time.

Historically, the reason HT exists is because of x86's shitty instruction format. Back in the 70's, variable length instructions seemed like a great idea to reduce time.

x86 features lots of variable length instructions and they are a real pain to decode.* (hell, they were a real pain to adapt to a pipeline, necessitating complex decode into uOPs in the first place) This in turn made parallel decoding damn near impossible because you couldn't tell ahead of time where the next instruction would start. The early solution, hyperthreading, introduces a 2nd thread into the core, bringing with it a new secondary stream of instructions to decode that don't have any dependence on the first. Thus, they can both be decoded in parallel and you can use more of the core's execution units.

In more recent chips (ZEN) AMD has managed 4-way decoding by attempting a decode on every single byte and discarding decodes that are wrong. This means all the parallel decode units in each core are tightly coupled, greatly increasing complexity. (each must signal to the others when an attempted instruction decode was correct/incorrect)

You might hear micro-architects from intel/AMD claim to have "solved" the issue with tons of fancy instruction prediction and uOP caches. That's only a half-truth. Has it improved the problem? Sure, at a huge cost of chip area and power. And still AMD's ZEN is stuck at 4-way parallel decoding per thread while intel moved mountains (and bugs) to achieve 5-way. The ARM based M1 does 8-way without breaking a sweat. There's a damn good reason hyperthreading has never appeared in any CPU with a (simpler) RISC-style ISA, except for IBM's high-throughput POWER. It simply isn't needed.**

*theoretically, it's possible to have INFINITE length x86 instructions, though in modern processors it's capped.

**Yes, there were classical RISC architectures that survived well into the era of intel hyperthreading. Sun's SPARC, for one. Note that late fujitsu-SPARC's "hyperthreading" is different from the usual kind. The core doesn't actually run two threads together, instead swapping the whole thing back and forth between them to provide the illusion of hyperthreading.

**coder** · 28 June 2021, 10:46 AM

Originally posted by uid313 View Post

This feels more like a workaround than a true fix.

No, it is the true fix. Otherwise, you're stuck in a game of whack-a-mole, and there's always the chance that one thread can spy on the other just by looking at how much throughput it gets out of certain instructions.

Originally posted by uid313 View Post

Is there any real way to fix the underlying issue?

It boils down to a tradeoff between efficient sharing of resources vs. isolation.

The solution is to throw away any pretense of true isolation, and just schedule threads on the same core which already share the same memory space, or that you're otherwise unconcerned if they spy on each other.

Originally posted by uid313 View Post

Is it possible to have a SMT/HT implementation without security issues?

The question is probably more a matter of how much efficiency one sacrifices, in doing so. Plus, never being 100% sure there's true isolation.

Originally posted by uid313 View Post

I know that SMT can increase performance, but is it really needed?

It's certainly worthwhile. For a few % more area, it can offer substantial benefits by helping to hide memory latency and by compensating for poor instruction-level-parallelism in some code.

Originally posted by uid313 View Post

Or is it only needed on x86 because of it's poor design?

Many CPUs had it before x86. POWER has SMT4 and some SPARC CPUs had up to 8-way SMT.

GPUs use SMT in an even bigger way, with Intel's Gen9 implementing 7-way SMT and AMD's GCN implementing 12-way (I'm not sure if RDNA departs from that).

**coder** · 28 June 2021, 11:44 AM

Originally posted by Developer12 View Post

Quick primer:

This is not entirely accurate.

Originally posted by Developer12 View Post

Historically, the reason HT exists is because of x86's shitty instruction format.

Hyperthreading was first added to x86 in the comparatively narrow Pentium 4, so I'm not sure it was really added just to get around decoding bottlenecks.

Even in modern x86 CPUs, Hyperthreading tends to be a net loss for intensive floating-point code. The decoders and prefetchers can cope well enough.

Originally posted by Developer12 View Post

x86 features lots of variable length instructions and they are a real pain to decode.* (hell, they were a real pain to adapt to a pipeline, necessitating complex decode into uOPs in the first place)

I think many CPUs have some variation on uOPs. It's not only for those with variable-length instructions. It also enables CPUs to split and merge ISA instructions and generally decouple the implementation from the ISA.

Originally posted by Developer12 View Post

This in turn made parallel decoding damn near impossible because you couldn't tell ahead of time where the next instruction would start.

FWIW, Jim Keller recently said:

For a while we thought variable-length instructions were really hard to decode. But we keep figuring out how to do that. You basically predict where all the instructions are in tables, and once you have good predictors, you can predict that stuff well enough. So fixed-length instructions seem really nice when you're building little baby computers, but if you're building a really big computer, to predict or to figure out where all the instructions are, it isn't dominating the die. So it doesn't matter that much.

Source: https://www.anandtech.com/show/16762...erson-at-tesla

I saw where you seem to have prempted this, but I trust Jim more than you.

Originally posted by Developer12 View Post

still AMD's ZEN is stuck at 4-way parallel decoding per thread while intel moved mountains (and bugs) to achieve 5-way.

They have op caches, which act like a L0 instruction cache. For loops, this basically gives them as much dispatch as they want. Yeah, it's not perfect but seems to work pretty well.

Intel did something interesting in Tremont, where they have 2x 3-wide decoders that can work in parallel (i.e. on different branch targets).

Originally posted by Developer12 View Post

There's a damn good reason hyperthreading has never appeared in any CPU with a (simpler) RISC-style ISA,

That's not correct. There are even SMT ARM cores. You missed that because:

You never bothered to check.
You don't seem to appreciate the full benefits of SMT.

BTW, even ARM CPUs support variable-length instructions, to a degree. See THUMB-2.

**microcode** · 28 June 2021, 05:10 PM

Originally posted by uid313 View Post

This feels more like a workaround than a true fix.
Is there any real way to fix the underlying issue?
Is it possible to have a SMT/HT implementation without security issues?
Or is the real solution to design CPUs without SMT?

I know that SMT can increase performance, but is it really needed?
Or is it only needed on x86 because of it's poor design?

Of course it's a workaround, but it's a workaround that makes a lot of sense. The point of SMT is to share resources, and caches are a huge part of each core. Fundamentally in order to fix timing sidechannels arising from SMT, you will need to separate the caches, which may nearly completely defeat the purpose of SMT.

**microcode** · 28 June 2021, 05:12 PM

Originally posted by Developer12 View Post

Historically, the reason HT exists is because of x86's shitty instruction format. Back in the 70's, variable length instructions seemed like a great idea to reduce time.

There are POWER chips with 8-way SMT, and yes it works really well on many workloads

**coder** · 28 June 2021, 09:55 PM

Originally posted by microcode View Post

The point of SMT is to share resources, and caches are a huge part of each core.

It's not only caches, but also lots of hidden state inside the core, like branch-predictors and prefetchers.

Originally posted by microcode View Post

Fundamentally in order to fix timing sidechannels arising from SMT, you will need to separate the caches, which may nearly completely defeat the purpose of SMT.

You need to duplicate some structures, but not caches. For caches, all that's needed is a logical partitioning. I think some CPUs already feature QoS support to ensure that one thread doesn't hog all the cache. This can simply be extended to create a stricter logical partition between the threads. Essentially, you just need and extra bit in the cache's tag RAM (for SMT-2) saying which thread "owns" the line, and a policy not to evict the other thread's lines. The benefit is that a the entire cache could be made available to a thread using the core exclusively.

Announcement

Core-Scheduling For Linux 5.14 To Reduce SMT/HT Information Leak Risks, Side Channels

Core-Scheduling For Linux 5.14 To Reduce SMT/HT Information Leak Risks, Side Channels

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment