Announcement

**mrg666** · 10 August 2023, 01:58 PM

Originally posted by coder View Post

No, the reason AVX-512 was disabled is blindingly obvious: E-cores. Intel didn't want a hybrid-ISA CPU and I don't blame them.

Except you're the one reaching for bizarre theories based on incomplete/incorrect information. Why is it so hard to accept the obvious explanations for things?

There is Intel Thread Director to direct the AVX512 instructions to P cores. Instead they disabled the capabilities of existing hardware. Unlike you, I don't see a problem with hybrid-ISA on a "hybrid" CPU. It was possible to enable AVX-512 in the earlier Alder Lake CPUs. After they found people enabling via UEFI and "using" AVX512, they fused them off in the later production. These are just facts that you may not know, everybody is free to come up with their conclusions. You sound like an apologist for Intel and trying to paint me as an enemy. But ... whatever. I don't care, my cpus are all 12th and 13th gen. Yes, I choose Intel cpus.

Just adding a bizarre link

How to Pick Up an AVX-512 Supporting Alder Lake: An Easy Way

https://www.tomshardware.com/news/how-to-pick-up-an-avx-512-supporting-alder-lake-an-easy-way

A small, but significant, detail tells all.

**coder** · 10 August 2023, 06:23 PM

Originally posted by mrg666 View Post

There is Intel Thread Director to direct the AVX512 instructions to P cores. Instead they disabled the capabilities of existing hardware.

LOL. The ThreadDirector summarizes a core's recent execution history. The first AVX-512 instruction an E-core sees will trigger a SIGILL (Illegal Instruction), long before the ThreadDirector can ever enter the picture.

Originally posted by mrg666 View Post

Unlike you, I don't see a problem with hybrid-ISA on a "hybrid" CPU.

It's obvious you don't see the problem, so let me walk you through it. An app sees the hardware supports 32 threads, because there are 8 P-cores, each supporting 2 threads, and 16 E-cores. Unbeknownst to the app, it linked in a library which uses AVX-512. The library either doesn't know about hybrid CPUs or isn't spawning its own worker threads, so it can muck around with the affinities or reduce the thread count to just the P-cores.

So, now you 32 threads trying to run, but half of them immediately hit a SIGILL on the E-cores where they were sent. The kernel sees they were trying to run an AVX-512 instruction and automatically updates their affinity so they only run on P-cores. Due to the contention for P-cores, the same fate eventually befalls the rest of the threads.

Now, we have a situation where 32 threads are contending for SMT slots on 8 of the cores. That contention leads to continual context switches, more L2 cache thrashing on those cores, and higher thread communication/synchronization latency. The result is probably measurably worse than if the app only spawned 16 worker threads, in the first place. But you cannot know which libraries might use even one AVX-512 instruction.

Furthermore, because they're P-cores, they're burning more power, in the process. Meanwhile, you have 16 nice E-cores sitting idle. Those idle E-cores pack the majority of the horsepower on the chip, and they're decidedly more efficient than the P-cores, which means you have a good chance of actually using it without throttling.

If we look at the upside of using AVX-512 on the P-cores, recent benchmarks[1] show us it's typically less than 50%, on specifically-optimized workloads (which we don't know if this is). If we look at how much performance the E-cores can contribute, Alder Lake showed 8x E-cores being about 51.5% as fast as 8x dual-threaded P-cores at floating point workloads[2]. In the best case, this number scales to over 100% on Raptor Lake.

So, now you're turning your back on anywhere from over 100% to 200% of the performance that you hope AVX-512 can provide. Instead of tripping over itself to try and use some extra power of the P-cores, the app would be much better off restricting itself to AVX2 and just using all of the cores.

Intel can run simulations and they can do math. Their decision to include the E-cores, even at the expense of AVX-512, wasn't dumb, unlike a great many people on the internet, who haven't done the analysis and are merely posting out of angst at the idea of losing something.

References:

From [1]:

Originally posted by mrg666 View Post

It was possible to enable AVX-512 in the earlier Alder Lake CPUs.

Just because you can, doesn't mean you should.

Originally posted by mrg666 View Post

These are just facts that you may not know,

Oh, I most certainly know. I've been through this many times.

Originally posted by mrg666 View Post

everybody is free to come up with their conclusions. You sound like an apologist for Intel and trying to paint me as an enemy. But ... whatever.

Adopting a victim complex doesn't make you any more right or wrong. I suggest you focus more on the facts/analysis and less on your internal self-narrative.

I have no affiliation or love for Intel. I just do the analysis and call 'em like I see 'em. E-cores were a good idea and I'm not afraid to say it.

Against Intel, what I'll say is that I hate how they walked back on AVX-512 and this new rebranding exercise they're doing with AVX10, which is just creating churn and not really solving any real problems. I wish they could've squeezed an implementation like Zen 4's into their E-cores, but even that probably would've made them too big (don't forget that Zen 4 is made on a smaller node). AVX-512 was messy to begin with, and everything they've done with it, since the launch of Alder Lake, has just made matters worse.

**filbo** · 11 August 2023, 01:02 AM

Originally posted by coder View Post

An app sees the hardware supports 32 threads, because there are 8 P-cores, each supporting 2 threads, and 16 E-cores. Unbeknownst to the app, it linked in a library which uses AVX-512. The library either doesn't know about hybrid CPUs or isn't spawning its own worker threads, so it can muck around with the affinities or reduce the thread count to just the P-cores.
So, now you 32 threads trying to run, but half of them immediately hit a SIGILL on the E-cores where they were sent. [...] Now, we have a situation where 32 threads are contending for SMT slots on 8 of the cores

You're right, except this is a temporary condition. Code which is naive to the situation would have the problem. Bugs would be reported, code would be fixed, all would be well. For instance, the problematic library might expose a 'number of usable threads' API call which the app would then use to determine how many threads to spawn.

In any case, it's poor app behavior to sniff the number of CPU contexts and immediately grab that many threads! That's excessively entitled, baking in an assumption that This App is more important than everything else put together. Real world general-use apps offer user control over how many threads to launch!

I'm sure some users would still gripe about having to bother to configure such apps to only demand (#P-cores * 2) threads (or fewer). But at that point it's a minor usability gripe, not a show stopper!

**mrg666** · 11 August 2023, 06:14 AM

Originally posted by coder View Post

LOL. The ThreadDirector summarizes a core's recent execution history. The first AVX-512 instruction an E-core sees will trigger a SIGILL (Illegal Instruction), long before the ThreadDirector can ever enter the picture.

It's obvious you don't see the problem, so let me walk you through it. An app sees the hardware supports 32 threads, because there are 8 P-cores, each supporting 2 threads, and 16 E-cores. Unbeknownst to the app, it linked in a library which uses AVX-512. The library either doesn't know about hybrid CPUs or isn't spawning its own worker threads, so it can muck around with the affinities or reduce the thread count to just the P-cores.

So, now you 32 threads trying to run, but half of them immediately hit a SIGILL on the E-cores where they were sent. The kernel sees they were trying to run an AVX-512 instruction and automatically updates their affinity so they only run on P-cores. Due to the contention for P-cores, the same fate eventually befalls the rest of the threads.

Now, we have a situation where 32 threads are contending for SMT slots on 8 of the cores. That contention leads to continual context switches, more L2 cache thrashing on those cores, and higher thread communication/synchronization latency. The result is probably measurably worse than if the app only spawned 16 worker threads, in the first place. But you cannot know which libraries might use even one AVX-512 instruction.

Furthermore, because they're P-cores, they're burning more power, in the process. Meanwhile, you have 16 nice E-cores sitting idle. Those idle E-cores pack the majority of the horsepower on the chip, and they're decidedly more efficient than the P-cores, which means you have a good chance of actually using it without throttling.

If we look at the upside of using AVX-512 on the P-cores, recent benchmarks[1] show us it's typically less than 50%, on specifically-optimized workloads (which we don't know if this is). If we look at how much performance the E-cores can contribute, Alder Lake showed 8x E-cores being about 51.5% as fast as 8x dual-threaded P-cores at floating point workloads[2]. In the best case, this number scales to over 100% on Raptor Lake.

So, now you're turning your back on anywhere from over 100% to 200% of the performance that you hope AVX-512 can provide. Instead of tripping over itself to try and use some extra power of the P-cores, the app would be much better off restricting itself to AVX2 and just using all of the cores.

Intel can run simulations and they can do math. Their decision to include the E-cores, even at the expense of AVX-512, wasn't dumb, unlike a great many people on the internet, who haven't done the analysis and are merely posting out of angst at the idea of losing something.

References:

From [1]:

Just because you can, doesn't mean you should.

Oh, I most certainly know. I've been through this many times.

Adopting a victim complex doesn't make you any more right or wrong. I suggest you focus more on the facts/analysis and less on your internal self-narrative.

I have no affiliation or love for Intel. I just do the analysis and call 'em like I see 'em. E-cores were a good idea and I'm not afraid to say it.

Against Intel, what I'll say is that I hate how they walked back on AVX-512 and this new rebranding exercise they're doing with AVX10, which is just creating churn and not really solving any real problems. I wish they could've squeezed an implementation like Zen 4's into their E-cores, but even that probably would've made them too big (don't forget that Zen 4 is made on a smaller node). AVX-512 was messy to begin with, and everything they've done with it, since the launch of Alder Lake, has just made matters worse.

oh wow, thank you for taking your time and I am glad my bizarre theories gave you so much inspiration to write in the end. It is still nonsense what Intel did with AVX512, why those unused transistors occupy valuable silicon real estate, this is either incompetent or it is a deception. Whatever.

**coder** · 11 August 2023, 06:28 AM

Originally posted by filbo View Post

You're right, except this is a temporary condition. Code which is naive to the situation would have the problem. Bugs would be reported, code would be fixed, all would be well. For instance, the problematic library might expose a 'number of usable threads' API call which the app would then use to determine how many threads to spawn.

This idea that we're going to rearchitect all of the code that potentially uses AVX-512 seems unhinged. Software likes abstraction, and to effectively expose everything using AVX-512 defies that. It's not going to happen and it's not worth it.

More importantly, you seemed to stop reading my post about half-way through, because you didn't reach the punchline. If you did such a thing, you'd gain less performance than you're giving up by turning your back on the E-cores.

Furthermore, Intel has already told us that hybrid CPUs will implement AVX10/256 on both P-cores and E-cores, making this a moot argument. This naturally leaves us to infer that if/when they ever have hybrid CPUs with AVX10/512 on the P-cores, it will also be on the E-cores.

Originally posted by filbo View Post

In any case, it's poor app behavior to sniff the number of CPU contexts and immediately grab that many threads!

I agree, but you need OS support, in order to do anything better. I wish Linux had a work-stealing API, where you could allocate work queues and populate them with entries, sort of like io_uring. Then, the kernel could decide how many threads to spin up and change this number on-the-fly. I think MacOS does something like that.

Originally posted by filbo View Post

That's excessively entitled, baking in an assumption that This App is more important than everything else put together.

Worse than that, I've had the experience of writing a program which spun up 3x as many worker threads as I had hardware threads, because each of 3 different libraries internally started up its own worker threads.

What would be better is to have a common work-stealing API, so that there'd only be one set of worker threads global to the process. That could happen in userspace, if glibc implemented something like that, but it'd be much more compelling for it to do so if the kernel offered an underlying mechanism or support of some kind.

Originally posted by filbo View Post

Real world general-use apps offer user control over how many threads to launch!

No, don't put it on the user. Leave it to the OS to mediate, like it does other details of work scheduling. It bugs me, when I have to manually specify the number of threads or jobs for a program to use.

Perhaps there could be an override, like how you can use nice to override job priority, but the default behavior should be smart enough to work in most cases.

Originally posted by filbo View Post

I'm sure some users would still gripe about having to bother to configure such apps to only demand (#P-cores * 2) threads (or fewer). But at that point it's a minor usability gripe, not a show stopper!

It doesn't scale well to all the apps that currently or potentially have worker threads. You really don't want to have to configure the number of threads for everything from a media player or video game to a web browser or spreadsheet. At some point, even you would have to conclude that "computers should be smart enough to figure this out for me."

**coder** · 11 August 2023, 06:34 AM

Originally posted by mrg666 View Post

oh wow, thank you for taking your time and I am glad my bizarre theories gave you so much inspiration to write in the end.

Thanks for the kind words. I hope my tone wasn't too harsh. Your questions and ideas are worthy of a thoughtful response, but I sometimes get a bit weary.

Originally posted by mrg666 View Post

It is still nonsense what Intel did with AVX512, why those unused transistors occupy valuable silicon real estate, this is either incompetent or it is a deception. Whatever.

I'm with you 100%. I think the original sin was Intel's implementation of AVX-512 in 14 nm. Or, at least implementing it at full width, in 14 nm. Had they not taken the plunge so aggressively, they wouldn't have been in a position of having to walk it back.

That got us where we are, today: it's a mess and it makes me sad. Furthermore, I can't help but feel Intel is doing something a bit anti-competitive with AVX10.

**L_A_G** · 11 August 2023, 07:22 AM

Originally posted by coder View Post

Given that this doesn't affect any current model Intel CPUs, I think they don't mind just leaving well-enough alone and telling customers that they can recover lost performance simply by buying a shiny, new Intel CPU.

In which case the right answer seems to make third party fixes and have those apply to unaffected CPUs as well. If for nothing else but spite.

**ulistaerk** · 11 August 2023, 09:07 AM

Thank you Michael for that pioneer work! Now that the first fallout is visible, I can imagine that is the same impact-scale as the last security issues. Defiantly worse than expected and lets see how this momentum affects the industry.

**chrcoluk** · 11 August 2023, 10:43 AM

great data, appreciated as always, easily the best source on the internet for the impact of these mitigations.

**chrcoluk** · 11 August 2023, 10:47 AM

Originally posted by coder View Post

LOL. The ThreadDirector summarizes a core's recent execution history. The first AVX-512 instruction an E-core sees will trigger a SIGILL (Illegal Instruction), long before the ThreadDirector can ever enter the picture.

It's obvious you don't see the problem, so let me walk you through it. An app sees the hardware supports 32 threads, because there are 8 P-cores, each supporting 2 threads, and 16 E-cores. Unbeknownst to the app, it linked in a library which uses AVX-512. The library either doesn't know about hybrid CPUs or isn't spawning its own worker threads, so it can muck around with the affinities or reduce the thread count to just the P-cores.

So, now you 32 threads trying to run, but half of them immediately hit a SIGILL on the E-cores where they were sent. The kernel sees they were trying to run an AVX-512 instruction and automatically updates their affinity so they only run on P-cores. Due to the contention for P-cores, the same fate eventually befalls the rest of the threads.

Now, we have a situation where 32 threads are contending for SMT slots on 8 of the cores. That contention leads to continual context switches, more L2 cache thrashing on those cores, and higher thread communication/synchronization latency. The result is probably measurably worse than if the app only spawned 16 worker threads, in the first place. But you cannot know which libraries might use even one AVX-512 instruction.

Furthermore, because they're P-cores, they're burning more power, in the process. Meanwhile, you have 16 nice E-cores sitting idle. Those idle E-cores pack the majority of the horsepower on the chip, and they're decidedly more efficient than the P-cores, which means you have a good chance of actually using it without throttling.

If we look at the upside of using AVX-512 on the P-cores, recent benchmarks[1] show us it's typically less than 50%, on specifically-optimized workloads (which we don't know if this is). If we look at how much performance the E-cores can contribute, Alder Lake showed 8x E-cores being about 51.5% as fast as 8x dual-threaded P-cores at floating point workloads[2]. In the best case, this number scales to over 100% on Raptor Lake.

So, now you're turning your back on anywhere from over 100% to 200% of the performance that you hope AVX-512 can provide. Instead of tripping over itself to try and use some extra power of the P-cores, the app would be much better off restricting itself to AVX2 and just using all of the cores.

Intel can run simulations and they can do math. Their decision to include the E-cores, even at the expense of AVX-512, wasn't dumb, unlike a great many people on the internet, who haven't done the analysis and are merely posting out of angst at the idea of losing something.

References:

From [1]:

Just because you can, doesn't mean you should.

Oh, I most certainly know. I've been through this many times.

Adopting a victim complex doesn't make you any more right or wrong. I suggest you focus more on the facts/analysis and less on your internal self-narrative.

I have no affiliation or love for Intel. I just do the analysis and call 'em like I see 'em. E-cores were a good idea and I'm not afraid to say it.

Against Intel, what I'll say is that I hate how they walked back on AVX-512 and this new rebranding exercise they're doing with AVX10, which is just creating churn and not really solving any real problems. I wish they could've squeezed an implementation like Zen 4's into their E-cores, but even that probably would've made them too big (don't forget that Zen 4 is made on a smaller node). AVX-512 was messy to begin with, and everything they've done with it, since the launch of Alder Lake, has just made matters worse.

Some good points, I think the ideal solution would be out of the box no AVX512 (as not much consumer software uses it? and is a consumer chip), but also add a mode in the bios that enables AVX512 at the cost of disabling e-cores, then everyone happy I guess.

Announcement

Initial Benchmarks Of The Intel Downfall Mitigation Performance Impact

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment