Announcement

**NobodyXu** · 07 April 2022, 10:44 AM

Originally posted by coder View Post

Well, Intel seems to think so. That's why Sapphire Rapids is getting AMX (8192-bit matrix extensions).

It can definitely make running ML models more efficient on CPU.

While it is true that GPU is used for training, inference tasks can also be computed by CPU, which is much cheaper than renting cloud machines with GPUs.

Originally posted by coder View Post

Meanwhile, AMD is spending its die area on lots more cores.

IMHO, having more cores are more important than having special instructions for accelerating certain workflows, since most ML models deployed in the production are quite large and complex, and are data hungry, so multi core is going to help more than the special instructions.

It will be even better if we can have both.

Originally posted by coder View Post

I think most cloud users with AI workloads are already moving to special-purpose accelerators that can be scaled up far beyond the number of CPUs in a system.

For tech giants and big companies/banks, they have definitely moved to special-purpose accelerators.

But not everybody has enough financial resource for that, special accelerators are still quite expensive, so I believe there are still a lot of cases where the inference is run on CPU.

**coder** · 07 April 2022, 10:53 AM

Originally posted by paulpach View Post

they actually have hundreds of physical registers. Each one of them can feed the execution units. The routing of data between hundreds of registers to the execution units is a huge power drain.

VLIW has the same problem. So much so, in fact, that I was seeing papers in the late 1990's about how interconnect was going to become the new scaling bottleneck and people were already talking about moving beyond VLIW to Transport-Triggered Architectures.

Originally posted by paulpach View Post

Rather the instruction scheduling is done by the compiler. They have an order of magnitude fewer physical registers.

Not from what I've read. IA64, which is getting on towards 25 years old, had 128 64-bit registers. And I've programmed a classical VLIW CPU from the late 1990's that had 64.

I've seen stats on the number of shadow registers in x86 CPUs, but they're being elusive. However, it's safe to say the shadow register file is smaller than the instruction reorder window, which puts it in the range of a few hundred (i.e. same order of magnitude as VLIW).

Anyway, this is such a weird point. I don't see why you'd think compile-time scheduling needs drastically fewer registers than runtime OoO. Whether your scheduling is at compile-time or runtime, if you're scheduling roughly the same number of execution units you probably need comparable amounts of registers. The main difference being that you have to actually save/restore all your ISA registers - not so, for shadow registers.

Originally posted by paulpach View Post

VLIW architectures, typically found in DSPs, are 100x more power-efficient than x86 and ARM,

That's a highly-suspect figure. I think a lot of it depends on data movement and things like using directly addressable on-die SRAM. I don't disagree that OoO burns a lot of energy, mind you. But it's nowhere near 100x.

Originally posted by paulpach View Post

Itanium failed because it stalled a lot in general-purpose code. If someone figures out how to avoid stalling in general-purpose code with a VLIW architecture, they would be sitting on a gold mine.

It's not true VLIW - it's EPIC, which has some amount of runtime scheduling. They actually could've implemented branch-prediction and limited OoO, but Intel had already lost interest by the second or 3rd generation.

**coder** · 07 April 2022, 11:02 AM

Originally posted by NobodyXu View Post

It can definitely make running ML models more efficient on CPU.

While it is true that GPU is used for training, inference tasks can also be computed by CPU, which is much cheaper than renting cloud machines with GPUs.

I'll say this, yet again: bandwidth, bandwidth, bandwidth! That's what GPUs and purpose-built AI chips have that CPUs lack!

You guys should really check out Hot Chips presentations for the past few years. All of the AI chips are at least as focused on memory bandwidth as they are compute and interconnects. There's one company using 900 MB of on-die SRAM, providing 62 TB/sec of aggregate bandwidth. It's such a big problem they burned half their die space on SRAM, and their die is the biggest supported by TSMC's N7 node.

Originally posted by NobodyXu View Post

For tech giants and big companies/banks, they have definitely moved to special-purpose accelerators.

You mention banks and since they're big mainframe customers, I will concede that the mainframe market is weird. They pay top $$$ for ultra-stable and certified hardware. So, it makes sense to me that IBM built its own AI hardware and I think most of those customers will simply take whatever IBM gives them.

Originally posted by NobodyXu View Post

But not everybody has enough financial resource for that, special accelerators are still quite expensive, so I believe there are still a lot of cases where the inference is run on CPU.

Intel thinks so, but I expect this trend will taper off. IMO, it's currently being prolonged by the shortage of silicon production capacity hampering the ability to get AI accelerators scaled up and into broader usage.

**NobodyXu** · 07 April 2022, 11:14 AM

Originally posted by coder View Post

I'll say this, yet again: bandwidth, bandwidth, bandwidth! That's what GPUs and purpose-built AI chips have that CPUs lack!

You guys should really check out Hot Chips presentations for the past few years. All of the AI chips are at least as focused on memory bandwidth as they are compute and interconnects. There's one company using 900 MB of on-die SRAM, providing 62 TB/sec of aggregate bandwidth. It's such a big problem they burned half their die space on SRAM, and their die is the biggest supported by TSMC's N7 node.

Yeah, because CPU is purposefully optimized for latency, the latest DDR5 also improves latency, and not much to bandwidth.

ML tasks need a lot of bandwidth because of the big data and large and complex model.

But still, it is not impossible to run it on CPU and I think for some production environment, it might be cheaper.

Originally posted by coder View Post

You mention banks and since they're big mainframe customers, I will concede that the mainframe market is weird. They pay top $$$ for ultra-stable and certified hardware. So, it makes sense to me that IBM built its own AI hardware and I think most of those customers will simply take whatever IBM gives them.

Checkout this video https://www.youtube.com/watch?v=ZDtaanCENbc from LTT, they paied IBM Canada a visit and played with their latest mainframe z16.

It mention that z16 has AI accelerators so that the bank can run ML algorithms on transaction to pick out frauds.

And yeah, these Banks is definitely going to buy one of them once they are available, it is a zero-risk investment for IBM to develop a new mainframe.

Originally posted by coder View Post

Intel thinks so, but I expect this trend will taper off. IMO, it's currently being prolonged by the shortage of silicon production capacity hampering the ability to get AI accelerators scaled up and into broader usage.

I definitly agree that these accelerators are the future, something like Google's TPU is going to be very helpful.

**mdedetrich** · 07 April 2022, 12:23 PM

Originally posted by NobodyXu View Post

But not everybody has enough financial resource for that, special accelerators are still quite expensive, so I believe there are still a lot of cases where the inference is run on CPU.

Sure, but maybe to rephrase the problem to illustrate my point better. The reason why AI is so bandwidth hungry is because the AI field is so dominant right now that naturally speaking we are pushing it to the limits but this problem isn't specific to AI, its just that currently AI is where the market is at.

**coder** · 07 April 2022, 02:38 PM

Originally posted by NobodyXu View Post

Yeah, because CPU is purposefully optimized for latency, the latest DDR5 also improves latency, and not much to bandwidth.

On-die SRAM and in-package memory would also be much lower-latency.

No, what CPUs are really optimized for is memory capacity. That, and being able to swap out faulty DIMMs is another key point.

Intel started to re-balance, with Knights Landing (which had 16 GB of in-package HMC DRAM) and soon select Sapphire Rapids CPUs, which will feature HBM2. AFAIK, DDR5 will continue to be supported even with these HBM2 versions. If the trend takes hold, maybe we'll see a rapid transition towards CXL.mem, for expandability.

Originally posted by NobodyXu View Post

ML tasks need a lot of bandwidth because of the big data and large and complex model.

Mostly the model, in fact. The models are much larger than the data run through them. That's why on-chip storage is such a win. If you can keep much or all of your model on-chip, then you can afford to use cheap DDR4 DIMMs for capacity.

Originally posted by NobodyXu View Post

But still, it is not impossible to run it on CPU and I think for some production environment, it might be cheaper.

Depends on how much you're doing and how much the provider can milk customers for access to fancy AI accelerators. As AI accelerators become more commodity, there's no way CPU-based inference will continue to be cheaper. It'd be like saying that CPU-based game rendering is cheaper than using a GPU.

That said, an interesting corner case seems to be emerging. Just as mainstream CPUs have integrated iGPUs, Sapphire Rapids' AMX could blur the lines, slightly. However, as I've mentioned, Intel can't sweep away the performance differential enabled by the massive amounts of on-die or stacked SRAM that custom-built AI chips are using. So, there will remain a threshold beyond which it doesn't make sense to use their CPUs for inferencing.

Originally posted by NobodyXu View Post

It mention that z16 has AI accelerators so that the bank can run ML algorithms on transaction to pick out frauds.

Yes. These were detailed (to the extent IBM will disclose) in their Hot Chips talk. Just because they're faster than using the CPU cores doesn't make them better than using a purpose-built AI chip. But they're likely good enough to run the models IBM's key customers need. For now.

Originally posted by NobodyXu View Post

And yeah, these Banks is definitely going to buy one of them once they are available, it is a zero-risk investment for IBM to develop a new mainframe.

I do wonder how long the mainframe market will continue. I think a thorough reliability analysis of cloud computing technologies, employing software-based fault tolerance, would likely show the price premium of mainframes to be unjustifiable. At that point, it's mainly a question of when existing customers want to rewrite their software.

**coder** · 07 April 2022, 02:45 PM

Originally posted by mdedetrich View Post

The reason why AI is so bandwidth hungry is because the AI field is so dominant right now that naturally speaking we are pushing it to the limits

No, it's due to the simple fact that AI models are huge.

If you're speculating about something you don't know, don't try to sound so authoritative. Speculation is fine. We all do it. It's casting speculation as statement of fact that's a problem, because that's misleading to people who are similarly clueless.

Originally posted by mdedetrich View Post

but this problem isn't specific to AI, its just that currently AI is where the market is at.

Not all problems are so bandwidth-hungry. Otherwise, AI processors wouldn't be so exceptional in their memory architecture.

It used to be that graphics was the most bandwidth-intensive of common problem domains, but it's now been completely eclipsed by AI.

**paulpach** · 07 April 2022, 04:13 PM

Originally posted by coder View Post

VLIW has the same problem. So much so, in fact, that I was seeing papers in the late 1990's about how interconnect was going to become the new scaling bottleneck and people were already talking about moving beyond VLIW to Transport-Triggered Architectures.

It really doesn't. Out of Order processors use reorder buffers, there are hundreds of them. The apple M1 has a reordering window of 500 or so. They also have a bunch of reservation stations. Each one of them has a register. When the programmer uses an instruction such as ADD r2,r1,r3, those registers get mapped to one of the hundreds of physical registers during the issue phase.

VLIW architectures don't do that. The physical registers and logical registers are the same. So they just have a few dozen registers and that is it. There are a lot fewer places to route data through.

Originally posted by coder View Post

Not from what I've read. IA64, which is getting on towards 25 years old, had 128 64-bit registers. And I've programmed a classical VLIW CPU from the late 1990's that had 64.

yes, and that is a lot less than the reorder window and reservation stations you find in an out-of-order CPU.

But regardless, it is not a requirement for VLIW to use that many registers at all. The ones you are familiar with may have that many, but a VLIW CPU can use the same amount of logical registers as your everyday x86 or ARM CPU. For example, the modern Texas C6000 series only has 32 general-purpose registers.

Originally posted by coder View Post

Anyway, this is such a weird point. I don't see why you'd think compile-time scheduling needs drastically fewer registers than runtime OoO. Whether your scheduling is at compile-time or runtime, if you're scheduling roughly the same number of execution units you probably need comparable amounts of registers. The main difference being that you have to actually save/restore all your ISA registers - not so, for shadow registers.

You don't need the Tomasulo algorithm, no need for retirement stations, no need for reorder window. Those are used to increase ILP while resolving data hazards. In a VLIW there are no data hazards at all. All data hazards are resolved by the compiler. So they can just implement the logical registers with physical registers and call it a day.

Originally posted by coder View Post

That's a highly-suspect figure. I think a lot of it depends on data movement and things like using directly addressable on-die SRAM. I don't disagree that OoO burns a lot of energy, mind you. But it's nowhere near 100x.

The TMS320C6474 uses 4-8 watts and can do 28,800 million instructions per second.
The i9-12900K uses 50-358 watts and can do 97671 million instructions per second.

28800 / 8 = 3600 MIPS per watt
97671 / 358 = 272 MIPS per watt

So you are right, not 100x the performance per watt, more like 13x. The comparison is not exactly fair, since the i9 uses much better node technology, but even with the node disadvantage, it comes out ahead with a massive win.

Unfortunately, the TMS320C6474 completely sucks for general-purpose code, which is full of branches. It requires carefully crafted assembly that takes advantage of software pipelining and whatnot.

in an OoO CPU, a good 90% of the power usage goes to figuring out exactly which instruction to send to the execution units. Actually executing the instructions is a tiny part of the power budget. In a VLIW, instructions are decoded and fed directly to execution units, so power usage is much smaller.

If someone managed to crack the nut and get DSP-like performance with general-purpose code, that would be huge. Did Tachyum do it? I don't know, but I do think this is possible.

Originally posted by coder View Post

It's not true VLIW - it's EPIC, which has some amount of runtime scheduling. They actually could've implemented branch-prediction and limited OoO, but Intel had already lost interest by the second or 3rd generation.

The Tachyum prodigy is supposedly not strictly VLIW either, but I can't find any details.

**AnonDMR** · 08 April 2022, 03:18 PM

It's not a VLIW processor - this has been said over and over. https://youtu.be/lQ1wUnsh5Qk?t=5174

**Markospox** · 08 April 2022, 04:05 PM

Originally posted by ldesnogu View Post

When something sounds too good to be true, it usually isn't.

Not always with tech and mathematics - this looks good (one may prefer some other but that's other thing) and it is good

.

Announcement

Tachyum Gets FreeBSD Running On Their Prodigy ISA Emulation Platform For AI / HPC

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment