Announcement

**coder** · 07 November 2022, 06:57 PM

Originally posted by StillStuckOnSI View Post

Unless these accelerators are operating at extremely low precision (think 1, 2, 4 or 8 bits), there is no substantive difference between "inference" and "training". e.g. to make use of tensor cores in many CUDA operations, your inputs need to be fp16 or (the slightly wider but still lower precision than fp32) tf32 anyhow.

But they might indeed be biased towards int8, or lower.

You definitely need at least fp16 (bf16 is much better), for training. The other thing training tends to like is large memory. For the larger models, you also want multiple GPUs and fast interconnects. Those are the things which distinguish training-oriented vs. inference-oriented GPUs.

**verude** · 07 November 2022, 08:03 PM

Originally posted by WannaBeOCer View Post

DSC definitely works on Linux with a Nvidia GPU. If it didn’t I wouldn’t be able to run my 27GN950 at 4K/160Hz on Linux

https://github.com/NVIDIA/open-gpu-k...omment-2781922

Ah seems like higher refresh rates aren’t supported yet. Hopefully they’ll add support soon.

https://forums.developer.nvidia.com/...ivers/198363/5 according to this thread, linux doesn't support DSC, maybe it's down to your chroma subsampling?

**StillStuckOnSI** · 07 November 2022, 11:46 PM

Originally posted by coder View Post

But they might indeed be biased towards int8, or lower.

You definitely need at least fp16 (bf16 is much better), for training. The other thing training tends to like is large memory. For the larger models, you also want multiple GPUs and fast interconnects. Those are the things which distinguish training-oriented vs. inference-oriented GPUs.

It's quite the coincidence, but a search for "RDNA3 bf16" turns up a leak from today which seems to indicate they finally have support for the format: https://videocardz.com/newz/alleged-...gram-leaks-out. For the uninitiated, part of the reason Google's TPUs are so competitive for ML workloads is that their preferred input format is bf16. That doesn't mean any hardware which supports bf16 will automatically be faster, and it's not clear what rate it will run at on the new AMD cards (compared to fp32), but at least it'll make code more portable now that Intel/Nvidia/AMD/Google all support it in consumer-accessible hardware.

On interconnects, there (again) really isn't much of a difference between "training-oriented" and "inference-oriented". Very large models do need fast interconnects, but at that scale you're dealing with more than just NVLink because of cross-node communication. For smaller models that could conceivably fit on one machine, I think most people would consider something like a 4090 more "training" than "inference" despite it not having NVLink at all! Even in prior generations, you could hook up a couple of consumer cards with an NVLink bridge. Won't scale to the large models big companies are developing now, but lets you train something like BERT. What does seem to distinguish GPUs explicitly sold for "inference" like the T4 is that they cut out all of the unnecessary display-related hardware and run at a much lower TDP (e.g. 75W). That's a very different niche than what a flagship compute part like an A100 or a high-end gaming GPU is targeting.

**WannaBeOCer** · 07 November 2022, 11:48 PM

Originally posted by verude View Post

https://forums.developer.nvidia.com/...ivers/198363/5 according to this thread, linux doesn't support DSC, maybe it's down to your chroma subsampling?

Definitely not chroma subsampling, I think I'd trust a nvidia driver developer than a forum moderator.

**brucethemoose** · 08 November 2022, 02:31 AM

Originally posted by coder View Post

Intel was trying, with its Iris models that featured 2x or 3x the normal amount of EUs and up to 128 MB of eDRAM.

Because, even then, it wasn't terribly good. There were bottlenecks in the architecture that kept the GPU from scaling well. So, performance was good, but probably not enough to justify the added price or steer power users away from a dGPU option.

But that was just weird. And the value-add compared with having a truly separate dGPU was tenuous, at best.

According to whom? Didn't Valve contract with AMD specifically to make it for them? In those sorts of arrangements, Valve would retain ownership of the IP. At least, that's supposedly how it is with MS and Sony.

No, Van Gogh was on leaked AMD roadmaps before Valve would have presumably ordered it for the Deck: https://videocardz.com/newz/amd-mobi...g-with-vangogh

The more uncorroborated rumor is that OEMs other than Valve simply rejected it.

**coder** · 08 November 2022, 03:44 AM

Originally posted by StillStuckOnSI View Post

It's quite the coincidence, but a search for "RDNA3 bf16" turns up a leak from today which seems to indicate they finally have support for the format: https://videocardz.com/newz/alleged-...gram-leaks-out. For the uninitiated, part of the reason Google's TPUs are so competitive for ML workloads is that their preferred input format is bf16. That doesn't mean any hardware which supports bf16 will automatically be faster,

It has better numerical stability and converts trivially to/from fp32. Furthermore, the silicon footprint for implementing fp multipliers scales roughly as a square of the mantissa. That gives bf16 an advantage not only in density, but also energy-efficiency.

The downside is that it's not much good for a whole lot else, due to having so little precision.

Originally posted by StillStuckOnSI View Post

and it's not clear what rate it will run at on the new AMD cards (compared to fp32), but at least it'll make code more portable now that Intel/Nvidia/AMD/Google all support it in consumer-accessible hardware.

Well, my guess is their AI units don't even support fp32, in which case it's probably a moot point.

Originally posted by StillStuckOnSI View Post

On interconnects, there (again) really isn't much of a difference between "training-oriented" and "inference-oriented".

There is, because training typically involves a lot more data. Even the models are bigger, because they haven't yet been optimized.

Originally posted by StillStuckOnSI View Post

What does seem to distinguish GPUs explicitly sold for "inference" like the T4 is that they cut out all of the unnecessary display-related hardware and run at a much lower TDP (e.g. 75W). That's a very different niche than what a flagship compute part like an A100 or a high-end gaming GPU is targeting.

The training GPUs don't have display hardware, either. And the lower clock speed has not so much to do with training vs. inference, and everything to do with things like power-efficiency, durability, and density -- all things you want in server-oriented GPUs.

**StillStuckOnSI** · 08 November 2022, 10:14 PM

The question is whether people (especially ML practitioners) use "training oriented" and "inference oriented" to describe particular models or product lines of GPUs. Outside of the Tesla T4/Jetson lineage, I have not seen anything vaguely resembling this terminology being thrown around and I've certainly not seen the exact wording. Moreover, I've definitely not seen it being used to distinguish between something like a 3090 vs an A100. Using the former just for inference would be a waste, and when somebody buys one for ML they usually plan on training on it. On the other hand, for stuff like large language models you often need the latter for inference, because they won't fit on one V100/A100/H100. So separating them into "training" and "inference" categories is a false dichotomy.

**coder** · 10 November 2022, 01:32 AM

Originally posted by StillStuckOnSI View Post

The question is whether people (especially ML practitioners) use "training oriented" and "inference oriented" to describe particular models or product lines of GPUs. Outside of the Tesla T4/Jetson lineage, I have not seen anything vaguely resembling this terminology being thrown around and I've certainly not seen the exact wording.

It's a distinction commonly used to describe deep learning ASICs.

For instance: https://www.anandtech.com/show/14187...ators-for-2020

It’s not being called an AI training accelerator, it’s not being called a GPU, etc. It’s only being pitched for AI inference – efficiently executing pre-trained neural networks.

However, that is not an isolated example.

Originally posted by StillStuckOnSI View Post

I've definitely not seen it being used to distinguish between something like a 3090 vs an A100. Using the former just for inference would be a waste,

First off, Nvidia doesn't permit gaming cards to be used in data centers. So, they wouldn't even market the RTX 3090 for deep learning.

Second, you should be looking at whether it's more cost-effective to use the A40 or the A100 for inference, and then tell me using the A40 for inference is a waste.

Originally posted by StillStuckOnSI View Post

and when somebody buys one for ML they usually plan on training on it.

Because you're probably a student or hobbyist, and that's the best thing you can afford to train on. Moreover, a researcher is primarily focused on model development, not deployment at scale. When a model has been developed for commercial purposes, it needs to be deployed to achieve a return on the investment of developing it. That means putting a lot more data through it than would typically be used to train it. And that means you want hardware that's not overkill for the purpose, since you're probably using many instances and tying them up for long periods of time.

Originally posted by StillStuckOnSI View Post

So separating them into "training" and "inference" categories is a false dichotomy.

The word "oriented" is key. Nobody is saying you couldn't use an A100 for inference, just that it's generally overkill for that task.

**NeoMorpheus** · 10 November 2022, 04:23 PM

Talking about APUs, does anyone knows if they (well, the gpu part) work in tandem if a dGPU (AMD of course) is also present?

**coder** · 10 November 2022, 11:19 PM

Originally posted by NeoMorpheus View Post

Talking about APUs, does anyone knows if they (well, the gpu part) work in tandem if a dGPU (AMD of course) is also present?

I have no idea of AMD has announced anything like that. Intel was talking about it, a couple years ago -- I think particularly with their DG1, which was very much like the larger Xe iGPUs in their notebook CPUs.

My own opinion on this is that it doesn't make much sense, unless you're pairing a low-powered dGPU with an APU having a relatively high-powered iGPU (and this is exactly the situation with a "G7" Tigerlake U + DG1, which are nearly twins). If they're too asymmetric in performance, then the iGPU isn't contributing enough to be worth the trouble and added overhead.

The way I'd use an iGPU -- or perhaps other secondary GPUs, were I a game developer -- would be to find other compute tasks to dispatch to it. Perhaps physics, audio, or AI. This could unburden the faster GPU and CPU from handling such tasks. Furthermore, the secondary GPU shouldn't even need to be the same make as the primary.

Announcement

AMD Announces Radeon RX 7900 XTX / RX 7900 XT Graphics Cards - Linux Driver Support Expectations

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment