Announcement

**Weasel** · 15 April 2019, 11:37 AM

Originally posted by wizard69 View Post

is this true on a modern SoC with an integrated GPU with access to all memory?

I don't know. I've yet to see the contrary, so why don't you show an example?

**coder** · 16 April 2019, 02:17 AM

Originally posted by sdack View Post

That's just what you think. Intel believes there is room for more and so do I. There is always room for more and it doesn't have to be big, but it also needs simple methods to give it a wider spectrum and to promote AI software. When you don't get that then that's too bad, but it's coming. Whether you want it, need it, like it or not. You're just having an opinion. Others do have a need for it.

I'm struggling to find any logical consistency, here. You're saying more horsepower is needed, but not a proper solution? It's like saying that the Pentium should've included instructions to make floating point arithmetic more efficient to emulate in software, rather than just including a proper FPU, to use your analogy.

And the market segmentation argument falls apart, because their low-end CPUs all have iGPUs, which are far more powerful and efficient than giving each CPU core a BFloat16-augmented AVX-512 unit (which they haven't announced), while the server solution (which will have the BFloat16 instructions) exists in the cloud that already has GPUs.

Then, you're saying that mine is an opinion, except that my opinion is wrong and yours is right. By definition, an opinion cannot be wrong.

Originally posted by sdack View Post

I suspect that some of you have forgotten about the bigger picture and what the true purposes of a CPU and GPU are. Just because AI software currently finds use in graphics doesn't mean such units should only exist in GPUs.

Yes, it does. The reason being that it's an incredibly parallel (and bandwidth-intensive) problem, like graphics. GPUs are optimized for such problems, which is why they've become the mainstay of AI, to date (it's no accident!). A general-purpose CPU will never be as competent at AI in exactly the same way that it'll never be as competent at graphics.

Originally posted by sdack View Post

So does the next generation of compression algorithms such as cmix use AI to achieve higher compression ratios than ever. When you then in 10-20 years from now compress files do you not want to require a GPU for it. You want the CPU, as in "Central Processing Unit", to do it.

What they'll do is exactly what they did for graphics, which is to integrate an AI unit on chip.

Originally posted by sdack View Post

The reason why we see AI units in GPUs first is because of applications such as image recognition, which deal with graphics.

Thanks for removing all doubt that you're completely out of your depth.

Originally posted by sdack View Post

It doesn't mean AI units should only exist in GPUs.

Actually, the only thing better than GPUs are specialized ASICs - or special-function units, as we're seeing crop up in many cell phone SoCs.

I honestly don't know if you're just playing devil's advocate or assuming that it must make sense because Intel is doing it. Either way, you lack the foundation on which to build the case you're trying to make. If you want to take it on faith that this is a good move by Intel, that's your opinion and you're entitled to it.

**coder** · 16 April 2019, 02:38 AM

Originally posted by Weasel View Post

Who said it's just for training? Can be used for executing the AI network as well and applying it on data.

One can do a lot of things. Whether they're justifiable is another matter.

In practice, people use integer arithmetic for inferencing, because it's good enough. VNNI already has integer support. Where people use BFloat16, in actual practice, is for training. Any floating point you really need to use in inferencing isn't enough to justify BFloat16 over just using float32.

And I forgot to mention that the funny thing about your example of using it for audio processing is that IEEE 754 half-precision is actually much better suited to that application.

Originally posted by Weasel View Post

Throughput is irrelevant when you want real-time response.

Please don't be so obtuse. If you think about it for even a moment, it should be clear that I'm saying the time saved in processing will tend to outweigh the communications and context-switching overhead. Because convolutional neural networks are really expensive and the AVX-512 performance of a single core can never touch what a GPU could deliver. And once you go multi-core, you might as well use a GPU.

So, either your network is so trivial that you'd be fine with the existing float32 instructions, or you'd find that a GPU is fast enough to make up for the latency of using it.

Originally posted by Weasel View Post

Legacy Software - LiquidSonics

https://www.liquidsonics.com/software/reverberate-le/

Check the specs:

Stereo convolution reverb
Zero latency CPU edition (or up to 8192 samples for lower CPU usage)
Low latency GPU edition (512 – 8192 samples delay)
Wav, Aiff and Flac support
ADSHR envelopes
50%-150% stretch
Double oversampled IR-EQ
Pre-delay up to 500ms
GPU Edition utilises the power of NVIDIA CUDA

Keep in mind this is just one plugin in the chain.

Thanks for illustrating my earlier point, beautifully! That page references a GPU from 2007!

You're clearly stretching to make your case. Try dropping the confirmation bias and running the actual numbers to see whether there's a case, here. I'll bet you have no clue what types of networks people use for audio processing, how big they are, or what types of layers they actually use.

**coder** · 16 April 2019, 02:46 AM

Originally posted by Weasel View Post

Originally posted by wizard69 View Post

is this true on a modern SoC with an integrated GPU with access to all memory?

I don't know. I've yet to see the contrary, so why don't you show an example?

The Shared Virtual Memory feature of OpenCL 2.0 allows coherent sharing of user-allocated memory with the GPU. AMD designed Vega to support using HBM2 as a cache of main memory. Nouveau managed to implement SVM on recent Nvidia GPUs, and we know Intel supports OpenCL 2.0+ on Gen8 iGPUs (i.e. Broadwell and later).

So, it seems like all current GPUs should be able to do this.

**sdack** · 16 April 2019, 07:23 AM

Originally posted by coder View Post

I'm struggling ...

You do this through out your entire commentary. And the one part where you don't have any issues with is where you believe that all the Intel engineers have it wrong and you'd know better. I get that you wouldn't want to take it from a stranger on the Internet, but when Intel implements it then you should at least have some doubt on your own position ...

Know this, AI problems aren't just massively parallel problems. They can also be massively serial and be both. So is deep learning all about feeding and training a neural network a lot of data. You cannot load the data all in one into the neural network, but it needs to be processed in sequence, because the sequence is the important part of the learning. So when a neural network detects shapes in an image is it really tracing the outline as a path and learns the shape as a sequence of directions, which guides it around the shape. As such can it then detect the shape again and again, even when the shape gets rotated, stretched or slightly distorted.
But when a neural network is used to detect shapes in a million images then it doesn't need to learn about the sequence of the images, but the detection can run in parallel. I hope you can see how this is both a problem of serialisation as well as parallelization. If not then I suggest you find a better source and learn something about AIs. Or perhaps think about how one teacher can easily teach a class of 100 students all at the same time (aka "in parallel"), but that one student cannot learn from 100 teachers simultaneously without doing some serious multi-tasking.

**Weasel** · 16 April 2019, 10:16 AM

Originally posted by coder View Post

Please don't be so obtuse. If you think about it for even a moment, it should be clear that I'm saying the time saved in processing will tend to outweigh the communications and context-switching overhead. Because convolutional neural networks are really expensive and the AVX-512 performance of a single core can never touch what a GPU could deliver. And once you go multi-core, you might as well use a GPU.

You can't outweigh something that is literally the limiting factor, except by making it negative, and that's impossible.

Let's put it simply like this: you have X amount of time before you can finish processing. This X is very low for real time workloads, especially per plugin (because you can chain them up, which adds to it, sequentially). Speeding up computation within X will enable you to do more processing in the same time frame.

Doubling X (more latency) to do 10 times more computation will not help whatsoever because X is the target. Couldn't care less if you do 10 or 1 million times more work done if you increase X, it's a no go. I'm not sure what's so hard to understand?

The network is trivial, sure, but this is exactly why this addition can almost double the efficiency of the network while keeping the same X? Hence the whole point it is needed in the first place? So you can make it "less trivial" while keeping the same latency?

Originally posted by coder View Post

Thanks for illustrating my earlier point, beautifully! That page references a GPU from 2007!

It's the best I could find. Sure, maybe I'm wrong, so feel free to find a newer one? It would be very appreciated.

If you can't, then well, I rest my case. Perhaps because it just doesn't work, people don't really retry failed experiments. If you do find one, however, then it will be interesting to look it up, so please share.

**coder** · 20 April 2019, 01:52 AM

Originally posted by sdack View Post

the one part where you don't have any issues with is where you believe that all the Intel engineers have it wrong and you'd know better.

Not Intel engineers - Intel marketing! Or their product management, to be more precise. Those are the folks who determine their business strategy and specify their product portfolio, etc. Those are the geniuses who thought it'd be a good idea to build a x86 chip (Xeon Phi) to compete with GPUs.

Originally posted by sdack View Post

but when Intel implements it then you should at least have some doubt on your own position ...

You mean like Itanium, Pentium 4, Xeon Phi, and their cell phone SoCs? You really think Intel is infallible?

I think even without BFloat16, AVX-512 has proven to be a bad move. The market segmentation they've had to do (i.e. with all the different subsets) has made it even worse! I agree with AMD - AVX2 is probably as far as it makes sense to go, in a CPU. For anything wider, just use a GPU.

Originally posted by sdack View Post

I suggest you find a better source and learn something about AIs.

Seriously, you should take an online class on deep learning. Or just find a tutorial and play around with a popular framework. So far, it sounds like you just watched a short youtube video, which I'd say has probably done you more harm than good.

**coder** · 20 April 2019, 02:53 AM

Originally posted by Weasel View Post

You can't outweigh something that is literally the limiting factor, except by making it negative, and that's impossible.

This is the problem with trying to have an abstract argument about something concrete. My point is that a network that current-gen AVX-512 can inference in 10 ms and somehow gets a perfect speedup to 5 ms with BFloat16, will still be worse off than a GPU that can do it in 0.1 ms with 1 ms of communication overhead.

Originally posted by Weasel View Post

I'm not sure what's so hard to understand?

The concept isn't hard to understand, but it doesn't fit your example.

It boils down to this: If the work is trivial, then obviously keep it on the CPU. If the work is expensive, then obviously send it to the GPU. The issue is that there's really a very narrow range of tasks where making the CPU a bit faster lets it win vs. the GPU. Ideally, you'd keep data (and processing) in one place or the other, which is why frameworks like GStreamer have features to avoid data movement when multiple GPU-based elements are chained in sequence.

For your example, you picked an expensive task, where a single CPU core would typically take longer than the latency + compute time of the GPU. That was your mistake. That, and citing audio processing as a reason to use BFloat16 for inference. Typical errors that happen when you don't really know what you're talking about.

Originally posted by Weasel View Post

It's the best I could find. Sure, maybe I'm wrong, so feel free to find a newer one? It would be very appreciated.

Better yet, let's just cut to the heart of the matter and see what GPU latencies are really like, these days.

In recent generations, AMD and Nvidia have both added preemption support. From the NVIDIA Tesla P100 whitepaper:

Compute Preemption is another important new hardware and software feature added to GP100 that allows compute tasks to be preempted at instruction-level granularity, rather than thread block granularity as in prior Maxwell and Kepler GPU architectures. Compute Preemption prevents long-running applications from either monopolizing the system (preventing other applications from running) or timing out.

And their other Pascal-generation GPUs also have it. See: The NVIDIA GeForce GTX 1080 & GTX 1070 Founders Editions Review: Preemption Improved: Fine-Grained Preemption for Time-Critical Tasks

AMD has similar. But, let's stick with Nvidia, since it concerns your example. So, what's a best-case scenario? Back in 2014 (so, pre-Pascal), this paper found a purpose-built GPU-exclusive processing framework could deliver latencies as low as 8 microseconds: https://www.ncbi.nlm.nih.gov/pubmed/24784666

Obviously, that's a bit hard-core. So, let's consider more typical CUDA usage scenarios. These guys have some special sauce for achieving < 10 microsecond dispatch latencies, but cite "hundreds of microseconds" as their baseline for comparison, which is a lot better than my example of 1 ms (above): https://www.concurrent-rt.com/products/gpu-workbench/

What you're forgetting is that people are using GPUs as the de facto platform for vision processing in countless robotics and even self-driving applications - all realtime. And on the PC, a good VR requires well under 20 ms of latency from sensors to eyeballs, including all of the processing and transport of the rendered image to the HMD. Realtime applications and latency reductions have received a lot of focus, in recent generations of GPUs. So, that's why you can't just take some product specs from 10+ years ago as gospel, as you seem so willing to do.

**Weasel** · 20 April 2019, 07:39 AM

Originally posted by coder View Post

This is the problem with trying to have an abstract argument about something concrete. My point is that a network that current-gen AVX-512 can inference in 10 ms and somehow gets a perfect speedup to 5 ms with BFloat16, will still be worse off than a GPU that can do it in 0.1 ms with 1 ms of communication overhead.

It's not necessarily that it will do it in 5ms, but the fact it can process twice as much (more complex effects) within the same time slot (10ms in your example, though it's a bit on the high side).

As for the rest, I really don't claim to understand much of how GPUs work, that's why I wanted examples (as in, real-time audio plugins using it). Not saying you're wrong, but tens or hundreds of microseconds is kinda high, you know that? That's more expensive than a context switch and those are already a huge limiting factor for this purpose, which is one of JACK's shortcomings. So it's not gonna cut it.

Ardour's dev mentioned such in the comments of some blog post, but I completely forgot where it was so I'm not going to look for it, sorry. Which is the reason he revamped Ardour to be able to chain plugins normally instead of having each of them using JACK. The latency, when chained, was just too great, and it didn't scale.

**coder** · 20 April 2019, 01:55 PM

Originally posted by Weasel View Post

It's not necessarily that it will do it in 5ms, but the fact it can process twice as much (more complex effects) within the same time slot (10ms in your example, though it's a bit on the high side).

The problem is that you look at that and say "w00t! Twice the effects!!!1111", while even if they all did use BFloat16 (which, again, is useless for audio of better than phone quality), you'd find that most processing elements in the pipeline are already quite fast and probably dominated by overhead. So, the overall pipeline-level speedup would tend to be far less.

Originally posted by Weasel View Post

tens or hundreds of microseconds is kinda high, you know that?

Again, you picked deep inferencing as the use case. It's not high, if we're talking about that.

Originally posted by Weasel View Post

That's more expensive than a context switch and those are already a huge limiting factor for this purpose, which is one of JACK's shortcomings.

GStreamer runs most adjacent elements in the same thread, but they foolishly decided that 'queue' elements should break the threading domain, yielding more threads in many GStreamer pipelines than you'd really need or want.

Originally posted by Weasel View Post

Which is the reason he revamped Ardour to be able to chain plugins normally instead of having each of them using JACK. The latency, when chained, was just too great, and it didn't scale.

It also tends to blow out your cache coherency. One nice thing about running the call stack down the pipeline is that the processing result from the previous stage will usually be sitting right in the cache hierarchy. If you can then do your processing in-place, then you can get some crazy throughput from these architectures, if using relatively simple filters.

Announcement

Intel Prepares GCC Compiler Support For BFloat16

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment