Originally posted by wizard69
View Post
Announcement
Collapse
No announcement yet.
Intel Prepares GCC Compiler Support For BFloat16
Collapse
X
-
Originally posted by sdack View PostThat's just what you think. Intel believes there is room for more and so do I. There is always room for more and it doesn't have to be big, but it also needs simple methods to give it a wider spectrum and to promote AI software. When you don't get that then that's too bad, but it's coming. Whether you want it, need it, like it or not. You're just having an opinion. Others do have a need for it.
And the market segmentation argument falls apart, because their low-end CPUs all have iGPUs, which are far more powerful and efficient than giving each CPU core a BFloat16-augmented AVX-512 unit (which they haven't announced), while the server solution (which will have the BFloat16 instructions) exists in the cloud that already has GPUs.
Then, you're saying that mine is an opinion, except that my opinion is wrong and yours is right. By definition, an opinion cannot be wrong.
Originally posted by sdack View PostI suspect that some of you have forgotten about the bigger picture and what the true purposes of a CPU and GPU are. Just because AI software currently finds use in graphics doesn't mean such units should only exist in GPUs.
Originally posted by sdack View PostSo does the next generation of compression algorithms such as cmix use AI to achieve higher compression ratios than ever. When you then in 10-20 years from now compress files do you not want to require a GPU for it. You want the CPU, as in "Central Processing Unit", to do it.
Originally posted by sdack View PostThe reason why we see AI units in GPUs first is because of applications such as image recognition, which deal with graphics.
Originally posted by sdack View PostIt doesn't mean AI units should only exist in GPUs.
I honestly don't know if you're just playing devil's advocate or assuming that it must make sense because Intel is doing it. Either way, you lack the foundation on which to build the case you're trying to make. If you want to take it on faith that this is a good move by Intel, that's your opinion and you're entitled to it.
Comment
-
Originally posted by Weasel View PostWho said it's just for training? Can be used for executing the AI network as well and applying it on data.
In practice, people use integer arithmetic for inferencing, because it's good enough. VNNI already has integer support. Where people use BFloat16, in actual practice, is for training. Any floating point you really need to use in inferencing isn't enough to justify BFloat16 over just using float32.
And I forgot to mention that the funny thing about your example of using it for audio processing is that IEEE 754 half-precision is actually much better suited to that application.
Originally posted by Weasel View PostThroughput is irrelevant when you want real-time response.
So, either your network is so trivial that you'd be fine with the existing float32 instructions, or you'd find that a GPU is fast enough to make up for the latency of using it.
Originally posted by Weasel View Post
Check the specs:- Stereo convolution reverb
- Zero latency CPU edition (or up to 8192 samples for lower CPU usage)
- Low latency GPU edition (512 – 8192 samples delay)
- Wav, Aiff and Flac support
- ADSHR envelopes
- 50%-150% stretch
- Double oversampled IR-EQ
- Pre-delay up to 500ms
- GPU Edition utilises the power of NVIDIA CUDA
You're clearly stretching to make your case. Try dropping the confirmation bias and running the actual numbers to see whether there's a case, here. I'll bet you have no clue what types of networks people use for audio processing, how big they are, or what types of layers they actually use.
Comment
-
So, it seems like all current GPUs should be able to do this.
Comment
-
Originally posted by coder View PostI'm struggling ...
Know this, AI problems aren't just massively parallel problems. They can also be massively serial and be both. So is deep learning all about feeding and training a neural network a lot of data. You cannot load the data all in one into the neural network, but it needs to be processed in sequence, because the sequence is the important part of the learning. So when a neural network detects shapes in an image is it really tracing the outline as a path and learns the shape as a sequence of directions, which guides it around the shape. As such can it then detect the shape again and again, even when the shape gets rotated, stretched or slightly distorted.
But when a neural network is used to detect shapes in a million images then it doesn't need to learn about the sequence of the images, but the detection can run in parallel. I hope you can see how this is both a problem of serialisation as well as parallelization. If not then I suggest you find a better source and learn something about AIs. Or perhaps think about how one teacher can easily teach a class of 100 students all at the same time (aka "in parallel"), but that one student cannot learn from 100 teachers simultaneously without doing some serious multi-tasking.Last edited by sdack; 16 April 2019, 02:18 PM.
Comment
-
Originally posted by coder View PostPlease don't be so obtuse. If you think about it for even a moment, it should be clear that I'm saying the time saved in processing will tend to outweigh the communications and context-switching overhead. Because convolutional neural networks are really expensive and the AVX-512 performance of a single core can never touch what a GPU could deliver. And once you go multi-core, you might as well use a GPU.
Let's put it simply like this: you have X amount of time before you can finish processing. This X is very low for real time workloads, especially per plugin (because you can chain them up, which adds to it, sequentially). Speeding up computation within X will enable you to do more processing in the same time frame.
Doubling X (more latency) to do 10 times more computation will not help whatsoever because X is the target. Couldn't care less if you do 10 or 1 million times more work done if you increase X, it's a no go. I'm not sure what's so hard to understand?
The network is trivial, sure, but this is exactly why this addition can almost double the efficiency of the network while keeping the same X? Hence the whole point it is needed in the first place? So you can make it "less trivial" while keeping the same latency?
Originally posted by coder View PostThanks for illustrating my earlier point, beautifully! That page references a GPU from 2007!
If you can't, then well, I rest my case. Perhaps because it just doesn't work, people don't really retry failed experiments. If you do find one, however, then it will be interesting to look it up, so please share.
Comment
-
Originally posted by sdack View Postthe one part where you don't have any issues with is where you believe that all the Intel engineers have it wrong and you'd know better.
Originally posted by sdack View Postbut when Intel implements it then you should at least have some doubt on your own position ...
I think even without BFloat16, AVX-512 has proven to be a bad move. The market segmentation they've had to do (i.e. with all the different subsets) has made it even worse! I agree with AMD - AVX2 is probably as far as it makes sense to go, in a CPU. For anything wider, just use a GPU.
Originally posted by sdack View PostI suggest you find a better source and learn something about AIs.
Comment
-
Originally posted by Weasel View PostYou can't outweigh something that is literally the limiting factor, except by making it negative, and that's impossible.
Originally posted by Weasel View PostI'm not sure what's so hard to understand?
It boils down to this: If the work is trivial, then obviously keep it on the CPU. If the work is expensive, then obviously send it to the GPU. The issue is that there's really a very narrow range of tasks where making the CPU a bit faster lets it win vs. the GPU. Ideally, you'd keep data (and processing) in one place or the other, which is why frameworks like GStreamer have features to avoid data movement when multiple GPU-based elements are chained in sequence.
For your example, you picked an expensive task, where a single CPU core would typically take longer than the latency + compute time of the GPU. That was your mistake. That, and citing audio processing as a reason to use BFloat16 for inference. Typical errors that happen when you don't really know what you're talking about.
Originally posted by Weasel View PostIt's the best I could find. Sure, maybe I'm wrong, so feel free to find a newer one? It would be very appreciated.
In recent generations, AMD and Nvidia have both added preemption support. From the NVIDIA Tesla P100 whitepaper:
Compute Preemption is another important new hardware and software feature added to GP100 that allows compute tasks to be preempted at instruction-level granularity, rather than thread block granularity as in prior Maxwell and Kepler GPU architectures. Compute Preemption prevents long-running applications from either monopolizing the system (preventing other applications from running) or timing out.
AMD has similar. But, let's stick with Nvidia, since it concerns your example. So, what's a best-case scenario? Back in 2014 (so, pre-Pascal), this paper found a purpose-built GPU-exclusive processing framework could deliver latencies as low as 8 microseconds: https://www.ncbi.nlm.nih.gov/pubmed/24784666
Obviously, that's a bit hard-core. So, let's consider more typical CUDA usage scenarios. These guys have some special sauce for achieving < 10 microsecond dispatch latencies, but cite "hundreds of microseconds" as their baseline for comparison, which is a lot better than my example of 1 ms (above): https://www.concurrent-rt.com/products/gpu-workbench/
What you're forgetting is that people are using GPUs as the de facto platform for vision processing in countless robotics and even self-driving applications - all realtime. And on the PC, a good VR requires well under 20 ms of latency from sensors to eyeballs, including all of the processing and transport of the rendered image to the HMD. Realtime applications and latency reductions have received a lot of focus, in recent generations of GPUs. So, that's why you can't just take some product specs from 10+ years ago as gospel, as you seem so willing to do.
Comment
-
Originally posted by coder View PostThis is the problem with trying to have an abstract argument about something concrete. My point is that a network that current-gen AVX-512 can inference in 10 ms and somehow gets a perfect speedup to 5 ms with BFloat16, will still be worse off than a GPU that can do it in 0.1 ms with 1 ms of communication overhead.
As for the rest, I really don't claim to understand much of how GPUs work, that's why I wanted examples (as in, real-time audio plugins using it). Not saying you're wrong, but tens or hundreds of microseconds is kinda high, you know that? That's more expensive than a context switch and those are already a huge limiting factor for this purpose, which is one of JACK's shortcomings. So it's not gonna cut it.
Ardour's dev mentioned such in the comments of some blog post, but I completely forgot where it was so I'm not going to look for it, sorry. Which is the reason he revamped Ardour to be able to chain plugins normally instead of having each of them using JACK. The latency, when chained, was just too great, and it didn't scale.Last edited by Weasel; 20 April 2019, 07:42 AM.
Comment
-
Originally posted by Weasel View PostIt's not necessarily that it will do it in 5ms, but the fact it can process twice as much (more complex effects) within the same time slot (10ms in your example, though it's a bit on the high side).
Originally posted by Weasel View Posttens or hundreds of microseconds is kinda high, you know that?
Originally posted by Weasel View PostThat's more expensive than a context switch and those are already a huge limiting factor for this purpose, which is one of JACK's shortcomings.
Originally posted by Weasel View PostWhich is the reason he revamped Ardour to be able to chain plugins normally instead of having each of them using JACK. The latency, when chained, was just too great, and it didn't scale.
Comment
Comment