Announcement

Collapse
No announcement yet.

Intel Prepares GCC Compiler Support For BFloat16

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by coder View Post
    Yeah, this is a weird move. Intel is making dGPUs, enlarging their iGPUs, and spent $Billions on Nervana and Movidius. Why they feel they also need to beef up their AVX is mystifying, since it still won't touch GPUs or dedicated ASICs in perf/Watt and certainly won't beat them on ops/$. The only way this makes any business sense is as a stop-gap measure, for the < 1 year between when this ships and when they can really ramp up those products.

    You could look at the conversion instructions as improving interoperability with their GPUs and Nervana ASICs, but it's a little silly given that one of the main benefits of BFloat16 is how trivially it can be converted via existing instructions.
    One word: latency. Dedicated / specialized execution units will always incur latency.

    Suppose you want to apply some real-time effect based on an AI algorithm to some audio stream. People tried this with GPUs, the latency was horrible, especially when you add it up (chain multiple of them) and want a real-time response.

    Comment


    • #12
      Actually the fact that I don’t expect it too happen over night is why it makes sense to me to have a separate execution unit that can evolve as needed. I’m really expecting this to take awhile, hell I don’t even know if the industry is on the right track. Currently the hardware in most processors is focused on “Machine Learning” which is a branch of AI research.

      The other problem I see is that making a FP or vector unit more complex isn’t an avenue to better thermal performance. Intel use to lead in this respect but it seems like they have given up on being power competitive.
      Originally posted by sdack View Post
      They're actually being smart. Not everyone interested in AI programming has got specialized hardware readily available to them. To make standard PCs of tomorrow capable of handling such code more efficiently allows more developers to make first experiences with AI software, but without having additional costs. The opposite would mean not to have done anything about it. It then gives more relevance to the next generation of PCs and as such means more money in Intel's pocket, but it also benefits AI programming itself. Every bit helps.

      And before AIs become as advance as you think they will does mankind first need to work hard to get there. It will take many little steps and won't happen over night with a revolution or a sudden up-rise of AIs.

      Comment


      • #13
        Originally posted by sdack View Post
        But it is. All the suggestions you're making are more complex and only focus on achieving maximum power, but it is complexity, which hampers development.
        Complexity also hampers hardware design and kills thermal performance. I really don’t see how using existing GPUs makes anything more complex. Adding a dedicated execution unit also makes life easier for everybody. Easier hardware design and you would know precisely where your ML code will run.
        Adding a simple instruction will work as a stepping stone into AI programming.
        The industry is well past that point.
        You then can get far more efficient solutions already and Intel would only be competing with those. Not every problem then requires a big AI solution, but will only need a little. With a single CPU instruction can one then solve many little AI problems and leave the bigger ones to dedicated hardware units.
        that makes no sense at all.

        Comment


        • #14
          Originally posted by Weasel View Post
          One word: latency. Dedicated / specialized execution units will always incur latency.
          I’m not sure where that idea comes from. In a true Heterogeneous environment there is little latency. Any that does exist gets masked by far higher performance from dedicated hardware.
          Suppose you want to apply some real-time effect based on an AI algorithm to some audio stream. People tried this with GPUs, the latency was horrible, especially when you add it up (chain multiple of them) and want a real-time response.
          is this true on a modern SoC with an integrated GPU with access to all memory? I don’t really think so. Of course anybody can find an exception to the rule but iGPU have come along way in the last few years with operating system support firming up. You might be correct looking at the past but today’s hardware and operating systems are different beasts. The ability to leverage iGPUs for all sorts of tasks outside of graphics has improved significantly.

          Comment


          • #15
            Originally posted by sdack View Post
            Adding a simple instruction will work as a stepping stone into AI programming.
            To you, it might be a couple "simple" instructions, but you're not seeing the hardware or die area behind it. If you consider such things as "free", then sure, why not?

            Originally posted by sdack View Post
            one then solve many little AI problems and leave the bigger ones to dedicated hardware units.
            They already have plenty of hardware for handling "little" problems. If using BFloat16, you always have the option of running with IEEE 754 32-bit floats.

            What really blows a hole in the "little AI problems" argument is that BFloat16 is aimed squarely at training, where there's no such thing as a "little AI problem".

            I'll tell you my biggest issue with it: I see it as just another way to break compatibility with AMD. They're probably betting that AMD won't add BFloat16 to their CPUs and if their code generation tools & libraries have a CPU fast-path that depends on that feature, then they can show an artificially big performance gap vs AMD without doing the sort of CPUID model check that got them in trouble, previously. All for something that doesn't really belong on a CPU, in the first place - something of which Intel is clearly all too aware.
            Last edited by coder; 15 April 2019, 03:29 AM.

            Comment


            • #16
              Originally posted by Weasel View Post
              One word: latency. Dedicated / specialized execution units will always incur latency.
              WTF? You clearly have no idea how slow training is. It's not something where latency is normally a concern. BFloat16 is primarily aimed at training workloads.

              Originally posted by Weasel View Post
              Suppose you want to apply some real-time effect based on an AI algorithm to some audio stream.
              That's inferencing - where the industry is moving away from floats and towards integer arithmetic. If you did want to keep it on the CPU and wanted to use floats, you could use their existing IEEE 754 32-bit or 16-bit support. More importantly, the throughput advantage of running it on a GPU or ASIC would probably more than outweigh the additional communication latency.

              Originally posted by Weasel View Post
              People tried this with GPUs, the latency was horrible, especially when you add it up (chain multiple of them) and want a real-time response.
              Source? I'm guessing this was many years ago, when software support for GPU compute was vastly more primitive.

              Comment


              • #17
                Originally posted by wizard69 View Post
                The other problem I see is that making a FP or vector unit more complex isn’t an avenue to better thermal performance. Intel use to lead in this respect but it seems like they have given up on being power competitive.
                Yes. But they didn't "give up". It's more like the syndrome where "when all you've got is a hammer, everything starts to look like a nail." AVX-512 is one big hammer. They seem not to have learned their lesson with Xeon Phi, probably blaming the small, Atom-derived cores for its commercial woes.

                Comment


                • #18
                  Originally posted by wizard69 View Post
                  iGPU have come along way in the last few years with operating system support firming up. You might be correct looking at the past but today’s hardware and operating systems are different beasts. The ability to leverage iGPUs for all sorts of tasks outside of graphics has improved significantly.
                  For a long time, GPUs could not be preempted. That is no longer true. GPU preemption resolves the primary issue with latency and QoS.

                  Comment


                  • #19
                    Originally posted by coder View Post
                    They already have plenty of hardware for handling "little" problems.
                    That's just what you think. Intel believes there is room for more and so do I. There is always room for more and it doesn't have to be big, but it also needs simple methods to give it a wider spectrum and to promote AI software. When you don't get that then that's too bad, but it's coming. Whether you want it, need it, like it or not. You're just having an opinion. Others do have a need for it.

                    I suspect that some of you have forgotten about the bigger picture and what the true purposes of a CPU and GPU are. Just because AI software currently finds use in graphics doesn't mean such units should only exist in GPUs. So does the next generation of compression algorithms such as cmix use AI to achieve higher compression ratios than ever. When you then in 10-20 years from now compress files do you not want to require a GPU for it. You want the CPU, as in "Central Processing Unit", to do it.

                    What a CPU then should and shouldn't do isn't cast in stone, but it's a fluid definition, which adapts to new tasks with every next generation. This is such a step and we will see more of it. To think it didn't belong into a CPU is then the same as to think it wouldn't need a central processing unit in the future but everything should be done by a dedicated unit. Fact rather is that a CPU is already composed out of many dedicated units. So was floating point arithmetic once a separate chip if anyone can remember. Back then did people not think we would be needing those for every task. But here we are and integer as well as floating point arithmetic has become part of almost every piece of software.

                    The reason why we see AI units in GPUs first is because of applications such as image recognition, which deal with graphics. And of course there it makes sense to place the units directly into the GPU and also to have a lot of them working in parallel. It doesn't mean AI units should only exist in GPUs. It is the type of application which defines where these units need to be placed.
                    Last edited by sdack; 15 April 2019, 07:56 AM.

                    Comment


                    • #20
                      Originally posted by coder View Post
                      WTF? You clearly have no idea how slow training is. It's not something where latency is normally a concern. BFloat16 is primarily aimed at training workloads.
                      Who said it's just for training? Can be used for executing the AI network as well and applying it on data.

                      Originally posted by coder View Post
                      That's inferencing - where the industry is moving away from floats and towards integer arithmetic. If you did want to keep it on the CPU and wanted to use floats, you could use their existing IEEE 754 32-bit or 16-bit support. More importantly, the throughput advantage of running it on a GPU or ASIC would probably more than outweigh the additional communication latency.
                      Throughput is irrelevant when you want real-time response.

                      Originally posted by coder View Post
                      Source? I'm guessing this was many years ago, when software support for GPU compute was vastly more primitive.
                      https://www.liquidsonics.com/software/reverberate-le/

                      Check the specs:
                      • Stereo convolution reverb
                      • Zero latency CPU edition (or up to 8192 samples for lower CPU usage)
                      • Low latency GPU edition (512 – 8192 samples delay)
                      • Wav, Aiff and Flac support
                      • ADSHR envelopes
                      • 50%-150% stretch
                      • Double oversampled IR-EQ
                      • Pre-delay up to 500ms
                      • GPU Edition utilises the power of NVIDIA CUDA
                      Keep in mind this is just one plugin in the chain.

                      Comment

                      Working...
                      X