Announcement

Collapse
No announcement yet.

Radeon "GFX90A" Added To LLVM As Next-Gen CDNA With Full-Rate FP64

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by coder View Post
    The problem with this scenario is that the workstation market just isn't big enough to support custom silicon. It has to piggyback off one of their main markets:
    • Consumer gaming
    • AI inferencing
    • AI training
    • Cloud compute
    • HPC
    • Crypto
    Now AMD and Nvidia each have basically 2 product families that they use to target those market segments. So far, Nvidia has kept dsiplay and 3D hardware in all their ASICs, but their new crypto chips potentially represent a 3rd product family that might grow legs and take on CDNA in the HPC space (just wild speculation, here).

    What AMD could be imagining is that a workstation user needing large amounts of fp64 or AI-training horsepower pair a conventional workstation 3D card with an Instinct MI100 or whatever this new chip ends up in. You don't need graphics/compute accelerator chip that does it all. You can have 2, in the same way that 3D graphics evolved into their own dedicated processor outside of the CPU.
    I'd take your economics of scale argument as a pro for a Vega-like strategy (at least in the present), as two distinct architectures needs much more investments. Nvidia historically reserved their highest-end die of each generation for professionals but they did not differ on the architecture level too much from their gaming cards (they "only" added FP64, more execution units, had a different memory interface and capacity). Volta was somewhat special as they brought that architecture for professionals only - with Ampere we see the gaming chips first and GA100 later on and that falls again into the traditional pattern.

    But your two-cards scenario reminds me of their chiplet approach and brings me to the idea (which was brought up years ago in one of AdoredTV's famous speculation videos) of a combined APU + CDNA/RDNA solution, possibly with HBM on-package (the dGPU would come with its own for more capacity). Of course having (one or several) GPU chiplet(s) on the CPU would also enable more flexibility, even seperate RDNA- and CDNA-based APUs which you could couple with either RDNA- or CDNA-based dGPUs. I can envision a low-level interconnect and protocols like CXL to enable this. Maybe even an RDNA-APU could work with a CDNA-dGPU (or vice versa) together and could be used for gaming, as one of them would have the display and fixed-function hardware and could make use of the compute power of the other. Also it would be more economical for AMD as they would need to design a CPU chiplet and two GPU chiplets only to conquer all of these markets with mixed, matched and scaled up products based on these. That would indeed be very clever, if they can get it to work. The software support could be a big problem though, at least on the consumer side, as games would need to know how to optimally balance their workloads between all involved devices, hopefully in a transparent way without explicit support of developers. But maybe games would need to make more use of compute instead of graphics? What do you think of these thoughts?

    P.S: Just as a sidenote, as far as I understand it Nvidia's new Crypto line uses salvaged desktop chips (from two seperate generations even?), that seems to me as a seperation on the marketing level only (with the usual driver/vbios/chip level cuts), but nothing as dramatic as a distinct architecture.
    Last edited by ms178; 20 February 2021, 07:58 PM.

    Comment


    • #22
      Originally posted by ms178 View Post
      I'd take your economics of scale argument as a pro for a Vega-like strategy (at least in the present), as two distinct architectures needs much more investments. Nvidia historically reserved their highest-end die of each generation for professionals but they did not differ on the architecture level too much from their gaming cards (they "only" added FP64, more execution units, had a different memory interface and capacity). Volta was somewhat special as they brought that architecture for professionals only - with Ampere we see the gaming chips first and GA100 later on and that falls again into the traditional pattern.
      Your talk of "professionals" is far too broad. I mapped out various markets (with workstations being implicit) where virtually everything outside of consumer/gaming could be seen as "professional". Yet, these different markets have different needs that require different hardware features and emphasis.

      As for history, their GP100 was only sold as a Tesla and Quadro product -- they didn't even offer it as a Titan. GV100 was sold as all three. None was marketed to consumers, unless you consider a $3k graphics card a consumer product. And, from what I'm reading, the GA100 seems to lack a display controller, meaning we shouldn't expect it to show up in any form of graphics card. That said, I found a block diagram of its SM that shows texture units, though that could just be for use as a cloud GPU or even for imaging-based AI or HPC use cases.

      I think you're also reading too much into the generation names, whereas the real distinction lies in the numeric portion of the names. For instance, the GP100 is a very different beast than any of the other Pascal series, and the same is even more true of the GA100 vs. the rest of Ampere lineup. This 100-series seems clearly aimed at HPC and AI training workloads, which explains the HBM2, and all the fp64, and tensor cores.

      Originally posted by ms178 View Post
      a combined APU + CDNA/RDNA solution, possibly with HBM on-package (the dGPU would come with its own for more capacity).
      Power and cooling are some isues that spring to mind. The only hope of having enough memory bandwidth would be using HBM2, and then the package would be huge, as well. A big socket, requiring a beefy VRM is going to make motherboards more expensive, even for those not intending to use these features. And now you have a super-expensive component that you have to replace as a single unit, if any one portion breaks or you simply want to upgrade just one portion.

      I just don't see what problem it's solving, but it has some very real downsides. Gaming benchmarks have already shown that consumer GPUs aren't even close to maxing PCIe 4.0, so why do you need even faster connectivity? Tighter integration is a cost savings only for bespoke solutions, like game consoles.

      I think we already have a good thing going with PCs, as they are. If you need more GPU horsepower than an APU can deliver, just plug in whichever cad you like into a fast PCIe 4.0 x16 slot, and there you go. Workstation CPUs and motherboards offer more than enough lanes for this.

      Originally posted by ms178 View Post
      P.S: Just as a sidenote, as far as I understand it Nvidia's new Crypto line uses salvaged desktop chips (from two seperate generations even?), that seems to me as a seperation on the marketing level only (with the usual driver/vbios/chip level cuts), but nothing as dramatic as a distinct architecture.
      I just thought I'd mention those, since I'd read they lack a display controller block. However, I'll refrain from further discussion of them until they get released and all the facts become known.
      Last edited by coder; 20 February 2021, 09:55 PM.

      Comment


      • #23
        If AMD could keep building GPUs capable of targeting all workloads, I think they would. However, they're facing real competitive pressure from Nvidia, Intel, and at least half a dozen AI chip makers. And when you add up all the fixed function hardware needed for 3D graphics and display (especially accounting for the new ray tracing and mesh shaders), that's a lot of "dead" silicon that doesn't benefit their AI and compute customers. So, it only makes sense to get rid of that stuff in a compute-focused solution that's going to be much too expensive for nearly all consumers, anyhow.

        As for another Vega-style card, I don't see why they'd do it. RDNA is better for graphics, so the market for such a thing would be just those folks who need both a fast (but not the fastest!) graphics card and more compute power/capabilities than the RDNA cards can offer (which is still a lot). And that's going to be a relatively small market, many of whom can simply afford to plugin one of each card, anyhow.

        Now, I get why people want both. That's why I bought a Radeon VII, in fact. If you missed the boat on those, you can still buy a Radeon Pro VII. They're out of stock (like everything else), but still in production. They cost more than 2x as much, but at least you get full-rate fp64 and PCIe 4.0.

        Comment


        • #24
          Also, I really do hope they can find a way to make hardware with BFloat16 and their cool Matrix Cores accessible to hobbyists and open source developers. Whether it's by selling partly-disabled MI100 cards at a steep discount or making a smaller CDNA chip, they should appreciate how important a vibrant opensource ecosystem is for their strategy, and right now it's on life support.

          Taking CDNA out of consumers' hands is going to make it as irrelevant as POWER CPUs, or even more esoteric and niche tech.

          Comment


          • #25
            Originally posted by coder View Post
            Also, I really do hope they can find a way to make hardware with BFloat16 and their cool Matrix Cores accessible to hobbyists and open source developers. Whether it's by selling partly-disabled MI100 cards at a steep discount or making a smaller CDNA chip, they should appreciate how important a vibrant opensource ecosystem is for their strategy, and right now it's on life support.
            Taking CDNA out of consumers' hands is going to make it as irrelevant as POWER CPUs, or even more esoteric and niche tech.
            i think this news here is about hardware for privat customers.
            the reason why i think they will not only sell it to super computers and big server compenies
            is because of the blockchain mining boom.

            right now nvidia sells a lot of 3070/3080/3090 to miners and AMD 6800/6900 does not fit to miners.
            CDNA "Arcturus" "GFX90A" in in fact AMDs mining hardware...

            AMD would be very stupid to not sell this to miners in the open market.

            all your fear about CDNA becomes the new niche hardware like IBM Power is in my point of view wrong emotions.

            there is only one reason you can not but it yet: because AMD right now only produce playstation5/xbox chips only means 80%

            in 2-3 month this playstation5 hype is over and then you can buy your CDNA "Arcturus" "GFX90A" hardware.

            but i think you can expect a price like 2000€ or more...
            Phantom circuit Sequence Reducer Dyslexia

            Comment


            • #26
              Originally posted by coder View Post
              Your talk of "professionals" is far too broad. I mapped out various markets (with workstations being implicit) where virtually everything outside of consumer/gaming could be seen as "professional". Yet, these different markets have different needs that require different hardware features and emphasis.
              Right, but my point was that Nvidia achieved to serve all these markets with one architecture and a specialized chip at the top which was still based on the same architectural foundation as their gaming lineup and could build on top of that work in terms of software and driver support. AMD now tries to tackle these markets with two distinct architectures which might differ even more over time and I fear that we consumers with some compute workload needs are worse off with that decision as you'd need to pay extra for such a CDNA card. If there were a low-cost CDNA alternative as you outlined, that would adress my fears though. Only time will tell if we get such an alternative, I can only think of an APU with a low CU-count based on CDNA.

              Originally posted by coder View Post
              Power and cooling are some isues that spring to mind. The only hope of having enough memory bandwidth would be using HBM2, and then the package would be huge, as well. A big socket, requiring a beefy VRM is going to make motherboards more expensive, even for those not intending to use these features. And now you have a super-expensive component that you have to replace as a single unit, if any one portion breaks or you simply want to upgrade just one portion.

              I just don't see what problem it's solving, but it has some very real downsides. Gaming benchmarks have already shown that consumer GPUs aren't even close to maxing PCIe 4.0, so why do you need even faster connectivity? Tighter integration is a cost savings only for bespoke solutions, like game consoles.
              We have seen the benefits with Fujitsu's A64FX, and that is not a future product but in action since last year. Granted, it heavily uses SVE and has no GPU on the package but it only uses HBM as memory subsystem and is highly scalable and efficient for the workloads it was designed for. They also solved the cooling problem. For a broader set of workloads you'd probably still need a DDR memory interface for higher capacity, but HBM on the package could act as a new memory tier or cache to better feed the CPU and GPU portion. Also PCIe 4.0 and 5.0 is still a limiting factor for many GPGPU tasks, not so much from the bandwidth perspective but more importantly from its latency penalty. Today offloading to the GPU via PCIe is only profitable for those workloads which run long enough to be still profitable after the roundtrip over the PCIe bus to the GPU and back to the CPU. I see cache coherent interconnects and CXL on the horizon to help to make offloading to the GPU (or any other attached device) more profitable for a broader set of workloads. Single-source programming with SYCL 2020 and oneAPI will certainly leverage all of these technologies moving forward.

              I hope this made it clearer why I think that a low-latency and high-bandwith memory interface would be beneficial even to CPUs. There are ideas to solve this issue differently, with a new attachable memory interface which would make it possible to upgrade memory independently from the CPU. IBM was much in favor of this alternative approach and if you could attach GDDR6X on a stick to a motherboard and had a APU which could make use of it, that would serve the same purpose as HBM in the end as a new tier between CPU caches and main memory.

              Comment


              • #27
                Originally posted by cb88 View Post

                There is absolutely no point in that, when RDNA can run Vulkan and OpenCL compute, and HIP on Linux.

                The sole purpose of CDNA is to go after HPC density and TCO. And unlike the article implies CNDA != GCN any more than RDNA is GCN.

                Anything you can run on CDNA is going to also run on RNDA... just not quite as fast as long as it is portable code. Nobody should be writing assembly for GPUs at this point... unless you really need an HPC application to scale, in which case you already have access to that.
                That´s not really true..
                The sad reality: ROCm has hand-optimized kernels written in assembler for every architecture (like different GCN and CDNA), for the different tasks. Like a winograd kernel for AI. yes LLVM does support RDNA as a ISA, but the code generated by LLVM will be magnitudes slower on RDNA than a hand optimized kernel on CDNA.
                Sadly as of now RDNA/RDNA2 is not supported in ROCm, building it for RDNA manually leads to kernels which return bad results / don´t work.
                This is going to change in the near future though.

                All this leads to a very high entry barrier to HPC with AMD hardware, as you currently have to buy a GCN or CDNA card to even use their toolstack. WIth CUDA you can use a cheap (ok currently not) consumer card from 5 years ago and it just works and accelerates your workload pretty significantly.

                FPGA Accelerators: They might come to consumer hardware sooner or later, maybe embeeded in a CPU or a GPU to accelerate AI workloads in games or image processing.
                I guess we will see them as a replacement for HPC accelerators first, liek what Xilinx is doing with their Alveo Accelerators.
                Toolstack / Software Support will be key here..
                But we might see some C -> RTL compilers then, so you can write against a a library like ROCM or CUDA and have your code accelerated by an FPGA.. They will probably contain some hand optimized algorithms as well, like they do today!

                Comment


                • #28
                  Originally posted by Qaridarium View Post
                  CDNA "Arcturus" "GFX90A" in in fact AMDs mining hardware...
                  If they just wanted to build a fast mining chip, then there'd be no need for fp64 or their Matrix Cores. Leaving that stuff out could make it substantially cheaper. In fact, I wonder if you even need any floating-point for mining.

                  Originally posted by Qaridarium View Post
                  AMD would be very stupid to not sell this to miners in the open market.
                  Well, you can buy a MI100 today, but it's priced to compete with Nvidia A100, which is to say a lot:

                  https://www.dell.com/en-us/work/shop...ic-video-cards

                  Originally posted by Qaridarium View Post
                  but i think you can expect a price like 2000€ or more...
                  Yeah, it's a very big chip (750 mm2?), meaning that the potential for large price drops is probably limited to a small number of partially defective dies they might be able to salvage as low-spec products for academia and hobbyists.

                  Comment


                  • #29
                    Originally posted by ms178 View Post
                    Right, but my point was that Nvidia achieved to serve all these markets with one architecture and a specialized chip at the top which was still based on the same architectural foundation as their gaming lineup and could build on top of that work in terms of software and driver support.
                    Not really. Take the GP100 vs. the rest of the Pascal series, for instance. They have a different SIMD width, for crying out loud, just like CDNA vs. RDNA. And the GP100 supported packed fp16, while the rest of the Pascal family supported packed int8 and int16 dot products. They also have different numbers of special function units and shared memor per SM. And Pascal wasn't a one-off, in this regard -- they did it again, in Ampere.

                    You're really glossing over a lot, here. The fact is that while the difference between CDNA and RDNA is bigger than Nvidia's 100-series vs. the rest of each generation, both are playing the same sort of game. The 100-series is different right down to the structure and composition of its SMs. It's not as if they just dialed up or down the numbers of high-level structures, swapped the GDDR memory controllers for HBM2, and called it a day.

                    Originally posted by ms178 View Post
                    I fear that we consumers with some compute workload needs are worse off with that decision as you'd need to pay extra for such a CDNA card.
                    You always had to pay a lot for large amounts of fp64. And the cheapest card Nvidia ever sold with HBM was the $3000 Titan V. So, the idea of having to pay a lot for big compute capabilities is nothing new. Plus, AMD did continue to expand Rapid Packed Math, in RDNA, and offers on the order of about 1 TFLOPS of fp64 performance in the RX 6800/6900. So, unless you have serious need for AI training or big compute workloads, their consumer offerings don't entirely leave you out in the cold.

                    Originally posted by ms178 View Post
                    I can only think of an APU with a low CU-count based on CDNA.
                    Except they're GCN -- not CDNA. They don't have the features that make CDNA interesting, and are so small that even the next larger RDNA dGPU will give you better performance on pure-compute workloads.

                    Originally posted by ms178 View Post
                    We have seen the benefits with Fujitsu's A64FX, and that is not a future product but in action since last year.
                    That falls into the category of a bespoke solution, since they're used on custom boards, in a custom chassis. You can't use that to meaningfully generalize about the practicality of super-sized PC APUs.

                    Originally posted by ms178 View Post
                    Today offloading to the GPU via PCIe is only profitable for those workloads which run long enough to be still profitable after the roundtrip over the PCIe bus to the GPU and back to the CPU.
                    AMD beat the drum of heterogeneous compute for a long, long time. Yet, even in APUs, it seems to be fairly uncommon. I'd be all for seeing it become more widespread, but I think we need to see more software embrace it, before chip makers are going to take the sort of bold steps that increase platform costs to further optimize it.

                    Originally posted by ms178 View Post
                    I see cache coherent interconnects and CXL on the horizon to help to make offloading to the GPU (or any other attached device) more profitable for a broader set of workloads.
                    That's system-level and exactly the sort of thing that reduces the need for the kind of tight integration you described. Also, I'm aware of only server processors having announced support for it. I'm not sure it's headed for mainstream computing.

                    Originally posted by ms178 View Post
                    if you could attach GDDR6X on a stick to a motherboard and had a APU which could make use of it, that would serve the same purpose as HBM in the end as a new tier between CPU caches and main memory.
                    You can't put GDDR memory on a DIMM. The timing and electrical specs are way too tight for that, which is how they've managed to squeeze out the extra performance from it. I doubt you can even use it from a socketed CPU.

                    Comment


                    • #30
                      Originally posted by Spacefish View Post
                      LLVM does support RDNA as a ISA, but the code generated by LLVM will be magnitudes slower on RDNA than a hand optimized kernel on CDNA.
                      So, what do graphics shader compilers use? I thought the two backends in use were LLVM and ACO?

                      Originally posted by Spacefish View Post
                      WIth CUDA you can use a cheap (ok currently not) consumer card from 5 years ago and it just works and accelerates your workload pretty significantly.
                      Yes and no. To my point about AMD making Matrix Cores and BFloat16 hardware accessible to the unwashed masses, the analogous Nvidia features would be their tensor cores, which they only added to consumer products a couple years ago. But Nvidia is better about this point. They enable nearly all of their GPUs with some amount of compute capabilities, so students can buy one GPU for both gaming and course work.

                      Originally posted by Spacefish View Post
                      FPGA Accelerators: They might come to consumer hardware sooner or later, maybe embeeded in a CPU or a GPU to accelerate AI workloads in games or image processing.
                      I haven't seen benchmarks where they can beat Nvidia's tensor cores. So, it would have to be for something GPUs do poorly, like spiking neural networks. Those still aren't very common, and are likely to be addressed by purpose-built AI engines, by the time they are.

                      Originally posted by Spacefish View Post
                      I guess we will see them as a replacement for HPC accelerators first, liek what Xilinx is doing with their Alveo Accelerators.
                      I'm not familiar with those, but where I see FPGAs making sense in general-purpose computing is in cases where you need the flexibility of a programmable solution with the low-latency of hard-wired logic. So, maybe for things like software-defined networking or high-frequency trading.

                      Originally posted by Spacefish View Post
                      Toolstack / Software Support will be key here..
                      But we might see some C -> RTL compilers then, so you can write against a a library like ROCM or CUDA and have your code accelerated by an FPGA..
                      They've had OpenCL support for like a decade. You can buy an Intel FPGA card and use it with oneAPI for AI acceleration, today.

                      Comment

                      Working...
                      X