Announcement

Collapse
No announcement yet.

AMD CDX Bus Landing For Linux 6.4 To Interface Between APUs & FPGAs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • AMD CDX Bus Landing For Linux 6.4 To Interface Between APUs & FPGAs

    Phoronix: AMD CDX Bus Landing For Linux 6.4 To Interface Between APUs & FPGAs

    Since last year AMD-Xilinx has been posting Linux patches for enabling CDX as a new bus between application processors (APUs) and FPGAs. The AMD CDX bus is now poised for introduction in the upcoming Linux 6.4 cycle...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Very interesting! This is something that was talked about more than 10 years ago with AMD's "Fusion" initiative when they rolled out their first APUs and were touting HSA (Heterogenous System Architecture)

    Comment


    • #3
      Also curious. Is CDX an offshoot or extension of Infinity Fabric?

      Comment


      • #4
        This probably has nothing to do with the x86 chips. The documentation describes how the bus works and this is intended for zynq chips with RPUs(intended for real time) and APUs(general purpose, runs linux). The RPU has a firmware and interacts with the FPGA(or rather the stuff you have implemented on the FPGAs). The APU runs linux and interacts with the RPU to enumerate and communicate with the peripherals. I guess the main point is being able to reprogram the FPGA and have linux able to detect and utilize the changes. You can still do something similar with existing chips by reprogramming the FPGA with a different bitstream and manually loading the modules but this bus sounds like it'll make this easier.

        The new CPUs do indeed seem to have the AI accelerators but I still can't figure out where people got the idea that they have FPGAs. The whole XDNA accelerator is an IP(hardware design in the form of HDL), you can implement it on an FPGA or put it in a chip. Xilinx sells some alveo accelerator cards and they'll probably put some scaled down ones in the processors but putting an FPGA in a processor just so you can implement a piece of fixed function hardware is a horrible idea not to mention an engineering nightmare.

        Comment


        • #5
          Originally posted by osw89 View Post
          I still can't figure out where people got the idea that they have FPGAs. The whole XDNA accelerator is an IP
          I think the assumption was that since XDNA came from Xilinx, that it probably involves some amount of FPGA. However, that wasn't stated in anything I can find. This slide even goes so far as to compare & contrast it to a FPGA:
          The way AMD talks about it supporting a fixed number of concurrent realtime streams also summons notions of gate budgets and the like:
          Originally posted by osw89 View Post
          putting an FPGA in a processor just so you can implement a piece of fixed function hardware is a horrible idea not to mention an engineering nightmare.
          We know that FPGAs are great for energy-efficiency. And since AMD is targeting this at thin & light laptops, that also aligns with the notion that they're leveraging FPGA functionality, in some form or fashion, somewhere in XDNA. Not that you'd build up all the multipliers and adders from scratch -- there'd still be some special-purpose cores or fixed-function units for most of that -- but just if you had some spare gates to use for implementing custom layers, rather than relying on a slower software implementation running either on programmable DSP-like cores or the main CPU cores.

          Comment


          • #6
            Just so we're clear, the advantage of using a dataflow architecture for AI is mainly so you can keep the weights on-chip, rather than having to fetch them from DRAM, for each inference.



            The obvious limitation is in the size of the network, and with most networks ranging from multiple megabytes to many gigabytes, this is normally an approach you see utilized in much larger-scale systems than a little IP block in a SoC.

            Dataflow processing also very much aligns with the most natural approach for utilizing FPGAs, so it's an area Xilinx will know well.

            Comment


            • #7
              Originally posted by coder View Post
              I think the assumption was that since XDNA came from Xilinx, that it probably involves some amount of FPGA. However, that wasn't stated in anything I can find. This slide even goes so far as to compare & contrast it to a FPGA:

              The way AMD talks about it supporting a fixed number of concurrent realtime streams also summons notions of gate budgets and the like:
              Yes I have seen those slides as well, they are trying to say that it's can be used in FPGAs as well as implemented in silicon(Writing RTL for an FPGA or an ASIC isn't entirely the same thing after all.) and their processors seem to have fixed function accelerators. Wishful thinking I suppose.


              Originally posted by coder View Post
              We know that FPGAs are great for energy-efficiency
              They can be more efficient than regular processors depending on what you are implementing but they are never better than ASICs/dedicated, fixed-function silicon. An ASIC consumes less power, is faster and will probably be cheaper since an FPGA implementing the same function needs a larger area. I mean anything you can get to work on an FPGA can also run much better on/as an ASIC. That's the cost of their flexibility, you need to have a clock grid instead of a an optimized clock tree, you can't remove logic you don't need, and most importantly the wiring between logic elements needs to be a configurable grid instead of having only what you need meaning you need to spread the logic out to make routing possible which is why plays a part in the whole area thing. Funnily enough the whole general purpose vs fixed function arguments you use when comparing CPUs and FPGAs also apply to an extent (obviously not in the same ways but there are some similarities) to FPGAs vs ASICs.

              My point is they have no reason to put an FPGA in a CPU to program it to basically act like a piece of fixed-function silicon. It'll consume more power, not run as fast, and need more area, increasing the per chip cost and lowering the yield. Sure from the point of view from an EE working with digital stuff it would be great, I could experiment with it easily and even program it with newer versions of the AI accelerator. The problem is AMD doesn't really gain much from this apart from pleasing a couple of people like me, not financially worthwhile.

              BTW you're right that the whole layer thing is suitable for an FPGA but it also can be done better on an ASIC. It's hard to make assumptions from a marketing slide but their AI engine architecture seems to consist of a network of VLIW vector processors that they call "AI engine" https://www.xilinx.com/products/tech...ai-engine.html Looking at the slide again it looks like certain AI engines in the accelerator are able to be partitioned into individual layers to exploit the advantages you listed. Such an ability makes a lot of sense because your layers being fixed in hardware would be limiting compared to being able to assign engines to a layer. With such an architecture you could easily squeeze these AI engines into the empty area in your chip and it would be more efficient than putting an FPGA in it because it would utilize the area better. Also if your architecture is flexible and allows you to dedicate engines into layers you don't really need the flexibility of an FPGA. They already use this in their versal chips(SoC with very high end accelerators like the AI engine + an FPGA portion) and the AI engines are fixed-function.

              Sorry for the long post, I have developed somewhat of a habit of going on a rant in FPGA related articles here.

              Comment


              • #8
                Yikes, you'd think people here would understand an FPGA solution to fixed-function hardware is a no-brainer and a software developer's dream for encoding/decoding nearly any intense/realtime application. No more guessing if it has some stupid codec, or what hardware, or how good it is, or if it's new enough, or if it's bugged, or if it's right. Instead, you just compile the thing and upload it to your FFU. Efficiency? Doesn't matter at all, you're not going for efficiency, it's about flexibility. You could support thousands of codecs, and improve/change them at any time. ASIC's can't do that, and saying that isn't a completely worthwhile endeavor is completely moronic at best.

                Comment


                • #9
                  Originally posted by osw89 View Post
                  They can be more efficient than regular processors depending on what you are implementing but they are never better than ASICs/dedicated, fixed-function silicon.
                  Yes, I'm with you. I should've used a few more qualifiers. I was comparing logic simple enough to be entirely synthesized on a FPGA against code running on the sort of DSP cores that are featured in many AI accelerators. For convolution layers, these AI accelerators tend to have hard-wired tensor products, which indeed can bring programmable cores nearly on-par with fixed-function ASICs. However, the sort of custom layers that need to be implemented in software are probably limited to using vector operations, in practice.

                  Originally posted by osw89 View Post
                  My point is they have no reason to put an FPGA in a CPU to program it to basically act like a piece of fixed-function silicon. It'll consume more power, not run as fast, and need more area, increasing the per chip cost and lowering the yield.
                  It definitely makes sense for certain networking use-cases. AI could be another one, but only in conjunction with special-purpose accelerators, so the FPGA is only taking the place of using programmable cores, and you demand good energy-efficiency.

                  Originally posted by osw89 View Post
                  their AI engine architecture seems to consist of a network of VLIW vector processors that they call "AI engine" https://www.xilinx.com/products/tech...ai-engine.html Looking at the slide again it looks like certain AI engines in the accelerator are able to be partitioned into individual layers to exploit the advantages you listed.
                  TBH, I don't really see the benefit of using VLIW DSP cores over GPU cores, other than GPUs are traditionally weak at integer arithmetic. More recent AMD GPUs seem to have largely addressed that weakness, and even now feature a tensor product acceleration.

                  I guess the main downside of using GPU cores is that scaling them up carries more baggage (i.e. all the fp32 ALUs + potentially having to balance them with more ROPs, TMUs, RT cores, etc.) that simple AI cores don't need. Still, I wonder how much this XDNA exercise is a matter of AMD trying to justify its Xilinx acquisition vs. actually offering more value than devoting the same die area to scaling up its iGPU.

                  Originally posted by osw89 View Post
                  Sorry for the long post, I have developed somewhat of a habit of going on a rant in FPGA related articles here.
                  Your thoughtful and detailed remarks are definitely appreciated.

                  Comment


                  • #10
                    Originally posted by abott View Post
                    Yikes, you'd think people here would understand an FPGA solution to fixed-function hardware is a no-brainer and a software developer's dream for encoding/decoding nearly any intense/realtime application. No more guessing if it has some stupid codec, or what hardware, or how good it is, or if it's new enough, or if it's bugged, or if it's right. Instead, you just compile the thing and upload it to your FFU.
                    Could be. Depends on the size of what you're trying to accelerate vs. the size of the FPGA you're using. If the FPGA isn't big enough, its performance could suck. You need a lot more die area to implement a given logic on a FPGA than ASIC.

                    Originally posted by abott View Post
                    You could support thousands of codecs, and improve/change them at any time. ASIC's can't do that, and saying that isn't a completely worthwhile endeavor is completely moronic at best.
                    Most ASICs feature more programmability than you might think. Tensilica does very good business licensing their eXtensa DSP cores for integration into various ASICs. A lot of ASICs are merely application-specific processors, with key operations hardwired for better efficiency. That limits the opportunities for hardware bugs, as well as meaning you can still workaround many of them in the firmware.

                    Comment

                    Working...
                    X