Announcement

Collapse
No announcement yet.

ARMv8.6-A Brings BFloat16, GEMM & Other Enhancements

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • ARMv8.6-A Brings BFloat16, GEMM & Other Enhancements

    Phoronix: ARMv8.6-A Brings BFloat16, GEMM & Other Enhancements

    Arm has outlined their architecture enhancements being introduced in ARMv8.6-A as their 2019 ARMv8 architecture update...

    http://www.phoronix.com/scan.php?pag...8.6-A-Detailed

  • #2
    This is one area where RISC-V will crucify ARM - early adoption of new ISA features. Not because ARM is necessarily doing anything wrong, but because the RISC-V implementers are early adopters in general. ARM's slower movement is suitable to enterprise customers, who are making purchasing decisions based on proven benefits and stable standards, and those same customers don't want to pay for unproven features, and that slows adoption. If ARM wants to compete on delivery dates they're going to need to discount licensing for early adopters.

    Comment


    • #3
      Originally posted by linuxgeex View Post
      This is one area where RISC-V will crucify ARM - early adoption of new ISA features. Not because ARM is necessarily doing anything wrong, but because the RISC-V implementers are early adopters in general. ARM's slower movement is suitable to enterprise customers, who are making purchasing decisions based on proven benefits and stable standards, and those same customers don't want to pay for unproven features, and that slows adoption. If ARM wants to compete on delivery dates they're going to need to discount licensing for early adopters.
      I hope that there is going to be good competition between arm/risc-v/x86 - it is certainly going to be interesting

      Comment


      • #4
        This is a sideshow. I don't see people doing substantial amounts of deep learning or inferencing on their ARM cores, even today. And, by the time these extensions are out in the field, using general-purpose cores for AI will be like using software-based renderers for interactive graphics in the age of GPUs.

        Comment


        • #5
          Originally posted by coder View Post
          This is a sideshow. I don't see people doing substantial amounts of deep learning or inferencing on their ARM cores, even today. And, by the time these extensions are out in the field, using general-purpose cores for AI will be like using software-based renderers for interactive graphics in the age of GPUs.
          It will be for low-power SOCs where they figure they can get away with just adding the ISA features instead of a dedicated AI core, in circumstances where AI is a relatively small part of the workload. It also has multimedia applications for RAW image processing and high-quality FloatRGB compositing, which could make this an interesting feature for low-end digital camera cores, and maybe even display panel / TV cores 5-10 years from now. It'll also speed up HTML rendering for machines with no GPU, but HDR displays, so maybe digital signage and really low-end tablets, when HDR10 displays get to be as cheap as current TN 6-bit displays. Maybe even HDR10 watches :-)
          Last edited by linuxgeex; 09-29-2019, 02:34 AM.

          Comment


          • #6
            Originally posted by linuxgeex View Post
            It will be for low-power SOCs where they figure they can get away with just adding the ISA features instead of a dedicated AI core, in circumstances where AI is a relatively small part of the workload. It also has multimedia applications for RAW image processing and high-quality FloatRGB compositing, which could make this an interesting feature for low-end digital camera cores, and maybe even display panel / TV cores 5-10 years from now. It'll also speed up HTML rendering for machines with no GPU, but HDR displays, so maybe digital signage and really low-end tablets, when HDR10 displays get to be as cheap as current TN 6-bit displays. Maybe even HDR10 watches :-)
            Wow, that sounds so impressive I had to go find more detail, to see what I might've missed.

            According to https://community.arm.com/developer/...ents-armv8-6-a the following numeric additions were made:
            • Matrix multiple instructions for BFloat16 and signed or unsigned 8-bit integers is added to both SVE and Neon. SVE additionally supports single- and double-precision floating-point matrix multiplies.
            • Armv8.6-A adds instructions to accelerate certain computations using the BF16 floating-point number format.
            • A data gathering hint, to express situations where write merging is expected not to be performance optimal.

            And https://community.arm.com/developer/...n-armv8_2d00_a goes into more detail on the specific BFloat16 additions, which consist of the following instructions:
            • BFDOT, a [1×2] × [2×1] dot product of BF16 elements, accumulating into each IEEE-FP32 element within a SIMD result.
            • BFMMLA, effectively comprising two BFDOT operations which performs a [2×4] × [4×2] matrix multiplication of BF16 elements, accumulating into each [2x2] matrix of IEEE-FP32 elements within a SIMD result.
            • BFMLAL, a simple product of the even or odd BF16 elements, accumulating into each IEEE-FP32 element within a SIMD result.
            • BFCVT, converts IEEE-FP32 elements or scalar values to BF16 format.

            So, it seems tho BFloat16 functionality is really limited, and not a fully generalized extension of NEON & SVE to include BFloat16. It's just as well, because BFloat16 lacks the precision to do most of the things that integers can't. Specifically, regarding 10-bit HDR, I'd remind you that BFloat16 has only an 8-bit mantissa, meaning it can't represent a 10-bit range at full precision. Map it to any range you like, and your epsilon will be too large, for part of it.

            And, if we specifically take the IoT use case, even that is a little weird. Where BFloat16 tends to see the most use is in training - not inferencing. That's why they added the 8-bit equivalents. But, you're not going to train on an IoT device, or even a cell phone. To me, BFloat16 was really aimed at cloud/server-based training - there, it's too little, too late. By a lot.

            P.S. I'm all for hints to optimize performance of cache memory subsystems. So, that's certainly welcome, even though it's probably aimed at a similarly narrow use case.

            Comment


            • #7
              Originally posted by coder View Post
              It's just as well, because BFloat16 lacks the precision to do most of the things that integers can't. Specifically, regarding 10-bit HDR, I'd remind you that BFloat16 has only an 8-bit mantissa, meaning it can't represent a 10-bit range at full precision. Map it to any range you like, and your epsilon will be too large, for part of it.
              You're correct that BFloat16 can't hold UINT10 with full accuracy, but it can still do a tremendously better job than a UInt8 filter chain can.

              For example, compositing 12-bit-per-RGBA you will be left holding the 8 highest MSB when you're done, regardless of whether the inputs never used the 4 MSb of their 12-bit range. When converting down to 8 bits, preserving highlights, you'd have on average 3.5 bits left when using UInt8 math, and you'd have on average 7.5 bits left with BFloat16. That's a pretty significant difference in image quality. Also, you can perform tonemapping on the BFloat16 results. You can't on UInt8 results.

              If the SOC is smart about how it uses BFloat16 it can manipulate it to produce the FRC patterned framebuffers for the display, effectively providing true HDR10 with an 8-bit mantissa data type in conjunction with an 8-bit hardware display, which is what all the HDR10 displays I've seen so far actually are. Those kind of tricks have been going on since Gingerbread with the Samsung phone displays to create 6-bit gamut (spatially dithered to 8 bit after the fact) with a 4-bit display... you can be sure they haven't stopped now.

              If you find a true 10-bit display, chances are it will be advertised as 12-bit HDR. That one will require something wider than BFloat16 to do compositing. IEEE 754-2008 has 11-bit mantissa so it would be up to that task with FRC.

              PS a quick way to test the quality of the display FRC implementation is to adjust the display gamma (on the monitor, not on the driving device). This will push the FRC outside its designed ideal operating values, causing it to exceed the boundaries of human Critical Flicker Fusion at a higher sync rate than the manufacturer intended, and may give users ocular migraines, headaches, dizzyness, nausea, even trigger seizures in those prone to epilepsy. So always adjust display gamma at your PC instead of your monitor. Or if your monitor flickers, try adjusting its gamma to see if you can make the flicker go away, then adjust the gamma on the PC to get back a normal response curve. Sometimes those factory calibrations are far from ideal for reducing flicker. And of course you can reduce the ambient lighting so you can turn the monitor's brightness level down a bit too. At 20 lux and 72hz you shouldn't have annoying flicker even with a poorly calibrated IPS/PVA display. TN gamma changes with vertical deflection so there's no ideal gamma setting to compensate for FRC and they will always flicker at a higher sync rate.

              See Critical Flicker Fusion frequency response vs ambient lux
              Last edited by linuxgeex; 10-18-2019, 04:48 PM.

              Comment

              Working...
              X