Announcement

**linuxgeex** · 25 September 2019, 04:24 PM

This is one area where RISC-V will crucify ARM - early adoption of new ISA features. Not because ARM is necessarily doing anything wrong, but because the RISC-V implementers are early adopters in general. ARM's slower movement is suitable to enterprise customers, who are making purchasing decisions based on proven benefits and stable standards, and those same customers don't want to pay for unproven features, and that slows adoption. If ARM wants to compete on delivery dates they're going to need to discount licensing for early adopters.

**boxie** · 25 September 2019, 10:42 PM

Originally posted by linuxgeex View Post

This is one area where RISC-V will crucify ARM - early adoption of new ISA features. Not because ARM is necessarily doing anything wrong, but because the RISC-V implementers are early adopters in general. ARM's slower movement is suitable to enterprise customers, who are making purchasing decisions based on proven benefits and stable standards, and those same customers don't want to pay for unproven features, and that slows adoption. If ARM wants to compete on delivery dates they're going to need to discount licensing for early adopters.

I hope that there is going to be good competition between arm/risc-v/x86 - it is certainly going to be interesting

**coder** · 26 September 2019, 04:16 AM

This is a sideshow. I don't see people doing substantial amounts of deep learning or inferencing on their ARM cores, even today. And, by the time these extensions are out in the field, using general-purpose cores for AI will be like using software-based renderers for interactive graphics in the age of GPUs.

**linuxgeex** · 29 September 2019, 02:28 AM

Originally posted by coder View Post

This is a sideshow. I don't see people doing substantial amounts of deep learning or inferencing on their ARM cores, even today. And, by the time these extensions are out in the field, using general-purpose cores for AI will be like using software-based renderers for interactive graphics in the age of GPUs.

It will be for low-power SOCs where they figure they can get away with just adding the ISA features instead of a dedicated AI core, in circumstances where AI is a relatively small part of the workload. It also has multimedia applications for RAW image processing and high-quality FloatRGB compositing, which could make this an interesting feature for low-end digital camera cores, and maybe even display panel / TV cores 5-10 years from now. It'll also speed up HTML rendering for machines with no GPU, but HDR displays, so maybe digital signage and really low-end tablets, when HDR10 displays get to be as cheap as current TN 6-bit displays. Maybe even HDR10 watches :-)

**coder** · 29 September 2019, 04:43 AM

Originally posted by linuxgeex View Post

It will be for low-power SOCs where they figure they can get away with just adding the ISA features instead of a dedicated AI core, in circumstances where AI is a relatively small part of the workload. It also has multimedia applications for RAW image processing and high-quality FloatRGB compositing, which could make this an interesting feature for low-end digital camera cores, and maybe even display panel / TV cores 5-10 years from now. It'll also speed up HTML rendering for machines with no GPU, but HDR displays, so maybe digital signage and really low-end tablets, when HDR10 displays get to be as cheap as current TN 6-bit displays. Maybe even HDR10 watches :-)

Wow, that sounds so impressive I had to go find more detail, to see what I might've missed.

According to https://community.arm.com/developer/...ents-armv8-6-a the following numeric additions were made:

Matrix multiple instructions for BFloat16 and signed or unsigned 8-bit integers is added to both SVE and Neon. SVE additionally supports single- and double-precision floating-point matrix multiplies.
Armv8.6-A adds instructions to accelerate certain computations using the BF16 floating-point number format.
A data gathering hint, to express situations where write merging is expected not to be performance optimal.

And https://community.arm.com/developer/...n-armv8_2d00_a goes into more detail on the specific BFloat16 additions, which consist of the following instructions:

BFDOT, a [1×2] × [2×1] dot product of BF16 elements, accumulating into each IEEE-FP32 element within a SIMD result.
BFMMLA, effectively comprising two BFDOT operations which performs a [2×4] × [4×2] matrix multiplication of BF16 elements, accumulating into each [2x2] matrix of IEEE-FP32 elements within a SIMD result.
BFMLAL, a simple product of the even or odd BF16 elements, accumulating into each IEEE-FP32 element within a SIMD result.
BFCVT, converts IEEE-FP32 elements or scalar values to BF16 format.

So, it seems tho BFloat16 functionality is really limited, and not a fully generalized extension of NEON & SVE to include BFloat16. It's just as well, because BFloat16 lacks the precision to do most of the things that integers can't. Specifically, regarding 10-bit HDR, I'd remind you that BFloat16 has only an 8-bit mantissa, meaning it can't represent a 10-bit range at full precision. Map it to any range you like, and your epsilon will be too large, for part of it.

And, if we specifically take the IoT use case, even that is a little weird. Where BFloat16 tends to see the most use is in training - not inferencing. That's why they added the 8-bit equivalents. But, you're not going to train on an IoT device, or even a cell phone. To me, BFloat16 was really aimed at cloud/server-based training - there, it's too little, too late. By a lot.

P.S. I'm all for hints to optimize performance of cache memory subsystems. So, that's certainly welcome, even though it's probably aimed at a similarly narrow use case.

**linuxgeex** · 18 October 2019, 03:36 PM

Originally posted by coder View Post

It's just as well, because BFloat16 lacks the precision to do most of the things that integers can't. Specifically, regarding 10-bit HDR, I'd remind you that BFloat16 has only an 8-bit mantissa, meaning it can't represent a 10-bit range at full precision. Map it to any range you like, and your epsilon will be too large, for part of it.

You're correct that BFloat16 can't hold UINT10 with full accuracy, but it can still do a tremendously better job than a UInt8 filter chain can.

For example, compositing 12-bit-per-RGBA you will be left holding the 8 highest MSB when you're done, regardless of whether the inputs never used the 4 MSb of their 12-bit range. When converting down to 8 bits, preserving highlights, you'd have on average 3.5 bits left when using UInt8 math, and you'd have on average 7.5 bits left with BFloat16. That's a pretty significant difference in image quality. Also, you can perform tonemapping on the BFloat16 results. You can't on UInt8 results.

If the SOC is smart about how it uses BFloat16 it can manipulate it to produce the FRC patterned framebuffers for the display, effectively providing true HDR10 with an 8-bit mantissa data type in conjunction with an 8-bit hardware display, which is what all the HDR10 displays I've seen so far actually are. Those kind of tricks have been going on since Gingerbread with the Samsung phone displays to create 6-bit gamut (spatially dithered to 8 bit after the fact) with a 4-bit display... you can be sure they haven't stopped now.

If you find a true 10-bit display, chances are it will be advertised as 12-bit HDR. That one will require something wider than BFloat16 to do compositing. IEEE 754-2008 has 11-bit mantissa so it would be up to that task with FRC.

PS a quick way to test the quality of the display FRC implementation is to adjust the display gamma (on the monitor, not on the driving device). This will push the FRC outside its designed ideal operating values, causing it to exceed the boundaries of human Critical Flicker Fusion at a higher sync rate than the manufacturer intended, and may give users ocular migraines, headaches, dizzyness, nausea, even trigger seizures in those prone to epilepsy. So always adjust display gamma at your PC instead of your monitor. Or if your monitor flickers, try adjusting its gamma to see if you can make the flicker go away, then adjust the gamma on the PC to get back a normal response curve. Sometimes those factory calibrations are far from ideal for reducing flicker. And of course you can reduce the ambient lighting so you can turn the monitor's brightness level down a bit too. At 20 lux and 72hz you shouldn't have annoying flicker even with a poorly calibrated IPS/PVA display. TN gamma changes with vertical deflection so there's no ideal gamma setting to compensate for FRC and they will always flicker at a higher sync rate.

See

**coder** · 22 October 2019, 06:12 AM

Originally posted by linuxgeex View Post

You're correct that BFloat16 can't hold UINT10 with full accuracy, but it can still do a tremendously better job than a UInt8 filter chain can.

Okay, I guess you're done talking about ARMv8.6-A and are now into some hypothetical realm.

Typically, you'd use some higher intermediate representation, since you don't want to incur successive round-off errors. So, even if your output is 8-bit, intermediate precision is usually higher. Note how the BFloat16 instructions in ARMv8.6-A all have fp32 outputs, except for the one that exists specifically for converting fp32 to BFloat16. It seems an implicit acknowledgement that BFloat16 -> BFloat16 isn't generally what you want.

Originally posted by linuxgeex View Post

If the SOC is smart about how it uses BFloat16 it can manipulate it to produce the FRC patterned framebuffers for the display, effectively providing true HDR10 with an 8-bit mantissa data type in conjunction with an 8-bit hardware display, which is what all the HDR10 displays I've seen so far actually are.

Um, no. Not unless you're inventing a chip with specialized instructions that implement dithering instead of standard rounding behavior. Otherwise, you lose your intermediate precision before you can dither.

Of course, on ARMv8.6-A, you have fp32 as your intermediate precision, so it's all good (assuming your input is only 8-bit). However, if you don't actually need the additional range afforded by BFloat16, then it'd probably be more energy-efficient - if not also faster - to use integer SIMD instructions to implement fixed-point arithmetic.

Originally posted by linuxgeex View Post

IEEE 754-2008 has 11-bit mantissa so it would be up to that task

Yes! I get why BFloat16 is preferable for deep learning, but I think IEEE half-precision is generally a better trade-off and more widely applicable. As such, I'm mildly annoyed by the ascendance of BFloat16. The GPU sector had some good momentum going, with half-precision - I hate to see it now potentially getting derailed.

**linuxgeex** · 10 November 2019, 02:44 PM

Originally posted by coder View Post

Um, no. Not unless you're inventing a chip with specialized instructions that implement dithering instead of standard rounding behavior. Otherwise, you lose your intermediate precision before you can dither.

It's not difficult to bias math to cause the rounding to go the direction you want it to, resulting in any ordered dithering pattern you want. I'll leave it as an exercise for you to learn how you do that... hint: for FP it works best with log scaling and HDR illumination so happens to be done in log scale. ;-)

**coder** · 16 November 2019, 07:26 AM

Originally posted by linuxgeex View Post

It's not difficult to bias math to cause the rounding to go the direction you want it to, resulting in any ordered dithering pattern you want.

Sure, if you're adding/subtracting. But, again, the problem here is that you want to dither at the end of a series of computations, retaining higher intermediate precision until that point. Otherwise, your output will look like garbage.

Originally posted by linuxgeex View Post

for FP it works best with log scaling and HDR illumination so happens to be done in log scale. ;-)

Actually, doing computations in log-scale makes sense if you've got a limited-range datatype with extra precision, like fixed-point. With BFloat16, we have the exact opposite - high-range, low-precision. Using log-scale, you're going to burn through BFloat16's limited precision faster than with linear, not to mention the performance hit you'd take for doing simple addition and subtraction.

Announcement

ARMv8.6-A Brings BFloat16, GEMM & Other Enhancements

ARMv8.6-A Brings BFloat16, GEMM & Other Enhancements

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment