Originally posted by coder
View Post
I haven't seen the Habana Gaudi training code. I'd guess a controlling thread initiates a bunch of ethernet DMA transfers using the ROCE feature. while the ai operation threads just wait for inputs to arrive. The weights probably get stored off in HBME blocks.
NVDA probably gets their fp64 by fusing or pipelining FP16 or FP32 operations. I believe Intel's Xe-HPC has dedicated FP64.
The HPC people are using AI now, but they also want their 64 bit operations. I saw a good presentation on a project from CERN. There's a write-up here:
https://www.intel.com/content/www/us...mer-story.html
Comment