Also, on the subject of cache coherency, consider how much floating-point performance IBM's Cell managed. In just an 8+1-core chip, they managed over 100 GFLOPS, more than a decade ago. The 8 PPE cores had only 128-bit vector engines, like your design, but were in-order with 2-way SMT.
The secret? The Cell used scratch-pad memory - not cache. This made it notoriously difficult to program, but then they weren't using OpenCL, which would've significantly eased the burden on programmers of managing data movement, among other things.
The secret? The Cell used scratch-pad memory - not cache. This made it notoriously difficult to program, but then they weren't using OpenCL, which would've significantly eased the burden on programmers of managing data movement, among other things.
Comment