Originally posted by fairydreaming
View Post
The linked document shows a peak single-socket Stream TRIAD score of 398 GB/s. First, I'd like to point out that it was run in NPS4 mode, which should help. However, that doesn't explain the discrepancy you noted between theoretical performance, when accounting for RFO (read-for-overwrite).
Maybe AVX-512 optimizations are in effect, which could avoid the RFO penalty. The way that could work is because the size of a full AVX-512 register is 64b, which matches the cacheline size. If you do a cache-aligned write of an entire cacheline, then there's no point in fetching the old value, first. RFO is something that only makes sense for partial writes.
That reopens the question around Michael's performance measurements, since he compiled Stream Triad with -march=native, which should mean he also gets the benefits of AVX-512, in this case. So, I'd have to wonder whether his NPS mode setting & other configuration might help explain it? Also, they ran with SMT disabled.
It would be interesting to dig deeper into this.
Comment