Originally posted by AdrianBc
View Post
- 17.5% faster @ SPEC2017int
- 2.1% faster @ SPEC2017fp
So, it's a little worse than you say for int, and much worse for fp. FWIW, the numbers I quoted previously were based on the single-thread aggregate scores comparing 1P1T vs. 1E.
Too bad they didn't test 8P1T + 8E, but we can at least see how much 8E adds to 8P2T:
- 25.9% faster @ SPEC2017int
- 7.9% faster @ SPEC2017fp
Working backwards, that implies the 8P1T + 8E vs. 8P1T + 0E should be at least:
- 30.4% faster @ SPEC2017int
- 8.1% faster @ SPEC2017fp
Still, that suggests the 8 E-cores are, in aggregate, 30.4% and 8.1% as fast as the 8P1T cores. Compared to 8P2T, the 8E are 25.9% and 7.9% as fast. Obviously, those numbers reveal some scaling problems.
Now, an interesting fact about enabling the E-cores is that it creates a bottleneck in the ring bus, due to the E-core cluster stops being down-clocked. So, the multithreaded tests with E-cores show them adding less than what they'd ideally be capable of contributing.
Not only that, but there's doubtless some clock throttling going on, as the CPU bumps into its various power limits. And then there's DDR5, which is clearly much less of a bottleneck, but feeding so many threads with 2 channel-pairs is still going to starve them relative to their single-thread performance.
Originally posted by AdrianBc
View Post
Originally posted by AdrianBc
View Post
However, essentially what you're saying is that 16 E-cores = 8P2T-cores. I think your error is in assuming the E-cores scale as well. However, having 4 of them sharing a cache slice and ring bus stop is an impediment to this. I'm not saying my extrapolation is valid, but I think you overestimate them (or at least this implementation thereof).
Originally posted by AdrianBc
View Post
Originally posted by AdrianBc
View Post
Let's look at it this way. We'll divide out the points per thread, in the different permutations of P-core loading and E-core loading.
0P + 8E | 8P1T + 0E | 8P1T + 8E | 8P2T + 0E | 8P2T + 8E | |
int | 29.81 | 53.45 | 69.69 | 62.82 | 79.06 |
fp | 38.07 | 72.38 | 78.22 | 73.92 | 79.76 |
int/thread | 3.73 | 6.68 | 4.36 | 3.93 | 3.29 |
fp/thread | 4.76 | 9.05 | 4.89 | 4.62 | 3.32 |
Again, the 8P1T + 8E column is merely an estimate. However, this confirms that you really do want to load them the way Intel recommends.
The other thing that's interesting is that the performance of an int-heavy thread is higher in a 8P2T configuration than moving it to an E-core, but your aggregate performance still argues in favor of going 8P1T + 8E and then just balancing execution time of the threads between P-cores and E-cores.
For fp-heavy threads, it's always better to put the thread on an E-core than double up a P-core, both in aggregate and just in terms of its performance potential.
Originally posted by AdrianBc
View Post
I think we agree. It was just the notion of going to 512 bits @ 32 nm (on a general-purpose CPU) which I thought was ludicrous.
Comment