Announcement

**oiaohm** · 17 September 2018, 11:04 PM

Thing to remember is gen-z allows cpu/gpu socket for on board ram to have 1/2 the number pins compared ddr5 while having 4 to 8 times the memory bandwidth and almost none of the capacity limits.

So a 4000 pin chip reduced to 2000 pin yet you still have more bandwidth to the outside.

HBM might be able to be faster but capacity you can place inside a single chip has a physical limit.

**juanrga** · 21 September 2018, 02:26 PM

Originally posted by coder View Post

Yes, but GPUs have orders of magnitude fewer cores. What Nvidia calls "cores", aren't. Their SM's are the equivalent of a CPU core. Why does it matter? Because you can batch entire 32-lane SIMD's worth of memory accesses, thus vastly improving coherency over what you'd have if a comparable number of MIMD cores were all running. The other efficiency you get - and this is big - is that you're only fetching and decoding one instruction stream for all of them.

For problems that map well to SIMD, I think you can't beat GPUs. Sure there are going to be cases where 4096 MIMD cores outperform comparably sized GPUs, but those are probably going to be a minority of cases in HPC.

Anyway, we'll see. I think none of us are privy to enough information to conclusively say otherwise. Personally, I'm ready to be surprised by a large-scale MIMD, like PEZY-SC2. The ultimate test will be whether they sell (in this case, outside of Japan).

I know that "core" in GPU parlance doesn't mean "core" in CPU parlance. I wrote a basic CPU--GPU dictionary here

The dumb GPU approach is more efficient when there is no divergences in the code. When there are divergences, the MIMD approach is more efficient. GPU inefficiency can be 50% or more for 32-wide warp.

**oiaohm** · 21 September 2018, 08:20 PM

Originally posted by juanrga View Post

I know that "core" in GPU parlance doesn't mean "core" in CPU parlance. I wrote a basic CPU--GPU dictionary here

The dumb GPU approach is more efficient when there is no divergences in the code. When there are divergences, the MIMD approach is more efficient. GPU inefficiency can be 50% or more for 32-wide warp.

And when it comes to risc-v cores that have the same functionality as the GPU warp processor parts and full general instruction set as well. You have a core that can handle divergences quite well with the straight line performance of a gpu. Its not like processing the extra instructions add to core size that much particularly once you work out to have each core/warp scheduler to be able to do its own thing you need to drop the SM part of GPU and use a NoC connect to each core. The reality here adding the general instructions using a NoC and removing the SM equals a silicon size saving.

This reality means I question if current GPU method is correct. It seams like GPU has gone over specialised without enough benefit so a slightly more generic processing design can kick it tail in every test. Its the same mistake intel does with the Xeon Phi when you have a NoC you need to rethink things. NoC is quite a major change in silicon design particularly if you are after to make the most out of the silicon..

**coder** · 22 September 2018, 10:35 PM

Originally posted by juanrga View Post

I know that "core" in GPU parlance doesn't mean "core" in CPU parlance. I wrote a basic CPU--GPU dictionary here

See https://www.phoronix.com/forums/foru...26#post1041626

Originally posted by juanrga View Post

The dumb GPU approach is more efficient when there is no divergences in the code. When there are divergences, the MIMD approach is more efficient. GPU inefficiency can be 50% or more for 32-wide warp.

Yes, it's a given that we're talking about workloads which vectorize well. This is true of most HPC workloads, but there will always be exceptions.

I'm still not convinced 4096-core MIMD makes sense, given how much power will be burned in communication. I think the sweet spot for MIMD is probably to make the cores fewer and a bit more powerful, especially if we're talking about a memory-to-memory architecture (i.e. where memory or cache is used for inter-core communication). It gets a bit more interesting, if the cores have local memory accessible to their peers, like the Cell processor.

**juanrga** · 26 September 2018, 02:24 PM

Originally posted by oiaohm View Post

And when it comes to risc-v cores that have the same functionality as the GPU warp processor parts and full general instruction set as well. You have a core that can handle divergences quite well with the straight line performance of a gpu. Its not like processing the extra instructions add to core size that much particularly once you work out to have each core/warp scheduler to be able to do its own thing you need to drop the SM part of GPU and use a NoC connect to each core. The reality here adding the general instructions using a NoC and removing the SM equals a silicon size saving.

This reality means I question if current GPU method is correct. It seams like GPU has gone over specialised without enough benefit so a slightly more generic processing design can kick it tail in every test. Its the same mistake intel does with the Xeon Phi when you have a NoC you need to rethink things. NoC is quite a major change in silicon design particularly if you are after to make the most out of the silicon..

I think GPUs are overhyped. They are good for some very specific cases and bad at everything else. A manycore CPU looks much better.

**juanrga** · 26 September 2018, 02:47 PM

Originally posted by coder View Post

See [URL="https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/intel-linux/1041074-intel-begins-teasing-their-discrete-graphics-card?p=1041626#post1041626"]Yes, it's a given that we're talking about workloads which vectorize well. This is true of most HPC workloads, but there will always be exceptions.

It isn't a given. That is why CPU-based supercomputers continue to be build today. There are lots of HPC applications that run better on CPUs.

Originally posted by coder View Post

I'm still not convinced 4096-core MIMD makes sense, given how much power will be burned in communication. I think the sweet spot for MIMD is probably to make the cores fewer and a bit more powerful, especially if we're talking about a memory-to-memory architecture (i.e. where memory or cache is used for inter-core communication). It gets a bit more interesting, if the cores have local memory accessible to their peers, like the Cell processor.

A mesh interconnect or similar topology scales as O(N), whereas the scaling of more powerful cores depends on the improvements made on the core: increasing superscalar wide (e.g. 2x128bit --> 4x128bit) would scale as O(N²), whereas increasing the SIMD wide (e.g. 2x128bit --> 2x256bit) scales as O(N).

It could seem that reducing the number of cores and increasing the SIMD wide of each core is the optimal solution in terms of efficiency, but this is only true when the code can use all the SIMD wide available and there is no divergences.

Announcement

Intel Begins Teasing Their Discrete Graphics Card

Comment

Comment

Comment

Comment

Comment

Comment