Announcement

**coder** · 19 August 2018, 05:23 AM

Originally posted by oiaohm View Post

Both of those is having access to the real cards Xeon Phi and the real Silvermont atom and real 54C then seeing that single thread samples on those those hardware was slower in the Xeon Phi. Then you start going though the spec very careful to see what they cut out.

What spec?

Originally posted by oiaohm View Post

Problem its there are more than 1 benchmark example of these risc-v systems keeping up. Yes you are right it is a lot like the AMD GCN method of doing cache.

I meant the round-robin dispatch and having one instruction in flight per-thread, avoiding the need to detect data dependencies. I don't know about the cache layout, off hand, but wouldn't be surprised if that were also similar.

Originally posted by oiaohm View Post

There is really no particular reason why a general cpu based around a compact instruction set like risc-v could not have a GPU architecture layout and for a lot of compute processing this make more sense than using a GPU. GPU instruction set for compute processing is very much round peg square hole.

No, GPUs are performance-first architectures. They do away with many niceties that make life easier for software developers, like memory ordering & consistency guarantees and instruction-level traps. That's one reason they're so efficient.

Originally posted by oiaohm View Post

Interest enough is risc-v has a higher instruction density than AMD GCN instruction sets. Same is true with Nvidia instruction sets. There are a lot of issue with GPU and compute workloads.

If that were a bottleneck, they'd have optimized it. The fact that they didn't only means it's a non-issue for them.

Anyway, this is pointless. You're reading way too much into that one case of a machine designed expressly to do sparse memory transactions. You can believe RISC V is the apotheosis of CPU (and even GPU) ISAs, if you like. I think it's not, but I really don't care whether you agree.

**Weasel** · 19 August 2018, 07:57 AM

Originally posted by oiaohm View Post

Current X86 chips are not that well optimised for massive number of threads way too much focus on single thread speed.

Yeah because that's the point of a general purpose CPU (a supercomputer CPU is not that "general purpose" btw). For other tasks you have GPUs, ASICs, etc.

Different tools for different jobs.

**oiaohm** · 19 August 2018, 08:03 AM

Originally posted by coder View Post

What spec?

the programming spec for the chips from Intel.

Originally posted by coder View Post

I meant the round-robin dispatch and having one instruction in flight per-thread, avoiding the need to detect data dependencies. I don't know about the cache layout, off hand, but wouldn't be surprised if that were also similar.

Except you missed something the risc-v prototype I pointed you to has not lost method to detect data dependencies moved the placement of it.

RV64G is 64 bit risc-v general instruction set. So all you general instruction set features. Its mentioned on page 3.

Originally posted by coder View Post

No, GPUs are performance-first architectures. They do away with many niceties that make life easier for software developers, like memory ordering & consistency guarantees and instruction-level traps. That's one reason they're so efficient.

In these risc-v prototypes they moved "memory order& consistency" stuff to the L3 out of the cpu cores. Just take closer look at slide 6 where the atomic operations are diagram-ed for placement that is your memory order and consistency stuff. Barrel processor traps come events and messages.

So you can have that looks very close to a GPU performance first architecture with extra features in the L3 cache shared between all cores and in the communication system between cores and able to perform all the classed of instruction that you would expect any general purpose cpu todo.

Originally posted by coder View Post

Anyway, this is pointless. You're reading way too much into that one case of a machine designed expressly to do sparse memory transactions. You can believe RISC V is the apotheosis of CPU (and even GPU) ISAs, if you like. I think it's not, but I really don't care whether you agree.

Also read page 3 again. Take a close note its read from memory in 8,16,32 and 64 byte blocks. 64byte blocks is your normal non sparse memory cache line that you see in all x86 cpus. They are design a chip that is really good when you have a sparse memory workload and takes no downside when it does not because you have the same cache line size on offer that you would normally use when you don't have sparse memory problem.

I did not say it was the apotheosis but the risc-v work with networks on chip is interesting because you are seeing a general instruction set without the limitations of normal GPU instruction sets being able to achieve equal or better performance. Of course it means things like atomic operations that you would expect inside cores is not inside cores.

Serous question with the risc-v prototypes is if all the features GPU instruction sets drop for performance that make life easy for software developers had to be dropped or should those features just have been relocated in the design. From the different network on chip risc-v prototypes demoed is looking like all the features gpu instruction sets drop could be had just be relocated.

So we have had silicon layout for cpu and gpu being very different. One of the questions with the risc-v network on chip work is this in fact correct at all or should general cpus have a internal layout very much like a gpu.

There are other risc-v prototypes out there with networks on chips that don't target spare data.

**oiaohm** · 19 August 2018, 08:05 AM

Originally posted by Weasel View Post

Yeah because that's the point of a general purpose CPU (a supercomputer CPU is not that "general purpose" btw). For other tasks you have GPUs, ASICs, etc.

Different tools for different jobs.

General purpose does mean it should have decent all round performance. Being too focused on single threaded it really not general purpose any more.

**juanrga** · 19 August 2018, 09:30 AM

Originally posted by coder View Post

The top entry in the Green 500 gets 22x as many GFLOPS/W.

June 2018 | TOP500

https://www.top500.org/green500/lists/2018/06/

https://en.wikipedia.org/wiki/K_comp...er_consumption

You cannot compare an ancient computer with chips made on 45nm node, with modern computers using 16nm, 14nm, 12nm,...

You have to compare it with state of the art technology on the epoch: "For comparison, the average power consumption of a TOP 10 system in 2011 was 4.3 MW, and the average efficiency was 463.7 GFlop/kW." So K-computer was 78% more efficient than the average, despite it wasn't designed to be the most efficient computer, but it was designed to be the fastest in the world.

Originally posted by coder View Post

Annoyingly, Green 500 doesn't seem to specify what GPUs or other accelerators are used.

The list specifies what is used. The first three top computers use PEZY-SC2. The first GPU-based system is on #4, a system with Nvidia Tesla V100

**coder** · 19 August 2018, 12:22 PM

Originally posted by oiaohm View Post

the programming spec for the chips from Intel.

Link & page numbers, please.

Originally posted by oiaohm View Post

Except you missed something the risc-v prototype I pointed you to has not lost method to detect data dependencies moved the placement of it.

No, they were talking about register dependencies, and the video says they removed it.

Originally posted by oiaohm View Post

In these risc-v prototypes they moved "memory order& consistency" stuff to the L3 out of the cpu cores. Just take closer look at slide 6 where the atomic operations are diagram-ed for placement that is your memory order and consistency stuff.

No. Atomics are a somewhat different matter. You're reaching to win points. Don't do that.

Originally posted by oiaohm View Post

So you can have that looks very close to a GPU performance first architecture with extra features in the L3 cache shared between all cores and in the communication system between cores and able to perform all the classed of instruction that you would expect any general purpose cpu todo.

Also read page 3 again. Take a close note its read from memory in 8,16,32 and 64 byte blocks. 64byte blocks is your normal non sparse memory cache line that you see in all x86 cpus. They are design a chip that is really good when you have a sparse memory workload and takes no downside when it does not because you have the same cache line size on offer that you would normally use when you don't have sparse memory problem.

No, the performance of this thing looks way worse, on classical CPU and GPU workloads. Its ratio of memory bandwidth to compute is far different from CPU and even GPUs.

It's almost like they built a crypto-mining processor.

Originally posted by oiaohm View Post

Serous question with the risc-v prototypes

It's not a serious question, because you're over-generalizing. The reason it beat CPUs and GPUs is because its target problem domain is very different from what is normally encountered. That was the whole point of the exercise, in fact. I even think the actual CPU ISA had little to do with the end result - only that RISC V was easier for them to work with than alternatives.

**coder** · 19 August 2018, 12:27 PM

Originally posted by oiaohm View Post

General purpose does mean it should have decent all round performance. Being too focused on single threaded it really not general purpose any more.

No, because that's the pain point of most existing software. This is priority #1, for anything that's truly general purpose.

Again, the reason why they focus on single-threaded performance is no accident. They're continually profiling applications that people actually use, and targeting the bottlenecks. If the vast majority of software weren't so sensitive to single-thread performance, they surely wouldn't burn so much power, silicon, and engineering time on the problem.

**coder** · 19 August 2018, 12:38 PM

Originally posted by juanrga View Post

The list specifies what is used. The first three top computers use PEZY-SC2.

Sorry, I didn't recognize the name.

PEZY-SC2 - PEZY - WikiChip

https://en.wikichip.org/wiki/pezy/pezy-scx/pezy-sc2

I find it refreshing to see them use terms like "city" and "prefecture". However, those chip-level power and performance numbers are still worse than GPUs.

That said, this is probably more balanced than GPUs. At 256 PEZY cores/processor, it should be better at dense control flow than GPUs.

**juanrga** · 20 August 2018, 03:25 AM

Originally posted by coder View Post

Sorry, I didn't recognize the name.

PEZY-SC2 - PEZY - WikiChip

https://en.wikichip.org/wiki/pezy/pezy-scx/pezy-sc2

I find it refreshing to see them use terms like "city" and "prefecture". However, those chip-level power and performance numbers are still worse than GPUs.

That said, this is probably more balanced than GPUs. At 256 PEZY cores/processor, it should be better at dense control flow than GPUs.

Pezy-SC2 does 15GFLOPS/watt, which is at the same level than best 16nm GPU from Nvidia. Only newest 12nm GPU beat those numbers.

**coder** · 21 August 2018, 03:03 AM

Originally posted by juanrga View Post

Pezy-SC2 does 15GFLOPS/watt, which is at the same level than best 16nm GPU from Nvidia. Only newest 12nm GPU beat those numbers.

Sure, but that doesn't really help them. A 12 nm GPU is what it's going up against. Nvidia's V100 is all the rage, in HPC.

Also, we don't know how it compares on real-world benchmarks. For instance, AMD GPUs traditionally have higher raw bandwidth and compute specs than Nvidia, but somehow manages to underperform. While impressive, the raw numbers don't tell the whole story.

Announcement

Intel Begins Teasing Their Discrete Graphics Card

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment