Announcement

**ldo17** · 01 June 2017, 04:40 AM

Originally posted by L_A_G View Post

If something is possible with one CPU architecture, it's possible on any other equally advanced architecture.

What’s possible is that the overall performance can be improved by spending your dollar and power budget on a more advanced interconnect, at the expense of some raw CPU grunt. This is a threat to x86, which has been dominating the raw CPU power stakes for quite a while now.

As for the TaihuLight, you should take the claims about it's performance with more than just a grain of salt as the performance figures on it, and it's precursors, have never been independently verified.

It was good enough to get on the Top500 twice in a row., so far now. You saying that list can’t be trusted? Yet it could be trusted before, when there weren’t Chinese machines dominating the stakes?

The Xeon Phi, while housed like a GPU, is not a GPU. It's essentially a bunch of ARM-class x86 cores with big vector instruction units, 8 way SMT and a fast ring bus to connect them all. This is very similar to the high performance ARM packages like Cavium's ThunderX offerings and the chips used in the TaihuLight.

GPUs may not be perfect for all compute uses, but general purpose CPUs are actually pretty badly suited for the job ...

And yet the Xeon Phi seems to do this “job” quite well--otherwise, why did you mention it? (Remember, you were the one who brought it up.)

**L_A_G** · 01 June 2017, 08:39 AM

Originally posted by ldo17 View Post

What’s possible is that the overall performance can be improved by spending your dollar and power budget on a more advanced interconnect, at the expense of some raw CPU grunt. This is a threat to x86, which has been dominating the raw CPU power stakes for quite a while now.

Interconnects really don't eat that much silicon space or power. On-die interconnect capability isn't even that big of an issue when most CPU-to-CPU interconnect limitations happen off-die.

It was good enough to get on the Top500 twice in a row., so far now. You saying that list can’t be trusted? Yet it could be trusted before, when there weren’t Chinese machines dominating the stakes?

As I said, nobody's ever been able to verify the actual performance of it or it's precursors as access to it is very strictly limited and most of what it's actually used for is related to research projects run by the Chinese army. The people who maintain the supercomputer top 500 just don't want any conflicts so they take the reported figures at face value. I personally haven't heard of any westerner being able to run any serious jobs on either the TauhuLight, Tianhe 1/1A, or the Tianhe 2. By comparison I personally know somebody who's been able to run a jobs on the Titan, which is #3 on the last top 500 list, with the whole machine to himself and that machine belongs the U.S Department of Energy, who kind of also happens to be the custodian of the U.S nuclear arsenal.

Besides, as I already pointed out, the figures used to make up the top 500 list are on the theoretical performance of the machines and didn't say anything about the bandwidth or latency of the CPU-to-CPU interface. The fact that a machine is higher up in the top 500 ranking than another, it doesn't mean that it's actually faster with CPU-to-CPU communication bandwidth and latency bottlenecks get to affect results.

And yet the Xeon Phi seems to do this “job” quite well--otherwise, why did you mention it? (Remember, you were the one who brought it up.)

I brought it up as it's a something of a halfway point between a traditional general purpose CPU and a more dedicated piece of hardware like GPU or a DSP and it's pretty similar to way in which all high performance compute ARM CPUs have worked so far. You could almost call Cavium's TunderX a knockoff if it wasn't a proper CPU rather than accelerator card.

Seeing how you didn't really get the actual point I was trying to make: Compute units based on loads of small general purpose CPU cores has been tried before and it just wasn't successful. They don't run applications that are badly written or badly suited for HPC use as well as less specialized general purpose CPUs and they don't run applications that are well written or well suited for HPC use as more specialized hardware like GPUs.

Simply put: They're pretty much useless!

**ldo17** · 02 June 2017, 03:59 AM

Originally posted by L_A_G View Post

Interconnects really don't eat that much silicon space or power.

They certainly seem to make a helluva difference. Even the Europeans are going to build a multi-million-core super, using ARM chips this time.

As I said, nobody's ever been able to verify the actual performance of it ...

Seems like they have:

The authors of the TOP500 reserve the right to independently verify submitted LINPACK results, and exclude systems from the list which are not valid or not general purpose in nature. By general purpose system we mean that the computer system must be able to be used to solve a range of scientific problems. Any system designed specifically to solve the LINPACK benchmark problem or have as its major purpose the goal of a high TOP500 ranking will be disqualified.

That doesn’t sound like “we have to take their word for it” to me...

**L_A_G** · 02 June 2017, 06:35 AM

Originally posted by ldo17 View Post

They certainly seem to make a helluva difference. Even the Europeans are going to build a multi-million-core super, using ARM chips this time.

Seem to make? Once again the top 500 rankings list theoretical performance, not actual performance in real world supercomputing applications.

Also the fact that somebody builds a particular type of supercomputer doesn't mean that it's actually going work and provide the performance it's designers thought it would. Fujitsu was originally supposed to have an ARM supercomputer in operation next year, but it's been pushed up to 2020 and beyond. A few years they built an FPGA supercomputer that was also supposed to revolutionize supercomputing, but it never lived up to the hype and ended up becoming the only one of it's kind.

That doesn’t sound like “we have to take their word for it” to me...

If you read that properly you'd have noticed that they only reserve the right to verify reported performance figures. Not only that, they only talk about supercomputing problems in the respect to what the machines can so, not how well they do it. They really don't actually test the systems and the only reason they reserve the right to test machines is so that they can disqualify systems that aren't actually proper supercomputer systems or are just built to run supercomputing benchmarks really well and nothing else.

**name99** · 13 June 2017, 12:35 AM

Originally posted by ldo17 View Post

But I don’t think such an application area could ever account for anything approaching 300 million new machine sales per year.

In other words, it will always be a niche market: low-volume, high-margin. Something that sells to corporates rather than end-users.

(Kind of like how the whole computer market worked before PCs came along...)

That's because you're defining the space of interest so narrowly...
Apple sells around 300 million iOS devices/year. Those devices come with GPU compute performance that matches Intel's low-end and mid-range GPUs. And that compute performance is indeed of interest to, and utilized by Apple. So far this has been done through a framework called MPS used primarily for image and video manipulation, but even with iOS 10 we had image recognition in there. With iOS 11 we get general machine learning (using a variety of models and techniques for a variety of purposes, including things like translation or sentiment recognition) and, along a different dimension, AR.

Personally, on the Apple side I think the way this plays out is the replacement of the Imagination GPU not by an Apple GPU so much as an Apple throughput engine, something that uses the AArch64 ISA on a large number of simple cores (think Larrabee or Cell as earlier examples of this concept). IBM failed (IMHO) because Cell insisted on a crazy memory model; Intel failed (IMHO) because the x86 overhead was too high given the necessary simplicity of the desired cores.
PowerVR IS a top of the line GPU, no question about it, and even better with Furian; it doesn't make sense for Apple to abandon them unless they have something very different in mind, and a generic throughput engine is my guess.
Why do this? Because using a GPU to perform throughput computations made sense ten years ago, but nowadays, when there are so many non-GPU computations that run on the device, it's silly to start with a graphics-optimized device and twiddle it to get general purpose. Makes more sense to start with a general purpose device and add a few tweaks (like texture support) to allow it to also do graphics well.

Point is, when we switch from a world of CPU+GPU (and high costs to move between the two) to a world of latency cores plus throughput cores, and moving between them is somewhere between the cost of migrating from one CPU to another and migrating from a big to a LITTLE core, the world of these tools like OpenCL and CUDA changes substantially and it becomes of more interest, and more relevance, to have the compiler+runtime do more of the work (assisted by a language that makes pointer abuse more difficult, and that discourages mutating data structures, so that the compiler can be a lot more powerful).

Announcement

More OpenACC 2.5 Code Lands In GCC

Comment

Comment

Comment

Comment

Comment