Announcement

**pal666** · 04 September 2019, 07:13 PM

Originally posted by coder View Post

Cell proved what's possible when hardware stops trying to bend over backward to cater to poor code. IMO, that was the last generation of consoles that actually leapfrogged PCs in any meaningful way.

um, last generation of consoles are mediocre pcs hardware wise

**pal666** · 04 September 2019, 07:15 PM

Originally posted by torsionbar28 View Post

This has been the trend lately in the Linux kernel world. Linux used to be a choice OS for re-purposing old hardware. Not so much any more. Probably NetBSD is the only OS left that targets all kinds of old and obscure hardware.

and what magic spu compiler netbsd uses?

Originally posted by torsionbar28 View Post

Personally, I think the kernel and toolchain folks should keep support for old hardware

i'm sure kernel and toolchain folks also have some thoughts about what you should do

**curaga** · 05 September 2019, 02:57 AM

Originally posted by coder

So, are any indy devs still making any PS3 games or demos? I don't know if the PS3 will ever this sort of vintage, but devs are still working with far more obscure hardware:

There is little interest in such. Not only are the SPEs a massive pain to code for as mentioned by others, it is one of the newer generations where getting comparable graphics/experience to commercial games takes a ton of manpower. You are not going to compete with FF13 coding alone in your garage. Compare to older consoles, 8-bit era and older you can match any commercial release alone, 16-bit and early 32-bit you can do so with two to three people.

**rmoog** · 05 September 2019, 06:11 AM

Wtf gcc?

**audir8** · 05 September 2019, 03:50 PM

It's ok if newer versions of GCC don't have Cell support as long as there's a place to collect any patches made for pre-9.x compilers. If you want to do any development on the PS3, you're going to be stuck on something circa Ubuntu 10.04 anyway. There could be a "historical" Linux distro which focuses on things like this maybe. If you throw plain-text patches and a Gentoo ebuild on github, anybody will be able to rebuild it with a single command and add more patches. Is it any easier with other distros?

In a naive way, you could say both the Cell and Itanic asked a bit too much from compiler writers, who now-a-days have no problems writing all the compilers which make up Tensorflow. The Cell was the first HSA platform which was a moderate mainstream success because Sony stuck with it, and the amount of development ultimately done for it is nothing to sneeze at. Really, I'm looking forward to more HSA architectures with things like the aforementioned SYCL.

**coder** · 06 September 2019, 01:08 AM

Originally posted by audir8 View Post

In a naive way, you could say both the Cell and Itanic asked a bit too much from compiler writers, who now-a-days have no problems writing all the compilers which make up Tensorflow.

Really? The SPEs are just 2-way, in-order, with a few k of local RAM that should have fairly low, consistent latency. That's a pretty far-cry from anything VLIW-like. And, at the time, VLIW chips & their compilers had already been around for decades. I believe AMD (if not also Nvidia) GPUs were even VLIW-based, around then

TeraScale (microarchitecture) - Wikipedia

https://en.wikipedia.org/wiki/TeraScale_(microarchitecture)#TeraScale_1

Originally posted by audir8 View Post

The Cell was the first HSA platform which was a moderate mainstream success because Sony stuck with it, and the amount of development ultimately done for it is nothing to sneeze at. Really, I'm looking forward to more HSA architectures with things like the aforementioned SYCL.

Very interesting. I had no idea.

But, the thing is that I don't even know how meaningful that is, when the SPEs have to use DMA to even touch main memory. Other than avoiding the need to lock memory pages, I don't know how much you even gain by it.

**audir8** · 06 September 2019, 10:49 PM

Originally posted by coder View Post

Really? The SPEs are just 2-way, in-order, with a few k of local RAM that should have fairly low, consistent latency. That's a pretty far-cry from anything VLIW-like. And, at the time, VLIW chips & their compilers had already been around for decades. I believe AMD (if not also Nvidia) GPUs were even VLIW-based, around then

https://en.wikipedia.org/wiki/TeraSc...e)#TeraScale_1

I meant in terms of taking advantage of all the performance on offer. Making a VLIW compiler and making one (or two) that offers good multi-threading primitives/libraries in an HSA environment are two different problems. I think the PS3 and IBM figured out the latter eventually (better than Itanic did the former for perf/$ in their respective markets), but still not well enough to stick with it for another generation. If things like SYCL/OpenACC/OpenMP offload had existed for SPEs and PowerPC in 2006 with good compiler support, maybe we would have seen a Cell 2 keep up with GPUs.

Very interesting. I had no idea.

But, the thing is that I don't even know how meaningful that is, when the SPEs have to use DMA to even touch main memory. Other than avoiding the need to lock memory pages, I don't know how much you even gain by it.

I didn't mean to sound as authoritative, but Cell seems like it was the first chip with 2 distinct architectures sharing main memory that ended up shipping in 85M+ PS3s and a supercomputer. I think it makes sense to have the HSA APUs we have today instead of the Cell, though we haven't seen higher end APUs with shared memory unless you count the Iris chips from Intel or Kaby Lake-G with Vega (which had HBM2). Seeing HSA support in Linux and being able to run an OpenCL kernel almost anywhere is progress, though it could all be better supported by everyone still and be wider spread.

I think better abstractions at every level matter a lot, they do eventually lead to more speed, more correct programs, more optimizations, and lower development time.

**LoveRPi** · 07 September 2019, 08:35 PM

Originally posted by coder View Post

But, the thing is that I don't even know how meaningful that is, when the SPEs have to use DMA to even touch main memory. Other than avoiding the need to lock memory pages, I don't know how much you even gain by it.

Locking is one of the areas that create huge bottlenecks in large scale systems. DMA and explicit synchronization was by design. Think about cache coherency problems from locking on 64+ core systems. On these large scale systems, you have to explicitly synchronize your workload to optimize for performance of your application so Cell just enforced this from a design perspective.

**audir8** · 09 September 2019, 06:48 PM

Originally posted by LoveRPi View Post

Locking is one of the areas that create huge bottlenecks in large scale systems. DMA and explicit synchronization was by design. Think about cache coherency problems from locking on 64+ core systems. On these large scale systems, you have to explicitly synchronize your workload to optimize for performance of your application so Cell just enforced this from a design perspective.

I think ARM disagrees with you: https://community.arm.com/developer/...eneous-compute

I also found this in the Cell IBM redbook on page 73: http://www.redbooks.ibm.com/redbooks/pdfs/sg247575.pdf

3.7.4 Multi-SPE software cache

We want to define a large software cache that gathers LS [256KB SPE Local Storage] space from multiple
participating SPEs.

Forces
We want to push the software cache a bit further by allowing data to be cached
not necessarily in the SPE that encounters a “miss” but also in the LS of another
SPE. The idea is to exploit the high EIB bandwidth.

Solution
We do not have a solution for this yet. The first step is to look at the cache
coherency protocols (MESI, MOESI, and MESIF)8 that are in use today on
multiprocessor systems and try to adapt them to the Cell/B.E. system.

This is basically applicable to any data sharing between SPEs using the EIB... and IBM is saying implement your own cache coherency protocol in software. The number of software engineers who would do this right is basically zero. This would be mucho easier to do for a hardware engineer with a simulator, or even the compiler writer with good enough documentation on the Cell. Once you're above the compiler, getting memory semantics right without compiler provided intrinsics is going to be impossible no matter how "good" you are.

I actually do remember reading about this when the Cell came out, and people treating the SPEs as individual processors was common because doing any synchronization was so hard. CS has come a long way since the Cell, and having hardware cache coherency probably is necessary in a few workloads, but you are still free to do coarse or fine-grained locking, or use lock-free algorithms/data structures built with atomics as needed.

Java's LongAdder reduces contention by making several copies of a variable, and processors providing hardware cache coherency between cores are no different than a distributed system providing Consistency and Availability from the CAP theorem. If hardware cache coherency is too slow, you can move towards a more lock-free solution, but at least you'll have something working. The more work that the hardware and compiler can do, the better IMO.

Announcement

GCC 10 Compiler Drops IBM Cell Broadband Engine SPU Support

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment