Qualcomm Sampling 10nm 48-Core Server SoC

darkblu replied

21 January 2017, 02:56 AM
Originally posted by name99 View Post

This may be technically true but it's irrelevant. The future is AArch64. Apple has been sending strong messages for years now to developers that shipping 32-bit iOS code is unacceptable. (iOS10 now puts up a scary warning when you use a 32-bit app that doesn't exactly say "this app is a PoS and probably malware" but strongly implies it.) I would not be at all surprised if the A11 is the core where Apple drops AArch32 support.
Already some server chips don't even bother shipping AArch32. and for all I know, QC is one of them.

Unfortunately not everybody is Apple. Approx. 10 years after they first supported ARM, Canonical are still shipping predominantly Thumb2 userland binaries in their Ubuntu Touch distro running on ARMv8 machines.

Last edited by darkblu; 21 January 2017, 08:22 AM.
Leave a comment:
liam replied

17 January 2017, 04:23 PM
Originally posted by name99 View Post

First the whole POINT of the instructions (which are NOT usefully thought of as an extension of neon) is that they can be implemented in arbitrary width. You can implement the instructions on hardware that's, say, 128 bits wide then TRANSPARENTLY upgrade next year to 256 bit hardware, and the year after that to 512 bit hardware. That's precisely why they are not "Neon-like" (or SSE/AVX-like).

Second don't be sure you know what Apple wants from vector instructions. Apple ships not two but THREE Neon hardware pipelines in their chips (starting, I think, with the A9, I think the A8 and A7 had two pipelines). Clearly they think this is hardware worth spending transistors on, which means they believe they support a number of workloads that can benefit from substantial vector performance but which ALSO are too small to be shipped over to the GPU.

Although it was only implicit, I said as much regarding the width:

and yes, that is the maximum, not the minimum

your point about their difference with regards to NEON/AVX is a good one (AFIUI).

I realise that the new Wisconsin spec is designed to scale, as I said, UPTO 2048, but I had a few things in mind: 1) if 128 is enough (and, you're right, we don't really know if it is) it's cheaper to stick with neon, 2) we don't know how much silicon one of these units will require, but unless you are able to use them in place of a neon unit you are starting to looking at a pretty large increase to your silicon budget (don't forget this thing might need its own scheduler and it will certainly need a huge issue buffer)
I said they might be interested but that cost would be a factor. Keep in mind that sip/pop intended for mobile networking often include dsps which can be multi-purposed.

Btw, it appears that it was cyclone/a7 where apple moved to the 3 unit neon (http://www.anandtech.com/show/7910/a...cture-detailed).
Leave a comment:
L_A_G replied

13 December 2016, 02:35 PM
Originally posted by name99 View Post

...

Did you even read my post at all? Because the point of it was that there's plenty of competition in the HPC market, which is for most part is about running heavily threaded workloads (which I should know seeing how I've been in the business myself), and that being a big chip made on a high end node, just like it's competitors, it's also going to be similarly expensive.

I'm not saying that it's a bad design that can't compete with other chips in the HPC market, what I am saying that it's basically doing something that has more or less been done already so it's not really anything to get super excited about.
Leave a comment:
L_A_G replied

13 December 2016, 02:26 PM
Originally posted by name99 View Post

Seriously? Are you living in the 90s? Are you unaware of Apple's cores which have been running on top-of-the-line nodes for years and are anything but simple. Hell, even if you hate Apple in your bones, the high-end Android cores are likewise built on top-of-the-line nodes are are hardly trivial. ARM designs more than just M0's you know.

Apple may have a tendency to use unusually large cores, but they're the exception to the rule and most companies who design ARM cores, including ARM themselves, generally still go for small cores you can have quite a few of on a small die.

Originally posted by name99 View Post

...

You really do like to jump on a high hose do you? Jumping on one just because I didn't know about one ARM core that has SMT implemented it is a bit thin, specially when you start going on about how wrong I am about SMT being great when all I did was mention why Intel uses an advanced version of it on the Xeon Phi line of accelerators. Not once did I say that I thought it was any good in regular use or talk about what kind of performance boost it brings. I could do the same thing you did because that 25% optimum figure is very much overblown as any real world test will only show a roughly 15% increase in performance.
Leave a comment:
Ardje replied

12 December 2016, 10:22 AM
10nm finfet, so it is going to be produced by Samsung (like other qualcomm SoCs)?
Leave a comment:
name99 replied

11 December 2016, 05:10 PM
Originally posted by L_A_G View Post

Sure, we don't have the exact details on this, but Xeon Phi didn't just rely on loads of cores with beefy vector instruction units, they also had really heavy SMT (or Hyper Threading as Intel likes to call it) to maximize the utilization of those vector instruction units. I've never heard of anyone creating an ARM core with SMT, so if they've done that this could basically be Xeon Phi knockoff with ARM cores rather than Atom cores. If this is the case then this thing may have a point for compute loads, but I'm not so sure if it's all that great of a thing seeing how the Xeon Phi failed to sell all that well despite how hard Intel tried to push them.

I don't want to insult you, but you're really not in a position to say anything useful about this matter since you obviously know nothing about the field. Broadcom's Vulcan ARMv8 CPU supports SMT4.
But more generally, ARM as a corporation is against SMT and will not support it for its designs. SMT is not a sign of strength, it is a sign of weakness. It says that your core is so expensive in area that you need to add even more complexity to maximize the value; ARM believes that their cores are small enough that if you want more throughput just add more of them. SMT has never delivered the performance naive users expect, largely because the single most constrained resource on a CPU is the L1 caches, and SMT halves their effective size. So Intel SMT gives you about the equivalent of a 25% speed boost. ARM's answer would be --- if you want 5 CPU's worth of performance, stick 5 CPUs on the die rather than 4 with SMT.

If you look at who supports SMT (Intel, Oracle, IBM) it's hard not to conclude that its there mainly as a workaround to stupid SW licensing rules. Since ARM (at least right now...) isn't running software of that sort, it doesn't need the workaround.
So why did Broadcom add it to Vulcan? No idea. Vulcan came out of a networking team, and there may be something about network processors (packet classification and deep inspection, that sort of thing) that makes SMT valuable in that context and its flaws (giving each thread a much smaller effective L1) much less problematic, because caches aren't too useful anyway with network processing?
Leave a comment:
name99 replied

11 December 2016, 04:58 PM
Originally posted by L_A_G View Post

It's not just the CPU bus that can be a problem, it's also what sits behind the bus. Can RAM and disc provide access times fast enough not to cause severe stalls when all 48 cores are in use? As for compute loads, embarrassingly parallel compute (i.e number crunching) jobs are obviously going to be better served by hardware specifically designed for that (like GPUs and Xeon Phi boards) rather than just sticking loads of general purpose cores on a single die.

The reason why ARM provides such good power-to-price for low power envelope solutions is that they're relatively simple and small designs that can be cranked out reliably in very large volumes using nodes that are tried and tested with good yields and low cost. This thing on the other hand is being cranked out with a top-of-the-line node and chip itself is nether small nor simple due the the large amount of cores and because of this is going to draw comparable amounts of power as something produced by Intel. It would be great for compute loads if it wasn't for the fact that it's going up against things GPUs and Intel's Xeon Phi boards, which are obviously considerably better for highly parallel workloads. I've personally used GPUs to go general purpose compute and boy do they provide a lot of compute power when utilized properly.

Seriously? Are you living in the 90s? Are you unaware of Apple's cores which have been running on top-of-the-line nodes for years and are anything but simple. Hell, even if you hate Apple in your bones, the high-end Android cores are likewise built on top-of-the-line nodes are are hardly trivial. ARM designs more than just M0's you know.
Leave a comment:
name99 replied

11 December 2016, 04:54 PM
Originally posted by liam View Post

Apple might be interested in the new neon instructions but I think we'll need to see how much they are going to cost. Supporting a 2048bit wide vector is pretty insane (and yes, that is the maximum, not the minimum) and the article specified that the instructions are targeting scientific workloads not DSP media acceleration.
IOW, I'm not sure these will be finding their way on to our phones anytime soon.

First the whole POINT of the instructions (which are NOT usefully thought of as an extension of neon) is that they can be implemented in arbitrary width. You can implement the instructions on hardware that's, say, 128 bits wide then TRANSPARENTLY upgrade next year to 256 bit hardware, and the year after that to 512 bit hardware. That's precisely why they are not "Neon-like" (or SSE/AVX-like).

Second don't be sure you know what Apple wants from vector instructions. Apple ships not two but THREE Neon hardware pipelines in their chips (starting, I think, with the A9, I think the A8 and A7 had two pipelines). Clearly they think this is hardware worth spending transistors on, which means they believe they support a number of workloads that can benefit from substantial vector performance but which ALSO are too small to be shipped over to the GPU.
Leave a comment:
name99 replied

11 December 2016, 04:47 PM
Originally posted by darkblu View Post

x86's higher density is valid until you go x86-64, where the extra instruction prefixes practically negate that advantage. Actually, SIMD-heavy x86-64 code is definitely lower-density than similar A64 (aarch64) code (where the ops continue to be 4 bytes).

ARMv8 has T2 in aarch32.

This may be technically true but it's irrelevant. The future is AArch64. Apple has been sending strong messages for years now to developers that shipping 32-bit iOS code is unacceptable. (iOS10 now puts up a scary warning when you use a 32-bit app that doesn't exactly say "this app is a PoS and probably malware" but strongly implies it.) I would not be at all surprised if the A11 is the core where Apple drops AArch32 support.
Already some server chips don't even bother shipping AArch32. and for all I know, QC is one of them.
Leave a comment:
name99 replied

11 December 2016, 04:41 PM
Originally posted by vadix View Post

Firstly, Xeons are designed to be morr power efficient, even if its not the same as an ARM, and secondly, while Intel core may have a more complicated pipeline, x86 programs are drfinitely smaller than ARM (even with thumb) on average. Less program size means less cache misses with the same size cache. Don't mix truth and BS together.

I don't know where you're getting your information from, but you're stuck somewhere in the 90s.

First no-one using ARM for servers cares about ARMv7 and Thumb, they care about ARMv8.

Secondly ARMv8 is a DENSER ISA than x86-64. This is not a matter of opinion, it has been measured. eg

https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-1.pdf

(look at page 62)

Third (not that it matters anymore) you're also wrong historically. ARMv7 using Thumb is ALSO more dense than x86-32.
Leave a comment:

Announcement

Qualcomm Sampling 10nm 48-Core Server SoC

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: