Originally posted by name99
View Post
Announcement
Collapse
No announcement yet.
Qualcomm Sampling 10nm 48-Core Server SoC
Collapse
X
-
Originally posted by name99 View Post
First the whole POINT of the instructions (which are NOT usefully thought of as an extension of neon) is that they can be implemented in arbitrary width. You can implement the instructions on hardware that's, say, 128 bits wide then TRANSPARENTLY upgrade next year to 256 bit hardware, and the year after that to 512 bit hardware. That's precisely why they are not "Neon-like" (or SSE/AVX-like).
Second don't be sure you know what Apple wants from vector instructions. Apple ships not two but THREE Neon hardware pipelines in their chips (starting, I think, with the A9, I think the A8 and A7 had two pipelines). Clearly they think this is hardware worth spending transistors on, which means they believe they support a number of workloads that can benefit from substantial vector performance but which ALSO are too small to be shipped over to the GPU.
and yes, that is the maximum, not the minimum
I realise that the new Wisconsin spec is designed to scale, as I said, UPTO 2048, but I had a few things in mind: 1) if 128 is enough (and, you're right, we don't really know if it is) it's cheaper to stick with neon, 2) we don't know how much silicon one of these units will require, but unless you are able to use them in place of a neon unit you are starting to looking at a pretty large increase to your silicon budget (don't forget this thing might need its own scheduler and it will certainly need a huge issue buffer)
I said they might be interested but that cost would be a factor. Keep in mind that sip/pop intended for mobile networking often include dsps which can be multi-purposed.
Btw, it appears that it was cyclone/a7 where apple moved to the 3 unit neon (http://www.anandtech.com/show/7910/a...cture-detailed).
Leave a comment:
-
Originally posted by name99 View Post...
I'm not saying that it's a bad design that can't compete with other chips in the HPC market, what I am saying that it's basically doing something that has more or less been done already so it's not really anything to get super excited about.
Leave a comment:
-
Originally posted by name99 View PostSeriously? Are you living in the 90s? Are you unaware of Apple's cores which have been running on top-of-the-line nodes for years and are anything but simple. Hell, even if you hate Apple in your bones, the high-end Android cores are likewise built on top-of-the-line nodes are are hardly trivial. ARM designs more than just M0's you know.
Originally posted by name99 View Post...
Leave a comment:
-
10nm finfet, so it is going to be produced by Samsung (like other qualcomm SoCs)?
Leave a comment:
-
Originally posted by L_A_G View Post
Sure, we don't have the exact details on this, but Xeon Phi didn't just rely on loads of cores with beefy vector instruction units, they also had really heavy SMT (or Hyper Threading as Intel likes to call it) to maximize the utilization of those vector instruction units. I've never heard of anyone creating an ARM core with SMT, so if they've done that this could basically be Xeon Phi knockoff with ARM cores rather than Atom cores. If this is the case then this thing may have a point for compute loads, but I'm not so sure if it's all that great of a thing seeing how the Xeon Phi failed to sell all that well despite how hard Intel tried to push them.
But more generally, ARM as a corporation is against SMT and will not support it for its designs. SMT is not a sign of strength, it is a sign of weakness. It says that your core is so expensive in area that you need to add even more complexity to maximize the value; ARM believes that their cores are small enough that if you want more throughput just add more of them. SMT has never delivered the performance naive users expect, largely because the single most constrained resource on a CPU is the L1 caches, and SMT halves their effective size. So Intel SMT gives you about the equivalent of a 25% speed boost. ARM's answer would be --- if you want 5 CPU's worth of performance, stick 5 CPUs on the die rather than 4 with SMT.
If you look at who supports SMT (Intel, Oracle, IBM) it's hard not to conclude that its there mainly as a workaround to stupid SW licensing rules. Since ARM (at least right now...) isn't running software of that sort, it doesn't need the workaround.
So why did Broadcom add it to Vulcan? No idea. Vulcan came out of a networking team, and there may be something about network processors (packet classification and deep inspection, that sort of thing) that makes SMT valuable in that context and its flaws (giving each thread a much smaller effective L1) much less problematic, because caches aren't too useful anyway with network processing?
Leave a comment:
-
Originally posted by L_A_G View Post
It's not just the CPU bus that can be a problem, it's also what sits behind the bus. Can RAM and disc provide access times fast enough not to cause severe stalls when all 48 cores are in use? As for compute loads, embarrassingly parallel compute (i.e number crunching) jobs are obviously going to be better served by hardware specifically designed for that (like GPUs and Xeon Phi boards) rather than just sticking loads of general purpose cores on a single die.
The reason why ARM provides such good power-to-price for low power envelope solutions is that they're relatively simple and small designs that can be cranked out reliably in very large volumes using nodes that are tried and tested with good yields and low cost. This thing on the other hand is being cranked out with a top-of-the-line node and chip itself is nether small nor simple due the the large amount of cores and because of this is going to draw comparable amounts of power as something produced by Intel. It would be great for compute loads if it wasn't for the fact that it's going up against things GPUs and Intel's Xeon Phi boards, which are obviously considerably better for highly parallel workloads. I've personally used GPUs to go general purpose compute and boy do they provide a lot of compute power when utilized properly.
Leave a comment:
-
Originally posted by liam View Post
Apple might be interested in the new neon instructions but I think we'll need to see how much they are going to cost. Supporting a 2048bit wide vector is pretty insane (and yes, that is the maximum, not the minimum) and the article specified that the instructions are targeting scientific workloads not DSP media acceleration.
IOW, I'm not sure these will be finding their way on to our phones anytime soon.
Second don't be sure you know what Apple wants from vector instructions. Apple ships not two but THREE Neon hardware pipelines in their chips (starting, I think, with the A9, I think the A8 and A7 had two pipelines). Clearly they think this is hardware worth spending transistors on, which means they believe they support a number of workloads that can benefit from substantial vector performance but which ALSO are too small to be shipped over to the GPU.
Leave a comment:
-
Originally posted by darkblu View Postx86's higher density is valid until you go x86-64, where the extra instruction prefixes practically negate that advantage. Actually, SIMD-heavy x86-64 code is definitely lower-density than similar A64 (aarch64) code (where the ops continue to be 4 bytes).
ARMv8 has T2 in aarch32.
Already some server chips don't even bother shipping AArch32. and for all I know, QC is one of them.
Leave a comment:
-
Originally posted by vadix View Post
Firstly, Xeons are designed to be morr power efficient, even if its not the same as an ARM, and secondly, while Intel core may have a more complicated pipeline, x86 programs are drfinitely smaller than ARM (even with thumb) on average. Less program size means less cache misses with the same size cache. Don't mix truth and BS together.
First no-one using ARM for servers cares about ARMv7 and Thumb, they care about ARMv8.
Secondly ARMv8 is a DENSER ISA than x86-64. This is not a matter of opinion, it has been measured. eg
(look at page 62)
Third (not that it matters anymore) you're also wrong historically. ARMv7 using Thumb is ALSO more dense than x86-32.
Leave a comment:
Leave a comment: