Qualcomm Sampling 10nm 48-Core Server SoC

name99 replied

11 December 2016, 04:33 PM
Originally posted by droidhacker View Post

That is, assuming that they actually *are* particularly small of cores. This, of course, depends on how radically Qualcomm has made this new core relative to previously known power efficient ARM cores. If Intel can make an x86 core suck as bad as an x3-C3230, then maybe Qualcomm can make a core that is impressive when put up against more mainstream x86's. Of course, I'm not suggesting that they can suddenly be right up there with the highest end server cores, but maybe with 48 of them....

And for that matter, MOST servers run many small jobs that are very highly parallelizable.

Apple can certainly make a core that gives any Intel core below about 3.5GHz a run for its money. The Hurricane core (in TSMC 16nmFF+) is about 4.2mm^2. 50 of those on a die only takes you to 210 mm^2. Of course a useful server has a whole lot of others stuff on a die (memory controllers, lotsa L3, routing fabric, etc) but the point is that we have an existence proof TODAY that people other than Intel can make a compelling core that's small enough to put on a die in large numbers.
I don't expect QC's core to be as good in single-threaded performance as the A10, but I see no reason that it can't be "good enough". Especially since, as I said, the role of the 2017 gen of these server chips is not YET to deploy in volume and make money; it is to perform the last round of optimization and learning before the serious server chips ship.
Leave a comment:
name99 replied

11 December 2016, 04:26 PM
Originally posted by L_A_G View Post

Sorry, but I don't really see the point in a 48 core ARM chip.

The main point of ARM is good performance at low wattage, but with this many cores it's not going to be low wattage, which puts it squarely in the territory of Intel's Xeon and AMD's upcoming Zen-based Opteron chips. Additionally this number of cores really isn't all that useful for anything except for compute workloads, would would put it in the line of fire of Intel's Xeon Phi accelerators along with Nvidia and AMD's GPGPU products. I'd go as far as call this thing just a flat-out solution in search of a problem.

Uhh, the main point of ARM is to make money for ARM shareholders. So far they have done this best by selling very small cores and mobile optimized cores. That doesn't mean that's the only thing they CAN or SHOULD do. Your argument is no different from saying in 2007 "The main point of Apple is to ship Macs; why are they wasting time trying to work on a phone".

So for the technical points:
- many companies have indicated that they have workloads that are extremely parallelizable and would like a core that's a good match for those workloads.
- Xeon's do the job adequately but cost a damn fortune (as you move up to the higher core counts) and in the midrange Intel likes to cripple their memory capabilities so that if you want very high bandwidth and/or large RAM you have to go to the expensive cores
- there are multiple different "compute" workloads. If you're in the HPC (double-precision floating point business) Phi may be a good match for your tasks. For many other purposes not so much. These other purposes might include a lot of neural net/AI stuff (GPU works better bcs you don't need that double precision performance but you are paying for it in area/power/dollars) a lot of integer/pointer stuff (graph workloads) and a lot of memory intensive stuff (memcached and some NoSQL style workloads).

It IS true that some companies have released weak ARM server chips over the past few years with very little real-world relevance. But it is ALSO true that ARM (the company, check out
their annual reports) have only ever said that they expect to make a (small) mark in servers starting in 2017, and that they don't expect substantial market share until 2020. The primary reason for those early weak server chips was to create testbeds for bringing up the ARM ecosystem, and they have done that adequately and at about the speed expected. The 2017 generation of chips will receive some commercial deployments but, honestly, once again their job is primarily as learning vehicles, this time optimization learning, both on the SW side (as memcached, MySQL, nginx, and the other usual suspects) learn how to optimize their code to the characteristics of the ARMv8 ISA, memory model, synchronization model, and common fabrics, and on the HW side as each chip vendor learns the most important real-world weaknesses of its first serious server chip attempt.

I'd say, realistically, that ARM are very much on track with the schedule as they envisaged it around 2012 or 2013 and that so far there's no reason to assume they won't continue on that schedule.
Leave a comment:
liam replied

11 December 2016, 12:44 AM
Originally posted by L_A_G View Post

Sure, we don't have the exact details on this, but Xeon Phi didn't just rely on loads of cores with beefy vector instruction units, they also had really heavy SMT (or Hyper Threading as Intel likes to call it) to maximize the utilization of those vector instruction units. I've never heard of anyone creating an ARM core with SMT, so if they've done that this could basically be Xeon Phi knockoff with ARM cores rather than Atom cores. If this is the case then this thing may have a point for compute loads, but I'm not so sure if it's all that great of a thing seeing how the Xeon Phi failed to sell all that well despite how hard Intel tried to push them.

Hey, I'm not saying that it's a great idea, just mentioning it as a possibility.
I've already said what I hope Qualcomm has in mind (but I'm doubtful).
Leave a comment:
L_A_G replied

09 December 2016, 09:58 AM
Originally posted by liam View Post

Yes, that's true, but that's a much easier problem to deal with (especially disk access). Building a bus that can handle this kind of traffic for arbitrary loads would be very hard.
For one thing, we don't know what kind of cores these are using. The phi cores are, or were, just atom's with a massively beefed up vector unit (and instructions). The other thing is that we don't know what kind of acceleration they are offering on these boards (how many pcie lanes and which gen, is it even using pcie as opposed to something like CAPI---knowing the answer to those questions will give us more insight into the kinds of cores Qualcomm is building).
My hope is that they are building very wide decode, and big reorder buffer so we can see an arm version of a xeon.

Sure, we don't have the exact details on this, but Xeon Phi didn't just rely on loads of cores with beefy vector instruction units, they also had really heavy SMT (or Hyper Threading as Intel likes to call it) to maximize the utilization of those vector instruction units. I've never heard of anyone creating an ARM core with SMT, so if they've done that this could basically be Xeon Phi knockoff with ARM cores rather than Atom cores. If this is the case then this thing may have a point for compute loads, but I'm not so sure if it's all that great of a thing seeing how the Xeon Phi failed to sell all that well despite how hard Intel tried to push them.
Leave a comment:
liam replied

08 December 2016, 08:02 PM
Originally posted by L_A_G View Post

It's not just the CPU bus that can be a problem, it's also what sits behind the bus. Can RAM and disc provide access times fast enough not to cause severe stalls when all 48 cores are in use? As for compute loads, embarrassingly parallel compute (i.e number crunching) jobs are obviously going to be better served by hardware specifically designed for that (like GPUs and Xeon Phi boards) rather than just sticking loads of general purpose cores on a single die.

Yes, that's true, but that's a much easier problem to deal with (especially disk access). Building a bus that can handle this kind of traffic for arbitrary loads would be very hard.
For one thing, we don't know what kind of cores these are using. The phi cores are, or were, just atom's with a massively beefed up vector unit (and instructions). The other thing is that we don't know what kind of acceleration they are offering on these boards (how many pcie lanes and which gen, is it even using pcie as opposed to something like CAPI---knowing the answer to those questions will give us more insight into the kinds of cores Qualcomm is building).
My hope is that they are building very wide decode, and big reorder buffer so we can see an arm version of a xeon.
Leave a comment:
L_A_G replied

08 December 2016, 07:11 PM
Originally posted by liam View Post

Assuming the bus isn't terribly designed, this lets you pay for the dram,nic(s),accelerators ONCE per 48 cores. In a best case scenario all 48 cores will be able to interleave their responses and only be responsible for 1/48 of the power budget. The worst case is only 1 core is active (HOPEFULLY the others are either hotplugged or in a very low C-state) while occasionally servicing requests and paying for all the other hardware that would otherwise be amortized.

If you want a specific application, qualcomm mentioned hadoop and spark. To me, that suggests rather low ipc (so, relying on stupidly parallel workoads and the new arm neon instructions (http://www.eetimes.com/document.asp?doc_id=1330339)

It's not just the CPU bus that can be a problem, it's also what sits behind the bus. Can RAM and disc provide access times fast enough not to cause severe stalls when all 48 cores are in use? As for compute loads, embarrassingly parallel compute (i.e number crunching) jobs are obviously going to be better served by hardware specifically designed for that (like GPUs and Xeon Phi boards) rather than just sticking loads of general purpose cores on a single die.

Originally posted by BillBroadley View Post

Arm's deal is best price/perf at the phone friendly power. If they can manage best price/perf at server power levels all the better. Many embarassingly parallel workloads at large companies like google or facebook could care less about node performance. They want best performance/(total cost of ownership). That includes things like power, cooling, purchase cost, maintenance cost, error rate, etc.

If a rack + 2 30 amp 208V 3phase PDUs + arm ends up delivering more performance per $ then I can see it being very popular. Intel most specializes in maximum performance per core.

The reason why ARM provides such good power-to-price for low power envelope solutions is that they're relatively simple and small designs that can be cranked out reliably in very large volumes using nodes that are tried and tested with good yields and low cost. This thing on the other hand is being cranked out with a top-of-the-line node and chip itself is nether small nor simple due the the large amount of cores and because of this is going to draw comparable amounts of power as something produced by Intel. It would be great for compute loads if it wasn't for the fact that it's going up against things GPUs and Intel's Xeon Phi boards, which are obviously considerably better for highly parallel workloads. I've personally used GPUs to go general purpose compute and boy do they provide a lot of compute power when utilized properly.
Leave a comment:
liam replied

08 December 2016, 07:00 PM
Originally posted by wizard69 View Post

As far as the new vector instructions I have to think that Apple and probably Qualcomm are also on board. Apple was heavily involved in Alt-Vec development and would be very interested in bringing such performance to the iOS lineup. Qualcomm of course is already going after the server market and might take a stab at the PC market, both businesses could leverage a high performance vector capability.

In any event for us old guys what amazes me is that we basically have a Cray on a chip many times over. Enhanced vector capability just means even more software will run smoothly on these chips.

As for the limit on cores that is an interesting discussion because in the end ""It Depends". I remember some reported work by Intel that indicated that their architecture had problems going past 32 cores. Can't remember the specifics about work load but the point is you can optimize a processor for the type of work load you expect to run on it. Beyond that "cores" aren't really the issue, it is cache memory and RAM interfaces that bottleneck and get extremely hot (burn power). This is where innovation can still happen, the nice thing with ARM is that there is more free space per core on the die to allocate to cache and other support circuitry

Apple might be interested in the new neon instructions but I think we'll need to see how much they are going to cost. Supporting a 2048bit wide vector is pretty insane (and yes, that is the maximum, not the minimum) and the article specified that the instructions are targeting scientific workloads not DSP media acceleration.
IOW, I'm not sure these will be finding their way on to our phones anytime soon.
Leave a comment:
Brane215 replied

08 December 2016, 03:08 PM
Originally posted by BillBroadley View Post

These days the x86 isa is mostly just a compatible binary format. It's NOT directly executed, or even cached. The decoder breaks it into microops which are RISC (fixed with, simple, etc). X86's on the inside are basically Out of Order risc cores. The microps are cached, speculatively executed, restired, etc. Sure there's a bit of extra complexity, but it's very minor. If you look at the transistor budget the slightly bigger decoder is a very minor issue. That's why no other architecture has significantly better than x86 complexity.

Look at the _power_ budget for that. Who cares about transistors. They are cheap, at least in theory. In practice, ones used in decoder stage are under far greater pressure than some cell in L3 cache. Every switch costs some area, power, propagation, heating load and emits EMI into enviromnent that then has to deal with it.

It's not the same when you have nice 32 bit instruction format or when you have friggin 8-bit prefixes, swamps of obsolete instructions etc. Yes, you can translate around that, but it's gonna cost you. IN TDP, area used man-years spent on design etc etc.
Leave a comment:
darkblu replied

08 December 2016, 09:42 AM
Originally posted by vadix View Post

Firstly, Xeons are designed to be morr power efficient, even if its not the same as an ARM, and secondly, while Intel core may have a more complicated pipeline, x86 programs are drfinitely smaller than ARM (even with thumb) on average. Less program size means less cache misses with the same size cache. Don't mix truth and BS together.

x86's higher density is valid until you go x86-64, where the extra instruction prefixes practically negate that advantage. Actually, SIMD-heavy x86-64 code is definitely lower-density than similar A64 (aarch64) code (where the ops continue to be 4 bytes).

Originally posted by renox View Post

Half of what you wrote is false: CISC ISA tend to have higher code density than RISC ISA.
Which is why ARM has Thumb/Thumb2 and MIPS has MIPS16 extension to be able to have the nearly the same code density as the x86.
Surprisingly ARMv8 doesn't have a 16 bit extension..

ARMv8 has T2 in aarch32.
Leave a comment:
wizard69 replied

08 December 2016, 08:38 AM
Originally posted by BillBroadley View Post

Arm's deal is best price/perf at the phone friendly power. If they can manage best price/perf at server power levels all the better. Many embarassingly parallel workloads at large companies like google or facebook could care less about node performance. They want best performance/(total cost of ownership). That includes things like power, cooling, purchase cost, maintenance cost, error rate, etc.

If you look at Apples latest A series processors you will see that they get very good performance while maintaining very good thermals. In fact I'd have to say the cores are already good enough to implement in a many core chip and use that chip in servers or even desktops. This especially if the cores can have their clock rates increased.

If a rack + 2 30 amp 208V 3phase PDUs + arm ends up delivering more performance per $ then I can see it being very popular. Intel most specializes in maximum performance per core.

Intel seems to be all over the place with respect to performance per watt. Usually the very low power chips are also very low performers. I think what is really telling about Intel is the space wasted on die for each one of their cores. Their big cores put them at a pretty huge disadvantage relative to ARM. I suspect that this is part of the rush to ARM, that is more space on die to implement cores or to add support circuitry,
Leave a comment:

Announcement

Qualcomm Sampling 10nm 48-Core Server SoC

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: