Announcement

Collapse
No announcement yet.

Qualcomm Sampling 10nm 48-Core Server SoC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by wizard69 View Post
    As far as the new vector instructions I have to think that Apple and probably Qualcomm are also on board. Apple was heavily involved in Alt-Vec development and would be very interested in bringing such performance to the iOS lineup. Qualcomm of course is already going after the server market and might take a stab at the PC market, both businesses could leverage a high performance vector capability.

    In any event for us old guys what amazes me is that we basically have a Cray on a chip many times over. Enhanced vector capability just means even more software will run smoothly on these chips.

    As for the limit on cores that is an interesting discussion because in the end ""It Depends". I remember some reported work by Intel that indicated that their architecture had problems going past 32 cores. Can't remember the specifics about work load but the point is you can optimize a processor for the type of work load you expect to run on it. Beyond that "cores" aren't really the issue, it is cache memory and RAM interfaces that bottleneck and get extremely hot (burn power). This is where innovation can still happen, the nice thing with ARM is that there is more free space per core on the die to allocate to cache and other support circuitry
    Apple might be interested in the new neon instructions but I think we'll need to see how much they are going to cost. Supporting a 2048bit wide vector is pretty insane (and yes, that is the maximum, not the minimum) and the article specified that the instructions are targeting scientific workloads not DSP media acceleration.
    IOW, I'm not sure these will be finding their way on to our phones anytime soon.

    Comment


    • #32
      Originally posted by liam View Post
      Assuming the bus isn't terribly designed, this lets you pay for the dram,nic(s),accelerators ONCE per 48 cores. In a best case scenario all 48 cores will be able to interleave their responses and only be responsible for 1/48 of the power budget. The worst case is only 1 core is active (HOPEFULLY the others are either hotplugged or in a very low C-state) while occasionally servicing requests and paying for all the other hardware that would otherwise be amortized.

      If you want a specific application, qualcomm mentioned hadoop and spark. To me, that suggests rather low ipc (so, relying on stupidly parallel workoads and the new arm neon instructions (http://www.eetimes.com/document.asp?doc_id=1330339)
      It's not just the CPU bus that can be a problem, it's also what sits behind the bus. Can RAM and disc provide access times fast enough not to cause severe stalls when all 48 cores are in use? As for compute loads, embarrassingly parallel compute (i.e number crunching) jobs are obviously going to be better served by hardware specifically designed for that (like GPUs and Xeon Phi boards) rather than just sticking loads of general purpose cores on a single die.

      Originally posted by BillBroadley View Post
      Arm's deal is best price/perf at the phone friendly power. If they can manage best price/perf at server power levels all the better. Many embarassingly parallel workloads at large companies like google or facebook could care less about node performance. They want best performance/(total cost of ownership). That includes things like power, cooling, purchase cost, maintenance cost, error rate, etc.

      If a rack + 2 30 amp 208V 3phase PDUs + arm ends up delivering more performance per $ then I can see it being very popular. Intel most specializes in maximum performance per core.
      The reason why ARM provides such good power-to-price for low power envelope solutions is that they're relatively simple and small designs that can be cranked out reliably in very large volumes using nodes that are tried and tested with good yields and low cost. This thing on the other hand is being cranked out with a top-of-the-line node and chip itself is nether small nor simple due the the large amount of cores and because of this is going to draw comparable amounts of power as something produced by Intel. It would be great for compute loads if it wasn't for the fact that it's going up against things GPUs and Intel's Xeon Phi boards, which are obviously considerably better for highly parallel workloads. I've personally used GPUs to go general purpose compute and boy do they provide a lot of compute power when utilized properly.

      Comment


      • #33
        Originally posted by L_A_G View Post

        It's not just the CPU bus that can be a problem, it's also what sits behind the bus. Can RAM and disc provide access times fast enough not to cause severe stalls when all 48 cores are in use? As for compute loads, embarrassingly parallel compute (i.e number crunching) jobs are obviously going to be better served by hardware specifically designed for that (like GPUs and Xeon Phi boards) rather than just sticking loads of general purpose cores on a single die.
        Yes, that's true, but that's a much easier problem to deal with (especially disk access). Building a bus that can handle this kind of traffic for arbitrary loads would be very hard.
        For one thing, we don't know what kind of cores these are using. The phi cores are, or were, just atom's with a massively beefed up vector unit (and instructions). The other thing is that we don't know what kind of acceleration they are offering on these boards (how many pcie lanes and which gen, is it even using pcie as opposed to something like CAPI---knowing the answer to those questions will give us more insight into the kinds of cores Qualcomm is building).
        My hope is that they are building very wide decode, and big reorder buffer so we can see an arm version of a xeon.

        Comment


        • #34
          Originally posted by liam View Post

          Yes, that's true, but that's a much easier problem to deal with (especially disk access). Building a bus that can handle this kind of traffic for arbitrary loads would be very hard.
          For one thing, we don't know what kind of cores these are using. The phi cores are, or were, just atom's with a massively beefed up vector unit (and instructions). The other thing is that we don't know what kind of acceleration they are offering on these boards (how many pcie lanes and which gen, is it even using pcie as opposed to something like CAPI---knowing the answer to those questions will give us more insight into the kinds of cores Qualcomm is building).
          My hope is that they are building very wide decode, and big reorder buffer so we can see an arm version of a xeon.
          Sure, we don't have the exact details on this, but Xeon Phi didn't just rely on loads of cores with beefy vector instruction units, they also had really heavy SMT (or Hyper Threading as Intel likes to call it) to maximize the utilization of those vector instruction units. I've never heard of anyone creating an ARM core with SMT, so if they've done that this could basically be Xeon Phi knockoff with ARM cores rather than Atom cores. If this is the case then this thing may have a point for compute loads, but I'm not so sure if it's all that great of a thing seeing how the Xeon Phi failed to sell all that well despite how hard Intel tried to push them.

          Comment


          • #35
            Originally posted by L_A_G View Post

            Sure, we don't have the exact details on this, but Xeon Phi didn't just rely on loads of cores with beefy vector instruction units, they also had really heavy SMT (or Hyper Threading as Intel likes to call it) to maximize the utilization of those vector instruction units. I've never heard of anyone creating an ARM core with SMT, so if they've done that this could basically be Xeon Phi knockoff with ARM cores rather than Atom cores. If this is the case then this thing may have a point for compute loads, but I'm not so sure if it's all that great of a thing seeing how the Xeon Phi failed to sell all that well despite how hard Intel tried to push them.
            Hey, I'm not saying that it's a great idea, just mentioning it as a possibility.
            I've already said what I hope Qualcomm has in mind (but I'm doubtful).

            Comment


            • #36
              Originally posted by L_A_G View Post
              Sorry, but I don't really see the point in a 48 core ARM chip.

              The main point of ARM is good performance at low wattage, but with this many cores it's not going to be low wattage, which puts it squarely in the territory of Intel's Xeon and AMD's upcoming Zen-based Opteron chips. Additionally this number of cores really isn't all that useful for anything except for compute workloads, would would put it in the line of fire of Intel's Xeon Phi accelerators along with Nvidia and AMD's GPGPU products. I'd go as far as call this thing just a flat-out solution in search of a problem.
              Uhh, the main point of ARM is to make money for ARM shareholders. So far they have done this best by selling very small cores and mobile optimized cores. That doesn't mean that's the only thing they CAN or SHOULD do. Your argument is no different from saying in 2007 "The main point of Apple is to ship Macs; why are they wasting time trying to work on a phone".

              So for the technical points:
              - many companies have indicated that they have workloads that are extremely parallelizable and would like a core that's a good match for those workloads.
              - Xeon's do the job adequately but cost a damn fortune (as you move up to the higher core counts) and in the midrange Intel likes to cripple their memory capabilities so that if you want very high bandwidth and/or large RAM you have to go to the expensive cores
              - there are multiple different "compute" workloads. If you're in the HPC (double-precision floating point business) Phi may be a good match for your tasks. For many other purposes not so much. These other purposes might include a lot of neural net/AI stuff (GPU works better bcs you don't need that double precision performance but you are paying for it in area/power/dollars) a lot of integer/pointer stuff (graph workloads) and a lot of memory intensive stuff (memcached and some NoSQL style workloads).

              It IS true that some companies have released weak ARM server chips over the past few years with very little real-world relevance. But it is ALSO true that ARM (the company, check out
              their annual reports) have only ever said that they expect to make a (small) mark in servers starting in 2017, and that they don't expect substantial market share until 2020. The primary reason for those early weak server chips was to create testbeds for bringing up the ARM ecosystem, and they have done that adequately and at about the speed expected. The 2017 generation of chips will receive some commercial deployments but, honestly, once again their job is primarily as learning vehicles, this time optimization learning, both on the SW side (as memcached, MySQL, nginx, and the other usual suspects) learn how to optimize their code to the characteristics of the ARMv8 ISA, memory model, synchronization model, and common fabrics, and on the HW side as each chip vendor learns the most important real-world weaknesses of its first serious server chip attempt.

              I'd say, realistically, that ARM are very much on track with the schedule as they envisaged it around 2012 or 2013 and that so far there's no reason to assume they won't continue on that schedule.

              Comment


              • #37
                Originally posted by droidhacker View Post

                That is, assuming that they actually *are* particularly small of cores. This, of course, depends on how radically Qualcomm has made this new core relative to previously known power efficient ARM cores. If Intel can make an x86 core suck as bad as an x3-C3230, then maybe Qualcomm can make a core that is impressive when put up against more mainstream x86's. Of course, I'm not suggesting that they can suddenly be right up there with the highest end server cores, but maybe with 48 of them....

                And for that matter, MOST servers run many small jobs that are very highly parallelizable.
                Apple can certainly make a core that gives any Intel core below about 3.5GHz a run for its money. The Hurricane core (in TSMC 16nmFF+) is about 4.2mm^2. 50 of those on a die only takes you to 210 mm^2. Of course a useful server has a whole lot of others stuff on a die (memory controllers, lotsa L3, routing fabric, etc) but the point is that we have an existence proof TODAY that people other than Intel can make a compelling core that's small enough to put on a die in large numbers.
                I don't expect QC's core to be as good in single-threaded performance as the A10, but I see no reason that it can't be "good enough". Especially since, as I said, the role of the 2017 gen of these server chips is not YET to deploy in volume and make money; it is to perform the last round of optimization and learning before the serious server chips ship.

                Comment


                • #38
                  Originally posted by vadix View Post

                  Firstly, Xeons are designed to be morr power efficient, even if its not the same as an ARM, and secondly, while Intel core may have a more complicated pipeline, x86 programs are drfinitely smaller than ARM (even with thumb) on average. Less program size means less cache misses with the same size cache. Don't mix truth and BS together.
                  I don't know where you're getting your information from, but you're stuck somewhere in the 90s.

                  First no-one using ARM for servers cares about ARMv7 and Thumb, they care about ARMv8.

                  Secondly ARMv8 is a DENSER ISA than x86-64. This is not a matter of opinion, it has been measured. eg

                  (look at page 62)

                  Third (not that it matters anymore) you're also wrong historically. ARMv7 using Thumb is ALSO more dense than x86-32.

                  Comment


                  • #39
                    Originally posted by darkblu View Post
                    x86's higher density is valid until you go x86-64, where the extra instruction prefixes practically negate that advantage. Actually, SIMD-heavy x86-64 code is definitely lower-density than similar A64 (aarch64) code (where the ops continue to be 4 bytes).


                    ARMv8 has T2 in aarch32.
                    This may be technically true but it's irrelevant. The future is AArch64. Apple has been sending strong messages for years now to developers that shipping 32-bit iOS code is unacceptable. (iOS10 now puts up a scary warning when you use a 32-bit app that doesn't exactly say "this app is a PoS and probably malware" but strongly implies it.) I would not be at all surprised if the A11 is the core where Apple drops AArch32 support.
                    Already some server chips don't even bother shipping AArch32. and for all I know, QC is one of them.

                    Comment


                    • #40
                      Originally posted by liam View Post

                      Apple might be interested in the new neon instructions but I think we'll need to see how much they are going to cost. Supporting a 2048bit wide vector is pretty insane (and yes, that is the maximum, not the minimum) and the article specified that the instructions are targeting scientific workloads not DSP media acceleration.
                      IOW, I'm not sure these will be finding their way on to our phones anytime soon.
                      First the whole POINT of the instructions (which are NOT usefully thought of as an extension of neon) is that they can be implemented in arbitrary width. You can implement the instructions on hardware that's, say, 128 bits wide then TRANSPARENTLY upgrade next year to 256 bit hardware, and the year after that to 512 bit hardware. That's precisely why they are not "Neon-like" (or SSE/AVX-like).

                      Second don't be sure you know what Apple wants from vector instructions. Apple ships not two but THREE Neon hardware pipelines in their chips (starting, I think, with the A9, I think the A8 and A7 had two pipelines). Clearly they think this is hardware worth spending transistors on, which means they believe they support a number of workloads that can benefit from substantial vector performance but which ALSO are too small to be shipped over to the GPU.

                      Comment

                      Working...
                      X