Announcement

Collapse
No announcement yet.

Qualcomm Sampling 10nm 48-Core Server SoC

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by Brane215 View Post
    x86 is power hungry and has horrific ISA format, which means higher code footprint, hungrier decoder unit and caches and extra complications with instruction translations.
    And even when all this is solved on technical level, you still end up with legal and licensing limitations. Intel'r only alternative is AMD and that's it.

    ARM svcene is wide open to new players and by its nature it doesn't even insist on ARM. Whoever decided to recompile his/her code for ARM, know that there isn't much to stop him from doing it again for something completely different.

    Also, now that applications are using multithreading more and more, single thread performanc eis not that essential any more, whijch means oprating in area, where ARM is much more comfortable - with higher count of more power efficient cores.

    Also, Samsung, Qualcomm and the likes aren't that much behind Intel WRT to pure CPU muscle nor uncore material.

    If nice,speedy 32 or 64-core ARM/MIPS/Power were available on xATX board, I wouldn't lose a nanosecond contemplating Zen.

    These days the x86 isa is mostly just a compatible binary format. It's NOT directly executed, or even cached. The decoder breaks it into microops which are RISC (fixed with, simple, etc). X86's on the inside are basically Out of Order risc cores. The microps are cached, speculatively executed, restired, etc. Sure there's a bit of extra complexity, but it's very minor. If you look at the transistor budget the slightly bigger decoder is a very minor issue. That's why no other architecture has significantly better than x86 complexity.

    Comment


    • #22
      Originally posted by Brane215 View Post
      x86 is power hungry and has horrific ISA format, which means higher code footprint, hungrier decoder unit and caches and extra complications with instruction translations.
      And even when all this is solved on technical level, you still end up with legal and licensing limitations. Intel'r only alternative is AMD and that's it.

      ARM svcene is wide open to new players and by its nature it doesn't even insist on ARM. Whoever decided to recompile his/her code for ARM, know that there isn't much to stop him from doing it again for something completely different.

      Also, now that applications are using multithreading more and more, single thread performanc eis not that essential any more, whijch means oprating in area, where ARM is much more comfortable - with higher count of more power efficient cores.

      Also, Samsung, Qualcomm and the likes aren't that much behind Intel WRT to pure CPU muscle nor uncore material.

      If nice,speedy 32 or 64-core ARM/MIPS/Power were available on xATX board, I wouldn't lose a nanosecond contemplating Zen.

      Firstly, Xeons are designed to be morr power efficient, even if its not the same as an ARM, and secondly, while Intel core may have a more complicated pipeline, x86 programs are drfinitely smaller than ARM (even with thumb) on average. Less program size means less cache misses with the same size cache. Don't mix truth and BS together.

      Comment


      • #23
        Originally posted by Brane215 View Post
        x86 is power hungry and has horrific ISA format, which means higher code footprint, hungrier decoder unit and caches and extra complications with instruction translations.
        Half of what you wrote is false: CISC ISA tend to have higher code density than RISC ISA.
        Which is why ARM has Thumb/Thumb2 and MIPS has MIPS16 extension to be able to have the nearly the same code density as the x86.
        Surprisingly ARMv8 doesn't have a 16 bit extension..

        Comment


        • #24
          Originally posted by DMJC View Post
          Who cares about X86 if your application is entirely written in HTML, and being run on Linux?
          Html in kernel? Torvalds wont like that.

          Comment


          • #25
            Originally posted by liam View Post

            Assuming the bus isn't terribly designed, this lets you pay for the dram,nic(s),accelerators ONCE per 48 cores. In a best case scenario all 48 cores will be able to interleave their responses and only be responsible for 1/48 of the power budget. The worst case is only 1 core is active (HOPEFULLY the others are either hotplugged or in a very low C-state) while occasionally servicing requests and paying for all the other hardware that would otherwise be amortized.
            If you want a specific application, qualcomm mentioned hadoop and spark. To me, that suggests rather low ipc (so, relying on stupidly parallel workoads and the new arm neon instructions (http://www.eetimes.com/document.asp?doc_id=1330339)
            As far as the new vector instructions I have to think that Apple and probably Qualcomm are also on board. Apple was heavily involved in Alt-Vec development and would be very interested in bringing such performance to the iOS lineup. Qualcomm of course is already going after the server market and might take a stab at the PC market, both businesses could leverage a high performance vector capability.

            In any event for us old guys what amazes me is that we basically have a Cray on a chip many times over. Enhanced vector capability just means even more software will run smoothly on these chips.

            As for the limit on cores that is an interesting discussion because in the end ""It Depends". I remember some reported work by Intel that indicated that their architecture had problems going past 32 cores. Can't remember the specifics about work load but the point is you can optimize a processor for the type of work load you expect to run on it. Beyond that "cores" aren't really the issue, it is cache memory and RAM interfaces that bottleneck and get extremely hot (burn power). This is where innovation can still happen, the nice thing with ARM is that there is more free space per core on the die to allocate to cache and other support circuitry

            Comment


            • #26
              Originally posted by gnufreex View Post
              Html in kernel? Torvalds wont like that.
              Application, not operating system... (practiceliteracy thx bai)

              Comment


              • #27
                Qualcomm will have designed this processor in tandem with potential customers - Google, Amazon, Baidu, etc - so will be targeting their needs.

                If you want in on the massively profitable server market, you work with what you have - Qualcomm have no x86 license, that's out of the door. So you create a presumably-high IPC ARM core design (Falkor), and you make use of other core competencies (or which Qualcomm has many).

                Note that one rumoured aspect is on-die or on-package FPGA (Xilinx) as an option with this design. There may also be on-package HBM2 to deal with the memory bandwidth issue (at least for cached assets).

                Falkor (the core) is likely to be used in future Windows products, now that MS has announced it's trying again, and doing it properly this time round.

                Comment


                • #28
                  Originally posted by BillBroadley View Post

                  Arm's deal is best price/perf at the phone friendly power. If they can manage best price/perf at server power levels all the better. Many embarassingly parallel workloads at large companies like google or facebook could care less about node performance. They want best performance/(total cost of ownership). That includes things like power, cooling, purchase cost, maintenance cost, error rate, etc.
                  If you look at Apples latest A series processors you will see that they get very good performance while maintaining very good thermals. In fact I'd have to say the cores are already good enough to implement in a many core chip and use that chip in servers or even desktops. This especially if the cores can have their clock rates increased.
                  If a rack + 2 30 amp 208V 3phase PDUs + arm ends up delivering more performance per $ then I can see it being very popular. Intel most specializes in maximum performance per core.
                  Intel seems to be all over the place with respect to performance per watt. Usually the very low power chips are also very low performers. I think what is really telling about Intel is the space wasted on die for each one of their cores. Their big cores put them at a pretty huge disadvantage relative to ARM. I suspect that this is part of the rush to ARM, that is more space on die to implement cores or to add support circuitry,

                  Comment


                  • #29
                    Originally posted by vadix View Post

                    Firstly, Xeons are designed to be morr power efficient, even if its not the same as an ARM, and secondly, while Intel core may have a more complicated pipeline, x86 programs are drfinitely smaller than ARM (even with thumb) on average. Less program size means less cache misses with the same size cache. Don't mix truth and BS together.
                    x86's higher density is valid until you go x86-64, where the extra instruction prefixes practically negate that advantage. Actually, SIMD-heavy x86-64 code is definitely lower-density than similar A64 (aarch64) code (where the ops continue to be 4 bytes).

                    Originally posted by renox View Post

                    Half of what you wrote is false: CISC ISA tend to have higher code density than RISC ISA.
                    Which is why ARM has Thumb/Thumb2 and MIPS has MIPS16 extension to be able to have the nearly the same code density as the x86.
                    Surprisingly ARMv8 doesn't have a 16 bit extension..
                    ARMv8 has T2 in aarch32.

                    Comment


                    • #30
                      Originally posted by BillBroadley View Post

                      These days the x86 isa is mostly just a compatible binary format. It's NOT directly executed, or even cached. The decoder breaks it into microops which are RISC (fixed with, simple, etc). X86's on the inside are basically Out of Order risc cores. The microps are cached, speculatively executed, restired, etc. Sure there's a bit of extra complexity, but it's very minor. If you look at the transistor budget the slightly bigger decoder is a very minor issue. That's why no other architecture has significantly better than x86 complexity.
                      Look at the _power_ budget for that. Who cares about transistors. They are cheap, at least in theory. In practice, ones used in decoder stage are under far greater pressure than some cell in L3 cache. Every switch costs some area, power, propagation, heating load and emits EMI into enviromnent that then has to deal with it.

                      It's not the same when you have nice 32 bit instruction format or when you have friggin 8-bit prefixes, swamps of obsolete instructions etc. Yes, you can translate around that, but it's gonna cost you. IN TDP, area used man-years spent on design etc etc.


                      Comment

                      Working...
                      X