Announcement

Collapse
No announcement yet.

Centaur Tech Announces Eight-Core x86 SoC With AI Coprocessor

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by coder View Post
    I guess you missed Phoronix' coverage of Zhaoxin, then?
    Please don't confuse VIA CPUs with Zhaoxin CPUs. They are not the same thing. Zhaoxin processors are based on VIA CPUs just like Hygon processors are based on AMD CPUs. However, current development is totally independent.

    VIA CPUs
    • CNA (2008) => 6F2 & 6F3 : CentaurHauls Family 6, Model 15, Stepping 2 & 3 AKA VIA Nano 1000/2000 Series, e.g. Nano L2100 - x86-64, SSSE3, 1 MB L2 cache per core, single core, 65nm
    • CNB (2009) => 6F8 : CentaurHauls Family 6, Model 15, Stepping 8 AKA VIA Nano 3000 Series, e.g. Nano L3025 - SSE4.1, VIA VT (compatible with Intel VT-X)
    • CNC (2011) => 6FA : CentaurHauls Family 6, Model 15, Stepping 10 AKA VIA Nano X2, e.g. Nano X2 L4050 - 2 cores, basically two Nano 3000 in the same die, 40nm
    • CNQ (2011) => 6FC & 6FD : CentaurHauls Family 6, Model 15, Stepping 12 & 13) AKA VIA Nano QuadCore, e.g. Nano QuadCore L4650E - 4 cores, basically two Nano X2 in a multi-chip module
    • CNR (2015, samples available in 2014) => 6FE (CentaurHauls Family 6, Model 15, Stepping 14) AKA VIA Nano QuadCore, Isaiah II, e.g. Nano QuadCore C4650 - SSE4.2, AVX, AVX2, AES-NI, 2 MB shared L2 cache, 4 cores on a single die, 28nm
    • CNS (2019) - AVX-512, NCORE AI coprocessor, 16 MB shared L3 cache, 8 cores, 16nm TSMC

    Zhaoxin CPUs
    • ZX-A (2014) => CentaurHauls Family 6, Model 15, Stepping 13 (KaiXian ZX-A), e.g. C4350AL - based on VIA CNQ (VIA Nano X2 C4350AL), 40nm TSMC; compatible with VX11(H) (Chrome 640/645 GPU - 3a01)
    • ZX-B (2014?) => CentaurHauls Family 6, Model 15, Stepping 13 (KaiXian ZX-B) - the same microarchitecture as ZX-A, 40nm HLMC (different fabrication plant, produced in mainland China)
    • ZX-C (2015) => CentaurHauls Family 6, Model 15, Stepping 14 AKA ZhangJiang (KaiXian ZX-C), e.g. C4610 - based on VIA CNR, 2 MB shared L2 cache, 4 cores, 28nm TSMC; compatible with VX11(PH) (Chrome 640/645 GPU - 3a01)
    • ZX-C+ (2016) => CentaurHauls Family 6, Model 15, Stepping 14 AKA ZhangJiang (KaiXian ZX-C+ & KaisHeng ZX-C+), e.g. C4701, FC-1081 - SM3, SM4, up to 4 MB shared L2 cache (2 x 2 MB), up to 8 cores (basically two quad-core CPUs in a multi-chip module, similar solution to that used in CNQ); compatible with ZX-100(S) (Chrome 320 GPU - 3a02)
    • ZX-D (2017) => CentaurHauls Family 6, Model 31, Stepping 12 AKA WuDaoKou (KaiXian KX-5000 & KaisHeng KH-20000), e.g. KX-U5680, KH-26800 - major redesign, full SoC, new uncore with northbridge moved on-die, new P2P high-speed interconnect crossbar that replaces FSB, PCIe 3.0, DDR4, integrated GPU (Chrome 860 GPU - 3a03), 4 MB L2 cache per cluster (up to 8 MB in total), up to 8 cores (2 quad-core clusters on a single die), 28nm HLMC & SMIC; compatible with ZX-200 (IOE chip for IO extensibility)
    • ZX-E (2018) => CentaurHauls Family 7, Model 11, Stepping 0 & 1 & 15 AKA LuJiaZui (KaiXian KX-6000 & KaisHeng KH-30000), e.g. KX-U6880A, KH-37800D - new integrated GPU (Chrome 960 GPU - 3a04), 16nm FF (FinFET) TSMC
    • ZX-F (2019) => CentaurHauls Family 6, Model 71 (KaiXian KX-7000 & KaisHeng KH-40000), Stepping 1 - PCIe 4.0 (planned), DDR5 (planned), 16 MB shared L3 cache (confirmed), 7nm TSMC (planned)

    I haven't mentioned about the performance boost, but you should know that it is significant.

    Comment


    • #22
      Originally posted by microcode View Post
      Wonder where they got that IP.
      Via is the 3rd company who owns an x86 license, why can't they produce it

      In fact I saw this news on another site which says Centaur only without any mention of Via and I wondered how can they get a license to produce x86 CPUs and have to google to confirm it's actually Via

      Originally posted by the_scx View Post
      It has 8 cores. Why should it support SMT if we already know that it is a big source of hardware bugs?
      SMT is the way to go. It trades just a tiny die area for a much bigger performance increase. PowerPC and Sparc have used 8-thread cores for a long time. Many ARM cores for server also support SMT. The fact that Intel's implementation is buggy doesn't mean that SMT is buggy. There's nothing wrong with SMT in AMD CPUs

      Comment


      • #23
        Originally posted by phuclv View Post
        In fact I saw this news on another site which says Centaur only without any mention of Via and I wondered how can they get a license to produce x86 CPUs
        Centaur is wholly-owned by Via.

        Originally posted by phuclv View Post
        There's nothing wrong with SMT in AMD CPUs
        Really? I know some of the side-channel attacks have affected AMD, but you're saying none of theme were SMT-related?

        Anyway, I get the paranoia around SMT. Unless you have rigorous QoS for it, which will significantly reduce its efficiency benefits, you can't safely run SMT threads from different processes on the same core. However, the solution is quite simple - just restrict SMT to threads from the same process (or VM, perhaps)! That should 100% mitigate the security concerns around it.

        Comment


        • #24
          Originally posted by phuclv View Post
          Via is the 3rd company who owns an x86 license, why can't they produce it
          I meant the microarchitecture, obviously VIA has an IA32/AMD64 license. I'm insinuating semi-humorously that if their microarchitecture is as competitive as they claim it is, I'm not so sure they acquired it primarily through hard engineering work and competency.

          Originally posted by phuclv View Post
          SMT is the way to go. It trades just a tiny die area for a much bigger performance increase. PowerPC and Sparc have used 8-thread cores for a long time. Many ARM cores for server also support SMT. The fact that Intel's implementation is buggy doesn't mean that SMT is buggy. There's nothing wrong with SMT in AMD CPUs
          Sidechannel vulnerabilities other than meltdown, some zombieload variations, and swapgs will tend to have equivalents on non-intel microarchitectures with OoO pipelines and speculative execution, and that includes the additional sidechannels introduced by SMT. AMD's microarchitectures have happened to be less vulnerable to some of the MDS vulnerabilities that have gained a lot of press recently, but they're not immune to all of them by a long shot.

          Now, that's not to say there isn't a way to use SMT well. My suggestion has been to allow programs to either automatically, or by request, to be scheduled on all SMT threads on each core they use at the same time. In the case of Chromium, and possibly Firefox now (since E10S), only pages of the same origin share a process, so you could just make it so that processes get exclusive time on a core, but can use more than one of that core's threads.
          It's not perfect, since some of the benefit of SMT comes from exploiting how some processes will tend to use more of a specific resource than the others, producing less contention on a given resource (e.g. the FPU, hardware cryptographic operators, or random number generators), and therefore more throughput. I suspect enough of that will exist in today's throughput-critical processes that it's not a problem.
          Last edited by microcode; 22 November 2019, 06:20 PM.

          Comment


          • #25
            Originally posted by microcode View Post
            It's not perfect, since some of the benefit of SMT comes from exploiting how some processes will tend to use more of a specific resource than the others, producing less contention on a given resource (e.g. the FPU, hardware cryptographic operators, or random number generators), and therefore more throughput. I suspect enough of that will exist in today's throughput-critical processes that it's not a problem.
            It'd be cool if the kernel could monitor some performance counters and try to work out optimal pairings of threads. However, recent security concerns around SMT have probably now made that less likely - just in case someone manages to find a way to siphon that data out of the scheduler.

            Comment


            • #26
              Originally posted by phuclv View Post

              SMT is the way to go. It trades just a tiny die area for a much bigger performance increase. PowerPC and Sparc have used 8-thread cores for a long time. Many ARM cores for server also support SMT. The fact that Intel's implementation is buggy doesn't mean that SMT is buggy. There's nothing wrong with SMT in AMD CPUs
              A tiny die area, but it's still a huge bump in the complexity of the design. Power9 was affected by spectere. SPARC's t-series cores have a very different SMT. Instructions are issued round-robin across threads, there is no rename stage rather there's rotating windows, and the re-order buffer is quite small. AMD likewise was hit by specter, just not meltdown, as it isolated the kernel page table by default. And because AMD had been less aggressive about speculation and prediction performance did not take a hit with the fixes.

              Originally posted by coder View Post
              However, the solution is quite simple - just restrict SMT to threads from the same process (or VM, perhaps)! That should 100% mitigate the security concerns around it.
              Process isn't something the CPU knows about. You'd be relying on the OS scheduler to behave correctly. And the security context isn't always the same for every thread. You don't necessarily want every thread being able to see anything in the process memory space.
              Last edited by WorBlux; 04 February 2020, 02:12 PM.

              Comment


              • #27
                Originally posted by WorBlux View Post
                A tiny die area, but it's still a huge bump in the complexity of the design.
                I'm not so sure about that. If you take a modern Intel or AMD CPU, what stages can be removed by eliminating SMT, without also hurting single-thread performance?

                Originally posted by WorBlux View Post
                Process isn't something the CPU knows about. You'd be relying on the OS scheduler to behave correctly.
                Yes, that was my point. That software fixes for SMT vulnerabilities need only involve the OS' thread scheduler.

                Originally posted by WorBlux View Post
                And the security context isn't always the same for every thread. You don't necessarily want every thread being able to see anything in the process memory space.
                Uh, that's the most fundamental distinction between thread vs. process. If you don't want your threads to have access to each other's data, then you put them in another process.

                And yes, I know what LWP are, and that you can choose what parts to share or not, when spawning one. I suggest that whether they share the same memory space be the determining factor in deciding whether they can be paired via SMT.

                Comment


                • #28
                  Originally posted by coder View Post
                  I'm not so sure about that. If you take a modern Intel or AMD CPU, what stages can be removed by eliminating SMT, without also hurting single-thread performance?.
                  I'm talking logical complexity, not necessarily physical stages. Most of what you need for SMT falls out of what you have to do for out-of-order, though you've added new classes of hazards and cross-interference into the design.

                  Comment


                  • #29
                    Originally posted by WorBlux View Post
                    I'm talking logical complexity, not necessarily physical stages. Most of what you need for SMT falls out of what you have to do for out-of-order,
                    There are many in-order SMT CPUs out there, like Intel's original Atom and first-gen Xeon Phi. Most GPUs are basically in-order SMT.

                    Comment


                    • #30
                      Originally posted by coder View Post
                      There are many in-order SMT CPUs out there, like Intel's original Atom and first-gen Xeon Phi. Most GPUs are basically in-order SMT.
                      I'm not saying you can't do SMT w/ in-order. However it just doesn't fall out, you need to add some sort of logical or physical separation of the archetecturial state (register values) between threads.

                      Looking at the Bonnelle u-Arch, the prefetch buffers and instruction queues had to be duplicated to support SMT. Also They needed two set of Register Files (floating point and integer), one for each thread.

                      Compared to skylake, which does have some accomidations in the pre-fetch and fetch for SMT. However the instruction queue is split rather than duplicated. And the rename stage allows multiple threads to share the same physical register file.

                      First gen atom brings back nostalgic memories of xubuntu running on a netbook. What a dog of an architecture, just barely on the edge or usefull even using a lightweight OS.

                      Early SMT of SPARC did round-robin dispatch. SPARC already had the concept of register windows that could automaticly spill to RAM if you got to deep or context-switched. They were able to add a huge number of logical windows while only increasing physical windows by a 1/2 or a 1/3 or that. DId suprisingly well for I/O (memory) bound tasks.

                      Anyways... I can't find much info on Larabee or the Knights series. Some sources say it was a lot like bonnelle, but a keep finding a slide of Knight's Mill that says 72 u-ops in flight, 4 way SMT and 2-way OoO. Very weird, but makes some sense


                      https://www.anandtech.com/show/12172...-qfma-and-vnni

                      I can see here what looks like 2 schedulers/retire buffers. There is definitely rename going on. What I can't tell is weather they split the register file as well. Given that that there is a more than linear increase in power/area as registors get bigger, especially w/ adding ports, my bet is that they are. I would guess that threads are pegged to a specific scheduler, but that's not necessarily true as the RAT may be able create instructions to force values migrate between registers. (But that would allow one thread to potentially block both schedulers). Each scheduler has port into each pipeline so that no coordination is needed directly between the two . Granted 36 u-ops isn't that deep but would still hide some L1 misses and give you some flexibility w/instruction schedules.

                      Yes there's a lot of speculation in the prior paragraph as u-arch data on these processors is fairly limited. The point being to do SMT, you need a logically separate set of registors. The OoO rename stage gives you this nearly for free without consuming additional area/space but does significantly increase the logical complexity or the rename allocation table and scheduler. . If you want to do SMT in-order, then the simplest way to get a logically separate set of registers to to add more register files which does add a significant area/power impact on the critical path (even though it may be worth it in the end)

                      As for GPU's I'm not nearly as familiar with how they work, and they are stange beasts. Initial research shows that they take the multiple register file approach. They are additionally restricted in only dispatching from one thread at a time. You can mix threads vertically withing in the pipeline but not across the pipeline, that is any given cycle can only dispatch from one thread. Also there is the additional restriction that all active threads must be within the same warp (threads in a warp share a memory space) and thus not completely analogous to Skylake or Zen SMT.
                      Last edited by WorBlux; 31 May 2020, 02:07 PM.

                      Comment

                      Working...
                      X