Announcement

Collapse
No announcement yet.

Tachyum Gets FreeBSD Running On Their Prodigy ISA Emulation Platform For AI / HPC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    If you remember Transmetta/Efficeon (Linus Torvald worked on it, too), it was also a "morphing" CPU which had the same performance like Intel, but at a fraction of power consuption and smaller/cheaper chip. It used profiling and JIT re-compilation of code pages in RAM, so the final stage of optimization effectivelly fed in-order the VLIW architecture. The problem was, the CPU was slow at the ordinary tasks. It required a single task running "in cycle" to optimize it, e.g. playing a video (software decoded at the time), playing a game, ... But in random tasks (open .doc file in Word, run spell checker, load a page in the web browser - each page having different scripts, most running only on page load, ...) it was slow (unoptimized code using only fraction of units of VLIW). The next Intel CPU had the same low power consuption for notebooks, so Transmetta was not needed anymore. PS: An example, they showed, was running Quake 3 with 2 CPU ISAs used to render single frame. One x86 and one some "Java CPU" (e.g. Sun had CPUs with hardware accelerated Java) and it switched the ISA during rendering each frame.
    Last edited by Ladis; 06 April 2022, 07:18 PM.

    Comment


    • #22
      Originally posted by Ladis View Post
      it was also a "morphing" CPU which had the same performance like Intel, but at a fraction of power consuption and smaller/cheaper chip.
      I hate to go off-topic but, that's hardly an accomplishment.
      I don't know of any modern CPU as slow and power-hungry as Intel's chips. The mad dash to replace them with ARM is similar to how we're desperate to replace the dilapidated Xorg stack. We simply can't go on like this, if ARM didn't show up when it did, the computing would have been set back another couple decades (after the setbacks Intel already put us through, thank god AMD at least picked up the slack.)

      Comment


      • #23
        Originally posted by Ironmask View Post

        I hate to go off-topic but, that's hardly an accomplishment.
        I don't know of any modern CPU as slow and power-hungry as Intel's chips. The mad dash to replace them with ARM is similar to how we're desperate to replace the dilapidated Xorg stack. We simply can't go on like this, if ARM didn't show up when it did, the computing would have been set back another couple decades (after the setbacks Intel already put us through, thank god AMD at least picked up the slack.)
        At the time all software was closed source (and nobody knew about Linux). So it was a huge achievement, because you could run the same x86 Windows OS and software. What was ARM and such for common people at the time, when you couldn't run that OS and those programs and games everybody used .

        PS: Intel (and x86) didn't become more efficient by improving their architecture, but by just using a newer smaller node. They were leading in the chip fabrication at the time (and that lenghtened the life of x86).

        Comment


        • #24
          One size fits all usually fits none.

          Comment


          • #25
            Originally posted by atomsymbol View Post
            I think/believe the main points of the Prodigy CPU are:
            • Matrix multiplication acceleration instructions:
            On this point, the T864 datasheet claims "8 Tflops High Performance Computing" and "131 Tflops AI training and inference". The T16128, with 2x the cores, simply doubles it.

            In comparison:
            • AMD's MI250X (shipping since late last year) offers: 45.3 fp64 vector TFLOPS, 90.5 matrix TFLOPS, and 362 AI TFLOPS (fp16 or BF16).
            • Nvidia's A100 (launched 2020) offers: 9.7 fp64 vector TFLOPS and 312 AI TFLOPS (fp16).
            • Nvidia's H100 (coming in Q3) offers: 30 fp64 vector TFLOPS, 60 fp64 matrix TFLOPS, and 1000 AI TFLOPS (fp16).

            So, it can't touch GPUs, on any workloads suitable for them. Even if you ran them at lower clocks to rein in their power dissipation, they'd still run circles around it.

            Originally posted by atomsymbol View Post
            • VLIW instruction set architecture: up to 8 micro-ops per cycle to ALUs/load/store/branch/compare
            Huh. The datasheets for T864 and T16128 are both claiming "4 instructions per clock up to 4GHz", which hardly impresses.

            This is the typical problem you see with chip startups. It's like they look at the spec sheets of current-gen GPUs and CPUs, go off an design a chip which can beat them, but forget that they'll be competing against chips 2 generations better, by the time they get anything to market.

            The only recent HPC processor I've seen that's really impressed me (though A64FX deserves a special mention) is Preferred Networks (PFN):



            PEZY-SC2 was pretty cool, including their mad TCI wireless in-package DRAM interface that outmatched any HBM2, in its day. Sadly, they haven't seemed to do much since getting busted on financial fraud.

            Comment


            • #26
              Originally posted by sophisticles View Post
              One size fits all usually fits none.
              I remember when APUs got popular and my friend was like "wow it's two in one!" and I was like "it's the worst of both worlds, a bad CPU combined with a bad GPU" and later they were like "yeah you were right".
              Nobody in this thread seems to be able to find out anything about what this is even supposed to be. I'm guessing they found out what an FPGA is and are marketing it as a CPU that can optimize itself to do anything? So, snake oil.
              Now, if you want real innovation in AI chips, some startup (I forget the name of) is using SSD technology to encode artificial neural networks into solid-state chips. That one is actually going to be revolutionary.

              Comment


              • #27
                Originally posted by Ironmask View Post
                I remember when APUs got popular and my friend was like "wow it's two in one!" and I was like "it's the worst of both worlds, a bad CPU combined with a bad GPU" and later they were like "yeah you were right".
                I disagree. APUs had potential, but the software support simply wasn't there. If you had some floating-point heavy workload, even the small iGPUs traditionally had a lot more compute power than the CPU cores. Lately, CPU cores have done a lot to catch up, but iGPUs still pack considerably more FLOPS/W.

                The biggest limitation APUs face is memory bandwidth. DDR5 raises the bar on that, but still not enough. You'd have to go to in-package memory, like Apple, to scale them up to anything meaningful. For laptops, I guess you could use soldered-down GDDR6, because many are already using soldered RAM, anyhow.

                Originally posted by Ironmask View Post
                I'm guessing they found out what an FPGA is and are marketing it as a CPU that can optimize itself to do anything? So, snake oil.
                No, for two reasons. First, you couldn't hit their quoted performance numbers with a CPU built on a normal FPGA. Second, they provide a FPGA-based development platform, implying that the real products clearly aren't.

                Comment


                • #28
                  Originally posted by Ironmask View Post
                  Nobody in this thread seems to be able to find out anything about what this is even supposed to be.
                  It is supposed to be a CPU for HPC/cloud/supercomputers. They have a (preliminary?) contract with a local government in Europe to put a lot of those CPUs in a supercomputer.

                  It isn't supposed to be a gaming CPU, nor a desktop CPU, nor a notebook CPU.

                  I'm guessing they found out what an FPGA is and are marketing it as a CPU that can optimize itself to do anything? So, snake oil.
                  The probability of that being true is very close to zero. Or do you believe that a person with 88 patents doesn't know technology? https://patents.google.com/?inventor=Radoslav+Danilak


                  Originally posted by coder View Post
                  The datasheets for T864 and T16128 are both claiming "4 instructions per clock up to 4GHz", which hardly impresses.
                  It is likely that you are misinterpreting the datasheets. It actually might mean: up to 4*8=32 µops per clock.

                  Originally posted by Ladis View Post
                  If you remember Transmetta/Efficeon (Linus Torvald worked on it, too), it was also a "morphing" CPU which had the same performance like Intel, but at a fraction of power consumption and smaller/cheaper chip.
                  Some notes:
                  • Torvalds implied some time ago that he believes that Transmeta starting with servers, instead of notebooks, might have achieved better results and might have lowered the probability of Transmeta going out of business.
                  • The fact is that the 1st Intel Pentium (year 1993) is a CPU that is internally a dynamic-VLIW CPU, so it was problematic from the very beginning for Transmeta (year 2000, 7 years after Pentium) to outperform AMD/Intel x86 CPUs
                  • It is probable that in 20 years x86 CPUs will be doing what Transmeta was doing: x86 CPUs will have large µop caches (size: megabytes)
                    • The primary reason for that is _not_ caused by "x86 being outdated or inferior to ARM". The primary reason is staying competitive in terms of performance.
                  Originally posted by Ladis View Post
                  It [Transmeta] used profiling and JIT re-compilation of code pages in RAM, so the final stage of optimization effectively fed in-order the VLIW architecture.
                  µop cacheline and µop execution in x86 AMD/ARM/Intel CPUs is related to the concept of an in-order VLIW instruction.


                  Originally posted by Ironmask View Post
                  I don't know of any modern CPU as slow and power-hungry as Intel's chips. The mad dash to replace them with ARM is similar .... If ARM didn't show up when it did, the computing would have been set back another couple decades ....
                  Some notes:
                  • Intel CPUs are the most power hungry because Intel wanted Alder Lake to be the fastest gaming CPU when Alder Lake was released
                  • i486 (year 1989) was among the first CPUs to require an active CPU cooler. Prior to that point, CPUs consumed only relatively small amounts of energy. In other words: CPUs power consumption is closely related to how the CPU is being cooled (and is related to power delivery).
                  • If an x86 CPU has a µop cache, the advantage of ARM over x86 is close to zero
                    • The wider the CPU (more ALUs and load-store units) the more need for the CPU to have a µop cache - irrespective of whether it is ARM or x86

                  Comment


                  • #29
                    Originally posted by coder View Post
                    This sounds better than it is. You talk about these things as if they're comparable, but unless you're using a really small deep learning model, the overhead of shipping your data over PCIe or CXL is negligible, by comparison with the time that inferencing takes. We're talking about microseconds vs. milliseconds, at least.
                    It sounds better than it is because currently, unless you can afford highly specialized systems, you have to work around this limitation and people accept it. The PCIe is definitely a bottleneck for many HPC cases. I mean look at the latest Z series of mainframes from IBM, they put their AI accelerator directly onto the chip exactly to remove that overhead and they got something like a 20x improvement (also because since that AI processor is on the chip it now has extremely fast access to IBM's new CPU cache architecture).

                    For this reason, having some "general" ISA that allows all of what traditionally would be accelerators/separate chips onto the same die has merit. I can't really comment on how viable/realistic this is though but its definitely a problem in HPC.

                    Comment


                    • #30
                      Originally posted by atomsymbol View Post
                      The probability of that being true is very close to zero. Or do you believe that a person with 88 patents doesn't know technology? https://patents.google.com/?inventor=Radoslav+Danilak
                      Not saying this is the case, but considering how much of a problem bullshit patents are, I wouldn't read too much just on the number.

                      Comment

                      Working...
                      X