Announcement

Collapse
No announcement yet.

Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by xfcemint View Post

    4) Complicates the design significantly. In particular, the switch-over operation may have a lot of dependencies.
    yes. i mean, there's a reason why "scalar" (normal, SMP) CPUs have scratch registers for context-switching (MIPS, RISC-V in particular) it's to get fast context-switches. where you start to include bank-switching of full register sets, including SPRs, it starts to make me twitchy about implementing something like that in a hybrid context. pure (dedicated) GPU, no problem.

    Comment


    • Originally posted by lkcl View Post

      interesting. this sounds very much like an optimised barrel processor: i say "optimised", where barrel processors are normally fair-scheduling, you're talking about instant swapping between regfiles.

      whereas, a hybrid CPU, being effectively "a standard SMP arrangement with extra opcode bells and whistles" would need a linux kernel OS context-switch.
      Very similar, most barrel processors have some way to accelerate the switch The Sparc T3 was all about memory access in the context of a database. But I don't think GPU's even need that, probably just a couple architectural registers of an offset and bounds. No clue on the exact details of the scheduler and I suspect it's a gaurded secret.

      Originally posted by lkcl View Post
      the summary takes 30 seconds. a full debrief takes 7 hours.

      this is why we're extending the PowerISA regfile to 128 FP and 128 INT regs.
      Your register and cache video helped explain some of it, but still don't think I fully understand the 6600 overall. 7 Hours seems optimistic just for that aspect. I do occasionally find myself on comp.arch sometimes just trying to glean interesting tidbits when I can.

      Originally posted by lkcl View Post

      Jeff Bush's Nyuzi paper, nyuzipass2015, already made this abundantly clear, hence why 128 FP and 128 INT regs. you absolutely cannot have the LOAD-processing-STORE loop interrupted by register spill.

      (edit: well.. you can... but the power consumption penalty would terminate all and any possibility of having a commercially-viable processor. logically therefore, you don't do that!)
      Indeed multiple round trips make no sense at all,
      Originally posted by lkcl View Post
      ah no, not quite. the vector instruction is basically not really a vector at all, it's a "for-loop from 0 to VL-1 whilst the PC is not advanced i.e. it's a bit like a SUB-PC". conceptually it sits in between instruction decode and instruction issue.

      it therefore shoves *elements* into the multi-issue execution engine.

      now, if the VL is e.g. 4 and there is room for e.g. 8-wide multi-issue, then the instruction decode does *not* stop with that first vector instruction, it goes, "hmm if i decode the next instruction as well i can shove an extra 4 elements into the 8-wide multi-issue"

      and at *that* point it will go "ok i can't do any more in this cycle"

      but because all the Computation Units are pipelines (except DIV) then on the next cycle guess what? next instruction decode gets 8 more free issue slots, and off we go again.
      Poor DIV, always the black sheep of the family. But I can now see how it's be easy to multiplex the issue.

      I think I'm starting to get it. The X.org presentation helped. So the decoder sees the Vector instruction, and knows the element width, and will pack a FU/registers SIMD style, and also continue into a neighboring FU. I suppose a Matrix takes care of unaligned/incomplete widths, but that i might be harder on alternative implementations.

      And if I'm not wrong, at the end of the day, the limit of VL is based on the number of architectural registers available and their width? Hence the proposal for the official extension.

      Anyways to be more specific, say you've entered into vector mode and the next instructions in the queue are
      load x to r13
      load Vx to r0...3
      load Vy to r4...7
      mult r13, r4..7 to R17...20
      add r4...7, r17...20 to r13...r16
      store r13..16 to Vz

      And say Vy is in L1 cache, but Vx is isn't. Seems like load request for Vx could clog up you're load/store units even though theres potential tor a better scheduale. Or can loads overtake other loads in the pipeline?



      Originally posted by lkcl View Post


      memory load-store is basically exactly as it would be for a multi-issue superscalar out-of-order load-store, but most first-time processor architects wouldn't dream of creating a 6 to 8 multi-issue load-store microarchitecture. even BOOM has only just recently added 2 simultaneous load-stores.

      to cope with the kind of memory load anticipated, i had to spend several months with Mitch Alsup on comp.arch last year, to get enough of an understanding of how to do it.
      Indeed, it does look like a very ambitious project, even more so once you drill into the details. It's a real shame RISC-V wasn't more accommodating.

      Originally posted by lkcl View Post

      yes. and a minimum 256 bit L2 cache data path, plus 4 "striped" L1 caches. absolutely mental. *nobody* in the open hardware has tried designing something like this as a first processor! everyone does like 32-bit L1 caches or 64-bit, maybe.
      I can see why you're using a python flavor to do it.
      Originally posted by lkcl View Post

      not quite: the plan is to "stripe" the register file so that vectors are optimal, and to provide a cyclic ring-buffer for scalar workloads that don't quite fit that. example:

      Vector A fits into R0 R1 R2 R3
      Vector B fits into R4 R5 R6 R7
      result C is to go into R8 R9 R10 R11

      the data paths between R0, R4, R8, R12, R16 (etc) are immediate and direct. likewise between R1, R5, R9, .... etc.

      therefore this takes 1 clock cycle to read or write, and there are 4 such "paths" between regfiles, so all *four* sets of vector ops (R8=R4+R0, R9=R5+R1) all do not interfere with each other.

      however let us say that you make the "mistake" of doing this:

      Vector A fits into R0 R1 R2 R3
      Vector B fits into R4 R5 R6 R7
      result C is to go into R9 R10 R11 R12

      now although the reads (A, B) work fine, the result R0+R4, needing to go into R9, it is in the *wrong lane* and must be dropped into the "cyclic buffer". it will be a *three* cycle latency before it gets written to the regfile.

      otherwise we have to have a full crossbar (12 or 16 way READ and 8 or 10 WRITE) and that's just completely insane.
      Does that mean the entire pipeline has to stall while waiting on the buffer? If so... ouch.

      Originally posted by lkcl View Post

      they're not quite ZOHLsm but yes if you cognitively disconnect "decode" from "issue" then consider SV to be "a compressed version of decode", we can still have multi-issue decode and multi-issue execution.


      yyeah there is so much to get done before considering doing that, although hilariously we considered overloading "branch" as a way to "start threads".



      we have to do a full from-scratch redesign, in particular taking into account Condition Registers in PowerISA. sigh. https://bugs.libre-soc.org/show_bug.cgi?id=213
      Overloading the branches for predicates does seem pretty clever though.

      Comment


      • Originally posted by lkcl View Post

        yes. i mean, there's a reason why "scalar" (normal, SMP) CPUs have scratch registers for context-switching (MIPS, RISC-V in particular) it's to get fast context-switches. where you start to include bank-switching of full register sets, including SPRs, it starts to make me twitchy about implementing something like that in a hybrid context. pure (dedicated) GPU, no problem.
        Understandable, I've got some ideas, but keep getting stuck on details. Maybe normal prediction and prefetch will be plenty good in practice. And maybe not stuffing all the load units full from a single load-vector source instruction.

        Comment


        • Originally posted by WorBlux View Post
          Your register and cache video helped explain some of it, but still don't think I fully understand the 6600 overall. 7 Hours seems optimistic just for that aspect.
          ah yes: the 6600 and its precise-exception augmentations took me 5 months to understand. SimpleV's specification details which are ISA-independent (nothing to do with the 6600) "only" took 7 hours

          I do occasionally find myself on comp.arch sometimes just trying to glean interesting tidbits when I can.
          it's pretty high traffic and people love deviating

          Poor DIV, always the black sheep of the family. But I can now see how it's be easy to multiplex the issue.
          jacob came up with a "combined" algorithm that covers DIV, SQRT and R-SQRT in the same unit(s). this gives us something like a 50% increase in silicon area for a *combined* unit but then a 2/3 reduction in the *number* of such units required.

          I think I'm starting to get it. The X.org presentation helped. So the decoder sees the Vector instruction, and knows the element width, and will pack a FU/registers SIMD style,
          yes

          and also continue into a neighboring FU.
          yes, by *automatically* "masking out" the elements that don't fit that particular back-end SIMD unit, so that the programmer *does NOT* have to get into SIMD setup/teardown Hell

          I suppose a Matrix takes care of unaligned/incomplete widths,
          not quite: the masking takes care of it. as far as the actual ALU is concerned it doesn't care if it's been told to do 1x64 op, 2x32 ops, 4x16 ops or 8x8 ops (masked or unmasked).

          but that i might be harder on alternative implementations.
          you're telling me.

          And if I'm not wrong, at the end of the day, the limit of VL is based on the number of architectural registers available and their width?
          well, if you try to slam 64x FP64 operations into the engine then yes you're going to run out of registers. if however you try 64 INT8 operations those will get spread out across 8x SIMD ALUs taking 8 64-bit registers each, which is... tolerable.

          Hence the proposal for the official extension.
          ah the reason for the official extension proposal is because we simply cannot be the de-facto "hard fork" maintainers of u-boot, coreboot, linux kernel, gcc, llvm, debian distro, fedora distro (which Redhat will object to for Trademark reasons anyway), i mean the resources to do all that would be absolutely mental.

          this is the primary reason why we dropped RISC-V, because they failed, persistently and regularly, under their legal responsibilities under Trademark Law, to respond to reasonable in-good-faith requests to be included in the enhancement of the RISC-V ISA *without* completely compromising our business objectives.

          moving on

          Anyways to be more specific, say you've entered into vector mode and the next instructions in the queue are
          load x to r13
          load Vx to r0...3
          load Vy to r4...7
          mult r13, r4..7 to R17...20
          add r4...7, r17...20 to r13...r16
          store r13..16 to Vz

          And say Vy is in L1 cache, but Vx is isn't. Seems like load request for Vx could clog up you're load/store units even though theres potential tor a better scheduale. Or can loads overtake other loads in the pipeline?
          if we have enough LD/ST Reservation Stations (6-12 depending on required throughput), then yes. and as long as the memory locations are non-overlapping in the lower 12 bits, yes. that's a Mitch Alsup trick which saves hugely on address-compare XOR gates. by only comparing on the bottom 12 bits of the address against all other addresses (bear in mind that's an O(N^2) algorithm so is one HELL of a lot of gates if you have say 8 or 12 LD/ST RSes) you may end up "overzealously" catching some addresses that *might* not overlap in their upper bits, but if you did so it would find opportunities for parallelism.

          i.e the fallback is "these LD/STs are going to be done sequentially if we *can't* find opportunities for parallelsim" rather than "assume everything's done in parallel and whoops we missed some, wark, data-corruption"

          Indeed, it does look like a very ambitious project, even more so once you drill into the details.
          ahh... yah

          It's a real shame RISC-V wasn't more accommodating.
          hey they did us a favour. who wants to create a processor where the people behind it are spiteful, vengeful, arrogant d***s?

          moving on...

          I can see why you're using a python flavor to do it.
          it would be absolute hell and require 5x the engineers to not do this with OO techniques. or... we could... but we'd need to treat VHDL / Verilog as a "machine code target" with auto-generators (written probably in python) that used templates in VHDL/Verilog and filled in the gaps (size of element width) etc. the maintainability and readability of such an effort would be hell (i've tried).

          best to just stick with a modern OO programming language entirely.


          "not quite: the plan is to "stripe" the register file so that vectors are optimal, and to provide a cyclic ring-buffer for scalar workloads that don't quite fit that. example:"

          Does that mean the entire pipeline has to stall while waiting on the buffer? If so... ouch.
          ah no. the issue engine is independent, the Reservation Stations are independent and their latches (called "Nameless Registers" in augmented-6600 terminology) act as buffers. as long as you still have RSes to reserve, the issue engine does not stall, and the RSes are *not* dependent on the Register File(s) for resource allocation. however the *moment* any given instruction cannot reserve a required RS, *then* you must stall.

          couple of notes:

          1) 6600 is not a pipelined architecture: it's a parallel-processing architecture where the Computation Units (ALUs) can be pipelines or FSMs or bits of wet string for all it cares. therefore, if the Function Units can't get a word in to read/write from the Regfiles, such that their stuff hangs around in the Reservation Stations, *then* you get a stall (because no free RSes). so that increased latency (because of the cyclic buffer between RSes and Regfiles) means that you may have to increase the number of RSes to compensate (that's if you care about the non-vector path... which we don't)

          2) Thornton and Cray were so hyper-intelligent and it was so early that they solved problems that they didn't know existed (or would become "problems" for other architects). consequently they didn't even notice that the RS "latches" were a form of "Register Renaming" and it's only an extensive retrospective analysis and comparison against the Tomasulo Algorithm that i even noticed that the RS latches are directly equivalent to "Register renaming". even Patterson, one of the world's leading academics, completely failed to notice this, angering and annoying the s*** out of Mitch Alsup enough for Mitch to write two supplementary chapters to Thornton's book, "Design of a Computer".


          Overloading the branches for predicates does seem pretty clever though.
          that was for RISC-V. OpenPOWER ISA, everything is based around Condition Registers. so, i am advocating that we simply vectorise those (and increase their number to 64 or 128)

          https://bugs.libre-soc.org/show_bug.cgi?id=213#c48

          Comment


          • Originally posted by WorBlux View Post

            Understandable, I've got some ideas, but keep getting stuck on details. Maybe normal prediction and prefetch will be plenty good in practice. And maybe not stuffing all the load units full from a single load-vector source instruction.


            the nice thing about the predication is, it drops on top of the SIMD masks, and from there through to regfile byte-write-enables. no matter the element width, it's all good. it means that for a 64 bit operation, writing to the regfile we need to raise 8x byte-level write lines, but that's standard practice for SRAMs in L1 and L2 caches so cell library developers are going "yawn" at that (small) innovation.

            Comment


            • First off, thanks for the reply, it's clarified quite a few things for me.

              Originally posted by lkcl View Post
              well, if you try to slam 64x FP64 operations into the engine then yes you're going to run out of registers. if however you try 64 INT8 operations those will get spread out across 8x SIMD ALUs taking 8 64-bit registers each, which is... tolerable.
              So the API supports it, but in practice the compiler (mesa) will tune VL sizes to the hardware (but I guess this is what mesa does for everything Vulkan/OpenGL/OpenCL anyways). Even for native binary code, it seems more viable to include code paths for different 2^n bit optimal chunks, rather than trying to deal with every SIMD opcode/intrinsic under the sun. GCC could easily future-proof code by including optimized paths for yet unseen sizes, but you'd never be able to do that with unreleased SIMD extensions.

              Originally posted by lkcl View Post
              ah no. the issue engine is independent, the Reservation Stations are independent and their latches (called "Nameless Registers" in augmented-6600 terminology) act as buffers. as long as you still have RSes to reserve, the issue engine does not stall, and the RSes are *not* dependent on the Register File(s) for resource allocation. however the *moment* any given instruction cannot reserve a required RS, *then* you must stall.

              couple of notes:

              1) 6600 is not a pipelined architecture: it's a parallel-processing architecture where the Computation Units (ALUs) can be pipelines or FSMs or bits of wet string for all it cares. therefore, if the Function Units can't get a word in to read/write from the Regfiles, such that their stuff hangs around in the Reservation Stations, *then* you get a stall (because no free RSes). so that increased latency (because of the cyclic buffer between RSes and Regfiles) means that you may have to increase the number of RSes to compensate (that's if you care about the non-vector path... which we don't)

              2) Thornton and Cray were so hyper-intelligent and it was so early that they solved problems that they didn't know existed (or would become "problems" for other architects). consequently they didn't even notice that the RS "latches" were a form of "Register Renaming" and it's only an extensive retrospective analysis and comparison against the Tomasulo Algorithm that i even noticed that the RS latches are directly equivalent to "Register renaming". even Patterson, one of the world's leading academics, completely failed to notice this, angering and annoying the s*** out of Mitch Alsup enough for Mitch to write two supplementary chapters to Thornton's book, "Design of a Computer".
              Rename becasue of the dual FU-FU and FU-Reg DM's. If RS A wants to write to r3, RS B wants to read R3 from RS A, and RS C also wants to write to r3, there's no reason RS C can't go ahead and do it's operation and keep the result on it's output latches while waiting for RS A to finish and RS B to pull it's read. I hope I'm starting to get it.

              1. So a several FU's might share a pipilined ALU, so long as it can track and buffer the results, But once an FU is issued an instruction it has to track it to commit?

              I may just have to go read that book. Are Mitch's chapters of addendum publicly availible?


              Originally posted by lkcl View Post

              that was for RISC-V. OpenPOWER ISA, everything is based around Condition Registers. so, i am advocating that we simply vectorise those (and increase their number to 64 or 128)

              https://bugs.libre-soc.org/show_bug.cgi?id=213#c48
              I'm looking at this CR thing for a while now, digging into that bug report, and the Power ISA specification, and not really getting any great ideas.

              One really bad idea - Ignore the CR and add a byte of mask at the bottom of each GPR. But of course the would make register spill/save a nighmare. Plus doesn't really help with GT/LT/EQ.

              One start to an idea was to expand the CR bit feild into byte feilds (plus mask). Also seems more terrible the more I think of it. If you were only ever doing 8x simd maybe.

              Oh Well, I know that's not helpful at all, but it was fun to think about. I'll be sure to follow your progress anyways.



              Comment


              • Originally posted by WorBlux View Post
                First off, thanks for the reply, it's clarified quite a few things for me.



                So the API supports it, but in practice the compiler (mesa) will tune VL sizes to the hardware (but I guess this is what mesa does for everything Vulkan/OpenGL/OpenCL anyways).
                pretty much, yeah. i mean, the compiler will know the register allocation / usage, and normally would shove out a batch of SIMD instructions (4x 4-wide SIMD to do 16 operations), whereas with SV it would issue *one* scalar operation with VL=16, *knowing* that this means that 16 registers will be needed.

                Even for native binary code, it seems more viable to include code paths for different 2^n bit optimal chunks, rather than trying to deal with every SIMD opcode/intrinsic under the sun. GCC could easily future-proof code by including optimized paths for yet unseen sizes, but you'd never be able to do that with unreleased SIMD extensions.
                well, what i'm hoping is that the significant work being done on llvm for RISC-V, ARM SVE/2, and other companies with variable-length VL, will hit mainline well in advance, such that all we need to do is a minimalist amount of porting work to add SV.


                Rename becasue of the dual FU-FU and FU-Reg DM's.
                the FU-Regs matrix covers the information about which registers a given FU needs to read or write, whilst the FU-FU matrix preserves the *result* ordering dependency. interestingly, FU-FU preserves a DAG (Directed Acyclic Graph)

                If RS A wants to write to r3, RS B wants to read R3 from RS A, and RS C also wants to write to r3, there's no reason RS C can't go ahead and do it's operation and keep the result on it's output latches while waiting for RS A to finish and RS B to pull it's read. I hope I'm starting to get it.
                pretty much

                B has a Read-after-Write hazard on A, C has a Write-after-Read hazard on B. yes absolutely, C can go ahead in parallel, create the result, and once the WaR hazard is dropped by B, the "hold" goes away,

                C is then allowed to raise "Write_Request", C will (at some point) be notified "ok, RIGHT NOW, you must put data, RIGHT NOW, on this clock cycle, for one cycle only, the data you want writing to the regfile". this is the "GO_WRITE" signal, and following that GO_WRITE (the cycle after), C absolutely must drop its Write_Request (because it's done its write). that "drop" of the Write_Request also goes into the FU-FU and FU_Regs Dependency Matrices to say "i no longer have a dependency: i'm totally done, no longer busy, and therefore free to be issued another instruction".

                it's ultimately really quite compact and beautifully elegant, very little actual silicon, just hell to explain. even Thornton, in "Design of a Computer", specifically says this (around p.126 i think). i've found it usually takes several weeks for people to grasp the basics, and about 3 months to truly get it.

                1. So a several FU's might share a pipilined ALU,
                yes. Mitch Alsup calls this "Concurrent Computation Units". basically if you have a 4-long pipeline, you have at least 4 "RSes" and you schedule one (and only one) of them to let it get data into the front of that pipeline, in each clock cycle.

                so long as it can track and buffer the results, But once an FU is issued an instruction it has to track it to commit?
                it's non-negotiably critical that they do so. failure to keep track of results will guaranteed 100% result in data corruption.

                the only thing that's slightly odd in the Concurrent Computation Unit case is: the FU is *not* the pipeline, it's the RS connected *to* the Pipeline. or, put another way, 4x RSes connected to a shared (mutexed) pipeline is actually *FOUR* separate Function Units.

                this got very painful to explain on comp.arch that it was important to computer science that there be different terminology to distinguish "hazard-aware Function Unit" from "pipeline". we had quite a lot of idiots absolutely categorically insist that "FU equals pipeline, shuttup you moron".

                eeeeventually after a week of quite painful discussion the term "Phase-aware Function Unit" occurred to me, as a way to distinguish this from "when people start treating pipelines as synonymous with FUs".

                Phase-aware to mean "the FU is aware at all times and carries with it the responsibility for tracking and notifying of its operands and its result".

                would you believe it, there is no modern industry-standard term for "Phase-aware Function Unit"?


                I may just have to go read that book. Are Mitch's chapters of addendum publicly availible?
                yes if you send me your email address (PM me) and indicate that you agree that if you share the files with anyone else you must ask them to credit Mitch Alsup if they use any of the material in it, and to require them to (recursively) request the same conditions (recursively) on those follow-on recipients.


                I'm looking at this CR thing for a while now, digging into that bug report, and the Power ISA specification, and not really getting any great ideas.


                One really bad idea - Ignore the CR and add a byte of mask at the bottom of each GPR. But of course the would make register spill/save a nighmare. Plus doesn't really help with GT/LT/EQ.

                One start to an idea was to expand the CR bit feild into byte feilds (plus mask). Also seems more terrible the more I think of it. If you were only ever doing 8x simd maybe.
                tricky, isn't it? now you see why it took 18 months to design SV (and implement the simulator).

                btw there are no bad ideas at this stage.

                Oh Well, I know that's not helpful at all, but it was fun to think about. I'll be sure to follow your progress anyways.

                Last edited by lkcl; 10-18-2020, 01:12 AM.

                Comment


                • Originally posted by lkcl View Post
                  well, what i'm hoping is that the significant work being done on llvm for RISC-V, ARM SVE/2, and other companies with variable-length VL, will hit mainline well in advance, such that all we need to do is a minimalist amount of porting work to add SV.
                  Ya it's such a huge advance in API, I'm surprised it took 4 iterations of mainstream SIMD before chip designers said "ya, that's a problem, I admit"
                  Originally posted by lkcl View Post
                  the FU-Regs matrix covers the information about which registers a given FU needs to read or write, whilst the FU-FU matrix preserves the *result* ordering dependency. interestingly, FU-FU preserves a DAG (Directed Acyclic Graph)
                  I think DAG translates to Dataflow diagram, for those who are less mathematically astute.
                  Originally posted by lkcl View Post
                  it's ultimately really quite compact and beautifully elegant, very little actual silicon, just hell to explain. even Thornton, in "Design of a Computer", specifically says this (around p.126 i think). i've found it usually takes several weeks for people to grasp the basics, and about 3 months to truly get it

                  ....

                  this got very painful to explain on comp.arch that it was important to computer science that there be different terminology to distinguish "hazard-aware Function Unit" from "pipeline". we had quite a lot of idiots absolutely categorically insist that "FU equals pipeline, shuttup you moron".

                  eeeeventually after a week of quite painful discussion the term "Phase-aware Function Unit" occurred to me, as a way to distinguish this from "when people start treating pipelines as synonymous with FUs".

                  Phase-aware to mean "the FU is aware at all times and carries with it the responsibility for tracking and notifying of its operands and its result".

                  would you believe it, there is no modern industry-standard term for "Phase-aware Function Unit"?
                  I can believe it. After the RISC revolution it seems everyone ran off in the same direction, with most going down the CPU lane and a few going VLIW. A lot of iterative improvements, like 4% better branch predictors or a iterative improvement on LRU cache eviction. But fundamental design choices seemed to be stamped in steel (with million-dollar+ lithography screens anyways). If we can credit openrisc/Risc-v with anything it's reviving interest and tool-chains to the open commons.
                  Originally posted by lkcl View Post
                  tricky, isn't it? now you see why it took 18 months to design SV (and implement the simulator).
                  Ya certainly. I'm also seeing why prior designs pushed SIMD and let the compiler deal with all the edge cases. Especially if you're on a tight timeline trying to hit that next process node first. Not being on a tight schedule or tied to a particular node gives you the time and creative space to get it right. Also, little wonder why you, Mitch, and Ivan Goddard all end up in the same places.

                  Comment


                  • Originally posted by WorBlux View Post
                    Ya it's such a huge advance in API, I'm surprised it took 4 iterations of mainstream SIMD before chip designers said "ya, that's a problem, I admit"
                    all they had to do was look at the Cray architecture! Cray has done vectors large enough that the regfile needed to be held in external ultra-fast SRAM.

                    btw you may be intrigued to know that a number of people working for Northrup Grumman, and others who used Cray supercomputers, were significant contributors to RVV.

                    I think DAG translates to Dataflow diagram, for those who are less mathematically astute.
                    (and like blockchains which are all DAGs)

                    I can believe it. After the RISC revolution it seems everyone ran off in the same direction, with most going down the CPU lane and a few going VLIW.
                    realistically it has to be said that the only highly commercially successful VLIW processor is the TI DSP series. these are VLIW-double-interleaved, so really good for audio stereo processing. they're also typically programmed in assembler.

                    SIMD just... gaah. i think early in this thread i posted the "SIMD considered harmful" article, but that really doesn't sink in, as it's an academic exercise. where it really hits home is when you count the number of VSX handcoded assembly instructions in a recent glibc6 patch to POWER9.

                    250.

                    the RVV equivalent is *14*

                    A lot of iterative improvements, like 4% better branch predictors or a iterative improvement on LRU cache eviction. But fundamental design choices seemed to be stamped in steel (with million-dollar+ lithography screens anyways). If we can credit openrisc/Risc-v with anything it's reviving interest and tool-chains to the open commons.
                    well we can, but only because they got there before IBM had finished the preparatory work for opening PowerISA. Hugh said that he was hilariously contacted by tons of people saying, "duuude, OPF should totally do what RISCV is doing" and he had to tell them that a small team had been preparing exactly that, quietly, for 10 years

                    Ya certainly. I'm also seeing why prior designs pushed SIMD and let the compiler deal with all the edge cases.
                    indeed. it's veeery seductive. slap another opcode on, drop in a new ALU, and as far as the architecture is concerned, the SIMD opcode is literally no different from any other scalar operation. 2xFP32 is one opcode, 1xFP64 is another, neither the ISA nor architecture knows or cares.

                    dead simple, right?

                    wark-wark

                    Especially if you're on a tight timeline trying to hit that next process node first. Not being on a tight schedule or tied to a particular node gives you the time and creative space to get it right. Also, little wonder why you, Mitch, and Ivan Goddard all end up in the same places.
                    Mitch does all his designs at the gate level. he studied the 6600 very early on, recognised its genious, and based the Motorola 68000 on it. his expertise developed from there.

                    Ivan's team took a radically different approach in the Mill, where you actually do static scheduling by the compiler onto a "conveyor belt". all types of operations are known (fixed) completion times and so the compiler has all the information it needs. this does mean that the compiler *specifically* has to target a particular architecture. no two different Mill archs are binary compatible.

                    but... the Mill ISA? woow. there is no FP32 ADD or INT16 ADD, there is just... ADD. the size and type come from the LOAD operation, tag the register from that point onwards, and are carried right the way to STORE. ultra, ultra efficient and beautifully simple ISA, hardly any opcodes at all. terminology: polymorphic widths and operations. i wish i could use that but the problem is it is such a deviation from PowerISA it will be hard to get it in.

                    me i just don't waste time reinventing things that are already beautiful and elegant, or tolerate things that are not. like SIMD


                    Comment


                    • Originally posted by lkcl View Post
                      SIMD just... gaah. i think early in this thread i posted the "SIMD considered harmful" article, but that really doesn't sink in, as it's an academic exercise. where it really hits home is when you count the number of VSX handcoded assembly instructions in a recent glibc6 patch to POWER9.

                      250.

                      the RVV equivalent is *14*
                      Which kind of make you wonder how much performance is being left on the table by clogging your caches with code rather than data...

                      Originally posted by lkcl View Post
                      Mitch does all his designs at the gate level. he studied the 6600 very early on, recognised its genious, and based the Motorola 68000 on it. his expertise developed from there.

                      Ivan's team took a radically different approach in the Mill, where you actually do static scheduling by the compiler onto a "conveyor belt". all types of operations are known (fixed) completion times and so the compiler has all the information it needs. this does mean that the compiler *specifically* has to target a particular architecture. no two different Mill archs are binary compatible.

                      but... the Mill ISA? woow. there is no FP32 ADD or INT16 ADD, there is just... ADD. the size and type come from the LOAD operation, tag the register from that point onwards, and are carried right the way to STORE. ultra, ultra efficient and beautifully simple ISA, hardly any opcodes at all. terminology: polymorphic widths and operations. i wish i could use that but the problem is it is such a deviation from PowerISA it will be hard to get it in.
                      I'm fairly sure on mill that the load specifies width, and the instruction provides type (and that pointer is a hardware type distinct from unsigned integer) http://millcomputing.com/wiki/Instruction_Set

                      Of course they are changing and refining all the time. But it is a very CISC instruction set . Not only is each member binary incompatible, every FU slot has a different binary encoding (and set of supported ops), which is one of the ways they get away with such a wide issue.

                      This combined with the conceptual conveyor belt means nobody except the crypto guys are going to be writing anything in raw assembler. I think their plan for hardware initialization is to include a forth interpreter on the ROM. Not unheard of, but a very different approach from current mainstream.

                      Originally posted by lkcl View Post
                      me i just don't waste time reinventing things that are already beautiful and elegant, or tolerate things that are not. like SIMD
                      One advantage of working for yourself, I suppose.

                      Comment

                      Working...
                      X