OpenBLAS 0.3.20 Adds Support For Russia's Elbrus E2000, Arm Neoverse N2/V1 CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • bridgman
    AMD Linux
    • Oct 2007
    • 13187

    Originally posted by mSparks View Post
    They call it "Ultra wide instruction arch". Doesn't mention that either.
    Have to disagree with this - by the second page the author was already talking about execution engine width and that continued as far as I had time to read. Agree that the word "ultra" was not used, although that's more of a marketing word than a technical word anyways.

    Originally posted by mSparks View Post
    We had several pages of someone (not coder iirc) trying to convince me that VLIW can only run ops in order (specifically cannot do out of order instructions)
    A lot of this discussion seems to be driven by subtle differences in terminology. As an example, I believe it's fair to say that the generally accepted definition of "out of order execution" is that the hardware executes instructions in a different sequence from the instruction stream, not different from the programmer's source code. By that definition a typical VLIW processor would not be executing OOO.

    I don't believe anyone is saying that VLIW processors absolutely can not execute instructions OOO relative to the instruction stream (it is obviously possible albeit more complex to manage dependencies across bundles than across individual instructions), just that VLIW has always been used as an alternative to dynamic scheduling (OOO) that provides some of the benefits with much less complexity.

    If you were designing an OOO processor you would generally not want a VLIW ISA because it adds even more complexity with little or no benefit, but as far as I know that is the only reason we don't see OOO VLIW processors.
    Test signature

    Comment

    • mSparks
      Senior Member
      • Oct 2007
      • 2065

      Originally posted by Khrundel View Post
      As we already discussed, that's won't work.

      I don't know why Intel buys any company with VLIW cores.
      Have you actually tried an M1 or Elbrus? or are you just spouting FUD?

      My M1 macbook air is performing almost as well as my AMD5900X desktop, and outperforms the other halves AMD 4800H doing Excel calcs by like 2 or 3 times.

      Comment

      • Khrundel
        Senior Member
        • May 2016
        • 327

        Originally posted by mshigorin View Post
        Incorrect. You somehow grant OoO the inherently perfect scheduling in the first place -- did you wonder about "magic", what's yours here, decoder?
        Lets imagine we have a decent VLIW compiler, and some OoO CPU. Can we adapt this compiler for this CPU? Naturally, VLIW is just limited and sparse packs of simple instructions. Moreover, for example we can imagine that instead of 5-way OoO superscalar we have 7-way superscalar. Our good VLIW compiler will find 7 independent "threads" of execution and place them. And in case of 2 of these "threads" will stall, our 5-way superscalar will perform fullspeed, in case it has OoO window to hold more than 2*latency instructions.
        I mean why you think OoO CPU can't use help of compiler? It can. It just more forgiving than VLIW, so compiler can afford some risk.
        Speaking of compilers. I talked with Vadim who wrote radeon_si optimizations in Mesa. He told me that he was disappointed with AMD engineers being unable to do that decently in a year, so he did that in three months as a bet.
        Correct me if I wrong, but southern island (_SI) was a first non-VLIW GPU of AMD. VLIW was in radeon without SI.

        Originally posted by mshigorin View Post
        I write this off an e2kv4 workstation that started running v3 ALT back when we only had that. 7za b performance difference between v3-on-v4 and v4-on-v4 was about 1% (small but consistent). My ALT Rescue image for v3 runs on anything from v3 to v6, at least in its [only] text mode.
        That is why I suggested example with creating a LITTLE core. You know, ARM's bil.LITTLE, when thread goes to a different cores depending of performance or economy is priority. Imagine your VLIW from architectural point of view must contain 3 FPU, but statisctically most time you need only 2. Or you can make some unit way simplier but some rare operation will perform couples of cycle longer. These kind of changes either impossible for VLIW or will create much greater impact.

        Comment

        • Khrundel
          Senior Member
          • May 2016
          • 327

          Originally posted by mSparks View Post
          Have you actually tried an M1 or Elbrus? or are you just spouting FUD?

          My M1 macbook air is performing almost as well as my AMD5900X desktop, and outperforms the other halves AMD 4800H doing Excel calcs by like 2 or 3 times.
          I'm afraid your "excell works smoothly" isn't best benchmark for CPU.

          Comment

          • mSparks
            Senior Member
            • Oct 2007
            • 2065

            Originally posted by bridgman View Post

            Have to disagree with this - by the second page the author was already talking about execution engine width and that continued as far as I had time to read. Agree that the word "ultra" was not used, although that's more of a marketing word than a technical word anyways.
            whats the instruction width of VLIW?

            M1 is apparently
            https://images.anandtech.com/doci/16...torm_575px.png

            8 instructions wide.


            Originally posted by bridgman View Post

            A lot of this discussion seems to be driven by subtle differences in terminology. As an example, I believe it's fair to say that the generally accepted definition of "out of order execution" is that the hardware executes instructions in a different sequence from the instruction stream, not different from the programmer's source code. By that definition a typical VLIW processor would not be executing OOO.
            Well, yeah, but there is
            In Order
            and there is Out of Order

            Saying it can't do Out of Order means it must do In Order.

            And, "AIUI" the point of the "long instructions" is they have multiple instructions that have no order, aka they are "out of order" (and not in the broken sense).

            Comment

            • mSparks
              Senior Member
              • Oct 2007
              • 2065

              Originally posted by Khrundel View Post
              I'm afraid your "excell works smoothly" isn't best benchmark for CPU.
              But its what we care about.
              We paid $2700 for a windows laptop for the other half that takes an hour before Excel becomes responsive again after you click "calculate now"
              I bought an M1 macbook air for like $1500 as a backup/mobile for my 5900X linux desktop, and it completes the same calcs in 15 minutes.

              Other half is _very_ jealous of my M1, and that laptop is probably the last windows machine we ever buy.
              Last edited by mSparks; 25 February 2022, 12:38 PM.

              Comment

              • jabl
                Senior Member
                • Nov 2011
                • 650

                So just because an OoO core is X-wide doesn't make it VLIW. There's no bundling of X instructions into an X-wide bundle, but each instruction is scheduled independently. Which instructions that happen to execute concurrently is an effect on which instructions are ready to execute, available pipeline slots in the core and so on. And crucially, this can vary dynamically e.g. depending on how quick a load instruction executes.

                A VLIW core is typically considered in-order, if each instruction bundle is executed in order, notwhitstanding that each individual instruction in a bundle is executed concurrently with the others in the same bundle. I think Intel had some plans to make OoO Itanium cores but never got as far before they cancelled the entire product line.

                Comment

                • bridgman
                  AMD Linux
                  • Oct 2007
                  • 13187

                  Originally posted by mSparks View Post
                  whats the instruction width of VLIW?
                  M1 is apparently 8 instructions wide.
                  I think everyone agrees that the execution engine in M1 is very wide (I would say closer to 16-wide than 8-wide once you include load/store and FPU) but that doesn't mean it is VLIW - as far as everyone can see it is just a very wide OOO processor. If it were actually VLIW then it would need instruction bundles in the code it executes, and as far as I know the code executed by M1 looks like pretty standard ARM ISA.

                  Originally posted by mSparks View Post
                  Well, yeah, but there is In Order and there is Out of Order. Saying it can't do Out of Order means it must do In Order.
                  Right, but as far as I can see people are not saying that a VLIW processor absolutely can not do Out of Order, just that nobody has done it and that it probably doesn't make a lot of sense.

                  VLIW and OOO are different approaches to implementing a superscalar processor and that using the two together seems to add more complexity than benefit. It's certainly possible though - and would make a cool thesis project even if not a viable product.

                  Originally posted by mSparks View Post
                  And, "AIUI" the point of the "long instructions" is they have multiple instructions that have no order, aka they are "out of order" (and not in the broken sense).
                  Good point - within an instruction bundle/packet there is no implicit order. I was talking about OOO execution of instruction bundles (long instructions) rather than instructions within a bundle - should have made that more clear.
                  Test signature

                  Comment

                  • mSparks
                    Senior Member
                    • Oct 2007
                    • 2065

                    Originally posted by bridgman View Post
                    just that nobody has done it
                    What is "it"? That sounds to me like you are saying "no one has run opcodes out of order instead of in order with VLIW"?

                    But the _whole point_ of VLIW is the opcodes in a VLIW have no order.
                    Last edited by mSparks; 25 February 2022, 01:36 PM.

                    Comment

                    • bridgman
                      AMD Linux
                      • Oct 2007
                      • 13187

                      Originally posted by mSparks View Post
                      What is "it"? That sounds to me like you are saying "no one has run opcodes out of order instead of in order with VLIW"?

                      But the _whole point_ of VLIW is the opcodes in a VLIW have no order.
                      Sorry - "it" in this context means running instruction bundles (your "long instructions") out of order. There is no implicit order between instructions within a bundle but there is also an expectation that bundles will be executed in order. The only exception was EPIC, which included a "stop bit" indicating that the instructions in the next bundle do not depend on completion of the instructions in the current bundle.

                      As far as I know EPIC was the only implementation that made OOO execution of instruction bundles potentially do-able without a full OOO engine but that was limited to re-sequencing within a group of bundles where all but the last bundle had the stop bit set.

                      Even there I believe the intended purpose was to allow multiple bundles to be executed in parallel on a 6-wide or 9-wide processor (a bundle had 3 instructions) rather than executing bundles out of order on a 3-wide processor. I don't know of any implementation that took advantage of the stop bit to execute bundles out of order, but at first glance it should be possible and would only require tracking of data readiness from memory/cache reads, not the more complex dependency analysis we associate with a full OOO engine.

                      Even with that, you would still be limited to OOO execution of bundles within a group identified by the compiler, and those groups would typically only have a handful of instructions. By comparison a modern OOO processor is able to execute instructions out of order over a much larger window - hundreds of instructions on a modern processor.
                      Last edited by bridgman; 25 February 2022, 02:23 PM.
                      Test signature

                      Comment

                      Working...
                      X