Announcement

Collapse
No announcement yet.

OpenBLAS 0.3.20 Adds Support For Russia's Elbrus E2000, Arm Neoverse N2/V1 CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by Khrundel View Post
    As we already discussed, that's won't work.
    As we've discussed already, wrong.

    Originally posted by Khrundel View Post
    That is 100% confession of VLIW's uselessnes. [...] Maybe with some shared units or real SMT for cores.
    Now go to Linus and tell him that

    Originally posted by Khrundel View Post
    No, it's not like SSE...
    ...which is a direct descendant of technology stolen from Elbrus, heh.

    Originally posted by Khrundel View Post
    Lets imagine we have a decent VLIW compiler, and some OoO CPU. Can we adapt this compiler for this CPU?
    I don't have to imagine as I have both handy.
    Do you know the peculiarities of the decoder? Of its behaviour in genX? genY? given microcode Z on batch NN with AVX throttling capping clock at A.BC GHz in practice?..
    This game can be played both ways.
    But I can go to the compiler guys and just ask; can you?

    Originally posted by Khrundel View Post
    Our good VLIW compiler will find 7 independent "threads" of execution and place them.
    ...in order; why did you pull OoO's ears here then? The optimizing compilers for OoO do try and help target decoders as well.

    The reverse side of OoO is the dark magic of the decoder (aggravated by hypocrite marketing selling "performance" and staying silent when breaking the assumption of "security") -- as has been practically proved by those series of spectacular vulnerabilities during last several years.

    Originally posted by Khrundel View Post
    I mean why you think OoO CPU can't use help of compiler?
    Why do you say I think what I never said I think? Would you like me using this method upon yourself? I can, but it's childish.

    Originally posted by Khrundel View Post
    Imagine your VLIW from architectural point of view must contain 3 FPU, but statisctically most time you need only 2. Or you can make some unit way simplier but some rare operation will perform couples of cycle longer. These kind of changes either impossible for VLIW or will create much greater impact.
    You fall into blaming the solution for a different problem for being inapplicable (or hard to apply) -- the common management issue in similar cases is stepping back first and looking at the larger problem to understand if the approach is right in the first place.

    I've seen too many cases where people would heroically fight non-problems with the "solutions" they came to -- it's just that the problem was wrong in the first place, and understanding that to reformulate the problem correctly would help a lot.

    Drop the decoder and you might just not need that kind of acrobatics anymore.

    PS: just in case, VLIW/EPIC are no silver bullet at all; in the particular case of Elbrus this approach is officially known to be chosen due to the need for HPC hardware when Soviet electronics tended to lag behind but Soviet programmers being very advanced; it's much the case for modern Russia either.

    PPS: you know what... I use this 801-PC as my main workstation for four years. It's fine with me, no one tried -- or could -- force me to switch. Reading your allegations on the relevant hardware is, well, a kind of fun. :-)

    Comment


    • Originally posted by bridgman View Post

      Sorry - "it" in this context means running instruction bundles (your "long instructions") out of order. There is no implicit order between instructions within a bundle but there is also an expectation that bundles will be executed in order.
      But that is what OoO silicon does as well, -> create bundles of instructions with no implicit order (from the ordered instructions it receives) that can all be run on the same clock cycles.

      The only difference I'm seeing here, is VLIW allows it to be precompiled, and OoO relies on implementing more or less the same algorithms in silicon and doing it at runtime.

      Comment


      • Originally posted by mshigorin View Post
        As we've discussed already, wrong.
        Sorry, your opinion just doesn't matter. In reality only in rare cases you can prefetch data long enough before. If you don't understand this, just stop wasting my time.

        Originally posted by mshigorin View Post
        ...which is a direct descendant of technology stolen from Elbrus, heh.
        May I ask you a personal question? Do you work at MCST? And if so, does your employer forces you to tell cool stories about everybody stealing from Elbrus? I've read some old interview of Elbrus2000 founder who claimed Pentium was descendant of stolen from Elbrus technology, some ex-employee of that guy was employed by Intel and, using stolen technology, he've created a CPU and even named it after himself. All the best was invented within elbrus. Pentium, VLIW you name it. And SIMD too. Seymour Cray? Never heard of him.

        Originally posted by mshigorin View Post
        I don't have to imagine as I have both handy.
        Do you know the peculiarities of the decoder? Of its behaviour in genX? genY? given microcode Z on batch NN with AVX throttling capping clock at A.BC GHz in practice?..
        Please, stop pretending you are smarter than you are. With all this stupid bragging you've forgot what you're answering, and now I have to repeat myself.
        Originally posted by mshigorin View Post
        ...in order; why did you pull OoO's ears here then?
        Hmm... there is no such thing as "in order code" LOL. OoO is a property of a CPU, not code.
        The optimizing compilers for OoO do try and help target decoders as well.
        Next time, instead of writing some garbage try to read and understand. Compiler generates code for several independed "threads", like it does for VLIW, except it has more freedom. I think I must clarify myself: "thread" within quotes because it isn't real thread, it is a sequence of instructions, depending one on another but independent of other sequences. Compiler interleaves instructions from these "threads", and voila, you have same performance as in best case scenario for VLIW. So why we need OoO? As I've already written, OoO logic allows to survive cache miss. That "thread", which contains stalled instruction just waits within buffer, while CPU executes other "threads". VLIW, in same situation, will have to stall all "threads" until all instructions of next VLIW-word will be ready.

        Now you can argue with me, not some of your fantasies.

        You fall into blaming the solution for a different problem for being inapplicable (or hard to apply) -- the common management issue in similar cases is stepping back first and looking at the larger problem to understand if the approach is right in the first place.
        No. I admit there is a real problem Elbrus designed to solve. When, in late 90th, nobody knew how to spend transistor budget, VLIW looked like something sane. At that moment Boris Babayan tried to get government funding to create Merced-like CPU. Merced was codename for Itanium. Later, after it became clear that VLIW is just bad idea, Elbrus itself still solves main problem: it removes any competition and prevents from measuring results. 20 years of development and it is still "promising" architecture which will flourish some time, once compiler will be finished. Would they dropped Elbrus 10 years ago and just bought ARM license and employ some good developer, they could have their unique modern CPU core by now, with all soft already ported.

        Drop the decoder and you might just not need that kind of acrobatics anymore.
        I think I've already heard this somewhere.... Ah, of course, that was RISC mantra. And now they have decoders too.
        PS: just in case, VLIW/EPIC are no silver bullet at all;
        Yes. It is not a bullet, it is a fancy bolt which will perform at least as good as bullets after we finish our new fancy crossbow.

        Comment


        • Originally posted by Khrundel View Post
          Hmm... there is no such thing as "in order code" LOL. OoO is a property of a CPU, not code.
          Hmm, yes there is. It is the order instructions are executed.

          a+b=c
          c+a=d

          order matters here, you cannot do them at the same time, you need the answer to a+b before you can calculate (the expected) answer to c+a

          a+b=c
          c+a=d
          e+c=f

          the order of the last two instructions doesn't matter here, so you can do them "out of order", namely calculating c+a=d and e+c=f AT THE SAME TIME. This is where the perf benefit of OoO and VLIW comes from.

          VLIW works exactly the same way, only instead of determining that c+a=d and e+c=f can be run "out of order" in the CPU silicon the compiler puts c+a=d and e+c=f IN A SINGLE LONG WORD.

          The whole point of "OoO" is CPUs have more than one FP unit on a single core, if you just ran those instructions "in order" then a+b=c then c+a=d then e+c=f would take 50% more CPU cycles than doing a+b=c then c+a=d and e+c=f

          There is no performance benefit to be had by "just" changing the order they run in, it actually needs to reduce the number of cycles needed to complete them by using more of the available silicon -> silicon latency and power OoO needlessly wastes doing it at runtime.
          Last edited by mSparks; 25 February 2022, 07:07 PM.

          Comment


          • Originally posted by mSparks View Post
            Which is pretty much what I meant by "isn't that the whole point of VLIW"

            the reorder buffer does in silicon what the VLIW compiler calculates before the instructions are even sent to the stack.

            Obviously no RoB silicon or VLIW compilers are created equal, but they are doing pretty much the same thing, removing the requirement that every instruction runs in order one after the other in sequence.
            Right. Doing the same job, but the hardware knows which reads haven't yet completed and dynamic branch probabilities. In contrast, the compiler. has fewer limitations on its visibility and the types of optimizations it can do.

            Comment


            • Originally posted by Khrundel View Post
              As we already discussed, that's won't work.
              Hardware prefetching can work better than you probably expect. Even for linked lists and trees. Depending on how much memory fragmentation you have, whether it's built from start to end, and what's happening between node additions, a linked list isn't necessarily so random in its memory layout. In the simplest case, it's just as linear as an array:


              Originally posted by Khrundel View Post
              That is 100% confession of VLIW's uselessnes. If you have so many threads so you can hide latencies, why bother with VLIW in first place?
              It's a question of scale. You could make the same argument against x86 -- that if it needs SMT, it must be broken. Well, having just SMT-2 works pretty well for it, and most people seem to be happy to use it.

              BTW, AMD used a barrel architecture for their GCN GPUs and CDNA accelerators. It's simply a technique from a processor architect's bag of tricks, not an admission of failure.

              Originally posted by Khrundel View Post
              And what is the working set size? Less than half of 96MB of L3 of ryzen 5800x3d?
              In the case of a DSP, it could be a couple MB or less.

              Originally posted by Khrundel View Post
              Wait a minute. I wasn't aware we discussing something so distant from general purpose CPU.
              You made categorical statements about their uselessness. That throws open the doors to any and all applications.

              Originally posted by Khrundel View Post
              I think as anywhere it is more like "we are too small to place all eggs to same basket, we starting with something more flexible and then replacing hotpoint with ASIC".
              These are programmable cores inside of ASICs and SoCs. It seems like you're finally coming to accept that VLIW does have a niche.

              Originally posted by Khrundel View Post
              Are nvidia's tensor cores based on VLIW?
              Tensor cores are hard-wired and limited in what types of operations they can accelerate. They're very good at what they do, but too limited if it's your only computational primitive.

              Originally posted by Khrundel View Post
              I don't know why Intel buys any company with VLIW cores. I suppose not because of VLIW technology, because Intel already have EPIC and wouldn't be a problem to develop another one for accelerator.
              Intel is a business. When they do acquisitions, they look for ready solutions that meet current and future market demands. If they would build AI processors in-house, they would be late to market -- just look at how long it took them to make a dGPU from when they first disclosed the project, and that's after they already had a decade of GPU IP, experience, and a substantial HW/SW team. So, they take whatever is the best from the available options. And those best options just happen to harness VLIW technology.

              Originally posted by Khrundel View Post
              No, it's not like SSE... I mean yes, they are like SSE, but because unlike usual CPUs, which is a hybrid between scalar and vector CPU, they are fully SIMD and this allows them to look from programmer's perspective as scalar CPU which executes several threads.
              They do support scalar operations, as well. But yes, they are classical SIMD to the point that they can sustain the conceit of each SIMD lane being a scalar thread.

              Comment


              • Originally posted by coder View Post
                Right. Doing the same job, but the hardware knows which reads haven't yet completed and dynamic branch probabilities. In contrast, the compiler. has fewer limitations on its visibility
                I'd say the compiler has much better visibility on which instructions it is creating have dependencies on earlier instructions it created. It can also spend a lot more time doing that analysis on a much larger buffer.
                I'd say the main problem with VLIW is the structure of the word is going to be hardware dependant, if your VLIW definition only supports 8 simultaneous opcodes, and you want to build a chip with 12 FP units, then the VLIW becomes "not fit for purpose" and everything needs recompiling to make good use of it.

                Whereas doing it all in silicon is more akin to doing it with a virtual machine.

                OTOH, doing it in silicon must add a comparatively large amount of latency between starting to send instructions to the chip, and getting results out of the chip, for many applications that is probably not so much of an issue, for others (e.g. monitoring an engine running at 100,000 RPM) it will make the difference between the whole chip being fit for purpose or not.

                Originally posted by coder View Post
                the types of optimizations it can do.
                Sounds like a RISC vs CISC argument to me, that's been a hard fought battle, but with the M1, AMD chiplets and Tflop cheap tiny devices like the Jetson Nano I think its pretty clear parallel RISC has won over complex instruction sets and funky branching. Last I heard even Intel implements their CISC instructions with RISC + firmware now.

                Comment


                • Originally posted by mSparks View Post

                  Hmm, yes there is. It is the order instructions are executed.

                  a+b=c
                  c+a=d

                  order matters here, you cannot do them at the same time, you need the answer to a+b before you can calculate (the expected) answer to c+a
                  You talking about dependency, not order. Any code have order, order in which it follows. OoO takes this code and finds what can be executed out of order.
                  the order of the last two instructions doesn't matter here, so you can do them "out of order"
                  Your reasoning is misleading. If I say to you "take a pen into your right hand and scratch your stomach with left and then write down current date" this is not "out of order" program. I just give you 2 commands, one includes instructions for both your hands. So no, you can't name VLIW "out of order" in any way. Even first pentiums or cortex a8 can't be named OoO, they could execute several instuctions simultaneously and even could check if second command depends on result of first (VLIW CPU doesn't need this), but OoO requires some other functions.
                  The whole point of "OoO" is CPUs have more than one FP unit on a single core, if you just ran those instructions "in order" then a+b=c then c+a=d then e+c=f would take 50% more CPU cycles than doing a+b=c then c+a=d and e+c=f
                  you are describing superscalar.
                  Here is OoO:
                  #1. a+b -> c // decodes to r1+r2->r8 ,marks c as r8
                  #2. c+d -> e // decodes to r8+r3 -> r9 ,marks e as r9 this depends on #1
                  #3. e+c -> f // decodes to r4+r8 -> r10, marks f as r10, frees r8 on this op retirement, this depends on #1, #2
                  #4. a-b -> c // note: we overwriting c, but due to register renaming this and following ops don't depend on previous, decodes to r1-r2 -> r11, marks c as r11
                  #5. c+d ->g // decodes to r11+r3 -> r12, marks g as r12 this depends on #4

                  And then it executes
                  #1 and #4 // both writing to same register c!
                  #2 and #5 // both reading from same register c
                  #3
                  That why we call this out of order.

                  Comment


                  • Originally posted by Khrundel View Post
                    And then it executes
                    #1 and #4 // both writing to same register c!
                    #2 and #5 // both reading from same register c
                    #3
                    That why we call this out of order.
                    Which is exactly what a VLIW cpu executes when it receives the three long words

                    (#1,#4)
                    (#2,#5)
                    (#3,noop)

                    Just they were created by the compiler rather than at runtime on the silicon.

                    A good compiler would even drop the (#1,#4).
                    Last edited by mSparks; 26 February 2022, 08:26 AM.

                    Comment


                    • Originally posted by Khrundel View Post
                      Correct me if I wrong, but southern island (_SI) was a first non-VLIW GPU of AMD. VLIW was in radeon without SI.
                      I think that was a general slight against compiler devs in the West, not meant to be directly VLIW-related.

                      Originally posted by Khrundel View Post
                      That is why I suggested example with creating a LITTLE core. You know, ARM's bil.LITTLE, when thread goes to a different cores depending of performance or economy is priority. Imagine your VLIW from architectural point of view must contain 3 FPU, but statisctically most time you need only 2. Or you can make some unit way simplier but some rare operation will perform couples of cycle longer. These kind of changes either impossible for VLIW or will create much greater impact.
                      Of course you can build a VLIW little core! It simply needs to obey the same ISA constraints as the big core. Obviously, that could mean parts of it stall, because other operations have higher latencies than on the big core, for instance. And if it has only one FPU instead of 3, then maybe you get stalls whenever an instruction word arrives with more than one FPU instruction.

                      Yes, this is not optimal, but it's entirely possible, and probably not even terribly inefficient. However, this further makes the case for EPIC over VLIW.

                      Comment

                      Working...
                      X