OpenBLAS 0.3.20 Adds Support For Russia's Elbrus E2000, Arm Neoverse N2/V1 CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • mshigorin
    Junior Member
    • Feb 2022
    • 17

    #81
    Originally posted by coder View Post
    FWIW, I just want to express my best wishes for the people of Ukraine and Russia.
    As someone with passports of both Russia (just got it recently) and Ukraine (I won't go to the embassy as I'm on ukronazi blacklist and I won't bet on those being adequate even in Moscow), I can only reply:
    СПАСИБО!

    Originally posted by coder View Post
    Let's hope for the best outcome for Ukraine
    It's begun already.

    People who were oppressed by totally inhuman dictatorship of a clown (not even a joke unfortunately, Zelensky is a professional clown), and a confectioner before -- both of those being oligarchs of Jewish origins, incidentally -- are being freed from that.

    Russian Army tells Ukrainian soldiers they'll be safe if they do not fight, and that population has nothing to fear.

    I know many people throughout all of the former Ukraine -- from Donetsk to Lvov -- and I do care for them, even those fooled hard and unable to answer simple questions ("so who won that revolution?").

    Russia doesn't start wars but ends those definitely.

    DISCLAIMER: of course I remember that the first victim of any war is truth, and I know that Russian propaganda (there of course is one, that's the case with every state alive I know of) does influence me as a Russian patriot.
    Originally posted by coder View Post
    Prefetching is essential, even for modern, out-of-order cores. Because even they don't have big enough reorder buffers to hide the latency of a read that has to go all the way out to DRAM. And deep reorder buffers presume you can even find enough work to do that doesn't depend on the missing data.
    I've heard (but haven't seen) that a carefully optimized crypto algo running on 500 MHz 2C+ would outrun 1500 MHz c2d -- yes, that took some compiler tweaking but the estimation was that x86 CPU would be able to do similarly with a ~1500 instructions depth reordering buffer (the latter number is how I remember it, might be several hundred -- not sure on that one).
    Originally posted by mSparks View Post
    And the best thing we can do to promote peace is have that system overwhelmingly tell them the internet doesn't believe them, doesn't support them, and they can go [...]
    Guys, you shock me -- in a good sense!

    My conventional test is whether someone claiming something is actually doing that -- or not. Many verbal "peacemongers" bring fuel for wars by what they do. John Perkins described that pretty well in those anonymous-sourced "Confessions of an Econimic Hitman".
    Originally posted by tuxd3v View Post
    [...] that the compiler is very difficult to implement if you want to have good performance, because you don't have runtime input..
    Partially difficult but partially impossible because runtime information on e.g. whether memory segments overlap is not available at compile time generally (unless warranted by a particular software developer).

    Originally posted by tuxd3v View Post
    That big advantage for the clients didn't existed for Itanium..
    I think I'll pass your considerations to MCST folks as these are very sound to me.

    Originally posted by tuxd3v View Post
    [...] they had previously lcc, I believe?they are now using a gcc version, I think
    MCST's compiler is EDG-based lcc; they've experimented with gcc back in 2.9x days but they've lacked high-level information for loop optimization when the IR got down to the backend. They're experimenting with LLVM now, and done a kind of IR translator that has some potential of merging both LLVM and LCC optimizations.

    Originally posted by tuxd3v View Post
    Does the compiler already supports Elbrus 16S with also simd instructions and such?
    It does, rebuilding ALT's package base for e2kv6 with it right now.

    Regarding SIMD, it's a bit different here: there are provisions in the compiler to grok both x86 ones and ARM NEON so one can write the code using those intrinsics and have it more or less optimized on e2k as well (some projects are switching to SIMDe already, and there's a patch for it either). The implementation just uses the existing register file and ALUs.

    Originally posted by tuxd3v View Post
    Another CPU of my interest is multiclet S2 operating at 2.5Ghz, 16nm
    I've heard of Multiclet but haven't seen one so far; if you get stuck with some information available only in Russia, feel free to email me at [email protected] and maybe I'll be able to ask the guys.

    Originally posted by coder View Post
    What I really want is for Michael to keep covering all tech, whether it's from China, Russia, or anywhere else. I don't think he's easily dissuaded by arguments in the forums, but it would be nice if we can also have intelligent discussions about this tech, and not get bogged down in politics.
    Yes, he does great job at that -- and I'm doing my best so that all of us can use and enjoy this tech wherever we need that; I've had totally great email exchanges with colleagues from many countries and it helps me believe in humanity even in these times of change when I see DECENT and HONEST people whose ACTIONS do not come split with their WORDS.

    Folks, you almost made me cry smiling!
    Originally posted by caligula View Post
    Who's the idiot now? He started the genocide few minutes ago.
    You can find one in a closest mirror: "He" started *stopping* the genocide last night.

    If you can't tell genocide from democracy, try finding bread consumption numbers (heck, even official population numbers) for the former Ukraine. In a nutshell, it's about half the factual population now than it was 30 years ago. And the difference is on par with all of USSR losses in WWII: more than 20 million people.
    Originally posted by Khrundel View Post
    VLIW proponent usually talk about magic compiler which will somehow solve all problems. But that is nonsense.
    Hope that every single one of us knows the real price of well-marketed silver bullets.

    Originally posted by Khrundel View Post
    PGO just allows compiler to evaluate probability more correctly.
    IIRC building xz with PGO (lcc 1.25 branch) helped get it running about 20% faster (or was it closer to 15%? something in that range) on e2kv4, as a matter of fact for us here.

    Originally posted by Khrundel View Post
    But unforgiving nature of VLIW makes compiler to be more pessimistic. That mean if someone able to create some state of art compiler, able to produce a near-perfect code for 16-way VLIW, then this compiler can be easily adapted to be more optimistic and to rearrange scalar code in such way, that some 24-way OoO superscalar CPU will be able to fill all its pipelines.
    Do you know x86 decoder well? And if yes, which one?

    Originally posted by Khrundel View Post
    OoO is always better than VLIW.
    Incorrect, as I've mentioned in this very post. OoO has many upsides but your text resembles Maslov's regarding the modest lack of problems inherent to OoO and its widespread implementations -- that proved to be capable of eating a good chunk of performance if not ignored.
    Originally posted by tomas View Post
    But accepting Russia Today's agenda is completely legitimate?
    You can try to avoid buying into any agendas instead of going demagogy way (there are people familiar with those tricks out there, you know). You'll fail from time to time (I do either) but putting some attention into your own fact checking -- at least where you can ask those on the site you know in person, if you don't see the matter with your own eyes -- can reward you a lot regarding the sources you can more or less trust or those you have witnessed spreading misinformation.

    For example, the notion of "separatism". Why Lvov wasn't shelled back in 2014 when even British sources have witnessed its clear separatist act? Read up on Ruslan Kotsaba's fearless confirmation that "Donetsk has its own maidan, except without oligarchs". He was one of the prominent pro-maidan reporters in Kiev, and he actually had guts to go into Donetsk and ask people (with his clearly Western-Ukrainian pronunciation we knew all too well here) why are they against the "new government". He was declared a criminal and jailed by that "government" specifically for telling the truth publicly.

    The Minsk agreements were a DOA attempt to force Ukrainian Nazis into pulling the army and nazi death squads away from Donbass and going into negotiations. The problem was that their US puppetmasters clearly had no intention for any peace on these lands of historic Russia. Just as with other similar artificial "managed chaos" zones, they aimed for the war.

    The most prominent moment was when Biden and friends yelled "Puttin will attack on [DATE]", and Zelensky replied" no, no, there's no sign of that, stop ruining our economy!". He seems too dumb to have understood it before that he's not a partner, he's a tool.

    And when we've been blamed long enough, we chose to at least deserve that. :]

    Remember the proverbial "Russians harness for a long time, but they drive fast" when you see a fool poking a bear with a stick.

    Comment

    • jabl
      Senior Member
      • Nov 2011
      • 650

      #82
      Originally posted by mSparks View Post
      So your position is Apple aren't doing with the M1 what they announced they are doing with an M1, and to prove it to myself I should look at a manual for an entirely different processor.
      Apple AFAIK hasn't stated that M1 is a VLIW CPU, or even VLIW-inspired. What they have said is that M1 can execute up to 8 instructions in parallel, which indeed is exceptionally wide, but otherwise something any superscalar OoO core can do. You don't need VLIW to take advantage of instruction level parallelism.

      Further, aarch64 isn't another CPU, but an ISA. Coincidentally it's the ISA that M1 executes, similar to how, say, AMD Zen executes the x86-64 ISA. Hence if you want evidence that the M1 ISA is a VLIW, I recommend you to search for evidence in a manual that describes aarch64, the ISA that M1 implements.

      I did read that, but I'm sorry, that is so confusingly written that the only takeaway I got was that the author has an extremely limited understanding of microarchitecture.

      I have since learned that Apple's M1 incorporates some of the innovations from this product, in particular, the Very Large Instruction Word Very Long Instruction Word architecture
      I'd be very interested and surprised to learn about anything VLIW-like in M1.

      Comment

      • mSparks
        Senior Member
        • Oct 2007
        • 2111

        #83
        Originally posted by jabl View Post

        Apple AFAIK hasn't stated that M1 is a VLIW CPU,
        "Ultra wide execution architecture"
        The Apple slide at 9m22s in
        Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.


        Already linked on page six.
        Last edited by mSparks; 24 February 2022, 12:36 PM.

        Comment

        • mSparks
          Senior Member
          • Oct 2007
          • 2111

          #84
          Originally posted by caligula

          Igor plz. It's not my fault you dropped out of high school and couldn't find a decent job where you enjoy your pathetic little life. Doesn't Putler pay you 1 ruble per shit post? How much was that again in real money? That is, using the post 24th Feb exchange ratio.
          Let me guess, you are so deeply brainwashed you shit post your Monarch's BS for free. Well thanks for your valuable contribution, please go back to playing with your sisters private parts, adults are talking here.

          Comment

          • jabl
            Senior Member
            • Nov 2011
            • 650

            #85
            Originally posted by mSparks View Post
            "Ultra wide execution architecture"
            The Apple slide at 9m22s in
            Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.


            Already linked on page six.
            Yes, I saw that earlier, and no, "ultra wide execution architecture" doesn't prove it's a VLIW under the cover (of course as I mentioned earlier, from the user visible aarch64 ISA the M1 implements we can trivially conclude that at least the user facing part has nothing to do with VLIW).

            It's entirely possible to implement a very wide microarchitecture with a superscalar OoO design, which all available evidence suggests M1 is. For some more detailed investigation into the M1 microarchitecture look at e.g. https://www.anandtech.com/show/16226...14-deep-dive/2 and https://dougallj.github.io/applecpu/firestorm.html

            Comment

            • coder
              Senior Member
              • Nov 2014
              • 8964

              #86
              Originally posted by jabl View Post
              But I have seen no indications anywhere that Apple M1 would be anything like that. Everything I've seen suggests the M1 microarchitecture is a "normal" OoO core design.
              Handley's 350-page analysis of the M1 doesn't mention VLIW anywhere.

              Comment

              • mSparks
                Senior Member
                • Oct 2007
                • 2111

                #87
                Originally posted by jabl View Post

                doesn't prove it's a VLIW under the cover
                Well, at least we got past the M1 not being "long instruction words that contain multiple RISC instructions that run in parallel"

                So, back to my earlier question.
                Originally posted by mSparks View Post

                Yes, I got that, I just didn't get what you think is the difference between M1 long instruction words that contain multiple RISC instructions that run in parallel and Elbrus VLIW that contain multiple RISC instructions that run in parallel.

                You keep mentioning out of order, but these are all parallel instructions, the whole point is there is no order.
                ...
                Originally posted by coder View Post
                Handley's 350-page analysis of the M1 doesn't mention VLIW anywhere.


                They call it "Ultra wide instruction arch". Doesn't mention that either.
                Last edited by mSparks; 24 February 2022, 02:42 PM.

                Comment

                • coder
                  Senior Member
                  • Nov 2014
                  • 8964

                  #88
                  Originally posted by Khrundel View Post
                  Prefetching can help, but it won't . I mean, would prefetching be possible all time, we wouldn't need any caches, just load register 1000 cycles before using it.
                  It's not a perfect solution, just a mitigation.

                  Look, I'm not trying to have a debate about pure VLIW vs OoO. I'm just trying to understand why you said VLIW scales poorly with frequency. If your point was just about memory latencies, then I simply wanted a confirmation that's what you were talking about.

                  Originally posted by Khrundel View Post
                  You talking about PGO, so you mean it, PGO just allows compiler to evaluate probability more correctly.
                  What PGO can help (i.e. besides knowing when to be more/less aggressive about unrolling and inlining) is with inserting explicit prefetches. However, as you point out, there are cases where you can't prefetch (at least, not more than 1-deep), such as when walking a tree or a linked list.

                  Originally posted by Khrundel View Post
                  OoO is always better than VLIW. And, second tragedy of VLIW: it has no real advantages.
                  This is not true. VLIW has better power-efficiency, if you can keep it from stalling. That's by avoiding scheduler overhead. So, for signal-processing applications that tend to have regular data access patterns, it can be a significant win. There are lots of DSPs and AI chips that use VLIW. Older GPUs also did so, until they figured out that wide SIMD + SMT was a better solution (but still in-order!).

                  Also, you're limited in your thinking. You only talk about classical VLIW, not EPIC. EPIC saves less runtime overhead than VLIW, but still allows for things like OoO and speculative execution. Compared with classical OoO, you save on having to detect data dependencies.

                  Comment

                  • coder
                    Senior Member
                    • Nov 2014
                    • 8964

                    #89
                    Originally posted by jabl View Post
                    I did read that, but I'm sorry, that is so confusingly written that the only takeaway I got was that the author has an extremely limited understanding of microarchitecture.
                    Same. I looked at all 3 parts, but there was no depth and no evidence presented to indicate VLIW. Pretty much the opposite of Handley's work (i.e. in terms of depth), which I linked above.

                    Comment

                    • coder
                      Senior Member
                      • Nov 2014
                      • 8964

                      #90
                      Originally posted by mSparks View Post
                      They call it "Ultra wide instruction arch". Doesn't mention that either.
                      I guess the main point is what else it says about the M1.

                      So, for instance, it goes into some depth about the reorder buffer (also called ROB). This is a structure used for instruction scheduling, specifically when you're executing them out of order.

                      The document has some good introductory sections that explain a lot of these concepts. It's worth a look, if you're interested in CPU micro-architecture.

                      BTW, some have mentioned here that ELBRUS is more EPIC than classical VLIW. I would like to know more about this, if anyone has details to share.

                      Comment

                      Working...
                      X