Announcement

Collapse
No announcement yet.

OpenBLAS 0.3.20 Adds Support For Russia's Elbrus E2000, Arm Neoverse N2/V1 CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by tomas View Post
    Ukraine is a sovereign nation and Russia has no right to decide either its foreign policy nor its security policy or which alliances it wishes to join, whether it's EU-membership or Nato-membership.
    but Venezuela, Syria, Cuba, Argentina, etc, do not.

    Originally posted by tomas View Post
    The Budapest Memorandum is still valid and in play, none of the signing parties has withdrawn from it, so your whole point is moot.
    Assurances in Budapest Memorandum are not legally binding. Soviet Constitution and 1990 Law on Secession precede Budapest Memorandum, and are legally binding. Union republics were not allowed to secede with territory adminisratively assigned to them. Autonomous Oblasts were given right to self determination. Crimea was an Autonomous Oblast. These are the relevant laws. The Budapest Memorandum does not mention Crimea.
    Last edited by novideo; 01 June 2022, 06:05 PM.

    Comment


    • #32
      Originally posted by kobblestown View Post
      It'd be nice to see some Elbrus performance numbers...
      in single-core tests with optimized software it show almost same ipc with modern x86 except inability to reach any decent frequency (1.5GHz is reached, 2GHz is next goal). Being a VLIW performance will scale worse with frequency than OoO, so it is nothing to expect there, just a way to waste taxpayers' money.

      Comment


      • #33
        Originally posted by Khrundel View Post
        in single-core tests with optimized software it show almost same ipc with modern x86 except inability to reach any decent frequency (1.5GHz is reached, 2GHz is next goal). Being a VLIW performance will scale worse with frequency than OoO, so it is nothing to expect there, just a way to waste taxpayers' money.
        That is one way to look at the VLIW. OTOH VLIW performance is more determenistic since it is not dependent on OoO semantics and all corresponding problems. Reaching high clock rates for VLIW is tough, single core can be more complex compared to other archs.

        Comment


        • #34
          Originally posted by blacknova View Post
          That is one way to look at the VLIW. OTOH VLIW performance is more determenistic since it is not dependent on OoO semantics and all corresponding problems. Reaching high clock rates for VLIW is tough, single core can be more complex compared to other archs.
          No way. Quite opposite: VLIW requires deterministic environment and that is why it sucks so hard. When single read operation may take one to several hundreds cycles it is impossible to achieve any performance with compile-time operations scheduling. So, VLIW may work and show its theoretical performance only when CPU have no caches and compiler knows exactly how much time any read operation takes. But this means either low frequency CPU or CPU-integrated memory, like in Cell Broadband Engine. Both are incompatible with modern general CPUs. Moreover, not only high frequences, multicore harm VLIW CPUs too. Imagine some program, an usual for loop which takes one word from memory, performs some computation and then takes next. From a compiler perspective this is great, you can expect data locality, so VLIW compiler doesn't need to delay data processing, it can expect data to be within cache and ready almost instantly. With SMT over multilevel caches with MESI-protocol for coherency this is no longer the case, cache can be invalidated any time in the middle of loop body processing. The more cores are here, the more probable this situation will be.

          Comment


          • #35
            Originally posted by mshigorin View Post
            As I said, this site is not called Moronix. You just cited Russia Today, btw -- which is pretty humorous to me.
            No, I quoted you quoting Russia Today.
            And calling someone a moron is not nice, I did not call you any names. Please keep it civil.

            It hasn't even been ratified, child.
            (Calling someone a child is a classic master suppression technique that makes you appear insecure and weakens your argument, and besides I'm 3 years older than you).

            I don't know what you are trying to pull here, but the undisputed fact is that the Budapest Memorandum from 1994 has been signed by Russia, and thereby ratified:

            "ratify - sign or give formal consent to (a treaty, contract, or agreement), making it officially valid."

            Furthermore, as I referred to in my first post, Russia has also signed (and thereby ratified) the "Astana Commemorative Declaration" at the 2010 OSCE Summit held in Astana, Kazakhstan:

            Seventh OSCE Summit of Heads of State or Government, Astana, 1-2 December 2010.


            From point 3:

            "The security of each participating State is inseparably linked to that of all others. Each participating State has an equal right to security. We reaffirm the inherent right of each and every participating State to be free to choose or change its security arrangements, including treaties of alliance, as they evolve."

            Now which of these facts that I present do you dispute?

            And you're trying to explain things you didn't even study to a reserve officer of radio-chem-bio defense who earned his M.Sc. in Chemistry in Kiev University and helped write Oliynik's bill back in 2002..
            None of this has any bearing to the discussion at hand.

            "That lame stuff boils down to "gimme money and weapons OR I go nuke".
            You can oppose the reality as long as it allows you. Do not cry when it stops doing so.
            The reality is that it is Russia that is the aggressor here and in the wrong.
            You have many fellow Russians that don't buy in to the Kremlin and Putin propaganda.
            I'm sorry to see you are not one of them, but it's never too late to change.
            Last edited by tomas; 22 February 2022, 04:51 AM.

            Comment


            • #36
              Originally posted by Khrundel View Post
              No way. Quite opposite: VLIW requires deterministic environment and that is why it sucks so hard. When single read operation may take one to several hundreds cycles it is impossible to achieve any performance with compile-time operations scheduling. So, VLIW may work and show its theoretical performance only when CPU have no caches and compiler knows exactly how much time any read operation takes. But this means either low frequency CPU or CPU-integrated memory, like in Cell Broadband Engine. Both are incompatible with modern general CPUs. Moreover, not only high frequences, multicore harm VLIW CPUs too. Imagine some program, an usual for loop which takes one word from memory, performs some computation and then takes next. From a compiler perspective this is great, you can expect data locality, so VLIW compiler doesn't need to delay data processing, it can expect data to be within cache and ready almost instantly. With SMT over multilevel caches with MESI-protocol for coherency this is no longer the case, cache can be invalidated any time in the middle of loop body processing. The more cores are here, the more probable this situation will be.
              Or it could work with large caches. I'm not saying it VLIW is the best solution, but is working solution with it is own pros and cons. VLIW certainly is not most power effecient solution, since all cores are very large and complex.

              Comment


              • #37
                Originally posted by blacknova View Post
                Or it could work with large caches. I'm not saying it VLIW is the best solution, but is working solution with it is own pros and cons. VLIW certainly is not most power effecient solution, since all cores are very large and complex.
                No it won't. You can't create large cache with high associativity and fast too, that is why modern CPUs have many levels of caching, up to 4. Moreover, if you have to add very complex cache logic into your CPU and excessive amount of cache memory, why don't add OoO-capable decoder/scheduler instead? OoO is way better. So, only way to make it work is Cell's like SPU with core's own on-chip memory.

                Comment


                • #38
                  Originally posted by Khrundel View Post
                  [...] except inability to reach any decent frequency (1.5GHz is reached, 2GHz is next goal).
                  That goal is reached and works just fine a foot away from me:
                  e1601:~> inxi -Cay
                  CPU:
                  Info: model: E16C bits: 64 type: MCP arch: Elbrus-16C family: 6
                  model-id: 0xB (11) stepping: 0
                  Topology: cpus: 1x cores: 16 smt: <unsupported> cache: L1: 3 MiB
                  desc: d-16x64 KiB; i-16x128 KiB L2: 16 MiB desc: 16x1024 KiB L3: 32 MiB
                  desc: 1x32 MiB
                  Speed (MHz): avg: 2000 min/max: N/A cores: 1: 2000 2: 2000 3: 2000 4: 2000
                  5: 2000 6: 2000 7: 2000 8: 2000 9: 2000 10: 2000 11: 2000 12: 2000 13: 2000
                  14: 2000 15: 2000 16: 2000 bogomips: 64000
                  Flags: N/A
                  Vulnerabilities: No CPU vulnerability/bugs data available.


                  Originally posted by Khrundel View Post
                  Being a VLIW performance will scale worse with frequency than OoO
                  I observe quite the opposite: v4-optimized code runs on v5 at speed about proportional to frequency (actually slightly faster due to RAM generation change and full-custom memory controller design IIRC; optimizing for v5 ISA can gain another boost).

                  Originally posted by blacknova View Post
                  OTOH VLIW performance is more determenistic since it is not dependent on OoO semantics and all corresponding problems.
                  There's yet another perspective: one can have a look at what exactly gets executed, without the decoder-related guesswork or arcane knowledge required. Yes, it's still a stick with two ends as e2k has its own arcane knowledge. :]

                  Originally posted by Khrundel View Post
                  Quite opposite: VLIW requires deterministic environment and that is why it sucks so hard.
                  Don't trust Maslov, he lies in your face (while also telling some technically correct things and avoiding to mention some he must know) -- that was shown publicly more than once already.

                  First off, Elbrus isn't exactly VLIW. It's rather EPIC.

                  But yes, VLIW/EPIC arches are more like "planned economy": if everything is thoroughly prepared, it runs way more efficiently than "market economy" (that's exactly what Reagan government ran into back in 1970s -- having to invent "reaganomics" of lies, its harvest is what US reaps now in terms of state debt, goods deficit, social disproportions, etc).

                  But if things happen as they happen (read: interpreters, bytecode/jit workloads, multitasking, other means to get lots of indirection and context switches), "plans" are hard to impossible to build and the main feature of the approach, parallelism with "just-in-timeism", doesn't do its theoretical potential (e.g. an instruction out of 25 possible would get scheduled, or even a NOP).

                  OTOH "market economy" has its downsides like the need to waste energy on re-optimizing even the inner loops on each and every CPU running those -- if Greenpeace was actually about ecology, they'd march against OoORISC as well as HTTPS Everywhere I guess, given the extra gigawatts these alone consume.

                  Originally posted by Khrundel View Post
                  When single read operation may take one to several hundreds cycles it is impossible to achieve any performance with compile-time operations scheduling.
                  Elbrus counters this problem with APB which I have mentioned already. Of course it has its own pros and cons: APB setup takes a few cycles, and firing it for a single memory access or so would be plain counterproductive.

                  Originally posted by Khrundel View Post
                  So, VLIW may work and show its theoretical performance only when CPU have no caches and compiler knows exactly how much time any read operation takes.
                  There's a practical performance test that's used when Elbrus is to pass its state trials (someone mentioned taxpayer's money, did they hear anything on US taxpayers asking their govenment questions on Intel's dirty tricks to boost performance sacrificing integrity and security -- knowingly?). That test shows the theoretical performance in practice.

                  Again, please claim what you actually can support, or at least tag with "IMHO" or whatever -- that's just more respect to you, each of us can err but being at least able to realize it does matter. I won't complain to Masyana, mate.

                  Originally posted by Khrundel View Post
                  But this means either low frequency CPU or CPU-integrated memory, like in Cell Broadband Engine.
                  Heh, used to work with those about ten years ago either; those motherboards had regular DIMM slots.

                  Originally posted by Khrundel View Post
                  Moreover, not only high frequences, multicore harm VLIW CPUs too.
                  Have you seen the real-life performance references that I've posted above too?

                  Originally posted by Khrundel View Post
                  Imagine some program, an usual for loop which takes one word from memory, performs some computation and then takes next. From a compiler perspective this is great, you can expect data locality, so VLIW compiler doesn't need to delay data processing, it can expect data to be within cache and ready almost instantly. With SMT over multilevel caches with MESI-protocol for coherency this is no longer the case, cache can be invalidated any time in the middle of loop body processing. The more cores are here, the more probable this situation will be.
                  Elbrus doesn't do SMT; but it does do SMP and in the observed reality it does it pretty well with near-O(1) scalability observed from 4C and all the way to 8CB (I haven't laid my hands on 4E16C motherboard just yet -- short of admiring one in Room 107).

                  If you'd care to read the guide I referred to as well (not a "3--5 min reading" definitely though), you might have seen the data access optimization techniques aimed at reducing contention and improving parallelization. Those are generally useful as hardware decoders will have way better time with data access decoupled too.
                  Originally posted by tomas View Post
                  No, I quoted you quoting Russia Today.
                  So you quoted Russia Today. Stop trying to argue with someone who actually studied formal logic in a specialized maths school and applies it on a daily basis, won't help you at all. Or at least try to imagine yourself in a court and get ready to present solid proof to each and every of your words -- that's what I'm aiming for myself, it's hard to compete with a lower standard.

                  Originally posted by tomas View Post
                  And calling someone a moron is not nice, I did not call you any names. Please keep it civil.
                  So I did; even when you directly insulted my intellect with those blunt allegations about events you know nothing about in lands you never have been to (when I know those through first-hand experience).

                  Originally posted by tomas View Post
                  (Calling someone a child is a classic master suppression technique that makes you appear insecure and weakens your argument, and besides I'm 3 years older than you).
                  We can get grey but stay children (in a bad sense) if we don't at least learn to tell truth from false. A sore state.

                  Originally posted by tomas View Post
                  the undisputed fact is that the Budapest Memorandum from 1994 has been signed by Russia, and thereby ratified
                  .
                  No. Please do your homework on the very terms you try to repeat after others (who either have no clue or try to manipulate you into what they know to be false interpretation; I've referred to "The Third Wave" experiment above for a reason).

                  For example, Putin signed DPR/LPR acknowledgement documents last night but these still had to be ratified by the Parliament. These are different branches of state power, and their authorization of a given document both differs and doesn't spread automatically from one to another.

                  Originally posted by tomas View Post
                  Furthermore, as I referred to in my first post, Russia has also signed (and thereby ratified) the "Astana Commemorative Declaration" at the 2010 OSCE Summit held in Astana, Kazakhstan:
                  "The security of each participating State is inseparably linked to that of all others. Each participating State has an equal right to security. We reaffirm the inherent right of each and every participating State to be free to choose or change its security arrangements, including treaties of alliance, as they evolve."
                  Now which of these facts that I present do you dispute?
                  You've presented your own uninformed interpretation (might be media-forced down your mind, not my business) as a "fact". I'd do so from time to time either, but I'd at least thank those that point me towards problems with my interpretation that I would mistake for a fact.

                  Regarding Astana (I've been there too BTW, but back in 2001): US destroyed Libya to make it a ram against Syria, planning to further aplify the effect towards Iran and then strike Russia at Caucasus (part of it was spelled in public by Gen. Clarke as I've also mentioned already, and the arrow on the map is pretty easy to finish -- even for a smart child).

                  It's like if Russia would destroy several countries in Latin America, getting them into chaos so becoming a $50 mercenary with an AK suddenly starts being an option to feed one's children; boosting the effect until a multi-million army would roll into North America. Would you like that? And if not, please reread the very paragraph on security you cited.

                  NATO officially stated recently that its position is "safety to NATO countries, don't care about the rest" during the negotiations with Russia over our proposal on European security. Why should we still care for those people when we're stronger and they ruined the international security into "fist law" state? So we do not care anymore, and it's official too.

                  I actually grew up in Kiev. We actually tolerated those nazi jokes of West Ukrainian morons (yes, iodine deficitic morons) for decades. And it would be better for those to get their timely punch in a face in return for such a joke since thousands of those morons accustomed to impunity might be at least still alive and not wasted at war.

                  You see, when a "revolution" takes over -- thus declares itself not a coup anymore -- the first danger at hand for the "new government" is the militants that are almost inevitably used in the process: they're self-confident, most frequently uncontrollable by that time, and typically brainwashed to aim so high that the target couldn't be hit even in theory; in Ukrainian case it was "European salaries" [right off the skies] and "Ukraine from Syan [river] to Don [river]" which involves direct territorial quarrels with both Poland and Russia.

                  So that bunch of nazis was mostly sent off under Grads; some of those were smart enough to stop half a mile away from target place to see it well covered (and understand that it was their own destiny planned by their own command).

                  Originally posted by tomas View Post
                  The reality is that it is Russia that is the aggressor here and in the wrong.
                  As I've said already, you can try fiddling with reality all day. It's only a matter of time when it chooses to fiddle with you. Do not complain then.

                  I was stupid enough to spell something like "Ukrainian city" in Sebastopol back in 2006 when visiting friends (don't recall the exact wording). I was stopped short politely but definitely with a reply: "remember, Sebastopol is a RUSSIAN city; got it?". I was stunned for a moment but understood my friends instantly and deeply; they were right.

                  Go tell the Crimeans or Donbass people that "Russia is the aggressor". They'll most likely not even harm you in return, even if they have every single right to do so. Go tell those still suppressed in Kiev or Lvov. They know from their lives who is the aggressor. There's a Londoner who actually has balls to travel and ask and show what he sees, Graham Phillips -- go and see.

                  Imagine those Latin America hordes waving into USA and starting to overthrow local monuments, renaming streets and Cities (say Boston would become New Liverpool), forcing their own language upon everyone, threatening Canada to bomb them with dirty nuclear munitions (it's not a direct analogy as geography does matter but hope you get the drift).

                  Would you appreciate that? Would you do nothing against that if you were governing your people?

                  Originally posted by tomas View Post
                  You have many fellow Russians that don't buy in to the Kremlin and Putin propaganda.
                  I'm sorry to see you are not one of them, but it's never too late to change.
                  I'm sorry years have come to you but wisdom has not.

                  I've actually talked to some of those poor idiots (about 2% of the population or so); mostly real children or youngsters who have no clue, and some 50+ folks who would transmit absolutely Goebbelsish stuff like "you know, in reality 80% of Russia is against Putin".

                  I also know some decent folks considering themselves an opposition; an old friend of mine chose to go as a coordinator for observers at elections back in 2018 IIRC; he got his personal material proof that Putin was elected fairly, and that became a non-question to him.

                  You come with a paper knife against an armoured guy with a gun. You can continue trying to stick your gross misunderstanding of the part of reality you try to discuss into me for as long as you wish, it just doesn't work -- since you got yourself convinced by someone else who wasn't there either, and I know from experience.

                  Both Kiev and Elbrus.
                  Originally posted by blacknova View Post
                  Or it could work with large caches. I'm not saying it VLIW is the best solution, but is working solution with it is own pros and cons.
                  It's indeed a different approach at the same trade with its pros and cons.

                  In particular, wish Javascript JIT was faster on my 8C but it's good enough for e.g. Yandex Maps (might even be faster than an old i5-3xxx with all of anti-Spectre/Meltdown/... workarounds deployed!) but I've tested 16C we deployed as a build node to be a workstation -- it's fast enough already, even if that's not as fast as it can be yet.

                  Originally posted by blacknova View Post
                  VLIW certainly is not most power effecient solution, since all cores are very large and complex.
                  I'd argue here too but I lack my real job FLOPS/W arguments to be sound.

                  A fact I can present is that 4E8CB (quad 8CB motherboard) stocked with 512 Gb RAM running SPEC CPU 2017 consumes about 410 W or so; each processor can do 570 GFLOPS FP32 / 285 GFLOPS FP64 as specs claim ("Характеристики" tab).

                  The current problem with Elbrus power efficiency is the lack of frequency scaling in production versions (or at least I don't know of); v6 fixes that as well, I've seen 2C3 scaling 400 to 2000 MHz and back when idle about a year ago. Must say that Intel did a decent job in this department since they started working on Baytrail -- even if that brought enough problems in the mean time.

                  I've heard that current Intel CPUs can easily have half of the transistor budget dedicated to decoder; these have to eat, and eat each time.

                  Comment


                  • #39
                    Originally posted by blacknova View Post

                    Or it could work with large caches. I'm not saying it VLIW is the best solution, but is working solution with it is own pros and cons. VLIW certainly is not most power effecient solution, since all cores are very large and complex.
                    If you look into the history and motivations behind the Intel Itanium project, they were very concerned with the superscalar OoO approach to finding parallelism at runtime running into scaling limits, IIRC the OoO ROB scales as O(n**3) in power & area with increasing width and depth. So by offloading the scheduling and finding parallelism to the compiler the argument was that this would lead to a smaller and simpler core than an equivalent OoO approach.

                    This didn't work out in practice, apart from the fundamental issue of not having runtime information about latencies available at compile time as mentioned earlier in this thread, finding enough parallelism at compile time was difficult, leading to a lot of code bloat in NOP instructions in the instruction packets, and also being statically scheduled it required a lot of loop unrolling, software pipelining etc. tricks to get performance but again at the cost of code bloat. So all this code bloat meant that much of the advantages of not needing to spend chip area on OoO logic was lost, as that area and power was instead consumed by instruction caches and moving those instructions around.

                    VLIW might be a useful implementation approach for some special-purpose DSP processors, but for general purpose code it seems like a pretty dead end approach.

                    Comment


                    • #40
                      Originally posted by jabl View Post
                      So all this code bloat meant that much of the advantages of not needing to spend chip area on OoO logic was lost, as that area and power was instead consumed by instruction caches and moving those instructions around.
                      Speaking of code bloat, it's definitely there -- but the extent of it is not as dramatic as I'd assume based solely on reading your message.

                      coreutils-8.31.0.3.6bd78-alt2 built for x86_64, aarch64, ppc64le, e2kv4 (the latter one is -alt1.E2K.1.e2kv4 with -O3 which includes much more aggressive inlining among other inflating things and without ppc-specific fixes for true/false); /bin/ls size in bytes is:

                      138704 at x86_64
                      130520 at aarch64
                      200016 at ppc64le
                      341920 at e2kv4


                      Originally posted by jabl View Post
                      VLIW might be a useful implementation approach for some special-purpose DSP processors, but for general purpose code it seems like a pretty dead end approach.
                      I rather think that world-scale interpreted/JIT code with enormous energy waste (both decoding stage and execution stage) are a real dead end if we look slightly beyond just IT.

                      But time will surely tell. And in the mean time, I send this message using both Elbrus and Javascript.

                      PS: I've heard that MCST's JIT implementation for Java performs on Elbrus-8C (8-core 1300 MHz e2kv4) slightly better than Oracle's one on Baikal-M (8-core 1500 MHz A57).

                      Comment

                      Working...
                      X