Announcement

Collapse
No announcement yet.

Linus Torvalds: "I Hope AVX512 Dies A Painful Death"

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by SilverBird775 View Post
    No you cannot. It is a real problem. It is way, way too wide. There is a common 3D/4D matrix operations which can be coded with AVX/AVX2 but that's is already too luxurious for a price. Luxurious because of data permutation toll you have to pay to prepare "chunks". It's like trying to move the whole Earth baggage with a single train! Yes, it endeed move quick and looks epic. But loading and unloading this train is a grand problem. Anything more complex then double vector[4] is a super rare case which can be easily expressed with just a few AVX instructions if there is a need. So you just stating that doubling the transistor budget for a whole CPU for something no one will ever use is not a problem but it is.

    AVX512 have no use for games, it have no use for a CAD software, it is just an useless waste of transistors. AVX512 should never ever be a requirement for a software. AVX/AVX2 better not as well. More cores is a much better way to spend transistor budget.
    ?!?!?!?!?!?
    There is so many things wrong in this that i cannot even comprehend.

    How many assembly lines you wrote in your life? And how many times you saw godbolt like output with march=native and -O2 or higher?

    1st. AVX2 is already used wide in games, to a point certain games used to raise minimum requirement to support AVX (assasin's creed origins for example). Later ubisoft released alternative patch non-AVX so people with older CPUs can play but AVX and AVX2 is used, and if those 2 are used probably AVX512 also could, if not downclocking problems and bad adaptation on market. Even windows desktop uses AVX, it has wide usage for CAD software, Blender, codecs etc. all gladly will take advantage of AVX and AVX2.

    2nd. Not everything can be parellized, or can only take advantage of parellization until certain point. Read about Amdahl's law. When it is not big deal for scientific workloads that mostly can be well splited but it is problem for games for example. Also the more cores you have the higher latency core to core you have. That is why 9980XE at 5GHz is still slower at gaming then 9900k at 5GHz because HEDT core latency is higher. Desktop computers are not servers/supercomputers where you can just "let's assign a thread for every single user". Games running 144fps+ per second are caring LOTS of latency and there more cores is not better. So in a lot of cases if you can do workload faster on one thread it is a lot better.

    3rd. "No you cannot. It is a real problem. It is way, way too wide. There is a common 3D/4D matrix operations which can be coded with AVX/AVX2 but that's is already too luxurious for a price. Luxurious because of data permutation toll you have to pay to prepare "chunks". It's like trying to move the whole Earth baggage with a single train! Yes, it endeed move quick and looks epic. But loading and unloading this train is a grand problem."
    ?!?!?!? Horrible comparison, absolutly not true for desktop users, and btw AVX of all kinds have already instruction to load data in one go and they dont' take longer then 2 cycles. Actually it is not big deal to unload the train since data for sake of AVX is usually nicly arrayed which means data is not all around memory, can be niced cached and so on. For multicore CPUs oh boy, not only cores do on their own requests to RAM, the thing is cores asks for data all around memory space, cores often request data from other cores, also from programming perspective you do lots of unnecessery copying, adding more data like locks, synchronization stuff only to avoid problems. For encoders or webrowsers, the more cores you have, the bigger memory consumption is. AVX stuff don't have such problems.

    4th. "So you just stating that doubling the transistor budget for a whole CPU for something no one will ever use is not a problem but it is." Even 9800X is not even close to double size of 9900k, despite having LOTS of more PCI-e lanes, 2 more memory channels, and having 2 dummy cores and being older generation wise and having slightly less cache. Per core 9980XE vs 9900k is just 4mm^2 diffrence per core (26 vs 22) and it is totally normal considering all other diffrences (like mesh vs ring). It is same sort of BS people say about nvidia about "radicolously large tensor/raytracing cores". In reality they cover less then 0.3% of die in current RTX serie.

    Comment


    • I have seen some posts about SIMD and energy consumption and have made some experiments on this aspect some time ago : https://hal.archives-ouvertes.fr/hal-01795146

      Using NEON units requires more power but since your code runs faster, it significantly decrease the energy required. In fact, the difference in power when running scalar vs NEON codes is largely compensate by the speedup you obtain.
      Another thing is that power P depends on the frequency F and the current U according to the following formula : P = c.F.U^2 and U also depends on F since you need to increase U to be able to sustain higher F so by lower the frequency (and consequently the current) and use SIMD instructions you may achieve better energy consumption over scalar codes with higher frequencies.

      Comment


      • Originally posted by SilverBird775 View Post
        No you cannot. It is a real problem. It is way, way too wide. There is a common 3D/4D matrix operations which can be coded with AVX/AVX2 but that's is already too luxurious for a price. Luxurious because of data permutation toll you have to pay to prepare "chunks". It's like trying to move the whole Earth baggage with a single train! Yes, it endeed move quick and looks epic. But loading and unloading this train is a grand problem. Anything more complex then double vector[4] is a super rare case which can be easily expressed with just a few AVX instructions if there is a need. So you just stating that doubling the transistor budget for a whole CPU for something no one will ever use is not a problem but it is.
        A common operation is 4x4 * 4x1 single-precision multiply. On AVX2 they'd just pack 2 rows at a time into a register, but perform the operation twice. SSE would be 4 operations. But you could pack all 4 rows of a matrix into an AVX512 register and do the same thing in one. With the current caveats it's not worth it and is more complex, but it's not too wide to be feasible. Code clarity-wise, I think I'd prefer the 128-bit SSE, but when you get into SIMD it's all unreadable anyway.

        Comment


        • Also I would like to add, that common data set for desktop computers is not big but rather small, and when for arrays of size let's say ~~100 SIMD code is giving you great advantage, while at the same time array is too small to make sense for stuff to be splitted for sake of multithreading.

          Comment


          • Originally posted by piotrj3 View Post
            How many assembly lines you wrote in your life? And how many times you saw godbolt like output with march=native and -O2 or higher?
            Indeed I did/do optimize commercial CAD software for SSE/AVX. So I do have real background to have a judgement. Your expirience may be different but that would not change what I have seen myself. It is not like "hey, we got AVX2, let's code it!", it's more like "you only have SSE2 and SSE3, good luck and do your best".

            Originally posted by piotrj3 View Post
            1st. AVX2 is already used wide in games, to a point certain games used to raise minimum requirement to support AVX (assasin's creed origins for example). Later ubisoft released alternative patch non-AVX so people with older CPUs can play but AVX and AVX2 is used, and if those 2 are used probably AVX512 also could, if not downclocking problems and bad adaptation on market. Even windows desktop uses AVX, it has wide usage for CAD software, Blender, codecs etc. all gladly will take advantage of AVX and AVX2.
            Overpriced CPU's to play overpriced games, well done! Ridicule intended. Someone should pay to move industry forward. If an imaginary AVX1024 will give a 1% boost for a certain game and the happy clients will pay for it, is it really right to you? Besides, you are shifting the attention from AVX512 to AVX2. That is AVX512 madness should hopefully "die a painful death" (c). AVX2 is luxury but still sane enough to exist.

            Originally posted by piotrj3 View Post
            2nd. Not everything can be parellized, or can only take advantage of parellization until certain point. Read about Amdahl's law. When it is not big deal for scientific workloads that mostly can be well splited but it is problem for games for example. Also the more cores you have the higher latency core to core you have. That is why 9980XE at 5GHz is still slower at gaming then 9900k at 5GHz because HEDT core latency is higher. Desktop computers are not servers/supercomputers where you can just "let's assign a thread for every single user". Games running 144fps+ per second are caring LOTS of latency and there more cores is not better. So in a lot of cases if you can do workload faster on one thread it is a lot better.
            Same for SIMD, so what's the point? That's is exactly the reason for declining efficiency for a wider vectors: Not everything can be vectorized.

            Originally posted by piotrj3 View Post
            ?!?!?!? Horrible comparison, absolutly not true for desktop users, and btw AVX of all kinds have already instruction to load data in one go and they dont' take longer then 2 cycles. Actually it is not big deal to unload the train since data for sake of AVX is usually nicly arrayed which means data is not all around memory, can be niced cached and so on. For multicore CPUs oh boy, not only cores do on their own requests to RAM, the thing is cores asks for data all around memory space, cores often request data from other cores, also from programming perspective you do lots of unnecessery copying, adding more data like locks, synchronization stuff only to avoid problems. For encoders or webrowsers, the more cores you have, the bigger memory consumption is. AVX stuff don't have such problems.
            AVX of all kinds have already instruction to load data in one go?! Oh my... It took so many SSE addons to make SSE2 to adapt for real applications! Just finish some real case besides reading specs. You cannot nicely array data for a real world algorithm. Deal with it. The math logic is branching, twisting, changing, and guessing to make your coder life a hell. Also, try to make your code readable aftermath. The readability just put his big and fat requirement on your nicely arrayed fantasies.

            Originally posted by piotrj3 View Post
            4th. "So you just stating that doubling the transistor budget for a whole CPU for something no one will ever use is not a problem but it is." Even 9800X is not even close to double size of 9900k, despite having LOTS of more PCI-e lanes, 2 more memory channels, and having 2 dummy cores and being older generation wise and having slightly less cache. Per core 9980XE vs 9900k is just 4mm^2 diffrence per core (26 vs 22) and it is totally normal considering all other diffrences (like mesh vs ring). It is same sort of BS people say about nvidia about "radicolously large tensor/raytracing cores". In reality they cover less then 0.3% of die in current RTX serie.
            The numbers always muliply later. The first commandset iterations are always lacking usefull tools and tricks, the commandset is barebone. It took many SSE updates and addons to make SSE2 usefull. Same story for AVX, for a first glance it was like OMG, it's so crippled and inefficient (as it was for SSE2!), and then with AVX2 patch it became much more usable. There is a reason why mobile chips are limited to a full SSE bundle but have nothing of AVX. Too many transistors required. Appealing to some giant costly Core i7 monsters does not making AVX transistor requirements any prettier, may be in time, with technology advancement. It is not a "bad adaptation on market", it's a common sense and cost-efficiency. You have to think of lesser and mobile cores.

            Comment


            • Its nice to make things up from your brain people: https://www.anandtech.com/show/13660...g5400-review/3
              G5400 vs i3-7100: 10% performance boost on the cost of +50% power consumption. Some AVX instructions are made for heavy single thread performance better than competition and not better performance per watt. This extra single thread performance can partially compensated by game clocks on complicated apps and overall by more cores at the same area. So Ryzen 3650 SSE6 = 8 cores @4.9Ghz 80w. Now Ryzen 3600 costs 50w for games and 80w for complex apps.

              Comment


              • Originally posted by Marnfeldt View Post

                With integers, pi is precisely 3
                No, since integers supports fixed point arithmetic.

                Comment


                • Originally posted by curfew View Post
                  In this form you cannot even multiply it by ten without overflowing and losing the slightest bit of accuracy, idiot. 🤦‍♀️

                  Floating point numbers are critical for doing accurate computations, not just printing random numbers on the screen.
                  "Idiot" was a great word to use. If the goal was to crash and burn and fail the debate.

                  What you are really telling the world, is that you don't know about fixed point arithmetic.

                  Comment


                  • Originally posted by sophisticles View Post

                    What are you, and the dumbasses that upvoting this garbage, smoking?

                    Integers are significantly less precise, consider the following psuedo-code:

                    int a = 22;
                    int b = 7;
                    int c = a/b;
                    cout c;

                    The above code will give you an answer of 3, it will truncate everything after the decimal point.

                    On the other hand the following psuedo-code:

                    int a = 22;
                    int b = 7;
                    float c = a/b;
                    cout c;

                    Will give you 3.14285714286

                    Your statement is wrong on so many different levels, mathematical, scientific, and computer related, that one has to wonder if you ever took any even rudimentary class in math ever in your life.
                    How would you compute sqrt(17) with 1000 decimals using your double variables? You are basically making a statement that is as intelligent as saying that it's glue you use as fuel for your car.

                    Floating point is named as it is, because it has a mantissa and an exponent - and a floating point register is designed to store an *approximation* of a number. In some specific cases, it manages to store the exact value. Most times, not.

                    Integer arithmetic - as in fixed point - can handle the value 1.33333934565634382349214134124185194999911 exactly. No round off. No unexpected "accident".

                    In many situation, it's impossible to use floating point just because of the reverse of your claim - the floating point registers can't maintain the precision needed. They can't store the exact values. And if you have a very numerically sensitive algorithm, it may explode from the lack of precision.

                    Ever wondered why the world has big number libraries, when we already know how large values you can fit in a 64-bit or maybe 80-bit floating point register?

                    Comment


                    • Originally posted by sophisticles View Post

                      I take it by your idiotic doubling down on your stupidity that you do not know that said libraries are using float and double data types.
                      When was the last time you implemented fixed point arithmetic? Or implemented a big number library? Or did you do the idiotic mistake of posting on a subject you failed to know?

                      Now post source code where you use a 32-bit float and compute PI with 1000 value digits. See how easy it is? The way you have to do it, is to dumb down the imprecision of floating point values by using a small subset as small integers. But it would have been faster to use real integer registers in the first place.

                      Comment

                      Working...
                      X