SiFive HiFive Premier P550 RISC-V Price Lowered, Ubuntu 24.04 Support Ready

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Gamer1227
    Phoronix Member
    • Mar 2024
    • 68

    #31


    The main problem people run into is that they see the compiler flounder at autovectorization and arrogantly assume that it's just stupid, rather than asking why it didn't do what they expected and trying to figure out what it got hung up on. If you put in the effort to do that, then you might be surprised at how good they can be and you might save yourself some time & effort from having to do the entire process by hand.


    It's the programmer's fault for not telling the compiler those things won't happen. The language is well-defined and the tools give you nearly all the expressiveness you need to say what you actually mean. If you fail to tell the compiler what you really mean, that's on you, not the compiler.

    Seriously, do you hear machinists blaming their milling machines for being dumb and advising to do the work by hand? No! They learn how to use their tools and we should do the same![/QUOTE]

    There is a reason why every popular FOSS software that needs SIMD, the devs write it by hand with assembly or intrinsics, like FFMPEG, Blender, and so on.

    Look great devs, like Daniel Lemire in his blog, he is a expert in ow level programming and always vectoruze by hand.

    Auto-vectorization is very hard to be done by compilers, it is a intrinsically complex problem, it is the only optimization where a assembly programmer can easily outperform a compiler.

    And also very fragile, there can be a regression on new compiler version. And not work on older versions.

    So if you want RELIABLE vectorization, that will never stop working on a compiler update, and will work on older compiler version.

    Comment

    • AndyChow
      Senior Member
      • Apr 2012
      • 771

      #32
      Originally posted by ayumu View Post

      This is at the level of slander, considering each and every generation of their development boards has upstreamed support.
      Slander? Then sue me. It still won't make the chip support mainline kernel.

      Comment

      • coder
        Senior Member
        • Nov 2014
        • 8952

        #33
        Originally posted by Gamer1227 View Post
        There is a reason why every popular FOSS software that needs SIMD, the devs write it by hand with assembly or intrinsics, like FFMPEG, Blender, and so on.
        Yes, and I explained it. To a large degree, it's developer arrogance and ignorance. I'm not actually opposed to intrinsics, but when people dive into assembly language, I suspect it's more about buffing their ego than actually about trying to use the best tool for the task at hand.

        Originally posted by Gamer1227 View Post
        ​Look great devs, like Daniel Lemire in his blog, he is a expert in ow level programming and always vectoruze by hand.
        You can't just treat all vectorization problems as equal. Some are much more subtle than others. However, as I explained, a lot of the real value programmers add in manually vectorizing code is rethinking the control and data structures to be vectorization-friendly. In many cases, that's just about as far as you'd need to take it and you could just let the compiler do the rest.

        Originally posted by Gamer1227 View Post
        ​​And also very fragile, there can be a regression on new compiler version. And not work on older versions.
        I think it would be much less fragile with PGO or equivalent hinting.

        Originally posted by Gamer1227 View Post
        ​​​So if you want RELIABLE vectorization, that will never stop working on a compiler update, and will work on older compiler version.
        If you rely on everything to be vectorized by hand, the vast majority of software never will. That's why it pays for programmers to understand the capabilities and limitations of compiler autovectorization. It's a lot less work to write easily autovertorizable C/C++ code than it is to go the whole way and do it all by hand, not to mention a lot more portable and easier to maintain.

        In these debates, I'm always struck by how people advocate for hand-vectorizing as if it has no downside, with opportunity cost being one glaring oversight. Even if we accept that it's indeed the gold standard, it's essentially saying you want like 1% of code that's near optimal and are content for the rest to sputter along as entirely scalar. If people would actually invest time in learning how to write autovectorizable code, then maybe we could have 5% or 10% of code being vectorized and it'd run fast on ARM and RISC-V (for the cores with vector extensions), not just x86.

        Comment

        • Quackdoc
          Senior Member
          • Oct 2020
          • 5072

          #34
          Originally posted by coder View Post
          Yes, and I explained it. To a large degree, it's developer arrogance and ignorance. I'm not actually opposed to intrinsics, but when people dive into assembly language, I suspect it's more about buffing their ego than actually about trying to use the best tool for the task at hand.


          You can't just treat all vectorization problems as equal. Some are much more subtle than others. However, as I explained, a lot of the real value programmers add in manually vectorizing code is rethinking the control and data structures to be vectorization-friendly. In many cases, that's just about as far as you'd need to take it and you could just let the compiler do the rest.


          I think it would be much less fragile with PGO or equivalent hinting.


          If you rely on everything to be vectorized by hand, the vast majority of software never will. That's why it pays for programmers to understand the capabilities and limitations of compiler autovectorization. It's a lot less work to write easily autovertorizable C/C++ code than it is to go the whole way and do it all by hand, not to mention a lot more portable and easier to maintain.

          In these debates, I'm always struck by how people advocate for hand-vectorizing as if it has no downside, with opportunity cost being one glaring oversight. Even if we accept that it's indeed the gold standard, it's essentially saying you want like 1% of code that's near optimal and are content for the rest to sputter along as entirely scalar. If people would actually invest time in learning how to write autovectorizable code, then maybe we could have 5% or 10% of code being vectorized and it'd run fast on ARM and RISC-V (for the cores with vector extensions), not just x86.
          don't engage with such overly false statements. image-rs is insanely popular and found autovectorization to perform better then their hand written code https://github.com/image-rs/image-png/pull/512

          auto vectorization is actually pretty decent now

          Comment

          • Gamer1227
            Phoronix Member
            • Mar 2024
            • 68

            #35
            Originally posted by coder View Post

            If you rely on everything to be vectorized by hand, the vast majority of software never will. That's why it pays for programmers to understand the capabilities and limitations of compiler autovectorization. It's a lot less work to write easily autovertorizable C/C++ code than it is to go the whole way and do it all by hand, not to mention a lot more portable and easier to maintain.

            In these debates, I'm always struck by how people advocate for hand-vectorizing as if it has no downside, with opportunity cost being one glaring oversight. Even if we accept that it's indeed the gold standard, it's essentially saying you want like 1% of code that's near optimal and are content for the rest to sputter along as entirely scalar. If people would actually invest time in learning how to write autovectorizable code, then maybe we could have 5% or 10% of code being vectorized and it'd run fast on ARM and RISC-V (for the cores with vector extensions), not just x86.
            not all code is vectorized, look at the source code of x264 decoder, they only vectorize some hot loops that can be vectorized, that is like 5% of code at worst. It is not that much work.

            Comment

            • Gamer1227
              Phoronix Member
              • Mar 2024
              • 68

              #36
              Originally posted by Quackdoc View Post

              don't engage with such overly false statements. image-rs is insanely popular and found autovectorization to perform better then their hand written code https://github.com/image-rs/image-png/pull/512

              auto vectorization is actually pretty decent now
              Yeah, go tell the FFMPEG developers that hand writen assembly is stupid and that they should "just trust the compiler Bro!".

              Image-rs is a exception, the exception reinforces the rule.

              Look at all this "stupid" hand writen assembly.
              Mirror of https://git.ffmpeg.org/ffmpeg.git. Contribute to FFmpeg/FFmpeg development by creating an account on GitHub.

              Comment

              • coder
                Senior Member
                • Nov 2014
                • 8952

                #37
                Originally posted by Gamer1227 View Post
                not all code is vectorized, look at the source code of x264 decoder, they only vectorize some hot loops that can be vectorized, that is like 5% of code at worst. It is not that much work.
                Even that much is enough work that they had to train a bunch of people to do it. Then, when they want to optimize for a new architecture, they have to do it all over again. Also, what about when they want to support new decoder features, like 10-bit or 12-bit? You bet that's another code path!

                Don't make light of work you didn't do.

                Originally posted by Gamer1227 View Post
                Yeah, go tell the FFMPEG developers that hand writen assembly is stupid and that they should "just trust the compiler Bro!".
                We didn't say that. I won't speak for Quackdoc , but my point is that people should learn to use their tools properly.​

                Comment

                • Gamer1227
                  Phoronix Member
                  • Mar 2024
                  • 68

                  #38
                  Originally posted by coder View Post
                  Even that much is enough work that they had to train a bunch of people to do it. Then, when they want to optimize for a new architecture, they have to do it all over again. Also, what about when they want to support new decoder features, like 10-bit or 12-bit? You bet that's another code path!

                  Don't make light of work you didn't do.


                  We didn't say that. I won't speak for Quackdoc , but my point is that people should learn to use their tools properly.​
                  using namespace std; void sum(vector<int64_t> a, vector<int64_t> b, vector<int64_t> sum) { assert(a.size() == b.size()); assert(a.size() == sum.size()); // remove the commmented if condition // and see the great auto vectorization crumble // works on both GCC and Clang! for (auto i = 0; i < a.size(); i++) { //if (a[i] < 5) { // a[i] = 0; //} sum[i] = a[i] + b[i]; } }


                  I wrote a small example of C++, just a function summing the elements of 2 vectors into a sum vector, also helped the compiler by asserting that the 3 vectors have the same size.

                  As you can see, whithout the simple if condition, it works well, if you hover the mouse over the sum[i] = a[i] + b[i], the assembly is SSE instructions with xmm registers.

                  But if you uncomment that condition, the compiler falls back to scalar instructions, and it is only 1 very simple if condition, imagine anything that is more complex.

                  The only way to vectorize with the IF statement is to enable AVX2, wich have conditional instructions. But then your code would not run on CPUs without AVX2.

                  On hand written code, the solution is to create a scalar C path and a AVX2 path, the path would be chosen looking at CPU features. But auto vectorized code is not able to do it, it only generates 1 path.

                  but again, it is a very simple IF statement, even with AVX turned on, the compiler would probably fail with more complex code.

                  Comment

                  • coder
                    Senior Member
                    • Nov 2014
                    • 8952

                    #39
                    Originally posted by Gamer1227 View Post
                    using namespace std; void sum(vector<int64_t> a, vector<int64_t> b, vector<int64_t> sum) { assert(a.size() == b.size()); assert(a.size() == sum.size()); // remove the commmented if condition // and see the great auto vectorization crumble // works on both GCC and Clang! for (auto i = 0; i < a.size(); i++) { //if (a[i] < 5) { // a[i] = 0; //} sum[i] = a[i] + b[i]; } }


                    I wrote a small example of C++, just a function summing the elements of 2 vectors into a sum vector, also helped the compiler by asserting that the 3 vectors have the same size.

                    As you can see, whithout the simple if condition, it works well, if you hover the mouse over the sum[i] = a[i] + b[i], the assembly is SSE instructions with xmm registers.

                    But if you uncomment that condition, the compiler falls back to scalar instructions, and it is only 1 very simple if condition, imagine anything that is more complex.
                    You failed to specify a -march option. If you simply add -march=sapphirerapids, it will vectorize that.

                    Originally posted by Gamer1227 View Post
                    ​The only way to vectorize with the IF statement is to enable AVX2, wich have conditional instructions. But then your code would not run on CPUs without AVX2.
                    It vectorizes with -march=sandybridge, which lacks AVX2.

                    Also, using clang-19.1 + -march=nehalem gets it vectorized. I think clang is now the leader in autovectorization. I'd focus on it, if you want to see the state of the art.

                    Originally posted by Gamer1227 View Post
                    ​​On hand written code, the solution is to create a scalar C path and a AVX2 path, the path would be chosen looking at CPU features. But auto vectorized code is not able to do it, it only generates 1 path.
                    With a modicum of creativity, you can figure out how to use the preprocessor, macros, and the #include directive to utilize GCC's function multi-versioning feature to achieve the same effect, while writing the function only once.

                    Originally posted by Gamer1227 View Post
                    ​​but again, it is a very simple IF statement,
                    Yes, and you did almost nothing to help the compiler. Like I said, using built-ins or PGO will give it enough information to know how likely certain codepaths are. Using __restrict__ will help it avoid having to assume certain variables might alias.

                    You seem to have this odd idea that optimizing code is an all-or-nothing affair. That either the compiler should be all-knowing and do everything for you, without you having to lift a finger, or that you have to drop into assembly language and do everything yourself. In most other professions, tradesman learn to actually use their tools. The tool's job is to make your life easier, but you should master it, if you desire to extract the greatest benefit from it.

                    Like I said, I think the main reason a lot of people drop into assembly language is about ego. They enjoy the dopamine hit they get, when they believe they can do something better than the compiler. However, I've seen the generic C paths, in some of these projects, and they did absolutely nothing to help out the compiler and use it more effectively. A cynical take in this is that they actually don't want the generic version to be very fast, so that they can keep justifying their hand-written assembly language.
                    Last edited by coder; 16 December 2024, 04:27 PM.

                    Comment

                    • Quackdoc
                      Senior Member
                      • Oct 2020
                      • 5072

                      #40
                      Originally posted by Gamer1227 View Post

                      Yeah, go tell the FFMPEG developers that hand writen assembly is stupid and that they should "just trust the compiler Bro!".

                      Image-rs is a exception, the exception reinforces the rule.

                      Look at all this "stupid" hand writen assembly.
                      https://github.com/FFmpeg/FFmpeg/tree/master/libavcodec
                      bro, I said 1cm and you took it as 100km

                      Comment

                      Working...
                      X