Announcement

Collapse
No announcement yet.

Experimental -O3 Optimizing The Linux Kernel For Better Performance Brought Up Again

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Poor Volta continues to insult me in every thread. As unhappy as I am I don't have the desire to do that to people on the net. Would be amazing if he at least added counter arguments, but, no, "You know shit" that's all he's capable of saying. And he's all alone in that - not a single other person in this discussion used the same verbiage. Amazing "intellect", what can I say.

    Comment


    • #32
      Originally posted by oiaohm View Post
      Big problem with fast is
      "-fallow-store-data-races"


      Linux kernel is multi threaded in lots of places. So yes this one feature of -Ofast can make a lot of race condition locations in the Linux kernel.

      There are a few problem child inside Linux kernel space with -03. "-fpredictive-commoning"



      Ok what if you have just performed a command that altered memory mapping from userspace to kernel space or the reverse. Some cases you are need to exactly replay the load and stores so that contents of the page tables exposed to user-space and the contents of page tables exposed to applications are the same.

      What is safe to do to user space code that only has a single set of page tables accessible is not always safe when you do it from a kernel.
      This does not make any sense.

      Any and all such "commands" in kernel code would be wrapped in accessors and barriers that prohibit any and all reordering, anyway. Same goes for any accesses to memory shared between multiple threads and/or CPUs.

      Comment


      • #33
        Originally posted by birdie View Post
        Poor Volta continues to insult me in every thread. As unhappy as I am I don't have the desire to do that to people on the net. Would be amazing if he at least added counter arguments, but, no, "You know shit" that's all he's capable of saying. And he's all alone in that - not a single other person in this discussion used the same verbiage. Amazing "intellect", what can I say.
        You don't present any arguments, so there can be no counterarguments. You just stand there frothing at the mouth calling everyone names, so don't be surprised when people respond in the same way.

        Comment


        • #34
          Originally posted by intelfx View Post

          You don't present any arguments, so there can be no counterarguments. You just stand there frothing at the mouth calling everyone names, so don't be surprised when people respond in the same way.
          I haven't called anyone anything for ages. FO. And if I presented no arguments, the fuck he's saying what he's said? FO twice.

          Comment


          • #35
            Originally posted by birdie View Post

            GCC 12.1.

            O2 vs Ofast:
            Code:
            + -fallow-store-data-races [enabled]
            + -fassociative-math [enabled]
            + -fcx-limited-range [enabled]
            + -ffinite-math-only [enabled]
            + -fgcse-after-reload [enabled]
            + -fipa-cp-clone [enabled]
            + -floop-interchange [enabled]
            + -floop-unroll-and-jam [enabled]
            + -fmath-errno [disabled]
            + -fpeel-loops [enabled]
            + -fpredictive-commoning [enabled]
            + -freciprocal-math [enabled]
            + -fsemantic-interposition [disabled]
            + -fsigned-zeros [disabled]
            + -fsplit-loops [enabled]
            + -fsplit-paths [enabled]
            + -ftrapping-math [disabled]
            + -ftree-loop-distribution [enabled]
            + -ftree-partial-pre [enabled]
            + -funroll-completely-grow-size [enabled]
            + -funsafe-math-optimizations [enabled]
            + -funswitch-loops [enabled]
            + -fversion-loops-for-strides [enabled]
            O2 vs O3:
            Code:
            + -fgcse-after-reload [enabled]
            + -fipa-cp-clone [enabled]
            + -floop-interchange [enabled]
            + -floop-unroll-and-jam [enabled]
            + -fpeel-loops [enabled]
            + -fpredictive-commoning [enabled]
            + -fsplit-loops [enabled]
            + -fsplit-paths [enabled]
            + -ftree-loop-distribution [enabled]
            + -ftree-partial-pre [enabled]
            + -funroll-completely-grow-size [enabled]
            + -funswitch-loops [enabled]
            + -fversion-loops-for-strides [enabled]
            O3 vs Ofast:
            Code:
            + -fallow-store-data-races [enabled]
            + -fassociative-math [enabled]
            + -fcx-limited-range [enabled]
            + -ffinite-math-only [enabled]
            + -fmath-errno [disabled]
            + -freciprocal-math [enabled]
            + -fsemantic-interposition [disabled]
            + -fsigned-zeros [disabled]
            + -ftrapping-math [disabled]
            + -funsafe-math-optimizations [enabled]
            This doesn't cover all the differences though. Many optimizations behave slightly different between optimization levels, sometimes these can be controlled via other parameters. Just don't think you can get -O3 by using the extra optimzation flags.

            Comment


            • #36
              Originally posted by Jannik2099 View Post

              No, the 1% is absolutely not OS code, and certainly not something with as much code rot as the linux kernel.

              Paging has absolutely nothing to do with this. It's neither required nor specially treated in the C standard, nor is there a magic -funsafe-paging flag in -O3 - it has absolutely nothing to do with the kind of UB that gets exposed by O3

              Again, if O3 breaks something that means YOU have a bug. And that bug WILL most likely manifest at O2 too at some point in time. Ignoring the issue because you don't want to fix it does not help.

              And again, no, all options in O3 ARE safe. All options in O3 are fully standards compliant in gcc and clang.
              Yeah, but until C11 you could barely write a kernel with standard compliant C-code. Most significantly the Linux kernel heavily relies on specific compiler behavior that lies outside the C standard (temporarily casting variables to volatile for thread communication, which means the variables are not protected against other threads temporarily overriding them outside those explicit reads or writes). Some of that might break if you enable -O3, since the optimizations have been tested against standard compliance, not can they compile the kernel safely, so they might violate expectations the kernel has. It is slightly better these days with the standard memory model of C11 (used for less than a year by kernel), but with decades of code written before that, it probably isn't safe.
              Last edited by carewolf; 23 June 2022, 03:29 PM.

              Comment


              • #37
                Originally posted by birdie View Post

                I haven't called anyone anything for ages. FO. And if I presented no arguments, the fuck he's saying what he's said? FO twice.
                Still frothing at the mouth and throwing pathetic insults around. Classic birdie.

                Comment


                • #38
                  Originally posted by birdie View Post
                  In my 25+ years of using PC, laptops, etc. I've had 0 situations where ntoskrnl.exe or vmlinuz took a discernible amount of CPU time.
                  It's more of an issue for server apps, but it seems even something like a web browser can hit the kernel's memory management and filesystem code paths pretty hard. And more cores -> more threads -> more task-scheduling overhead.

                  I've only done much profiling of server applications, but it's not unusual for me to see > 10% in the kernel. And that's just within the process I'm examining. I don't usually do system-wide profiling, so I can't say how much kernel time is unassociated with the process, but I've often seen overall sys time well above 10%, in top.
                  Last edited by coder; 23 June 2022, 05:04 PM.

                  Comment


                  • #39
                    Originally posted by birdie View Post
                    Secondly, I bet in your example the kernel spent ~95% of time waiting for IO and ~5% getting you the data. Again, let's make the kernel twice as fast and as a result you'll shave off 0.0002ms? Woah.
                    No, that would be accounted for as iowait. You can indirectly observe it with time, on a single-threaded program, by computing real - (user + sys).

                    Originally posted by birdie View Post
                    You seemingly don't understand how the kernel works either. It's a proxy, it must be a proxy, if it does any serious work, it's badly coded.
                    Um, wow? So, filesystems, network stacks, device drivers, memory management, I/O scheduling, and thread scheduling aren't real work?

                    It really depends on how heavily you're leaning on them. Something like a database can bypass a little bit of that by doing direct I/O (which it shouldn't have to, if we had an ideal kernel), but the more of those areas you're touching, the more you're really dependent kernel performance.

                    Comment


                    • #40
                      Originally posted by DanielG View Post
                      (Should probably be done with an AMD GPU, maybe their millions of lines of driver-code can benefit from more optimization?)
                      Any in-tree GPU driver should be equally exposed to kernel optimizations. And keep in mind that their line-count is inflated by lots of constant definitions.

                      Comment

                      Working...
                      X