Announcement

Collapse
No announcement yet.

Building The Default x86_64 Linux Kernel In Just 16 Seconds

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Just out of curiosity, I decided to try compiling this kernel on a computer I have access to.
    I did it three times. Once on a raid 6 array, one on a fusion IO drive and one in /run (tmpfs)

    Here are the results

    linux-4.18 LTS compile

    Dell PowerEdge R810
    Xeon X7650 x 4
    64 GB RAM
    Funtoo Linux 1.3
    kernel 4.14.78-gentoo

    raid 6: time make -s -j 128
    real 1m19.368s
    user 49m0.871s
    sys 4m42.715s


    fusion io: time make -s -j 128
    real 1m18.847s
    user 49m46.197s
    sys 5m49.159s

    tmpfs: time make -s -j 128
    real 1m15.964s
    user 49m10.004s
    sys 4m55.751s


    So it seems for me at least, it doesn't make much of a difference all where the files are stored.

    Comment


    • #42
      Originally posted by nuetzel View Post

      In which time...?
      15s of compile time on a pentium 4.

      tcc does a lot less optimizations than gcc or llvm.

      Comment


      • #43
        Originally posted by Michael View Post

        It's not a matter of "not heard of", but rather trying to be realistic - how many people actually build in tmpfs?
        With 128GB of memory in both my dual-Xeon 2690 v2 and my Threadripper 2990wx boxes, I'd be mad to not use tmpfs for building just about everything I need to build.

        Comment


        • #44
          Originally posted by cbdougla View Post
          Just out of curiosity, I decided to try compiling this kernel on a computer I have access to.
          I did it three times. Once on a raid 6 array, one on a fusion IO drive and one in /run (tmpfs)

          Here are the results

          linux-4.18 LTS compile

          Dell PowerEdge R810
          Xeon X7650 x 4
          64 GB RAM
          Funtoo Linux 1.3
          kernel 4.14.78-gentoo

          raid 6: time make -s -j 128
          real 1m19.368s
          user 49m0.871s
          sys 4m42.715s


          fusion io: time make -s -j 128
          real 1m18.847s
          user 49m46.197s
          sys 5m49.159s

          tmpfs: time make -s -j 128
          real 1m15.964s
          user 49m10.004s
          sys 4m55.751s


          So it seems for me at least, it doesn't make much of a difference all where the files are stored.
          Which in turn means that there is something else that acts as a scaling bottleneck*, other than I/O. Ram latency? Ram bandwidth? Build script issues? Single thread linking? Linux scheduler? GCC scheduling issues as the thread that does the scheduling is also working with compilation? Etc etc.

          * I'm referring to the fact that time is not improved linearly as we add extra threads, especially when we add A LOT of threads, like going from a 1 socket epyc to 2 socket: https://openbenchmarking.org/embed.p...ha=371b7fe&p=2

          From 20s to 16s, instead of being near 11-12s on the 7742.
          7601 scales better on two sockets, from 37s to 23s.

          Comment


          • #45
            Originally posted by cbdougla View Post
            Just out of curiosity, I decided to try compiling this kernel on a computer I have access to.
            I did it three times. Once on a raid 6 array, one on a fusion IO drive and one in /run (tmpfs)

            So it seems for me at least, it doesn't make much of a difference all where the files are stored.
            Perhaps try it with
            Code:
            nice --5 ionice -c2 -n7
            to see how much giving it higher scheduler and IO priorities helps. I have my normal user and group set to be able to use up to negative 10 with 0 by default.

            I know that nicing my GUI programs to negative 5 and makepkg to positive 5 with ionice set to -c3 has a noticeable effect on compile times; turns 20 minute kernel builds into 22 minute builds but I have no UI lag whatsoever (totally worth it on a desktop). I don't normally run it with what I suggested because it makes the GUI lag horrible due to taking all the priorities without dramatically speeding up build times (maybe a minute at best, ~19), but they'd be worth trying on a system like that to see how much of an effect priority has as core count increases.

            I'll also add to the general conversation that building on /tmp does matter when spinning disks are involved on really multi-threaded systems. While not something I've timed a whole lot, I can say that it dramatically increases GUI lag when compiling off a spinner and can make a PC next to unusable due to16 threads competing for some slow ass IO priority.

            Comment


            • #46
              Originally posted by Michael View Post

              It's not a matter of "not heard of", but rather trying to be realistic - how many people actually build in tmpfs?
              I don't build my kernels in tmpfs, but my /var/tmp/portage directory is tmpfs which makes a big difference when compiling on Gentoo, also saves wear on the SSD

              Comment


              • #47
                Originally posted by skeevy420 View Post

                Perhaps try it with
                Code:
                nice --5 ionice -c2 -n7
                to see how much giving it higher scheduler and IO priorities helps. I have my normal user and group set to be able to use up to negative 10 with 0 by default.

                I know that nicing my GUI programs to negative 5 and makepkg to positive 5 with ionice set to -c3 has a noticeable effect on compile times; turns 20 minute kernel builds into 22 minute builds but I have no UI lag whatsoever (totally worth it on a desktop). I don't normally run it with what I suggested because it makes the GUI lag horrible due to taking all the priorities without dramatically speeding up build times (maybe a minute at best, ~19), but they'd be worth trying on a system like that to see how much of an effect priority has as core count increases.

                I'll also add to the general conversation that building on /tmp does matter when spinning disks are involved on really multi-threaded systems. While not something I've timed a whole lot, I can say that it dramatically increases GUI lag when compiling off a spinner and can make a PC next to unusable due to16 threads competing for some slow ass IO priority.
                An interesting idea. I tried this and here's what I got:

                ella /fio1/tmp/linux-4.18 # nice --5 ionice -c2 -n7 sh compileit
                ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text'
                ld: warning: creating a DT_TEXTREL in object

                real 1m16.606s
                user 50m37.993s
                sys 5m21.957s

                the script "compileit" had the "time make -s -j128" in it.

                So still no difference.

                Comment


                • #48
                  Originally posted by cbdougla View Post
                  Just out of curiosity, I decided to try compiling this kernel on a computer I have access to.
                  I did it three times. Once on a raid 6 array, one on a fusion IO drive and one in /run (tmpfs)

                  Here are the results

                  linux-4.18 LTS compile

                  Dell PowerEdge R810
                  Xeon X7650 x 4
                  64 GB RAM
                  Funtoo Linux 1.3
                  kernel 4.14.78-gentoo

                  raid 6: time make -s -j 128
                  real 1m19.368s
                  user 49m0.871s
                  sys 4m42.715s


                  fusion io: time make -s -j 128
                  real 1m18.847s
                  user 49m46.197s
                  sys 5m49.159s

                  tmpfs: time make -s -j 128
                  real 1m15.964s
                  user 49m10.004s
                  sys 4m55.751s


                  So it seems for me at least, it doesn't make much of a difference all where the files are stored.
                  Redirect output to Null or at least Not on the terminal...

                  Comment


                  • #49
                    Originally posted by _Alex_ View Post

                    Which in turn means that there is something else that acts as a scaling bottleneck*, other than I/O. Ram latency? Ram bandwidth? Build script issues? Single thread linking? Linux scheduler? GCC scheduling issues as the thread that does the scheduling is also working with compilation? Etc etc.

                    * I'm referring to the fact that time is not improved linearly as we add extra threads, especially when we add A LOT of threads, like going from a 1 socket epyc to 2 socket: https://openbenchmarking.org/embed.p...ha=371b7fe&p=2

                    From 20s to 16s, instead of being near 11-12s on the 7742.
                    7601 scales better on two sockets, from 37s to 23s.
                    I couldn't agree more. In particular, my choice of -j128 was only semi-arbitrary. I ran a few tests to make sure it wasn't going to suck and used it.

                    But your post made me curious so I ran some compiles in a loop with different thread counts:

                    Compiling with 8 threads: real 4m5.837s
                    Compiling with 16 threads: real 2m16.562s
                    Compiling with 32 threads: real 1m33.063s
                    Compiling with 48 threads: real 1m22.983s
                    Compiling with 64 threads: real 1m16.787s
                    Compiling with 80 threads: real 1m12.038s
                    Compiling with 96 threads: real 1m17.209s
                    Compiling with 112 threads: real 1m16.330s
                    Compiling with 128 threads: real 1m17.286s
                    Compiling with 256 threads: real 1m23.081s
                    Compiling with 512 threads: real 1m22.763s

                    This system has 32 cores (64 threads with HT) so it's not surprising to see almost linear improvement up to 32 threads.
                    After that, I do definitely hit a wall. That sweet spot around 80 threads seems to be consistent for me in multiple tests. I am not sure why.

                    Also, this computer is under a little bit of load but the I/O system where I am doing the compile is not struggling at all but I certainly can't devote all the CPUs all the time to playing with this.

                    BTW, I saw the 15 second load average get up to 329 on the 512 thread run. :-)

                    The linker seemed to run single threaded (I only every saw one ld process on top -- very unscientific).

                    Comment


                    • #50
                      That sweet spot around 80 threads seems to be consistent for me in multiple tests. I am not sure why.
                      Perhaps the surprisingly good results in 80-128 threads (overkill in threading, more context switching and swapping tasks per thread, more ram use, more cache misses) are probably down to forcing the machine (which has a bit of load in other uses) to get more time slices for the compilation threads, in a sense forcing a reduction in the time slices of the other tasks that the machine is running. That would explain why 80 is perhaps the best performing (?).

                      I did a few tests myself on a ryzen 2200g @ 3.1ghz fixed frequency on 1-2-3-4 threads (I tried up to 128 but it was slowing down), on tmpfs:

                      1: 894.47s
                      2: 466.50 (0.521% of single threaded - instead of 0.500% ideally / 0.021% diff to ideal scaling)
                      3: 324.40 (0.362% of single threaded - instead of 0.333% ideally / 0.029% diff to ideal scaling)
                      4: 253.30 (0.283% of single threaded - instead of 0.250% ideally / 0.033% diff to ideal scaling)

                      ...it seems there is an improvement degradation as thread count increases, even at low core counts. The more threads, the bigger the difference compared to what it should be (in terms of percentages).

                      Another task that also has problematic scaling, even more so than kernel building, is video encoding:

                      https://openbenchmarking.org/embed.p...ha=5d172cc&p=2
                      https://openbenchmarking.org/embed.p...ha=22b4335&p=2
                      https://openbenchmarking.org/embed.p...ha=885218a&p=2

                      This is like hitting a wall in a very ugly way. How is it possible to not see a serious difference by having an extra processor (that doubles core/thread count) in tasks that can be highly parallelized? And the problem exists in both SVT and standard implementations of codecs. I'm thinking that if youtube is running video encoding in their servers, perhaps the ideal way to do this is to have, say, 128 instances of single-threaded video encodings, rather than having one-file-at-a-time encodings which use 128 threads. But again it makes me wonder where is the bottleneck and why it stops scaling linearly (this is more extreme - it practically stops scaling).

                      Comment

                      Working...
                      X