Announcement

Collapse
No announcement yet.

Building The Default x86_64 Linux Kernel In Just 16 Seconds

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Originally posted by nuetzel View Post

    In which time...?
    15s of compile time on a pentium 4.

    tcc does a lot less optimizations than gcc or llvm.

    Comment


    • #42
      Originally posted by Michael View Post

      It's not a matter of "not heard of", but rather trying to be realistic - how many people actually build in tmpfs?
      With 128GB of memory in both my dual-Xeon 2690 v2 and my Threadripper 2990wx boxes, I'd be mad to not use tmpfs for building just about everything I need to build.

      Comment


      • #43
        Originally posted by cbdougla View Post
        Just out of curiosity, I decided to try compiling this kernel on a computer I have access to.
        I did it three times. Once on a raid 6 array, one on a fusion IO drive and one in /run (tmpfs)

        Here are the results

        linux-4.18 LTS compile

        Dell PowerEdge R810
        Xeon X7650 x 4
        64 GB RAM
        Funtoo Linux 1.3
        kernel 4.14.78-gentoo

        raid 6: time make -s -j 128
        real 1m19.368s
        user 49m0.871s
        sys 4m42.715s


        fusion io: time make -s -j 128
        real 1m18.847s
        user 49m46.197s
        sys 5m49.159s

        tmpfs: time make -s -j 128
        real 1m15.964s
        user 49m10.004s
        sys 4m55.751s


        So it seems for me at least, it doesn't make much of a difference all where the files are stored.
        Which in turn means that there is something else that acts as a scaling bottleneck*, other than I/O. Ram latency? Ram bandwidth? Build script issues? Single thread linking? Linux scheduler? GCC scheduling issues as the thread that does the scheduling is also working with compilation? Etc etc.

        * I'm referring to the fact that time is not improved linearly as we add extra threads, especially when we add A LOT of threads, like going from a 1 socket epyc to 2 socket: https://openbenchmarking.org/embed.p...ha=371b7fe&p=2

        From 20s to 16s, instead of being near 11-12s on the 7742.
        7601 scales better on two sockets, from 37s to 23s.

        Comment


        • #44
          Originally posted by cbdougla View Post
          Just out of curiosity, I decided to try compiling this kernel on a computer I have access to.
          I did it three times. Once on a raid 6 array, one on a fusion IO drive and one in /run (tmpfs)

          So it seems for me at least, it doesn't make much of a difference all where the files are stored.
          Perhaps try it with
          Code:
          nice --5 ionice -c2 -n7
          to see how much giving it higher scheduler and IO priorities helps. I have my normal user and group set to be able to use up to negative 10 with 0 by default.

          I know that nicing my GUI programs to negative 5 and makepkg to positive 5 with ionice set to -c3 has a noticeable effect on compile times; turns 20 minute kernel builds into 22 minute builds but I have no UI lag whatsoever (totally worth it on a desktop). I don't normally run it with what I suggested because it makes the GUI lag horrible due to taking all the priorities without dramatically speeding up build times (maybe a minute at best, ~19), but they'd be worth trying on a system like that to see how much of an effect priority has as core count increases.

          I'll also add to the general conversation that building on /tmp does matter when spinning disks are involved on really multi-threaded systems. While not something I've timed a whole lot, I can say that it dramatically increases GUI lag when compiling off a spinner and can make a PC next to unusable due to16 threads competing for some slow ass IO priority.

          Comment


          • #45
            Originally posted by Michael View Post

            It's not a matter of "not heard of", but rather trying to be realistic - how many people actually build in tmpfs?
            I don't build my kernels in tmpfs, but my /var/tmp/portage directory is tmpfs which makes a big difference when compiling on Gentoo, also saves wear on the SSD

            Comment


            • #46
              Originally posted by skeevy420 View Post

              Perhaps try it with
              Code:
              nice --5 ionice -c2 -n7
              to see how much giving it higher scheduler and IO priorities helps. I have my normal user and group set to be able to use up to negative 10 with 0 by default.

              I know that nicing my GUI programs to negative 5 and makepkg to positive 5 with ionice set to -c3 has a noticeable effect on compile times; turns 20 minute kernel builds into 22 minute builds but I have no UI lag whatsoever (totally worth it on a desktop). I don't normally run it with what I suggested because it makes the GUI lag horrible due to taking all the priorities without dramatically speeding up build times (maybe a minute at best, ~19), but they'd be worth trying on a system like that to see how much of an effect priority has as core count increases.

              I'll also add to the general conversation that building on /tmp does matter when spinning disks are involved on really multi-threaded systems. While not something I've timed a whole lot, I can say that it dramatically increases GUI lag when compiling off a spinner and can make a PC next to unusable due to16 threads competing for some slow ass IO priority.
              An interesting idea. I tried this and here's what I got:

              ella /fio1/tmp/linux-4.18 # nice --5 ionice -c2 -n7 sh compileit
              ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text'
              ld: warning: creating a DT_TEXTREL in object

              real 1m16.606s
              user 50m37.993s
              sys 5m21.957s

              the script "compileit" had the "time make -s -j128" in it.

              So still no difference.

              Comment


              • #47
                Originally posted by cbdougla View Post
                Just out of curiosity, I decided to try compiling this kernel on a computer I have access to.
                I did it three times. Once on a raid 6 array, one on a fusion IO drive and one in /run (tmpfs)

                Here are the results

                linux-4.18 LTS compile

                Dell PowerEdge R810
                Xeon X7650 x 4
                64 GB RAM
                Funtoo Linux 1.3
                kernel 4.14.78-gentoo

                raid 6: time make -s -j 128
                real 1m19.368s
                user 49m0.871s
                sys 4m42.715s


                fusion io: time make -s -j 128
                real 1m18.847s
                user 49m46.197s
                sys 5m49.159s

                tmpfs: time make -s -j 128
                real 1m15.964s
                user 49m10.004s
                sys 4m55.751s


                So it seems for me at least, it doesn't make much of a difference all where the files are stored.
                Redirect output to Null or at least Not on the terminal...

                Comment


                • #48
                  Originally posted by _Alex_ View Post

                  Which in turn means that there is something else that acts as a scaling bottleneck*, other than I/O. Ram latency? Ram bandwidth? Build script issues? Single thread linking? Linux scheduler? GCC scheduling issues as the thread that does the scheduling is also working with compilation? Etc etc.

                  * I'm referring to the fact that time is not improved linearly as we add extra threads, especially when we add A LOT of threads, like going from a 1 socket epyc to 2 socket: https://openbenchmarking.org/embed.p...ha=371b7fe&p=2

                  From 20s to 16s, instead of being near 11-12s on the 7742.
                  7601 scales better on two sockets, from 37s to 23s.
                  I couldn't agree more. In particular, my choice of -j128 was only semi-arbitrary. I ran a few tests to make sure it wasn't going to suck and used it.

                  But your post made me curious so I ran some compiles in a loop with different thread counts:

                  Compiling with 8 threads: real 4m5.837s
                  Compiling with 16 threads: real 2m16.562s
                  Compiling with 32 threads: real 1m33.063s
                  Compiling with 48 threads: real 1m22.983s
                  Compiling with 64 threads: real 1m16.787s
                  Compiling with 80 threads: real 1m12.038s
                  Compiling with 96 threads: real 1m17.209s
                  Compiling with 112 threads: real 1m16.330s
                  Compiling with 128 threads: real 1m17.286s
                  Compiling with 256 threads: real 1m23.081s
                  Compiling with 512 threads: real 1m22.763s

                  This system has 32 cores (64 threads with HT) so it's not surprising to see almost linear improvement up to 32 threads.
                  After that, I do definitely hit a wall. That sweet spot around 80 threads seems to be consistent for me in multiple tests. I am not sure why.

                  Also, this computer is under a little bit of load but the I/O system where I am doing the compile is not struggling at all but I certainly can't devote all the CPUs all the time to playing with this.

                  BTW, I saw the 15 second load average get up to 329 on the 512 thread run. :-)

                  The linker seemed to run single threaded (I only every saw one ld process on top -- very unscientific).

                  Comment


                  • #49
                    That sweet spot around 80 threads seems to be consistent for me in multiple tests. I am not sure why.
                    Perhaps the surprisingly good results in 80-128 threads (overkill in threading, more context switching and swapping tasks per thread, more ram use, more cache misses) are probably down to forcing the machine (which has a bit of load in other uses) to get more time slices for the compilation threads, in a sense forcing a reduction in the time slices of the other tasks that the machine is running. That would explain why 80 is perhaps the best performing (?).

                    I did a few tests myself on a ryzen 2200g @ 3.1ghz fixed frequency on 1-2-3-4 threads (I tried up to 128 but it was slowing down), on tmpfs:

                    1: 894.47s
                    2: 466.50 (0.521% of single threaded - instead of 0.500% ideally / 0.021% diff to ideal scaling)
                    3: 324.40 (0.362% of single threaded - instead of 0.333% ideally / 0.029% diff to ideal scaling)
                    4: 253.30 (0.283% of single threaded - instead of 0.250% ideally / 0.033% diff to ideal scaling)

                    ...it seems there is an improvement degradation as thread count increases, even at low core counts. The more threads, the bigger the difference compared to what it should be (in terms of percentages).

                    Another task that also has problematic scaling, even more so than kernel building, is video encoding:

                    https://openbenchmarking.org/embed.p...ha=5d172cc&p=2
                    https://openbenchmarking.org/embed.p...ha=22b4335&p=2
                    https://openbenchmarking.org/embed.p...ha=885218a&p=2

                    This is like hitting a wall in a very ugly way. How is it possible to not see a serious difference by having an extra processor (that doubles core/thread count) in tasks that can be highly parallelized? And the problem exists in both SVT and standard implementations of codecs. I'm thinking that if youtube is running video encoding in their servers, perhaps the ideal way to do this is to have, say, 128 instances of single-threaded video encodings, rather than having one-file-at-a-time encodings which use 128 threads. But again it makes me wonder where is the bottleneck and why it stops scaling linearly (this is more extreme - it practically stops scaling).

                    Comment


                    • #50
                      Originally posted by tchiwam View Post

                      Redirect output to Null or at least Not on the terminal...
                      Good idea but there was no output because of the -s switch to make.

                      Comment

                      Working...
                      X