Announcement

Collapse
No announcement yet.

Building The Default x86_64 Linux Kernel In Just 16 Seconds

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Building The Default x86_64 Linux Kernel In Just 16 Seconds

    Phoronix: Building The Default x86_64 Linux Kernel In Just 16 Seconds

    It's now been one week since the launch of AMD's EPYC Rome processors with up to 64 cores / 128 threads per socket and better IPC uplift compared to their previous-generation parts. Rome has outperformed Intel Xeon Scalable CPUs in their class while offering better power efficiency and way better performance-per-dollar. One of my favorite metrics has been how quickly the new EPYC 7742 2P can build the Linux kernel...

    http://www.phoronix.com/scan.php?pag...ds-AMD-EPYC-2P

  • _Alex_
    replied
    Originally posted by cbdougla View Post
    What really surprised me in my test run is that 512 threads wasn't worse than it was.
    Yeah this kind of buries the scheduling issue theory. Supposedly one thread has to schedule the work so if the thread that does the scheduling is busy itself then the scheduling of hundreds of threads will go far slower. This probably happens at two levels: compilation script level where one job finishes and another starts and kernel scheduling that assigns to threads. But it wasn't very noticeable.

    The video encode benchmark is interesting. In relation to your thoughts on one encode per core, I wonder if they ran two simultaneous encodes on the 2 x 7742 configuration they would get better aggregate results or if by then they're reaching the limits of one of the other subsystems.
    Without having tested anything related to video encoding, my gut feeling is that it would probably go far better in this scenario. And, perhaps, if one task got assigned the 128 threads of the one 7742 and the other task got assigned the other 128 threads of the second 7742, it would scale almost linearly. Or one could hope anyway

    Leave a comment:


  • cbdougla
    replied
    Originally posted by _Alex_ View Post

    Perhaps the surprisingly good results in 80-128 threads (overkill in threading, more context switching and swapping tasks per thread, more ram use, more cache misses) are probably down to forcing the machine (which has a bit of load in other uses) to get more time slices for the compilation threads, in a sense forcing a reduction in the time slices of the other tasks that the machine is running. That would explain why 80 is perhaps the best performing (?).

    I did a few tests myself on a ryzen 2200g @ 3.1ghz fixed frequency on 1-2-3-4 threads (I tried up to 128 but it was slowing down), on tmpfs:

    1: 894.47s
    2: 466.50 (0.521% of single threaded - instead of 0.500% ideally / 0.021% diff to ideal scaling)
    3: 324.40 (0.362% of single threaded - instead of 0.333% ideally / 0.029% diff to ideal scaling)
    4: 253.30 (0.283% of single threaded - instead of 0.250% ideally / 0.033% diff to ideal scaling)

    ...it seems there is an improvement degradation as thread count increases, even at low core counts. The more threads, the bigger the difference compared to what it should be (in terms of percentages).

    Another task that also has problematic scaling, even more so than kernel building, is video encoding:

    https://openbenchmarking.org/embed.p...ha=5d172cc&p=2
    https://openbenchmarking.org/embed.p...ha=22b4335&p=2
    https://openbenchmarking.org/embed.p...ha=885218a&p=2

    This is like hitting a wall in a very ugly way. How is it possible to not see a serious difference by having an extra processor (that doubles core/thread count) in tasks that can be highly parallelized? And the problem exists in both SVT and standard implementations of codecs. I'm thinking that if youtube is running video encoding in their servers, perhaps the ideal way to do this is to have, say, 128 instances of single-threaded video encodings, rather than having one-file-at-a-time encodings which use 128 threads. But again it makes me wonder where is the bottleneck and why it stops scaling linearly (this is more extreme - it practically stops scaling).
    What really surprised me in my test run is that 512 threads wasn't worse than it was. Your theory on why 80 is a sweet spot sounds reasonable to me. I was theorizing myself that it was simply the number where all the threads were kept busy but without too much context switch or cache thrashing overhead.

    The video encode benchmark is interesting. In relation to your thoughts on one encode per core, I wonder if they ran two simultaneous encodes on the 2 x 7742 configuration they would get better aggregate results or if by then they're reaching the limits of one of the other subsystems.

    Leave a comment:


  • cbdougla
    replied
    Originally posted by tchiwam View Post

    Redirect output to Null or at least Not on the terminal...
    Good idea but there was no output because of the -s switch to make.

    Leave a comment:


  • _Alex_
    replied
    That sweet spot around 80 threads seems to be consistent for me in multiple tests. I am not sure why.
    Perhaps the surprisingly good results in 80-128 threads (overkill in threading, more context switching and swapping tasks per thread, more ram use, more cache misses) are probably down to forcing the machine (which has a bit of load in other uses) to get more time slices for the compilation threads, in a sense forcing a reduction in the time slices of the other tasks that the machine is running. That would explain why 80 is perhaps the best performing (?).

    I did a few tests myself on a ryzen 2200g @ 3.1ghz fixed frequency on 1-2-3-4 threads (I tried up to 128 but it was slowing down), on tmpfs:

    1: 894.47s
    2: 466.50 (0.521% of single threaded - instead of 0.500% ideally / 0.021% diff to ideal scaling)
    3: 324.40 (0.362% of single threaded - instead of 0.333% ideally / 0.029% diff to ideal scaling)
    4: 253.30 (0.283% of single threaded - instead of 0.250% ideally / 0.033% diff to ideal scaling)

    ...it seems there is an improvement degradation as thread count increases, even at low core counts. The more threads, the bigger the difference compared to what it should be (in terms of percentages).

    Another task that also has problematic scaling, even more so than kernel building, is video encoding:

    https://openbenchmarking.org/embed.p...ha=5d172cc&p=2
    https://openbenchmarking.org/embed.p...ha=22b4335&p=2
    https://openbenchmarking.org/embed.p...ha=885218a&p=2

    This is like hitting a wall in a very ugly way. How is it possible to not see a serious difference by having an extra processor (that doubles core/thread count) in tasks that can be highly parallelized? And the problem exists in both SVT and standard implementations of codecs. I'm thinking that if youtube is running video encoding in their servers, perhaps the ideal way to do this is to have, say, 128 instances of single-threaded video encodings, rather than having one-file-at-a-time encodings which use 128 threads. But again it makes me wonder where is the bottleneck and why it stops scaling linearly (this is more extreme - it practically stops scaling).

    Leave a comment:


  • cbdougla
    replied
    Originally posted by _Alex_ View Post

    Which in turn means that there is something else that acts as a scaling bottleneck*, other than I/O. Ram latency? Ram bandwidth? Build script issues? Single thread linking? Linux scheduler? GCC scheduling issues as the thread that does the scheduling is also working with compilation? Etc etc.

    * I'm referring to the fact that time is not improved linearly as we add extra threads, especially when we add A LOT of threads, like going from a 1 socket epyc to 2 socket: https://openbenchmarking.org/embed.p...ha=371b7fe&p=2

    From 20s to 16s, instead of being near 11-12s on the 7742.
    7601 scales better on two sockets, from 37s to 23s.
    I couldn't agree more. In particular, my choice of -j128 was only semi-arbitrary. I ran a few tests to make sure it wasn't going to suck and used it.

    But your post made me curious so I ran some compiles in a loop with different thread counts:

    Compiling with 8 threads: real 4m5.837s
    Compiling with 16 threads: real 2m16.562s
    Compiling with 32 threads: real 1m33.063s
    Compiling with 48 threads: real 1m22.983s
    Compiling with 64 threads: real 1m16.787s
    Compiling with 80 threads: real 1m12.038s
    Compiling with 96 threads: real 1m17.209s
    Compiling with 112 threads: real 1m16.330s
    Compiling with 128 threads: real 1m17.286s
    Compiling with 256 threads: real 1m23.081s
    Compiling with 512 threads: real 1m22.763s

    This system has 32 cores (64 threads with HT) so it's not surprising to see almost linear improvement up to 32 threads.
    After that, I do definitely hit a wall. That sweet spot around 80 threads seems to be consistent for me in multiple tests. I am not sure why.

    Also, this computer is under a little bit of load but the I/O system where I am doing the compile is not struggling at all but I certainly can't devote all the CPUs all the time to playing with this.

    BTW, I saw the 15 second load average get up to 329 on the 512 thread run. :-)

    The linker seemed to run single threaded (I only every saw one ld process on top -- very unscientific).

    Leave a comment:


  • tchiwam
    replied
    Originally posted by cbdougla View Post
    Just out of curiosity, I decided to try compiling this kernel on a computer I have access to.
    I did it three times. Once on a raid 6 array, one on a fusion IO drive and one in /run (tmpfs)

    Here are the results

    linux-4.18 LTS compile

    Dell PowerEdge R810
    Xeon X7650 x 4
    64 GB RAM
    Funtoo Linux 1.3
    kernel 4.14.78-gentoo

    raid 6: time make -s -j 128
    real 1m19.368s
    user 49m0.871s
    sys 4m42.715s


    fusion io: time make -s -j 128
    real 1m18.847s
    user 49m46.197s
    sys 5m49.159s

    tmpfs: time make -s -j 128
    real 1m15.964s
    user 49m10.004s
    sys 4m55.751s


    So it seems for me at least, it doesn't make much of a difference all where the files are stored.
    Redirect output to Null or at least Not on the terminal...

    Leave a comment:


  • cbdougla
    replied
    Originally posted by skeevy420 View Post

    Perhaps try it with
    Code:
    nice --5 ionice -c2 -n7
    to see how much giving it higher scheduler and IO priorities helps. I have my normal user and group set to be able to use up to negative 10 with 0 by default.

    I know that nicing my GUI programs to negative 5 and makepkg to positive 5 with ionice set to -c3 has a noticeable effect on compile times; turns 20 minute kernel builds into 22 minute builds but I have no UI lag whatsoever (totally worth it on a desktop). I don't normally run it with what I suggested because it makes the GUI lag horrible due to taking all the priorities without dramatically speeding up build times (maybe a minute at best, ~19), but they'd be worth trying on a system like that to see how much of an effect priority has as core count increases.

    I'll also add to the general conversation that building on /tmp does matter when spinning disks are involved on really multi-threaded systems. While not something I've timed a whole lot, I can say that it dramatically increases GUI lag when compiling off a spinner and can make a PC next to unusable due to16 threads competing for some slow ass IO priority.
    An interesting idea. I tried this and here's what I got:

    ella /fio1/tmp/linux-4.18 # nice --5 ionice -c2 -n7 sh compileit
    ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text'
    ld: warning: creating a DT_TEXTREL in object

    real 1m16.606s
    user 50m37.993s
    sys 5m21.957s

    the script "compileit" had the "time make -s -j128" in it.

    So still no difference.

    Leave a comment:


  • FireBurn
    replied
    Originally posted by Michael View Post

    It's not a matter of "not heard of", but rather trying to be realistic - how many people actually build in tmpfs?
    I don't build my kernels in tmpfs, but my /var/tmp/portage directory is tmpfs which makes a big difference when compiling on Gentoo, also saves wear on the SSD

    Leave a comment:


  • skeevy420
    replied
    Originally posted by cbdougla View Post
    Just out of curiosity, I decided to try compiling this kernel on a computer I have access to.
    I did it three times. Once on a raid 6 array, one on a fusion IO drive and one in /run (tmpfs)

    So it seems for me at least, it doesn't make much of a difference all where the files are stored.
    Perhaps try it with
    Code:
    nice --5 ionice -c2 -n7
    to see how much giving it higher scheduler and IO priorities helps. I have my normal user and group set to be able to use up to negative 10 with 0 by default.

    I know that nicing my GUI programs to negative 5 and makepkg to positive 5 with ionice set to -c3 has a noticeable effect on compile times; turns 20 minute kernel builds into 22 minute builds but I have no UI lag whatsoever (totally worth it on a desktop). I don't normally run it with what I suggested because it makes the GUI lag horrible due to taking all the priorities without dramatically speeding up build times (maybe a minute at best, ~19), but they'd be worth trying on a system like that to see how much of an effect priority has as core count increases.

    I'll also add to the general conversation that building on /tmp does matter when spinning disks are involved on really multi-threaded systems. While not something I've timed a whole lot, I can say that it dramatically increases GUI lag when compiling off a spinner and can make a PC next to unusable due to16 threads competing for some slow ass IO priority.

    Leave a comment:

Working...
X