Announcement

Collapse
No announcement yet.

Building The Default x86_64 Linux Kernel In Just 16 Seconds

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #51
    Originally posted by _Alex_ View Post

    Perhaps the surprisingly good results in 80-128 threads (overkill in threading, more context switching and swapping tasks per thread, more ram use, more cache misses) are probably down to forcing the machine (which has a bit of load in other uses) to get more time slices for the compilation threads, in a sense forcing a reduction in the time slices of the other tasks that the machine is running. That would explain why 80 is perhaps the best performing (?).

    I did a few tests myself on a ryzen 2200g @ 3.1ghz fixed frequency on 1-2-3-4 threads (I tried up to 128 but it was slowing down), on tmpfs:

    1: 894.47s
    2: 466.50 (0.521% of single threaded - instead of 0.500% ideally / 0.021% diff to ideal scaling)
    3: 324.40 (0.362% of single threaded - instead of 0.333% ideally / 0.029% diff to ideal scaling)
    4: 253.30 (0.283% of single threaded - instead of 0.250% ideally / 0.033% diff to ideal scaling)

    ...it seems there is an improvement degradation as thread count increases, even at low core counts. The more threads, the bigger the difference compared to what it should be (in terms of percentages).

    Another task that also has problematic scaling, even more so than kernel building, is video encoding:

    https://openbenchmarking.org/embed.p...ha=5d172cc&p=2
    https://openbenchmarking.org/embed.p...ha=22b4335&p=2
    https://openbenchmarking.org/embed.p...ha=885218a&p=2

    This is like hitting a wall in a very ugly way. How is it possible to not see a serious difference by having an extra processor (that doubles core/thread count) in tasks that can be highly parallelized? And the problem exists in both SVT and standard implementations of codecs. I'm thinking that if youtube is running video encoding in their servers, perhaps the ideal way to do this is to have, say, 128 instances of single-threaded video encodings, rather than having one-file-at-a-time encodings which use 128 threads. But again it makes me wonder where is the bottleneck and why it stops scaling linearly (this is more extreme - it practically stops scaling).
    What really surprised me in my test run is that 512 threads wasn't worse than it was. Your theory on why 80 is a sweet spot sounds reasonable to me. I was theorizing myself that it was simply the number where all the threads were kept busy but without too much context switch or cache thrashing overhead.

    The video encode benchmark is interesting. In relation to your thoughts on one encode per core, I wonder if they ran two simultaneous encodes on the 2 x 7742 configuration they would get better aggregate results or if by then they're reaching the limits of one of the other subsystems.

    Comment


    • #52
      Originally posted by cbdougla View Post
      What really surprised me in my test run is that 512 threads wasn't worse than it was.
      Yeah this kind of buries the scheduling issue theory. Supposedly one thread has to schedule the work so if the thread that does the scheduling is busy itself then the scheduling of hundreds of threads will go far slower. This probably happens at two levels: compilation script level where one job finishes and another starts and kernel scheduling that assigns to threads. But it wasn't very noticeable.

      The video encode benchmark is interesting. In relation to your thoughts on one encode per core, I wonder if they ran two simultaneous encodes on the 2 x 7742 configuration they would get better aggregate results or if by then they're reaching the limits of one of the other subsystems.
      Without having tested anything related to video encoding, my gut feeling is that it would probably go far better in this scenario. And, perhaps, if one task got assigned the 128 threads of the one 7742 and the other task got assigned the other 128 threads of the second 7742, it would scale almost linearly. Or one could hope anyway

      Comment


      • #53
        How about recompiling it in less than 8 seconds on an 5900X? ;-)
        Russell published an interesting post about his first experience with Firebuild accelerating refpolicy’s and the Linux kernel’s build. It turned out a few small tweaks could accelerate …

        Comment

        Working...
        X