Announcement

**programmerjake** · 14 August 2019, 06:50 PM

Originally posted by nuetzel View Post

In which time...?

15s of compile time on a pentium 4.

tcc does a lot less optimizations than gcc or llvm.

**Sonadow** · 14 August 2019, 08:45 PM

Originally posted by Michael View Post

It's not a matter of "not heard of", but rather trying to be realistic - how many people actually build in tmpfs?

With 128GB of memory in both my dual-Xeon 2690 v2 and my Threadripper 2990wx boxes, I'd be mad to not use tmpfs for building just about everything I need to build.

**_Alex_** · 15 August 2019, 01:20 AM

Originally posted by cbdougla View Post

Just out of curiosity, I decided to try compiling this kernel on a computer I have access to.
I did it three times. Once on a raid 6 array, one on a fusion IO drive and one in /run (tmpfs)

Here are the results

linux-4.18 LTS compile

Dell PowerEdge R810
Xeon X7650 x 4
64 GB RAM
Funtoo Linux 1.3
kernel 4.14.78-gentoo

raid 6: time make -s -j 128
real 1m19.368s
user 49m0.871s
sys 4m42.715s

fusion io: time make -s -j 128
real 1m18.847s
user 49m46.197s
sys 5m49.159s

tmpfs: time make -s -j 128
real 1m15.964s
user 49m10.004s
sys 4m55.751s

So it seems for me at least, it doesn't make much of a difference all where the files are stored.

Which in turn means that there is something else that acts as a scaling bottleneck*, other than I/O. Ram latency? Ram bandwidth? Build script issues? Single thread linking? Linux scheduler? GCC scheduling issues as the thread that does the scheduling is also working with compilation? Etc etc.

* I'm referring to the fact that time is not improved linearly as we add extra threads, especially when we add A LOT of threads, like going from a 1 socket epyc to 2 socket: https://openbenchmarking.org/embed.p...ha=371b7fe&p=2

From 20s to 16s, instead of being near 11-12s on the 7742.
7601 scales better on two sockets, from 37s to 23s.

**skeevy420** · 15 August 2019, 07:37 AM

Originally posted by cbdougla View Post

Just out of curiosity, I decided to try compiling this kernel on a computer I have access to.
I did it three times. Once on a raid 6 array, one on a fusion IO drive and one in /run (tmpfs)

So it seems for me at least, it doesn't make much of a difference all where the files are stored.

Perhaps try it with

Code:

nice --5 ionice -c2 -n7

to see how much giving it higher scheduler and IO priorities helps. I have my normal user and group set to be able to use up to negative 10 with 0 by default.

I know that nicing my GUI programs to negative 5 and makepkg to positive 5 with ionice set to -c3 has a noticeable effect on compile times; turns 20 minute kernel builds into 22 minute builds but I have no UI lag whatsoever (totally worth it on a desktop). I don't normally run it with what I suggested because it makes the GUI lag horrible due to taking all the priorities without dramatically speeding up build times (maybe a minute at best, ~19), but they'd be worth trying on a system like that to see how much of an effect priority has as core count increases.

I'll also add to the general conversation that building on /tmp does matter when spinning disks are involved on really multi-threaded systems. While not something I've timed a whole lot, I can say that it dramatically increases GUI lag when compiling off a spinner and can make a PC next to unusable due to16 threads competing for some slow ass IO priority.

**FireBurn** · 15 August 2019, 10:33 AM

Originally posted by Michael View Post

It's not a matter of "not heard of", but rather trying to be realistic - how many people actually build in tmpfs?

I don't build my kernels in tmpfs, but my /var/tmp/portage directory is tmpfs which makes a big difference when compiling on Gentoo, also saves wear on the SSD

**cbdougla** · 15 August 2019, 02:33 PM

Originally posted by skeevy420 View Post

Perhaps try it with

Code:

nice --5 ionice -c2 -n7

to see how much giving it higher scheduler and IO priorities helps. I have my normal user and group set to be able to use up to negative 10 with 0 by default.

I know that nicing my GUI programs to negative 5 and makepkg to positive 5 with ionice set to -c3 has a noticeable effect on compile times; turns 20 minute kernel builds into 22 minute builds but I have no UI lag whatsoever (totally worth it on a desktop). I don't normally run it with what I suggested because it makes the GUI lag horrible due to taking all the priorities without dramatically speeding up build times (maybe a minute at best, ~19), but they'd be worth trying on a system like that to see how much of an effect priority has as core count increases.

I'll also add to the general conversation that building on /tmp does matter when spinning disks are involved on really multi-threaded systems. While not something I've timed a whole lot, I can say that it dramatically increases GUI lag when compiling off a spinner and can make a PC next to unusable due to16 threads competing for some slow ass IO priority.

An interesting idea. I tried this and here's what I got:

ella /fio1/tmp/linux-4.18 # nice --5 ionice -c2 -n7 sh compileit
ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text'
ld: warning: creating a DT_TEXTREL in object

real 1m16.606s
user 50m37.993s
sys 5m21.957s

the script "compileit" had the "time make -s -j128" in it.

So still no difference.

**tchiwam** · 15 August 2019, 02:36 PM

Originally posted by cbdougla View Post

Just out of curiosity, I decided to try compiling this kernel on a computer I have access to.
I did it three times. Once on a raid 6 array, one on a fusion IO drive and one in /run (tmpfs)

Here are the results

linux-4.18 LTS compile

Dell PowerEdge R810
Xeon X7650 x 4
64 GB RAM
Funtoo Linux 1.3
kernel 4.14.78-gentoo

raid 6: time make -s -j 128
real 1m19.368s
user 49m0.871s
sys 4m42.715s

fusion io: time make -s -j 128
real 1m18.847s
user 49m46.197s
sys 5m49.159s

tmpfs: time make -s -j 128
real 1m15.964s
user 49m10.004s
sys 4m55.751s

So it seems for me at least, it doesn't make much of a difference all where the files are stored.

Redirect output to Null or at least Not on the terminal...

**cbdougla** · 15 August 2019, 03:00 PM

Originally posted by _Alex_ View Post

Which in turn means that there is something else that acts as a scaling bottleneck*, other than I/O. Ram latency? Ram bandwidth? Build script issues? Single thread linking? Linux scheduler? GCC scheduling issues as the thread that does the scheduling is also working with compilation? Etc etc.

* I'm referring to the fact that time is not improved linearly as we add extra threads, especially when we add A LOT of threads, like going from a 1 socket epyc to 2 socket: https://openbenchmarking.org/embed.p...ha=371b7fe&p=2

From 20s to 16s, instead of being near 11-12s on the 7742.
7601 scales better on two sockets, from 37s to 23s.

I couldn't agree more. In particular, my choice of -j128 was only semi-arbitrary. I ran a few tests to make sure it wasn't going to suck and used it.

But your post made me curious so I ran some compiles in a loop with different thread counts:

Compiling with 8 threads: real 4m5.837s
Compiling with 16 threads: real 2m16.562s
Compiling with 32 threads: real 1m33.063s
Compiling with 48 threads: real 1m22.983s
Compiling with 64 threads: real 1m16.787s
Compiling with 80 threads: real 1m12.038s
Compiling with 96 threads: real 1m17.209s
Compiling with 112 threads: real 1m16.330s
Compiling with 128 threads: real 1m17.286s
Compiling with 256 threads: real 1m23.081s
Compiling with 512 threads: real 1m22.763s

This system has 32 cores (64 threads with HT) so it's not surprising to see almost linear improvement up to 32 threads.
After that, I do definitely hit a wall. That sweet spot around 80 threads seems to be consistent for me in multiple tests. I am not sure why.

Also, this computer is under a little bit of load but the I/O system where I am doing the compile is not struggling at all but I certainly can't devote all the CPUs all the time to playing with this.

BTW, I saw the 15 second load average get up to 329 on the 512 thread run. :-)

The linker seemed to run single threaded (I only every saw one ld process on top -- very unscientific).

**_Alex_** · 15 August 2019, 03:52 PM

That sweet spot around 80 threads seems to be consistent for me in multiple tests. I am not sure why.

Perhaps the surprisingly good results in 80-128 threads (overkill in threading, more context switching and swapping tasks per thread, more ram use, more cache misses) are probably down to forcing the machine (which has a bit of load in other uses) to get more time slices for the compilation threads, in a sense forcing a reduction in the time slices of the other tasks that the machine is running. That would explain why 80 is perhaps the best performing (?).

I did a few tests myself on a ryzen 2200g @ 3.1ghz fixed frequency on 1-2-3-4 threads (I tried up to 128 but it was slowing down), on tmpfs:

1: 894.47s
2: 466.50 (0.521% of single threaded - instead of 0.500% ideally / 0.021% diff to ideal scaling)
3: 324.40 (0.362% of single threaded - instead of 0.333% ideally / 0.029% diff to ideal scaling)
4: 253.30 (0.283% of single threaded - instead of 0.250% ideally / 0.033% diff to ideal scaling)

...it seems there is an improvement degradation as thread count increases, even at low core counts. The more threads, the bigger the difference compared to what it should be (in terms of percentages).

Another task that also has problematic scaling, even more so than kernel building, is video encoding:

https://openbenchmarking.org/embed.p...ha=5d172cc&p=2
https://openbenchmarking.org/embed.p...ha=22b4335&p=2
https://openbenchmarking.org/embed.p...ha=885218a&p=2

This is like hitting a wall in a very ugly way. How is it possible to not see a serious difference by having an extra processor (that doubles core/thread count) in tasks that can be highly parallelized? And the problem exists in both SVT and standard implementations of codecs. I'm thinking that if youtube is running video encoding in their servers, perhaps the ideal way to do this is to have, say, 128 instances of single-threaded video encodings, rather than having one-file-at-a-time encodings which use 128 threads. But again it makes me wonder where is the bottleneck and why it stops scaling linearly (this is more extreme - it practically stops scaling).

**cbdougla** · 15 August 2019, 04:01 PM

Originally posted by tchiwam View Post

Redirect output to Null or at least Not on the terminal...

Good idea but there was no output because of the -s switch to make.

Announcement

Building The Default x86_64 Linux Kernel In Just 16 Seconds

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment