Announcement

**cb88** · 09 April 2014, 11:56 PM

Originally posted by Brane215 View Post

1. kernel, glibc and other non-standard packages are not that insignificant. Kernel doesn't link to anything outside in userland ( except perhaps that one link for getting accurate time, but that's insignificant), but it needs stability more than anything else in userland. And with LTO option being in testing, I don't see how can the result be trusted atm for anything except testing purposes.

Glibc and similar libraries are not insignificant. If they miscompile with LTO, they'll influence just about anything. And if you compile them non-lto then you are missing big part of the point - big parts will end up being opaque islands to any lto optimisation efforts.

2. I had to do it blindly, because there simply was no one to turn to. CFLAGS & LDFLAGS I used were given to me IIRC at gentoo forum. Someone said that it is only important to turn on optimisation and that linker only does -O1 anyway. They also said that there is no need to repeat CFLAGS while linking and that this was needed only on early lto version of gcc.

IOW, there is not much publicly known or accessible documentation on-line about gcc and related tools. Every now and then someone utters something and this gets picked up by ignorant crowd and praised without detailed knowledge about it.
Had Christianity had so fu**ed up, obscure, hard to get, fragmented and contradictory documentation on "Word of God", whole thing would ended with Stallman, I mean Christ as initiator.

From what you say in your fist point there it seems you don't acutally understand LTO. It has nothing to do with what the binary links to after its built... it has to do with being able to do optimisations across all source files that will be linked together to generate a single binary. Without LTO the scope of the optimiser is limited to a single source file when most projects have a crap load of source files. LTO basically loads all the source files into the compiler parses the entire thing and then hands it off to the optmiser rather than doing the aforementioned steps individually for each file to generate individual binaries that are linked to form the entire binary.

Also have some respect for people that believe different than you. Thanks.

**Brane215** · 10 April 2014, 12:39 AM

Originally posted by cb88 View Post

From what you say in your fist point there it seems you don't acutally understand LTO. It has nothing to do with what the binary links to after its built... it has to do with being able to do optimisations across all source files that will be linked together to generate a single binary. Without LTO the scope of the optimiser is limited to a single source file when most projects have a crap load of source files. LTO basically loads all the source files into the compiler parses the entire thing and then hands it off to the optmiser rather than doing the aforementioned steps individually for each file to generate individual binaries that are linked to form the entire binary.

Also have some respect for people that believe different than you. Thanks.

1. Nope. AFAIK you are talking about -fwhole program, that was just meant as an experiment or uverture of LTO.

Loading all sources and processing them as one is precisely what LTO is trying to avoid, so that you can control the memory burden on the compiling system. LTO works by compiler making instead of classic binary elf and elf with gimple structures, containing functions, variables etc etc with their attributes.

So later, when linker starts merging those objects, it works with gimple and so it "see" inside those objects just as well as did compiler that made them.
Then linker redoes optimisation step during the linking, doing much of the same stuff compiler did earlier on each object file, only this time it works with processed material instead of source and it does int globally.

Also, that "whole program optimisation" as you describe it, wouldn't work across already compiled libraries, since for compiler they'd be just a bunch of externally visible symbols.
Here, output result can contain besides linkable binary also his corresponding gimple ( = "fat elf" ?), so that compiler can later redo some optimisations on the same library, when compiling another program.
I don't know or undestand all the details, and can't see how would this work for dynamic linking, so I am assuming this goes just for static linking.

2. "Also have some respect for people that believe different than you."

My whole point was that not everyone is the same and that his "don't make a fuss, it works for ME" argument was selfish.

**duby229** · 10 April 2014, 09:39 AM

Originally posted by caligula View Post

I understand that 5% off is important in 4 MB flash storage. However in desktop apps it doesn't matter at all. Hard drives are now 1 TB (ssd) and 4 TB (3.5" hdd). You can also set up raid6 or zfs. So you get tens of terabytes and it's very cheap. You shouldn't bother with binary sizes. In fact there's plenty of room for more functionality in Firefox. Luckily they're working hard at implementing more new features with each release.

This is wrong in so many ways....

Just because hdds are so big doesnt mean they should be filled up. The fact is that storage is the slowest part of most computers. You have a choice between waiting for storage or getting your work done.

**Azrael5** · 10 April 2014, 10:07 AM

Originally posted by cb88 View Post

From what you say in your fist point there it seems you don't acutally understand LTO. It has nothing to do with what the binary links to after its built... it has to do with being able to do optimisations across all source files that will be linked together to generate a single binary. Without LTO the scope of the optimiser is limited to a single source file when most projects have a crap load of source files. LTO basically loads all the source files into the compiler parses the entire thing and then hands it off to the optmiser rather than doing the aforementioned steps individually for each file to generate individual binaries that are linked to form the entire binary.

Also have some respect for people that believe different than you. Thanks.

It uses SSE2 instructions?

**ryao** · 10 April 2014, 02:08 PM

Originally posted by Azrael5 View Post

It uses SSE2 instructions?

SSE2 instructions are disabled in Linux builds via options sent to the compiler. This was a design decision in Linux to make context switches faster. Using them outside of special critical regions will cause bad things to happen. At best, you will panic the system. At worst, you will have information leaks into userland.

Anyway, better algorithms always beat better interprocedural optimization. There is no reason why you cannot have both at once, but there is also no reason to rush into it. Quite honestly, I look forward to building the kernel with Clang much more than I look forward to better interprocedural optimizations.

That being said, I just sent a patch off to the list that illustrates the effect of better algorithms. Reducing time spent in spin locks by relying on lock-free data structures does more for system performance than any amount of interprocedural optimization will ever do:

https://lkml.org/lkml/2014/4/10/416

**Azrael5** · 10 April 2014, 06:47 PM

Originally posted by ryao View Post

SSE2 instructions are disabled in Linux builds via options sent to the compiler. This was a design decision in Linux to make context switches faster. Using them outside of special critical regions will cause bad things to happen. At best, you will panic the system. At worst, you will have information leaks into userland.

Anyway, better algorithms always beat better interprocedural optimization. There is no reason why you cannot have both at once, but there is also no reason to rush into it. Quite honestly, I look forward to building the kernel with Clang much more than I look forward to better interprocedural optimizations.

That being said, I just sent a patch off to the list that illustrates the effect of better algorithms. Reducing time spent in spin locks by relying on lock-free data structures does more for system performance than any amount of interprocedural optimization will ever do:

https://lkml.org/lkml/2014/4/10/416

Are you stating that linux operating systems use FPU to process algorithms? If it is real that's a real incredilble joke. FPU is for mahematic operations in floating point. SSE2 it's the way to optimize as well as making reliable and giving boost in operation up to 1000% of major speed. Only this evolution allows linux to outperform any systems currently used in every hardware platform equipped by sse2 processors. I don't believe linux systems lack this feature. It's impossible.

**ryao** · 10 April 2014, 06:53 PM

Originally posted by Azrael5 View Post

Are you stating that linux operating systems use FPU to process algorithms? If it is real that's a real incredilble joke. FPU is for mahematic operations in floating point. SSE2 it's the way to optimize as well as making reliable and giving boost in operation up to 1000% of major speed. Only this evolution allows linux to outperform any systems currently used in every hardware platform equipped by sse2 processors. I don't believe linux systems lack this feature. It's impossible.

The Linux kernel does not use a FPU in general. Everything internally is integer based. The only registers kernel developers are allowed to use in general are the integer registers. x87, MMX, SSE and any other fancy registers are only usable with interrupts disabled and it is only done when there is clear benefit. This is a design decision in the kernel and it works rather well.

**Azrael5** · 10 April 2014, 07:10 PM

Originally posted by ryao View Post

The Linux kernel does not use a FPU in general. Everything internally is integer based. The only registers kernel developers are allowed to use in general are the integer registers. x87, MMX, SSE and any other fancy registers are only usable with interrupts disabled and it is only done when there is clear benefit. This is a design decision in the kernel and it works rather well.

OK take a simple tool as Sumatra (pdf reader), 3.2 version features SSE2 instructions, open pdf file provided by many images with 3.2 and other no-sse2 versions and see the differences scrolling pages.

Question is that we have hardware also obsolete which is not valued as it deserves.... because of software.

**smitty3268** · 10 April 2014, 08:03 PM

Originally posted by Azrael5 View Post

OK take a simple tool as Sumatra (pdf reader), 3.2 version features SSE2 instructions, open pdf file provided by many images with 3.2 and other no-sse2 versions and see the differences scrolling pages.

Question is that we have hardware also obsolete which is not valued as it deserves.... because of software.

I'm not sure what your question there was, but it seems you didn't understand the initial point. Userspace apps can use SSE all they want, the kernel doesn't control that. The kernel itself just doesn't use SSE.

So your pdf reader can do whatever it wants, but the File System code built into the kernel better just use integer math only. There are valid performance reasons to enforce that in the kernel, which don't apply to normal userspace code.

**Brane215** · 10 April 2014, 08:18 PM

Originally posted by smitty3268 View Post

I'm not sure what your question there was, but it seems you didn't understand the initial point. Userspace apps can use SSE all they want, the kernel doesn't control that. The kernel itself just doesn't use SSE.

So your pdf reader can do whatever it wants, but the File System code built into the kernel better just use integer math only. There are valid performance reasons to enforce that in the kernel, which don't apply to normal userspace code.

So when RAID-4/5/6/etc subsystem uses SSE to calculate Q syndrome, this means that whole that time on that core interrupts are disabled ?

IIRC x86 has some trick that enables it to save extr<a context only when needed. Something about some flag in special registers so that taks that tries to use SSE faults and then FW code saves registers, marks SSE registers to be saved on next context switch and continues.

Coludn't such trick work also within kernel ?

Announcement

Torvalds Is Unconvinced By LTO'ing A Linux Kernel

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment