Announcement

**Hirager** · 13 May 2012, 05:55 AM

Originally posted by xir_ View Post

It can allow more to fit into the L2.

I am just curious. Do -O3 optimizations make binaries "eat" more L2 than -O2 optimizations? Consider everything else left the same in comparison.

**rohcQaH** · 13 May 2012, 06:27 AM

Originally posted by Hirager View Post

I am just curious. Do -O3 optimizations make binaries "eat" more L2 than -O2 optimizations? Consider everything else left the same in comparison.

IIRC Firefox is by default compiled with -Os because the smaller cache footprint outweights all the other optimizations. But that's something you'll have to test for each project separately.

The linked ubuntu docs seem to be hidden behind a login. Is there a solution for the library redundancy? Having to load x32 kdelibs+Qt AND x86_64 kdelibs+Qt for that one KDE-App that benefits from >4GB memory would probably outweight any memory savings to be had.

**Hirager** · 13 May 2012, 06:38 AM

Originally posted by rohcQaH View Post

IIRC Firefox is by default compiled with -Os because the smaller cache footprint outweights all the other optimizations. But that's something you'll have to test for each project separately.

The linked ubuntu docs seem to be hidden behind a login. Is there a solution for the library redundancy? Having to load x32 kdelibs+Qt AND x86_64 kdelibs+Qt for that one KDE-App that benefits from >4GB memory would probably outweight any memory savings to be had.

No offence meant, but I would rather hear the answer from someone who specializes in this sort of things.

As to your question. You forget just how big multimedia projects can. It is not about memory savings for big programs. It is about savings achieved in workflows which do not require 64-bit software. 64-bit programs are treated here like an additions and nothing more. So this is a back to the past situation, because it turned out that the drawbacks of 64-bit software can be nullified.

**jakubo** · 13 May 2012, 07:33 AM

Will there be a benefit for WINE?

**XorEaxEax** · 15 May 2012, 10:27 AM

Originally posted by Hirager View Post

I am just curious. Do -O3 optimizations make binaries "eat" more L2 than -O2 optimizations? Consider everything else left the same in comparison.

Well since -O3 favours speed over code size it is likely to be bigger than -O2 and thus fill up cpu cache faster. However since the optimizer aims for fastest speed it will only make code larger when the added cache footprint (like through inlining etc) will not make performance worse.

In reality though the heuristics governing this are very difficult to get right and this is why sometimes the same code compiled using -O2 will beat -O3. I've never encountered this is with PGO (profile guided optimization) though, which means that the runtime data it uses for making choices when optimizing allows it to accurately value the impact code size/cache misses will have on performance.

**XorEaxEax** · 15 May 2012, 10:31 AM

Originally posted by jakubo View Post

Will there be a benefit for WINE?

I don't think so, obviously the actual windows programs won't be faster but I also think that the parts of Windows which Wine reimplements which could potentially be faster needs to run as standard 32-bit code aswell and thus won't be faster. But I'm not sure about this, I don't have much insight into how Wine works.

**smitty3268** · 16 May 2012, 11:06 PM

Originally posted by rohcQaH View Post

IIRC Firefox is by default compiled with -Os because the smaller cache footprint outweights all the other optimizations. But that's something you'll have to test for each project separately.

I believe that was changed when they updated to a more recent GCC version and started supporting PGO. I believe they switched to -O3, along with using an option to limit the amount of inlining that normally enables.

x32 support is unlikely to decrease memory or disk size requirements. In fact, it will almost certainly increase them, because you are just adding new libraries that need to be duplicated in both architectures for compatibility. And the amount of size it will save in a particular executable is really very small. We're talking about reducing a 1024KB program to 1000KB maybe.

The benefit comes from reducing L1, L2, and L3 cache pressure, which can lead to significant speed boosts. It depends heavily on the application in question, though - and even the hardware it's running on. x32 might bring a big boost on hardware with smaller caches, while giving no boost at all on cpus with a large cache size.

**XorEaxEax** · 17 May 2012, 03:59 AM

Originally posted by smitty3268 View Post

x32 support is unlikely to decrease memory or disk size requirements. In fact, it will almost certainly increase them, because you are just adding new libraries that need to be duplicated in both architectures for compatibility.

That is assuming you will keep/need to run applications as x64, in particular if you have a 64-bit cpu and 4gb or less of ram x32 ONLY would be the perfect fit.

Originally posted by smitty3268 View Post

And the amount of size it will save in a particular executable is really very small. We're talking about reducing a 1024KB program to 1000KB maybe.

I believe you are wrong here, I believe typically a full 32-bit system will use ~20% less ram than an equivalent 64-bit system due to libraries and applications being smaller (as in binaries) and using less ram when running (due to pointer size). Also potentially the x32 code could be even smaller than 32-bit code, this is because that even though both 32-bit and x32 has 32-bit pointers, 32-bit still suffers from having very few registers which means it will need to waste more code performing push'ing and pop'ing from stack in order to reuse the registers. x32 also has 32 bit pointers but TWICE the amount of registers which means that it can keep much more data inside the registers and require much less code to do stack push/pop'ing, thus making code smaller.

**smitty3268** · 17 May 2012, 01:43 PM

Originally posted by XorEaxEax View Post

That is assuming you will keep/need to run applications as x64, in particular if you have a 64-bit cpu and 4gb or less of ram x32 ONLY would be the perfect fit.

You're assuming distros are going to create pure x32 distros, which i find unlikely. They already have to use the x64 kernel, so I find it hard to believe they wouldn't include x64 userland libs as well.

I could be wrong about that, but i just don't see it happening. Every new architecture they have to support just means that much more work for their limited staff - it will be much easier to just combine x32 and x64 architectures together.

If you are talking about custom building your own distro (on gentoo? or lfs?) then maybe you have a point.

I believe you are wrong here, I believe typically a full 32-bit system will use ~20% less ram than an equivalent 64-bit system due to libraries and applications being smaller (as in binaries) and using less ram when running (due to pointer size). Also potentially the x32 code could be even smaller than 32-bit code, this is because that even though both 32-bit and x32 has 32-bit pointers, 32-bit still suffers from having very few registers which means it will need to waste more code performing push'ing and pop'ing from stack in order to reuse the registers. x32 also has 32 bit pointers but TWICE the amount of registers which means that it can keep much more data inside the registers and require much less code to do stack push/pop'ing, thus making code smaller.

And i believe i'm right. Do you have any proof?

The avg size of an executables instructions is really quite small. Most of it tends to be data - string values encoded in the program, for example. Even pointer-heavy apps are dominated in size by the data they are using, not the pointers themselves.

**XorEaxEax** · 18 May 2012, 03:22 PM

Originally posted by smitty3268 View Post

I could be wrong about that, but i just don't see it happening. Every new architecture they have to support just means that much more work for their limited staff - it will be much easier to just combine x32 and x64 architectures together.

Yes, I'm doubtful of this aswell, Ubuntu as the article states is looking into it but that is a long way from fully supporting it, Gentoo is very much build-it-yourself from scratch so I believe they will 'support' x32. I'm not sure what you mean by combining x32 and x64 architectures though, they will use the same kernel but they will need different libraries.

Originally posted by smitty3268 View Post

And i believe i'm right. Do you have any proof?

As for 32-bit using ~20% less ram than equivalent 64-bit system/code that has been quite verified (I've done it twice myself in the past, both on Windows and Ubuntu), but since it's quick to do in these days of VM's I did a test just now, two identical setups in terms of software, one Arch 32-bit and one Arch 64-bit. After the same base installation I installed X, OpenBox and Conky on both,
after starting X/Openbox this is what conky reported:

404 Not Found

http://img442.imageshack.us/img442/3500/32bit.png

404 Not Found

http://img232.imageshack.us/img232/8794/64bit.png

Now for the x32 vs 32-bit code size, no I had no proof as it was just something which seemed logical, more registers = less push:ing and pop:ing = smaller code footprint, anyway thanks to your scepticism I figured I should see if it was true.

As I'm running a pure 64-bit system and the GCC I'm using (Arch vanilla) wasn't configured with 32,x32 multilib I could compile code as 32-bit and x32 but not build a final binary. That's not so bad though since I can generate assembly output which actually shows us the code. I took meteor.c from Language Shootout as test subject as it didn't need to link in any external functionality (commented out main/printf) and compiled 32-bit and x32 into assembly output using:

gcc -Os -march=native -fomit-frame-pointer -m32 -S -c meteor.c
gcc -Os -march=native -fomit-frame-pointer -mx32 -S -c meteor.c

The resulting x32 assembly output listing turned out to be quite a bit smaller than the 32-bit one (1505 vs 1691 lines respectively) but that could be the result of 32-bit assembly containing more compiler directives rather than actually smaller code so obviously I had to examine the listings. I can't say I did any thorough comparisons on the larger functions but from quickly scanning I couldn't see any occurence where the x32 code was larger but I did see several places where the x32 code was smaller, I picked out some small (and thus easier to examine) examples from the generated assembly:

Code:

32-bit:
boardHasIslands:
.LFB19:
	pushl	%edi
	xorl	%eax, %eax
	pushl	%esi
	movb	12(%esp), %dl
	cmpb	$39, %dl
	jg	.L237
	movb	$5, %cl
	movsbw	%dl, %ax
	movl	board+4, %edi
	idivb	%cl
	movl	board, %esi
	movsbl	%al, %ecx
	leal	(%ecx,%ecx,4), %ecx
	shrdl	%edi, %esi
	shrl	%cl, %edi
	testb	$32, %cl
	cmovne	%edi, %esi
	andl	$32767, %esi
	testb	$1, %al
	je	.L238
	movl	bad_odd_triple(,%esi,4), %eax
	jmp	.L237
.L238:
	movl	bad_even_triple(,%esi,4), %eax
.L237:
	popl	%esi
	popl	%edi
	ret

x32:
boardHasIslands:
.LFB19:
	xorl	%eax, %eax
	cmpb	$39, %dil
	jg	.L231
	movb	$5, %dl
	movsbw	%dil, %ax
	idivb	%dl
	movq	board(%rip), %rdx
	movsbl	%al, %ecx
	leal	(%rcx,%rcx,4), %ecx
	shrq	%cl, %rdx
	andl	$32767, %edx
	sall	$2, %edx
	testb	$1, %al
	movslq	%edx, %rdx
	je	.L232
	movl	bad_odd_triple(%rdx), %eax
	ret
.L232:
	movl	bad_even_triple(%rdx), %eax
.L231:
	ret

32-bit:
record_piece:
.LFB11:
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	movl	16(%esp), %esi
	movl	20(%esp), %eax
	movl	32(%esp), %edx
	imull	$50, %esi, %ebx
	imull	$600, %esi, %esi
	addl	%eax, %ebx
	imull	$12, %eax, %eax
	movl	piece_counts(,%ebx,4), %ecx
	addl	%eax, %esi
	movl	28(%esp), %eax
	leal	(%esi,%ecx), %edi
	movl	%edx, pieces+4(,%edi,8)
	movl	%eax, pieces(,%edi,8)
	movl	24(%esp), %eax
	movb	%al, next_cell(%ecx,%esi)
	incl	%ecx
	movl	%ecx, piece_counts(,%ebx,4)
	popl	%ebx
	popl	%esi
	popl	%edi
	ret

x32:
record_piece:
.LFB11:
	imull	$50, %edi, %eax
	imull	$600, %edi, %edi
	addl	%esi, %eax
	imull	$12, %esi, %esi
	sall	$2, %eax
	cltq
	movl	piece_counts(%rax), %r8d
	addl	%edi, %esi
	addl	%r8d, %esi
	incl	%r8d
	leal	0(,%rsi,8), %edi
	movslq	%esi, %rsi
	movl	%r8d, piece_counts(%rax)
	movslq	%edi, %rdi
	movb	%dl, next_cell(%rsi)
	movq	%rcx, pieces(%rdi)
	ret

Now granted, this is not irrefutable proof. I can't swear that the x32 assembly here generates smaller code footprint than 32-bit as I'm only going by the assembly output, but it does seem likely. I also compiled with both -O2 and -O3 and in both cases the resulting x32 assembly was quite a bit smaller than the 32-bit one, I didn't examine those listings though.

When kernel 3.4 is released and I thus have the possibility to actually run and benchmark x32 code I will recompile GCC with 32,x32 multilib so that I can build and compare proper binaries.

Originally posted by smitty3268 View Post

The avg size of an executables instructions is really quite small. Most of it tends to be data - string values encoded in the program, for example. Even pointer-heavy apps are dominated in size by the data they are using, not the pointers themselves.

Again the ram usage difference of roughly ~20% between 32-bit and 64-bit equivalent systems is pretty much confirmed. Also code size does matter for performance since the cpu cache isn't infinite.

Announcement

Ubuntu Plans For Linux x32 ABI Support

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment