Announcement

Collapse
No announcement yet.

Ubuntu Plans For Linux x32 ABI Support

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by xir_ View Post
    It can allow more to fit into the L2.
    I am just curious. Do -O3 optimizations make binaries "eat" more L2 than -O2 optimizations? Consider everything else left the same in comparison.

    Comment


    • #12
      Originally posted by Hirager View Post
      I am just curious. Do -O3 optimizations make binaries "eat" more L2 than -O2 optimizations? Consider everything else left the same in comparison.
      IIRC Firefox is by default compiled with -Os because the smaller cache footprint outweights all the other optimizations. But that's something you'll have to test for each project separately.


      The linked ubuntu docs seem to be hidden behind a login. Is there a solution for the library redundancy? Having to load x32 kdelibs+Qt AND x86_64 kdelibs+Qt for that one KDE-App that benefits from >4GB memory would probably outweight any memory savings to be had.

      Comment


      • #13
        Originally posted by rohcQaH View Post
        IIRC Firefox is by default compiled with -Os because the smaller cache footprint outweights all the other optimizations. But that's something you'll have to test for each project separately.


        The linked ubuntu docs seem to be hidden behind a login. Is there a solution for the library redundancy? Having to load x32 kdelibs+Qt AND x86_64 kdelibs+Qt for that one KDE-App that benefits from >4GB memory would probably outweight any memory savings to be had.
        No offence meant, but I would rather hear the answer from someone who specializes in this sort of things.

        As to your question. You forget just how big multimedia projects can. It is not about memory savings for big programs. It is about savings achieved in workflows which do not require 64-bit software. 64-bit programs are treated here like an additions and nothing more. So this is a back to the past situation, because it turned out that the drawbacks of 64-bit software can be nullified.

        Comment


        • #14
          Will there be a benefit for WINE?

          Comment


          • #15
            Originally posted by Hirager View Post
            I am just curious. Do -O3 optimizations make binaries "eat" more L2 than -O2 optimizations? Consider everything else left the same in comparison.
            Well since -O3 favours speed over code size it is likely to be bigger than -O2 and thus fill up cpu cache faster. However since the optimizer aims for fastest speed it will only make code larger when the added cache footprint (like through inlining etc) will not make performance worse.

            In reality though the heuristics governing this are very difficult to get right and this is why sometimes the same code compiled using -O2 will beat -O3. I've never encountered this is with PGO (profile guided optimization) though, which means that the runtime data it uses for making choices when optimizing allows it to accurately value the impact code size/cache misses will have on performance.

            Comment


            • #16
              Originally posted by jakubo View Post
              Will there be a benefit for WINE?
              I don't think so, obviously the actual windows programs won't be faster but I also think that the parts of Windows which Wine reimplements which could potentially be faster needs to run as standard 32-bit code aswell and thus won't be faster. But I'm not sure about this, I don't have much insight into how Wine works.

              Comment


              • #17
                Originally posted by rohcQaH View Post
                IIRC Firefox is by default compiled with -Os because the smaller cache footprint outweights all the other optimizations. But that's something you'll have to test for each project separately.
                I believe that was changed when they updated to a more recent GCC version and started supporting PGO. I believe they switched to -O3, along with using an option to limit the amount of inlining that normally enables.


                x32 support is unlikely to decrease memory or disk size requirements. In fact, it will almost certainly increase them, because you are just adding new libraries that need to be duplicated in both architectures for compatibility. And the amount of size it will save in a particular executable is really very small. We're talking about reducing a 1024KB program to 1000KB maybe.

                The benefit comes from reducing L1, L2, and L3 cache pressure, which can lead to significant speed boosts. It depends heavily on the application in question, though - and even the hardware it's running on. x32 might bring a big boost on hardware with smaller caches, while giving no boost at all on cpus with a large cache size.

                Comment


                • #18
                  Originally posted by smitty3268 View Post
                  x32 support is unlikely to decrease memory or disk size requirements. In fact, it will almost certainly increase them, because you are just adding new libraries that need to be duplicated in both architectures for compatibility.
                  That is assuming you will keep/need to run applications as x64, in particular if you have a 64-bit cpu and 4gb or less of ram x32 ONLY would be the perfect fit.

                  Originally posted by smitty3268 View Post
                  And the amount of size it will save in a particular executable is really very small. We're talking about reducing a 1024KB program to 1000KB maybe.
                  I believe you are wrong here, I believe typically a full 32-bit system will use ~20% less ram than an equivalent 64-bit system due to libraries and applications being smaller (as in binaries) and using less ram when running (due to pointer size). Also potentially the x32 code could be even smaller than 32-bit code, this is because that even though both 32-bit and x32 has 32-bit pointers, 32-bit still suffers from having very few registers which means it will need to waste more code performing push'ing and pop'ing from stack in order to reuse the registers. x32 also has 32 bit pointers but TWICE the amount of registers which means that it can keep much more data inside the registers and require much less code to do stack push/pop'ing, thus making code smaller.

                  Comment


                  • #19
                    Originally posted by XorEaxEax View Post
                    That is assuming you will keep/need to run applications as x64, in particular if you have a 64-bit cpu and 4gb or less of ram x32 ONLY would be the perfect fit.
                    You're assuming distros are going to create pure x32 distros, which i find unlikely. They already have to use the x64 kernel, so I find it hard to believe they wouldn't include x64 userland libs as well.

                    I could be wrong about that, but i just don't see it happening. Every new architecture they have to support just means that much more work for their limited staff - it will be much easier to just combine x32 and x64 architectures together.

                    If you are talking about custom building your own distro (on gentoo? or lfs?) then maybe you have a point.

                    I believe you are wrong here, I believe typically a full 32-bit system will use ~20% less ram than an equivalent 64-bit system due to libraries and applications being smaller (as in binaries) and using less ram when running (due to pointer size). Also potentially the x32 code could be even smaller than 32-bit code, this is because that even though both 32-bit and x32 has 32-bit pointers, 32-bit still suffers from having very few registers which means it will need to waste more code performing push'ing and pop'ing from stack in order to reuse the registers. x32 also has 32 bit pointers but TWICE the amount of registers which means that it can keep much more data inside the registers and require much less code to do stack push/pop'ing, thus making code smaller.
                    And i believe i'm right. Do you have any proof?

                    The avg size of an executables instructions is really quite small. Most of it tends to be data - string values encoded in the program, for example. Even pointer-heavy apps are dominated in size by the data they are using, not the pointers themselves.

                    Comment


                    • #20
                      Originally posted by smitty3268 View Post
                      I could be wrong about that, but i just don't see it happening. Every new architecture they have to support just means that much more work for their limited staff - it will be much easier to just combine x32 and x64 architectures together.
                      Yes, I'm doubtful of this aswell, Ubuntu as the article states is looking into it but that is a long way from fully supporting it, Gentoo is very much build-it-yourself from scratch so I believe they will 'support' x32. I'm not sure what you mean by combining x32 and x64 architectures though, they will use the same kernel but they will need different libraries.

                      Originally posted by smitty3268 View Post
                      And i believe i'm right. Do you have any proof?
                      As for 32-bit using ~20% less ram than equivalent 64-bit system/code that has been quite verified (I've done it twice myself in the past, both on Windows and Ubuntu), but since it's quick to do in these days of VM's I did a test just now, two identical setups in terms of software, one Arch 32-bit and one Arch 64-bit. After the same base installation I installed X, OpenBox and Conky on both,
                      after starting X/Openbox this is what conky reported:




                      Now for the x32 vs 32-bit code size, no I had no proof as it was just something which seemed logical, more registers = less push:ing and pop:ing = smaller code footprint, anyway thanks to your scepticism I figured I should see if it was true.

                      As I'm running a pure 64-bit system and the GCC I'm using (Arch vanilla) wasn't configured with 32,x32 multilib I could compile code as 32-bit and x32 but not build a final binary. That's not so bad though since I can generate assembly output which actually shows us the code. I took meteor.c from Language Shootout as test subject as it didn't need to link in any external functionality (commented out main/printf) and compiled 32-bit and x32 into assembly output using:

                      gcc -Os -march=native -fomit-frame-pointer -m32 -S -c meteor.c
                      gcc -Os -march=native -fomit-frame-pointer -mx32 -S -c meteor.c

                      The resulting x32 assembly output listing turned out to be quite a bit smaller than the 32-bit one (1505 vs 1691 lines respectively) but that could be the result of 32-bit assembly containing more compiler directives rather than actually smaller code so obviously I had to examine the listings. I can't say I did any thorough comparisons on the larger functions but from quickly scanning I couldn't see any occurence where the x32 code was larger but I did see several places where the x32 code was smaller, I picked out some small (and thus easier to examine) examples from the generated assembly:

                      Code:
                      32-bit:
                      boardHasIslands:
                      .LFB19:
                      	pushl	%edi
                      	xorl	%eax, %eax
                      	pushl	%esi
                      	movb	12(%esp), %dl
                      	cmpb	$39, %dl
                      	jg	.L237
                      	movb	$5, %cl
                      	movsbw	%dl, %ax
                      	movl	board+4, %edi
                      	idivb	%cl
                      	movl	board, %esi
                      	movsbl	%al, %ecx
                      	leal	(%ecx,%ecx,4), %ecx
                      	shrdl	%edi, %esi
                      	shrl	%cl, %edi
                      	testb	$32, %cl
                      	cmovne	%edi, %esi
                      	andl	$32767, %esi
                      	testb	$1, %al
                      	je	.L238
                      	movl	bad_odd_triple(,%esi,4), %eax
                      	jmp	.L237
                      .L238:
                      	movl	bad_even_triple(,%esi,4), %eax
                      .L237:
                      	popl	%esi
                      	popl	%edi
                      	ret
                      
                      x32:
                      boardHasIslands:
                      .LFB19:
                      	xorl	%eax, %eax
                      	cmpb	$39, %dil
                      	jg	.L231
                      	movb	$5, %dl
                      	movsbw	%dil, %ax
                      	idivb	%dl
                      	movq	board(%rip), %rdx
                      	movsbl	%al, %ecx
                      	leal	(%rcx,%rcx,4), %ecx
                      	shrq	%cl, %rdx
                      	andl	$32767, %edx
                      	sall	$2, %edx
                      	testb	$1, %al
                      	movslq	%edx, %rdx
                      	je	.L232
                      	movl	bad_odd_triple(%rdx), %eax
                      	ret
                      .L232:
                      	movl	bad_even_triple(%rdx), %eax
                      .L231:
                      	ret
                      
                      32-bit:
                      record_piece:
                      .LFB11:
                      	pushl	%edi
                      	pushl	%esi
                      	pushl	%ebx
                      	movl	16(%esp), %esi
                      	movl	20(%esp), %eax
                      	movl	32(%esp), %edx
                      	imull	$50, %esi, %ebx
                      	imull	$600, %esi, %esi
                      	addl	%eax, %ebx
                      	imull	$12, %eax, %eax
                      	movl	piece_counts(,%ebx,4), %ecx
                      	addl	%eax, %esi
                      	movl	28(%esp), %eax
                      	leal	(%esi,%ecx), %edi
                      	movl	%edx, pieces+4(,%edi,8)
                      	movl	%eax, pieces(,%edi,8)
                      	movl	24(%esp), %eax
                      	movb	%al, next_cell(%ecx,%esi)
                      	incl	%ecx
                      	movl	%ecx, piece_counts(,%ebx,4)
                      	popl	%ebx
                      	popl	%esi
                      	popl	%edi
                      	ret
                      
                      x32:
                      record_piece:
                      .LFB11:
                      	imull	$50, %edi, %eax
                      	imull	$600, %edi, %edi
                      	addl	%esi, %eax
                      	imull	$12, %esi, %esi
                      	sall	$2, %eax
                      	cltq
                      	movl	piece_counts(%rax), %r8d
                      	addl	%edi, %esi
                      	addl	%r8d, %esi
                      	incl	%r8d
                      	leal	0(,%rsi,8), %edi
                      	movslq	%esi, %rsi
                      	movl	%r8d, piece_counts(%rax)
                      	movslq	%edi, %rdi
                      	movb	%dl, next_cell(%rsi)
                      	movq	%rcx, pieces(%rdi)
                      	ret
                      Now granted, this is not irrefutable proof. I can't swear that the x32 assembly here generates smaller code footprint than 32-bit as I'm only going by the assembly output, but it does seem likely. I also compiled with both -O2 and -O3 and in both cases the resulting x32 assembly was quite a bit smaller than the 32-bit one, I didn't examine those listings though.

                      When kernel 3.4 is released and I thus have the possibility to actually run and benchmark x32 code I will recompile GCC with 32,x32 multilib so that I can build and compare proper binaries.

                      Originally posted by smitty3268 View Post
                      The avg size of an executables instructions is really quite small. Most of it tends to be data - string values encoded in the program, for example. Even pointer-heavy apps are dominated in size by the data they are using, not the pointers themselves.
                      Again the ram usage difference of roughly ~20% between 32-bit and 64-bit equivalent systems is pretty much confirmed. Also code size does matter for performance since the cpu cache isn't infinite.

                      Comment

                      Working...
                      X