Announcement

Collapse
No announcement yet.

LTO'ing Mesa Is Getting Discussed For Performance & Binary Size Reasons

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by haagch View Post
    I also learned about other flags from https://lists.freedesktop.org/archiv...ay/118929.html so I'm now compiling with

    export CFLAGS="$CFLAGS -O3 -flto=9 -ffat-lto-objects -flto-odr-type-merging"
    export CXXFLAGS="$CFLAGS"
    export LDFLAGS=" -flto=9"
    With linker plugin, -ffat-lto-objects should no longer be needed. It only makes GCC to proudce assembly at compile time that doubles compile times. It is also possible that during linking the plugin is not used (because build machinery bypasses gcc-ar or gcc or gcc-nm somehow and calls binutils directly) and then the LTO is silently discarded and you won't see any LTO options. So try to get your build working without -ffat-lto-objects.

    -flto-odr-type-merging is the default, so no need to specify it.

    Comment


    • #22
      Originally posted by CochainComplex View Post
      Anyone used pgo for big programs ? According to gcc doc it is easy for a few files. But is there any way to use PGO for MESA ?
      It is used byt Firefox and google, so yes, there are big programs compiled with PGO.

      Comment


      • #23
        Originally posted by atomsymbol

        In my opinion, we should slowly be moving to a world where -fPIC passed to a compiler is ignored as a command-line option because the compiler (and the "system" behind) will automatically decide when to use position-independent code.

        As far as I know we aren't moving to such a world at all, which is unfortunate.
        GCC 6+ uses linker plugin to drop -fPIC at linktime when the resulting binary is not PIC. -fPIC limits optimizations compiler can do (because with PIC you can dynamically interpose symbols) and thus it needs to be know when the code is being optimized. not sure you would move away from specifying it explicitly to compiler/linker.

        Comment


        • #24
          Originally posted by CochainComplex View Post

          Works. I applied the mentioned patch and used your flags. No Problems at all. Thx

          Which patch? The one from the bug? I don't think it's necessary - or rather I'm not using it

          Comment


          • #25
            @atomsymbol: Thanks for the sample, I am actually quite shocked that relative jumps arent used (independent of PIC or not). On ARM the code is quite similar whether PIC or not is used (PIC has branch trampolines, non-PIC a jumptable). Been working on non-linux, non-intel for a long time

            Originally posted by hubicka View Post
            GCC 6+ uses linker plugin to drop -fPIC at linktime when the resulting binary is not PIC. -fPIC limits optimizations compiler can do (because with PIC you can dynamically interpose symbols) and thus it needs to be know when the code is being optimized. not sure you would move away from specifying it explicitly to compiler/linker.
            Thats an ELF requirement, however it should only apply to visible symbols, LTO should help hiding anything thats not externally visible. But after the example above it seems there is alot unused potential (to put this optimistic, I can also label this as clear WTF)

            Comment


            • #26
              Originally posted by hubicka View Post
              With linker plugin, -ffat-lto-objects should no longer be needed. It only makes GCC to proudce assembly at compile time that doubles compile times. It is also possible that during linking the plugin is not used (because build machinery bypasses gcc-ar or gcc or gcc-nm somehow and calls binutils directly) and then the LTO is silently discarded and you won't see any LTO options. So try to get your build working without -ffat-lto-objects.

              -flto-odr-type-merging is the default, so no need to specify it.
              Thanks for all the info. I have already switched to fireburn's flags though: https://www.phoronix.com/forums/foru...333#post875333

              I did not have to set AR, NM and RANLIB though, it just works like that. Not sure if it is using gcc-ar etc. I'll try setting the variables and see whether anything changes for the next build.

              Comment


              • #27
                Would this noticeably improve performance per watt?

                Comment


                • #28
                  Originally posted by discordian View Post
                  @atomsymbol: Thanks for the sample, I am actually quite shocked that relative jumps arent used (independent of PIC or not). On ARM the code is quite similar whether PIC or not is used (PIC has branch trampolines, non-PIC a jumptable). Been working on non-linux, non-intel for a long time


                  Thats an ELF requirement, however it should only apply to visible symbols, LTO should help hiding anything thats not externally visible. But after the example above it seems there is alot unused potential (to put this optimistic, I can also label this as clear WTF)
                  LTO privatize symbols based on linker resolution file. So it will optimize across internal symbols well.

                  I can see in your switch codegen an extra zero extend of edi both pic and non pc. Will fix that. Otherwise all the work is done to keep jumptable without dynamic relications. What is WTF?

                  Comment


                  • #29
                    Originally posted by hubicka View Post
                    LTO privatize symbols based on linker resolution file. So it will optimize across internal symbols well.

                    I can see in your switch codegen an extra zero extend of edi both pic and non pc. Will fix that. Otherwise all the work is done to keep jumptable without dynamic relications. What is WTF?
                    WTF = What the f. Are you a gcc developer?

                    Actually I assumed the code is that complicated because the table is for some reason "external visible" and thus needs to have extra code for interposing. I am beginning to realize that I was wrong and that AMD64 apparently doesnt have simple constructs for pc-relative jumps I taken for granted with a more recent architecture.

                    In short, I expected a 2-3 instruction sequence.
                    1) load offset from table [.pclabel + scale * index]
                    2) jump by addind offset to pc
                    .pclabel
                    [.branch0 - .pclabel]
                    [.branch1 - .pclabel]
                    [.branch2 - .pclabel]
                    ....

                    How this looks for ARM Thumb2 should give you the idea I had when I talked about pc-relative addressing and that it can be just as fast as non-pc relative addressing (or even faster, since you dont need an PLT).
                    Code:
                    -O2
                    00000000 <fn>:
                       0:    e3500005     cmp    r0, #5
                       4:    979ff100     ldrls    pc, [pc, r0, lsl #2]
                       8:    ea000011     b    54 <fn+0x54>
                       c:    0000002c     .word    0x0000002c
                      10:    00000034     .word    0x00000034
                      14:    0000003c     .word    0x0000003c
                      18:    00000044     .word    0x00000044
                      1c:    0000004c     .word    0x0000004c
                      20:    00000024     .word    0x00000024
                    
                    -O2 -fPIC (really not optimal, should be closer to the Thumb2 code below)
                    00000000 <fn>:
                       0:    e3500005     cmp    r0, #5
                       4:    908ff100     addls    pc, pc, r0, lsl #2
                       8:    ea000011     b    54 <fn+0x54>
                       c:    ea000006     b    2c <fn+0x2c>
                      10:    ea000007     b    34 <fn+0x34>
                      14:    ea000008     b    3c <fn+0x3c>
                      18:    ea000009     b    44 <fn+0x44>
                      1c:    ea00000a     b    4c <fn+0x4c>
                      20:    eaffffff     b    24 <fn+0x24>
                    
                    Thumb2 (identical code for PIC and no-PIC)
                    -O2 -fPIC -mthumb -mcpu=cortex-a15
                    00000000 <fn>:
                       0:    2805          cmp    r0, #5
                       2:    d816          bhi.n    32 <fn+0x32>
                       4:    e8df f000     tbb    [pc, r0]
                       8:    0f0c0906     .word    0x0f0c0906
                       c:    0312          .short    0x0312

                    Comment


                    • #30
                      Originally posted by haagch View Post
                      Thanks for all the info. I have already switched to fireburn's flags though: https://www.phoronix.com/forums/foru...333#post875333

                      I did not have to set AR, NM and RANLIB though, it just works like that. Not sure if it is using gcc-ar etc. I'll try setting the variables and see whether anything changes for the next build.
                      So with AR, NM etc set it does not look different at all so it's necessary for my setup.

                      Oh and now I see the problem I have had some time ago with -fPIC and otherwise similar flags when trying to use wine:
                      Code:
                      [FONT=monospace][COLOR=#000000]/usr/lib32/xorg/modules/dri/i965_dri.so: undefined symbol: V4F_COUNT[/COLOR][/FONT]
                      Of course according to google I am literally the only human on the planet who has posted about this issue... once.

                      Time to find out which flag exactly causes it.

                      *starts compiling*

                      edit: Can confirm, the exact same flags with just -flto removed work with wine.
                      Grepping mesa for V4F_COUNT... Wow, that's some low level ASM stuff right there. Probably need some compiler insight to know what's going on there... hubicka maybe?
                      Last edited by haagch; 01 June 2016, 05:26 PM.

                      Comment

                      Working...
                      X