Announcement

Collapse
No announcement yet.

LTO'ing Mesa Is Getting Discussed For Performance & Binary Size Reasons

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by CochainComplex View Post

    Works. I applied the mentioned patch and used your flags. No Problems at all. Thx

    Which patch? The one from the bug? I don't think it's necessary - or rather I'm not using it

    Comment


    • #32
      @atomsymbol: Thanks for the sample, I am actually quite shocked that relative jumps arent used (independent of PIC or not). On ARM the code is quite similar whether PIC or not is used (PIC has branch trampolines, non-PIC a jumptable). Been working on non-linux, non-intel for a long time

      Originally posted by hubicka View Post
      GCC 6+ uses linker plugin to drop -fPIC at linktime when the resulting binary is not PIC. -fPIC limits optimizations compiler can do (because with PIC you can dynamically interpose symbols) and thus it needs to be know when the code is being optimized. not sure you would move away from specifying it explicitly to compiler/linker.
      Thats an ELF requirement, however it should only apply to visible symbols, LTO should help hiding anything thats not externally visible. But after the example above it seems there is alot unused potential (to put this optimistic, I can also label this as clear WTF)

      Comment


      • #33
        Originally posted by hubicka View Post
        With linker plugin, -ffat-lto-objects should no longer be needed. It only makes GCC to proudce assembly at compile time that doubles compile times. It is also possible that during linking the plugin is not used (because build machinery bypasses gcc-ar or gcc or gcc-nm somehow and calls binutils directly) and then the LTO is silently discarded and you won't see any LTO options. So try to get your build working without -ffat-lto-objects.

        -flto-odr-type-merging is the default, so no need to specify it.
        Thanks for all the info. I have already switched to fireburn's flags though: https://www.phoronix.com/forums/foru...333#post875333

        I did not have to set AR, NM and RANLIB though, it just works like that. Not sure if it is using gcc-ar etc. I'll try setting the variables and see whether anything changes for the next build.

        Comment


        • #34
          Would this noticeably improve performance per watt?

          Comment


          • #35
            Originally posted by discordian View Post
            @atomsymbol: Thanks for the sample, I am actually quite shocked that relative jumps arent used (independent of PIC or not). On ARM the code is quite similar whether PIC or not is used (PIC has branch trampolines, non-PIC a jumptable). Been working on non-linux, non-intel for a long time


            Thats an ELF requirement, however it should only apply to visible symbols, LTO should help hiding anything thats not externally visible. But after the example above it seems there is alot unused potential (to put this optimistic, I can also label this as clear WTF)
            LTO privatize symbols based on linker resolution file. So it will optimize across internal symbols well.

            I can see in your switch codegen an extra zero extend of edi both pic and non pc. Will fix that. Otherwise all the work is done to keep jumptable without dynamic relications. What is WTF?

            Comment


            • #36
              Originally posted by hubicka View Post
              LTO privatize symbols based on linker resolution file. So it will optimize across internal symbols well.

              I can see in your switch codegen an extra zero extend of edi both pic and non pc. Will fix that. Otherwise all the work is done to keep jumptable without dynamic relications. What is WTF?
              WTF = What the f. Are you a gcc developer?

              Actually I assumed the code is that complicated because the table is for some reason "external visible" and thus needs to have extra code for interposing. I am beginning to realize that I was wrong and that AMD64 apparently doesnt have simple constructs for pc-relative jumps I taken for granted with a more recent architecture.

              In short, I expected a 2-3 instruction sequence.
              1) load offset from table [.pclabel + scale * index]
              2) jump by addind offset to pc
              .pclabel
              [.branch0 - .pclabel]
              [.branch1 - .pclabel]
              [.branch2 - .pclabel]
              ....

              How this looks for ARM Thumb2 should give you the idea I had when I talked about pc-relative addressing and that it can be just as fast as non-pc relative addressing (or even faster, since you dont need an PLT).
              Code:
              -O2
              00000000 <fn>:
                 0:    e3500005     cmp    r0, #5
                 4:    979ff100     ldrls    pc, [pc, r0, lsl #2]
                 8:    ea000011     b    54 <fn+0x54>
                 c:    0000002c     .word    0x0000002c
                10:    00000034     .word    0x00000034
                14:    0000003c     .word    0x0000003c
                18:    00000044     .word    0x00000044
                1c:    0000004c     .word    0x0000004c
                20:    00000024     .word    0x00000024
              
              -O2 -fPIC (really not optimal, should be closer to the Thumb2 code below)
              00000000 <fn>:
                 0:    e3500005     cmp    r0, #5
                 4:    908ff100     addls    pc, pc, r0, lsl #2
                 8:    ea000011     b    54 <fn+0x54>
                 c:    ea000006     b    2c <fn+0x2c>
                10:    ea000007     b    34 <fn+0x34>
                14:    ea000008     b    3c <fn+0x3c>
                18:    ea000009     b    44 <fn+0x44>
                1c:    ea00000a     b    4c <fn+0x4c>
                20:    eaffffff     b    24 <fn+0x24>
              
              Thumb2 (identical code for PIC and no-PIC)
              -O2 -fPIC -mthumb -mcpu=cortex-a15
              00000000 <fn>:
                 0:    2805          cmp    r0, #5
                 2:    d816          bhi.n    32 <fn+0x32>
                 4:    e8df f000     tbb    [pc, r0]
                 8:    0f0c0906     .word    0x0f0c0906
                 c:    0312          .short    0x0312

              Comment


              • #37
                Originally posted by haagch View Post
                Thanks for all the info. I have already switched to fireburn's flags though: https://www.phoronix.com/forums/foru...333#post875333

                I did not have to set AR, NM and RANLIB though, it just works like that. Not sure if it is using gcc-ar etc. I'll try setting the variables and see whether anything changes for the next build.
                So with AR, NM etc set it does not look different at all so it's necessary for my setup.

                Oh and now I see the problem I have had some time ago with -fPIC and otherwise similar flags when trying to use wine:
                Code:
                /usr/lib32/xorg/modules/dri/i965_dri.so: undefined symbol: V4F_COUNT
                Of course according to google I am literally the only human on the planet who has posted about this issue... once.

                Time to find out which flag exactly causes it.

                *starts compiling*

                edit: Can confirm, the exact same flags with just -flto removed work with wine.
                Grepping mesa for V4F_COUNT... Wow, that's some low level ASM stuff right there. Probably need some compiler insight to know what's going on there... hubicka maybe?
                Last edited by haagch; 01 June 2016, 05:26 PM.

                Comment


                • #38
                  FireBurn I meant this one: https://lists.freedesktop.org/archiv...ay/118772.html attachement.bin. It only disables flto flags for mapi source files if you pass flto with CFLAGS. But I have heard mapi might get successfully compiled with flto.
                  Last edited by CochainComplex; 01 June 2016, 05:38 PM.

                  Comment


                  • #39
                    hubicka thx for you info - I've also seen that zlib in clear linux was profiled too (I guess some other package as well).

                    Comment


                    • #40
                      Originally posted by discordian View Post
                      WTF = What the f. Are you a gcc developer?
                      Yep, i am gcc developer and even gcc folks what WTF mean. I was just curious what you expect from compiler to do on the jumptable.

                      Originally posted by discordian View Post
                      Actually I assumed the code is that complicated because the table is for some reason "external visible" and thus needs to have extra code for interposing. I am beginning to realize that I was wrong and that AMD64 apparently doesnt have simple constructs for pc-relative jumps I taken for granted with a more recent architecture.

                      In short, I expected a 2-3 instruction sequence.
                      1) load offset from table [.pclabel + scale * index]
                      2) jump by addind offset to pc
                      .pclabel
                      [.branch0 - .pclabel]
                      [.branch1 - .pclabel]
                      [.branch2 - .pclabel]
                      ....
                      [/CODE]
                      I tried this kind of codegen back in early 2000s when writing x86-64 machine description. At that time it was a loss because the branch prediction logic did not handle well the sequence of indirect jump to direct jump. It also consume more of code cache and less of data cache and data cache is generally less limitting.
                      I don't think the main x86 chips changed much in the respect. The expensive part of the tablejump sequence is still the indirect branch.

                      Putting the table just behind the instruction is not recommended on Intel nor AMD chips.
                      See http://www.intel.com/content/dam/www...ion-manual.pdf section 3.6.9

                      Comment

                      Working...
                      X