Announcement

Collapse
No announcement yet.

LTO'ing Mesa Is Getting Discussed For Performance & Binary Size Reasons

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    FireBurn I meant this one: https://lists.freedesktop.org/archiv...ay/118772.html attachement.bin. It only disables flto flags for mapi source files if you pass flto with CFLAGS. But I have heard mapi might get successfully compiled with flto.
    Last edited by CochainComplex; 01 June 2016, 05:38 PM.

    Comment


    • #32
      hubicka thx for you info - I've also seen that zlib in clear linux was profiled too (I guess some other package as well).

      Comment


      • #33
        Originally posted by discordian View Post
        WTF = What the f. Are you a gcc developer?
        Yep, i am gcc developer and even gcc folks what WTF mean. I was just curious what you expect from compiler to do on the jumptable.

        Originally posted by discordian View Post
        Actually I assumed the code is that complicated because the table is for some reason "external visible" and thus needs to have extra code for interposing. I am beginning to realize that I was wrong and that AMD64 apparently doesnt have simple constructs for pc-relative jumps I taken for granted with a more recent architecture.

        In short, I expected a 2-3 instruction sequence.
        1) load offset from table [.pclabel + scale * index]
        2) jump by addind offset to pc
        .pclabel
        [.branch0 - .pclabel]
        [.branch1 - .pclabel]
        [.branch2 - .pclabel]
        ....
        [/CODE]
        I tried this kind of codegen back in early 2000s when writing x86-64 machine description. At that time it was a loss because the branch prediction logic did not handle well the sequence of indirect jump to direct jump. It also consume more of code cache and less of data cache and data cache is generally less limitting.
        I don't think the main x86 chips changed much in the respect. The expensive part of the tablejump sequence is still the indirect branch.

        Putting the table just behind the instruction is not recommended on Intel nor AMD chips.
        See http://www.intel.com/content/dam/www...ion-manual.pdf section 3.6.9

        Comment


        • #34
          Originally posted by hubicka View Post
          Yep, i am gcc developer and even gcc folks what WTF mean. I was just curious what you expect from compiler to do on the jumptable.


          I tried this kind of codegen back in early 2000s when writing x86-64 machine description. At that time it was a loss because the branch prediction logic did not handle well the sequence of indirect jump to direct jump. It also consume more of code cache and less of data cache and data cache is generally less limitting.
          I don't think the main x86 chips changed much in the respect. The expensive part of the tablejump sequence is still the indirect branch.

          Putting the table just behind the instruction is not recommended on Intel nor AMD chips.
          See http://www.intel.com/content/dam/www...ion-manual.pdf section 3.6.9
          I'd say way to go Intel with confusing namings considering ia64 is Itanium

          Comment


          • #35
            Originally posted by hubicka View Post
            Yep, i am gcc developer and even gcc folks what WTF mean. I was just curious what you expect from compiler to do on the jumptable.
            Lol, I wasnt trying to be insulting just curious myself, sorry.
            Originally posted by hubicka View Post
            I tried this kind of codegen back in early 2000s when writing x86-64 machine description. At that time it was a loss because the branch prediction logic did not handle well the sequence of indirect jump to direct jump. It also consume more of code cache and less of data cache and data cache is generally less limitting.
            I don't think the main x86 chips changed much in the respect. The expensive part of the tablejump sequence is still the indirect branch.

            Putting the table just behind the instruction is not recommended on Intel nor AMD chips.
            See http://www.intel.com/content/dam/www...ion-manual.pdf section 3.6.9
            Obviously you dont want to mix code and data since caches are separate, but apparently the big issue is placing it immediatly after the indirect jump. Also I dont know what you mean with "sequence of indirect jump to direct jump", the return from the routine in the example? What happens in the branches is a further topic IMHO, and in this example the indirect jump could be replaced with a table lookup + addition, or you could aswell use arithmetic for the jump-offset (same amount of instructions in each branch).

            In the generic case, what about using constant pools, ie. accumulating the data (jumptable offsets and potentially more) from potentially multiple functions and placing it somewhere close in its own (cacheline-aligned) block thats just data? It was common for the m68k way back in time.

            Comment


            • #36
              Originally posted by atomsymbol
              I wasn't able to reproduce your V4F_COUNT issue on my machine with mesa-git and GCC 4.9. What kind of configure flags and compiler flags are you using?
              Happens only with wine by the way, everything else seems fine. I also only tested 32 bit wine, not 64 bit.
              Also happens only with intel. radeonsi loads fine.

              Compiled with
              CFLAGS="-O3 -march=native -pipe -floop-interchange -ftree-loop-distribution -floop-strip-mine -floop-block -Wno-narrowing -flto=8"
              CXXFLAGS="${CFLAGS} -fno-delete-null-pointer-checks -flifetime-dse=1 -fpermissive"
              LDFLAGS="-Wl,-O1 -Wl,--hash-style=gnu -Wl,--as-needed -Wl,-flto=8"
              ./autogen.sh \
              --prefix=/usr \
              --libdir=/usr/lib32 \
              --build=x86_64-pc-linux-gnu --host=i686-pc-linux-gnu \
              --with-dri-driverdir=/usr/lib32/xorg/modules/dri \
              --with-dri-drivers=i965 \
              --with-egl-platforms=x11,drm,wayland \
              --with-gallium-drivers=radeonsi,r600,swrast,ilo \
              --enable-glx-tls \
              --enable-egl \
              --enable-gallium-llvm \
              --enable-gles1 \
              --enable-gles2 \
              --enable-texture-float \
              --enable-vdpau \
              --enable-va \
              --enable-gbm \
              --enable-shared-glapi \
              --enable-gallium-osmesa \
              --enable-dri3 \
              --enable-nine \
              --enable-omx \
              --with-vulkan-drivers=

              Originally posted by atomsymbol
              I am assuming the following command has a non-empty output on your machine:

              Code:
              $ nm -D i965_dri.so | grep V4F_COUNT
              One result:
              U V4F_COUNT

              Here's my mesa build:

              (compiled with -march=native for ivy bridge)
              Indeed only loading the GL driver from this build with
              LIBGL_DRIVERS_PATH=/wherever/usr/lib32/xorg/modules/dri
              and trying to start warcraft 3 in wine on intel fails with the symbol lookup error (need LIBGL_DEBUG=verbose to see it by the way)

              Comment


              • #37
                Cool, thanks. So with -ffat-lto-objects it should work?

                By the way, intel anv vulkan in steam doesn't work with the same LTO settings from earlier either, at least in The Talos Principle.
                Code:
                Cannot set requested display mode 1920x1080: GfxAPI error: Dynamic module "libvulkan.so.1" not found!
                Presumably it's the same problem.

                Comment

                Working...
                X