Announcement

Collapse
No announcement yet.

LTO'ing Mesa Is Getting Discussed For Performance & Binary Size Reasons

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • haagch
    replied
    Cool, thanks. So with -ffat-lto-objects it should work?

    By the way, intel anv vulkan in steam doesn't work with the same LTO settings from earlier either, at least in The Talos Principle.
    Code:
    Cannot set requested display mode 1920x1080: GfxAPI error: Dynamic module "libvulkan.so.1" not found!
    Presumably it's the same problem.

    Leave a comment:


  • atomsymbol
    replied
    Originally posted by haagch View Post
    U V4F_COUNT
    I located the cause of the problem: gen_matypes.c is passed to the compiler in order to generate assembly code in text form (compiler switch: -S), but with -flto GCC doesn't output any code nor data because that is postponed to link time.

    In other words: executing "gcc -S -flto foo.c -o foo.s" generates a file with empty .text and .data sections.

    This isn't a GCC bug. On the other hand, maybe passing -S together with -flto to the compiler should imply -ffat-lto-objects or at least print a warning message (hubicka)? I used GCC 4.9.3 which isn't printing any warning.

    Leave a comment:


  • haagch
    replied
    Originally posted by atomsymbol View Post
    I wasn't able to reproduce your V4F_COUNT issue on my machine with mesa-git and GCC 4.9. What kind of configure flags and compiler flags are you using?
    Happens only with wine by the way, everything else seems fine. I also only tested 32 bit wine, not 64 bit.
    Also happens only with intel. radeonsi loads fine.

    Compiled with
    CFLAGS="-O3 -march=native -pipe -floop-interchange -ftree-loop-distribution -floop-strip-mine -floop-block -Wno-narrowing -flto=8"
    CXXFLAGS="${CFLAGS} -fno-delete-null-pointer-checks -flifetime-dse=1 -fpermissive"
    LDFLAGS="-Wl,-O1 -Wl,--hash-style=gnu -Wl,--as-needed -Wl,-flto=8"
    ./autogen.sh \
    --prefix=/usr \
    --libdir=/usr/lib32 \
    --build=x86_64-pc-linux-gnu --host=i686-pc-linux-gnu \
    --with-dri-driverdir=/usr/lib32/xorg/modules/dri \
    --with-dri-drivers=i965 \
    --with-egl-platforms=x11,drm,wayland \
    --with-gallium-drivers=radeonsi,r600,swrast,ilo \
    --enable-glx-tls \
    --enable-egl \
    --enable-gallium-llvm \
    --enable-gles1 \
    --enable-gles2 \
    --enable-texture-float \
    --enable-vdpau \
    --enable-va \
    --enable-gbm \
    --enable-shared-glapi \
    --enable-gallium-osmesa \
    --enable-dri3 \
    --enable-nine \
    --enable-omx \
    --with-vulkan-drivers=

    Originally posted by atomsymbol View Post
    I am assuming the following command has a non-empty output on your machine:

    Code:
    $ nm -D i965_dri.so | grep V4F_COUNT
    One result:
    U V4F_COUNT

    Here's my mesa build:
    http://haagch.frickel.club/files/mesa-lto.pkg.tar.gz
    (compiled with -march=native for ivy bridge)
    Indeed only loading the GL driver from this build with
    LIBGL_DRIVERS_PATH=/wherever/usr/lib32/xorg/modules/dri
    and trying to start warcraft 3 in wine on intel fails with the symbol lookup error (need LIBGL_DEBUG=verbose to see it by the way)

    Leave a comment:


  • atomsymbol
    replied
    Originally posted by haagch View Post
    So with AR, NM etc set it does not look different at all so it's necessary for my setup.

    Oh and now I see the problem I have had some time ago with -fPIC and otherwise similar flags when trying to use wine:
    Code:
    /usr/lib32/xorg/modules/dri/i965_dri.so: undefined symbol: V4F_COUNT
    Of course according to google I am literally the only human on the planet who has posted about this issue... once.

    Time to find out which flag exactly causes it.

    *starts compiling*

    edit: Can confirm, the exact same flags with just -flto removed work with wine.
    Grepping mesa for V4F_COUNT... Wow, that's some low level ASM stuff right there. Probably need some compiler insight to know what's going on there... hubicka maybe?
    I wasn't able to reproduce your V4F_COUNT issue on my machine with mesa-git and GCC 4.9. What kind of configure flags and compiler flags are you using?

    I am assuming the following command has a non-empty output on your machine:

    Code:
    $ nm -D i965_dri.so | grep V4F_COUNT

    Leave a comment:


  • discordian
    replied
    Originally posted by hubicka View Post
    Yep, i am gcc developer and even gcc folks what WTF mean. I was just curious what you expect from compiler to do on the jumptable.
    Lol, I wasnt trying to be insulting just curious myself, sorry.
    Originally posted by hubicka View Post
    I tried this kind of codegen back in early 2000s when writing x86-64 machine description. At that time it was a loss because the branch prediction logic did not handle well the sequence of indirect jump to direct jump. It also consume more of code cache and less of data cache and data cache is generally less limitting.
    I don't think the main x86 chips changed much in the respect. The expensive part of the tablejump sequence is still the indirect branch.

    Putting the table just behind the instruction is not recommended on Intel nor AMD chips.
    See http://www.intel.com/content/dam/www...ion-manual.pdf section 3.6.9
    Obviously you dont want to mix code and data since caches are separate, but apparently the big issue is placing it immediatly after the indirect jump. Also I dont know what you mean with "sequence of indirect jump to direct jump", the return from the routine in the example? What happens in the branches is a further topic IMHO, and in this example the indirect jump could be replaced with a table lookup + addition, or you could aswell use arithmetic for the jump-offset (same amount of instructions in each branch).

    In the generic case, what about using constant pools, ie. accumulating the data (jumptable offsets and potentially more) from potentially multiple functions and placing it somewhere close in its own (cacheline-aligned) block thats just data? It was common for the m68k way back in time.

    Leave a comment:


  • nanonyme
    replied
    Originally posted by hubicka View Post
    Yep, i am gcc developer and even gcc folks what WTF mean. I was just curious what you expect from compiler to do on the jumptable.


    I tried this kind of codegen back in early 2000s when writing x86-64 machine description. At that time it was a loss because the branch prediction logic did not handle well the sequence of indirect jump to direct jump. It also consume more of code cache and less of data cache and data cache is generally less limitting.
    I don't think the main x86 chips changed much in the respect. The expensive part of the tablejump sequence is still the indirect branch.

    Putting the table just behind the instruction is not recommended on Intel nor AMD chips.
    See http://www.intel.com/content/dam/www...ion-manual.pdf section 3.6.9
    I'd say way to go Intel with confusing namings considering ia64 is Itanium

    Leave a comment:


  • hubicka
    replied
    Originally posted by discordian View Post
    WTF = What the f. Are you a gcc developer?
    Yep, i am gcc developer and even gcc folks what WTF mean. I was just curious what you expect from compiler to do on the jumptable.

    Originally posted by discordian View Post
    Actually I assumed the code is that complicated because the table is for some reason "external visible" and thus needs to have extra code for interposing. I am beginning to realize that I was wrong and that AMD64 apparently doesnt have simple constructs for pc-relative jumps I taken for granted with a more recent architecture.

    In short, I expected a 2-3 instruction sequence.
    1) load offset from table [.pclabel + scale * index]
    2) jump by addind offset to pc
    .pclabel
    [.branch0 - .pclabel]
    [.branch1 - .pclabel]
    [.branch2 - .pclabel]
    ....
    [/CODE]
    I tried this kind of codegen back in early 2000s when writing x86-64 machine description. At that time it was a loss because the branch prediction logic did not handle well the sequence of indirect jump to direct jump. It also consume more of code cache and less of data cache and data cache is generally less limitting.
    I don't think the main x86 chips changed much in the respect. The expensive part of the tablejump sequence is still the indirect branch.

    Putting the table just behind the instruction is not recommended on Intel nor AMD chips.
    See http://www.intel.com/content/dam/www...ion-manual.pdf section 3.6.9

    Leave a comment:


  • CochainComplex
    replied
    hubicka thx for you info - I've also seen that zlib in clear linux was profiled too (I guess some other package as well).

    Leave a comment:


  • CochainComplex
    replied
    FireBurn I meant this one: https://lists.freedesktop.org/archiv...ay/118772.html attachement.bin. It only disables flto flags for mapi source files if you pass flto with CFLAGS. But I have heard mapi might get successfully compiled with flto.
    Last edited by CochainComplex; 01 June 2016, 05:38 PM.

    Leave a comment:


  • haagch
    replied
    Originally posted by haagch View Post
    Thanks for all the info. I have already switched to fireburn's flags though: https://www.phoronix.com/forums/foru...333#post875333

    I did not have to set AR, NM and RANLIB though, it just works like that. Not sure if it is using gcc-ar etc. I'll try setting the variables and see whether anything changes for the next build.
    So with AR, NM etc set it does not look different at all so it's necessary for my setup.

    Oh and now I see the problem I have had some time ago with -fPIC and otherwise similar flags when trying to use wine:
    Code:
    /usr/lib32/xorg/modules/dri/i965_dri.so: undefined symbol: V4F_COUNT
    Of course according to google I am literally the only human on the planet who has posted about this issue... once.

    Time to find out which flag exactly causes it.

    *starts compiling*

    edit: Can confirm, the exact same flags with just -flto removed work with wine.
    Grepping mesa for V4F_COUNT... Wow, that's some low level ASM stuff right there. Probably need some compiler insight to know what's going on there... hubicka maybe?
    Last edited by haagch; 01 June 2016, 05:26 PM.

    Leave a comment:

Working...
X