Announcement

Collapse
No announcement yet.

A Fresh Look At The PGO Performance With GCC 8

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • eva2000
    replied
    Originally posted by Michael View Post

    Yes that snapshot reports 8.1.1.
    maybe time to revisit PGO with GCC 8 and 9 with newer Intel and AMD Ryzen cpus ?

    Leave a comment:


  • Grinch
    replied
    Originally posted by hubicka View Post
    -
    it output something like
    This is what I get on my Arch install: -plugin /usr/lib/gcc/x86_64-pc-linux-gnu/8.1.1/liblto_plugin.so -plugin-opt=/usr/lib/gcc/x86_64-pc-linux-gnu/8.1.1/lto-wrapper

    Which seem to indicate the plugin working correctly, so no problem on that end. Thanks for the help!

    Leave a comment:


  • hubicka
    replied
    -
    Originally posted by Grinch View Post
    Thanks for the info.

    I should do some more testing with LTO by the sound of it, do you still have to pass the compiler optimization options at the link stage as well (I'm on GCC 8.1) ?
    -flto=<numthreads> should just work, but you need to be sure that your linker plugin is set up correctly. The non-plugin path will produce worse code because compiler will not get the resultion info.

    One way to check that linker plugin is used is to link with --verbose:
    gcc --verbose test.o
    it output something like
    /usr/lib64/gcc/x86_64-suse-linux/8/collect2 -plugin /usr/lib64/gcc/x86_64-suse-linux/8/liblto_plugin.so ...
    on the line invoking collect2

    Leave a comment:


  • Grinch
    replied
    Originally posted by hubicka View Post
    and LTO+PGO improves perofmrance by about 7% while each of them improves by less than 3% alone.
    Thanks for the info.

    I should do some more testing with LTO by the sound of it, do you still have to pass the compiler optimization options at the link stage as well (I'm on GCC 8.1) ?

    Leave a comment:


  • hubicka
    replied
    Originally posted by Grinch View Post
    Great hearing from someone directly involved with the optimization implementation in question, thanks a lot for chiming in!

    Yes, I've noticed that PGO builds are almost always smaller than their non-pgo equivalents despite enabling loop unrolling, as for LTO + PGO, I've seldom noticed any increase in performance when combining them and very little when it happens so I typically don't use LTO due to the extra time it takes during the linking stage, in fact I've wondered if I am doing something wrong.

    'carewolf' suggested using -O2/-Os instead of -O3 in order to avoid large binary sized while still retaining a lot of the performance advantages of PGO, anything you'd care to comment on ?
    -O2 -fprofile-use and -O3 -fprofile-use are pretty much equivalent. GCC 8 does not enable unroll-and-jam at -O2 -fprofile-use but I plan to look into heuristics and fix it.

    It always depends on what you try, but LTO+PGO usually combines well. http://www.ucw.cz/~hubicka/slides/opensuse2018-e.pdf has some SPEC scores, and LTO+PGO improves perofmrance by about 7% while each of them improves by less than 3% alone.

    Leave a comment:


  • Grinch
    replied
    Originally posted by hubicka View Post

    As the maintainer of -fprofile-use
    Great hearing from someone directly involved with the optimization implementation in question, thanks a lot for chiming in!

    Yes, I've noticed that PGO builds are almost always smaller than their non-pgo equivalents despite enabling loop unrolling, as for LTO + PGO, I've seldom noticed any increase in performance when combining them and very little when it happens so I typically don't use LTO due to the extra time it takes during the linking stage, in fact I've wondered if I am doing something wrong.

    'carewolf' suggested using -O2/-Os instead of -O3 in order to avoid large binary sized while still retaining a lot of the performance advantages of PGO, anything you'd care to comment on ?

    Leave a comment:


  • hubicka
    replied
    Originally posted by Grinch View Post
    I was looking at an older version documentation, the current options enabled by PGO (-fprofile-use) as per docs are: -fbranch-probabilities, -fvpt, -funroll-loops, -fpeel-loops, -ftracer, -ftree-vectorize, -ftree-loop-distribute-patterns.

    You were right as we can see here, given that at least -finline-functions and -funswitch-loops which belong to -O3 are enabled by PGO, thanks for the heads up!
    As the maintainer of -fprofile-use I am trying to enable all optimizations done at -O3 and modify them to drive their heuristics by profile use. You can still get more benefits from -Ofast because it permits to transformations that are not stadnard compliant. Most benefits come from combining link-time-optimization and profile-use because cross-module inlining enables a lot more transformations.

    Nice thing about profile-use is that it makes cold code small and thus it would be nice to see it enabled more often for distribution builds (it is easy to do for most of interpreters and many other core packages). With LTO it also change the code layout so the binaries loads sequentially and works touches fewer pages improving the startup times with page demand loading.

    Leave a comment:


  • Grinch
    replied
    Originally posted by carewolf View Post

    I was told by a gcc developer once, but I just checked and also can't find any documenation saying it, so I checked the sources. Here is the code enabling extra options with -fprofile-use is given:
    I was looking at an older version documentation, the current options enabled by PGO (-fprofile-use) as per docs are: -fbranch-probabilities, -fvpt, -funroll-loops, -fpeel-loops, -ftracer, -ftree-vectorize, -ftree-loop-distribute-patterns.

    You were right as we can see here, given that at least -finline-functions and -funswitch-loops which belong to -O3 are enabled by PGO, thanks for the heads up!

    Leave a comment:


  • carewolf
    replied
    Originally posted by Grinch View Post
    If you are right then that is very interesting, but I've seen no such information, are you sure ?
    I was told by a gcc developer once, but I just checked and also can't find any documenation saying it, so I checked the sources. Here is the code enabling extra options with -fprofile-use is given:

    static void
    enable_fdo_optimizations (struct gcc_options *opts,
    struct gcc_options *opts_set,
    int value)
    {
    if (!opts_set->x_flag_branch_probabilities)
    opts->x_flag_branch_probabilities = value;
    if (!opts_set->x_flag_profile_values)
    opts->x_flag_profile_values = value;
    if (!opts_set->x_flag_unroll_loops)
    opts->x_flag_unroll_loops = value;
    if (!opts_set->x_flag_peel_loops)
    opts->x_flag_peel_loops = value;
    if (!opts_set->x_flag_tracer)
    opts->x_flag_tracer = value;
    if (!opts_set->x_flag_value_profile_transformations)
    opts->x_flag_value_profile_transformations = value;
    if (!opts_set->x_flag_inline_functions)
    opts->x_flag_inline_functions = value;
    if (!opts_set->x_flag_ipa_cp)
    opts->x_flag_ipa_cp = value;
    if (!opts_set->x_flag_ipa_cp_clone
    && value && opts->x_flag_ipa_cp)
    opts->x_flag_ipa_cp_clone = value;
    if (!opts_set->x_flag_ipa_bit_cp
    && value && opts->x_flag_ipa_cp)
    opts->x_flag_ipa_bit_cp = value;
    if (!opts_set->x_flag_predictive_commoning)
    opts->x_flag_predictive_commoning = value;
    if (!opts_set->x_flag_split_loops)
    opts->x_flag_split_loops = value;
    if (!opts_set->x_flag_unswitch_loops)
    opts->x_flag_unswitch_loops = value;
    if (!opts_set->x_flag_gcse_after_reload)
    opts->x_flag_gcse_after_reload = value;
    if (!opts_set->x_flag_tree_loop_vectorize)
    opts->x_flag_tree_loop_vectorize = value;
    if (!opts_set->x_flag_tree_slp_vectorize)
    opts->x_flag_tree_slp_vectorize = value;
    if (!opts_set->x_flag_vect_cost_model)
    opts->x_flag_vect_cost_model = VECT_COST_MODEL_DYNAMIC;
    if (!opts_set->x_flag_tree_loop_distribute_patterns)
    opts->x_flag_tree_loop_distribute_patterns = value;
    }

    Leave a comment:


  • carewolf
    replied
    Originally posted by Grinch View Post

    Hmmm... PGO does not enable any of the -O3 optimizations, which in turn are '-finline-functions', '-fweb', '-frename-registers'.

    PGO enables '-fbranch-probabilites', '-fvpt', -funroll-loops,'-fpeel-loops','-ftracer'

    If you are right then that is very interesting, but I've seen no such information, are you sure ?
    No, it doesn't enable them as if you passed them as arguments, because when passed as arguments they are used on all code. The clever thing with PGO is that it will only use it on code profiled to be hot.

    Still I am not sure how well it works, which is why I would love to see it benchmarked. In theory it should be basically the same as using -O3 directly, but it probably isn't since PGO isn't as used, especially with -O2 so it is likely have mistakes or bugs. Mostly I am just curious.

    Leave a comment:

Working...
X