Announcement

Collapse
No announcement yet.

Building The Linux Kernel With LLVM's Clang Yields Comparable Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by behanw View Post
    The LLVMLinux build system actually times both clang and gcc builds of the same kernel source (amongst other things). Just be sure to unset USE_CCACHE to make it fair (we use ccache to make the gcc builds faster).

    Run "make kernel-clean kernel-build" and look at the build time at the end of the build.

    Then run "unset USE_CCACHE; make kernel-gcc-clean kernel-gcc-build" and look at the build time at the end of that build.

    Only the actual compile of the kernel is timed.

    On my i7-3840QM compiling the vexpress target to SSD I get:

    $ make list-versions | egrep 'GCC|LLVM|CLANG'
    GCC = gcc-4.8.real (Ubuntu 4.8.2-19ubuntu1) 4.8.2
    LLVM = LLVM version 3.5.0svn r209864 commit
    CLANG = clang version 3.5.0 r209859 commit
    $ make kernel-clean kernel-build
    ...
    User time (seconds): 768.03
    System time (seconds): 70.10
    Percent of CPU this job got: 598%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 2:19.97
    Maximum resident set size (kbytes): 117260
    ...
    $ unset USE_CCACHE; make kernel-gcc-clean kernel-gcc-build
    ...
    User time (seconds): 905.61
    System time (seconds): 69.82
    Percent of CPU this job got: 630%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 2:34.78
    Maximum resident set size (kbytes): 136424
    ...

    I imagine you will opt to do multiple builds of the longer x86_64 build since it will give you a better test set.

    The build system is there to do these precise things as well as allow us to easily have people replicate the same builds (barring the patch issues you described which we are investigating).

    Your readers may also like to see the output of "make list-versions list-settings" which shows the versions of all the important SW involved and the settings used by the build system (git repos, branches, commit numbers, etc).

    I hope this helps.

    It is in my best interests to have 100% fair comparisons of the 2 toolchains through 3rd party benchmarks such as yours, so I am extremely happy to help in any way I can with your future benchmarks if you are amenable.

    Behan
    I assume both gcc and llvm are not using LTO at this point --- or have both tamed their LTO memory usage enough that this is now feasible for something as large as the LINUX kernel? If so, does LTO lead to interesting improvements?

    I expect the next great frontier, once LTO is nailed down, is profile directed feedback. GCC has that in a kinda-sorta-yeah-it-works mode where (as I understand it) it works just fine for SPEC-like code, but has problems with either large projects and/or highly threaded projects. LLVM appear to be behind when it comes to PDF (even though PDF was one of the LLVM goals way back when, when Lattner designed the architecture), but, again as I understand it, this is because they want to solve the problems GCC has punted on (liked I said, support for very large projects and a high degree of threading --- think Apple and Google both wanting to use PDF to profile initially the web browsers and then their OS's).

    Comment


    • #22
      LTO/PDO (or FDO)

      GCC's PDO is used by some large projects (not just SPEC), including Firefox or Google's internal apps (https://gcc.gnu.org/wiki/LightweightIpo). Threads basically means that you want to use -fprofile-correction that makes GCC to ignore inconsistencies caused by concurent updates (that are not that common). Google has some patches for GCC 4.10 to improve threading support (by thread safe but expensive profiling). As far as I know, LLVM does pretty much the same.

      Google has patches to get PDO working with kernel, I did not try it personally. I plan to do bit more benchmarks on this (but currently I am stuck on libreoffice). Firefox data are at http://hubicka.blogspot.ca/2014/04/l...2-firefox.html

      LTO Kernel memory usage is manageable with GCC 4.9 (bellow 4GB, so it builds on my box). I did not see any really good benchmarks - there are not that many things to do with kernel that would be CPU bound in kernel and not heavily hand optimized, so the wins are primarily in code size. Some data on this are in the following thread: http://lkml.iu.edu/hypermail/linux/k...4.1/00275.html

      Comment


      • #23
        Originally posted by hubicka View Post
        GCC's PDO is used by some large projects (not just SPEC), including Firefox or Google's internal apps (https://gcc.gnu.org/wiki/LightweightIpo). Threads basically means that you want to use -fprofile-correction that makes GCC to ignore inconsistencies caused by concurent updates (that are not that common). Google has some patches for GCC 4.10 to improve threading support (by thread safe but expensive profiling). As far as I know, LLVM does pretty much the same.

        Google has patches to get PDO working with kernel, I did not try it personally. I plan to do bit more benchmarks on this (but currently I am stuck on libreoffice). Firefox data are at http://hubicka.blogspot.ca/2014/04/l...2-firefox.html

        LTO Kernel memory usage is manageable with GCC 4.9 (bellow 4GB, so it builds on my box). I did not see any really good benchmarks - there are not that many things to do with kernel that would be CPU bound in kernel and not heavily hand optimized, so the wins are primarily in code size. Some data on this are in the following thread: http://lkml.iu.edu/hypermail/linux/k...4.1/00275.html
        There was a thread on the LLVM dev list about six weeks ago discussing PDF. The essential point, as I recall it, is that the claim you make ("ignore inconsistencies caused by concurrent updates" is not important) was considered by some particular users (obviously not all, there are many circumstances where this is just fine) as being unacceptable --- for what they had in mind it was absolutely essential to get correct numbers.
        You can get the gist of the issue here: http://article.gmane.org/gmane.comp....vm.devel/72247
        A related issue (which appears to be on its way to solution) is more efficient ways to communicate the (possibly large and complex) data structures acquired and populated by profiling to later users. The issue here (may be less of an issue for GCC) is LLVM's structure as a library of tools, and the desire to allow independently created tools to all utilize this profiling information.

        Regarding LTO benefits for the kernel, thanks for the thread you referenced. Reading through it, apart from the usual talking past each other and general unwillingness to accept that "what's unimportant to you is vital for me, and vice versa", the consensus seems to be that right now, using GCC LTO on the Linux kernel results in a smaller binary (smaller by 5 to 30+% depending on precise config details) but not a noticeably faster binary.
        This suggests to me that right now most of the optimizations that LTO enables in theory are not yet enable in practice (or they may, perhaps, be enabled for smaller targets but are switched off when they would require too much RAM). In particular, I'm guessing that basic code rearrangement (pack functions that call each other as close to each other as possible) is not yet enabled. Beyond that, I expect *aggressive* code rearrangement (detect rarely utilized code paths like error handling code, and move it out the function to a completely different page) is neither implemented, nor useful until PDF is in place to provide the necessary data.
        [Given the fury of the arguments regarding LTO today, I can just imagine how ballistic some people will get when presented with the idea that not only has their code been rewritten/reordered by the compiler optimizer but large swathes of functions have been moved megabytes away, with strands of JMPs connecting the main code flow to these outliers far far away!]

        Comment


        • #24
          Originally posted by name99 View Post
          There was a thread on the LLVM dev list about six weeks ago discussing PDF. The essential point, as I recall it, is that the claim you make ("ignore inconsistencies caused by concurrent updates" is not important) was considered by some particular users (obviously not all, there are many circumstances where this is just fine) as being unacceptable --- for what they had in mind it was absolutely essential to get correct numbers.
          You can get the gist of the issue here: http://article.gmane.org/gmane.comp....vm.devel/72247
          Thanks for the pointer. GCC had patches for counters in thread local storage (that really costs several MB for thread) and for atomic counters (that is very slow). I was thinking about adding per-function counters and flushing them to global counter from time to time that would perhaps run better with path profiling style algorithms.

          Originally posted by name99 View Post
          A related issue (which appears to be on its way to solution) is more efficient ways to communicate the (possibly large and complex) data structures acquired and populated by profiling to later users. The issue here (may be less of an issue for GCC) is LLVM's structure as a library of tools, and the desire to allow independently created tools to all utilize this profiling information.
          This is actually also something that is being worked on at GCC side, too. Rong Xu (from Google) has contributed gcov-tool that lets you to manipulate profiles separately. https://gcc.gnu.org/ml/gcc-patches/2.../msg00667.html
          He made libgcov useful for general handling, not just for runtime. I would like to get more of the functionality isolated to a library, because I plan to use it for linktime instrumentation.

          Originally posted by name99 View Post
          Regarding LTO benefits for the kernel, thanks for the thread you referenced. Reading through it, apart from the usual talking past each other and general unwillingness to accept that "what's unimportant to you is vital for me, and vice versa", the consensus seems to be that right now, using GCC LTO on the Linux kernel results in a smaller binary (smaller by 5 to 30+% depending on precise config details) but not a noticeably faster binary.
          This suggests to me that right now most of the optimizations that LTO enables in theory are not yet enable in practice (or they may, perhaps, be enabled for smaller targets but are switched off when they would require too much RAM). In particular, I'm guessing that basic code rearrangement (pack functions that call each other as close to each other as possible) is not yet enabled. Beyond that, I expect *aggressive* code rearrangement (detect rarely utilized code paths like error handling code, and move it out the function to a completely different page) is neither implemented, nor useful until PDF is in place to provide the necessary data.
          [Given the fury of the arguments regarding LTO today, I can just imagine how ballistic some people will get when presented with the idea that not only has their code been rewritten/reordered by the compiler optimizer but large swathes of functions have been moved megabytes away, with strands of JMPs connecting the main code flow to these outliers far far away!]
          GCC has function splitting that is pretty much what you call aggressive code placement (-freorder-blocks-and-partition). It was code sitting in tree for ages, but was disabled because of various issues - problems with unwind information and debuggers not ready for split function bodies. Teresa Johnson (from Google) fixed many of issues for GCC 4.9 https://gcc.gnu.org/ml/gcc-patches/2.../msg00094.html and it is now enabled by default with PDF at some targets. My plan is to enable it by default at -O2+ and start offloading C++ EH cleanups and other provably cold regions. We need some infrastructure updates for that, since GCC profile code was always more centered about identifying very hot regions than identifying very cold ones.

          Function reordering is done with PDF for GCC 4.9+ (it was implemented by Martin Liska http://arxiv.org/abs/1403.6997) and it works well with Firefox/GCC and others. For non-PDF, as one of first experiments with Firefox I implemented non-PDO function reordering pass in 2010. I was never able to prove any benefits on benchmarks I tested (but I did not try that terribly hard). The thing is that today code caches are rather big and code placement is critical only for very large applications (Firefox, libreoffice and such). Code placement is very hard on these because of presence of indirect call and difficult to estimate function profile. This changed since 90's when the caches was small and program control flows within the regions of interest much more predictable. I plan to return to it for 4.10 - For GCC 4.9 I added may edges for virtual calls http://hubicka.blogspot.ca/2014/01/d...-c-part-1.html that may make the analysis more reliable.

          GCC does same inter-procedural optimization with LTO or without (i.e. no passes are disabled because of scalability). Only optimization having significant scalability issues is ipa-PTA, but that is not on even for -O3.

          Comment


          • #25
            Originally posted by hubicka View Post

            Function reordering is done with PDF for GCC 4.9+ (it was implemented by Martin Liska http://arxiv.org/abs/1403.6997) and it works well with Firefox/GCC and others. For non-PDF, as one of first experiments with Firefox I implemented non-PDO function reordering pass in 2010. I was never able to prove any benefits on benchmarks I tested (but I did not try that terribly hard). The thing is that today code caches are rather big and code placement is critical only for very large applications (Firefox, libreoffice and such). Code placement is very hard on these because of presence of indirect call and difficult to estimate function profile. This changed since 90's when the caches was small and program control flows within the regions of interest much more predictable. I plan to return to it for 4.10 - For GCC 4.9 I added may edges for virtual calls http://hubicka.blogspot.ca/2014/01/d...-c-part-1.html that may make the analysis more reliable.
            Thanks for the tech update. I follow LLVM closely, but I only get to learn what's really happening in GCC from these sorts of posts.
            To my mind the real benefit from code reordering is not so much cache but TLB coverage, but you are right that the numbers I have in mind are all based on the early work, done in the 90s with rather different target machines. Of course the equivalent of low-end machines now are mobile, and presumably lighter cache packing, even if it doesn't result in much speedup (because of OoO and pre-fetching) does result in lower energy consumption. But we need this stuff to migrate to ARM and be tested there to get that information...

            Comment


            • #26
              Originally posted by name99 View Post
              Thanks for the tech update. I follow LLVM closely, but I only get to learn what's really happening in GCC from these sorts of posts.
              To my mind the real benefit from code reordering is not so much cache but TLB coverage, but you are right that the numbers I have in mind are all based on the early work, done in the 90s with rather different target machines. Of course the equivalent of low-end machines now are mobile, and presumably lighter cache packing, even if it doesn't result in much speedup (because of OoO and pre-fetching) does result in lower energy consumption. But we need this stuff to migrate to ARM and be tested there to get that information...
              Doing experiments on non-x86_64 architectures is definitely interesting. I did not have time for these - my prototype just seemed to work but did not seem to have enough of potential to draw my immediate attention at that time ( http://arxiv.org/pdf/1010.2196.pdf ). At that time LTO was still pretty much green and other things had a priority. I think it is interesting topic to return to now.

              Note that also gold has a way to get code reordering done https://gcc.gnu.org/ml/gcc-patches/2.../msg01440.html
              I am however not aware of any benchmarks of this feature and I did not look into details of the implementation yet.

              Comment


              • #27
                Originally posted by behanw View Post
                It does seem that the wiki instructions have fallen behind. We will fix that.
                The link in the OP to the build instructions still talks about building an old kernel version, 3.3, while the downloaded scripts will attempt to fetch the very latest kernel (a RC3 version ATM)...

                Is there going to be a time when one can just download a kernel source tarball (or do the equivalent of `apt-get source linux-image-$version-generic`) and then invoke (the equivalent of) `make-kpkg` to build with the compiler of choice?

                My interest in building custom kernels is mostly to roll my own flavour of my distro's current kernel package, with optimisation for my CPU, no annoying things like AppArmor and with the Con Kolivas patch set. Building with clang mostly has the advantage of being faster (and it also makes it simple to use my Mac workstation as a build slave without need for a dedicated cross compiler, using distcc).

                If possible, it'd be very nice if one could just download a patchfile to apply to the kernel source tree and that allows to build a given kernel version with clang, rather than having to use a more or less closed build system which makes it hard to apply the distro and Kolivas patches and use one's own .config .

                Comment


                • #28
                  Originally posted by behanw View Post
                  It does seem that the wiki instructions have fallen behind. We will fix that.
                  The link in the OP to the build instructions still talks about building an old kernel version, 3.3, while the downloaded scripts will attempt to fetch the very latest kernel (a RC3 version ATM)...

                  Is there going to be a time when one can just download a kernel source tarball (or do the equivalent of `apt-get source linux-image-$version-generic`) and then invoke (the equivalent of) `make-kpkg` to build with the compiler of choice?

                  My interest in building custom kernels is mostly to roll my own flavour of my distro's current kernel package, with optimisation for my CPU, no annoying things like AppArmor and with the Con Kolivas patch set. Building with clang mostly has the advantage of being faster (and it also makes it simple to use my Mac workstation as a build slave without need for a dedicated cross compiler, using distcc).

                  If possible, it'd be very nice if one could just download a patchfile to apply to the kernel source tree and that allows to build a given kernel version with clang, rather than having to use a more or less closed build system which makes it hard to apply the distro and Kolivas patches and use one's own .config .

                  Comment


                  • #29
                    Originally posted by behanw View Post
                    It does seem that the wiki instructions have fallen behind. We will fix that.
                    If we're talking about the same thing, the page with instructions the OP links to still seems to be about Linux 3.3, while the buildsystem is configured to fetch the latest kernel from git.

                    I wonder if and when it'll be possible simply to grab a kernel tarball from kernel.org or using the equivalent of `apt-get source linux-image-$VERSION-generic`, do the usual personal tweaking and the run the equivalent of `make-kpkg`.

                    I regularly get the sources of the current (K)Ubuntu kernel version I'm using that way, apply the Con Kolivas patches after the Ubuntu patches, and then build the image and headers packages with maximum optimisation for my CPU. I'd love to use Clang for that if it allows to speed up the build time (also because with distcc and clang I can turn my Mac workstation into a build slave without need for dedicated cross compilers).

                    Doing this with llvmlinux seems hard if not impossible (I just discovered that the current clang patches fail on the 3.16.1 sources, which I'd want to use).

                    Comment


                    • #30
                      (my bad, but not sure what I did to post the same message 3x ... hope it wasn't simply because I reloaded the page in my browser!)

                      Comment

                      Working...
                      X