Announcement

Collapse
No announcement yet.

Glibc Enables A Per-Thread Cache For Malloc - Big Performance Win

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by higuita View Post

    Because not all apps really improve with the new mallocs ... some tasks like one malloc, other like others. Remember that you have many kind of apps running in linux, from render farms, or memory hungry to single thread or cpu hungry. There is no "one size fit all"
    Also, unless you really understand how everything works, it is usually better to use the default... many apps tried to switch mallocs to solve some problem and then found other kind of problems. Mozilla took years to switch and fine-tune their malloc

    botton line, glibc malloc is not perfect, but works well in most cases
    I'd say my biggest complaint of not using something like tcmalloc or jemalloc is the lack of tools. Both of those come with pretty decent tools to graph how your program uses memory. It also gives very explicit control over everything... for instance, you can set the value of deleted memory or you can tell it to set all uninitialized memory to 0 (don't do this, it's meant for testing or debugging).

    In my experience, I've yet to see a practical area where jemalloc is beaten by either the NT allocator or the glibc allocator.
    That said, I usually just use the default allocator since I'm not a speed demon and it simplifies distribution. But I do sometimes hook up jemalloc just for debugging and testing purposes.

    Comment


    • #22
      Originally posted by DrYak View Post
      Genuine question :
      - Which developer in his right mind would be calling malloc in a performance-critical part of the code ?
      I know this one … C++ programmers. Who live by the "optimization is premature or something" mantra. A common thing I see all the time is functions that take a fully blown heap allocated string when essentially a string slice (or C string) would suffice:

      Code:
      void my::c_api_wrapper(const std::string& s) {
          c_api(s.c_str());
      }
      
      my::c_api_wrapper("string literal");
      The point where "optimization is premature or something" becomes a fallacy, is that most code isn't performance critical until it is.

      Of course, nobody writes code like this in a language with string slices, like Rust or Go. Btw, Rust uses jemalloc.
      Last edited by andreano; 09 July 2017, 12:27 PM.

      Comment


      • #23
        Originally posted by andreano View Post

        I know this one … C++ programmers. Who live by the "optimization is premature or something" mantra. A common thing I see all the time is functions that take a fully blown heap allocated string when essentially a string slice (or C string) would suffice:

        Code:
        void my::c_api_wrapper(const std::string& s) {
        c_api(s.c_str());
        }
        
        my::c_api_wrapper("string literal");
        The point where "optimization is premature or something" becomes a fallacy, is that most code isn't performance critical until it is.

        This class of performance bug is much less likely in a language with string slices, like Rust or Go. Btw, Rust uses jemalloc.
        This is what C++17's std::string_view / std::basic_string_view is for. It wraps any array of char and makes it act like a std::string without having to create a std::string. Except that it can't guarantee null termination so you have to write your own function equivalent to c_str which has to check for the null termination before using it.

        In my C++ code when I have a function like the one you listed, I usually give it a const char* overload also, and call that one from the std::string overload. That way I don't have to use c_str at the call sites. But then, I like to optimize, and I always keep my eye on c_str calls because those have led to so many, many use-after-free bugs.

        Comment


        • #24
          Originally posted by Pawlerson View Post
          Glibc was always quite slow. It's still faster than Windows NT when comes to thread creation, but ways slower than Linux kernel which is the fastest kernel in the world in this case.

          This change should have significant impact on scalability. Sysbench could be interesting.
          yet musl is slower apparently. would like some benchmarks though

          Comment


          • #25
            Originally posted by andreano View Post

            I know this one … C++ programmers. Who live by the "optimization is premature or something" mantra. A common thing I see all the time is functions that take a fully blown heap allocated string when essentially a string slice (or C string) would suffice:

            Code:
            void my::c_api_wrapper(const std::string& s) {
            c_api(s.c_str());
            }
            
            my::c_api_wrapper("string literal");
            The point where "optimization is premature or something" becomes a fallacy, is that most code isn't performance critical until it is.

            Of course, nobody writes code like this in a language with string slices, like Rust or Go. Btw, Rust uses jemalloc.
            I dont recall ever running into that problem with c api wrapper in my 17 years.

            Comment


            • #26
              Originally posted by andreano View Post

              I know this one … C++ programmers. Who live by the "optimization is premature or something" mantra. A common thing I see all the time is functions that take a fully blown heap allocated string when essentially a string slice (or C string) would suffice:

              Code:
              void my::c_api_wrapper(const std::string& s) {
              c_api(s.c_str());
              }
              
              my::c_api_wrapper("string literal");
              The point where "optimization is premature or something" becomes a fallacy, is that most code isn't performance critical until it is.

              Of course, nobody writes code like this in a language with string slices, like Rust or Go. Btw, Rust uses jemalloc.

              There should be no memory allocation in your code anyway. Your string fall under short string optimization with C++11.

              Indeed, Premature optimisation is source or evil, or bullshit like that.


              Comment


              • #27
                Originally posted by coder View Post
                Well, pretty much anything not dealing only with small strings and fixed-sized arrays. So, that's basically most real-world software. Most interesting data structures require dynamic allocation.
                And in my own experience/designs, in such cases, you allocate a big (enough) pool of memory at the beginning, and only increase the pool from time to time when its getting close to be filled.

                (e.g.: if your inner loop is streaming data from an input file, you do not use auto-allocation on each read, you read into a fixed-buffer and it the reading function abords with an "end-of-buffer reached" condition, only then you re-alloc it and resume the read, thus you only fumble with memory allocations once in a while)

                (the same reason you'll got for a buffered file IO API. Or memory mapping. So the whole file access isn't systematically called on each read but only when you deplete the buffer / pagefault the memory mapped access).

                Originally posted by Hi-Angel View Post
                I imagine Firefox might benefit. AFAIK it still have disabled GPU acceleration by default, and so is pretty CPU-heavy.
                I was referring to the general strategy that one should always avoid doing complex stuff like memory allocation or disk I/O inside inside time-critical inner loops.
                (At least that what I've learned).

                I/O should be buffered (or memory mapped) to avoid actual disk I/O on each loop.
                Memory is best allocated in re-useable pools at the beginning, rather than allocated/freed on each loop.

                I was just wondering for legitimate uses for malloc in inner loops.

                And @adreano pointed out that it might get inadvertently called on C++ object allocation and/or some C++ data conversion, if you use object inside you inner loop that the compiler cannot successfully optimize out.


                (DISCLAIMER: despite the nick name, I'm not only a medical doctor, I've also studied bio-informatics and we regularily needs to deals with fucktons amount data.
                Rewriting some time critical part from Perl/Python/R into C/C++ for performance reasons is something we periodically need to do)

                Comment


                • #28
                  Originally posted by DrYak View Post

                  And in my own experience/designs, in such cases, you allocate a big (enough) pool of memory at the beginning, and only increase the pool from time to time when its getting close to be filled.

                  (e.g.: if your inner loop is streaming data from an input file, you do not use auto-allocation on each read, you read into a fixed-buffer and it the reading function abords with an "end-of-buffer reached" condition, only then you re-alloc it and resume the read, thus you only fumble with memory allocations once in a while)

                  (the same reason you'll got for a buffered file IO API. Or memory mapping. So the whole file access isn't systematically called on each read but only when you deplete the buffer / pagefault the memory mapped access).



                  I was referring to the general strategy that one should always avoid doing complex stuff like memory allocation or disk I/O inside inside time-critical inner loops.
                  (At least that what I've learned).

                  I/O should be buffered (or memory mapped) to avoid actual disk I/O on each loop.
                  Memory is best allocated in re-useable pools at the beginning, rather than allocated/freed on each loop.

                  I was just wondering for legitimate uses for malloc in inner loops.

                  And @adreano pointed out that it might get inadvertently called on C++ object allocation and/or some C++ data conversion, if you use object inside you inner loop that the compiler cannot successfully optimize out.
                  You know, it's hard to put in words… I could say that you can't buffer IO in all usecases, and in particular browsers can not keep allocated memory just in case. In particular when I open Firefox, it often (I guess it's Pentadactyl being odd) takes about 1G of memory, but then frees it. Also, you wouldn't be glad if a browser kept a memory it recently used to render a heavy site.

                  But if I'd say that those cases would benefit from libc cache, we've got a contradiction. But may be not quite. The thing is, because of what I said, browser turns out to be a heavy user of memory (de)allocation, so thanks to libc now they would render pages a tiny bit faster.

                  But you're right that most apps wouldn't put malloc into time critical loops. GTK wouldn't. Qt wouldn't. Viber wouldn't. A messenger(s), a music player on your background, a desktop compositor, a terminal shell, a bunch of info widgets (that almost never written in a lang requiring manual memory management) on your panel wouldn't. Just on occasion. But globally that's many occasions every second, so you power consumption might benefit, your experience with heavy loaded system might benefit too.

                  Originally posted by DrYak View Post
                  (DISCLAIMER: despite the nick name, I'm not only a medical doctor, I've also studied bio-informatics and we regularily needs to deals with fucktons amount data.
                  Rewriting some time critical part from Perl/Python/R into C/C++ for performance reasons is something we periodically need to do)
                  Well, I actually thought 2 things: α) the Dr. is a reference to Ph.D., and β) based on subjective statistic, it's unlikely a real Ph.D — it's like my nickname is a reference to High-Tech-Angel, but neither I am an angel, nor my environment is hi-tech ☺ Though in my case hi-tech is probably a reference to ambition.

                  Comment


                  • #29
                    DrYak so, this evening I was bogged by curiosity of how much could games benefit from the cache. So I quickly made wrappers around malloc and free to count and print calls, exported through LD_PRELOAD, ran GTAⅣ on a separate XServer, and… it got frozen. I used magic SysRq to switch back to main graphical session, and found that terminal output being quickly scrolled. The game calls malloc and free too much, my innocent prints slow it down considerably. So I made it to print every 1/10 calls. Then 1/100. It still didn't help, you know what helped? The ratio 1/10000 (I'd set an exclamation sign here, but afraid of confusion with factorial ).

                    Geez, GTAⅣ calls ≈500 000 times both free() and malloc() just for the time of loading till the game menu! 500k times! Interestingly, within this 10k granularity, the numbers for both free() and malloc() are the same, so it is clearly a bug. That said, it might be a guilty of some "gta4browser.exe" — no idea what is it for, but if I don't kill this, the counters keep rising 10k times every second.

                    But game menu doesn't really matter, right? So I loaded a game level, and walked a bit inside the room, often switching back to the terminal. At this point malloc()/free() ration starts differ, but they both are still called a lot. Approximately 10-20k times a second I think.

                    That said, I don't think that libc cache would matter much for me personally with regard to this game — I am (un?)fortunately GPU limited.

                    NB: I might be dishonest a bit at this point, because actually every time I'm trying to optimize something, I do benchmark results with GTAⅣ. It is an anomaly of game programming, it managed to stretch over, probably, every single bottleneck possible!

                    Comment


                    • #30
                      Originally posted by DrYak View Post
                      And in my own experience/designs, in such cases, you allocate a big (enough) pool of memory at the beginning, and only increase the pool from time to time when its getting close to be filled.
                      Who among us has not written a buffer pool at one time or another? But why would you want to do that every time? And what if you sometimes need different sized buffers - are you just going to multiply your pools to cover all your bases? And what if you're trying to write generic code without knowing whether or not it'll be used in inner loops or whether it's appropriate to hang on to a pool of these buffers for the data structure's lifetime?

                      What I've found is that buffer pools often provide little or no measurable benefit. Sure, there will be exceptions to that, but far fewer if malloc() is doing it for you. And besides just wasting memory with your own buffer pools, you're potentially hurting cache locality by taking memory out of circulation while it's still resident in the cache hierarchy.

                      That's to say nothing about complicating your code and obfuscating the truly interesting parts. Code optimization can carry high costs, so you'd better make sure it's justified by the benefits.

                      Comment

                      Working...
                      X