Benchmarking The Linux Kernel With An "-O3" Optimized Build

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • NobodyXu
    replied
    Originally posted by F.Ultra View Post

    Well we have to see once that work is done. What to keep in mind though is that software like PostgreSQL are internally using many similar techniques to io_uring.
    Yeah, we will have to wait and see, but I don't think a userspace approach similar to io-uring can overperform io-uring.

    Regardless of the approach, it need to issue one syscall for each I/O request (it can group write/read on same fd with preadv/pwritev) one-by-one, while with io-uring it can issue I/O requests without any syscall and these I/O requests can interleave however they like.

    The kernel also has access to the underlying I/O interface, which is already fully async.

    There's also been effort to optimize filesystem usage in io-uring, and I think xfs is already optimized for io-uring.

    So I think io-uring is 100% going to be faster than postgresql's homebrew solution, though in reality, postgresql also does a lot more than I/O, so if for certain area where CPU is the limitation or uses too much locking, there won't be much of an improvement

    Leave a comment:


  • F.Ultra
    replied
    Originally posted by NobodyXu View Post

    Oh yeah, you are right, though I think io-uring with worker thread will still bring some reduction to latency, in addition to the savings of syscall.

    I still think that the performance improvement of postgresql seen here is largely due to to reduction to context switching, since it is I/O heavy.

    Once postgresql adopts io-uring, the performance gap of postgresql between O2 and O3 will be much smaller.

    And io-uring with worker thread might offer more performance improvements than using O3 here.
    Well we have to see once that work is done. What to keep in mind though is that software like PostgreSQL are internally using many similar techniques to io_uring.

    Leave a comment:


  • NobodyXu
    replied
    Originally posted by F.Ultra View Post

    yes and everything that you wrote here is about throughput but not latency. One major part of latency in a sql query on a large database is waiting for IO and that wait have to occur regardless of if that request is done synchronously or asynchronously.

    That benchmark that Jens did is 100% throughput and 0% latency.
    Oh yeah, you are right, though I think io-uring with worker thread will still bring some reduction to latency, in addition to the savings of syscall.

    I still think that the performance improvement of postgresql seen here is largely due to to reduction to context switching, since it is I/O heavy.

    Once postgresql adopts io-uring, the performance gap of postgresql between O2 and O3 will be much smaller.

    And io-uring with worker thread might offer more performance improvements than using O3 here.

    Leave a comment:


  • carewolf
    replied
    Originally posted by coder View Post
    Isn't retpoline support baked into GCC? If its optimizer breaks them, that would be a compiler bug.

    https://security.stackexchange.com/q...etpoline-flags
    True, but there are multiple different workarounds, some in the compiler, some in the microcode, some in what the kernel does. They should all be safe against optimizations. With emphasis on should. As an experienced user of computers, I am wary of "should", again we can fix it, and definitely should in the long run, but unless there is something significant to gain, it might be more worthwhile to focus on other things first. I trust the kernel developers know their priorities. If they decide a year from now that it is worth trying out and fixing anything that pops up, I would also support that, though be a little surprised.

    Leave a comment:


  • F.Ultra
    replied
    Originally posted by NobodyXu View Post

    I disagree, io-uring is able to execute batch of I/O requests concurrently on one worker thread, this means that postgresql no longer has to submit I/O requests one-by-one and can submit batches of them together and let the kernel interleave and execute them in concurrent.

    Jens Axboe was able to archive 14M IOPS out of a single core, on a machine that has 5 NVMe SSDs (or perhaps more?)

    https://twitter.com/axboe/status/150...8YUeR8ZzUYroRQ
    yes and everything that you wrote here is about throughput but not latency. One major part of latency in a sql query on a large database is waiting for IO and that wait have to occur regardless of if that request is done synchronously or asynchronously.

    That benchmark that Jens did is 100% throughput and 0% latency.

    Leave a comment:


  • sinepgib
    replied
    Originally posted by carewolf View Post
    It wasnt unrolling, I think it was a more aggressive conversion from branches into conditional move instructions. Which is also necessary for vectorization (and usually faster unless you create a new data dependency). While it might be theorectically possible to add such exact details to the CPU archs, the microops cache and queue is rather finicky and we do not have all the details from Intel on how instructions were split into micro-ops, or how they are sometimes recombined into other micro-ops while in the queue. Generally GCC tries to avoid such details, I don't think they don't optimize based on cacheline length either, though I know the Intel Compiler used to do that (before they switched to just being clang).
    Thanks for the detailed explanation. I was talking more in the general tho. Like, if you -march for atom, it would make sense to aim for more cache friendly unrolling (shorter loops so they fit in icache) but then for a really beefy Xeon with AVX2 you would probably prefer more aggressive unrolling so you can take advantage of the bigger SIMD registers, for example. It doesn't necessarily need to go into the lowest of the lowest level, but I wanted to know whether it's taken into account at all at pre-code-generation stages.

    Leave a comment:


  • coder
    replied
    Originally posted by carewolf View Post
    While it might be theorectically possible to add such exact details to the CPU archs, the microops cache and queue is rather finicky and we do not have all the details from Intel on how instructions were split into micro-ops, or how they are sometimes recombined into other micro-ops while in the queue.
    That's unfortunate. You might imagine Intel could provide a first-order approximation in the form of a table of instruction -> uOPs length. The sizes of the uOP caches in different CPU cores don't seem to be a secret.

    Originally posted by carewolf View Post
    GCC tries to avoid such details, I don't think they don't optimize based on cacheline length either,
    Hmmm.... seems like they could at least align branch targets.

    Leave a comment:


  • coder
    replied
    Originally posted by discordian View Post
    That's hard to do without running the code, could probably be done with PGO. Its plain not possible for the "complete view" where your routine kicks out cache some other parts of the program would use shortly after, you have the penalty of potentially loading more than you need, and then paying one more time later.
    You're arguing a more complex case than what carewolf described. Yes, there's a limit to what a compiler could reasonably be expected to do with complex control flow, but if we're talking about a single loop that either fits in the micro-op cache or it doesn't, I think that's not an unreasonable expectation. And simple loops containing zero or one internal branches are common enough to bother optimizing for.

    Originally posted by discordian View Post
    I would not trust either PGO or humans alone, and certainly not a compiler to make good decisions.
    Human developers are generally good at knowing when a branch is either extremely likely or extremely unlikely. It's the middle-ground that's tricky.

    For instance, loops designed to process lots of data are easy to flag as likely. Error-handling are obvious candidates for flagging as unlikely. It's plausible compilers could be imbued with AI to spot these same sorts of patterns, but I still think humans will tend to be better at this trivial level of classification.

    Leave a comment:


  • coder
    replied
    Originally posted by discordian View Post
    some pretty hard to measure higher cache pollution (wont show up in synthetic tests).
    Well, that's what performance counters are for.

    Leave a comment:


  • coder
    replied
    Originally posted by carewolf View Post
    Spectre is about how hardware instructions are operating. They lie outside the C standard are not subject to standard C rules, so yes, optimizations can break them..
    Isn't retpoline support baked into GCC? If its optimizer breaks them, that would be a compiler bug.

    As of version 8 (later backported to 7.3), GCC has added retpoline support [0]. While I understand that it is intended[citation needed] for use in kernel patching for Spectre (ie: [1][2]), that doe...

    Leave a comment:

Working...
X