DragonFlyBSD Is Seeing Better Performance Following A Big VM Rework
DragonFlyBSD lead developer Matthew Dillon has been reworking the virtual memory (VM) infrastructure within their kernel and it's leading to measurable performance improvements.
This mailing list post outlines the work around the kernel's VM pmap code being restructured that results in possible memory conservation, helps with processes sharing lots of memory, and enhances concurrent page fault performance. The performance bits are what we're after and they appear to be quite compelling at least with Dillon's testing so far on both big (Threadripper) and small (Raven Ridge) AMD test systems:
DragonFlyBSD appears on track for a great 2019 with their other recent accomplishments being prompt handling of the MDS/Zombieload mess,DRM code updates, HAMMER2 improvements, flipping on compiler-based Retpoline support, and FUSE work, among other coding activities.
This mailing list post outlines the work around the kernel's VM pmap code being restructured that results in possible memory conservation, helps with processes sharing lots of memory, and enhances concurrent page fault performance. The performance bits are what we're after and they appear to be quite compelling at least with Dillon's testing so far on both big (Threadripper) and small (Raven Ridge) AMD test systems:
These changes significantly improve page fault performance, particularly under heavy concurrent loads.
* kernel overhead during the 'synth everything' bulk build is now under 15% system time. It used to be over 20%. (system time / (system time + user time)). Tested on the threadripper (32-core/64-thread).
* The heavy use of shared mmap()s across processes no longer multiplies the pv_entry use, saving a lot of memory. This can be particularly important for postgres.
* Concurrent page faults now have essentially no SMP lock contention and only four cache-line bounces for atomic ops per fault (something that we may now also be able to deal with with the new work as a basis).
* Zero-fill fault rate appears to max-out the CPU chip's internal data busses, though there is still room for improvement. I top out at 6.4M zfod/sec (around 25 GBytes/sec worth of zero-fill faults) on the threadripper and I can't seem to get it to go higher. Note that obviously there is a little more dynamic ram overhead than that from the executing kernel code, but still...
* Heavy concurrent exec rate on the TR (all 64 threads) for a shared dynamic binary increases from around 6000/sec to 45000/sec. This is actually important, because bulk builds
* Heavy concurrent exec rate on the TR for independent static binaries now caps out at around 450000 execs per second. Which is an insanely high number.
* Single-threaded page fault rate is still a bit wonky but hit 500K-700K faults/sec (2-3 GBytes/sec).
--
Small system comparison using a Ryzen 2400G (4-core/8-thread), release vs master (this includes other work that has gone into master since the last release, too):
* Single threaded exec rate (shared dynamic binary) - 3180/sec to 3650/sec
* Single threaded exec rate (independent static binary) - 10307/sec to 12443/sec
* Concurrent exec rate (shared dynamic binary x 8) - 15160/sec to 19600/sec
* Concurrent exec rate (independent static binary x 8) - 60800/sec to 78900/sec
* Single threaded zero-fill fault rate - 550K zfod/sec -> 604K zfod/sec
* Concurrent zero-fill fault rate (8 threads) - 1.2M zfod/sec -> 1.7M zfod/sec
* make -j 16 buildkernel test (tmpfs /usr/src, tmpfs /usr/obj):
4.4% improvement in overall time on the first run (6.2% improvement on subsequent runs). system% 15.6% down to 11.2% of total cpu seconds. This is a kernel overhead reduction of 31%. Note that the increased time on release is probably due to inefficient buffer cache recycling.
DragonFlyBSD appears on track for a great 2019 with their other recent accomplishments being prompt handling of the MDS/Zombieload mess,DRM code updates, HAMMER2 improvements, flipping on compiler-based Retpoline support, and FUSE work, among other coding activities.
5 Comments