Announcement

Collapse
No announcement yet.

Suggestions for improving the pts/postgresql test

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    400 concurrent clients on a 12core/24thread CPU is, quite frankly, nuts. The CPU utilization never tops 40% with significant IO wait times. I also see up to 2s delays per statement due to locking.

    In the interest of testing the currently derived test configuration across multiple configurations I will go ahead and iterate over different file systems: btrfs, ext4, xfs.

    We can observe that latency is drastically improved with ext4 over btrfs (6.8ms vs. 13.7ms), which leads to significant tps improvements. Interestingly, the peak throughput on ext4 is achieved with even more clients (600). Latency is again significantly improved with xfs (3.5ms) and we see again much improved peak throughput at 400 clients.
    CPU utilization is improved, but still barely exceeds 50%.
    WAL reaches a max of 8.5GB
    The longest delay due to locking with xfs shinks to ~200ms.
    image.png
    Last edited by jochendemuth; 09 January 2023, 01:35 PM.

    Comment


    • #12
      Continuing with the theme of testing the updated test definition, I ran it across a list of nvme drives accessible to me.
      1. Samsung 980 Pro 1TB
      2. WD SN850 1TB
      3. Intel Optane 905 960GB
      Test configuration uses xfs file system, PostgreSQL is configured with default settings, plus
      • shared buffers=16GB (1/4 of memory)
      • max_wal_size=100GB
      • max_connections=1500
      • log_checkpoints=on (irrelevant to test performance, but good for troubleshooting)
      • log_lock_waits=on (irrelevant to test performance, but good for troubleshooting)
      • deadlog_timeout=100ms (irrelevant to test performance, but good for troubleshooting)
      pgbench was configured with
      • scale=3210 (~48GB memory, ~3/4 memory)
      • protocol=prepared
      • time based execution=30s
      test execution would
      • warm up caches with a 30s run of 20 clients (data disgarded)
      • force checkpoint (SQL run as superuser) before each execution of pgbench
      • measure size of WAL before each test execution (SQL execution)
      I had to run this scenario multiple times because I could not believe the performance of the WD SN850. I had to graph the x-axis logarithmically to separate it from the Optane drive.
      Observations:
      • Samsung 980 Pro SSD achieves > 35,000 tps, which is 10x above the test result with PTS in stock Fedora configuration (btrfs)
      • Samsung achieves highest throughput at a load of 400 clients (on a 24 thread system)
      • WD SN850 achieves max throughput (>40,000 tps), at a workload of 75 clients
      • Optane 905 achieves max throughput (>41,000 tps), at a workload of 40 clients
      • Average latency is very close between SN850 and Optane, 980 Pro in a different class
      • Minimum latency is a paltry 0.2ms for Optane, 0.5ms for SN850, both achieved at the lowest tested workload of 5 clients.
      • CPU utilization on the AMD Ryzen 9 5900X barely broke 50%
      • Disk IO had a mostly 1:2 read/write ratio. Max throughput was about 450MB/s read vs. 900MB/s write.
      • Optane drive only had a single transaction delayed for more than 100ms. Occurred at a workload of 300 clients
      • WD SN850 similarly only had a few transactions delayed for a max of 165ms.
      • Samsung 980 Pro only had < 30 transactions delayed for a max of 242ms (very impressive. Makes the point that transaction locks are well controlled)
      • The WAL reached a max size of ~10GB in size.
      image.png

      Comment


      • #13
        Michael - without even reading all my posts you can tell that I had a lot of fun running the pgbench benchmark.

        TL&DR
        I recommend expanding the current pts/postgresql benchmark with the following independent two options:
        1. Light read-only workload
          Definition: High-concurrency OLTP lookup workload against small dataset (~1.5GB).
          Technical implementation: pgbench -c $(num_hw_threads * 4) --protocol=prepared -T 120
        2. TPC-B-like OLTP workload
          Definition: High-concurrency OLTP read-write workload against medium size dataset (sized to ~3.4 machine memory). Requires up to 100GB of disk space in addition to dataset.
          Technical implementation: scale=$(echo "mem = $(grep MemTotal /proc/meminfo | grep -o '[[:digit:]]*')/1024; s = mem/15*0.75; scale = 0; (s+0.5)/1 " | bc -l); pgbench -i -s $scale test; psql test -c 'checkpoint;'; pgbench -c $(num_hw_threads * 4) --protocol=prepared -T 60
        Since my background is in analytical workloads I may propose an OLAP workload based on custom scripts at a later time, but that will have to wait for a while.

        Discussion:
        • Adding workload-based tests will be beneficial as they are designed to scale with the tested hw configuration. They are more meaningful to the user/reader of tests.
        • Adding new options to the test will allow to keep existing test results on Phoronix.com valid.
        • The biggest concern to me is the increased storage requirement. Hundreds of GB of storage space is not much for any modern database platfom, but potentially a show-stopper for the current user based of PTS. Arguably this is what matures the test into a meaningful exercise. I wonder if PTS already has a mechanism to check for free disk space and can limit the test requirement (e.g. the max_wal_size parameter) to the available disk space.
        • These tests were designed with a smallest amount of modifications to out of the box install in mind. Each change resulted in a significant increase of performance. Without sharing results in this thread I have evaluated a range of additional options, but while these did measurably improve performance, these didn't have such profound effects on performance.
        • Running regular checkpoints has been key to highly repeatable results for the default "tpcb-like" workload. I found these to be processed quickly (few seconds at most). Especially for this workload I recommend to keep the "vacuum" step as this process represents critical table maintenance for optimal performance. Again, this did not have significant impact on overall test duration.
        • Consider having a quick warm-up run to load data into the shared buffer caches. 20-30s duration is sufficient.
        • To balance out the increased requirements on test duration due to warm-up and frequent check-pointing, it is quite thinkable to reduce the overall pgbench test duration parameter to 30-60s. In my testing of hundreds of test runs the results have been very consistent.

        Comment

        Working...
        X