Announcement

Collapse
No announcement yet.

Suggestions for improving the pts/postgresql test

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Suggestions for improving the pts/postgresql test

    TL&DR:
    Michael please update or enhance your pts/postgresql test to use prepared query mode.
    Also update the pts/postgresql test to allow for significantly larger numbers of clients (up to 5000) to account for the modern hardware capabilities in your public tests.

    Long story...

    PostgreSQL is a very mature open-source RDMS that can be operated effectively across a huge span of hardware configurations, from low powered Raspberry PI (arm) to dual EPYC servers with hundreds of cores.
    PostgreSQL also can be tuned to support a large range of usage profiles from fast paced OLTP lookups in small data sets to complex analytics queries across terabytes of data.
    The latter, I assume, is the reason why the default configuration of PostgreSQL is rather limiting and requires adjustment before any meaningful conclusion can be drawn on the performance of the software.​

    Following the 320 CPU thread limit of Clear Linux discussed in the recent AMD EPYC Genoa articles I wondered how the test configuration of postgresql in the articles was affected given that only 250 clients were simulated.
    I assumed that an optimal result would be achieved with at least as many simulated clients as the hardware has compute threads. But how many? I decided that this would be the opportunity to look under the covers of the pts/postgresql test.

    To figured that out I went into my home lab, ran the test on my hardware first and then l looked at the options and implementation of the PTS test.

    pgbench is a relatively simple tool that by default implements a light OLTP workload inspired by the, in the industry well-known, TPC-C benchmark. It is expandable to simulate custom workloads.
    When simulating the workload of the article (scale 100, 250 clients, select only) the first thing I noticed was that by default the tool only uses a single thread for load generation. This way it was impossible to fully stress my AM4 Ryzen 5900X CPU. By using the -j option pgbench can scale the load generation across multiple threads. I could quickly observe that it requires 5 threads to saturate my 24 thread-capable CPU. That means ~20% CPU capacity is required for load generation - interesting.

    In further reading the man page of pgbench I stumbled across the query mode parameter (-M). pgbench by default will reconnect each client for every transaction. It also supports an optimized query mode which caches the "parse analysis", meaning it's skipping the step of compiling an execution plan from the query text on repeated executions. This represents a very important optimization in transactional workloads.

    Wondering if pts/postgresql takes advantage of this I looked at the test code. To my delight I saw that
    1. postgresql is using the generally recommended 25% of main memory for sharedbuffer allocation. Good!
    2. postgresql allows 500 connections, required for a successful simulation of that many clients.
    2. pgbench is using num_of_core threads to simulate workload (-j parameter). Good!
    3. pgbench is executing the test in default simple execution mode

    I decided to run tests with the goal to evaluate how many clients it takes for optimal results. I did that with a simple logarithmic search over the client parameter (-c) starting with 5 and ending with 500 (one step beyond the workload given to the dual-EPYC server).
    Knowing that hugepages can have an effect on performance I added a test run with huge pages (2kB) enabled.

    My test configuration:
    CPU: AMD Ryzen 9 5900X
    Mobo: ASUS WS-X570-ASE
    RAM: 2x 32GB DDR4-4000
    Storage: 1x Samsung 980 Pro 1TB
    OS: Fedora 37, fully updated with default postgresql install except for adjustment in number of connections (500) and shared buffers (16GB) - same as in PTS configuration.
    image.png
    ​​

    ​Observations:
    • tps are significantly lower when clients < # CPU threads (24).
    • highest tps is reached at 250 clients in all tests - that's 10x cpu threads.
    • prepare query mode has about half the latency of simple mode and conversely achieves about twice the tps in fully loaded configurations (>=25 clients).
    • huge pages improve latency and tps in prepared query mode, but not significantly
    • the max tps reached topped 1m transactions per second on my 12 core CPU.
    Conclusion:
    The simulated load pattern queries a relatively small dataset (scale 100, ~ 1.5GB of data) repeatedly. Most hardware configurations should handle this workload completely in memory. In case the hardware has 8GB or more of main memory the test runs in access optimized shared buffers in postgresql. This configuration is optimal to find the largest number of supported transactions for any given hardware. However, maximum tps requires the use of prepared transactions.

    A naive look at this could think that simple vs. prepared mode simulates a much larger population of clients that happen to arrive 250 at a time vs. a population of 250 clients that are repeatedly querying the system. However, in reality it just represents non-optimized software.

    The less than 3m tps documented on Phoronix across 2x 96-core EPYC server CPUs pale in comparison to 1m tps reached on a last-gen 12-core desktop CPU. The likely culprit is that the test scenario was chosen too small to saturate such monster hardware. Adding a simple tuning parameter promises to double the tps again.

    It is time to update the pts/postgresql test definition!

  • #2
    Just to be sure I am not missing something, for the prepared query mode it's just a matter of adding --protocol=prepared to the pgbench options, correct? Also can easily extend the client count as well. Thanks.
    Michael Larabel
    https://www.michaellarabel.com/

    Comment


    • #3
      Originally posted by Michael View Post
      Just to be sure I am not missing something, for the prepared query mode it's just a matter of adding --protocol=prepared to the pgbench options, correct? Also can easily extend the client count as well. Thanks.
      Yep - that's it!

      Comment


      • #4
        Michael - if I may provide some constructive criticism on the pts suite...

        The breadth of pts is staggering and few, if any, will understand all the things that can be tested with it.
        Most tests have parameters of highly technical nature. One needs to have an inherent insight into the tested software to make sound decisions.

        I am working on a suggestion to change this for postgresql. Away from technical specifications of scale and client, but rather offering a list of simulated use cases.

        The above discussed use case is valid - it represents a read-only, in-memory-only (in all but the smallest hw configurations) OLAP workload. Based on my documented test results it would be easily possible to set parameters for pgbench for any hardware configuration: simply choose "--scale=100 --clients=4*$NR_CPU_CORES --protocol=prepared".

        Similarly, I think I can define a meaningful read-write workload that is auto-tuned to a given hardware. That would enable meaningful comparison across hw and software configurations.

        Will present my proposal in a followup post in this thread.

        Comment


        • #5
          Looking at the current implementation of PTS for the default operation of pgbench, the "tpcb-like" script, we first need to recognize that this script contains select, insert, update, and delete operations. Invariably, data will be altered, which means that this test will stress more computer components than the simpler "select-only" script.

          I see several situations in the current test implementation that leads to lower than possible test results:
          1. Table locking. The man page of pgbench explains that a higher number of clients than used to scale will lead to locking in the default pgbench script. At this point the benchmark mostly measures how long tables are locked before transactions get completed.
            I tried this out with scale=50 (smaller than optimal # of clients from my first post) and scale=3210 (represents a size 3/4 of memory). As can be seen easily on the graph below, both latency and tps are much improved in the larger scale test database.
            image.png

          Comment


          • #6
            Write-ahead log (WAL). WAL's central concept is that changes to data files (where tables and indexes reside) must be written only after those changes have been logged, that is, after log records describing the changes have been flushed to permanent storage.​ Checkpoints are points in the sequence of transactions at which it is guaranteed that the heap and index data files have been updated with all information written before that checkpoint. At checkpoint time, all dirty data pages are flushed to disk and a special checkpoint record is written to the log file.​ Checkpoints are triggered either after a set timeout occured or when the allotted space for the WAL is full. Checkpoints are fairly expensive, first because they require writing out all currently dirty buffers, and second because they result in extra subsequent WAL traffic.
            In the default configuration, the WAL configuration is too small, resulting in too frequent checkpoints. Checkpoints that occur during a test run will lower the test score because it takes disk resources away from the test. Because checkpoints in default configuration are triggered at different points in and between test runs, the test results are not very consistent. Production deployments tune WAL in such a way that checkpoints are expected to be time triggered based on expected loads, the frequency of checkpoints is monitored.
            I tried this out with default WAL configuration and with WAL size increased to the point that a checkpoint is guaranteed not to be triggered for the duration of the test. Also, I forced a checkpoint before every run of pgbench. As a result the latencies are more consistent and lower and a 60% higher TPS is achieved.​​

            Comment


            • #7
              Originally posted by jochendemuth View Post
              I tried this out with default WAL configuration and with WAL size increased to the point that a checkpoint is guaranteed not to be triggered for the duration of the test. Also, I forced a checkpoint before every run of pgbench. As a result the latencies are more consistent and lower and a 60% higher TPS is achieved.​​
              What was the WAL configuration values you used? Thanks.
              Michael Larabel
              https://www.michaellarabel.com/

              Comment


              • #8
                Originally posted by Michael View Post

                What was the WAL configuration values you used? Thanks.
                I set max_wal_size​='100GB'.

                I created a checkpoint by issuing the command "checkpoint;" as superuser. Here as a command ahead of the pgbench command:
                Code:
                $ psql test -c 'checkpoint;'; pgbench ...
                It's possible to query the amount of WAL used by postgresql with the following command:
                Code:
                psql test -c 'select sum(size) from pg_ls_waldir();'
                In my testing WAL reached about 4GB in size. The size is dependent on the number of transactions between checkpoints. Fast setups will consume more.

                Comment


                • #9
                  An upper bound for the max_wal_size parameter is the sequential write performance of the storage medium for the test duration.

                  So, in case of the dual AMD EPYC Genoa on Optane P5800X I'd assume that WAL will take about 7GB/s for 120s, meaning up to ~1TB of storage. Hard coding this amount of storage will not be a great choice for PTS. I wonder if you have a way to query free storage capacity in the setup phase and, more crucially, delete the postgresql folder after a successful test run to clear up space.

                  I personally find that pgbench reaches steady state relatively quickly, and only run it for 30s at a time.

                  I used the following bash calculation to initialize pgbench with ~3/4 memory size. I assume that PTS has an easier way to do that.

                  Code:
                  $ scale=$(echo "mem = $(grep MemTotal /proc/meminfo | grep -o '[[:digit:]]*')/1024; s = mem/15*0.75; scale = 0; (s+0.5)/1 " | bc -l);
                  $ pgbench -i -s $scale test​

                  Comment


                  • #10
                    The last set of tests still only reaches ~40% CPU utilization being completely bottlenecked by the default storage config on Fedora 37. Now, I'm adding the current default optimizations (shared_buffers=1/4 memory) and the suggested options from the select-only test (--protocol=prepared) for completeness.
                    We can see that these optimization predictably lead to slightly lower latency, and especially at high concurrency (400 clients) leads to signficantly higher tps.

                    image.png

                    Sorry, about the information dissemination in many posts, but my account is limited to one picture per post and the graphs really tell the story.

                    Comment

                    Working...
                    X