TL&DR:
Michael please update or enhance your pts/postgresql test to use prepared query mode.
Also update the pts/postgresql test to allow for significantly larger numbers of clients (up to 5000) to account for the modern hardware capabilities in your public tests.
Long story...
PostgreSQL is a very mature open-source RDMS that can be operated effectively across a huge span of hardware configurations, from low powered Raspberry PI (arm) to dual EPYC servers with hundreds of cores.
PostgreSQL also can be tuned to support a large range of usage profiles from fast paced OLTP lookups in small data sets to complex analytics queries across terabytes of data.
The latter, I assume, is the reason why the default configuration of PostgreSQL is rather limiting and requires adjustment before any meaningful conclusion can be drawn on the performance of the software.
Following the 320 CPU thread limit of Clear Linux discussed in the recent AMD EPYC Genoa articles I wondered how the test configuration of postgresql in the articles was affected given that only 250 clients were simulated.
I assumed that an optimal result would be achieved with at least as many simulated clients as the hardware has compute threads. But how many? I decided that this would be the opportunity to look under the covers of the pts/postgresql test.
To figured that out I went into my home lab, ran the test on my hardware first and then l looked at the options and implementation of the PTS test.
pgbench is a relatively simple tool that by default implements a light OLTP workload inspired by the, in the industry well-known, TPC-C benchmark. It is expandable to simulate custom workloads.
When simulating the workload of the article (scale 100, 250 clients, select only) the first thing I noticed was that by default the tool only uses a single thread for load generation. This way it was impossible to fully stress my AM4 Ryzen 5900X CPU. By using the -j option pgbench can scale the load generation across multiple threads. I could quickly observe that it requires 5 threads to saturate my 24 thread-capable CPU. That means ~20% CPU capacity is required for load generation - interesting.
In further reading the man page of pgbench I stumbled across the query mode parameter (-M). pgbench by default will reconnect each client for every transaction. It also supports an optimized query mode which caches the "parse analysis", meaning it's skipping the step of compiling an execution plan from the query text on repeated executions. This represents a very important optimization in transactional workloads.
Wondering if pts/postgresql takes advantage of this I looked at the test code. To my delight I saw that
1. postgresql is using the generally recommended 25% of main memory for sharedbuffer allocation. Good!
2. postgresql allows 500 connections, required for a successful simulation of that many clients.
2. pgbench is using num_of_core threads to simulate workload (-j parameter). Good!
3. pgbench is executing the test in default simple execution mode
I decided to run tests with the goal to evaluate how many clients it takes for optimal results. I did that with a simple logarithmic search over the client parameter (-c) starting with 5 and ending with 500 (one step beyond the workload given to the dual-EPYC server).
Knowing that hugepages can have an effect on performance I added a test run with huge pages (2kB) enabled.
My test configuration:
CPU: AMD Ryzen 9 5900X
Mobo: ASUS WS-X570-ASE
RAM: 2x 32GB DDR4-4000
Storage: 1x Samsung 980 Pro 1TB
OS: Fedora 37, fully updated with default postgresql install except for adjustment in number of connections (500) and shared buffers (16GB) - same as in PTS configuration.
image.png
Observations:
The simulated load pattern queries a relatively small dataset (scale 100, ~ 1.5GB of data) repeatedly. Most hardware configurations should handle this workload completely in memory. In case the hardware has 8GB or more of main memory the test runs in access optimized shared buffers in postgresql. This configuration is optimal to find the largest number of supported transactions for any given hardware. However, maximum tps requires the use of prepared transactions.
A naive look at this could think that simple vs. prepared mode simulates a much larger population of clients that happen to arrive 250 at a time vs. a population of 250 clients that are repeatedly querying the system. However, in reality it just represents non-optimized software.
The less than 3m tps documented on Phoronix across 2x 96-core EPYC server CPUs pale in comparison to 1m tps reached on a last-gen 12-core desktop CPU. The likely culprit is that the test scenario was chosen too small to saturate such monster hardware. Adding a simple tuning parameter promises to double the tps again.
It is time to update the pts/postgresql test definition!
Michael please update or enhance your pts/postgresql test to use prepared query mode.
Also update the pts/postgresql test to allow for significantly larger numbers of clients (up to 5000) to account for the modern hardware capabilities in your public tests.
Long story...
PostgreSQL is a very mature open-source RDMS that can be operated effectively across a huge span of hardware configurations, from low powered Raspberry PI (arm) to dual EPYC servers with hundreds of cores.
PostgreSQL also can be tuned to support a large range of usage profiles from fast paced OLTP lookups in small data sets to complex analytics queries across terabytes of data.
The latter, I assume, is the reason why the default configuration of PostgreSQL is rather limiting and requires adjustment before any meaningful conclusion can be drawn on the performance of the software.
Following the 320 CPU thread limit of Clear Linux discussed in the recent AMD EPYC Genoa articles I wondered how the test configuration of postgresql in the articles was affected given that only 250 clients were simulated.
I assumed that an optimal result would be achieved with at least as many simulated clients as the hardware has compute threads. But how many? I decided that this would be the opportunity to look under the covers of the pts/postgresql test.
To figured that out I went into my home lab, ran the test on my hardware first and then l looked at the options and implementation of the PTS test.
pgbench is a relatively simple tool that by default implements a light OLTP workload inspired by the, in the industry well-known, TPC-C benchmark. It is expandable to simulate custom workloads.
When simulating the workload of the article (scale 100, 250 clients, select only) the first thing I noticed was that by default the tool only uses a single thread for load generation. This way it was impossible to fully stress my AM4 Ryzen 5900X CPU. By using the -j option pgbench can scale the load generation across multiple threads. I could quickly observe that it requires 5 threads to saturate my 24 thread-capable CPU. That means ~20% CPU capacity is required for load generation - interesting.
In further reading the man page of pgbench I stumbled across the query mode parameter (-M). pgbench by default will reconnect each client for every transaction. It also supports an optimized query mode which caches the "parse analysis", meaning it's skipping the step of compiling an execution plan from the query text on repeated executions. This represents a very important optimization in transactional workloads.
Wondering if pts/postgresql takes advantage of this I looked at the test code. To my delight I saw that
1. postgresql is using the generally recommended 25% of main memory for sharedbuffer allocation. Good!
2. postgresql allows 500 connections, required for a successful simulation of that many clients.
2. pgbench is using num_of_core threads to simulate workload (-j parameter). Good!
3. pgbench is executing the test in default simple execution mode
I decided to run tests with the goal to evaluate how many clients it takes for optimal results. I did that with a simple logarithmic search over the client parameter (-c) starting with 5 and ending with 500 (one step beyond the workload given to the dual-EPYC server).
Knowing that hugepages can have an effect on performance I added a test run with huge pages (2kB) enabled.
My test configuration:
CPU: AMD Ryzen 9 5900X
Mobo: ASUS WS-X570-ASE
RAM: 2x 32GB DDR4-4000
Storage: 1x Samsung 980 Pro 1TB
OS: Fedora 37, fully updated with default postgresql install except for adjustment in number of connections (500) and shared buffers (16GB) - same as in PTS configuration.
image.png
Observations:
- tps are significantly lower when clients < # CPU threads (24).
- highest tps is reached at 250 clients in all tests - that's 10x cpu threads.
- prepare query mode has about half the latency of simple mode and conversely achieves about twice the tps in fully loaded configurations (>=25 clients).
- huge pages improve latency and tps in prepared query mode, but not significantly
- the max tps reached topped 1m transactions per second on my 12 core CPU.
The simulated load pattern queries a relatively small dataset (scale 100, ~ 1.5GB of data) repeatedly. Most hardware configurations should handle this workload completely in memory. In case the hardware has 8GB or more of main memory the test runs in access optimized shared buffers in postgresql. This configuration is optimal to find the largest number of supported transactions for any given hardware. However, maximum tps requires the use of prepared transactions.
A naive look at this could think that simple vs. prepared mode simulates a much larger population of clients that happen to arrive 250 at a time vs. a population of 250 clients that are repeatedly querying the system. However, in reality it just represents non-optimized software.
The less than 3m tps documented on Phoronix across 2x 96-core EPYC server CPUs pale in comparison to 1m tps reached on a last-gen 12-core desktop CPU. The likely culprit is that the test scenario was chosen too small to saturate such monster hardware. Adding a simple tuning parameter promises to double the tps again.
It is time to update the pts/postgresql test definition!
Comment