β Back to Connection Routing & Pooling Strategies
Connection Pool Architecture for Read Replicas
Scaling read-heavy workloads across distributed database deployments requires more than provisioning additional nodes. The connection pool layer sitting between application services and the replica topology dictates latency boundaries, resource exhaustion thresholds, and consistency guarantees. A misconfigured pool creates exactly the cascading failures you added replicas to prevent: connection exhaustion, thundering-herd reconnections after failover, and stale reads silently served to users who just wrote data. This page covers production-grade pool design, proxy integration, lifecycle configuration, failure-mode mitigation, and the monitoring signals that keep the whole system observable.
The Operational Problem This Addresses
Connection Routing & Pooling Strategies defines how traffic is partitioned across a replica set. This page goes one level deeper: how to design and operate the pool itself so that partitioning decisions are actually enforced under load, during topology changes, and when individual replicas degrade. Without deliberate pool architecture the most common outcomes are:
- Connection exhaustion on the primary when replicas are removed from rotation and traffic falls back without backpressure.
- Stale reads bypassing lag checks because pool state is not propagated to routing logic.
- Thundering-herd reconnections when a replica restarts and every idle connection validates simultaneously.
- Session-mode pools that cap concurrency far below what transaction-mode pools can sustain for the same replica
max_connectionsbudget.
The SLA risks are concrete: a pool sized for steady-state traffic that has no reserve or circuit-breaker logic will exhaust available connections within seconds of a moderate traffic spike, causing HTTP 500s or database timeouts that cascade across the entire application tier.
Concept Definition and Scope
A connection pool for read replicas is a broker between application threads and backend database sockets. It maintains a set of open, authenticated connections to one or more replica endpoints, lends them to application threads for the duration of a query or transaction, then recycles them. The pool enforces limits on concurrency (max_pool_size), manages idle socket lifetimes, and β in the context of read replicas specifically β must also carry awareness of replication lag and node health so it can refuse to lend connections to degraded backends.
This is distinct from a generic connection pool in two ways:
- Routing is a first-class concern. The pool must integrate with lag metrics and health checks to dynamically exclude nodes. A generic pool simply balances across a fixed list of backends.
- Write-after-read consistency is a pool-level responsibility. After a write to the primary, reads issued within a causal window must bypass the replica pool entirely. This requires the pool β or the routing layer above it β to carry per-request write timestamps.
Connection pooling modes (session, transaction, statement) have direct consequences for read-replica deployments. Transaction mode is the production standard: it releases the backend connection at COMMIT/ROLLBACK, enabling far higher client fan-out per replica max_connections budget. Session mode binds a backend socket for the clientβs entire session lifetime β wasteful for read-only API calls that open a session, execute one SELECT, and close. Statement mode is too restrictive for most workloads because it prohibits multi-statement transactions.
Architecture Overview
The diagram below shows the three-tier pool architecture: application threads β pool/proxy layer β replica set, with the primary as the write target and fallback for lag-breached reads.
Mechanism Deep-Dive
Routing Decision Pipeline
Before a client connection is handed a backend socket, the pool/proxy evaluates a short decision pipeline:
- Query classification. Parse the SQL digest or inspect the first keyword.
SELECT β¦ FOR UPDATEandSELECT β¦ FOR SHAREare locking reads that must route to the primary. PlainSELECTqueries are candidates for replica routing. DDL and DML always go to the primary. - Causal window check. If the requesting thread issued a write within the last N milliseconds (configurable, commonly 1500β2000 ms), the request bypasses the replica pool and goes directly to the primary, enforcing read-after-write consistency.
- Lag guard evaluation. Query the poolβs in-memory lag cache (refreshed every few seconds from
pg_stat_replicationor a heartbeat table). Replicas withreplay_lag > thresholdare excluded from the candidate set. - Health check gate. Replicas that have failed
Nconsecutive TCP health checks are inDRAININGstate and receive no new connections. - Load balancing. From the remaining healthy, within-lag candidates, select using
least_connections(preferred for heterogeneous node capacities) orweighted_round_robinwhen replicas have differing vCPU/memory profiles. - Connection checkout. Lend an idle, validated socket from the pool. If none is available and
pool_size < max_pool_size, open a new backend connection. Ifmax_pool_sizeis reached, queue the request or fail fast.
Transaction-Mode Connection Lifecycle
In transaction mode the backend socket lifecycle is:
CLIENT CONNECTS β [IDLE socket assigned] β BEGIN (optional)
β query 1 β query 2 β β¦ β COMMIT/ROLLBACK
β [socket returned to pool, reset to clean state]
CLIENT DISCONNECTS (pool connection persists)
The reset step (DISCARD ALL or RESET ALL) clears session-level parameters, temporary tables, and advisory locks. This is what makes connection sharing safe across unrelated client sessions. PgBouncer performs this reset automatically; verify that server_reset_query = DISCARD ALL is set and that your application does not rely on session state persisting across transactions.
Proxy Rule Evaluation in ProxySQL
ProxySQL evaluates query rules in ascending rule_id order and applies the first match. The evaluation is regex-based against the query digest, so order and specificity matter. Locking reads (FOR UPDATE) must appear before general SELECT rules, otherwise they are misrouted to replicas.
-- ProxySQL routing rules (admin interface)
INSERT INTO mysql_query_rules
(rule_id, active, match_digest, destination_hostgroup, apply)
VALUES
(1, 1, '^SELECT .* FOR UPDATE', 10, 1),
(2, 1, '^SELECT .* FOR SHARE', 10, 1),
(3, 1, '^SELECT .*', 20, 1),
(4, 1, '^(INSERT|UPDATE|DELETE|CREATE|ALTER|DROP)', 10, 1);
INSERT INTO mysql_servers
(hostgroup_id, hostname, port, weight, max_connections, max_replication_lag)
VALUES
(10, 'primary.db.internal', 3306, 1, 500, 0),
(20, 'replica-01.db.internal',3306, 1, 500, 2),
(20, 'replica-02.db.internal',3306, 1, 500, 2),
(20, 'replica-03.db.internal',3306, 2, 1000, 5);
LOAD MYSQL QUERY RULES TO RUNTIME; SAVE MYSQL QUERY RULES TO DISK;
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;max_replication_lag (seconds) is enforced per server: when Seconds_Behind_Master exceeds this value, ProxySQL moves the server to the writer hostgroup or marks it offline β it will never silently route stale reads.
ORM-Level Routing and Causal Windows
When ORM middleware for automatic query routing is the routing layer rather than a dedicated proxy, the causal window must be tracked explicitly in application state. The following Spring Boot interceptor demonstrates the pattern:
// Spring Boot DataSource Routing Interceptor
public class ReplicaRoutingInterceptor implements MethodInterceptor {
private static final long CAUSAL_WINDOW_MS = 1500L;
@Override
public Object invoke(MethodInvocation invocation) throws Throwable {
Transactional tx = invocation.getMethod().getAnnotation(Transactional.class);
boolean isWrite = (tx != null && !tx.readOnly());
long lastWrite = ThreadLocalContext.getLastWriteTimestamp();
if (isWrite || (System.currentTimeMillis() - lastWrite < CAUSAL_WINDOW_MS)) {
DataSourceRouter.setActiveDataSource("primary");
} else {
DataSourceRouter.setActiveDataSource("replica-pool");
}
try {
return invocation.proceed();
} catch (Exception e) {
// Circuit-break to primary on any replica failure
DataSourceRouter.setActiveDataSource("primary");
return invocation.proceed();
}
}
}If the ORM detects sustained replica unavailability, it should open a circuit and route all traffic to the primary until the replica pool recovers β preventing thread starvation in the application tier.
Trade-Off Comparison Table
| Strategy | Concurrency headroom | Lag/stale-read risk | Failover behaviour | Operational complexity |
|---|---|---|---|---|
| Transaction-mode pool (PgBouncer) | Very high β many clients per backend socket | Low when lag guard is wired | Queue drains; reconnect on resume | Low β single config file |
| Session-mode pool | Low β 1:1 client:socket | Low | Connections drop; reconnect storm | Low |
| Proxy with AST-based routing (ProxySQL) | High | Very low β per-query decisions | Automatic server demotion | Medium β rule ordering, digest tuning |
| ORM-level routing (no proxy) | Depends on driver pool | Medium β causal window must be coded | Application-controlled circuit break | High β per-framework implementation |
| DNS-based routing (read endpoint) | Depends on backend pool | Medium β lag awareness absent | TTL-bound (30β60 s delay) | Very low β no proxy needed |
Configuration Runbook
PgBouncer for Read Replica Pools
Full details are in Configuring PgBouncer for read-only connection pools. The minimal working configuration with critical parameters annotated:
; pgbouncer.ini β read replica pool
[databases]
; Route this DSN name to an internal load balancer in front of replicas
app_read = host=replica-lb.internal port=5432 dbname=app_db
[pgbouncer]
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
; Transaction mode: release backend socket at COMMIT/ROLLBACK
pool_mode = transaction
; max_client_conn: 3Γ expected peak application connections
; Queues are created beyond default_pool_size but before this limit
max_client_conn = 2000
; default_pool_size: replica max_connections / number of app instances
; Leave headroom for superuser and replication connections (~15%)
default_pool_size = 150
; reserve_pool_size: burst buffer for administrative queries
reserve_pool_size = 20
reserve_pool_timeout = 5
; Lifecycle: reclaim idle sockets after 60 s, recycle after 1 h
server_idle_timeout = 60
server_lifetime = 3600
server_connect_timeout = 5
; Background health check interval (seconds)
server_check_delay = 30
; TCP keepalives: detect silent node failures before OS timeout
tcp_keepalives_idle = 30
tcp_keepalives_interval = 10
; Clients waiting beyond this limit are rejected β handle in application
query_wait_timeout = 30
; Reset session state between client reuses
server_reset_query = DISCARD ALLSizing rules of thumb:
max_client_conn= 3Γ expected peak application connectionsdefault_pool_size=replica_max_connections Γ 0.85 / number_of_pgbouncer_instancesreserve_pool_size= 10β20 connections for burst absorption and administrative queries
TCP and TLS Optimisation
Terminate TLS at the pool boundary rather than at each application instance. This reduces handshake overhead and centralises certificate rotation. tcp_keepalives_idle=30 and tcp_keepalives_interval=10 detect silent node failures before application timeouts trigger β a common source of connection leaks when a replica is taken out of service without a clean TCP close.
Monitoring and Alerting Signals
Key Metrics
| Metric | Source | Alert threshold |
|---|---|---|
pgbouncer_pools_active_connections |
PgBouncer stats | β |
pgbouncer_pools_waiting_connections |
PgBouncer stats | > 50 for 30 s β warning |
pgbouncer_pools_active_connections / pgbouncer_pools_max_connections |
Derived | > 0.85 for 5 min β warning |
pgbouncer_clients_waiting |
PgBouncer stats | > 100 for 2 min β critical |
pg_replication_lag_seconds |
pg_stat_replication |
> lag threshold β replica demotion |
pgbouncer_stats_avg_wait_time |
PgBouncer stats | p99 > 200 ms β capacity review |
Prometheus Alerting Rules
groups:
- name: replica_pool_capacity
rules:
- alert: ReplicaPoolUtilizationHigh
expr: >
pgbouncer_pools_active_connections
/ pgbouncer_pools_max_connections > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Read replica pool approaching capacity"
description: >
Pool utilisation at {{ $value | humanizePercentage }}.
Consider horizontal scaling or read throttling.
- alert: ReplicaPoolQueueDepthCritical
expr: pgbouncer_pools_waiting_connections > 100
for: 2m
labels:
severity: critical
annotations:
summary: "Connection queue depth critical"
description: >
Queue depth at {{ $value }}.
Enable fail-fast routing for low-priority queries immediately.
- alert: ReplicaLagBreach
expr: pg_replication_lag_seconds > 0.5
for: 1m
labels:
severity: warning
annotations:
summary: "Replica lag exceeds 500 ms SLA threshold"
description: >
Replica {{ $labels.server }} lag at {{ $value | humanizeDuration }}.
Pool should demote this replica from the read candidate set.Diagnostic Queries
-- PgBouncer live pool stats
SHOW POOLS;
-- Columns: database, user, cl_active, cl_waiting, sv_active, sv_idle, sv_used, sv_tested, sv_login, maxwait
-- Active waits (connections queued longer than 1 s)
SELECT database, user, cl_waiting, maxwait
FROM pgbouncer.pools
WHERE maxwait > 1;
-- PostgreSQL replication lag on the primary
SELECT
application_name,
state,
EXTRACT(EPOCH FROM (now() - write_lag)) AS write_lag_s,
EXTRACT(EPOCH FROM (now() - replay_lag)) AS replay_lag_s
FROM pg_stat_replication
ORDER BY replay_lag_s DESC NULLS LAST;Failure Modes and Recovery Steps
1. Connection Queue Saturation (client login has been waiting too long)
Root cause. max_client_conn has been reached β the PgBouncer queue is full. This usually means default_pool_size is too small for the current load, or one or more replicas are degraded and traffic has concentrated on the survivors.
Remediation.
- Run
SHOW POOLS;on the PgBouncer admin socket to confirmcl_waiting > 0and identify which database is saturated. - If a replica is degraded, remove it from the pool temporarily:
DISABLE replica-01.db.internalin the ProxySQL admin or updatepgbouncer.iniandRELOAD. - Increase
default_pool_sizeon the remaining replicas within theirmax_connectionsbudget. Apply withRELOADβ no restart required. - At the application layer, return HTTP
429 Too Many Requestsfor non-critical read paths rather than queuing β queued threads consume memory and block healthier requests. - Post-incident: recalibrate
default_pool_sizeandmax_client_connusing p99 wait times from the incident window.
2. Thundering Herd After Replica Restart
Root cause. When a replica restarts, every idle connection in a pool configured with testOnBorrow=true (HikariCP) or equivalent performs a validation query simultaneously, generating a spike of up to max_pool_size simultaneous connections to the newly available backend.
Remediation.
- Replace
testOnBorrowwithtestWhileIdleand a 30-second validation interval. - Set
pool_pre_ping=Truein SQLAlchemy, orkeepaliveTimein HikariCP for background validation. - During planned replica restart, pause the PgBouncer pool (
PAUSE app_read;). Queue client connections. After the replica is ready, rehydrate in batches: incrementdefault_pool_sizeby 10 every 5 seconds until the target is reached, then resume (RESUME app_read;). - Set
server_connect_timeout = 5so failed reconnect attempts fail fast rather than blocking.
3. Stale Reads After Write (Missing Causal Window)
Root cause. A write was committed on the primary; the calling thread then issued a read that was routed to a replica before the WAL segment was applied. If the causal window is not tracked or is shorter than the actual replication lag, users see data that predates their own action.
Remediation.
- Audit the causal window implementation: confirm that
last_write_timestampis tracked per-thread or per-request (not per-process). - Extend the causal window to at least
p99(replication_lag) + 200 msbased on observed metrics. - For critical read-after-write paths (e.g., confirmation pages, balance checks), force primary routing explicitly via a query hint (
/* ROUTE=PRIMARY */) or an ORM routing annotation β do not rely solely on the causal window. - Verify using
SELECT now() - pg_last_xact_replay_timestamp() AS lagon the replica immediately after writes in staging to confirm the window is sufficient.
4. DNS Flapping During Failover
Root cause. When a replica is promoted or replaced, its DNS endpoint changes. If client pools cached the old IP and DNS TTL is set too high, they continue sending connections to the departed address, causing ECONNREFUSED or silent timeouts. If TTL is set very low, aggressive re-resolution during the transition causes rapid pool churn.
Remediation.
- Set DNS TTL to 30 seconds for replica endpoints β low enough for fast cutover, high enough to avoid resolution overhead.
- Configure a 15-second hysteresis window in health checks to prevent rapid state oscillation (marking a node unhealthy, then healthy, then unhealthy) during network partitions.
- Use a static load-balancer DNS name (
replica-lb.internal) that re-resolves to healthy replicas rather than individual replica hostnames β insulates the pool from single-replica DNS changes.
5. Prepared Statement Routing Corruption
Root cause. Prepared statements are session-scoped on the backend. In transaction-mode pools, backend connections are shared across sessions, so a prepared statement created by client A may no longer exist on the backend connection lent to client B.
Remediation.
- Enable
server_reset_query = DISCARD ALLin PgBouncer β this deallocates prepared statements on connection return. - Alternatively, use the extended query protocol with
protocol = extendedand let PgBouncer track prepared statement handles per client. - For ProxySQL: cache prepared statement handles at the proxy level and map them to backend handles per connection. Verify with
SHOW MYSQL PREPARED STATEMENTS CACHE;.
Production-Readiness Checklist
Related Pages in This Section
- Configuring PgBouncer for read-only connection pools β Transaction-mode pool sizing,
userlist.txtauth setup, and memory footprint calculations for replica-only PgBouncer deployments. - Avoiding connection exhaustion during replica failover β Circuit-breaker patterns, backpressure configuration, and step-by-step rehydration procedures to prevent cascade failures when a replica is removed from rotation.
FAQ
What pool mode should I use for read replicas in PgBouncer?
Transaction mode (pool_mode = transaction) is the standard choice for read replica pools. It recycles server connections between client transactions, supporting far more client connections per replica than session mode without requiring a dedicated backend socket for each client. Session mode is only appropriate when applications rely on session-level state (temporary tables, advisory locks, SET variables) that must persist across multiple queries β rare for read-only analytics or API workloads.
How do I prevent a thundering herd after a replica failover?
Disable testOnBorrow and replace it with testWhileIdle on a 30-second validation interval. Use pool_pre_ping=True in SQLAlchemy or keepaliveTime in HikariCP for background connection validation. During failover, pause the PgBouncer pool and rehydrate in small batches β incrementing default_pool_size by 10 connections every 5 seconds β to avoid overwhelming the newly promoted replica.
When should reads be rerouted to the primary instead of a replica?
Route reads to the primary when replication lag exceeds your SLA threshold (commonly 500 ms), within a causal consistency window after a write (typically 1500β2000 ms), or when all replicas fail health checks. Configure these conditions explicitly rather than letting the application silently consume stale data. For critical paths such as payment confirmations, always force primary routing regardless of lag.
How do I size the connection pool if I have multiple app instances?
Calculate default_pool_size per PgBouncer instance as (replica_max_connections Γ 0.85) / number_of_pgbouncer_instances. Reserve the remaining ~15% for the replication process, superuser connections, and monitoring queries. Validate under synthetic load using pgbench -c <clients> -j <threads> -T 60 -h replica-lb.internal before promoting to production.
Related
- β Connection Routing & Pooling Strategies β Parent section covering the full spectrum of routing architectures, from proxy-layer splitting to ORM middleware.
- Implementing Read/Write Splitting at the Proxy Layer β How ProxySQL and HAProxy enforce the write-primary, read-replica boundary at the wire-protocol level.
- ORM Middleware for Automatic Query Routing β Framework-level routing without a dedicated proxy, covering Spring, Django, and SQLAlchemy patterns.
- Detecting and Handling Replication Lag in Real Time β How to measure
replay_lag, set dynamic lag thresholds, and wire lag data into pool routing decisions. - Fallback Strategies When Replicas Fall Behind β Circuit-breaker patterns and graceful degradation for read workloads when the replica set cannot keep up with write throughput.