← Back to Connection Routing & Pooling Strategies

Connection Pool Architecture for Read Replicas

Scaling read-heavy workloads across distributed database deployments requires more than provisioning additional nodes. The connection pool layer sitting between application services and the replica topology dictates latency boundaries, resource exhaustion thresholds, and consistency guarantees. A misconfigured pool creates exactly the cascading failures you added replicas to prevent: connection exhaustion, thundering-herd reconnections after failover, and stale reads silently served to users who just wrote data. This page covers production-grade pool design, proxy integration, lifecycle configuration, failure-mode mitigation, and the monitoring signals that keep the whole system observable.

The Operational Problem This Addresses

Connection Routing & Pooling Strategies defines how traffic is partitioned across a replica set. This page goes one level deeper: how to design and operate the pool itself so that partitioning decisions are actually enforced under load, during topology changes, and when individual replicas degrade. Without deliberate pool architecture the most common outcomes are:

  • Connection exhaustion on the primary when replicas are removed from rotation and traffic falls back without backpressure.
  • Stale reads bypassing lag checks because pool state is not propagated to routing logic.
  • Thundering-herd reconnections when a replica restarts and every idle connection validates simultaneously.
  • Session-mode pools that cap concurrency far below what transaction-mode pools can sustain for the same replica max_connections budget.

The SLA risks are concrete: a pool sized for steady-state traffic that has no reserve or circuit-breaker logic will exhaust available connections within seconds of a moderate traffic spike, causing HTTP 500s or database timeouts that cascade across the entire application tier.

Concept Definition and Scope

A connection pool for read replicas is a broker between application threads and backend database sockets. It maintains a set of open, authenticated connections to one or more replica endpoints, lends them to application threads for the duration of a query or transaction, then recycles them. The pool enforces limits on concurrency (max_pool_size), manages idle socket lifetimes, and β€” in the context of read replicas specifically β€” must also carry awareness of replication lag and node health so it can refuse to lend connections to degraded backends.

This is distinct from a generic connection pool in two ways:

  1. Routing is a first-class concern. The pool must integrate with lag metrics and health checks to dynamically exclude nodes. A generic pool simply balances across a fixed list of backends.
  2. Write-after-read consistency is a pool-level responsibility. After a write to the primary, reads issued within a causal window must bypass the replica pool entirely. This requires the pool β€” or the routing layer above it β€” to carry per-request write timestamps.

Connection pooling modes (session, transaction, statement) have direct consequences for read-replica deployments. Transaction mode is the production standard: it releases the backend connection at COMMIT/ROLLBACK, enabling far higher client fan-out per replica max_connections budget. Session mode binds a backend socket for the client’s entire session lifetime β€” wasteful for read-only API calls that open a session, execute one SELECT, and close. Statement mode is too restrictive for most workloads because it prohibits multi-statement transactions.


Architecture Overview

The diagram below shows the three-tier pool architecture: application threads β†’ pool/proxy layer β†’ replica set, with the primary as the write target and fallback for lag-breached reads.

Connection Pool Architecture for Read Replicas Three-tier diagram: application instances on the left connect to a pool/proxy layer in the centre, which routes writes to the primary and reads to replicas on the right. Health checks and lag monitors feed back into the routing layer. App instance A thread pool App instance B thread pool App instance C thread pool Pool / Proxy PgBouncer Β· ProxySQL Query classifier Lag guard Health checker Circuit breaker Connection recycler writes reads lag fallback Primary writes + fallback reads Replica 1 lag ≀ threshold βœ“ Replica 2 lag ≀ threshold βœ“ Replica 3 lag exceeded β€” drained health / lag feedback

Mechanism Deep-Dive

Routing Decision Pipeline

Before a client connection is handed a backend socket, the pool/proxy evaluates a short decision pipeline:

  1. Query classification. Parse the SQL digest or inspect the first keyword. SELECT … FOR UPDATE and SELECT … FOR SHARE are locking reads that must route to the primary. Plain SELECT queries are candidates for replica routing. DDL and DML always go to the primary.
  2. Causal window check. If the requesting thread issued a write within the last N milliseconds (configurable, commonly 1500–2000 ms), the request bypasses the replica pool and goes directly to the primary, enforcing read-after-write consistency.
  3. Lag guard evaluation. Query the pool’s in-memory lag cache (refreshed every few seconds from pg_stat_replication or a heartbeat table). Replicas with replay_lag > threshold are excluded from the candidate set.
  4. Health check gate. Replicas that have failed N consecutive TCP health checks are in DRAINING state and receive no new connections.
  5. Load balancing. From the remaining healthy, within-lag candidates, select using least_connections (preferred for heterogeneous node capacities) or weighted_round_robin when replicas have differing vCPU/memory profiles.
  6. Connection checkout. Lend an idle, validated socket from the pool. If none is available and pool_size < max_pool_size, open a new backend connection. If max_pool_size is reached, queue the request or fail fast.

Transaction-Mode Connection Lifecycle

In transaction mode the backend socket lifecycle is:

code
CLIENT CONNECTS β†’ [IDLE socket assigned] β†’ BEGIN (optional)
  β†’ query 1 β†’ query 2 β†’ … β†’ COMMIT/ROLLBACK
  β†’ [socket returned to pool, reset to clean state]
CLIENT DISCONNECTS (pool connection persists)

The reset step (DISCARD ALL or RESET ALL) clears session-level parameters, temporary tables, and advisory locks. This is what makes connection sharing safe across unrelated client sessions. PgBouncer performs this reset automatically; verify that server_reset_query = DISCARD ALL is set and that your application does not rely on session state persisting across transactions.

Proxy Rule Evaluation in ProxySQL

ProxySQL evaluates query rules in ascending rule_id order and applies the first match. The evaluation is regex-based against the query digest, so order and specificity matter. Locking reads (FOR UPDATE) must appear before general SELECT rules, otherwise they are misrouted to replicas.

sql
-- ProxySQL routing rules (admin interface)
INSERT INTO mysql_query_rules
  (rule_id, active, match_digest, destination_hostgroup, apply)
VALUES
  (1, 1, '^SELECT .* FOR UPDATE',                    10, 1),
  (2, 1, '^SELECT .* FOR SHARE',                     10, 1),
  (3, 1, '^SELECT .*',                               20, 1),
  (4, 1, '^(INSERT|UPDATE|DELETE|CREATE|ALTER|DROP)', 10, 1);

INSERT INTO mysql_servers
  (hostgroup_id, hostname, port, weight, max_connections, max_replication_lag)
VALUES
  (10, 'primary.db.internal',   3306, 1, 500,  0),
  (20, 'replica-01.db.internal',3306, 1, 500,  2),
  (20, 'replica-02.db.internal',3306, 1, 500,  2),
  (20, 'replica-03.db.internal',3306, 2, 1000, 5);

LOAD MYSQL QUERY RULES TO RUNTIME; SAVE MYSQL QUERY RULES TO DISK;
LOAD MYSQL SERVERS TO RUNTIME;     SAVE MYSQL SERVERS TO DISK;

max_replication_lag (seconds) is enforced per server: when Seconds_Behind_Master exceeds this value, ProxySQL moves the server to the writer hostgroup or marks it offline β€” it will never silently route stale reads.

ORM-Level Routing and Causal Windows

When ORM middleware for automatic query routing is the routing layer rather than a dedicated proxy, the causal window must be tracked explicitly in application state. The following Spring Boot interceptor demonstrates the pattern:

java
// Spring Boot DataSource Routing Interceptor
public class ReplicaRoutingInterceptor implements MethodInterceptor {
    private static final long CAUSAL_WINDOW_MS = 1500L;

    @Override
    public Object invoke(MethodInvocation invocation) throws Throwable {
        Transactional tx = invocation.getMethod().getAnnotation(Transactional.class);
        boolean isWrite = (tx != null && !tx.readOnly());
        long lastWrite = ThreadLocalContext.getLastWriteTimestamp();

        if (isWrite || (System.currentTimeMillis() - lastWrite < CAUSAL_WINDOW_MS)) {
            DataSourceRouter.setActiveDataSource("primary");
        } else {
            DataSourceRouter.setActiveDataSource("replica-pool");
        }

        try {
            return invocation.proceed();
        } catch (Exception e) {
            // Circuit-break to primary on any replica failure
            DataSourceRouter.setActiveDataSource("primary");
            return invocation.proceed();
        }
    }
}

If the ORM detects sustained replica unavailability, it should open a circuit and route all traffic to the primary until the replica pool recovers β€” preventing thread starvation in the application tier.


Trade-Off Comparison Table

Strategy Concurrency headroom Lag/stale-read risk Failover behaviour Operational complexity
Transaction-mode pool (PgBouncer) Very high β€” many clients per backend socket Low when lag guard is wired Queue drains; reconnect on resume Low β€” single config file
Session-mode pool Low β€” 1:1 client:socket Low Connections drop; reconnect storm Low
Proxy with AST-based routing (ProxySQL) High Very low β€” per-query decisions Automatic server demotion Medium β€” rule ordering, digest tuning
ORM-level routing (no proxy) Depends on driver pool Medium β€” causal window must be coded Application-controlled circuit break High β€” per-framework implementation
DNS-based routing (read endpoint) Depends on backend pool Medium β€” lag awareness absent TTL-bound (30–60 s delay) Very low β€” no proxy needed

Configuration Runbook

PgBouncer for Read Replica Pools

Full details are in Configuring PgBouncer for read-only connection pools. The minimal working configuration with critical parameters annotated:

ini
; pgbouncer.ini β€” read replica pool
[databases]
; Route this DSN name to an internal load balancer in front of replicas
app_read = host=replica-lb.internal port=5432 dbname=app_db

[pgbouncer]
listen_port = 6432
auth_type   = scram-sha-256
auth_file   = /etc/pgbouncer/userlist.txt

; Transaction mode: release backend socket at COMMIT/ROLLBACK
pool_mode = transaction

; max_client_conn: 3Γ— expected peak application connections
; Queues are created beyond default_pool_size but before this limit
max_client_conn = 2000

; default_pool_size: replica max_connections / number of app instances
; Leave headroom for superuser and replication connections (~15%)
default_pool_size = 150

; reserve_pool_size: burst buffer for administrative queries
reserve_pool_size    = 20
reserve_pool_timeout = 5

; Lifecycle: reclaim idle sockets after 60 s, recycle after 1 h
server_idle_timeout    = 60
server_lifetime        = 3600
server_connect_timeout = 5

; Background health check interval (seconds)
server_check_delay = 30

; TCP keepalives: detect silent node failures before OS timeout
tcp_keepalives_idle     = 30
tcp_keepalives_interval = 10

; Clients waiting beyond this limit are rejected β€” handle in application
query_wait_timeout = 30

; Reset session state between client reuses
server_reset_query = DISCARD ALL

Sizing rules of thumb:

  • max_client_conn = 3Γ— expected peak application connections
  • default_pool_size = replica_max_connections Γ— 0.85 / number_of_pgbouncer_instances
  • reserve_pool_size = 10–20 connections for burst absorption and administrative queries

TCP and TLS Optimisation

Terminate TLS at the pool boundary rather than at each application instance. This reduces handshake overhead and centralises certificate rotation. tcp_keepalives_idle=30 and tcp_keepalives_interval=10 detect silent node failures before application timeouts trigger β€” a common source of connection leaks when a replica is taken out of service without a clean TCP close.


Monitoring and Alerting Signals

Key Metrics

Metric Source Alert threshold
pgbouncer_pools_active_connections PgBouncer stats β€”
pgbouncer_pools_waiting_connections PgBouncer stats > 50 for 30 s β†’ warning
pgbouncer_pools_active_connections / pgbouncer_pools_max_connections Derived > 0.85 for 5 min β†’ warning
pgbouncer_clients_waiting PgBouncer stats > 100 for 2 min β†’ critical
pg_replication_lag_seconds pg_stat_replication > lag threshold β†’ replica demotion
pgbouncer_stats_avg_wait_time PgBouncer stats p99 > 200 ms β†’ capacity review

Prometheus Alerting Rules

yaml
groups:
  - name: replica_pool_capacity
    rules:
      - alert: ReplicaPoolUtilizationHigh
        expr: >
          pgbouncer_pools_active_connections
          / pgbouncer_pools_max_connections > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Read replica pool approaching capacity"
          description: >
            Pool utilisation at {{ $value | humanizePercentage }}.
            Consider horizontal scaling or read throttling.

      - alert: ReplicaPoolQueueDepthCritical
        expr: pgbouncer_pools_waiting_connections > 100
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Connection queue depth critical"
          description: >
            Queue depth at {{ $value }}.
            Enable fail-fast routing for low-priority queries immediately.

      - alert: ReplicaLagBreach
        expr: pg_replication_lag_seconds > 0.5
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Replica lag exceeds 500 ms SLA threshold"
          description: >
            Replica {{ $labels.server }} lag at {{ $value | humanizeDuration }}.
            Pool should demote this replica from the read candidate set.

Diagnostic Queries

sql
-- PgBouncer live pool stats
SHOW POOLS;
-- Columns: database, user, cl_active, cl_waiting, sv_active, sv_idle, sv_used, sv_tested, sv_login, maxwait

-- Active waits (connections queued longer than 1 s)
SELECT database, user, cl_waiting, maxwait
FROM   pgbouncer.pools
WHERE  maxwait > 1;

-- PostgreSQL replication lag on the primary
SELECT
  application_name,
  state,
  EXTRACT(EPOCH FROM (now() - write_lag))  AS write_lag_s,
  EXTRACT(EPOCH FROM (now() - replay_lag)) AS replay_lag_s
FROM pg_stat_replication
ORDER BY replay_lag_s DESC NULLS LAST;

Failure Modes and Recovery Steps

1. Connection Queue Saturation (client login has been waiting too long)

Root cause. max_client_conn has been reached β€” the PgBouncer queue is full. This usually means default_pool_size is too small for the current load, or one or more replicas are degraded and traffic has concentrated on the survivors.

Remediation.

  1. Run SHOW POOLS; on the PgBouncer admin socket to confirm cl_waiting > 0 and identify which database is saturated.
  2. If a replica is degraded, remove it from the pool temporarily: DISABLE replica-01.db.internal in the ProxySQL admin or update pgbouncer.ini and RELOAD.
  3. Increase default_pool_size on the remaining replicas within their max_connections budget. Apply with RELOAD β€” no restart required.
  4. At the application layer, return HTTP 429 Too Many Requests for non-critical read paths rather than queuing β€” queued threads consume memory and block healthier requests.
  5. Post-incident: recalibrate default_pool_size and max_client_conn using p99 wait times from the incident window.

2. Thundering Herd After Replica Restart

Root cause. When a replica restarts, every idle connection in a pool configured with testOnBorrow=true (HikariCP) or equivalent performs a validation query simultaneously, generating a spike of up to max_pool_size simultaneous connections to the newly available backend.

Remediation.

  1. Replace testOnBorrow with testWhileIdle and a 30-second validation interval.
  2. Set pool_pre_ping=True in SQLAlchemy, or keepaliveTime in HikariCP for background validation.
  3. During planned replica restart, pause the PgBouncer pool (PAUSE app_read;). Queue client connections. After the replica is ready, rehydrate in batches: increment default_pool_size by 10 every 5 seconds until the target is reached, then resume (RESUME app_read;).
  4. Set server_connect_timeout = 5 so failed reconnect attempts fail fast rather than blocking.

3. Stale Reads After Write (Missing Causal Window)

Root cause. A write was committed on the primary; the calling thread then issued a read that was routed to a replica before the WAL segment was applied. If the causal window is not tracked or is shorter than the actual replication lag, users see data that predates their own action.

Remediation.

  1. Audit the causal window implementation: confirm that last_write_timestamp is tracked per-thread or per-request (not per-process).
  2. Extend the causal window to at least p99(replication_lag) + 200 ms based on observed metrics.
  3. For critical read-after-write paths (e.g., confirmation pages, balance checks), force primary routing explicitly via a query hint (/* ROUTE=PRIMARY */) or an ORM routing annotation β€” do not rely solely on the causal window.
  4. Verify using SELECT now() - pg_last_xact_replay_timestamp() AS lag on the replica immediately after writes in staging to confirm the window is sufficient.

4. DNS Flapping During Failover

Root cause. When a replica is promoted or replaced, its DNS endpoint changes. If client pools cached the old IP and DNS TTL is set too high, they continue sending connections to the departed address, causing ECONNREFUSED or silent timeouts. If TTL is set very low, aggressive re-resolution during the transition causes rapid pool churn.

Remediation.

  1. Set DNS TTL to 30 seconds for replica endpoints β€” low enough for fast cutover, high enough to avoid resolution overhead.
  2. Configure a 15-second hysteresis window in health checks to prevent rapid state oscillation (marking a node unhealthy, then healthy, then unhealthy) during network partitions.
  3. Use a static load-balancer DNS name (replica-lb.internal) that re-resolves to healthy replicas rather than individual replica hostnames β€” insulates the pool from single-replica DNS changes.

5. Prepared Statement Routing Corruption

Root cause. Prepared statements are session-scoped on the backend. In transaction-mode pools, backend connections are shared across sessions, so a prepared statement created by client A may no longer exist on the backend connection lent to client B.

Remediation.

  1. Enable server_reset_query = DISCARD ALL in PgBouncer β€” this deallocates prepared statements on connection return.
  2. Alternatively, use the extended query protocol with protocol = extended and let PgBouncer track prepared statement handles per client.
  3. For ProxySQL: cache prepared statement handles at the proxy level and map them to backend handles per connection. Verify with SHOW MYSQL PREPARED STATEMENTS CACHE;.

Production-Readiness Checklist



FAQ

What pool mode should I use for read replicas in PgBouncer?

Transaction mode (pool_mode = transaction) is the standard choice for read replica pools. It recycles server connections between client transactions, supporting far more client connections per replica than session mode without requiring a dedicated backend socket for each client. Session mode is only appropriate when applications rely on session-level state (temporary tables, advisory locks, SET variables) that must persist across multiple queries β€” rare for read-only analytics or API workloads.

How do I prevent a thundering herd after a replica failover?

Disable testOnBorrow and replace it with testWhileIdle on a 30-second validation interval. Use pool_pre_ping=True in SQLAlchemy or keepaliveTime in HikariCP for background connection validation. During failover, pause the PgBouncer pool and rehydrate in small batches β€” incrementing default_pool_size by 10 connections every 5 seconds β€” to avoid overwhelming the newly promoted replica.

When should reads be rerouted to the primary instead of a replica?

Route reads to the primary when replication lag exceeds your SLA threshold (commonly 500 ms), within a causal consistency window after a write (typically 1500–2000 ms), or when all replicas fail health checks. Configure these conditions explicitly rather than letting the application silently consume stale data. For critical paths such as payment confirmations, always force primary routing regardless of lag.

How do I size the connection pool if I have multiple app instances?

Calculate default_pool_size per PgBouncer instance as (replica_max_connections Γ— 0.85) / number_of_pgbouncer_instances. Reserve the remaining ~15% for the replication process, superuser connections, and monitoring queries. Validate under synthetic load using pgbench -c <clients> -j <threads> -T 60 -h replica-lb.internal before promoting to production.