Avoiding Connection Exhaustion During Replica Failover

Operational scenario: a read replica is being promoted or has failed, and within seconds application error rates spike with FATAL: too many connections for role, pool exhausted, or ECONNREFUSED β€” the pool has hit its ceiling and new requests are queueing or dropping.

This page is a configuration-first runbook. It covers how to detect exhaustion early, why failover reliably triggers it, how to harden your proxy and application pool beforehand, and how to execute a controlled drain-and-reroute without cascading service degradation.


Symptom Identification

Distinguish genuine exhaustion from transient churn before deciding on remediation.

Error patterns

Signal Transient churn True exhaustion
Log pattern connection reset, idle timeout, server closed too many connections, pool exhausted, FATAL.*role, ECONNREFUSED
PgBouncer SHOW POOLS cl_waiting briefly spikes then drops cl_waiting holds steady and grows
pg_stat_activity count Fluctuates under max_connections * 0.85 Pinned at or above max_connections * 0.85
App latency P99 spike then recovery P50 degrades; new requests time out entirely

Real-time Prometheus alert rules

Deploy these before your next planned failover window:

promql
# Pool saturation warning β€” fires at 85 % utilisation
(pgbouncer_pools_active_connections / pgbouncer_pools_max_connections) > 0.85

# Connection wait time degradation
pgbouncer_pools_wait_time_seconds > 0.5

# HAProxy health-check failure rate against read replicas
rate(haproxy_server_check_failures_total{backend="db_read_replicas"}[10s]) > 0.3

Database-side headroom check (run ad-hoc during an incident):

sql
-- Must stay below max_connections * 0.85 during normal operations
SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state;

Correlate timestamps across proxy access logs, database audit logs, and application stdout to isolate the exact failover window.


Root Cause Analysis

Why failover reliably triggers exhaustion

Three mechanics combine during any topology change:

DNS propagation lag. DNS-based routing introduces a window (typically TTL 30 s–300 s) where clients resolve to a node that is now offline or in promoting state. Half the fleet receives ECONNREFUSED or PostgreSQL error 57P01 (admin_shutdown), and every one of those failures triggers an immediate retry.

Proxy routing lag. Proxy-based routing (L4/L7) eliminates DNS TTL lag but still requires health-check synchronisation. A PgBouncer or HAProxy instance that has not yet marked the old replica OFFLINE continues sending new connections to it, accumulating cl_waiting entries until the health-check cycle fires.

Retry storm from unsynchronised backoff. Default exponential backoff (base_delay * 2^attempt) without jitter causes client fleets to retry in lock-step. During a 5-second promotion window, 1 000 clients retrying at 1 s, 2 s, 4 s intervals can generate more than 10 000 concurrent connection attempts β€” instantly saturating max_connections on the database and the proxy’s max_client_conn buffer.

The connection pool architecture for read replicas determines how much headroom exists to absorb that spike. Under-provisioned reserve_pool_size and absent server_connect_timeout leave no margin.

Failover connection lifecycle β€” state diagram

Connection lifecycle during replica failover State diagram showing how client connections transition from connected through waiting, retry-storm, circuit-open, draining, and rerouted states during a replica failover event. Connected (normal ops) Failover topology change Retry storm exhaustion risk Circuit open retries throttled Draining → standby no jitter breaker fires ECONNREFUSED / 57P01 Exhaustion risk path Mitigated path

Step-by-Step Resolution

Step 1 β€” Pause the pool and halt new server connections

bash
# PgBouncer: suspend query dispatch to the target database
psql -p 6432 -c "PAUSE app_db;"

# Verify: cl_waiting must be falling, not rising
psql -p 6432 -c "SHOW POOLS;" | grep app_db

For ProxySQL (MySQL clusters):

sql
-- Soft-offline the replica so new queries queue rather than fail
UPDATE mysql_servers
SET status = 'OFFLINE_SOFT'
WHERE hostgroup_id = 10;
LOAD MYSQL SERVERS TO RUNTIME;

-- Verify: cur_latency_ms should drop off as connections drain
SELECT hostname, status, cur_latency_ms
FROM runtime_mysql_servers
WHERE hostgroup_id = 10;

Step 2 β€” Wait for active transactions to drain

sql
-- Run on the primary or a healthy replica until this returns 0
SELECT count(*)
FROM pg_stat_activity
WHERE state = 'active'
  AND wait_event_type IS NULL;

Do not proceed to Step 3 until the count reaches zero or your RTO deadline forces a hard cutover.

Step 3 β€” Activate circuit breakers to throttle retry volume

Before rerouting, ensure circuit breakers are open so the reconnect wave is metered:

Resilience4j (Java):

yaml
resilience4j:
  circuitbreaker:
    instances:
      replicaPool:
        failureRateThreshold: 50
        waitDurationInOpenState: 5000
        slidingWindowSize: 20
        permittedNumberOfCallsInHalfOpenState: 5

Envoy proxy:

yaml
circuit_breakers:
  thresholds:
    - priority: DEFAULT
      max_connections: 1000
      max_pending_requests: 1000
      max_retries: 3

Custom middleware: implement a token bucket or semaphore limiting concurrent connection attempts to pool_size * 1.2. Return 503 Service Unavailable when the bucket is empty β€” clients back off, preventing thundering herd reconvergence.

Step 4 β€” Reroute the pool to the standby

Update PgBouncer’s target host and reload without a full restart:

bash
# Edit /etc/pgbouncer/pgbouncer.ini β€” update the host line:
# app_db = host=standby1 port=5432 dbname=app_db

# Send HUP to reload config in place (no connection drop for existing clients)
kill -HUP $(pgrep pgbouncer)

# Resume query dispatch
psql -p 6432 -c "RESUME app_db;"

Inline verification β€” sv (server) connections should be opening against the new host:

bash
psql -p 6432 -c "SHOW SERVERS;" | grep standby1

Step 5 β€” Signal application readiness endpoints

Set your /db/readiness health probe to return 200 only after the pool has at least min_pool_size server connections established against the standby. Many Kubernetes readiness probes poll this endpoint β€” returning 503 during Steps 1–4 prevents Kubernetes from routing traffic to pods that would fail anyway.


Configuration Snippet

Minimal PgBouncer configuration tuned for failover resilience. Every critical parameter is annotated:

ini
[databases]
; Point to the active replica β€” update and HUP on failover
app_db = host=replica1 port=5432 dbname=app_db

[pgbouncer]
listen_port = 6432

; Hard ceiling on client-side connections β€” set 2Γ— expected peak
max_client_conn = 10000

; Server-side pool size per database/user pair
default_pool_size = 50

; Maintain warm connections β€” prevents cold-start spike post-failover
min_pool_size = 10

; Emergency slots released only when cl_waiting > 0
reserve_pool_size = 10
reserve_pool_timeout = 3

; Kill idle server connections after 30 s β€” releases slots faster during failover
server_idle_timeout = 30

; Fail fast on a dead replica (seconds)
server_connect_timeout = 5

; Recycle server connections after 1 hour to prevent stale TCP state
server_lifetime = 3600

; TCP keepalive β€” detect dead connections without waiting for server_connect_timeout
tcp_keepalive = 1
tcp_keepidle  = 30   ; seconds before first probe
tcp_keepintvl = 10   ; seconds between probes
tcp_keepcnt   = 3    ; probes before declaring dead

Application-layer pool settings (adjust to your framework):

  • HikariCP: maximumPoolSize=50, minimumIdle=10, connectionTimeout=2000, validationTimeout=1000, leakDetectionThreshold=30000, keepaliveTime=30000
  • SQLAlchemy: pool_size=20, max_overflow=10, pool_timeout=30, pool_recycle=1800, pool_pre_ping=True
  • Go database/sql: SetMaxOpenConns(50), SetMaxIdleConns(10), SetConnMaxLifetime(30*time.Minute), SetConnMaxIdleTime(5*time.Minute)

Verification and Rollback

Confirming the fix is working

Run immediately after resuming traffic:

sql
-- Active/idle ratio should be healthy (active < 30 % of total)
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;
bash
# PgBouncer: cl_waiting must be 0; sv_active should equal default_pool_size
psql -p 6432 -c "SHOW POOLS;"
sql
-- ProxySQL: replica must show ONLINE; replication lag must be < max_replication_lag
SELECT hostname, status, cur_repl_lag
FROM runtime_mysql_servers
WHERE hostgroup_id = 10;

Target thresholds: max_connections utilisation below 60 %, cl_waiting = 0, TCP keepalive probes stable, application P99 latency returning to baseline within 2Γ— the server_connect_timeout.

Rollback procedure

If post-failover validation fails (replication lag > 10 s, connection error rate > 5 % for more than 60 s, or promotion itself failed):

bash
# 1. Halt new query routing to the problem replica
psql -p 6432 -c "PAUSE app_db;"

# 2. Revert pgbouncer.ini host to the original replica
sed -i 's/host=standby1/host=replica1/' /etc/pgbouncer/pgbouncer.ini

# 3. Reload config
kill -HUP $(pgrep pgbouncer)

# 4. Resume traffic
psql -p 6432 -c "RESUME app_db;"

Log every state transition. Do not attempt automatic rollback retry without manual DBA sign-off β€” a runaway retry loop against a degraded node compounds the problem.


Edge Cases and Gotchas

Cascading replication topologies

When a failed replica is itself a replication source for a downstream cascade, removing it from the pool is not enough. Downstream replicas lose their WAL stream and begin accumulating replication lag. Their pools become starved of fresh data while remaining technically connectable. The symptom looks like slow queries, not connection exhaustion β€” clients connect successfully but read stale rows. Implement a separate lag-based health check (SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) > 30) that excludes lagging downstream replicas from the read pool independently of TCP liveness.

Multi-region deployments with cross-AZ pools

In multi-region setups, reserve_pool_size must be sized against peak regional traffic, not global traffic. A failover in us-east-1 that drains its pool does not drain eu-west-1’s pool β€” but if your load balancer shifts global traffic to the surviving region, that region’s pool receives a traffic spike it was not sized for. Pre-stage a second reserve_pool_size allocation in the receiving region before a planned cross-region failover, or use read/write splitting at the proxy layer to cap per-region read traffic independently.

Logical replication during failover

Physical replica promotion is clean β€” the promoted standby is a full-fidelity copy of the primary’s data. Logical replication slots do not transfer during promotion. If your pool routes to a logically replicated node (e.g., a partial subset replica for analytics), failover will leave the slot orphaned on the old primary. Reconnecting the pool to the new logical replica requires recreating the replication slot β€” an operation that can take minutes on a large table. During that window, the pool connects successfully but reads return data from the pre-failover snapshot. Add a pg_replication_slots health check to your runbook and plan for this window explicitly.


Frequently Asked Questions

Why does adding jitter to retry logic help so much?

Without jitter, base_delay * 2^attempt causes every client that failed at time T to retry at T+1 s, T+3 s, T+7 s β€” in synchrony. With jitter (base_delay * 2^attempt * random(0.5, 1.5)), retries spread across a window, reducing peak concurrent connection attempts by roughly the jitter factor. For a fleet of 1 000 clients, this converts a 10 000-connection burst into a 5 000–7 000 connection ramp β€” enough to stay under max_connections * 0.85 on most configurations.

Should I use transaction or session pooling mode in PgBouncer during failover?

Transaction pooling (pool_mode = transaction) releases the server connection back to the pool after every transaction boundary, so failover drains faster β€” there are no long-lived server connections blocking slot reuse. Session pooling holds a server connection for the client’s entire lifetime, which means a stuck client session can hold a slot indefinitely during a failover pause. Use transaction pooling for read replica pools wherever your application does not rely on session-scoped state (prepared statements, SET LOCAL variables, advisory locks).

What happens if PAUSE app_db blocks indefinitely?

PAUSE waits for all active transactions on the pooled connections to finish. If a long-running query is executing, PAUSE blocks until it completes or is cancelled. Use KILL on the PgBouncer admin interface to terminate specific client connections, or set server_idle_timeout low enough that idle server connections are recycled before the failover window opens. As a last resort, psql -p 6432 -c "KILL app_db;" forcefully terminates all server connections immediately β€” use only when RTO requires it.


← Back to Connection Pool Architecture for Read Replicas