Avoiding Connection Exhaustion During Replica Failover
Connection exhaustion during replica promotion is not a stochastic anomaly; it is a deterministic failure mode resulting from misaligned pool configurations, aggressive client-side retry logic, and topology routing lag. This runbook provides a configuration-first, step-by-step execution guide to maintain connection stability, enforce strict pool boundaries, and execute controlled failovers without cascading service degradation.
Symptom Identification & Telemetry Baselines
Detect exhaustion before cascading failure. Monitor pool saturation thresholds (>90% utilized), TCP SYN retransmission spikes, proxy queue depth metrics, and application-level error patterns (FATAL: too many connections for role, connection refused, pool exhausted).
Real-Time Alert Thresholds
Deploy actionable alert rules in Prometheus/Grafana to trigger automated mitigation:
# Pool saturation warning (PgBouncer/ProxySQL)
(pgbouncer_pools_active_connections / pgbouncer_pools_max_connections) > 0.85
# Connection wait latency degradation
pgbouncer_pools_wait_time_seconds > 0.5
# Proxy health check failure rate
rate(haproxy_server_check_failures_total{backend="db_read_replicas"}[10s]) > 0.3
Database-side baseline validation:
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
-- Must remain < max_connections * 0.85 during normal operations
Log Pattern Extraction
Differentiate transient connection churn from true exhaustion using structured log parsing:
- Churn/Reconnect:
(?i)connection.*reset|tcp.*reset|idle.*timeout|server.*closed - True Exhaustion:
(?i)too many connections|pool exhausted|connection refused|FATAL.*role|ECONNREFUSEDCorrelate timestamps across proxy access logs, database audit logs, and application stdout to isolate the exact failover window and identify the originating client fleet.
Root Cause Analysis: Why Failover Triggers Exhaustion
Topology changes disrupt the connection lifecycle. DNS TTL propagation delays, proxy routing lag, and unbounded client retries generate a thundering herd. Stale connection validation often bypasses pool limits, and improper Connection Routing & Pooling Strategies configurations amplify retry storms during replica promotion events.
DNS vs Proxy Routing Latency
DNS-based routing introduces propagation windows (TTL 30s–300s) where clients resolve to decommissioned or read-only replicas. This creates split-brain connection states: half the fleet connects to an offline node, triggering immediate ECONNREFUSED or 57P01 errors. Proxy-based routing (L4/L7) eliminates DNS lag but requires active health-check synchronization to avoid routing to nodes in recovering or promoting states.
Retry Storm Mechanics
Default exponential backoff (base_delay * 2^attempt) without jitter causes synchronized retry waves. During a 5-second promotion window, 1,000 clients retrying at 1s, 2s, 4s intervals can generate >10,000 concurrent connection requests, instantly saturating max_connections and exhausting the proxy’s max_client_conn buffer. Without jitter, retries align to the same millisecond, overwhelming TCP handshake queues.
Pre-Failover Configuration & Pool Hardening
Align pool sizing with expected failover concurrency per Connection Pool Architecture for Read Replicas. Enforce strict limits, idle timeouts, and validation queries to prevent stale connections from consuming slots during topology transitions.
Proxy Layer Configuration
PgBouncer (pgbouncer.ini)
[databases]
app_db = host=replica1 port=5432 dbname=app_db
[pgbouncer]
listen_port = 6432
max_client_conn = 10000
default_pool_size = 50
min_pool_size = 10
reserve_pool_size = 10
server_lifetime = 3600
server_idle_timeout = 30
server_connect_timeout = 5
tcp_keepalive = 1
tcp_keepidle = 30
tcp_keepintvl = 10
tcp_keepcnt = 3
ProxySQL (proxysql.cnf)
mysql_servers:
- hostgroup_id: 10
hostname: "replica1"
port: 3306
max_connections: 2000
max_replication_lag: 5
mysql_query_rules:
- rule_id: 1
match_pattern: "^SELECT"
destination_hostgroup: 10
apply: 1
Application Pool Tuning
Apply failover-safe defaults across common frameworks:
- Java/HikariCP:
maximumPoolSize=50,minimumIdle=10,connectionTimeout=2000,validationTimeout=1000,leakDetectionThreshold=30000 - Python/SQLAlchemy:
pool_size=20,max_overflow=10,pool_timeout=30,pool_recycle=1800,pool_pre_ping=True - Go
database/sql:SetMaxOpenConns(50),SetMaxIdleConns(10),SetConnMaxLifetime(30 * time.Minute),SetConnMaxIdleTime(5 * time.Minute)
Mitigation Runbook: Active Failover Execution
Execute chronologically. Do not skip steps. Maintain strict observability during each phase.
Graceful Drain Sequence
- Pre-emptive Pause: Halt new connections to the target replica.
# PgBouncer
psql -p 6432 -c "PAUSE app_db;"
# ProxySQL
mysql -u admin -p -h 127.0.0.1 -P 6032 -e "UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostgroup_id=10;"
- Connection Eviction: Wait for active transactions to commit/rollback. Monitor
pg_stat_activityuntilcount(*)drops to 0. - Client Notification: Trigger application-level
SIGUSR1or health endpoint/db/readinessto return503for read traffic. - Resume Routing: Once drained, point pool to the standby.
psql -p 6432 -c "RESUME app_db;"
Circuit Breaker Activation
Throttle connection attempts during the promotion window to prevent retry storms.
- Resilience4j: Configure
failureRateThreshold=50,waitDurationInOpenState=5000ms,slidingWindowSize=20. - Envoy Proxy: Set
circuit_breakersthresholds:max_connections: 1000,max_pending_requests: 1000,max_retries: 3. - Custom Middleware: Implement a token bucket or semaphore limiting concurrent connection attempts to
pool_size * 1.2. Return429 Too Many Requestsor503 Service Unavailablewhen exhausted.
Post-Failover Validation & Rollback Procedures
Verify connection normalization. Define strict rollback triggers: replication lag > 10s, connection error rate > 5% for > 60s, or primary promotion failure.
Connection Normalization Checks
Run validation queries immediately post-failover:
-- Verify active/idle ratio
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;
-- Verify wait queue clearance (PgBouncer)
SHOW POOLS; -- wait column must be 0
-- Verify proxy routing table (ProxySQL)
SELECT * FROM mysql_servers WHERE hostgroup_id=10; -- status must be ONLINE
Confirm max_connections utilization is < 60% and tcp_keepalive probes are stable.
Automated Rollback Triggers
If validation fails, execute idempotent rollback:
# 1. Halt write routing immediately
psql -p 6432 -c "PAUSE app_db;"
# 2. Revert proxy routing to original topology
mysql -u admin -p -h 127.0.0.1 -P 6032 -e "UPDATE mysql_servers SET status='ONLINE' WHERE hostgroup_id=10 AND hostname='original_replica';"
# 3. Flush stale connections
psql -p 6432 -c "RECONNECT app_db;"
# 4. Resume traffic
psql -p 6432 -c "RESUME app_db;"
Log all state transitions. Do not attempt automatic retry without manual DBA sign-off.
Long-Term Architectural Safeguards
Prevent recurrence through topology-aware scaling and automated validation.
Topology-Aware Pool Scaling
Integrate pool auto-scalers with cluster state managers (e.g., Patroni, Orchestrator). When a replica transitions to promoting, dynamically reduce max_client_conn and increase server_idle_timeout to absorb connection spikes. Use Kubernetes HPA or custom controllers to adjust application SetMaxOpenConns based on pg_stat_activity metrics.
Chaos Engineering & Failover Drills
Schedule monthly controlled failover simulations using tools like Chaos Mesh or Gremlin. Inject network partitions, DNS TTL spikes, and proxy health-check delays. Validate that circuit breakers engage, pools drain gracefully, and rollback procedures execute within defined RTO/RPO windows. Document telemetry baselines and update runbooks quarterly.