Fallback Strategies When Replicas Fall Behind
Replication lag is an operational certainty in distributed database topologies, not an anomaly. Under sustained write throughput, network micro-partitions, or heavy compaction cycles, asynchronous replicas will inevitably fall behind. Treating fallback routing as an ad-hoc exception handler invites cascading failures; instead, it must be engineered as a deterministic routing contract with explicit tolerance windows, state machine transitions, and automated recovery paths.
This page covers the full lifecycle: threshold detection, circuit-breaker state machines, degraded-mode weight redistribution, primary read escalation, and pool reintegration. It assumes you have a proxy or middleware layer in front of one or more read replicas and are operating under asynchronous replication.
← Back to Replication Lag & Consistency Management
The Operational Problem
When a replica’s replay position drifts beyond your application’s data-freshness SLA, three risks materialise simultaneously:
- Stale reads — users see outdated state for critical entities (account balances, inventory counts, order statuses).
- Connection pool saturation — slow replay means query runtimes increase, pools fill, and new requests queue or timeout.
- Cascading primary overload — if the proxy blindly shifts all traffic to the primary the moment replicas lag, primary CPU and lock contention spike, potentially degrading write throughput for the entire cluster.
Fallback routing must address all three risks: it should progressively degrade replica load before problems compound, escalate only the minimum traffic to the primary, and restore replicas gracefully once they catch up.
Routing Architecture Overview
The diagram below shows how read queries move through the routing stack under normal, degraded, and escalated conditions.
Defining Lag Thresholds and Routing Triggers
Effective fallback routing begins with precise telemetry and deliberate threshold calibration. Native replication status views — pg_stat_replication.replay_lag for PostgreSQL, SHOW REPLICA STATUS / Seconds_Behind_Source for MySQL — must be polled at the proxy or middleware layer, not the application tier. Polling intervals should use exponential backoff and jitter to prevent metric flapping during transient network hiccups.
A robust detection pipeline integrates with detecting and handling replication lag in real-time to maintain a continuous baseline before routing logic evaluates state transitions.
Threshold evaluation should distinguish between absolute lag (bytes or rows behind) and temporal lag (seconds). Temporal lag is generally preferred for routing triggers because it directly maps to user-perceived staleness. The evaluation_window parameter is critical — it requires sustained lag above the threshold before demoting a replica, which prevents a brief I/O spike from triggering unnecessary traffic redistribution:
# proxy-router/lag-evaluator.yaml
replica_monitoring:
poll_interval: 2s
backoff_multiplier: 1.5
max_poll_interval: 15s
jitter_range_ms: 200
thresholds:
warning_lag_ms: 150 # Emit metric; no routing change yet
critical_lag_ms: 300 # Begin weight reduction
demotion_threshold_ms: 500 # Must exceed your application's max staleness SLA
evaluation_window: 30s # Sustained lag required before state transition firesThe demotion_threshold_ms must be set higher than your application’s maximum acceptable staleness. If your SLA allows 200ms of stale data, a demotion threshold of 500ms gives enough headroom to avoid false positives while still catching genuine drift before it compounds.
Trade-off Comparison: Fallback Approaches
| Approach | Staleness exposure | Primary load increase | Complexity | Best fit |
|---|---|---|---|---|
| Route all reads to primary immediately | None | High — may saturate | Low | Small replica sets, write-heavy OLTP |
| Proportional weight reduction (degraded mode) | Bounded by demotion_threshold_ms |
Minimal | Medium | Read-heavy apps with staleness tolerance |
| Circuit breaker + quarantine | None once quarantined | Medium | High | Large replica fleets, HA-critical workloads |
| Sticky session routing to low-lag replicas | Per-session SLA | None | Medium | User-facing reads requiring read-your-writes |
| Application-level timestamp bypass | None for flagged queries | Low (selective) | Medium | Mixed workloads with inconsistent freshness needs |
Proportional weight reduction (the middle option) is the best default for read-heavy workloads: it sheds load from struggling replicas without abruptly flooding the primary. The circuit-breaker pattern is appropriate when a replica is so far behind that even 20% traffic would cause connection pool timeouts.
Mechanism Deep-Dive: The Circuit-Breaker State Machine
Replicas should transition through four states: ACTIVE → DEGRADED → QUARANTINED → RECOVERING. Each state change must be driven by the evaluator’s sustained-lag assessment, not by individual query failures.
ACTIVE: Lag is within the warning threshold. Full traffic weight applies.
DEGRADED: Lag has exceeded critical_lag_ms for the full evaluation_window. The proxy reduces the replica’s traffic weight proportionally — typically to 20% — while healthy replicas absorb the difference. Consistency-sensitive queries are redirected to the primary or a healthy replica. Connection pool depth is capped to prevent queue buildup on a slow node.
QUARANTINED: Lag has exceeded demotion_threshold_ms for the evaluation window, or the replica has returned repeated query errors. Traffic weight drops to zero; the replica is removed from the active pool. The circuit breaker enters half-open state: a small probe traffic allowance (half_open_max_requests) periodically tests whether the replica has recovered.
RECOVERING: The replica’s lag has dropped below reintegration_lag_ms and held there for the full stability_window. Traffic is restored gradually — not all at once — to avoid a thundering herd when multiple replicas recover simultaneously.
# circuit-breaker/pool-demotion.yaml
degradation_policy:
circuit_breaker:
error_threshold: 5 # Consecutive errors before state escalates
timeout: 10s # Time in OPEN before entering half-open
half_open_max_requests: 3 # Probe requests during half-open evaluation
weight_redistribution:
degraded_weight: 0.2 # Cap lagging replica at 20% of normal load
healthy_weight: 0.8 # Remaining weight on healthy replicas
sticky_session_ttl: 300s # Keep in-flight sessions on the same host
fallback_target: remaining_healthy_replicas
degraded_state_behavior:
allow_stale_reads: false # Block stale reads for consistency-sensitive endpoints
route_critical_queries: primary
drain_timeout: 5s # Graceful drain before pool removal
log_level: WARNsticky_session_ttl preserves transaction affinity for in-flight requests: a session that started reading from Replica B will continue routing there until TTL expires or the replica is quarantined. Abrupt mid-flight routing shifts cause inconsistent result sets for paginated queries.
Baseline Routing Before Degradation Occurs
Before degradation events occur, read/write splitting middleware — PgBouncer, ProxySQL, HAProxy, or Vitess — must enforce strict query classification. Routing directives map to consistency SLAs: analytical or background jobs route to relaxed-consistency pools; user-facing reads target low-lag replicas. This baseline operates under eventual consistency patterns for read-heavy workloads as the default operational state.
Query classification via ProxySQL (MySQL) looks like this:
-- ProxySQL: route by query digest to hostgroup
INSERT INTO mysql_query_rules
(rule_id, active, match_digest, destination_hostgroup, apply)
VALUES
(10, 1, '^SELECT.*FROM users WHERE id=', 2, 1), -- Low-lag replica group
(20, 1, '^SELECT.*FROM analytics', 3, 1); -- High-capacity async group
-- Align lag monitoring with your evaluation_window
UPDATE global_variables
SET variable_value = '2000'
WHERE variable_name = 'mysql-monitor_replication_lag_interval';
LOAD MYSQL VARIABLES TO RUNTIME;
SAVE MYSQL VARIABLES TO DISK;mysql-monitor_replication_lag_interval (here, 2000ms) must align with your evaluation_window. If the proxy polls lag every 2s but your evaluation window is 30s, you get 15 data points per state transition — enough to distinguish a real trend from a spike. Query parsing overhead should remain under 2ms per statement; offload heavy regex matching to compiled middleware modules rather than inline Lua or Python.
For routing queries based on data freshness requirements, configure separate hostgroups per consistency tier so that degradation of one tier does not force all traffic into the same fallback path.
Critical Path Override: Primary Read Escalation
When all replicas exceed acceptable lag thresholds, high-value transactions require guaranteed data freshness. Primary read escalation bypasses replica routing entirely, injecting query hints or transaction-scoped routing flags to force reads against the writer node. This guarantees consistency but introduces significant trade-offs: primary write amplification, connection exhaustion under sustained escalation, and increased lock contention.
Implement escalation at the application middleware layer using explicit routing flags rather than relying on proxy auto-detection. Transaction-scoped overrides ensure that only the critical path bypasses replicas; background analytics continues routing to degraded pools:
# app/middleware/routing_interceptor.py
class PrimaryEscalationInterceptor:
def __init__(self, max_primary_conns: int, escalation_ttl: int):
self.max_primary_conns = max_primary_conns # Hard cap — prevent primary saturation
self.escalation_ttl = escalation_ttl # Max seconds a session bypasses replicas
self._active_escapes = 0
def route_query(self, query: str, context: dict) -> str:
if context.get("consistency_requirement") == "STRICT":
if self._active_escapes < self.max_primary_conns:
self._active_escapes += 1
try:
# Proxy parses this hint and routes to the writer hostgroup
return f"/* FORCE_MASTER */ {query}"
finally:
self._active_escapes -= 1
else:
raise ConnectionError(
"Primary read pool saturated — escalation limit reached"
)
return querymax_primary_conns must be strictly bounded — typically 10–20% of the primary’s max_connections. Monitor pg_stat_activity.wait_event_type during escalation windows; if Lock waits exceed 50ms, throttle escalation requests immediately. escalation_ttl limits how long a session can bypass replicas before it is forced back to the degraded pool, preventing a single long-running report from monopolising the escalation budget.
For a complete walkthrough of this pattern, see how to force primary reads for critical user transactions.
Monitoring and Alerting Signals
Instrumenting the fallback lifecycle requires metrics at three layers: the replication channel, the proxy routing state, and the application’s observed latency. The following Prometheus recording rules cover the most critical signals:
# prometheus/rules/replica-fallback.yaml
groups:
- name: replica_fallback
rules:
# Track replicas currently in DEGRADED or QUARANTINED state
- record: replica:routing_state:count
expr: count by (replica_host, state) (proxy_replica_state)
# Measure rate of fallback activations per minute
- record: replica:fallback_activation:rate1m
expr: rate(proxy_fallback_activations_total[1m])
# Fraction of queries routed to primary due to escalation
- record: replica:primary_escalation_ratio:rate5m
expr: |
rate(proxy_queries_routed_primary_total[5m])
/
rate(proxy_queries_total[5m])
# Alert when more than 15% of queries are being escalated to primary
- alert: ReplicaFallbackPrimaryOverload
expr: replica:primary_escalation_ratio:rate5m > 0.15
for: 2m
labels:
severity: critical
annotations:
summary: "Primary read escalation exceeds 15% of total query volume"
# Alert on sustained fallback activation rate
- alert: ReplicaFallbackRateHigh
expr: replica:fallback_activation:rate1m > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Replica fallback activations exceeding 5/min for 5 minutes"In addition to Prometheus, pull the following SQL snapshots during incident review:
-- PostgreSQL: lag per standby in seconds
SELECT
application_name,
state,
EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS replica_lag_s,
replay_lag,
sent_lsn,
replay_lsn
FROM pg_stat_replication
ORDER BY replay_lag DESC NULLS LAST;
-- PostgreSQL: queries waiting on locks during escalation windows
SELECT pid, wait_event_type, wait_event, query_start, query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
AND state = 'active';Track routing_drift_percent — the percentage of queries that landed on an unintended hostgroup — as a proxy misconfiguration signal. A drift rate above 5% warrants a routing rule audit before it causes a widespread staleness incident.
Failure Modes and Recovery Steps
-
Primary connection pool saturation during escalation. Cause:
max_primary_connsset too high relative to the primary’smax_connections, orescalation_ttlnot enforced. Diagnosis:pg_stat_activityshows hundreds of connections inidle in transaction. Remediation: immediately lowermax_primary_connsin the interceptor config, setidle_in_transaction_session_timeout = 10sinpostgresql.conf, and restart the proxy to flush the connection pool. -
Flapping: replicas oscillate between ACTIVE and DEGRADED. Cause:
evaluation_windowis too short relative to your workload’s I/O burst pattern. Diagnosis: proxy logs show rapid state transitions with lag values fluctuating around the threshold. Remediation: double theevaluation_windowand add hysteresis — require lag to drop belowwarning_lag_ms(notdemotion_threshold_ms) before promoting back to ACTIVE. -
Thundering herd on replica reintegration. Cause: all quarantined replicas pass the stability window simultaneously and are promoted at once. Diagnosis: spike in error rate and primary CPU immediately after replicas recover. Remediation: enforce
sequential_promotion: trueand add a staggered delay of 5–10s between each replica’s promotion. Cap new-connection rate on reintegrated replicas. -
Split-brain routing: proxies disagree on replica state. Cause: clock skew between proxy nodes, or network partition between the proxy and the monitoring agent. Diagnosis: queries returning inconsistent data across requests because one proxy routes to Replica B while another does not. Remediation: centralise lag state in a shared store (Redis, etcd) rather than per-proxy memory; all proxies read from the same source of truth for replica health.
-
Stale cache propagation during fallback. Cause: application-level caches (Redis, Memcached) are not invalidated when a replica is demoted, so stale reads continue even after escalation routes to the primary. Diagnosis: cache hit rate remains high but users see outdated data. Remediation: set
cache_invalidation_hook: truein the demotion config; emit a cache-bust event keyed on the affected entity types when a replica transitions to DEGRADED.
Configuration Runbook: Automated Pool Reintegration
# recovery/automated-reintegration.yaml
reconciliation:
reintegration_lag_ms: 50 # Replica must be within 50ms before reintegration starts
stability_window: 60s # Must hold that lag for 60s — no transient catches
sequential_promotion: true # Bring replicas online one at a time
promotion_stagger_s: 8 # Wait 8s between each sequential promotion
max_retries: 3 # Give up after 3 failed stability windows
cache_invalidation_hook: true # Emit cache-bust events before reintegration
observability:
track_fallback_activation_rate: true
track_routing_drift: true
track_primary_cpu_during_fallback: true
alert_thresholds:
fallback_rate_per_min: 5
primary_cpu_percent: 75 # Throttle escalation if primary CPU exceeds 75%
routing_drift_percent: 15After reintegration completes, verify catch-up completeness by comparing replication slot positions:
-- Confirm all standbys are fully caught up post-reintegration
SELECT
slot_name,
confirmed_flush_lsn,
pg_current_wal_lsn() - confirmed_flush_lsn AS lag_bytes
FROM pg_replication_slots
WHERE slot_type = 'physical';Zero lag_bytes across all slots confirms that replica pools have fully replayed the primary’s WAL before you close the incident.
Production Readiness Checklist
FAQ
How do I prevent flapping when a replica briefly spikes in lag?
Use an evaluation_window (typically 20–60 s) that requires lag to remain above the demotion threshold continuously before a state transition fires. Combine this with exponential backoff and jitter on your polling interval so transient I/O spikes don’t trigger premature demotions. For promotion back to ACTIVE, apply hysteresis: require lag to drop below warning_lag_ms rather than demotion_threshold_ms so there is a meaningful gap between demotion and promotion thresholds.
What is the safest way to reintegrate a lagging replica after it catches up?
Require the replica to hold lag below reintegration_lag_ms for the full stability_window (60 s minimum). Promote replicas one at a time (sequential_promotion) to avoid a thundering-herd reconnect storm, and invalidate any application-level caches that served stale data before reintegration completes. Confirm slot positions with the pg_replication_slots query above before closing the incident.
When should I route reads to the primary instead of a lagging replica?
Escalate to primary reads only for queries tagged with a strict consistency requirement — for example, post-write reads inside the same user session where read-your-writes consistency must be guaranteed — or when every replica in the pool exceeds the demotion threshold. For analytical or background reads, routing to a degraded replica with explicit staleness tolerance is preferable to saturating the primary.
Does this pattern differ for synchronous vs asynchronous replication?
For synchronous replication, the replica’s replay lag is bounded by the commit acknowledgement protocol — typically under 10ms — so the demotion threshold can be set very low and fallback fires quickly. For asynchronous replication, operational baselines often tolerate 50–200ms before triggering degradation. The circuit-breaker state machine applies identically in both cases; only the threshold values differ.
Child Pages
- How to Force Primary Reads for Critical User Transactions — step-by-step implementation of the
FORCE_MASTERhint pattern, connection-pool budget sizing, and rollback procedure when primary saturation occurs.
Related
- ← Replication Lag & Consistency Management — parent section covering the full lag/SLA problem space including detection, measurement, and consistency model selection.
- Detecting and Handling Replication Lag in Real-Time — upstream detection signals that feed the thresholds used in this fallback pipeline.
- Eventual Consistency Patterns for Read-Heavy Workloads — baseline consistency model this fallback strategy degrades from.
- Routing Queries Based on Data Freshness Requirements — per-query consistency tagging that determines which queries qualify for primary escalation.
- Connection Pool Architecture for Read Replicas — pool sizing and saturation patterns that interact with degraded-mode weight redistribution.
- Implementing Read/Write Splitting at the Proxy Layer — the baseline routing layer that fallback logic is layered on top of.