Fallback Strategies When Replicas Fall Behind

Replication lag is an operational certainty in distributed database topologies, not an anomaly. Under sustained write throughput, network micro-partitions, or heavy compaction cycles, asynchronous replicas will inevitably fall behind. Treating fallback routing as an ad-hoc exception handler invites cascading failures; instead, it must be engineered as a deterministic routing contract with explicit tolerance windows, state machine transitions, and automated recovery paths.

This page covers the full lifecycle: threshold detection, circuit-breaker state machines, degraded-mode weight redistribution, primary read escalation, and pool reintegration. It assumes you have a proxy or middleware layer in front of one or more read replicas and are operating under asynchronous replication.

Back to Replication Lag & Consistency Management


The Operational Problem

When a replica’s replay position drifts beyond your application’s data-freshness SLA, three risks materialise simultaneously:

  1. Stale reads — users see outdated state for critical entities (account balances, inventory counts, order statuses).
  2. Connection pool saturation — slow replay means query runtimes increase, pools fill, and new requests queue or timeout.
  3. Cascading primary overload — if the proxy blindly shifts all traffic to the primary the moment replicas lag, primary CPU and lock contention spike, potentially degrading write throughput for the entire cluster.

Fallback routing must address all three risks: it should progressively degrade replica load before problems compound, escalate only the minimum traffic to the primary, and restore replicas gracefully once they catch up.


Routing Architecture Overview

The diagram below shows how read queries move through the routing stack under normal, degraded, and escalated conditions.

Replica Fallback Routing State Diagram Queries enter through the application, pass through a lag evaluator in the proxy layer, and are routed to healthy replicas, degraded replicas at reduced weight, or the primary for escalated reads depending on replica lag state. Application READ query Proxy Lag Evaluator + Circuit Breaker full weight 20% weight STRICT only Replica A ACTIVE (lag < 150ms) Replica B DEGRADED (lag > 300ms) Primary escalated reads only lag clears → stability window → RECOVERING normal path degraded path escalated path

Defining Lag Thresholds and Routing Triggers

Effective fallback routing begins with precise telemetry and deliberate threshold calibration. Native replication status views — pg_stat_replication.replay_lag for PostgreSQL, SHOW REPLICA STATUS / Seconds_Behind_Source for MySQL — must be polled at the proxy or middleware layer, not the application tier. Polling intervals should use exponential backoff and jitter to prevent metric flapping during transient network hiccups.

A robust detection pipeline integrates with detecting and handling replication lag in real-time to maintain a continuous baseline before routing logic evaluates state transitions.

Threshold evaluation should distinguish between absolute lag (bytes or rows behind) and temporal lag (seconds). Temporal lag is generally preferred for routing triggers because it directly maps to user-perceived staleness. The evaluation_window parameter is critical — it requires sustained lag above the threshold before demoting a replica, which prevents a brief I/O spike from triggering unnecessary traffic redistribution:

yaml
# proxy-router/lag-evaluator.yaml
replica_monitoring:
  poll_interval: 2s
  backoff_multiplier: 1.5
  max_poll_interval: 15s
  jitter_range_ms: 200
  thresholds:
    warning_lag_ms: 150          # Emit metric; no routing change yet
    critical_lag_ms: 300         # Begin weight reduction
    demotion_threshold_ms: 500   # Must exceed your application's max staleness SLA
    evaluation_window: 30s       # Sustained lag required before state transition fires

The demotion_threshold_ms must be set higher than your application’s maximum acceptable staleness. If your SLA allows 200ms of stale data, a demotion threshold of 500ms gives enough headroom to avoid false positives while still catching genuine drift before it compounds.


Trade-off Comparison: Fallback Approaches

Approach Staleness exposure Primary load increase Complexity Best fit
Route all reads to primary immediately None High — may saturate Low Small replica sets, write-heavy OLTP
Proportional weight reduction (degraded mode) Bounded by demotion_threshold_ms Minimal Medium Read-heavy apps with staleness tolerance
Circuit breaker + quarantine None once quarantined Medium High Large replica fleets, HA-critical workloads
Sticky session routing to low-lag replicas Per-session SLA None Medium User-facing reads requiring read-your-writes
Application-level timestamp bypass None for flagged queries Low (selective) Medium Mixed workloads with inconsistent freshness needs

Proportional weight reduction (the middle option) is the best default for read-heavy workloads: it sheds load from struggling replicas without abruptly flooding the primary. The circuit-breaker pattern is appropriate when a replica is so far behind that even 20% traffic would cause connection pool timeouts.


Mechanism Deep-Dive: The Circuit-Breaker State Machine

Replicas should transition through four states: ACTIVE → DEGRADED → QUARANTINED → RECOVERING. Each state change must be driven by the evaluator’s sustained-lag assessment, not by individual query failures.

ACTIVE: Lag is within the warning threshold. Full traffic weight applies.

DEGRADED: Lag has exceeded critical_lag_ms for the full evaluation_window. The proxy reduces the replica’s traffic weight proportionally — typically to 20% — while healthy replicas absorb the difference. Consistency-sensitive queries are redirected to the primary or a healthy replica. Connection pool depth is capped to prevent queue buildup on a slow node.

QUARANTINED: Lag has exceeded demotion_threshold_ms for the evaluation window, or the replica has returned repeated query errors. Traffic weight drops to zero; the replica is removed from the active pool. The circuit breaker enters half-open state: a small probe traffic allowance (half_open_max_requests) periodically tests whether the replica has recovered.

RECOVERING: The replica’s lag has dropped below reintegration_lag_ms and held there for the full stability_window. Traffic is restored gradually — not all at once — to avoid a thundering herd when multiple replicas recover simultaneously.

yaml
# circuit-breaker/pool-demotion.yaml
degradation_policy:
  circuit_breaker:
    error_threshold: 5            # Consecutive errors before state escalates
    timeout: 10s                  # Time in OPEN before entering half-open
    half_open_max_requests: 3     # Probe requests during half-open evaluation

  weight_redistribution:
    degraded_weight: 0.2          # Cap lagging replica at 20% of normal load
    healthy_weight: 0.8           # Remaining weight on healthy replicas
    sticky_session_ttl: 300s      # Keep in-flight sessions on the same host
    fallback_target: remaining_healthy_replicas

  degraded_state_behavior:
    allow_stale_reads: false       # Block stale reads for consistency-sensitive endpoints
    route_critical_queries: primary
    drain_timeout: 5s             # Graceful drain before pool removal
    log_level: WARN

sticky_session_ttl preserves transaction affinity for in-flight requests: a session that started reading from Replica B will continue routing there until TTL expires or the replica is quarantined. Abrupt mid-flight routing shifts cause inconsistent result sets for paginated queries.


Baseline Routing Before Degradation Occurs

Before degradation events occur, read/write splitting middleware — PgBouncer, ProxySQL, HAProxy, or Vitess — must enforce strict query classification. Routing directives map to consistency SLAs: analytical or background jobs route to relaxed-consistency pools; user-facing reads target low-lag replicas. This baseline operates under eventual consistency patterns for read-heavy workloads as the default operational state.

Query classification via ProxySQL (MySQL) looks like this:

sql
-- ProxySQL: route by query digest to hostgroup
INSERT INTO mysql_query_rules
  (rule_id, active, match_digest, destination_hostgroup, apply)
VALUES
  (10, 1, '^SELECT.*FROM users WHERE id=', 2, 1),   -- Low-lag replica group
  (20, 1, '^SELECT.*FROM analytics',       3, 1);   -- High-capacity async group

-- Align lag monitoring with your evaluation_window
UPDATE global_variables
  SET variable_value = '2000'
  WHERE variable_name = 'mysql-monitor_replication_lag_interval';
LOAD MYSQL VARIABLES TO RUNTIME;
SAVE MYSQL VARIABLES TO DISK;

mysql-monitor_replication_lag_interval (here, 2000ms) must align with your evaluation_window. If the proxy polls lag every 2s but your evaluation window is 30s, you get 15 data points per state transition — enough to distinguish a real trend from a spike. Query parsing overhead should remain under 2ms per statement; offload heavy regex matching to compiled middleware modules rather than inline Lua or Python.

For routing queries based on data freshness requirements, configure separate hostgroups per consistency tier so that degradation of one tier does not force all traffic into the same fallback path.


Critical Path Override: Primary Read Escalation

When all replicas exceed acceptable lag thresholds, high-value transactions require guaranteed data freshness. Primary read escalation bypasses replica routing entirely, injecting query hints or transaction-scoped routing flags to force reads against the writer node. This guarantees consistency but introduces significant trade-offs: primary write amplification, connection exhaustion under sustained escalation, and increased lock contention.

Implement escalation at the application middleware layer using explicit routing flags rather than relying on proxy auto-detection. Transaction-scoped overrides ensure that only the critical path bypasses replicas; background analytics continues routing to degraded pools:

python
# app/middleware/routing_interceptor.py
class PrimaryEscalationInterceptor:
    def __init__(self, max_primary_conns: int, escalation_ttl: int):
        self.max_primary_conns = max_primary_conns   # Hard cap — prevent primary saturation
        self.escalation_ttl = escalation_ttl         # Max seconds a session bypasses replicas
        self._active_escapes = 0

    def route_query(self, query: str, context: dict) -> str:
        if context.get("consistency_requirement") == "STRICT":
            if self._active_escapes < self.max_primary_conns:
                self._active_escapes += 1
                try:
                    # Proxy parses this hint and routes to the writer hostgroup
                    return f"/* FORCE_MASTER */ {query}"
                finally:
                    self._active_escapes -= 1
            else:
                raise ConnectionError(
                    "Primary read pool saturated — escalation limit reached"
                )
        return query

max_primary_conns must be strictly bounded — typically 10–20% of the primary’s max_connections. Monitor pg_stat_activity.wait_event_type during escalation windows; if Lock waits exceed 50ms, throttle escalation requests immediately. escalation_ttl limits how long a session can bypass replicas before it is forced back to the degraded pool, preventing a single long-running report from monopolising the escalation budget.

For a complete walkthrough of this pattern, see how to force primary reads for critical user transactions.


Monitoring and Alerting Signals

Instrumenting the fallback lifecycle requires metrics at three layers: the replication channel, the proxy routing state, and the application’s observed latency. The following Prometheus recording rules cover the most critical signals:

yaml
# prometheus/rules/replica-fallback.yaml
groups:
  - name: replica_fallback
    rules:
      # Track replicas currently in DEGRADED or QUARANTINED state
      - record: replica:routing_state:count
        expr: count by (replica_host, state) (proxy_replica_state)

      # Measure rate of fallback activations per minute
      - record: replica:fallback_activation:rate1m
        expr: rate(proxy_fallback_activations_total[1m])

      # Fraction of queries routed to primary due to escalation
      - record: replica:primary_escalation_ratio:rate5m
        expr: |
          rate(proxy_queries_routed_primary_total[5m])
          /
          rate(proxy_queries_total[5m])

      # Alert when more than 15% of queries are being escalated to primary
      - alert: ReplicaFallbackPrimaryOverload
        expr: replica:primary_escalation_ratio:rate5m > 0.15
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Primary read escalation exceeds 15% of total query volume"

      # Alert on sustained fallback activation rate
      - alert: ReplicaFallbackRateHigh
        expr: replica:fallback_activation:rate1m > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Replica fallback activations exceeding 5/min for 5 minutes"

In addition to Prometheus, pull the following SQL snapshots during incident review:

sql
-- PostgreSQL: lag per standby in seconds
SELECT
  application_name,
  state,
  EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS replica_lag_s,
  replay_lag,
  sent_lsn,
  replay_lsn
FROM pg_stat_replication
ORDER BY replay_lag DESC NULLS LAST;

-- PostgreSQL: queries waiting on locks during escalation windows
SELECT pid, wait_event_type, wait_event, query_start, query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
  AND state = 'active';

Track routing_drift_percent — the percentage of queries that landed on an unintended hostgroup — as a proxy misconfiguration signal. A drift rate above 5% warrants a routing rule audit before it causes a widespread staleness incident.


Failure Modes and Recovery Steps

  1. Primary connection pool saturation during escalation. Cause: max_primary_conns set too high relative to the primary’s max_connections, or escalation_ttl not enforced. Diagnosis: pg_stat_activity shows hundreds of connections in idle in transaction. Remediation: immediately lower max_primary_conns in the interceptor config, set idle_in_transaction_session_timeout = 10s in postgresql.conf, and restart the proxy to flush the connection pool.

  2. Flapping: replicas oscillate between ACTIVE and DEGRADED. Cause: evaluation_window is too short relative to your workload’s I/O burst pattern. Diagnosis: proxy logs show rapid state transitions with lag values fluctuating around the threshold. Remediation: double the evaluation_window and add hysteresis — require lag to drop below warning_lag_ms (not demotion_threshold_ms) before promoting back to ACTIVE.

  3. Thundering herd on replica reintegration. Cause: all quarantined replicas pass the stability window simultaneously and are promoted at once. Diagnosis: spike in error rate and primary CPU immediately after replicas recover. Remediation: enforce sequential_promotion: true and add a staggered delay of 5–10s between each replica’s promotion. Cap new-connection rate on reintegrated replicas.

  4. Split-brain routing: proxies disagree on replica state. Cause: clock skew between proxy nodes, or network partition between the proxy and the monitoring agent. Diagnosis: queries returning inconsistent data across requests because one proxy routes to Replica B while another does not. Remediation: centralise lag state in a shared store (Redis, etcd) rather than per-proxy memory; all proxies read from the same source of truth for replica health.

  5. Stale cache propagation during fallback. Cause: application-level caches (Redis, Memcached) are not invalidated when a replica is demoted, so stale reads continue even after escalation routes to the primary. Diagnosis: cache hit rate remains high but users see outdated data. Remediation: set cache_invalidation_hook: true in the demotion config; emit a cache-bust event keyed on the affected entity types when a replica transitions to DEGRADED.


Configuration Runbook: Automated Pool Reintegration

yaml
# recovery/automated-reintegration.yaml
reconciliation:
  reintegration_lag_ms: 50         # Replica must be within 50ms before reintegration starts
  stability_window: 60s            # Must hold that lag for 60s — no transient catches
  sequential_promotion: true       # Bring replicas online one at a time
  promotion_stagger_s: 8           # Wait 8s between each sequential promotion
  max_retries: 3                   # Give up after 3 failed stability windows
  cache_invalidation_hook: true    # Emit cache-bust events before reintegration

  observability:
    track_fallback_activation_rate: true
    track_routing_drift: true
    track_primary_cpu_during_fallback: true
    alert_thresholds:
      fallback_rate_per_min: 5
      primary_cpu_percent: 75      # Throttle escalation if primary CPU exceeds 75%
      routing_drift_percent: 15

After reintegration completes, verify catch-up completeness by comparing replication slot positions:

sql
-- Confirm all standbys are fully caught up post-reintegration
SELECT
  slot_name,
  confirmed_flush_lsn,
  pg_current_wal_lsn() - confirmed_flush_lsn AS lag_bytes
FROM pg_replication_slots
WHERE slot_type = 'physical';

Zero lag_bytes across all slots confirms that replica pools have fully replayed the primary’s WAL before you close the incident.


Production Readiness Checklist


FAQ

How do I prevent flapping when a replica briefly spikes in lag?

Use an evaluation_window (typically 20–60 s) that requires lag to remain above the demotion threshold continuously before a state transition fires. Combine this with exponential backoff and jitter on your polling interval so transient I/O spikes don’t trigger premature demotions. For promotion back to ACTIVE, apply hysteresis: require lag to drop below warning_lag_ms rather than demotion_threshold_ms so there is a meaningful gap between demotion and promotion thresholds.

What is the safest way to reintegrate a lagging replica after it catches up?

Require the replica to hold lag below reintegration_lag_ms for the full stability_window (60 s minimum). Promote replicas one at a time (sequential_promotion) to avoid a thundering-herd reconnect storm, and invalidate any application-level caches that served stale data before reintegration completes. Confirm slot positions with the pg_replication_slots query above before closing the incident.

When should I route reads to the primary instead of a lagging replica?

Escalate to primary reads only for queries tagged with a strict consistency requirement — for example, post-write reads inside the same user session where read-your-writes consistency must be guaranteed — or when every replica in the pool exceeds the demotion threshold. For analytical or background reads, routing to a degraded replica with explicit staleness tolerance is preferable to saturating the primary.

Does this pattern differ for synchronous vs asynchronous replication?

For synchronous replication, the replica’s replay lag is bounded by the commit acknowledgement protocol — typically under 10ms — so the demotion threshold can be set very low and fallback fires quickly. For asynchronous replication, operational baselines often tolerate 50–200ms before triggering degradation. The circuit-breaker state machine applies identically in both cases; only the threshold values differ.


Child Pages