Eventual Consistency Patterns for Read-Heavy Workloads
β Back to Replication Lag & Consistency Management
Problem Framing
Scaling reads beyond a single nodeβs IOPS ceiling forces a concrete architectural choice: accept that replicas lag behind the primary, and build systems that handle that lag deliberately, or pay the write-latency tax of synchronous replication everywhere. Most production read fleets choose the former β but βeventual consistencyβ is not a passive state you fall into. It is an active operational posture with routing rules, observability thresholds, cache integration, and recovery runbooks.
This page addresses the three SLA risks that eventual consistency introduces in read replica deployments: stale reads reaching users who just wrote data, unbounded lag accumulation during apply-thread contention, and cache coherence breakdown during replica promotion. Each section below ties a concrete mechanism to its failure surface and recovery path.
Concept Definition and Scope
Eventual consistency in a replica fleet means: a write committed on the primary will, in the absence of further writes, propagate to all replicas within a bounded time window. That window is replication lag β the gap between when the primary commits a transaction and when a replica finishes applying it.
This differs from two adjacent concepts:
- Read-your-writes consistency β a single client always sees its own writes, typically enforced by routing that clientβs reads to the primary for a fixed window after any write it issues.
- Monotonic read consistency β successive reads by the same client never go backwards in time, enforced by session affinity to a single replica.
Eventual consistency makes neither guarantee by default. The patterns in this page layer those guarantees back in selectively, applying them only to the data domains that require them, so the remainder of the read fleet can serve reads from asynchronous replicas freely.
Architecture: Async and Semi-Sync Replication Topologies
Asynchronous replication decouples primary commit acknowledgment from replica apply cycles entirely. The primary writes to the binary log (MySQL) or WAL (PostgreSQL) and returns success to the client without waiting for any replica to confirm receipt. This maximises write throughput but introduces unbounded staleness if a replica stalls β during network partitions or heavy I/O contention, lag can grow to minutes.
Semi-synchronous replication requires at least one replica to receive and acknowledge the log event before the primary commits. This reduces worst-case data loss to near-zero, at the cost of ~2β5 ms of added write latency per transaction. In a typical tiered topology, semi-sync is applied only to the primary-to-first-tier-replica link. Downstream replicas operate asynchronously to absorb bulk read traffic without constraining write throughput.
MySQL semi-sync is configured via rpl_semi_sync_source_* variables (MySQL 8.0.26+; earlier versions use rpl_semi_sync_master_*):
# my.cnf β Primary
[mysqld]
# Semi-sync for the first replica only
rpl_semi_sync_source_enabled = 1
# Fall back to async after 1 000 ms if no ACK received
rpl_semi_sync_source_timeout = 1000
# Durability trade-off: accept up to 1s of loss on crash to gain write IOPS
innodb_flush_log_at_trx_commit = 2
sync_binlog = 100When rpl_semi_sync_source_timeout expires, the primary silently degrades to async. Applications must tolerate this transition: lag will spike until network conditions normalise.
Trade-off Comparison
| Topology | Write latency overhead | Worst-case RPO | Apply throughput | Best fit |
|---|---|---|---|---|
| Fully async | None | Unbounded (minutes under partition) | Highest | Analytics, feeds, search indexing |
| Semi-sync (1 replica) | +2β5 ms | Near-zero (one replica guaranteed) | High | OLTP with RPO < 1 s |
| Semi-sync (all replicas) | +5β20 ms | Zero | Lower (blocked by slowest replica) | Compliance, financial ledgers |
| Synchronous (DRBD / Galera) | +10β50 ms | Zero | Lowest | Strong consistency required everywhere |
The majority of read-heavy fleets use the second row: semi-sync to one replica for durability, async to the rest for read throughput.
Freshness-Aware Query Routing
Static read/write splits β all SELECT statements to replicas, all writes to primary β fail under variable load because they treat all reads as equally tolerant of staleness. A payment-status lookup requires the same freshness guarantee as the write that recorded the payment; a product-catalogue listing does not.
The correct model is a routing matrix keyed on data domain and lag threshold. Routing queries based on data freshness requirements covers this in full; the matrix below captures the three common tiers:
| Domain | Max acceptable lag | Routing rule | Fallback on threshold breach |
|---|---|---|---|
| Auth, sessions, payments | < 1 s | Primary or lag-gated replica | Primary only |
| Profile, catalogue, inventory | < 5 s | Any replica meeting threshold | Primary with rate limit |
| Feeds, analytics, search index | Unbounded | Any replica, including lagged nodes | None β serve stale data |
ProxySQL Hostgroup Configuration
-- ProxySQL: three hostgroups keyed on lag threshold
-- Hostgroup 10: writer (primary)
-- Hostgroup 20: standard replicas (lag < 5 s)
-- Hostgroup 30: tolerant replicas (all replicas)
INSERT INTO mysql_query_rules (rule_id, active, match_digest, destination_hostgroup, apply)
VALUES
(1, 1, '^SELECT.*FROM sessions WHERE', 10, 1), -- auth: primary
(2, 1, '^SELECT.*FROM payment_records WHERE', 10, 1), -- payments: primary
(3, 1, '^SELECT.*FROM product_catalog', 20, 1), -- catalogue: standard replicas
(4, 1, '^SELECT.*FROM analytics_events', 30, 1); -- analytics: tolerant replicas
-- Sub-500 ms health checks to catch replica degradation early
UPDATE global_variables SET variable_value = '500'
WHERE variable_name = 'mysql-monitor_ping_interval';
UPDATE global_variables SET variable_value = '1000'
WHERE variable_name = 'mysql-monitor_connect_interval';
LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;
LOAD MYSQL VARIABLES TO RUNTIME;
SAVE MYSQL VARIABLES TO DISK;When all replicas in hostgroup 20 exceed the 5 s lag threshold, ProxySQL spills those queries to hostgroup 10 (the primary). Connection pools must implement a drain timeout (3β5 s) during topology changes to prevent Too many connections errors on promotion.
Monitoring and Alerting Signals
Seconds_Behind_Source (MySQL) and pg_stat_replication.write_lag (PostgreSQL) are the most-cited lag metrics, but both have blind spots. Seconds_Behind_Source returns NULL when replication threads stop and reports 0 during network partitions while the SQL thread catches up from a stalled I/O thread. For production alerting, combine three signal types:
- Heartbeat-derived lag β write a timestamp to a dedicated table on the primary every second (
pt-heartbeator a custom writer); read it on replicas to compute millisecond-accurate lag. - Relay log byte offset delta β tracks how many bytes of log remain to be applied, useful for detecting I/O vs. apply-thread bottlenecks.
- Apply thread state β
SHOW REPLICA STATUSfieldReplica_SQL_Running; ifNo, the apply thread has stopped and lag is unbounded.
Detecting and handling replication lag in real-time covers the full instrumentation stack; the Prometheus alerting rule below is the minimum production threshold:
# Prometheus alert rule β replication lag
groups:
- name: replication_lag
rules:
- alert: ReplicaLagCritical
expr: mysql_replication_lag_seconds{job="mysql_replicas"} > 5
for: 30s
labels:
severity: critical
action: shed_traffic
annotations:
summary: "Replica {{ $labels.instance }} lag exceeds 5 s"
description: >
Circuit breaker will open for this replica.
Traffic rerouted to healthy nodes or primary.
- alert: ReplicaApplyThreadStopped
expr: mysql_replica_sql_running{job="mysql_replicas"} == 0
for: 10s
labels:
severity: page
annotations:
summary: "Apply thread stopped on {{ $labels.instance }}"
description: >
Lag is now unbounded. Investigate relay log for errors
before re-enabling replication.Lag is non-linear. A 2 s spike during batch ingestion is normal; a sustained 8 s lag during peak traffic indicates apply-thread starvation. Use a 30 s evaluation window to filter jitter before circuit breakers fire.
Cache-Backed Read Optimisation
When replica IOPS saturate under high read concurrency, shifting hot-path queries to an in-memory layer decouples scaling from replication topology entirely. The two primary patterns:
Write-through (synchronous): The application writes to the cache and the database in the same request path, keeping the cache consistent with the primary. Adds write latency but guarantees no stale cache hits immediately after a write. Appropriate for user sessions, inventory counts, and pricing.
Write-behind (asynchronous): The application writes to the cache immediately and queues a database write. Improves write throughput but risks cache-primary divergence if the queue drains slowly or the process crashes. Appropriate for event counters and activity logs where temporary divergence is acceptable.
Stampede Prevention with Probabilistic TTL Jitter
Fixed TTLs cause cache stampedes: when thousands of keys expire simultaneously, a thundering herd of requests hits the replica fleet. Spread expiration by jittering TTLs:
import redis
import random
import json
import time
r = redis.Redis(host="cache-primary", decode_responses=True)
def cache_write_through(key: str, value: dict, base_ttl: int = 3600) -> str:
"""Write to cache with Β±15 % TTL jitter and versioned key for failover coherence."""
jitter = random.uniform(-0.15 * base_ttl, 0.15 * base_ttl)
effective_ttl = int(base_ttl + jitter)
# Versioned key: incrementing version on failover invalidates old entries
version = r.incr(f"{key}:version")
versioned_key = f"{key}:v{version}"
pipe = r.pipeline()
pipe.set(versioned_key, json.dumps(value), ex=effective_ttl)
pipe.execute()
return versioned_key
def get_with_stale_fallback(versioned_key: str, replica_fn, max_staleness_s: int = 2):
"""
Return cached value; if missing, fetch from replica.
If replica is slow (>200 ms), serve a brief stale response via a background refresh.
"""
cached = r.get(versioned_key)
if cached:
return json.loads(cached)
# Cache miss: fetch from replica with tight timeout
start = time.monotonic()
fresh = replica_fn()
elapsed = time.monotonic() - start
if elapsed > 0.2:
# Replica is under pressure β return None and let caller handle gracefully
return None
cache_write_through(versioned_key.rsplit(":v", 1)[0], fresh)
return freshCache Coherence During Replica Promotion
During a replica promotion, stale cache entries can continue serving pre-failover values after the new primary accepts writes. Two remediation patterns:
- Versioned key prefix β on promotion, increment a global cache epoch. All new writes use
key:epoch:N+1; old entries atkey:epoch:Nexpire naturally within one TTL window. - Targeted invalidation β the promotion script iterates affected key patterns and calls
DELorUNLINK. Works well for small key spaces; impractical for millions of keys.
If Redis itself experiences latency spikes, fall back to direct replica reads with a strict 200 ms timeout. Cache misses during failover are served with the stale-while-revalidate pattern above.
Failure Modes and Recovery Steps
1. Cascading Lag from Long-Running Read Queries
Root cause: Unbounded SELECT queries on replicas consume I/O bandwidth and, in MySQLβs single-threaded replication mode, block the SQL apply thread from processing relay log events.
Diagnosis: Monitor Relay_log_space growth rate and divergence between Seconds_Behind_Source and heartbeat-derived lag. Growing relay log with flat Seconds_Behind_Source indicates an I/O thread stall; growing Seconds_Behind_Source with static relay log indicates an apply-thread stall.
Remediation:
-- MySQL 8.0+: enforce per-statement execution limits on replicas
SET GLOBAL max_execution_time = 5000; -- 5 s; applies to SELECT only
SET GLOBAL read_only = ON;
SET GLOBAL super_read_only = ON;
-- Enable parallel apply workers (reduces single-thread bottleneck)
SET GLOBAL replica_parallel_type = LOGICAL_CLOCK;
SET GLOBAL replica_parallel_workers = 8;2. Split-Brain After Network Partition
Root cause: During a partition, a replica is promoted before the original primary is definitively fenced, resulting in two nodes accepting writes simultaneously.
Diagnosis: Compare binary log positions (SHOW BINARY LOG STATUS) on both nodes. Diverging GTIDs confirm split-brain.
Remediation: Use quorum-based orchestration tools (Orchestrator, Patroni) with strict fencing (STONITH or VIP removal). Adjust DNS TTLs to 30β60 s for database endpoints, paired with connection pool health checks to prevent stale DNS from holding open connections to the demoted primary.
3. Circuit Breaker Opens; Primary Overloads
Root cause: A Prometheus alert fires, the circuit breaker opens on one or more replicas, and all traffic spills to the primary faster than connection pools can drain. The primary hits max_connections, cascading writes and reads into timeouts.
Diagnosis: Primary Threads_running spikes; max_connections errors appear in the application error log.
Remediation: Implement graduated traffic shedding β do not route 100 % of reads to the primary on first breach. Drop the tolerant-tier reads entirely (return 503 or serve from cache) before allowing standard-tier reads to fall through to the primary. Rate-limit primary spillover at the proxy layer.
4. Idempotent Retry Logic for Stale Read Remediation
Applications that experience stale reads or circuit-breaker rejections must retry idempotently. Implement exponential backoff with jitter: 200 ms β 400 ms β 800 ms, capped at 2 s. For write-after-read sequences, verify data freshness before committing a dependent transaction β either by routing the read to the primary or by comparing the valueβs updated_at timestamp against the write epoch.
5. Replica Apply Thread Stopped
Root cause: A DDL statement, large transaction, or corrupt relay log event causes the SQL apply thread to stop.
Diagnosis: SHOW REPLICA STATUS\G β Replica_SQL_Running: No, with a non-empty Last_SQL_Error.
Remediation:
-- Inspect the error
SHOW REPLICA STATUS\G
-- If safe to skip (idempotent DDL already applied on replica):
SET GLOBAL SQL_REPLICA_SKIP_COUNTER = 1;
START REPLICA SQL_THREAD;
-- Verify apply thread has restarted
SHOW REPLICA STATUS\G -- Replica_SQL_Running should be YesRecovery Runbook
The following checklist codifies the steps for a full replica recovery, from detection through traffic reinstatement. The site renders these as interactive toggles:
Related Pages
- Routing queries based on data freshness requirements β how to build a per-domain staleness budget and wire it into proxy routing rules
- Detecting and handling replication lag in real-time β heartbeat instrumentation, Prometheus exporters, and alert correlation for replica fleets
- Using application-level timestamps to bypass stale replicas β client-side read-your-writes enforcement without proxy involvement
- Fallback strategies when replicas fall behind β primary fallback, partial reads, and graceful degradation patterns for high-lag events
FAQ
What is the difference between eventual consistency and strong consistency in a replica fleet?
Strong consistency routes every read to the primary (or a synchronous standby), guaranteeing the client always sees the latest committed write. Eventual consistency routes reads to asynchronous replicas that may lag by milliseconds to seconds; the guarantee is that, absent further writes, replicas will converge to the primary state. The trade-off is write latency and primary IOPS pressure against acceptable staleness.
How do I prevent cache stampedes when using Redis in front of read replicas?
Apply probabilistic TTL jitter: instead of a fixed expiry, set effective_ttl = base_ttl + random(-jitter, +jitter). This spreads expiration events across a window so thousands of keys do not expire simultaneously and trigger a thundering herd against replicas. Combine with a stale-while-revalidate pattern so the first thread to miss the cache kicks off a background refresh while other threads serve the briefly stale value.
When should I fall back to the primary instead of reading from a replica?
Always route to the primary for auth tokens, payment records, inventory reservations, and any write-after-read sequence where the subsequent write depends on the value just read. For everything else, assign a staleness budget and fall back to the primary only when no replica meets that budget β not as a default catch-all, which defeats the purpose of read scaling.
Does semi-synchronous replication eliminate replication lag?
No. Semi-sync guarantees the primary does not commit until at least one replica has received the log event β but βreceivedβ means written to the relay log, not applied to the replicaβs data. The apply thread still processes events asynchronously after acknowledgment. Semi-sync reduces RPO (data loss on crash) to near-zero; it does not eliminate the read lag between replica receipt and apply.
Related
- Replication Lag & Consistency Management β parent section covering the full lag lifecycle: measurement, routing, and recovery
- Routing queries based on data freshness requirements β per-domain staleness budgets and proxy routing matrices
- Fallback strategies when replicas fall behind β circuit breakers, primary fallback, and partial-read degradation
- Connection pool architecture for read replicas β ProxySQL and PgBouncer pool design, drain windows, and failover routing
- Evaluating consistency models for distributed reads β CAP theorem trade-offs and consistency model selection for distributed read fleets