Eventual Consistency Patterns for Read-Heavy Workloads

← Back to Replication Lag & Consistency Management

Problem Framing

Scaling reads beyond a single node’s IOPS ceiling forces a concrete architectural choice: accept that replicas lag behind the primary, and build systems that handle that lag deliberately, or pay the write-latency tax of synchronous replication everywhere. Most production read fleets choose the former β€” but β€œeventual consistency” is not a passive state you fall into. It is an active operational posture with routing rules, observability thresholds, cache integration, and recovery runbooks.

This page addresses the three SLA risks that eventual consistency introduces in read replica deployments: stale reads reaching users who just wrote data, unbounded lag accumulation during apply-thread contention, and cache coherence breakdown during replica promotion. Each section below ties a concrete mechanism to its failure surface and recovery path.

Concept Definition and Scope

Eventual consistency in a replica fleet means: a write committed on the primary will, in the absence of further writes, propagate to all replicas within a bounded time window. That window is replication lag β€” the gap between when the primary commits a transaction and when a replica finishes applying it.

This differs from two adjacent concepts:

  • Read-your-writes consistency β€” a single client always sees its own writes, typically enforced by routing that client’s reads to the primary for a fixed window after any write it issues.
  • Monotonic read consistency β€” successive reads by the same client never go backwards in time, enforced by session affinity to a single replica.

Eventual consistency makes neither guarantee by default. The patterns in this page layer those guarantees back in selectively, applying them only to the data domains that require them, so the remainder of the read fleet can serve reads from asynchronous replicas freely.

Architecture: Async and Semi-Sync Replication Topologies

Async vs Semi-Sync Replication Topology A primary database sends binlog/WAL events to a first-tier replica via semi-synchronous replication (primary waits for ACK before committing). That first-tier replica fans out to two async downstream replicas which serve read traffic. A proxy routes write traffic to the primary and read traffic to replicas based on lag thresholds. Primary writes committed here semi-sync (ACK required) Replica (tier 1) semi-sync target async async Replica 2 read traffic Replica 3 read traffic Proxy ProxySQL / PgBouncer writes lag-gated read routing

Asynchronous replication decouples primary commit acknowledgment from replica apply cycles entirely. The primary writes to the binary log (MySQL) or WAL (PostgreSQL) and returns success to the client without waiting for any replica to confirm receipt. This maximises write throughput but introduces unbounded staleness if a replica stalls β€” during network partitions or heavy I/O contention, lag can grow to minutes.

Semi-synchronous replication requires at least one replica to receive and acknowledge the log event before the primary commits. This reduces worst-case data loss to near-zero, at the cost of ~2–5 ms of added write latency per transaction. In a typical tiered topology, semi-sync is applied only to the primary-to-first-tier-replica link. Downstream replicas operate asynchronously to absorb bulk read traffic without constraining write throughput.

MySQL semi-sync is configured via rpl_semi_sync_source_* variables (MySQL 8.0.26+; earlier versions use rpl_semi_sync_master_*):

ini
# my.cnf β€” Primary
[mysqld]
# Semi-sync for the first replica only
rpl_semi_sync_source_enabled = 1
# Fall back to async after 1 000 ms if no ACK received
rpl_semi_sync_source_timeout = 1000
# Durability trade-off: accept up to 1s of loss on crash to gain write IOPS
innodb_flush_log_at_trx_commit = 2
sync_binlog = 100

When rpl_semi_sync_source_timeout expires, the primary silently degrades to async. Applications must tolerate this transition: lag will spike until network conditions normalise.

Trade-off Comparison

Topology Write latency overhead Worst-case RPO Apply throughput Best fit
Fully async None Unbounded (minutes under partition) Highest Analytics, feeds, search indexing
Semi-sync (1 replica) +2–5 ms Near-zero (one replica guaranteed) High OLTP with RPO < 1 s
Semi-sync (all replicas) +5–20 ms Zero Lower (blocked by slowest replica) Compliance, financial ledgers
Synchronous (DRBD / Galera) +10–50 ms Zero Lowest Strong consistency required everywhere

The majority of read-heavy fleets use the second row: semi-sync to one replica for durability, async to the rest for read throughput.

Freshness-Aware Query Routing

Static read/write splits β€” all SELECT statements to replicas, all writes to primary β€” fail under variable load because they treat all reads as equally tolerant of staleness. A payment-status lookup requires the same freshness guarantee as the write that recorded the payment; a product-catalogue listing does not.

The correct model is a routing matrix keyed on data domain and lag threshold. Routing queries based on data freshness requirements covers this in full; the matrix below captures the three common tiers:

Domain Max acceptable lag Routing rule Fallback on threshold breach
Auth, sessions, payments < 1 s Primary or lag-gated replica Primary only
Profile, catalogue, inventory < 5 s Any replica meeting threshold Primary with rate limit
Feeds, analytics, search index Unbounded Any replica, including lagged nodes None β€” serve stale data

ProxySQL Hostgroup Configuration

sql
-- ProxySQL: three hostgroups keyed on lag threshold
-- Hostgroup 10: writer (primary)
-- Hostgroup 20: standard replicas (lag < 5 s)
-- Hostgroup 30: tolerant replicas (all replicas)

INSERT INTO mysql_query_rules (rule_id, active, match_digest, destination_hostgroup, apply)
VALUES
  (1, 1, '^SELECT.*FROM sessions WHERE',        10, 1),  -- auth: primary
  (2, 1, '^SELECT.*FROM payment_records WHERE', 10, 1),  -- payments: primary
  (3, 1, '^SELECT.*FROM product_catalog',       20, 1),  -- catalogue: standard replicas
  (4, 1, '^SELECT.*FROM analytics_events',      30, 1);  -- analytics: tolerant replicas

-- Sub-500 ms health checks to catch replica degradation early
UPDATE global_variables SET variable_value = '500'
  WHERE variable_name = 'mysql-monitor_ping_interval';
UPDATE global_variables SET variable_value = '1000'
  WHERE variable_name = 'mysql-monitor_connect_interval';

LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;
LOAD MYSQL VARIABLES TO RUNTIME;
SAVE MYSQL VARIABLES TO DISK;

When all replicas in hostgroup 20 exceed the 5 s lag threshold, ProxySQL spills those queries to hostgroup 10 (the primary). Connection pools must implement a drain timeout (3–5 s) during topology changes to prevent Too many connections errors on promotion.

Monitoring and Alerting Signals

Seconds_Behind_Source (MySQL) and pg_stat_replication.write_lag (PostgreSQL) are the most-cited lag metrics, but both have blind spots. Seconds_Behind_Source returns NULL when replication threads stop and reports 0 during network partitions while the SQL thread catches up from a stalled I/O thread. For production alerting, combine three signal types:

  1. Heartbeat-derived lag β€” write a timestamp to a dedicated table on the primary every second (pt-heartbeat or a custom writer); read it on replicas to compute millisecond-accurate lag.
  2. Relay log byte offset delta β€” tracks how many bytes of log remain to be applied, useful for detecting I/O vs. apply-thread bottlenecks.
  3. Apply thread state β€” SHOW REPLICA STATUS field Replica_SQL_Running; if No, the apply thread has stopped and lag is unbounded.

Detecting and handling replication lag in real-time covers the full instrumentation stack; the Prometheus alerting rule below is the minimum production threshold:

yaml
# Prometheus alert rule β€” replication lag
groups:
  - name: replication_lag
    rules:
      - alert: ReplicaLagCritical
        expr: mysql_replication_lag_seconds{job="mysql_replicas"} > 5
        for: 30s
        labels:
          severity: critical
          action: shed_traffic
        annotations:
          summary: "Replica {{ $labels.instance }} lag exceeds 5 s"
          description: >
            Circuit breaker will open for this replica.
            Traffic rerouted to healthy nodes or primary.
      - alert: ReplicaApplyThreadStopped
        expr: mysql_replica_sql_running{job="mysql_replicas"} == 0
        for: 10s
        labels:
          severity: page
        annotations:
          summary: "Apply thread stopped on {{ $labels.instance }}"
          description: >
            Lag is now unbounded. Investigate relay log for errors
            before re-enabling replication.

Lag is non-linear. A 2 s spike during batch ingestion is normal; a sustained 8 s lag during peak traffic indicates apply-thread starvation. Use a 30 s evaluation window to filter jitter before circuit breakers fire.

Cache-Backed Read Optimisation

When replica IOPS saturate under high read concurrency, shifting hot-path queries to an in-memory layer decouples scaling from replication topology entirely. The two primary patterns:

Write-through (synchronous): The application writes to the cache and the database in the same request path, keeping the cache consistent with the primary. Adds write latency but guarantees no stale cache hits immediately after a write. Appropriate for user sessions, inventory counts, and pricing.

Write-behind (asynchronous): The application writes to the cache immediately and queues a database write. Improves write throughput but risks cache-primary divergence if the queue drains slowly or the process crashes. Appropriate for event counters and activity logs where temporary divergence is acceptable.

Stampede Prevention with Probabilistic TTL Jitter

Fixed TTLs cause cache stampedes: when thousands of keys expire simultaneously, a thundering herd of requests hits the replica fleet. Spread expiration by jittering TTLs:

python
import redis
import random
import json
import time

r = redis.Redis(host="cache-primary", decode_responses=True)

def cache_write_through(key: str, value: dict, base_ttl: int = 3600) -> str:
    """Write to cache with Β±15 % TTL jitter and versioned key for failover coherence."""
    jitter = random.uniform(-0.15 * base_ttl, 0.15 * base_ttl)
    effective_ttl = int(base_ttl + jitter)

    # Versioned key: incrementing version on failover invalidates old entries
    version = r.incr(f"{key}:version")
    versioned_key = f"{key}:v{version}"

    pipe = r.pipeline()
    pipe.set(versioned_key, json.dumps(value), ex=effective_ttl)
    pipe.execute()

    return versioned_key


def get_with_stale_fallback(versioned_key: str, replica_fn, max_staleness_s: int = 2):
    """
    Return cached value; if missing, fetch from replica.
    If replica is slow (>200 ms), serve a brief stale response via a background refresh.
    """
    cached = r.get(versioned_key)
    if cached:
        return json.loads(cached)

    # Cache miss: fetch from replica with tight timeout
    start = time.monotonic()
    fresh = replica_fn()
    elapsed = time.monotonic() - start

    if elapsed > 0.2:
        # Replica is under pressure β€” return None and let caller handle gracefully
        return None

    cache_write_through(versioned_key.rsplit(":v", 1)[0], fresh)
    return fresh

Cache Coherence During Replica Promotion

During a replica promotion, stale cache entries can continue serving pre-failover values after the new primary accepts writes. Two remediation patterns:

  • Versioned key prefix β€” on promotion, increment a global cache epoch. All new writes use key:epoch:N+1; old entries at key:epoch:N expire naturally within one TTL window.
  • Targeted invalidation β€” the promotion script iterates affected key patterns and calls DEL or UNLINK. Works well for small key spaces; impractical for millions of keys.

If Redis itself experiences latency spikes, fall back to direct replica reads with a strict 200 ms timeout. Cache misses during failover are served with the stale-while-revalidate pattern above.

Failure Modes and Recovery Steps

1. Cascading Lag from Long-Running Read Queries

Root cause: Unbounded SELECT queries on replicas consume I/O bandwidth and, in MySQL’s single-threaded replication mode, block the SQL apply thread from processing relay log events.

Diagnosis: Monitor Relay_log_space growth rate and divergence between Seconds_Behind_Source and heartbeat-derived lag. Growing relay log with flat Seconds_Behind_Source indicates an I/O thread stall; growing Seconds_Behind_Source with static relay log indicates an apply-thread stall.

Remediation:

sql
-- MySQL 8.0+: enforce per-statement execution limits on replicas
SET GLOBAL max_execution_time = 5000;   -- 5 s; applies to SELECT only
SET GLOBAL read_only        = ON;
SET GLOBAL super_read_only  = ON;

-- Enable parallel apply workers (reduces single-thread bottleneck)
SET GLOBAL replica_parallel_type    = LOGICAL_CLOCK;
SET GLOBAL replica_parallel_workers = 8;

2. Split-Brain After Network Partition

Root cause: During a partition, a replica is promoted before the original primary is definitively fenced, resulting in two nodes accepting writes simultaneously.

Diagnosis: Compare binary log positions (SHOW BINARY LOG STATUS) on both nodes. Diverging GTIDs confirm split-brain.

Remediation: Use quorum-based orchestration tools (Orchestrator, Patroni) with strict fencing (STONITH or VIP removal). Adjust DNS TTLs to 30–60 s for database endpoints, paired with connection pool health checks to prevent stale DNS from holding open connections to the demoted primary.

3. Circuit Breaker Opens; Primary Overloads

Root cause: A Prometheus alert fires, the circuit breaker opens on one or more replicas, and all traffic spills to the primary faster than connection pools can drain. The primary hits max_connections, cascading writes and reads into timeouts.

Diagnosis: Primary Threads_running spikes; max_connections errors appear in the application error log.

Remediation: Implement graduated traffic shedding β€” do not route 100 % of reads to the primary on first breach. Drop the tolerant-tier reads entirely (return 503 or serve from cache) before allowing standard-tier reads to fall through to the primary. Rate-limit primary spillover at the proxy layer.

4. Idempotent Retry Logic for Stale Read Remediation

Applications that experience stale reads or circuit-breaker rejections must retry idempotently. Implement exponential backoff with jitter: 200 ms β†’ 400 ms β†’ 800 ms, capped at 2 s. For write-after-read sequences, verify data freshness before committing a dependent transaction β€” either by routing the read to the primary or by comparing the value’s updated_at timestamp against the write epoch.

5. Replica Apply Thread Stopped

Root cause: A DDL statement, large transaction, or corrupt relay log event causes the SQL apply thread to stop.

Diagnosis: SHOW REPLICA STATUS\G β€” Replica_SQL_Running: No, with a non-empty Last_SQL_Error.

Remediation:

sql
-- Inspect the error
SHOW REPLICA STATUS\G

-- If safe to skip (idempotent DDL already applied on replica):
SET GLOBAL SQL_REPLICA_SKIP_COUNTER = 1;
START REPLICA SQL_THREAD;

-- Verify apply thread has restarted
SHOW REPLICA STATUS\G  -- Replica_SQL_Running should be Yes

Recovery Runbook

The following checklist codifies the steps for a full replica recovery, from detection through traffic reinstatement. The site renders these as interactive toggles:

FAQ

What is the difference between eventual consistency and strong consistency in a replica fleet?

Strong consistency routes every read to the primary (or a synchronous standby), guaranteeing the client always sees the latest committed write. Eventual consistency routes reads to asynchronous replicas that may lag by milliseconds to seconds; the guarantee is that, absent further writes, replicas will converge to the primary state. The trade-off is write latency and primary IOPS pressure against acceptable staleness.

How do I prevent cache stampedes when using Redis in front of read replicas?

Apply probabilistic TTL jitter: instead of a fixed expiry, set effective_ttl = base_ttl + random(-jitter, +jitter). This spreads expiration events across a window so thousands of keys do not expire simultaneously and trigger a thundering herd against replicas. Combine with a stale-while-revalidate pattern so the first thread to miss the cache kicks off a background refresh while other threads serve the briefly stale value.

When should I fall back to the primary instead of reading from a replica?

Always route to the primary for auth tokens, payment records, inventory reservations, and any write-after-read sequence where the subsequent write depends on the value just read. For everything else, assign a staleness budget and fall back to the primary only when no replica meets that budget β€” not as a default catch-all, which defeats the purpose of read scaling.

Does semi-synchronous replication eliminate replication lag?

No. Semi-sync guarantees the primary does not commit until at least one replica has received the log event β€” but β€œreceived” means written to the relay log, not applied to the replica’s data. The apply thread still processes events asynchronously after acknowledgment. Semi-sync reduces RPO (data loss on crash) to near-zero; it does not eliminate the read lag between replica receipt and apply.