Replication Lag & Consistency Management

Asynchronous replication is the architectural trade-off at the heart of every read replica deployment: you gain horizontal read throughput at the cost of introducing a propagation window during which replicas may return data that the primary has already superseded. In practice, that window ranges from a few milliseconds during steady-state OLTP to minutes during bulk ingestion, DDL migrations, or network events — and the gap between “acceptable staleness” and “silent data corruption” is often just one poorly-routed query. Managing replication lag means detecting the delay accurately, setting per-workload freshness SLOs, routing reads intelligently based on real-time telemetry, and defining explicit degraded-state behavior for the moments when replicas fall outside those thresholds. This page maps the full engineering surface of that problem.


Architecture Overview

The diagram below shows how a write on the primary flows through the replication pipeline, where lag accumulates, and which components observe and act on that lag:

Replication lag pipeline and routing decision layer A write flows from the application to the primary database, generating a WAL record that travels over the network to each replica's receive buffer. The apply worker processes records from the buffer into the replica's data pages. Prometheus scrapes lag metrics from each replica. The query router reads those metrics and decides whether to send an incoming read to a replica or fall back to the primary. Application writes + reads Query Router ProxySQL / client-side lib writes reads (if lag ok) Primary DB WAL / Binlog writer sent_lsn / exec_gtid WAL stream (async) Replica A recv buffer → apply lag: replay_lsn delta e.g. 120 ms Replica B recv buffer → apply lag: replay_lsn delta e.g. 4 200 ms Prometheus scrapes /replication/metrics scrape lag telemetry → routing weights Lag accumulates here: network + apply queue

The two key positions in the pipeline where lag is observable are the WAL position delta (bytes between sent_lsn on the primary and replay_lsn on the replica, exposed via pg_stat_replication) and the heartbeat timestamp delta (a dedicated heartbeat row written to the primary every few seconds, compared against the replica’s copy). Use both: position delta tells you queue depth in bytes; heartbeat delta gives you actual wall-clock staleness. For MySQL the equivalent is Seconds_Behind_Source in SHOW REPLICA STATUS, supplemented by GTID position tracking.


Replication Topology & Lag Mechanics

Asynchronous replication decouples write throughput from read scalability by not requiring replicas to acknowledge writes before the primary commits. This produces the propagation delay managed throughout this page.

The replication graph shapes baseline latency. Star topologies fan out from a single primary, minimizing hop count but concentrating network egress on one node. Chain topologies cascade writes through intermediate nodes, reducing primary egress load but compounding RTT variance — lag on a mid-chain node magnifies on every downstream node. Multi-region topologies introduce WAN RTT as a hard floor on write propagation, often 30–80 ms per continent crossing.

The apply pipeline on each replica processes changes through four serial stages: network receive, log decode, transaction replay, and disk flush. Single-threaded apply workers become the bottleneck under high-concurrency write streams. Lock contention during replay stalls downstream commits, and disk I/O saturation causes WAL segment flush to fall behind arrival rate.

Topology Write Throughput Read Freshness Network Partition Tolerance Best Use Case
Star High Moderate Low (primary SPOF) Standard OLTP
Chain Moderate Low Moderate Edge/regional caching
Multi-Primary Very High Variable High Geo-distributed writes

Enabling parallel apply workers (PostgreSQL):

ini
# postgresql.conf — increases replay parallelism to reduce apply queue buildup
wal_level = logical              # required for logical parallel apply
max_logical_replication_workers = 8
max_wal_senders = 12
max_replication_slots = 12       # one per replica + headroom for tools
wal_compression = on             # reduces network transfer under bursty write load

For MySQL, parallel replication is controlled by replica_parallel_type = LOGICAL_CLOCK and replica_parallel_workers = 8 (MySQL 8.0+, replacing the deprecated slave_* names).

Failure modes:

  • Disk I/O saturation on the replica causes receive buffer queue buildup; lag grows at the rate of write throughput minus apply throughput.
  • Network jitter amplifies RTT variance, producing false lag spikes that briefly exceed SLO thresholds before self-correcting.
  • Bulk loads on the primary saturate the WAL sender, starving the replication stream and causing exponential lag accumulation in downstream apply workers.

Consistency Models & Application Trade-offs

The right consistency model for each query is determined by how much stale data is tolerable, not by a single global database setting. Selecting the wrong model causes either silent data corruption (inventory oversell, double-spend) or unnecessary read latency from over-routing to the primary.

Evaluating consistency models for distributed reads covers the formal definitions. The operational mapping is:

Strong consistency routes all reads to the primary or to a synchronous replica. Zero staleness, but caps horizontal read throughput — every read hits the write path.

Session consistency ties reads to a replica that has replayed at least the session’s last committed LSN/GTID. A user sees their own write immediately; other users’ concurrent writes may lag.

Monotonic reads guarantee that a client never observes older state than its previous read. Implemented by tracking the highest observed LSN in the client session token and refusing dispatch to any replica whose replay_lsn is behind it.

Eventual consistency routes reads to any replica regardless of lag. Acceptable for analytics dashboards and content delivery; not acceptable for balance reads or inventory state.

Consistency Tier Staleness Window Developer Complexity Infrastructure Overhead
Strong 0 ms Low High (sync commits or primary reads)
Session 0–500 ms Medium Medium (LSN affinity)
Monotonic 0–2 000 ms Medium Low (client token tracking)
Eventual Seconds–minutes High (app must tolerate stale) Low

Map workload profiles to staleness windows explicitly. Analytics dashboards tolerate minutes of lag. User profile updates require sub-second alignment. Payment state transitions demand strong consistency.

Eventual consistency patterns for read-heavy workloads covers cache TTL alignment with replication lag percentiles to prevent cache stampedes from amplifying underlying delay.

Failure modes:

  • Stale reads on inventory or balance checks trigger oversell or overdraft when two concurrent sessions read the same pre-decrement value from different replicas.
  • Cache stampedes amplify lag by flooding replicas with identical stale queries when cache TTLs are shorter than replication lag.
  • Monotonic-read token loss (session expiry, device switch) forces fallback to eventual consistency without explicit UI notification.

Connection Routing & Query Dispatch

Dynamic traffic routing prevents stale reads from reaching freshness-sensitive endpoints. Routing decisions must consume real-time lag telemetry — not static DNS records or connection strings that do not reflect current replica health.

Two deployment patterns dominate. Proxy-based routing (ProxySQL for MySQL, pgpool-II or Pgcat for PostgreSQL) centralizes policy enforcement and lag-aware weight adjustment at the cost of one additional network hop (typically 1–3 ms). Client-side routing (custom connection pool wrapper, SQLAlchemy multi-engine selector, Prisma datasource switching) eliminates that hop but scatters routing logic across service boundaries. For the trade-offs between these approaches see connection routing and pooling strategies.

Routing queries based on data freshness requirements covers the full decision tree: how to classify queries by staleness tolerance, how to encode freshness annotations in query hints or ORM contexts, and how to implement per-endpoint SLO enforcement without hardcoding replica hostnames.

ProxySQL routing rules by query pattern (MySQL):

sql
-- Route transactional selects (orders table) to low-lag hostgroup 10
INSERT INTO mysql_query_rules (rule_id, active, match_digest, destination_hostgroup, apply)
VALUES (1, 1, '^SELECT.*FROM orders', 10, 1);

-- Route analytical selects to high-lag-tolerant hostgroup 20
INSERT INTO mysql_query_rules (rule_id, active, match_digest, destination_hostgroup, apply)
VALUES (2, 1, '^SELECT.*FROM analytics', 20, 1);

-- Monitor interval controls how quickly lag changes re-weight the pool
UPDATE global_variables SET variable_value='2000'
  WHERE variable_name='mysql-monitor_connect_interval';
UPDATE global_variables SET variable_value='1000'
  WHERE variable_name='mysql-monitor_read_only_interval';
LOAD MYSQL VARIABLES TO RUNTIME;
SAVE MYSQL VARIABLES TO DISK;
Routing Strategy Latency Overhead Failover Complexity Maintenance Burden
Proxy-based 1–3 ms Low Centralized
Client-side 0 ms High Distributed
DNS TTL Variable (cache-dependent) Very high Manual

Failure modes:

  • Split-brain routing dispatches writes to an isolated replica during a network partition, producing divergent state that is expensive to reconcile.
  • Proxy connection pool exhaustion during network degradation causes timeout cascades that are indistinguishable from replica failure.
  • DNS cache poisoning routes traffic to decommissioned nodes when TTLs are set higher than the deployment cycle.

Real-Time Monitoring & Detection

Visibility into lag must span the entire apply pipeline. Instrumenting only the final Seconds_Behind_Source or heartbeat delta misses intermediate bottlenecks: a saturated receive buffer may not yet show up in apply lag but is already accumulating a queue.

Instrument WAL/Binlog position tracking alongside apply queue depth. Correlate OS-level signals — CPU steal, IOPS utilization, network retransmit rate — with replication throughput. Report lag percentiles, not averages: a p50 of 50 ms and a p99 of 8 000 ms indicate a fundamentally different problem than uniformly low lag.

Detecting and handling replication lag in real-time covers the full alerting stack: heartbeat insertion frequency, Prometheus recording rules, and circuit-breaker integration that removes replicas from the pool before SLO violations reach application users.

Prometheus scrape configuration and alerting rules:

yaml
# prometheus.yml scrape config for replica exporters
scrape_configs:
  - job_name: postgres_replicas
    scrape_interval: 5s
    metrics_path: /metrics
    static_configs:
      - targets: ['replica-a:9187', 'replica-b:9187']

# alerting rules
groups:
  - name: replication_lag
    rules:
      - alert: ReplicationLagP95Exceeded
        expr: |
          histogram_quantile(0.95,
            rate(replication_lag_seconds_bucket[5m])
          ) > 2
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Replica {{ $labels.instance }} p95 lag exceeds 2 s SLO"

      - alert: ApplyQueueSaturation
        expr: replication_apply_queue_depth > 10000
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Apply queue depth on {{ $labels.instance }} exceeds 10 k records"

      - alert: ReplicationHeartbeatStale
        expr: |
          time() - replication_heartbeat_timestamp_seconds > 30
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Heartbeat on {{ $labels.instance }} not updated for 30 s — replica may be disconnected"

Per-tier SLO thresholds:

Service Tier Max Acceptable Lag Alert Condition Routing Action
Tier-1 (user-facing APIs) 500 ms p95 > 500 ms for 1 min Route to primary
Tier-2 (internal tools) 5 s p95 > 5 s for 2 min Exclude from pool
Tier-3 (batch/analytics) 60 s p95 > 60 s for 5 min Alert only

Failure modes:

  • Metric collection lag — caused by long Prometheus scrape intervals or slow exporter response — masks actual replication delay by reporting stale telemetry.
  • GC pauses on the JVM-based monitoring agent produce false positive lag spikes that trigger unnecessary failovers.
  • Missing heartbeat timeouts leave degraded replicas in the routing pool; heartbeat absence must itself be alertable.

Client-Side Guarantees & Read-After-Write

Post-write reads routed naively to replicas return stale data in the window between commit and replication. The standard mitigation is read-after-write routing: attach the committed LSN or GTID to the session token, then on each subsequent read check whether the target replica’s replay_lsn is at or ahead of that value before dispatching.

Session token based routing:

javascript
// After a write commits, capture the position from the primary's response
const sessionToken = {
  last_commit_ts: Date.now(),
  lsn: primaryResult.pg_current_wal_lsn  // returned by your write wrapper
};

// On each read, decide routing by comparing replica lag against the token
async function routeRead(query, token) {
  const candidates = await replicaPool.getReplicasWithReplayLsnGte(token.lsn);
  if (candidates.length > 0) {
    return candidates[0].execute(query);        // use freshest qualifying replica
  }
  // No replica has caught up yet — fall back to primary for this query only
  return primary.execute(query);
}

For PostgreSQL, pg_last_wal_replay_lsn() on each replica exposes the current apply position. For MySQL, compare Retrieved_Gtid_Set and Executed_Gtid_Set from SHOW REPLICA STATUS.

Handle edge cases explicitly:

  • Cross-device sessions require a distributed token store (Redis, Memcached) so that a mobile app and a web browser belonging to the same user share the same last-write LSN.
  • Mobile offline states must queue writes locally and replay on reconnect; the LSN token is not valid until the queued write commits on the primary.
  • Token expiration (e.g. after 30 seconds of inactivity) should trigger graceful fallback to eventual consistency, surfaced as a UI indicator where data freshness is user-visible.
Guarantee Latency Penalty Session Overhead Routing Precision
Primary pinning +primary RTT Low Exact
LSN/GTID timestamp check +validation query Medium High
Monotonic read token +token lookup High High

Failure modes:

  • Session token loss on reconnect forces stale-replica reads, breaking post-write read expectations silently.
  • Clock skew across application servers breaks wall-clock timestamp ordering when not using LSN-based tokens.
  • Retry storms under high load saturate the primary connection pool when every post-write read falls back simultaneously.

Operational Resilience & Fallback Workflows

Service availability must degrade gracefully when replicas fall outside acceptable lag thresholds. Fallback paths must protect the primary from absorbing the full read load that replicas would otherwise serve.

Define three threshold levels for automatic response:

  1. Warning (p95 lag > SLO / 2): Reduce routing weight to this replica; shift load to fresher replicas.
  2. Critical (p95 lag > SLO): Exclude replica from active pool; route to cached responses or queue reads.
  3. Emergency (all replicas excluded): Direct reads to primary with rate limiting; activate read-through cache; enqueue non-critical reads for later replay.

Fallback strategies when replicas fall behind provides the complete circuit-breaker implementation and covers how to size the fallback cache, set queue depth limits, and recover replicas back into the pool after lag subsides.

Circuit breaker and rate limit configuration:

yaml
circuit_breaker:
  lag_threshold_ms: 5000           # open circuit when p95 lag exceeds 5 s
  open_duration: 30s               # hold open for 30 s before half-open probe
  half_open_requests: 5            # probe with 5 requests before full re-admission
  fallback_strategy: cache_then_queue  # try cache first; queue if cache miss

rate_limit:
  analytical_rps: 50               # shed analytical load first
  transactional_rps: 5000          # protect transactional SLA
  primary_fallback_rps: 200        # cap primary fallback traffic during degraded state

Conduct chaos drills quarterly for: replica desync during bulk import, network partition isolating one replica, disk saturation on replica (inject via stress-ng), and primary failover during high read load.

Fallback Path Primary Load Impact Data Freshness Recovery Complexity
Primary fallback High Strong Low
Cached response None Stale by TTL Medium
Queued reads Low Delayed High
Read-only mode None Blocked Low

Failure modes:

  • Primary saturation from fallback traffic triggers cascading timeouts if the primary connection pool is sized only for write traffic.
  • Split-brain during network partitions causes divergent writes when fallback logic does not distinguish between replica lag and network isolation.
  • Stale cache poisoning returns outdated data after replicas recover if the cache does not invalidate on replica re-admission.

Schema Evolution & Maintenance Workflows

DDL changes disrupt replication streams if applied without considering the apply pipeline. Every table lock on the primary stalls the WAL sender; every incompatible column type change causes apply errors on downstream replicas.

Apply backward-compatible changes in phases:

  1. Add nullable columns with server-side defaults (no app changes needed, no lock propagated).
  2. Deploy application code that writes both old and new columns.
  3. Backfill the new column offline during low-traffic windows using batched UPDATE statements with small LIMIT clauses to avoid holding long transactions.
  4. Drop the old column only after confirming no reader references it.

Configure replication filters to bypass non-critical schema sync during peak hours:

sql
-- PostgreSQL: non-blocking index creation — does not stall WAL sender
CREATE INDEX CONCURRENTLY idx_orders_fulfillment
  ON orders(fulfillment_status);

-- PostgreSQL 11+: add column without table rewrite
ALTER TABLE orders
  ADD COLUMN fulfillment_status VARCHAR(32) DEFAULT 'pending';

-- MySQL 8.0: online DDL — no table lock, no replication pause
ALTER TABLE orders
  ADD COLUMN fulfillment_status VARCHAR(32) DEFAULT 'pending',
  ALGORITHM=INPLACE,
  LOCK=NONE;

-- MySQL: exclude staging schema from replication during migration testing
CHANGE REPLICATION FILTER REPLICATE_IGNORE_DB = (temp_staging);
Migration Strategy Replication Impact Rollback Complexity Downtime Risk
Online DDL Low Medium None
Table swap High (brief lock) Low Brief window
Dual-write Very high (2x WAL volume) High None

Failure modes:

  • DDL that acquires ACCESS EXCLUSIVE lock on the primary blocks the WAL sender until the lock releases, causing lag proportional to DDL duration plus ongoing write volume.
  • Incompatible column type changes (e.g. INT to BIGINT in MySQL row-based replication) cause apply errors that silently halt replication on replicas that do not share the schema change.
  • Long-running transactions open during DDL hold row-level locks that prevent VACUUM from reclaiming dead tuples, triggering table bloat that compounds apply lag.

Topic Index

The following pages go deep on the specific mechanisms summarized above.

Detection & Monitoring

Detecting and Handling Replication Lag in Real-Time Covers heartbeat-based lag measurement, WAL position tracking via pg_stat_replication, Prometheus alerting rules, and the circuit-breaker integration that removes replicas from the routing pool before lag violations reach application users. Includes specific metric queries for both PostgreSQL and MySQL.

Routing by Freshness

Routing Queries Based on Data Freshness Requirements Explains how to classify queries by staleness tolerance, encode freshness annotations in query hints, and implement per-endpoint SLO enforcement using ProxySQL rules and client-side LSN/GTID token validation. Shows how to avoid hardcoding replica hostnames in application configuration.

Eventual Consistency Patterns

Eventual Consistency Patterns for Read-Heavy Workloads Details how to design application read paths that tolerate replication lag without sacrificing correctness. Covers cache TTL alignment with lag percentiles, read-your-writes guarantees via session tokens, and how to surface data freshness state to end users without degrading UX.

Fallback Strategies

Fallback Strategies When Replicas Fall Behind Documents the complete degraded-state playbook: circuit breaker thresholds, fallback to read-through cache, queued read strategies, and primary protection rate limits. Includes runnable ProxySQL configuration and a recovery re-admission checklist.


Production Readiness Checklist


FAQ

What is replication lag and why does it matter?

Replication lag is the elapsed time between a write committing on the primary and that change becoming readable on a replica. It matters because applications routing reads to replicas may return stale data, causing silent correctness bugs in inventory checks, balance reads, and session state. In practice, lag ranges from a few milliseconds during steady-state OLTP to minutes during bulk ingestion or DDL migrations.

How do I measure replication lag accurately?

Use heartbeat-based measurement alongside WAL position deltas. Write a timestamp to a dedicated heartbeat table on the primary every few seconds; compare it against the replica’s copy for wall-clock lag. Supplement with sent_lsn - replay_lsn from pg_stat_replication (PostgreSQL) or Seconds_Behind_Source from SHOW REPLICA STATUS (MySQL) for byte-level queue depth. WAL position tells you how much data is queued; heartbeat delta tells you actual staleness.

When should reads be routed to the primary instead of a replica?

Route to the primary immediately after a write that the same session must read back (read-after-write), for any query requiring strict serializability, and when all replicas exceed the service tier’s lag SLO. Attach the committed LSN or GTID to the session token and compare it against each candidate replica’s replay position before dispatching. Only fall back to the primary if no qualifying replica is available.

What causes replication lag spikes under bulk load?

Single-threaded apply workers serialize high-concurrency write streams into sequential replay, creating a queue. Long-running transactions on the primary hold open WAL segments, blocking WAL purge. DDL statements (especially without CONCURRENTLY or INPLACE) lock replication threads. Disk I/O saturation on the replica causes WAL segment flush to fall behind arrival rate. The root signal is apply queue depth growing faster than the apply worker drains it.

How do I protect the primary when all replicas are lagging?

Implement a circuit breaker that opens when p95 lag exceeds the SLO, routing reads to a read-through cache first. Set a hard rate limit on primary fallback traffic (primary_fallback_rps) to prevent the read load from overwhelming the write path. For non-latency-sensitive reads, queue them and replay after replica recovery rather than falling back to the primary at all. Size your connection pool to handle the worst-case fallback load before an incident occurs.