Replication Lag & Consistency Management
Asynchronous replication is the architectural trade-off at the heart of every read replica deployment: you gain horizontal read throughput at the cost of introducing a propagation window during which replicas may return data that the primary has already superseded. In practice, that window ranges from a few milliseconds during steady-state OLTP to minutes during bulk ingestion, DDL migrations, or network events — and the gap between “acceptable staleness” and “silent data corruption” is often just one poorly-routed query. Managing replication lag means detecting the delay accurately, setting per-workload freshness SLOs, routing reads intelligently based on real-time telemetry, and defining explicit degraded-state behavior for the moments when replicas fall outside those thresholds. This page maps the full engineering surface of that problem.
Architecture Overview
The diagram below shows how a write on the primary flows through the replication pipeline, where lag accumulates, and which components observe and act on that lag:
The two key positions in the pipeline where lag is observable are the WAL position delta (bytes between sent_lsn on the primary and replay_lsn on the replica, exposed via pg_stat_replication) and the heartbeat timestamp delta (a dedicated heartbeat row written to the primary every few seconds, compared against the replica’s copy). Use both: position delta tells you queue depth in bytes; heartbeat delta gives you actual wall-clock staleness. For MySQL the equivalent is Seconds_Behind_Source in SHOW REPLICA STATUS, supplemented by GTID position tracking.
Replication Topology & Lag Mechanics
Asynchronous replication decouples write throughput from read scalability by not requiring replicas to acknowledge writes before the primary commits. This produces the propagation delay managed throughout this page.
The replication graph shapes baseline latency. Star topologies fan out from a single primary, minimizing hop count but concentrating network egress on one node. Chain topologies cascade writes through intermediate nodes, reducing primary egress load but compounding RTT variance — lag on a mid-chain node magnifies on every downstream node. Multi-region topologies introduce WAN RTT as a hard floor on write propagation, often 30–80 ms per continent crossing.
The apply pipeline on each replica processes changes through four serial stages: network receive, log decode, transaction replay, and disk flush. Single-threaded apply workers become the bottleneck under high-concurrency write streams. Lock contention during replay stalls downstream commits, and disk I/O saturation causes WAL segment flush to fall behind arrival rate.
| Topology | Write Throughput | Read Freshness | Network Partition Tolerance | Best Use Case |
|---|---|---|---|---|
| Star | High | Moderate | Low (primary SPOF) | Standard OLTP |
| Chain | Moderate | Low | Moderate | Edge/regional caching |
| Multi-Primary | Very High | Variable | High | Geo-distributed writes |
Enabling parallel apply workers (PostgreSQL):
# postgresql.conf — increases replay parallelism to reduce apply queue buildup
wal_level = logical # required for logical parallel apply
max_logical_replication_workers = 8
max_wal_senders = 12
max_replication_slots = 12 # one per replica + headroom for tools
wal_compression = on # reduces network transfer under bursty write loadFor MySQL, parallel replication is controlled by replica_parallel_type = LOGICAL_CLOCK and replica_parallel_workers = 8 (MySQL 8.0+, replacing the deprecated slave_* names).
Failure modes:
- Disk I/O saturation on the replica causes receive buffer queue buildup; lag grows at the rate of write throughput minus apply throughput.
- Network jitter amplifies RTT variance, producing false lag spikes that briefly exceed SLO thresholds before self-correcting.
- Bulk loads on the primary saturate the WAL sender, starving the replication stream and causing exponential lag accumulation in downstream apply workers.
Consistency Models & Application Trade-offs
The right consistency model for each query is determined by how much stale data is tolerable, not by a single global database setting. Selecting the wrong model causes either silent data corruption (inventory oversell, double-spend) or unnecessary read latency from over-routing to the primary.
Evaluating consistency models for distributed reads covers the formal definitions. The operational mapping is:
Strong consistency routes all reads to the primary or to a synchronous replica. Zero staleness, but caps horizontal read throughput — every read hits the write path.
Session consistency ties reads to a replica that has replayed at least the session’s last committed LSN/GTID. A user sees their own write immediately; other users’ concurrent writes may lag.
Monotonic reads guarantee that a client never observes older state than its previous read. Implemented by tracking the highest observed LSN in the client session token and refusing dispatch to any replica whose replay_lsn is behind it.
Eventual consistency routes reads to any replica regardless of lag. Acceptable for analytics dashboards and content delivery; not acceptable for balance reads or inventory state.
| Consistency Tier | Staleness Window | Developer Complexity | Infrastructure Overhead |
|---|---|---|---|
| Strong | 0 ms | Low | High (sync commits or primary reads) |
| Session | 0–500 ms | Medium | Medium (LSN affinity) |
| Monotonic | 0–2 000 ms | Medium | Low (client token tracking) |
| Eventual | Seconds–minutes | High (app must tolerate stale) | Low |
Map workload profiles to staleness windows explicitly. Analytics dashboards tolerate minutes of lag. User profile updates require sub-second alignment. Payment state transitions demand strong consistency.
Eventual consistency patterns for read-heavy workloads covers cache TTL alignment with replication lag percentiles to prevent cache stampedes from amplifying underlying delay.
Failure modes:
- Stale reads on inventory or balance checks trigger oversell or overdraft when two concurrent sessions read the same pre-decrement value from different replicas.
- Cache stampedes amplify lag by flooding replicas with identical stale queries when cache TTLs are shorter than replication lag.
- Monotonic-read token loss (session expiry, device switch) forces fallback to eventual consistency without explicit UI notification.
Connection Routing & Query Dispatch
Dynamic traffic routing prevents stale reads from reaching freshness-sensitive endpoints. Routing decisions must consume real-time lag telemetry — not static DNS records or connection strings that do not reflect current replica health.
Two deployment patterns dominate. Proxy-based routing (ProxySQL for MySQL, pgpool-II or Pgcat for PostgreSQL) centralizes policy enforcement and lag-aware weight adjustment at the cost of one additional network hop (typically 1–3 ms). Client-side routing (custom connection pool wrapper, SQLAlchemy multi-engine selector, Prisma datasource switching) eliminates that hop but scatters routing logic across service boundaries. For the trade-offs between these approaches see connection routing and pooling strategies.
Routing queries based on data freshness requirements covers the full decision tree: how to classify queries by staleness tolerance, how to encode freshness annotations in query hints or ORM contexts, and how to implement per-endpoint SLO enforcement without hardcoding replica hostnames.
ProxySQL routing rules by query pattern (MySQL):
-- Route transactional selects (orders table) to low-lag hostgroup 10
INSERT INTO mysql_query_rules (rule_id, active, match_digest, destination_hostgroup, apply)
VALUES (1, 1, '^SELECT.*FROM orders', 10, 1);
-- Route analytical selects to high-lag-tolerant hostgroup 20
INSERT INTO mysql_query_rules (rule_id, active, match_digest, destination_hostgroup, apply)
VALUES (2, 1, '^SELECT.*FROM analytics', 20, 1);
-- Monitor interval controls how quickly lag changes re-weight the pool
UPDATE global_variables SET variable_value='2000'
WHERE variable_name='mysql-monitor_connect_interval';
UPDATE global_variables SET variable_value='1000'
WHERE variable_name='mysql-monitor_read_only_interval';
LOAD MYSQL VARIABLES TO RUNTIME;
SAVE MYSQL VARIABLES TO DISK;| Routing Strategy | Latency Overhead | Failover Complexity | Maintenance Burden |
|---|---|---|---|
| Proxy-based | 1–3 ms | Low | Centralized |
| Client-side | 0 ms | High | Distributed |
| DNS TTL | Variable (cache-dependent) | Very high | Manual |
Failure modes:
- Split-brain routing dispatches writes to an isolated replica during a network partition, producing divergent state that is expensive to reconcile.
- Proxy connection pool exhaustion during network degradation causes timeout cascades that are indistinguishable from replica failure.
- DNS cache poisoning routes traffic to decommissioned nodes when TTLs are set higher than the deployment cycle.
Real-Time Monitoring & Detection
Visibility into lag must span the entire apply pipeline. Instrumenting only the final Seconds_Behind_Source or heartbeat delta misses intermediate bottlenecks: a saturated receive buffer may not yet show up in apply lag but is already accumulating a queue.
Instrument WAL/Binlog position tracking alongside apply queue depth. Correlate OS-level signals — CPU steal, IOPS utilization, network retransmit rate — with replication throughput. Report lag percentiles, not averages: a p50 of 50 ms and a p99 of 8 000 ms indicate a fundamentally different problem than uniformly low lag.
Detecting and handling replication lag in real-time covers the full alerting stack: heartbeat insertion frequency, Prometheus recording rules, and circuit-breaker integration that removes replicas from the pool before SLO violations reach application users.
Prometheus scrape configuration and alerting rules:
# prometheus.yml scrape config for replica exporters
scrape_configs:
- job_name: postgres_replicas
scrape_interval: 5s
metrics_path: /metrics
static_configs:
- targets: ['replica-a:9187', 'replica-b:9187']
# alerting rules
groups:
- name: replication_lag
rules:
- alert: ReplicationLagP95Exceeded
expr: |
histogram_quantile(0.95,
rate(replication_lag_seconds_bucket[5m])
) > 2
for: 2m
labels:
severity: warning
annotations:
summary: "Replica {{ $labels.instance }} p95 lag exceeds 2 s SLO"
- alert: ApplyQueueSaturation
expr: replication_apply_queue_depth > 10000
for: 1m
labels:
severity: critical
annotations:
summary: "Apply queue depth on {{ $labels.instance }} exceeds 10 k records"
- alert: ReplicationHeartbeatStale
expr: |
time() - replication_heartbeat_timestamp_seconds > 30
for: 30s
labels:
severity: critical
annotations:
summary: "Heartbeat on {{ $labels.instance }} not updated for 30 s — replica may be disconnected"Per-tier SLO thresholds:
| Service Tier | Max Acceptable Lag | Alert Condition | Routing Action |
|---|---|---|---|
| Tier-1 (user-facing APIs) | 500 ms | p95 > 500 ms for 1 min | Route to primary |
| Tier-2 (internal tools) | 5 s | p95 > 5 s for 2 min | Exclude from pool |
| Tier-3 (batch/analytics) | 60 s | p95 > 60 s for 5 min | Alert only |
Failure modes:
- Metric collection lag — caused by long Prometheus scrape intervals or slow exporter response — masks actual replication delay by reporting stale telemetry.
- GC pauses on the JVM-based monitoring agent produce false positive lag spikes that trigger unnecessary failovers.
- Missing heartbeat timeouts leave degraded replicas in the routing pool; heartbeat absence must itself be alertable.
Client-Side Guarantees & Read-After-Write
Post-write reads routed naively to replicas return stale data in the window between commit and replication. The standard mitigation is read-after-write routing: attach the committed LSN or GTID to the session token, then on each subsequent read check whether the target replica’s replay_lsn is at or ahead of that value before dispatching.
Session token based routing:
// After a write commits, capture the position from the primary's response
const sessionToken = {
last_commit_ts: Date.now(),
lsn: primaryResult.pg_current_wal_lsn // returned by your write wrapper
};
// On each read, decide routing by comparing replica lag against the token
async function routeRead(query, token) {
const candidates = await replicaPool.getReplicasWithReplayLsnGte(token.lsn);
if (candidates.length > 0) {
return candidates[0].execute(query); // use freshest qualifying replica
}
// No replica has caught up yet — fall back to primary for this query only
return primary.execute(query);
}For PostgreSQL, pg_last_wal_replay_lsn() on each replica exposes the current apply position. For MySQL, compare Retrieved_Gtid_Set and Executed_Gtid_Set from SHOW REPLICA STATUS.
Handle edge cases explicitly:
- Cross-device sessions require a distributed token store (Redis, Memcached) so that a mobile app and a web browser belonging to the same user share the same last-write LSN.
- Mobile offline states must queue writes locally and replay on reconnect; the LSN token is not valid until the queued write commits on the primary.
- Token expiration (e.g. after 30 seconds of inactivity) should trigger graceful fallback to eventual consistency, surfaced as a UI indicator where data freshness is user-visible.
| Guarantee | Latency Penalty | Session Overhead | Routing Precision |
|---|---|---|---|
| Primary pinning | +primary RTT | Low | Exact |
| LSN/GTID timestamp check | +validation query | Medium | High |
| Monotonic read token | +token lookup | High | High |
Failure modes:
- Session token loss on reconnect forces stale-replica reads, breaking post-write read expectations silently.
- Clock skew across application servers breaks wall-clock timestamp ordering when not using LSN-based tokens.
- Retry storms under high load saturate the primary connection pool when every post-write read falls back simultaneously.
Operational Resilience & Fallback Workflows
Service availability must degrade gracefully when replicas fall outside acceptable lag thresholds. Fallback paths must protect the primary from absorbing the full read load that replicas would otherwise serve.
Define three threshold levels for automatic response:
- Warning (p95 lag > SLO / 2): Reduce routing weight to this replica; shift load to fresher replicas.
- Critical (p95 lag > SLO): Exclude replica from active pool; route to cached responses or queue reads.
- Emergency (all replicas excluded): Direct reads to primary with rate limiting; activate read-through cache; enqueue non-critical reads for later replay.
Fallback strategies when replicas fall behind provides the complete circuit-breaker implementation and covers how to size the fallback cache, set queue depth limits, and recover replicas back into the pool after lag subsides.
Circuit breaker and rate limit configuration:
circuit_breaker:
lag_threshold_ms: 5000 # open circuit when p95 lag exceeds 5 s
open_duration: 30s # hold open for 30 s before half-open probe
half_open_requests: 5 # probe with 5 requests before full re-admission
fallback_strategy: cache_then_queue # try cache first; queue if cache miss
rate_limit:
analytical_rps: 50 # shed analytical load first
transactional_rps: 5000 # protect transactional SLA
primary_fallback_rps: 200 # cap primary fallback traffic during degraded stateConduct chaos drills quarterly for: replica desync during bulk import, network partition isolating one replica, disk saturation on replica (inject via stress-ng), and primary failover during high read load.
| Fallback Path | Primary Load Impact | Data Freshness | Recovery Complexity |
|---|---|---|---|
| Primary fallback | High | Strong | Low |
| Cached response | None | Stale by TTL | Medium |
| Queued reads | Low | Delayed | High |
| Read-only mode | None | Blocked | Low |
Failure modes:
- Primary saturation from fallback traffic triggers cascading timeouts if the primary connection pool is sized only for write traffic.
- Split-brain during network partitions causes divergent writes when fallback logic does not distinguish between replica lag and network isolation.
- Stale cache poisoning returns outdated data after replicas recover if the cache does not invalidate on replica re-admission.
Schema Evolution & Maintenance Workflows
DDL changes disrupt replication streams if applied without considering the apply pipeline. Every table lock on the primary stalls the WAL sender; every incompatible column type change causes apply errors on downstream replicas.
Apply backward-compatible changes in phases:
- Add nullable columns with server-side defaults (no app changes needed, no lock propagated).
- Deploy application code that writes both old and new columns.
- Backfill the new column offline during low-traffic windows using batched
UPDATEstatements with smallLIMITclauses to avoid holding long transactions. - Drop the old column only after confirming no reader references it.
Configure replication filters to bypass non-critical schema sync during peak hours:
-- PostgreSQL: non-blocking index creation — does not stall WAL sender
CREATE INDEX CONCURRENTLY idx_orders_fulfillment
ON orders(fulfillment_status);
-- PostgreSQL 11+: add column without table rewrite
ALTER TABLE orders
ADD COLUMN fulfillment_status VARCHAR(32) DEFAULT 'pending';
-- MySQL 8.0: online DDL — no table lock, no replication pause
ALTER TABLE orders
ADD COLUMN fulfillment_status VARCHAR(32) DEFAULT 'pending',
ALGORITHM=INPLACE,
LOCK=NONE;
-- MySQL: exclude staging schema from replication during migration testing
CHANGE REPLICATION FILTER REPLICATE_IGNORE_DB = (temp_staging);| Migration Strategy | Replication Impact | Rollback Complexity | Downtime Risk |
|---|---|---|---|
| Online DDL | Low | Medium | None |
| Table swap | High (brief lock) | Low | Brief window |
| Dual-write | Very high (2x WAL volume) | High | None |
Failure modes:
- DDL that acquires
ACCESS EXCLUSIVElock on the primary blocks the WAL sender until the lock releases, causing lag proportional to DDL duration plus ongoing write volume. - Incompatible column type changes (e.g.
INTtoBIGINTin MySQL row-based replication) cause apply errors that silently halt replication on replicas that do not share the schema change. - Long-running transactions open during DDL hold row-level locks that prevent
VACUUMfrom reclaiming dead tuples, triggering table bloat that compounds apply lag.
Topic Index
The following pages go deep on the specific mechanisms summarized above.
Detection & Monitoring
Detecting and Handling Replication Lag in Real-Time
Covers heartbeat-based lag measurement, WAL position tracking via pg_stat_replication, Prometheus alerting rules, and the circuit-breaker integration that removes replicas from the routing pool before lag violations reach application users. Includes specific metric queries for both PostgreSQL and MySQL.
Routing by Freshness
Routing Queries Based on Data Freshness Requirements Explains how to classify queries by staleness tolerance, encode freshness annotations in query hints, and implement per-endpoint SLO enforcement using ProxySQL rules and client-side LSN/GTID token validation. Shows how to avoid hardcoding replica hostnames in application configuration.
Eventual Consistency Patterns
Eventual Consistency Patterns for Read-Heavy Workloads Details how to design application read paths that tolerate replication lag without sacrificing correctness. Covers cache TTL alignment with lag percentiles, read-your-writes guarantees via session tokens, and how to surface data freshness state to end users without degrading UX.
Fallback Strategies
Fallback Strategies When Replicas Fall Behind Documents the complete degraded-state playbook: circuit breaker thresholds, fallback to read-through cache, queued read strategies, and primary protection rate limits. Includes runnable ProxySQL configuration and a recovery re-admission checklist.
Production Readiness Checklist
FAQ
What is replication lag and why does it matter?
Replication lag is the elapsed time between a write committing on the primary and that change becoming readable on a replica. It matters because applications routing reads to replicas may return stale data, causing silent correctness bugs in inventory checks, balance reads, and session state. In practice, lag ranges from a few milliseconds during steady-state OLTP to minutes during bulk ingestion or DDL migrations.
How do I measure replication lag accurately?
Use heartbeat-based measurement alongside WAL position deltas. Write a timestamp to a dedicated heartbeat table on the primary every few seconds; compare it against the replica’s copy for wall-clock lag. Supplement with sent_lsn - replay_lsn from pg_stat_replication (PostgreSQL) or Seconds_Behind_Source from SHOW REPLICA STATUS (MySQL) for byte-level queue depth. WAL position tells you how much data is queued; heartbeat delta tells you actual staleness.
When should reads be routed to the primary instead of a replica?
Route to the primary immediately after a write that the same session must read back (read-after-write), for any query requiring strict serializability, and when all replicas exceed the service tier’s lag SLO. Attach the committed LSN or GTID to the session token and compare it against each candidate replica’s replay position before dispatching. Only fall back to the primary if no qualifying replica is available.
What causes replication lag spikes under bulk load?
Single-threaded apply workers serialize high-concurrency write streams into sequential replay, creating a queue. Long-running transactions on the primary hold open WAL segments, blocking WAL purge. DDL statements (especially without CONCURRENTLY or INPLACE) lock replication threads. Disk I/O saturation on the replica causes WAL segment flush to fall behind arrival rate. The root signal is apply queue depth growing faster than the apply worker drains it.
How do I protect the primary when all replicas are lagging?
Implement a circuit breaker that opens when p95 lag exceeds the SLO, routing reads to a read-through cache first. Set a hard rate limit on primary fallback traffic (primary_fallback_rps) to prevent the read load from overwhelming the write path. For non-latency-sensitive reads, queue them and replay after replica recovery rather than falling back to the primary at all. Size your connection pool to handle the worst-case fallback load before an incident occurs.
Related
- Database Replication Fundamentals & Architecture — covers the WAL flow, synchronous vs asynchronous commit paths, and topology patterns that determine your baseline lag floor.
- Connection Routing & Pooling Strategies — proxy and client-side routing architectures that consume lag telemetry to gate query dispatch.
- Understanding Synchronous vs Asynchronous Replication — the commit-path trade-offs that set the replication lag floor for your deployment.
- Designing Multi-Region Read Replica Topologies — how WAN RTT and topology choices affect lag, and how to structure cross-region fallback.
- Connection Pool Architecture for Read Replicas — how pool sizing and session affinity interact with lag-based routing decisions.