Detecting and Handling Replication Lag in Real-Time
Scaling read-heavy architectures introduces asynchronous replication latency that directly impacts user experience, transactional integrity, and SLA compliance. Solid replication lag and consistency management requires shifting from reactive, post-mortem monitoring to proactive, real-time telemetry extraction and adaptive traffic routing. This guide details the operational patterns required to quantify replication drift, intercept queries at the middleware layer, and enforce application-level consistency boundaries under dynamic load.
The Operational Problem
When a write commits on the primary, changes propagate to replicas asynchronously. The window between the primary commit and the replica applying that change is replication lag. Under normal conditions lag is milliseconds; under write spikes, network congestion, or long-running replica queries, it can balloon to seconds or minutes.
The SLA risks this creates are concrete:
- Read-after-write violations: A user submits a form, is redirected to a confirmation page, and their own update does not appear because the read hit a lagging replica.
- Stale cache poisoning: A cache layer populated from a lagging replica serves outdated data even after the primary has the correct state.
- Cascading connection exhaustion: When replicas fall behind during a write spike, routing all reads to the primary can saturate primary connections, turning a lag event into an availability incident.
Without real-time lag detection, none of these failure modes is visible until users report errors.
Concept Definition: What “Replication Lag” Measures
Replication lag has two distinct representations that serve different purposes:
Time-based lag measures the difference between the timestamp of the last transaction committed on the primary and the timestamp of the last transaction applied on the replica. It is the human-readable metric surfaced in pg_stat_replication.write_lag / replay_lag and MySQL’s Seconds_Behind_Master. Time-based lag is useful for alerting thresholds but is vulnerable to NTP clock skew across nodes.
Byte-offset lag measures the difference in WAL byte positions between the primary’s current write head and the replica’s last applied position. pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn()) returns this value directly. Byte-offset lag is immune to clock skew and is the correct metric for routing decisions that propagate write positions from application code.
Both metrics are complementary: use byte-offset for routing logic and read-after-write guarantees; use time-based for human-readable SLO dashboards and alerting.
Lag Detection Architecture
The following diagram shows the full signal path from primary WAL emission through exporter scraping, Prometheus alerting, and proxy routing decisions.
Defining SLOs and Telemetry Baselines
Before deploying detection infrastructure, establish explicit lag SLOs aligned to workload criticality:
| Workload type | Max acceptable lag | Examples |
|---|---|---|
| User-facing synchronous flows | < 500 ms |
Checkout, profile updates, balance reads |
| Background and analytical jobs | < 2 s |
Reporting queries, batch exports |
| Eventual consistency consumers | < 5 s |
Search indexing, notification queues |
Deploy centralized metrics aggregation using Prometheus or VictoriaMetrics with scrape_interval = 1s for lag exporters. Polling gaps greater than 2 seconds routinely mask transient spikes that trigger cascading timeouts. Establish baseline replication throughput under peak write load to distinguish normal network jitter from genuine replication storms.
# PostgreSQL: Cap standby conflict delay to prevent indefinite query cancellation
max_standby_streaming_delay = 5s# MySQL (my.cnf on source): Ensure semi-sync waits for replica disk flush
rpl_semi_sync_source_wait_point = AFTER_SYNCPrimary failure modes at this stage:
- Silent lag accumulation during bulk
INSERT/UPDATEoperations when WAL generation outpaces network I/O. - Metric polling gaps that smooth over sub-second spikes, causing alerting to miss threshold breaches.
- NTP/clock skew across distributed nodes that corrupts time-based lag calculations, requiring monotonic LSN/position tracking instead.
Mechanism Deep-Dive: How Lag Detection Works Internally
PostgreSQL: WAL LSN Delta
PostgreSQL exposes replication state through pg_stat_replication on the primary. Each row represents one connected standby and includes four LSN positions: sent_lsn, write_lsn, flush_lsn, and replay_lsn. The critical one for lag is replay_lsn — the position the standby has actually applied.
-- On the primary: byte-level lag per connected replica
SELECT
application_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
replay_lag AS lag_time
FROM pg_stat_replication
ORDER BY lag_bytes DESC;replay_lag is a PostgreSQL-native interval computed from commit timestamps; it is zero when the standby is caught up. lag_bytes via pg_wal_lsn_diff is the authoritative metric for routing decisions because it does not rely on timestamps.
MySQL/MariaDB: GTID Position
MySQL with GTID replication reports lag through performance_schema.replication_connection_status and the classic SHOW REPLICA STATUS. GTID-based lag measurement compares the set of executed transactions on the source against those on the replica:
-- On a MySQL replica: confirm lag and GTID position
SHOW REPLICA STATUS\G
-- Check: Seconds_Behind_Source, Executed_Gtid_Set vs Retrieved_Gtid_SetFor programmatic consumption, query performance_schema directly:
SELECT
CHANNEL_NAME,
SERVICE_STATE,
LAST_ERROR_MESSAGE
FROM performance_schema.replication_connection_status;Heartbeat Table Pattern (Cross-Engine)
When native views are unavailable (e.g., through a proxy that strips system queries) or cross-engine compatibility is required, a heartbeat table gives millisecond-precision lag without relying on internal replication metadata:
-- Schema (PostgreSQL; adapt types for MySQL)
CREATE TABLE replication_heartbeat (
id INT PRIMARY KEY,
ts TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp()
);
-- Upsert from pg_cron or an external process every 500 ms on the primary:
INSERT INTO replication_heartbeat (id, ts) VALUES (1, clock_timestamp())
ON CONFLICT (id) DO UPDATE SET ts = EXCLUDED.ts;
-- On each replica: compute observed lag
SELECT EXTRACT(EPOCH FROM (clock_timestamp() - ts)) * 1000 AS lag_ms
FROM replication_heartbeat
WHERE id = 1;Degraded-state behavior: During network partitioning, heartbeat writes may succeed locally but fail to replicate. Implement dual-path validation: if heartbeat lag exceeds threshold AND TCP keepalive RTT exceeds baseline, mark the replica as UNREACHABLE rather than LAGGING to prevent routing loops.
Trade-off Comparison: Detection Approaches
| Detection method | Latency precision | Clock-skew immune | Works through proxy | Operational overhead |
|---|---|---|---|---|
pg_stat_replication replay_lag |
~100 ms | No | No (requires direct primary conn) | None |
WAL LSN delta (pg_wal_lsn_diff) |
~1 ms | Yes | No | None |
MySQL Seconds_Behind_Source |
~1 s | No | No | None |
| Heartbeat table (custom writer) | ~500 ms | Yes | Yes | Low (pg_cron or cron job) |
| ProxySQL built-in monitor | ~1 s | No | Native | Config only |
For routing queries based on data freshness requirements, WAL LSN delta combined with a heartbeat table provides the strongest guarantees: the LSN is immune to clock skew and the heartbeat survives proxy interposition.
Configuration Runbook
ProxySQL: Per-Replica Lag Thresholds
-- mysql_servers: set per-server max lag in seconds
UPDATE mysql_servers
SET max_replication_lag = 1 -- demote if lag > 1s
WHERE hostgroup_id = 20; -- read replica hostgroup
UPDATE mysql_servers
SET max_connections = 200
WHERE hostgroup_id = 20;
-- Query rules: pin high-freshness queries to the primary (hostgroup 10)
INSERT INTO mysql_query_rules
(rule_id, active, match_pattern, destination_hostgroup, apply)
VALUES
(1, 1, '^SELECT.*FROM orders WHERE user_id', 10, 1);
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;
LOAD MYSQL QUERY RULES TO RUNTIME; SAVE MYSQL QUERY RULES TO DISK;# ProxySQL: scrape replicas every 1 second (milliseconds)
mysql-monitor_replication_lag_interval: 1000Hysteresis: Require lag to stay below threshold for 3 consecutive checks before re-promoting a demoted replica. This prevents routing oscillation during brief jitter.
PgBouncer: Transaction-Mode Pooling for Failover
When replicas resync after a lag event, connection demand spikes. Switch to transaction-mode pooling to cap backend sessions:
[pgbouncer]
pool_mode = transaction
max_client_conn = 500
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3
query_wait_timeout = 10 ; shed load gracefully rather than queue indefinitelyPostgreSQL: WAL Slot Safety
Inactive replication slots cause unbounded WAL accumulation that eventually fills disk and crashes the primary:
-- Cap WAL retained for slots; prevents disk exhaustion
ALTER SYSTEM SET max_slot_wal_keep_size = '5GB';
SELECT pg_reload_conf();
-- Monitor for inactive slots (drop them if confirmed abandoned)
SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes
FROM pg_replication_slots
WHERE active = false;Monitoring and Alerting Signals
Key Prometheus Metrics
| Metric | Source | Alert threshold |
|---|---|---|
pg_replication_lag_bytes |
postgres_exporter | > 50 MB for 2 min |
pg_replication_lag_seconds |
postgres_exporter | > 2 s for 3 min |
mysql_slave_lag_seconds |
mysqld_exporter | > 1 s for 2 min |
proxysql_server_status |
proxysql_exporter | Status = SHUNNED |
pgbouncer_pool_cl_waiting |
pgbouncer_exporter | > 50 for 1 min |
# Grafana / Prometheus alert rules
groups:
- name: replication_lag
rules:
- alert: ReplicationLagCritical
expr: pg_replication_lag_seconds > 2
for: 3m
labels:
severity: critical
annotations:
summary: "Replica {{ $labels.instance }} lag exceeds 2s for 3 minutes"
runbook: "Check pg_stat_replication on primary; verify network I/O between nodes"
- alert: ReplicationSlotInactive
expr: pg_replication_slots_active == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Inactive replication slot accumulating WAL"OpenTelemetry Span Tags for Trace Correlation
Inject routing context into every database span so incidents are traceable end-to-end:
db.replication.lag_seconds— lag at time of routing decisiondb.routing.target—primaryorreplica:{name}db.routing.reason—lag_threshold_exceeded,read_after_write,healthy
Application-Level Consistency Guarantees
Middleware routing alone is insufficient for strict read-after-write scenarios. Application code must propagate the primary’s commit position and validate replica freshness before routing subsequent reads. This pairs directly with the eventual consistency patterns that govern how stale reads are tolerated across different workload types.
Write-Position Propagation
Immediately after a write, capture the primary’s LSN or GTID and attach it to the request context:
# Post-write: capture primary position (PostgreSQL via psycopg2)
cursor.execute("SELECT pg_current_wal_lsn()::text")
write_lsn = cursor.fetchone()[0] # e.g. "0/1A2B3C40"
# Store in session or propagate via HTTP header
session["write_lsn"] = write_lsn
session["write_ts_ms"] = int(time.time() * 1000)# HTTP Header propagation for inter-service calls
X-Write-LSN: 0/1A2B3C40
X-Write-Timestamp: 1715432100000Client-Side Routing Logic
def safe_read(query, write_ts_ms=None, max_lag_ms=500):
"""Route to primary if within lag window, else replica pool."""
if write_ts_ms is not None:
elapsed = int(time.time() * 1000) - write_ts_ms
if elapsed < max_lag_ms:
return primary_conn.execute(query)
return replica_pool.execute(query)For session-scoped routing, a Redis token prevents stale reads without requiring every service to implement LSN comparison:
# Redis: Route all reads to primary for 30 s after a write
SETEX write_session:{user_id} 30 "primary_lsn:0/1A2B3C40"
Degraded-state behavior: Under sustained replication storms, excessive primary routing saturates CPU and I/O. Cap primary read traffic at 20% of total read capacity via a circuit breaker. If the cap is reached, return HTTP 429 Too Many Requests or serve explicitly flagged stale data, rather than cascading into primary overload.
The using application-level timestamps to bypass stale replicas page covers the full SDK implementation and session-token patterns in detail.
Failure Modes and Recovery Steps
1. Silent lag accumulation during write bursts
Root cause: WAL generation from bulk INSERT/UPDATE outpaces replica I/O throughput. pg_stat_replication shows growing lag_bytes while replay_lag climbs.
Remediation:
-- Check WAL send rate vs apply rate on primary
SELECT application_name,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS unsent_bytes,
write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
-- If unsent_bytes > 100 MB, consider throttling writes or adding replica capacity.2. Heartbeat table write contention
Root cause: High-frequency upserts to a single-row heartbeat table cause lock contention under concurrent application reads.
Remediation: Use INSERT ... ON CONFLICT DO UPDATE (PostgreSQL) or INSERT ... ON DUPLICATE KEY UPDATE (MySQL) with a low-priority write to avoid lock escalation. Move the heartbeat writer to a dedicated connection outside the application pool.
3. Routing oscillation from rapid lag changes
Root cause: A replica oscillates around the lag threshold, being promoted and demoted repeatedly, causing connection thrashing and application errors.
Remediation: Enforce hysteresis in the proxy (3 consecutive checks below threshold before re-promotion). In ProxySQL, this is handled automatically by mysql-monitor_replication_lag_interval combined with max_replication_lag; for custom routers, implement a counter rather than a binary threshold.
4. Stale cache after replica recovery
Root cause: During a lag event, cache entries are populated from lagging replicas. After recovery, those entries persist until TTL expiry, serving outdated data.
Remediation:
-- Redis: Conditional invalidation (Lua for atomicity)
local val = redis.call('GET', KEYS[1])
if val then
return redis.call('DEL', KEYS[1])
end
return 0
-- Execute: redis-cli EVAL "<script>" 1 cache_keyTie cache invalidation jobs to Prometheus alerting: fire purge when pg_replication_lag_seconds drops below SLO for more than 60 seconds.
5. Inactive replication slot causing WAL bloat
Root cause: A consumer (logical replication subscriber or CDC tool) disconnects without dropping its slot. PostgreSQL retains all WAL from the slot’s restart_lsn forward.
Remediation:
-- Identify and drop abandoned slots
SELECT slot_name, active,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1e9 AS retained_gb
FROM pg_replication_slots
WHERE active = false;
-- Drop only after confirming the consumer is genuinely gone:
SELECT pg_drop_replication_slot('slot_name');Operational Checklist
Pages in This Section
- Using Application-Level Timestamps to Bypass Stale Replicas — Full SDK implementation and session-token patterns for read-after-write consistency without proxy changes.
FAQ
How do I measure replication lag in PostgreSQL without relying on wall-clock time?
Use pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn()) queried on the primary against each replica’s reported LSN position. This gives byte-level drift that is immune to NTP skew and survives network partitions where timestamps cannot be trusted.
What hysteresis value prevents replica routing oscillation in ProxySQL?
Require lag to remain below threshold for at least 3 consecutive checks (3 seconds at mysql-monitor_replication_lag_interval = 1000ms) before re-promoting a replica. This prevents connection thrashing during brief network jitter.
When should I route reads to the primary instead of a replica?
Route to the primary for read-after-write scenarios where the elapsed time since the write is less than your max_lag_ms SLO, when all replicas exceed their lag threshold, or when you detect clock skew that makes timestamp-based validation unreliable. Use LSN/GTID propagation via HTTP headers or session tokens rather than wall-clock comparisons.
Related
← Back to Replication Lag & Consistency Management
- Routing Queries Based on Data Freshness Requirements — Per-query freshness policies and lag-aware hostgroup demotion in ProxySQL and PgBouncer.
- Eventual Consistency Patterns for Read-Heavy Workloads — How to tolerate and bound staleness for background, analytical, and cache workloads.
- Fallback Strategies When Replicas Fall Behind — Circuit breaker patterns, primary fallback thresholds, and connection drain procedures.
- Connection Pool Architecture for Read Replicas — PgBouncer and ProxySQL pool sizing, mode selection, and failover topology design.
- Understanding Synchronous vs Asynchronous Replication — The commit-path trade-offs that determine your baseline lag exposure.