Detecting and Handling Replication Lag in Real-Time

Scaling read-heavy architectures introduces asynchronous replication latency that directly impacts user experience, transactional integrity, and SLA compliance. Solid replication lag and consistency management requires shifting from reactive, post-mortem monitoring to proactive, real-time telemetry extraction and adaptive traffic routing. This guide details the operational patterns required to quantify replication drift, intercept queries at the middleware layer, and enforce application-level consistency boundaries under dynamic load.


The Operational Problem

When a write commits on the primary, changes propagate to replicas asynchronously. The window between the primary commit and the replica applying that change is replication lag. Under normal conditions lag is milliseconds; under write spikes, network congestion, or long-running replica queries, it can balloon to seconds or minutes.

The SLA risks this creates are concrete:

  • Read-after-write violations: A user submits a form, is redirected to a confirmation page, and their own update does not appear because the read hit a lagging replica.
  • Stale cache poisoning: A cache layer populated from a lagging replica serves outdated data even after the primary has the correct state.
  • Cascading connection exhaustion: When replicas fall behind during a write spike, routing all reads to the primary can saturate primary connections, turning a lag event into an availability incident.

Without real-time lag detection, none of these failure modes is visible until users report errors.


Concept Definition: What “Replication Lag” Measures

Replication lag has two distinct representations that serve different purposes:

Time-based lag measures the difference between the timestamp of the last transaction committed on the primary and the timestamp of the last transaction applied on the replica. It is the human-readable metric surfaced in pg_stat_replication.write_lag / replay_lag and MySQL’s Seconds_Behind_Master. Time-based lag is useful for alerting thresholds but is vulnerable to NTP clock skew across nodes.

Byte-offset lag measures the difference in WAL byte positions between the primary’s current write head and the replica’s last applied position. pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn()) returns this value directly. Byte-offset lag is immune to clock skew and is the correct metric for routing decisions that propagate write positions from application code.

Both metrics are complementary: use byte-offset for routing logic and read-after-write guarantees; use time-based for human-readable SLO dashboards and alerting.


Lag Detection Architecture

The following diagram shows the full signal path from primary WAL emission through exporter scraping, Prometheus alerting, and proxy routing decisions.

Replication Lag Detection Architecture Data flow from PostgreSQL primary through WAL stream to replicas, scraped by lag exporters into Prometheus, which feeds Grafana alerts and ProxySQL routing decisions. Application reads are routed by ProxySQL based on per-replica lag metrics. Primary DB WAL / binlog Replica 1 lag: 120 ms Replica 2 lag: 3.4 s ⚠ scrape 1s Prometheus + Grafana alerts threshold ProxySQL route / demote App client read queries Heartbeat writer (500 ms upsert on primary) WAL / sync path metrics scrape heartbeat

Defining SLOs and Telemetry Baselines

Before deploying detection infrastructure, establish explicit lag SLOs aligned to workload criticality:

Workload type Max acceptable lag Examples
User-facing synchronous flows < 500 ms Checkout, profile updates, balance reads
Background and analytical jobs < 2 s Reporting queries, batch exports
Eventual consistency consumers < 5 s Search indexing, notification queues

Deploy centralized metrics aggregation using Prometheus or VictoriaMetrics with scrape_interval = 1s for lag exporters. Polling gaps greater than 2 seconds routinely mask transient spikes that trigger cascading timeouts. Establish baseline replication throughput under peak write load to distinguish normal network jitter from genuine replication storms.

ini
# PostgreSQL: Cap standby conflict delay to prevent indefinite query cancellation
max_standby_streaming_delay = 5s
ini
# MySQL (my.cnf on source): Ensure semi-sync waits for replica disk flush
rpl_semi_sync_source_wait_point = AFTER_SYNC

Primary failure modes at this stage:

  • Silent lag accumulation during bulk INSERT/UPDATE operations when WAL generation outpaces network I/O.
  • Metric polling gaps that smooth over sub-second spikes, causing alerting to miss threshold breaches.
  • NTP/clock skew across distributed nodes that corrupts time-based lag calculations, requiring monotonic LSN/position tracking instead.

Mechanism Deep-Dive: How Lag Detection Works Internally

PostgreSQL: WAL LSN Delta

PostgreSQL exposes replication state through pg_stat_replication on the primary. Each row represents one connected standby and includes four LSN positions: sent_lsn, write_lsn, flush_lsn, and replay_lsn. The critical one for lag is replay_lsn — the position the standby has actually applied.

sql
-- On the primary: byte-level lag per connected replica
SELECT
  application_name,
  pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)  AS lag_bytes,
  replay_lag                                           AS lag_time
FROM pg_stat_replication
ORDER BY lag_bytes DESC;

replay_lag is a PostgreSQL-native interval computed from commit timestamps; it is zero when the standby is caught up. lag_bytes via pg_wal_lsn_diff is the authoritative metric for routing decisions because it does not rely on timestamps.

MySQL/MariaDB: GTID Position

MySQL with GTID replication reports lag through performance_schema.replication_connection_status and the classic SHOW REPLICA STATUS. GTID-based lag measurement compares the set of executed transactions on the source against those on the replica:

sql
-- On a MySQL replica: confirm lag and GTID position
SHOW REPLICA STATUS\G
-- Check: Seconds_Behind_Source, Executed_Gtid_Set vs Retrieved_Gtid_Set

For programmatic consumption, query performance_schema directly:

sql
SELECT
  CHANNEL_NAME,
  SERVICE_STATE,
  LAST_ERROR_MESSAGE
FROM performance_schema.replication_connection_status;

Heartbeat Table Pattern (Cross-Engine)

When native views are unavailable (e.g., through a proxy that strips system queries) or cross-engine compatibility is required, a heartbeat table gives millisecond-precision lag without relying on internal replication metadata:

sql
-- Schema (PostgreSQL; adapt types for MySQL)
CREATE TABLE replication_heartbeat (
  id  INT PRIMARY KEY,
  ts  TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp()
);

-- Upsert from pg_cron or an external process every 500 ms on the primary:
INSERT INTO replication_heartbeat (id, ts) VALUES (1, clock_timestamp())
  ON CONFLICT (id) DO UPDATE SET ts = EXCLUDED.ts;

-- On each replica: compute observed lag
SELECT EXTRACT(EPOCH FROM (clock_timestamp() - ts)) * 1000 AS lag_ms
FROM replication_heartbeat
WHERE id = 1;

Degraded-state behavior: During network partitioning, heartbeat writes may succeed locally but fail to replicate. Implement dual-path validation: if heartbeat lag exceeds threshold AND TCP keepalive RTT exceeds baseline, mark the replica as UNREACHABLE rather than LAGGING to prevent routing loops.


Trade-off Comparison: Detection Approaches

Detection method Latency precision Clock-skew immune Works through proxy Operational overhead
pg_stat_replication replay_lag ~100 ms No No (requires direct primary conn) None
WAL LSN delta (pg_wal_lsn_diff) ~1 ms Yes No None
MySQL Seconds_Behind_Source ~1 s No No None
Heartbeat table (custom writer) ~500 ms Yes Yes Low (pg_cron or cron job)
ProxySQL built-in monitor ~1 s No Native Config only

For routing queries based on data freshness requirements, WAL LSN delta combined with a heartbeat table provides the strongest guarantees: the LSN is immune to clock skew and the heartbeat survives proxy interposition.


Configuration Runbook

ProxySQL: Per-Replica Lag Thresholds

sql
-- mysql_servers: set per-server max lag in seconds
UPDATE mysql_servers
  SET max_replication_lag = 1  -- demote if lag > 1s
WHERE hostgroup_id = 20;       -- read replica hostgroup

UPDATE mysql_servers
  SET max_connections = 200
WHERE hostgroup_id = 20;

-- Query rules: pin high-freshness queries to the primary (hostgroup 10)
INSERT INTO mysql_query_rules
  (rule_id, active, match_pattern, destination_hostgroup, apply)
VALUES
  (1, 1, '^SELECT.*FROM orders WHERE user_id', 10, 1);

LOAD MYSQL SERVERS TO RUNTIME;    SAVE MYSQL SERVERS TO DISK;
LOAD MYSQL QUERY RULES TO RUNTIME; SAVE MYSQL QUERY RULES TO DISK;
yaml
# ProxySQL: scrape replicas every 1 second (milliseconds)
mysql-monitor_replication_lag_interval: 1000

Hysteresis: Require lag to stay below threshold for 3 consecutive checks before re-promoting a demoted replica. This prevents routing oscillation during brief jitter.

PgBouncer: Transaction-Mode Pooling for Failover

When replicas resync after a lag event, connection demand spikes. Switch to transaction-mode pooling to cap backend sessions:

ini
[pgbouncer]
pool_mode = transaction
max_client_conn = 500
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3
query_wait_timeout = 10    ; shed load gracefully rather than queue indefinitely

PostgreSQL: WAL Slot Safety

Inactive replication slots cause unbounded WAL accumulation that eventually fills disk and crashes the primary:

sql
-- Cap WAL retained for slots; prevents disk exhaustion
ALTER SYSTEM SET max_slot_wal_keep_size = '5GB';
SELECT pg_reload_conf();

-- Monitor for inactive slots (drop them if confirmed abandoned)
SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes
FROM pg_replication_slots
WHERE active = false;

Monitoring and Alerting Signals

Key Prometheus Metrics

Metric Source Alert threshold
pg_replication_lag_bytes postgres_exporter > 50 MB for 2 min
pg_replication_lag_seconds postgres_exporter > 2 s for 3 min
mysql_slave_lag_seconds mysqld_exporter > 1 s for 2 min
proxysql_server_status proxysql_exporter Status = SHUNNED
pgbouncer_pool_cl_waiting pgbouncer_exporter > 50 for 1 min
yaml
# Grafana / Prometheus alert rules
groups:
  - name: replication_lag
    rules:
      - alert: ReplicationLagCritical
        expr: pg_replication_lag_seconds > 2
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Replica {{ $labels.instance }} lag exceeds 2s for 3 minutes"
          runbook: "Check pg_stat_replication on primary; verify network I/O between nodes"

      - alert: ReplicationSlotInactive
        expr: pg_replication_slots_active == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Inactive replication slot accumulating WAL"

OpenTelemetry Span Tags for Trace Correlation

Inject routing context into every database span so incidents are traceable end-to-end:

  • db.replication.lag_seconds — lag at time of routing decision
  • db.routing.targetprimary or replica:{name}
  • db.routing.reasonlag_threshold_exceeded, read_after_write, healthy

Application-Level Consistency Guarantees

Middleware routing alone is insufficient for strict read-after-write scenarios. Application code must propagate the primary’s commit position and validate replica freshness before routing subsequent reads. This pairs directly with the eventual consistency patterns that govern how stale reads are tolerated across different workload types.

Write-Position Propagation

Immediately after a write, capture the primary’s LSN or GTID and attach it to the request context:

python
# Post-write: capture primary position (PostgreSQL via psycopg2)
cursor.execute("SELECT pg_current_wal_lsn()::text")
write_lsn = cursor.fetchone()[0]  # e.g. "0/1A2B3C40"

# Store in session or propagate via HTTP header
session["write_lsn"] = write_lsn
session["write_ts_ms"] = int(time.time() * 1000)
http
# HTTP Header propagation for inter-service calls
X-Write-LSN: 0/1A2B3C40
X-Write-Timestamp: 1715432100000

Client-Side Routing Logic

python
def safe_read(query, write_ts_ms=None, max_lag_ms=500):
    """Route to primary if within lag window, else replica pool."""
    if write_ts_ms is not None:
        elapsed = int(time.time() * 1000) - write_ts_ms
        if elapsed < max_lag_ms:
            return primary_conn.execute(query)
    return replica_pool.execute(query)

For session-scoped routing, a Redis token prevents stale reads without requiring every service to implement LSN comparison:

code
# Redis: Route all reads to primary for 30 s after a write
SETEX write_session:{user_id} 30 "primary_lsn:0/1A2B3C40"

Degraded-state behavior: Under sustained replication storms, excessive primary routing saturates CPU and I/O. Cap primary read traffic at 20% of total read capacity via a circuit breaker. If the cap is reached, return HTTP 429 Too Many Requests or serve explicitly flagged stale data, rather than cascading into primary overload.

The using application-level timestamps to bypass stale replicas page covers the full SDK implementation and session-token patterns in detail.


Failure Modes and Recovery Steps

1. Silent lag accumulation during write bursts

Root cause: WAL generation from bulk INSERT/UPDATE outpaces replica I/O throughput. pg_stat_replication shows growing lag_bytes while replay_lag climbs.

Remediation:

sql
-- Check WAL send rate vs apply rate on primary
SELECT application_name,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS unsent_bytes,
       write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
-- If unsent_bytes > 100 MB, consider throttling writes or adding replica capacity.

2. Heartbeat table write contention

Root cause: High-frequency upserts to a single-row heartbeat table cause lock contention under concurrent application reads.

Remediation: Use INSERT ... ON CONFLICT DO UPDATE (PostgreSQL) or INSERT ... ON DUPLICATE KEY UPDATE (MySQL) with a low-priority write to avoid lock escalation. Move the heartbeat writer to a dedicated connection outside the application pool.

3. Routing oscillation from rapid lag changes

Root cause: A replica oscillates around the lag threshold, being promoted and demoted repeatedly, causing connection thrashing and application errors.

Remediation: Enforce hysteresis in the proxy (3 consecutive checks below threshold before re-promotion). In ProxySQL, this is handled automatically by mysql-monitor_replication_lag_interval combined with max_replication_lag; for custom routers, implement a counter rather than a binary threshold.

4. Stale cache after replica recovery

Root cause: During a lag event, cache entries are populated from lagging replicas. After recovery, those entries persist until TTL expiry, serving outdated data.

Remediation:

lua
-- Redis: Conditional invalidation (Lua for atomicity)
local val = redis.call('GET', KEYS[1])
if val then
  return redis.call('DEL', KEYS[1])
end
return 0
-- Execute: redis-cli EVAL "<script>" 1 cache_key

Tie cache invalidation jobs to Prometheus alerting: fire purge when pg_replication_lag_seconds drops below SLO for more than 60 seconds.

5. Inactive replication slot causing WAL bloat

Root cause: A consumer (logical replication subscriber or CDC tool) disconnects without dropping its slot. PostgreSQL retains all WAL from the slot’s restart_lsn forward.

Remediation:

sql
-- Identify and drop abandoned slots
SELECT slot_name, active,
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1e9 AS retained_gb
FROM pg_replication_slots
WHERE active = false;

-- Drop only after confirming the consumer is genuinely gone:
SELECT pg_drop_replication_slot('slot_name');

Operational Checklist


Pages in This Section


FAQ

How do I measure replication lag in PostgreSQL without relying on wall-clock time?

Use pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn()) queried on the primary against each replica’s reported LSN position. This gives byte-level drift that is immune to NTP skew and survives network partitions where timestamps cannot be trusted.

What hysteresis value prevents replica routing oscillation in ProxySQL?

Require lag to remain below threshold for at least 3 consecutive checks (3 seconds at mysql-monitor_replication_lag_interval = 1000ms) before re-promoting a replica. This prevents connection thrashing during brief network jitter.

When should I route reads to the primary instead of a replica?

Route to the primary for read-after-write scenarios where the elapsed time since the write is less than your max_lag_ms SLO, when all replicas exceed their lag threshold, or when you detect clock skew that makes timestamp-based validation unreliable. Use LSN/GTID propagation via HTTP headers or session tokens rather than wall-clock comparisons.


← Back to Replication Lag & Consistency Management