How to Calculate Replication Lag Thresholds for SLA Compliance

Problem statement: Your monitoring fires a lag alert but you have no principled threshold to tell you whether 800 ms of replica delay is acceptable or an SLA breach in progress. You need a deterministic method to derive warning and hard-breach thresholds from your actual SLA budget, baseline telemetry, and infrastructure overhead β€” before the next incident.

Asynchronous replication always introduces a delay between the primary committing a write and a replica applying it. That delay β€” replication_delay_seconds in PostgreSQL, Seconds_Behind_Source in MySQL β€” is not a fixed constant: it fluctuates with write volume, network conditions, and replica I/O. Without thresholds grounded in real measurements, alert tuning becomes guesswork and SLA commitments cannot be enforced.

Symptom Identification

Look for these signals in production before applying any threshold calculation. Each one indicates lag has already crossed a boundary.

Metrics and telemetry to watch:

  • replication_delay_seconds (PostgreSQL pg_stat_replication.write_lag / replay_lag) rising monotonically over a 3-minute evaluation window
  • Seconds_Behind_Source > 0 sustained in MySQL SHOW REPLICA STATUS
  • wal_lag_bytes climbing without a corresponding increase in pg_stat_replication.sent_lsn delta
  • read_consistency_errors counter incrementing in application-layer replica routing logic

Application-level symptoms:

  • Stale query results that violate idempotency: a record that was just written returns as absent on the next read
  • 409 Conflict / SQLSTATE 23505 errors on retry paths that read-then-write through a replica
  • Unexpected connection pool exhaustion caused by retry storms from clients re-querying a stale replica

Critical distinction: differentiate between transient jitter (sub-second spikes during checkpoint flushes) and sustained apply lag (monotonic increase over minutes). Configure alert evaluation windows (for: 3m in Prometheus/Alertmanager) to suppress false positives from the former while catching the latter.


Replication Lag Threshold Zones A horizontal timeline showing three zones: normal operating range below 70% of SLA budget, warning zone between 70% and 100%, and hard breach above 100% which triggers failover. Normal 0 β†’ 70% of SLA budget Routing: read replicas active Warning 70% β†’ 100% Scale / re-route Hard Breach > 100% of SLA budget Failover to primary 0 ms Warning threshold Hard breach ∞ e.g. 1155 ms for 2 s SLA e.g. 1650 ms for 2 s SLA

Root Cause Analysis

Before adjusting any threshold, isolate which infrastructure layer is causing the lag. Masking a deficit with a relaxed threshold only defers the breach.

Network/Transport layer: Round-trip latency variance above 50 ms, packet loss above 0.1%, or TCP window exhaustion disrupts WAL streaming continuity. Verify tcp_keepalives_idle, tcp_keepalives_interval, and MTU alignment across availability zones. Cross-AZ multi-region replica topologies amplify this effect.

Primary write saturation: High wal_buffers flush latency, max_wal_size exhaustion, or aggressive checkpoint thrashing (checkpoint_completion_target < 0.9) stalls WAL segment generation. Monitor pg_stat_bgwriter.checkpoint_write_time and buffers_checkpoint for symptoms.

Replica I/O contention: Disk queue depth above 10, undersized shared_buffers, or long-running analytical queries holding AccessShareLock and blocking WAL apply when hot_standby_feedback = on is set. This is the most common source of sustained lag spikes on read-heavy replicas.

Routing and session stickiness: Connection pool misconfiguration β€” particularly pool_mode = transaction in PgBouncer with SET statements that rely on session state β€” can force clients to remain bound to a degraded replica after the routing policy has already redirected new connections away. See connection pool architecture for read replicas for the full failure mode.

Step-by-Step Threshold Calculation

Step 1 β€” Define workload tiers

Assign a maximum acceptable staleness per query class. This becomes your SLA budget for each tier. If no explicit SLA exists, derive the limit from user-facing consequences: a stale balance display is tolerable for 2 s; a stale inventory lock is not tolerable past 100 ms.

Query class Max staleness (SLA budget) Consistency model
Transactional reads (checkout, auth) 500 ms READ COMMITTED minimum
Reporting / dashboards 2 000 ms Eventual
Background analytics 5 000 ms Eventual

Step 2 β€” Capture P95 baseline lag

Extract historical lag percentiles under peak write throughput. Use a 30-day window that includes your busiest days. Never derive thresholds from mean lag β€” mean hides the tail behaviour that causes SLA breaches.

sql
-- PostgreSQL: lag in seconds at the moment of query
SELECT
  application_name,
  EXTRACT(EPOCH FROM write_lag)   AS write_lag_s,
  EXTRACT(EPOCH FROM flush_lag)   AS flush_lag_s,
  EXTRACT(EPOCH FROM replay_lag)  AS replay_lag_s
FROM pg_stat_replication;
sql
-- MySQL / MariaDB: seconds behind source
SHOW REPLICA STATUS\G
-- Read: Seconds_Behind_Source

If you have Prometheus with pg_replication_lag or mysql_slave_seconds_behind_master, query the P95 over 30 days:

promql
histogram_quantile(0.95,
  rate(pg_replication_lag_bucket[30d])
)

Record your P95 and P99 values. If P99 already exceeds your SLA budget, you have an infrastructure problem β€” resolve it before setting thresholds.

Inline verification:

sql
-- Confirm you are reading live replication state, not cached stats
SELECT now(), pg_last_xact_replay_timestamp(),
  EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_s;

Step 3 β€” Apply the safety-margin formula

code
Hard_Breach_Threshold = SLA_Budget - (Network_RTT + Apply_Overhead + Safety_Buffer)
Warning_Threshold     = Hard_Breach_Threshold Γ— 0.70

Component definitions:

  • Network_RTT: measured round-trip time between primary and replica (use pg_ping or equivalent; include cross-AZ variance)
  • Apply_Overhead: the measured time between a WAL record being received and being applied, visible as flush_lag - write_lag in pg_stat_replication
  • Safety_Buffer: cushion for measurement jitter and alert evaluation delay; a minimum of 150 ms is reasonable for sub-second SLAs; 200–300 ms for longer windows

Worked example (2 000 ms SLA, transatlantic replica):

code
Hard_Breach = 2000ms - (80ms RTT + 120ms apply + 200ms safety)
Hard_Breach = 1600ms

Warning     = 1600ms Γ— 0.70 = 1120ms

Inline verification: Run the lag query above immediately after a known write and confirm the observed lag is below your calculated warning threshold under normal conditions.

Step 4 β€” Cross-reference proxy timeouts

The calculated hard-breach threshold must be strictly lower than the proxy’s read timeout. If it is not, a replica that is sitting at exactly the breach threshold will also be timing out client connections, causing a retry storm on top of the failover.

code
proxy_read_timeout β‰₯ Hard_Breach_Threshold + 500ms

For a 1 600 ms hard-breach threshold, set proxy_read_timeout = 2100ms (or the nearest supported value). Align this in your read/write splitting proxy configuration before deploying threshold-based routing.

Inline verification: Use SELECT pg_sleep(2) through the proxy and confirm the proxy returns the result rather than timing out at the connection level.

Step 5 β€” Deploy tiered alerting

Configure two alert rules, not one. A single alert fires too late for proactive action.

yaml
# Prometheus alerting rules
groups:
  - name: replication_lag
    rules:
      - alert: ReplicationLagWarning
        expr: pg_replication_lag_seconds > 1.12   # 70% of 1600ms hard breach
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "Replica {{ $labels.instance }} approaching SLA lag limit"
          description: "Lag {{ $value }}s exceeds warning threshold 1.12s. Investigate before breach."

      - alert: ReplicationLagBreach
        expr: pg_replication_lag_seconds > 1.60   # hard breach threshold
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Replica {{ $labels.instance }} breached SLA lag limit"
          description: "Lag {{ $value }}s exceeds hard breach 1.60s. Failover routing now."

The for: 3m evaluation window on the warning suppresses false positives during routine checkpoint flushes. The for: 1m on the breach fires faster to trigger automated failover.

Inline verification: Use amtool alert query ReplicationLagWarning to confirm the alert is loaded and its threshold matches the value you calculated.

Step 6 β€” Version-control the thresholds

Parameterize calculated thresholds in Terraform or Ansible. Never hardcode them in application logic or runtime environment variables β€” they will drift and become unauditable.

hcl
# terraform/variables.tf
variable "replication_lag_warning_s" {
  description = "Replication lag warning threshold in seconds (70% of hard breach)"
  type        = number
  default     = 1.12
}

variable "replication_lag_breach_s" {
  description = "Replication lag hard breach threshold in seconds"
  type        = number
  default     = 1.60
}

Configuration Snippet

Deploy dynamic routing rules using the calculated thresholds. The proxy evaluates real-time lag and routes reads to compliant replicas or fails over to the primary.

ProxySQL (MySQL/MariaDB) β€” max_replication_lag in seconds:

sql
-- Define reader hostgroup with lag threshold
INSERT INTO mysql_replication_hostgroups
  (writer_hostgroup, reader_hostgroup, comment, max_replication_lag)
VALUES
  (10, 20, 'prod-readers', 2);  -- 2 s = hard breach threshold

-- Register replica in the reader hostgroup
INSERT INTO mysql_servers
  (hostgroup_id, hostname, port, max_connections, max_replication_lag)
VALUES
  (20, 'replica-01.db.internal', 3306, 500, 2);

LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;

PostgreSQL health check for HAProxy / load balancer:

sql
-- Returns lag in seconds; exit non-zero (HTTP 503) if above threshold
SELECT
  CASE
    WHEN EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) < 1.60
    THEN 'OK'
    ELSE 'LAGGING'
  END AS replica_status,
  EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::NUMERIC(8,3) AS lag_s;

Return HTTP 200 when lag_s < 1.60; return HTTP 503 when breached. Integrate this with fallback strategies when replicas fall behind to ensure the primary absorbs read traffic cleanly during failover.

PgBouncer β€” pool drain sequence on breach:

bash
# Pause new queries to the lagging pool, redirect to primary, then resume
psql -h localhost -p 6432 -U pgbouncer -c "PAUSE replica_pool;"
# (update HAProxy backend weight to 0 for the lagging replica here)
psql -h localhost -p 6432 -U pgbouncer -c "RESUME replica_pool;"

# Once replica catches up (lag < 50% of threshold for 5 consecutive minutes):
psql -h localhost -p 6432 -U pgbouncer -c "RECONNECT replica_pool;"

Verification and Rollback

Verify the fix is working

Run this immediately after deploying threshold-based routing to confirm the routing policy is being applied:

sql
-- PostgreSQL: confirm lag is below threshold and replica is streaming
SELECT
  application_name,
  state,
  sync_state,
  EXTRACT(EPOCH FROM replay_lag) AS replay_lag_s
FROM pg_stat_replication
WHERE EXTRACT(EPOCH FROM replay_lag) IS NOT NULL;
-- Expect: state = 'streaming', replay_lag_s < hard_breach_threshold
sql
-- MySQL: confirm replica is running and lag is within threshold
SHOW REPLICA STATUS\G
-- Expect: Replica_IO_Running = Yes, Replica_SQL_Running = Yes
--         Seconds_Behind_Source < hard_breach_threshold_in_seconds

Check Prometheus to confirm ReplicationLagWarning and ReplicationLagBreach alerts have resolved.

Rollback procedure

If threshold changes cause unexpected routing behaviour or cascade connection failures:

  1. Revert mysql_replication_hostgroups.max_replication_lag to the previous value via LOAD MYSQL SERVERS TO RUNTIME without saving to disk (preserves rollback path).
  2. Restore the previous Prometheus alerting rule from version control and curl -X POST http://alertmanager:9093/-/reload.
  3. Restore PgBouncer pool configuration from the IaC-managed copy and pgbouncer -R (graceful reload without dropping connections).
  4. Do not re-introduce a previously breached replica until lag has remained below 50% of the hard-breach threshold for 5 consecutive minutes.
  5. Commit the reverted threshold values to the IaC repository within 24 hours, noting the reason in the commit message.

Edge Cases and Gotchas

Cascading replication (replica-of-replica): In a topology where a replica feeds a downstream replica, lag accumulates additively at each tier. Each hop adds its own apply overhead and network RTT. A downstream replica that appears compliant by its own pg_last_xact_replay_timestamp() measurement may still be 2–3 hops behind the primary in wall-clock terms. Measure per-hop lag independently using pg_stat_replication on each intermediate node, and set progressively tighter warning thresholds on downstream nodes to account for the additive delay.

Logical replication with large transactions: Logical replication applies changes in full transaction batches, not individual WAL records. A single large bulk INSERT or UPDATE can cause lag to spike from 0 to several seconds instantaneously as the apply worker processes the complete transaction. This is expected behaviour, not infrastructure failure. Use for: 5m evaluation windows for warning alerts on logical replica deployments to avoid alert fatigue, and exclude known batch-job windows from SLA calculations.

Multi-region deployments with asymmetric RTT: Cross-continent replicas have substantially higher and more variable RTT than cross-AZ replicas. The Network_RTT component of the safety-margin formula must be measured at P99 latency, not average β€” otherwise the safety buffer will be too thin during peak network congestion. For multi-region topologies, set separate threshold profiles per region; a US-East primary to EU-West replica requires a fundamentally different threshold than the same primary to a US-West replica.

FAQ

What is a safe replication lag threshold for a 2-second SLA?

Subtract network RTT, WAL apply overhead, and a safety buffer from your SLA budget. For a 2 000 ms SLA with 50 ms RTT, 100 ms apply overhead, and a 200 ms safety margin, the hard-breach threshold is 1 650 ms. Set your warning threshold at 70% of that value β€” approximately 1 155 ms. Always verify these values against your measured P95 baseline before deploying.

Why should I not just relax the lag threshold when replicas fall behind?

Relaxing the threshold masks the underlying bottleneck β€” network congestion, replica I/O saturation, or primary write pressure β€” and guarantees a future hard breach under heavier load. Diagnose root cause first using pg_stat_replication, disk queue metrics, and network latency data. Only adjust thresholds once the infrastructure deficit is resolved and a new P95 baseline has been measured.

How do I handle replication lag thresholds in cascading replication setups?

In cascading topologies (replica-of-replica), lag accumulates additively at each tier. Each hop adds its own apply overhead and network RTT, so downstream replicas need proportionally tighter warning thresholds relative to their inherited lag. Measure per-hop lag independently using pg_stat_replication on each intermediate node, and account for the additive delay when deriving thresholds for downstream consumers.


← Back to Understanding Synchronous vs Asynchronous Replication