How to Calculate Replication Lag Thresholds for SLA Compliance
Problem statement: Your monitoring fires a lag alert but you have no principled threshold to tell you whether 800 ms of replica delay is acceptable or an SLA breach in progress. You need a deterministic method to derive warning and hard-breach thresholds from your actual SLA budget, baseline telemetry, and infrastructure overhead β before the next incident.
Asynchronous replication always introduces a delay between the primary committing a write and a replica applying it. That delay β replication_delay_seconds in PostgreSQL, Seconds_Behind_Source in MySQL β is not a fixed constant: it fluctuates with write volume, network conditions, and replica I/O. Without thresholds grounded in real measurements, alert tuning becomes guesswork and SLA commitments cannot be enforced.
Symptom Identification
Look for these signals in production before applying any threshold calculation. Each one indicates lag has already crossed a boundary.
Metrics and telemetry to watch:
replication_delay_seconds(PostgreSQLpg_stat_replication.write_lag/replay_lag) rising monotonically over a 3-minute evaluation windowSeconds_Behind_Source > 0sustained in MySQLSHOW REPLICA STATUSwal_lag_bytesclimbing without a corresponding increase inpg_stat_replication.sent_lsndeltaread_consistency_errorscounter incrementing in application-layer replica routing logic
Application-level symptoms:
- Stale query results that violate idempotency: a record that was just written returns as absent on the next read
409 Conflict/SQLSTATE 23505errors on retry paths that read-then-write through a replica- Unexpected connection pool exhaustion caused by retry storms from clients re-querying a stale replica
Critical distinction: differentiate between transient jitter (sub-second spikes during checkpoint flushes) and sustained apply lag (monotonic increase over minutes). Configure alert evaluation windows (for: 3m in Prometheus/Alertmanager) to suppress false positives from the former while catching the latter.
Root Cause Analysis
Before adjusting any threshold, isolate which infrastructure layer is causing the lag. Masking a deficit with a relaxed threshold only defers the breach.
Network/Transport layer: Round-trip latency variance above 50 ms, packet loss above 0.1%, or TCP window exhaustion disrupts WAL streaming continuity. Verify tcp_keepalives_idle, tcp_keepalives_interval, and MTU alignment across availability zones. Cross-AZ multi-region replica topologies amplify this effect.
Primary write saturation: High wal_buffers flush latency, max_wal_size exhaustion, or aggressive checkpoint thrashing (checkpoint_completion_target < 0.9) stalls WAL segment generation. Monitor pg_stat_bgwriter.checkpoint_write_time and buffers_checkpoint for symptoms.
Replica I/O contention: Disk queue depth above 10, undersized shared_buffers, or long-running analytical queries holding AccessShareLock and blocking WAL apply when hot_standby_feedback = on is set. This is the most common source of sustained lag spikes on read-heavy replicas.
Routing and session stickiness: Connection pool misconfiguration β particularly pool_mode = transaction in PgBouncer with SET statements that rely on session state β can force clients to remain bound to a degraded replica after the routing policy has already redirected new connections away. See connection pool architecture for read replicas for the full failure mode.
Step-by-Step Threshold Calculation
Step 1 β Define workload tiers
Assign a maximum acceptable staleness per query class. This becomes your SLA budget for each tier. If no explicit SLA exists, derive the limit from user-facing consequences: a stale balance display is tolerable for 2 s; a stale inventory lock is not tolerable past 100 ms.
| Query class | Max staleness (SLA budget) | Consistency model |
|---|---|---|
| Transactional reads (checkout, auth) | 500 ms | READ COMMITTED minimum |
| Reporting / dashboards | 2 000 ms | Eventual |
| Background analytics | 5 000 ms | Eventual |
Step 2 β Capture P95 baseline lag
Extract historical lag percentiles under peak write throughput. Use a 30-day window that includes your busiest days. Never derive thresholds from mean lag β mean hides the tail behaviour that causes SLA breaches.
-- PostgreSQL: lag in seconds at the moment of query
SELECT
application_name,
EXTRACT(EPOCH FROM write_lag) AS write_lag_s,
EXTRACT(EPOCH FROM flush_lag) AS flush_lag_s,
EXTRACT(EPOCH FROM replay_lag) AS replay_lag_s
FROM pg_stat_replication;-- MySQL / MariaDB: seconds behind source
SHOW REPLICA STATUS\G
-- Read: Seconds_Behind_SourceIf you have Prometheus with pg_replication_lag or mysql_slave_seconds_behind_master, query the P95 over 30 days:
histogram_quantile(0.95,
rate(pg_replication_lag_bucket[30d])
)Record your P95 and P99 values. If P99 already exceeds your SLA budget, you have an infrastructure problem β resolve it before setting thresholds.
Inline verification:
-- Confirm you are reading live replication state, not cached stats
SELECT now(), pg_last_xact_replay_timestamp(),
EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_s;Step 3 β Apply the safety-margin formula
Hard_Breach_Threshold = SLA_Budget - (Network_RTT + Apply_Overhead + Safety_Buffer)
Warning_Threshold = Hard_Breach_Threshold Γ 0.70
Component definitions:
Network_RTT: measured round-trip time between primary and replica (usepg_pingor equivalent; include cross-AZ variance)Apply_Overhead: the measured time between a WAL record being received and being applied, visible asflush_lag - write_laginpg_stat_replicationSafety_Buffer: cushion for measurement jitter and alert evaluation delay; a minimum of 150 ms is reasonable for sub-second SLAs; 200β300 ms for longer windows
Worked example (2 000 ms SLA, transatlantic replica):
Hard_Breach = 2000ms - (80ms RTT + 120ms apply + 200ms safety)
Hard_Breach = 1600ms
Warning = 1600ms Γ 0.70 = 1120ms
Inline verification: Run the lag query above immediately after a known write and confirm the observed lag is below your calculated warning threshold under normal conditions.
Step 4 β Cross-reference proxy timeouts
The calculated hard-breach threshold must be strictly lower than the proxyβs read timeout. If it is not, a replica that is sitting at exactly the breach threshold will also be timing out client connections, causing a retry storm on top of the failover.
proxy_read_timeout β₯ Hard_Breach_Threshold + 500ms
For a 1 600 ms hard-breach threshold, set proxy_read_timeout = 2100ms (or the nearest supported value). Align this in your read/write splitting proxy configuration before deploying threshold-based routing.
Inline verification: Use SELECT pg_sleep(2) through the proxy and confirm the proxy returns the result rather than timing out at the connection level.
Step 5 β Deploy tiered alerting
Configure two alert rules, not one. A single alert fires too late for proactive action.
# Prometheus alerting rules
groups:
- name: replication_lag
rules:
- alert: ReplicationLagWarning
expr: pg_replication_lag_seconds > 1.12 # 70% of 1600ms hard breach
for: 3m
labels:
severity: warning
annotations:
summary: "Replica {{ $labels.instance }} approaching SLA lag limit"
description: "Lag {{ $value }}s exceeds warning threshold 1.12s. Investigate before breach."
- alert: ReplicationLagBreach
expr: pg_replication_lag_seconds > 1.60 # hard breach threshold
for: 1m
labels:
severity: critical
annotations:
summary: "Replica {{ $labels.instance }} breached SLA lag limit"
description: "Lag {{ $value }}s exceeds hard breach 1.60s. Failover routing now."The for: 3m evaluation window on the warning suppresses false positives during routine checkpoint flushes. The for: 1m on the breach fires faster to trigger automated failover.
Inline verification: Use amtool alert query ReplicationLagWarning to confirm the alert is loaded and its threshold matches the value you calculated.
Step 6 β Version-control the thresholds
Parameterize calculated thresholds in Terraform or Ansible. Never hardcode them in application logic or runtime environment variables β they will drift and become unauditable.
# terraform/variables.tf
variable "replication_lag_warning_s" {
description = "Replication lag warning threshold in seconds (70% of hard breach)"
type = number
default = 1.12
}
variable "replication_lag_breach_s" {
description = "Replication lag hard breach threshold in seconds"
type = number
default = 1.60
}Configuration Snippet
Deploy dynamic routing rules using the calculated thresholds. The proxy evaluates real-time lag and routes reads to compliant replicas or fails over to the primary.
ProxySQL (MySQL/MariaDB) β max_replication_lag in seconds:
-- Define reader hostgroup with lag threshold
INSERT INTO mysql_replication_hostgroups
(writer_hostgroup, reader_hostgroup, comment, max_replication_lag)
VALUES
(10, 20, 'prod-readers', 2); -- 2 s = hard breach threshold
-- Register replica in the reader hostgroup
INSERT INTO mysql_servers
(hostgroup_id, hostname, port, max_connections, max_replication_lag)
VALUES
(20, 'replica-01.db.internal', 3306, 500, 2);
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;PostgreSQL health check for HAProxy / load balancer:
-- Returns lag in seconds; exit non-zero (HTTP 503) if above threshold
SELECT
CASE
WHEN EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) < 1.60
THEN 'OK'
ELSE 'LAGGING'
END AS replica_status,
EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::NUMERIC(8,3) AS lag_s;Return HTTP 200 when lag_s < 1.60; return HTTP 503 when breached. Integrate this with fallback strategies when replicas fall behind to ensure the primary absorbs read traffic cleanly during failover.
PgBouncer β pool drain sequence on breach:
# Pause new queries to the lagging pool, redirect to primary, then resume
psql -h localhost -p 6432 -U pgbouncer -c "PAUSE replica_pool;"
# (update HAProxy backend weight to 0 for the lagging replica here)
psql -h localhost -p 6432 -U pgbouncer -c "RESUME replica_pool;"
# Once replica catches up (lag < 50% of threshold for 5 consecutive minutes):
psql -h localhost -p 6432 -U pgbouncer -c "RECONNECT replica_pool;"Verification and Rollback
Verify the fix is working
Run this immediately after deploying threshold-based routing to confirm the routing policy is being applied:
-- PostgreSQL: confirm lag is below threshold and replica is streaming
SELECT
application_name,
state,
sync_state,
EXTRACT(EPOCH FROM replay_lag) AS replay_lag_s
FROM pg_stat_replication
WHERE EXTRACT(EPOCH FROM replay_lag) IS NOT NULL;
-- Expect: state = 'streaming', replay_lag_s < hard_breach_threshold-- MySQL: confirm replica is running and lag is within threshold
SHOW REPLICA STATUS\G
-- Expect: Replica_IO_Running = Yes, Replica_SQL_Running = Yes
-- Seconds_Behind_Source < hard_breach_threshold_in_secondsCheck Prometheus to confirm ReplicationLagWarning and ReplicationLagBreach alerts have resolved.
Rollback procedure
If threshold changes cause unexpected routing behaviour or cascade connection failures:
- Revert
mysql_replication_hostgroups.max_replication_lagto the previous value viaLOAD MYSQL SERVERS TO RUNTIMEwithout saving to disk (preserves rollback path). - Restore the previous Prometheus alerting rule from version control and
curl -X POST http://alertmanager:9093/-/reload. - Restore PgBouncer pool configuration from the IaC-managed copy and
pgbouncer -R(graceful reload without dropping connections). - Do not re-introduce a previously breached replica until lag has remained below 50% of the hard-breach threshold for 5 consecutive minutes.
- Commit the reverted threshold values to the IaC repository within 24 hours, noting the reason in the commit message.
Edge Cases and Gotchas
Cascading replication (replica-of-replica): In a topology where a replica feeds a downstream replica, lag accumulates additively at each tier. Each hop adds its own apply overhead and network RTT. A downstream replica that appears compliant by its own pg_last_xact_replay_timestamp() measurement may still be 2β3 hops behind the primary in wall-clock terms. Measure per-hop lag independently using pg_stat_replication on each intermediate node, and set progressively tighter warning thresholds on downstream nodes to account for the additive delay.
Logical replication with large transactions: Logical replication applies changes in full transaction batches, not individual WAL records. A single large bulk INSERT or UPDATE can cause lag to spike from 0 to several seconds instantaneously as the apply worker processes the complete transaction. This is expected behaviour, not infrastructure failure. Use for: 5m evaluation windows for warning alerts on logical replica deployments to avoid alert fatigue, and exclude known batch-job windows from SLA calculations.
Multi-region deployments with asymmetric RTT: Cross-continent replicas have substantially higher and more variable RTT than cross-AZ replicas. The Network_RTT component of the safety-margin formula must be measured at P99 latency, not average β otherwise the safety buffer will be too thin during peak network congestion. For multi-region topologies, set separate threshold profiles per region; a US-East primary to EU-West replica requires a fundamentally different threshold than the same primary to a US-West replica.
FAQ
What is a safe replication lag threshold for a 2-second SLA?
Subtract network RTT, WAL apply overhead, and a safety buffer from your SLA budget. For a 2 000 ms SLA with 50 ms RTT, 100 ms apply overhead, and a 200 ms safety margin, the hard-breach threshold is 1 650 ms. Set your warning threshold at 70% of that value β approximately 1 155 ms. Always verify these values against your measured P95 baseline before deploying.
Why should I not just relax the lag threshold when replicas fall behind?
Relaxing the threshold masks the underlying bottleneck β network congestion, replica I/O saturation, or primary write pressure β and guarantees a future hard breach under heavier load. Diagnose root cause first using pg_stat_replication, disk queue metrics, and network latency data. Only adjust thresholds once the infrastructure deficit is resolved and a new P95 baseline has been measured.
How do I handle replication lag thresholds in cascading replication setups?
In cascading topologies (replica-of-replica), lag accumulates additively at each tier. Each hop adds its own apply overhead and network RTT, so downstream replicas need proportionally tighter warning thresholds relative to their inherited lag. Measure per-hop lag independently using pg_stat_replication on each intermediate node, and account for the additive delay when deriving thresholds for downstream consumers.
β Back to Understanding Synchronous vs Asynchronous Replication
Related
- Understanding Synchronous vs Asynchronous Replication β the commit-path mechanics that determine which lag profile your deployment will exhibit
- Detecting and Handling Replication Lag in Real Time β production telemetry patterns and real-time detection strategies for lag events
- Fallback Strategies When Replicas Fall Behind β automated and manual procedures for routing around a lagging replica without dropping connections
- Routing Queries Based on Data Freshness Requirements β how to classify queries by staleness tolerance and wire that into your proxy routing rules
- Configuring MySQL Read-Only Routing with Lag Thresholds β ProxySQL and MySQL Router configuration specifics for lag-aware read routing