Configuring MySQL Read-Only Routing with Lag Thresholds

Operational problem: A MySQL replica serving read traffic has drifted behind the primary, but the load balancer keeps routing queries to it because it is still accepting connections β€” users see outdated balances, missing orders, or stale inventory counts.

Blind round-robin across MySQL read replicas violates application SLAs by serving stale data during replication drift. Deterministic threshold enforcement gates routing decisions on measurable replication latency rather than raw connection availability. This runbook covers production-ready configuration for MySQL 8.0+ (GTID-based), ProxySQL v2.5+, and HAProxy 2.6+, and ties directly into the routing queries based on data freshness requirements framework: replicas that exceed a configured lag threshold are removed from the read pool before stale data reaches users.


Symptom Identification

Look for these signals in production before assuming the problem is elsewhere:

  • Stale read complaints: Users report seeing old account balances, missing recent orders, or profile changes that have not propagated β€” typically within seconds of a write.
  • ProxySQL status showing SHUNNED: SELECT * FROM runtime_mysql_servers WHERE status != 'ONLINE'; returns replicas in SHUNNED state, yet application errors are 5xx rather than graceful fallback.
  • Seconds_Behind_Source climbing: On the replica itself, SHOW REPLICA STATUS\G shows Seconds_Behind_Source growing steadily rather than hovering near zero.
  • Performance Schema lag divergence: The heartbeat-table lag (sub-second) disagrees with Seconds_Behind_Source (coarser, integer seconds) β€” this indicates the SQL applier thread is stalling between transactions.
  • Connection pool exhaustion: All replicas exceed threshold simultaneously β†’ pool drains β†’ application falls back to primary β†’ primary CPU spikes above baseline.
sql
-- Quick lag snapshot across all replicas (run on each replica host)
SELECT
  @@hostname AS replica,
  TIMESTAMPDIFF(MICROSECOND, ts, NOW(3)) / 1000.0 AS heartbeat_lag_ms
FROM _replication_heartbeat;

-- Confirm applier thread health
SELECT
  SERVICE_STATE,
  LAST_ERROR_NUMBER,
  LAST_ERROR_MESSAGE
FROM performance_schema.replication_applier_status;

Root Cause Analysis

Replication lag in MySQL 8.0 originates from a bounded set of conditions:

Single-threaded applier bottleneck. With slave_parallel_workers = 0 (or 1), the SQL applier processes the relay log serially. A single large INSERT ... SELECT, an ALTER TABLE, or a batch update stalls all subsequent transactions behind it.

Parallel replication misconfiguration. slave_parallel_type = DATABASE distributes work across schemas, not transactions. On single-schema workloads this is identical to single-threaded. Switch to LOGICAL_CLOCK with binlog_transaction_dependency_tracking = WRITESET to unlock true parallel apply.

I/O saturation on the replica. Replicas with sync_binlog = 0 and innodb_flush_log_at_trx_commit = 2 buffer writes aggressively β€” correct for durability trade-offs β€” but disk queue saturation during write bursts causes the applier to stall waiting for fsync completion.

Missing primary keys. Tables without a primary key force full-table scans on the replica for every row-based replication event that touches them. This can turn a sub-millisecond write on the primary into a multi-second replay on the replica.

Network jitter between primary and replica. Asymmetric latency causes the I/O thread to stall reading from the binary log. The SQL thread runs out of work in the relay log and falls behind the primary.


Step-by-Step Resolution

Step 1 β€” Instrument Real-Time Lag Detection

SHOW REPLICA STATUS is insufficient for routing decisions: it is blocking, integer-precision, and returns NULL when the I/O thread is disconnected. Use a heartbeat table instead.

1.1 Create and populate the heartbeat table on the primary:

sql
-- Run once on the primary
CREATE TABLE IF NOT EXISTS _replication_heartbeat (
  id INT NOT NULL DEFAULT 1,
  ts DATETIME(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
  PRIMARY KEY (id)
) ENGINE=InnoDB;

-- Keep updated every 100ms via the Event Scheduler or pt-heartbeat
-- pt-heartbeat is preferred: pt-heartbeat --update --database heartbeat_db
UPDATE _replication_heartbeat SET ts = NOW(3) WHERE id = 1;

Verification: On each replica, confirm the table is replicating and lag is sub-second under normal load:

sql
SELECT TIMESTAMPDIFF(MICROSECOND, ts, NOW(3)) / 1000.0 AS lag_ms
FROM _replication_heartbeat;
-- Expected: < 50ms under light load; < 500ms during write bursts

1.2 Enable non-blocking Performance Schema monitoring:

sql
-- Confirm replication channels are connected and error-free
SELECT
  channel_name,
  SERVICE_STATE,
  LAST_HEARTBEAT_TIMESTAMP,
  LAST_ERROR_NUMBER,
  LAST_ERROR_MESSAGE
FROM performance_schema.replication_connection_status
WHERE CHANNEL_NAME = '';

-- Check parallel applier worker progress
SELECT
  WORKER_ID,
  LAST_APPLIED_TRANSACTION,
  LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP
FROM performance_schema.replication_applier_status_by_worker;

1.3 Configure ProxySQL’s built-in lag monitor:

sql
-- Polling interval and timeout in milliseconds
UPDATE global_variables
  SET variable_value = '2000'
  WHERE variable_name = 'mysql-monitor_replication_lag_interval';

UPDATE global_variables
  SET variable_value = '1000'
  WHERE variable_name = 'mysql-monitor_replication_lag_timeout';

LOAD MYSQL VARIABLES TO RUNTIME;
SAVE MYSQL VARIABLES TO DISK;

ProxySQL reads Seconds_Behind_Source from each server in the reader hostgroup and marks any replica SHUNNED when its lag exceeds max_replication_lag. For heartbeat-table precision, supplement this with an external monitoring agent writing lag into ProxySQL’s mysql_server_ping_log or a custom health-check script.


Step 2 β€” Define Threshold Policies in ProxySQL

Thresholds must map to business SLA tiers, not arbitrary second counts. Maintaining separate hostgroups per freshness tier prevents a lagging analytics replica from contaminating the read pool for transactional queries.

2.1 Create tiered hostgroup assignments:

sql
-- Writer hostgroup=10 serves all writes and read-after-write queries
-- Reader hostgroup=20 enforces a strict 5s max lag (user-facing transactional reads)
-- Reader hostgroup=30 enforces a relaxed 30s max lag (analytics, reporting)

INSERT INTO mysql_replication_hostgroups (writer_hostgroup, reader_hostgroup, comment)
VALUES (10, 20, 'prod_strict');

INSERT INTO mysql_replication_hostgroups (writer_hostgroup, reader_hostgroup, comment)
VALUES (10, 30, 'prod_relaxed');

-- Apply per-server lag limits (seconds)
UPDATE mysql_servers SET max_replication_lag = 5  WHERE hostgroup_id = 20;
UPDATE mysql_servers SET max_replication_lag = 30 WHERE hostgroup_id = 30;

LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;

Verification: Confirm servers are in the expected status:

sql
SELECT hostgroup_id, hostname, port, status, max_replication_lag
FROM runtime_mysql_servers
ORDER BY hostgroup_id, hostname;
-- All reader servers should show status='ONLINE' if lag is below threshold

2.2 Map query rules to hostgroups:

Route by SQL comment annotations so applications can express freshness requirements explicitly (see routing queries based on data freshness requirements for annotation conventions):

sql
-- Route strict reads to hostgroup 20
INSERT INTO mysql_query_rules
  (rule_id, active, match_pattern, destination_hostgroup, apply)
VALUES
  (100, 1, '^/\* MAX_STALENESS=strict \*/', 20, 1);

-- Route relaxed reads to hostgroup 30
INSERT INTO mysql_query_rules
  (rule_id, active, match_pattern, destination_hostgroup, apply)
VALUES
  (110, 1, '^/\* MAX_STALENESS=relaxed \*/', 30, 1);

-- All writes to primary (hostgroup 10)
INSERT INTO mysql_query_rules
  (rule_id, active, match_pattern, destination_hostgroup, apply)
VALUES
  (200, 1, '^(INSERT|UPDATE|DELETE|REPLACE|CREATE|ALTER|DROP|TRUNCATE)', 10, 1);

LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;

Step 3 β€” HAProxy Lag-Aware External Check

If you use HAProxy in front of ProxySQL or directly in front of replicas, add an external health check that gates on lag rather than port availability:

ini
backend mysql_readers
  option external-check
  external-check path "/usr/bin:/usr/local/bin"
  external-check command /etc/haproxy/check_mysql_lag.sh
  server replica1 10.0.1.11:3306 check inter 2s fall 3 rise 2
  server replica2 10.0.1.12:3306 check inter 2s fall 3 rise 2
bash
#!/bin/bash
# /etc/haproxy/check_mysql_lag.sh
# HAProxy passes server IP as $3 and port as $4
HOST=$3
PORT=$4
THRESHOLD=5   # seconds; adjust per SLA tier

LAG=$(mysql -h "$HOST" -P "$PORT" -u monitor --password='monitorpass' \
  -e "SHOW REPLICA STATUS\G" 2>/dev/null \
  | awk '/Seconds_Behind_Source/ {print $2}')

if [[ "$LAG" == "NULL" ]] || [[ -z "$LAG" ]] || (( LAG > THRESHOLD )); then
  exit 1
fi
exit 0

Verification: After deploying, trigger artificial lag on a replica with STOP REPLICA SQL_THREAD; and wait for inter * fall seconds. Confirm HAProxy marks the backend DOWN:

bash
echo "show servers state" | socat stdio /run/haproxy/admin.sock | grep mysql_readers
# The lagging replica should show state=6 (DOWN)

Step 4 β€” Implement Client-Side Routing Logic

Proxies route at the connection level; applications must reinforce routing at the query and transaction boundary to prevent stale-read leakage through connection reuse.

4.1 Separate connection pools per freshness tier:

The architecture below shows the routing path from application to primary/replica with ProxySQL enforcing lag thresholds at the proxy layer.

MySQL lag-threshold routing architecture Diagram showing two application connection pools (strict and relaxed) connecting to ProxySQL, which maintains hostgroup 20 (max 5s lag) and hostgroup 30 (max 30s lag) for two replicas, plus hostgroup 10 for the primary writer. A shunned replica is shown grayed out when its lag exceeds the threshold. App: strict pool MAX_STALENESS=strict App: relaxed pool MAX_STALENESS=relaxed App: write pool DML + read-after-write ProxySQL HG 20 Β· max 5s strict reads HG 30 Β· max 30s relaxed reads HG 10 Β· primary writes Replica 1 lag: 1s Β· ONLINE Replica 2 lag: 18s Β· SHUNNED Primary HG 10 Β· writes active route blocked (lag exceeded)

4.2 HikariCP (Java) β€” separate pools per tier:

java
// Bind separate HikariCP pools to each ProxySQL hostgroup
// Hostgroup is selected via a custom JDBC URL parameter or ProxySQL query rules
public DataSource getDataSourceForQuery(QueryType type) {
    return switch (type.getStalenessTolerance()) {
        case STRICT   -> strictReaderPool;   // routes to HG 20 via query rule
        case RELAXED  -> relaxedReaderPool;  // routes to HG 30 via query rule
        default       -> writerPool;         // routes to HG 10
    };
}

4.3 SQL comment annotation (any client):

sql
-- Strict: routed to HG 20 (max 5s lag replicas only)
/* MAX_STALENESS=strict */ SELECT balance FROM accounts WHERE user_id = ?;

-- Relaxed: routed to HG 30 (max 30s lag replicas)
/* MAX_STALENESS=relaxed */ SELECT * FROM audit_log WHERE created_at > ?;

Configuration Snippet

Minimal complete ProxySQL configuration with critical parameters annotated:

sql
-- ── Lag monitor settings ───────────────────────────────────────────
-- How often ProxySQL polls Seconds_Behind_Source (ms)
UPDATE global_variables SET variable_value='2000'
  WHERE variable_name='mysql-monitor_replication_lag_interval';

-- Max time ProxySQL waits for a lag-check response before treating
-- the replica as unreachable (ms); keep < monitor interval
UPDATE global_variables SET variable_value='1000'
  WHERE variable_name='mysql-monitor_replication_lag_timeout';

-- ── Hostgroup topology ─────────────────────────────────────────────
-- ProxySQL automatically moves servers between HG 10↔20 based on
-- read_only flag AND max_replication_lag threshold below
INSERT INTO mysql_replication_hostgroups (writer_hostgroup, reader_hostgroup, comment)
VALUES (10, 20, 'prod_strict');

-- Per-server max lag in seconds; servers exceeding this β†’ SHUNNED
UPDATE mysql_servers SET max_replication_lag = 5  WHERE hostgroup_id = 20;
UPDATE mysql_servers SET max_replication_lag = 30 WHERE hostgroup_id = 30;

-- ── Query routing rules ────────────────────────────────────────────
-- Writes always go to primary (HG 10)
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup, apply)
VALUES (200, 1, '^(INSERT|UPDATE|DELETE|REPLACE|CREATE|ALTER|DROP|TRUNCATE)', 10, 1);

-- Default SELECT β†’ strict reader pool
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup, apply)
VALUES (300, 1, '^SELECT', 20, 1);

LOAD MYSQL VARIABLES TO RUNTIME;   SAVE MYSQL VARIABLES TO DISK;
LOAD MYSQL SERVERS TO RUNTIME;     SAVE MYSQL SERVERS TO DISK;
LOAD MYSQL QUERY RULES TO RUNTIME; SAVE MYSQL QUERY RULES TO DISK;

Verification and Rollback

Confirm the fix is working

sql
-- 1. Check server status in ProxySQL runtime
SELECT hostgroup_id, hostname, port, status, max_replication_lag, replication_lag
FROM runtime_mysql_servers
ORDER BY hostgroup_id, hostname;
-- All reader servers: status='ONLINE', replication_lag < max_replication_lag

-- 2. Simulate lag breach: on a replica, pause the SQL applier for 10s
STOP REPLICA SQL_THREAD;
-- Wait 10s, then query ProxySQL
SELECT hostname, status FROM runtime_mysql_servers WHERE hostgroup_id = 20;
-- Expected: status='SHUNNED' for the paused replica

-- 3. Resume and confirm automatic recovery
START REPLICA SQL_THREAD;
-- After lag drops below threshold, status returns to 'ONLINE' automatically

-- 4. Verify routing from the application side
/* MAX_STALENESS=strict */ SELECT @@hostname;
-- Should return a replica hostname, never the primary, under normal lag

Rollback procedure

If threshold routing introduces instability (proxy flapping, connection pool starvation), revert to static routing immediately:

sql
-- Step 1: Remove lag-aware hostgroup rules
DELETE FROM mysql_replication_hostgroups;
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;

-- Step 2: Remove lag limits from all servers (set to 0 = no limit)
UPDATE mysql_servers SET max_replication_lag = 0;
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;

-- Step 3: Remove query routing rules if you want primary-only traffic
DELETE FROM mysql_query_rules WHERE destination_hostgroup IN (20, 30);
LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;

After rollback, restart application instances to flush stale connection pool entries, then verify primary-only traffic:

bash
for host in replica1 replica2 replica3; do
  lag=$(mysql -h "$host" -e "SHOW REPLICA STATUS\G" \
    | awk '/Seconds_Behind_Source/ {print $2}')
  echo "$host: lag=${lag}s"
done
# Confirm lag is still low before re-enabling threshold routing

Maintain primary-only routing for at least 30 minutes, then re-enable with conservative defaults (max_replication_lag = 15) before tightening again.


Edge Cases and Gotchas

Cascading replication topologies. In a chain where replica B replicates from replica A, Seconds_Behind_Source on B measures lag from A’s relay log, not from the primary. ProxySQL will see B as healthy if its lag from A is low, even though B might be 30s behind the primary. Use heartbeat tables anchored to the primary to get true end-to-end lag across the chain.

Multi-region deployments. Cross-region replicas routinely carry 50–200ms of baseline lag from WAN round-trip time alone. Set max_replication_lag thresholds per region, not globally. A 5s threshold appropriate for same-AZ replicas will continuously shun a cross-region replica under normal operating conditions. Consider maintaining a separate ProxySQL instance per region with region-appropriate thresholds, as detailed in designing multi-region read replica topologies.

NULL Seconds_Behind_Source after I/O thread disconnect. When the I/O thread loses its connection to the primary, SHOW REPLICA STATUS reports Seconds_Behind_Source = NULL. ProxySQL treats NULL as exceeding any threshold and shuns the server, which is correct behavior β€” but the root cause (network partition, primary restart, credential expiry) must be investigated separately. Detecting and handling replication lag in real-time covers the alerting signals that distinguish a disconnected I/O thread from a slow SQL applier.


FAQ

What happens when all replicas exceed the lag threshold simultaneously?

ProxySQL marks all reader hostgroup servers SHUNNED and routes queries to the writer hostgroup. Prevent primary overload by pre-configuring a fallback hostgroup with a looser threshold, and implement an application-level circuit breaker that returns cached or stale data with an explicit header rather than hammering the primary.

Why does ProxySQL report lag as 0 when Seconds_Behind_Source shows a non-zero value?

ProxySQL’s built-in lag monitor reads Seconds_Behind_Source, which resets to 0 when the SQL thread is idle even if the replica is behind. Use a heartbeat table for accurate sub-second granularity. ProxySQL also needs mysql-monitor_replication_lag_interval set lower than your threshold check interval.

Can HAProxy enforce lag thresholds without ProxySQL?

Yes. Use HAProxy’s external-check directive with a shell script that queries Seconds_Behind_Source and exits non-zero when the threshold is exceeded. The drawback is that external-check executes per health-check interval (typically 2s), so it cannot react faster than that interval, and it requires OS-level MySQL credentials on the HAProxy host.


← Back to Routing Queries Based on Data Freshness Requirements