Configuring MySQL Read-Only Routing with Lag Thresholds
Blind round-robin load balancing across MySQL read replicas violates application SLAs by serving stale data during replication drift. Effective Replication Lag & Consistency Management requires deterministic threshold enforcement, where routing decisions are gated by measurable replication latency rather than simple connection availability. This runbook covers production-ready configuration for MySQL 8.0+ (GTID-based), ProxySQL v2.5+, HAProxy 2.6+, and application-level routing logic. The objective is to isolate lagging nodes before they impact user-facing consistency, while maintaining high read throughput under normal operating conditions.
Step 1: Instrumenting Real-Time Lag Detection
Accurate routing depends on sub-second lag telemetry. Relying solely on SHOW SLAVE STATUS is insufficient for routing decisions due to its blocking nature, coarse granularity, and inability to report fractional seconds.
1.1 Extract Lag Metrics
Poll Seconds_Behind_Master alongside heartbeat table timestamps for precision:
-- Heartbeat table approach (recommended for sub-second accuracy)
CREATE TABLE IF NOT EXISTS _replication_heartbeat (
ts TIMESTAMP(3) DEFAULT CURRENT_TIMESTAMP(3),
PRIMARY KEY (ts)
);
-- Insert from writer every 100ms via cron or event scheduler
-- Replica calculates lag: NOW(3) - MAX(ts)
1.2 Performance Schema Polling
For non-blocking, continuous monitoring, query performance_schema:
SELECT
channel_name,
LAST_QUEUED_TRANSACTION,
LAST_APPLIED_TRANSACTION,
LAST_HEARTBEAT_TIMESTAMP,
LAST_ERROR_NUMBER,
LAST_ERROR_MESSAGE
FROM performance_schema.replication_connection_status
WHERE CHANNEL_NAME = '';
1.3 Continuous Monitoring Integration
Deploy mysqld_exporter with --collect.slave_status and --collect.heartbeat. Configure ProxySQL to ingest lag natively:
-- ProxySQL native replication monitoring
SET mysql-monitor_replication_lag_interval=2000;
SET mysql-monitor_replication_lag_timeout=1000;
ProxySQL will automatically populate mysql_replication_hostgroups with real-time lag values, bypassing manual heartbeat scripts.
Step 2: Defining Threshold Policies in the Connection Router
Thresholds must align with query criticality. Absolute thresholds (fixed seconds) suit most OLTP workloads, while relative thresholds (% of write throughput) apply to high-churn analytical pipelines.
2.1 ProxySQL Lag-Aware Hostgroups
-- writer_hostgroup=10, reader_hostgroup=20, max_replication_lag=5s
INSERT INTO mysql_replication_hostgroups (writer_hostgroup, reader_hostgroup, comment, max_replication_lag)
VALUES (10, 20, 'prod_tight', 5);
-- Relaxed hostgroup for non-critical reporting (lag <= 30s)
INSERT INTO mysql_replication_hostgroups (writer_hostgroup, reader_hostgroup, comment, max_replication_lag)
VALUES (10, 30, 'prod_relaxed', 30);
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
2.2 HAProxy Equivalent HAProxy requires external checks for lag-aware routing:
backend mysql_readers
option mysql-check user monitor
external-check command /etc/haproxy/check_mysql_lag.sh
server replica1 10.0.1.11:3306 check inter 2s fall 3 rise 2
server replica2 10.0.1.12:3306 check inter 2s fall 3 rise 2
The external script must exit 0 if Seconds_Behind_Master < threshold, else 1.
2.3 Routing Policy Mapping Align application SLAs with database routing rules per Routing Queries Based on Data Freshness Requirements:
hostgroup=20: User profile, cart, payment verification (max lag: 1–2s)hostgroup=30: Analytics, audit logs, search indexing (max lag: 15–30s)hostgroup=10: All writes, read-after-write guarantees, session state
Step 3: Implementing Client-Side Routing Logic
Proxies route at the connection level, but applications must enforce routing boundaries at the transaction and query layer.
3.1 Connection Pooler Configuration
- HikariCP (Java): Use separate pools per hostgroup. Inject
max_stalenessvia connection properties. - SQLAlchemy (Python): Bind engines to hostgroups. Use
execution_options({"max_staleness": 2})for custom routing hooks. - Go
database/sql+sqlx: Maintain*sql.DBinstances per hostgroup. Route via middleware.
3.2 Dynamic Datasource Switching
// Java/HikariCP example
public DataSource getDataSourceForQuery(QueryType type) {
int maxStaleness = type.getStalenessTolerance();
if (maxStaleness <= 2) return tightReaderPool;
if (maxStaleness <= 30) return relaxedReaderPool;
return writerPool;
}
3.3 Middleware & Header Propagation
Implement a read/write splitting middleware that parses X-Max-Staleness or SQL comments:
/* MAX_STALENESS=5 */ SELECT * FROM orders WHERE user_id = ?;
The proxy intercepts the hint, evaluates replica lag, and routes to the first hostgroup where replication_lag <= hint_value. If no replica qualifies, fallback to the primary.
Symptom Identification & Root Cause Analysis
Observable Symptoms
- Stale Reads: Users see outdated balances/orders immediately after writes.
- Routing Loops: Application retries on lagged replica → proxy marks offline → app reconnects → proxy marks online → loop.
- Connection Pool Exhaustion: All replicas exceed threshold → pool drains → app falls back to primary → primary CPU spikes.
- Proxy Failover Storms: Rapid
ONLINE↔OFFLINE_SOFTflapping during network jitter.
Root Cause Diagnosis
- Network Partition: Asymmetric latency between writer and replicas.
- Write Bursts: Large
INSERT ... SELECT, DDL, or missing primary keys causing full-table scans on replica. - I/O Bottlenecks: Misconfigured
sync_binlog=0orinnodb_flush_log_at_trx_commit=2on replicas causing disk queue saturation. - Parallel Replication Misalignment:
slave_parallel_type=DATABASEinstead ofLOGICAL_CLOCKcausing thread starvation.
Diagnostic Queries
SELECT
SERVICE_STATE,
LAST_HEARTBEAT_TIMESTAMP,
LAST_ERROR_NUMBER,
LAST_ERROR_MESSAGE
FROM performance_schema.replication_connection_status;
SELECT
WORKER_ID,
LAST_APPLIED_TRANSACTION,
LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP
FROM performance_schema.replication_applier_status_by_worker;
Cross-reference relay_log_space_limit and innodb_io_capacity to identify disk-bound appliers.
Mitigation Strategies & Fallback Execution
When lag thresholds are breached, execute containment before cascading failures occur.
3.1 Immediate Proxy Isolation
-- Soft-offline lagging replicas to drain active connections gracefully
UPDATE mysql_servers
SET status='OFFLINE_SOFT'
WHERE hostgroup_id IN (20, 30) AND status='ONLINE';
LOAD MYSQL SERVERS TO RUNTIME;
3.2 Application Circuit Breaker
Implement a fallback circuit breaker (e.g., Resilience4j, Go sony/gobreaker):
- Open State: All reads route to primary or return cached/stale data with explicit
X-Data-Stale: trueheaders. - Half-Open State: Probe one replica with
max_replication_lag=10. If successful, transition to Closed. - Timeout: Force primary-only routing if lag > 60s for > 30s.
3.3 Replica Promotion/Rebuild If lag persists > 5 minutes:
- Verify GTID consistency:
SELECT @@global.gtid_executed; - If replica is irrecoverable, rebuild from backup or use
CLONEplugin. - Route traffic to primary until replica catches up.
Rollback Procedures & Configuration Validation
If threshold routing introduces instability, revert to static routing immediately.
Step-by-Step Rollback
- Disable Lag-Aware Routing:
DELETE FROM mysql_replication_hostgroups;
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
- Flush Application Pools: Restart app instances or call pool
close()/reset()to clear stale connections. - Verify Primary-Only Traffic: Monitor proxy query logs. Confirm
hostgroup_id=10receives 100% of traffic. - Validate Replication Health:
#!/bin/bash
for host in replica1 replica2 replica3; do
lag=$(mysql -h $host -e "SHOW SLAVE STATUS\G" | grep Seconds_Behind_Master | awk '{print $2}')
echo "$host: lag=${lag}s"
done
Ensure Seconds_Behind_Master = 0 or < 1 across all nodes.
Post-Rollback Monitoring
- Maintain primary-only routing for 30 minutes.
- Re-enable threshold routing with conservative defaults (
max_replication_lag=15). - Verify no connection pool starvation or proxy flapping.
Performance Tuning & Observability Integration
Threshold Calibration Do not set thresholds arbitrarily. Analyze historical lag percentiles:
p95 lag: Set as baselinemax_replication_lagfor standard reads.p99 lag: Set as alert threshold; trigger circuit breaker if exceeded.- Adjust
slave_parallel_workersandbinlog_transaction_dependency_tracking=WRITESETto reduce baseline lag.
Observability Stack
- OpenTelemetry: Export
db.mysql.replication.lagas a gauge metric. Attach routing decision traces tospan.kind=CLIENT. - Grafana Dashboards: Panelize
proxy_mysql_servers_status,mysql_replication_lag_seconds, andconnection_pool_active. Alert onlag > thresholdfor > 60s. - Logs: Correlate proxy
OFFLINE_SOFTevents with application5xxspikes.
Consistency vs. Throughput Trade-offs
Strict thresholds (max_replication_lag=1) increase primary load and reduce read scalability. Relaxed thresholds (max_replication_lag=30) improve throughput but risk serving stale data. Align routing policies with business impact matrices, not engineering convenience.
Production Deployment Checklist
sysbenchandpt-killto verify routing under simulated lag.STOP SLAVE;on replica → confirm proxy drains connections.Seconds_Behind_Master > thresholdandproxy_mysql_servers_status != ONLINE.