Configuring MySQL Read-Only Routing with Lag Thresholds

Blind round-robin load balancing across MySQL read replicas violates application SLAs by serving stale data during replication drift. Effective Replication Lag & Consistency Management requires deterministic threshold enforcement, where routing decisions are gated by measurable replication latency rather than simple connection availability. This runbook covers production-ready configuration for MySQL 8.0+ (GTID-based), ProxySQL v2.5+, HAProxy 2.6+, and application-level routing logic. The objective is to isolate lagging nodes before they impact user-facing consistency, while maintaining high read throughput under normal operating conditions.

Step 1: Instrumenting Real-Time Lag Detection

Accurate routing depends on sub-second lag telemetry. Relying solely on SHOW SLAVE STATUS is insufficient for routing decisions due to its blocking nature, coarse granularity, and inability to report fractional seconds.

1.1 Extract Lag Metrics Poll Seconds_Behind_Master alongside heartbeat table timestamps for precision:

-- Heartbeat table approach (recommended for sub-second accuracy)
CREATE TABLE IF NOT EXISTS _replication_heartbeat (
 ts TIMESTAMP(3) DEFAULT CURRENT_TIMESTAMP(3),
 PRIMARY KEY (ts)
);
-- Insert from writer every 100ms via cron or event scheduler
-- Replica calculates lag: NOW(3) - MAX(ts)

1.2 Performance Schema Polling For non-blocking, continuous monitoring, query performance_schema:

SELECT 
 channel_name,
 LAST_QUEUED_TRANSACTION,
 LAST_APPLIED_TRANSACTION,
 LAST_HEARTBEAT_TIMESTAMP,
 LAST_ERROR_NUMBER,
 LAST_ERROR_MESSAGE
FROM performance_schema.replication_connection_status
WHERE CHANNEL_NAME = '';

1.3 Continuous Monitoring Integration Deploy mysqld_exporter with --collect.slave_status and --collect.heartbeat. Configure ProxySQL to ingest lag natively:

-- ProxySQL native replication monitoring
SET mysql-monitor_replication_lag_interval=2000;
SET mysql-monitor_replication_lag_timeout=1000;

ProxySQL will automatically populate mysql_replication_hostgroups with real-time lag values, bypassing manual heartbeat scripts.

Step 2: Defining Threshold Policies in the Connection Router

Thresholds must align with query criticality. Absolute thresholds (fixed seconds) suit most OLTP workloads, while relative thresholds (% of write throughput) apply to high-churn analytical pipelines.

2.1 ProxySQL Lag-Aware Hostgroups

-- writer_hostgroup=10, reader_hostgroup=20, max_replication_lag=5s
INSERT INTO mysql_replication_hostgroups (writer_hostgroup, reader_hostgroup, comment, max_replication_lag) 
VALUES (10, 20, 'prod_tight', 5);

-- Relaxed hostgroup for non-critical reporting (lag <= 30s)
INSERT INTO mysql_replication_hostgroups (writer_hostgroup, reader_hostgroup, comment, max_replication_lag) 
VALUES (10, 30, 'prod_relaxed', 30);

LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;

2.2 HAProxy Equivalent HAProxy requires external checks for lag-aware routing:

backend mysql_readers
 option mysql-check user monitor
 external-check command /etc/haproxy/check_mysql_lag.sh
 server replica1 10.0.1.11:3306 check inter 2s fall 3 rise 2
 server replica2 10.0.1.12:3306 check inter 2s fall 3 rise 2

The external script must exit 0 if Seconds_Behind_Master < threshold, else 1.

2.3 Routing Policy Mapping Align application SLAs with database routing rules per Routing Queries Based on Data Freshness Requirements:

hostgroup=20: User profile, cart, payment verification (max lag: 1–2s)
hostgroup=30: Analytics, audit logs, search indexing (max lag: 15–30s)
hostgroup=10: All writes, read-after-write guarantees, session state

Step 3: Implementing Client-Side Routing Logic

Proxies route at the connection level, but applications must enforce routing boundaries at the transaction and query layer.

3.1 Connection Pooler Configuration

HikariCP (Java): Use separate pools per hostgroup. Inject max_staleness via connection properties.
SQLAlchemy (Python): Bind engines to hostgroups. Use execution_options({"max_staleness": 2}) for custom routing hooks.
Go database/sql + sqlx: Maintain *sql.DB instances per hostgroup. Route via middleware.

3.2 Dynamic Datasource Switching

// Java/HikariCP example
public DataSource getDataSourceForQuery(QueryType type) {
 int maxStaleness = type.getStalenessTolerance();
 if (maxStaleness <= 2) return tightReaderPool;
 if (maxStaleness <= 30) return relaxedReaderPool;
 return writerPool;
}

3.3 Middleware & Header Propagation Implement a read/write splitting middleware that parses X-Max-Staleness or SQL comments:

/* MAX_STALENESS=5 */ SELECT * FROM orders WHERE user_id = ?;

The proxy intercepts the hint, evaluates replica lag, and routes to the first hostgroup where replication_lag <= hint_value. If no replica qualifies, fallback to the primary.

Symptom Identification & Root Cause Analysis

Observable Symptoms

Stale Reads: Users see outdated balances/orders immediately after writes.
Routing Loops: Application retries on lagged replica → proxy marks offline → app reconnects → proxy marks online → loop.
Connection Pool Exhaustion: All replicas exceed threshold → pool drains → app falls back to primary → primary CPU spikes.
Proxy Failover Storms: Rapid ONLINE ↔ OFFLINE_SOFT flapping during network jitter.

Root Cause Diagnosis

Network Partition: Asymmetric latency between writer and replicas.
Write Bursts: Large INSERT ... SELECT, DDL, or missing primary keys causing full-table scans on replica.
I/O Bottlenecks: Misconfigured sync_binlog=0 or innodb_flush_log_at_trx_commit=2 on replicas causing disk queue saturation.
Parallel Replication Misalignment: slave_parallel_type=DATABASE instead of LOGICAL_CLOCK causing thread starvation.

Diagnostic Queries

SELECT 
 SERVICE_STATE, 
 LAST_HEARTBEAT_TIMESTAMP, 
 LAST_ERROR_NUMBER, 
 LAST_ERROR_MESSAGE
FROM performance_schema.replication_connection_status;

SELECT 
 WORKER_ID, 
 LAST_APPLIED_TRANSACTION, 
 LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP
FROM performance_schema.replication_applier_status_by_worker;

Cross-reference relay_log_space_limit and innodb_io_capacity to identify disk-bound appliers.

Mitigation Strategies & Fallback Execution

When lag thresholds are breached, execute containment before cascading failures occur.

3.1 Immediate Proxy Isolation

-- Soft-offline lagging replicas to drain active connections gracefully
UPDATE mysql_servers 
SET status='OFFLINE_SOFT' 
WHERE hostgroup_id IN (20, 30) AND status='ONLINE';

LOAD MYSQL SERVERS TO RUNTIME;

3.2 Application Circuit Breaker Implement a fallback circuit breaker (e.g., Resilience4j, Go sony/gobreaker):

Open State: All reads route to primary or return cached/stale data with explicit X-Data-Stale: true headers.
Half-Open State: Probe one replica with max_replication_lag=10. If successful, transition to Closed.
Timeout: Force primary-only routing if lag > 60s for > 30s.

3.3 Replica Promotion/Rebuild If lag persists > 5 minutes:

Verify GTID consistency: SELECT @@global.gtid_executed;
If replica is irrecoverable, rebuild from backup or use CLONE plugin.
Route traffic to primary until replica catches up.

Rollback Procedures & Configuration Validation

If threshold routing introduces instability, revert to static routing immediately.

Step-by-Step Rollback

Disable Lag-Aware Routing:

DELETE FROM mysql_replication_hostgroups;
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;

Flush Application Pools: Restart app instances or call pool close()/reset() to clear stale connections.
Verify Primary-Only Traffic: Monitor proxy query logs. Confirm hostgroup_id=10 receives 100% of traffic.
Validate Replication Health:

#!/bin/bash
for host in replica1 replica2 replica3; do
lag=$(mysql -h $host -e "SHOW SLAVE STATUS\G" | grep Seconds_Behind_Master | awk '{print $2}')
echo "$host: lag=${lag}s"
done

Ensure Seconds_Behind_Master = 0 or < 1 across all nodes.

Post-Rollback Monitoring

Maintain primary-only routing for 30 minutes.
Re-enable threshold routing with conservative defaults (max_replication_lag=15).
Verify no connection pool starvation or proxy flapping.

Performance Tuning & Observability Integration

Threshold Calibration Do not set thresholds arbitrarily. Analyze historical lag percentiles:

p95 lag: Set as baseline max_replication_lag for standard reads.
p99 lag: Set as alert threshold; trigger circuit breaker if exceeded.
Adjust slave_parallel_workers and binlog_transaction_dependency_tracking=WRITESET to reduce baseline lag.

Observability Stack

OpenTelemetry: Export db.mysql.replication.lag as a gauge metric. Attach routing decision traces to span.kind=CLIENT.
Grafana Dashboards: Panelize proxy_mysql_servers_status, mysql_replication_lag_seconds, and connection_pool_active. Alert on lag > threshold for > 60s.
Logs: Correlate proxy OFFLINE_SOFT events with application 5xx spikes.

Consistency vs. Throughput Trade-offs Strict thresholds (max_replication_lag=1) increase primary load and reduce read scalability. Relaxed thresholds (max_replication_lag=30) improve throughput but risk serving stale data. Align routing policies with business impact matrices, not engineering convenience.

Production Deployment Checklist

Load test with sysbench and pt-kill to verify routing under simulated lag.
Validate failover behavior: STOP SLAVE; on replica → confirm proxy drains connections.
Implement alerting for Seconds_Behind_Master > threshold and proxy_mysql_servers_status != ONLINE.
Document runbook access, escalation paths, and rollback commands in incident response repository.
Schedule quarterly threshold recalibration based on traffic growth and replication topology changes.