Using Application-Level Timestamps to Bypass Stale Replicas
Problem statement: After a successful write, the application’s next read returns pre-mutation state because replication lag on the replica exceeds the user’s perceived consistency window — and the standard polling interval for lag metrics is too coarse to catch it in time.
This page defines the end-to-end pattern: capture the primary’s commit epoch immediately after COMMIT, propagate it through the request lifecycle, and compare it against the replica’s measured lag at query time to decide whether to route to the replica or fall back to the primary. The approach is database-agnostic (PostgreSQL and MySQL covered), requires no schema changes, and degrades gracefully via circuit-breaker thresholds.
Symptom Identification
Stale replica consumption produces deterministic application failures rather than random latency spikes. Look for these signals in production before deploying timestamp routing:
Observable failure modes:
- Users submit a mutation, immediately reload, and observe pre-mutation state. This is distinct from slow rendering — the state is genuinely wrong, not slow.
- Idempotency key collision: two requests with identical keys both succeed because the secondary read hasn’t materialized the first write yet.
- Background workers polling
status = 'PENDING'process the same row twice, generating duplicate downstream events.
Metrics to check:
-- PostgreSQL: current replica lag across all standbys
SELECT client_addr, state,
EXTRACT(EPOCH FROM replay_lag) * 1000 AS lag_ms
FROM pg_stat_replication
WHERE state = 'streaming';
-- MySQL: lag on the local replica
SHOW REPLICA STATUS\G
-- Look at: Seconds_Behind_Source, Replica_SQL_RunningIf lag_ms at p95 exceeds your application’s read-after-write tolerance (commonly 150–200 ms for user-facing flows), detecting and handling replication lag in real-time with coarse polling alone is insufficient — the commit epoch pattern described here closes the gap.
Root Cause Analysis
Two distinct failure domains combine to cause stale reads:
WAL propagation delay. After COMMIT, the primary writes to the WAL and asynchronously ships those records to standby nodes. Until the standby’s WAL receiver and applier process the segment, the row does not exist on the replica. Delays arise from: checkpoint intervals (checkpoint_timeout), wal_level misconfigurations, network jitter causing TCP retransmission in the wal_receiver thread, and replica I/O saturation from analytical queries consuming shared_buffers.
Polling cadence mismatch. Standard lag polling (SHOW REPLICA STATUS or pg_stat_replication) runs at 5–10 s intervals in most monitoring stacks. A write that commits and is followed immediately by a read lands in a gap where the router’s cached lag value is stale by definition. The router sees “lag = 20 ms” because that was accurate 8 seconds ago — but the write just happened 50 ms ago and hasn’t propagated yet.
Application-level timestamps bypass this by anchoring routing to the commit event itself rather than to an independently-polled lag estimate.
Architecture: Commit Epoch Propagation Flow
The diagram below shows how the commit epoch travels from the primary database through the application tier to the connection router and back.
Step-by-Step Resolution
Step 1: Capture the Primary Commit Epoch
Call the timestamp function on the primary connection immediately after COMMIT. Do not use client-generated timestamps — they drift with NTP skew and are not authoritative over commit order.
PostgreSQL:
-- Execute on the primary connection, immediately after COMMIT
SELECT EXTRACT(EPOCH FROM clock_timestamp()) * 1000 AS commit_epoch_ms;clock_timestamp() returns the current wall-clock time; calling it within milliseconds of COMMIT produces an epoch that closely bounds the actual commit time.
MySQL 8.0:
-- Execute on the primary connection, immediately after COMMIT
SELECT UNIX_TIMESTAMP(NOW(3)) * 1000 AS commit_epoch_ms;For GTID-based tracking, capture @@GLOBAL.gtid_executed post-commit and propagate the GTID set. The subscriber can then call SELECT WAIT_FOR_EXECUTED_GTID_SET('uuid:N', timeout_seconds) on the replica to block until that GTID is applied before serving the read.
Inline verification: After instrumenting, log the epoch alongside request IDs. Confirm that commit_epoch_ms increments monotonically within a single primary and that consecutive writes in the same second produce distinct values.
Step 2: Propagate the Epoch Through the Request Context
Store the epoch in the request’s thread-local or async context immediately after capturing it. Clear it on request teardown to prevent timestamp bleed across connection pool reuse.
// Java/Spring Boot: store commit epoch as a request attribute
@PostMapping("/orders")
public ResponseEntity<Order> createOrder(
@RequestBody OrderRequest req, HttpServletRequest request) {
Order order = orderService.create(req);
long commitEpochMs = orderService.getLastCommitEpochMs();
request.setAttribute("COMMIT_EPOCH_MS", commitEpochMs);
return ResponseEntity.ok(order);
}For inter-service calls propagate via HTTP/2 header:
X-Write-Commit-Epoch-Ms: 1718432000123
Storage by runtime:
- Java:
ThreadLocal<Long>cleared in a servlet filter’sfinallyblock - Go:
context.Contextwith a typed key; never stored in a global - Node.js:
AsyncLocalStoragescoped to the request’s async context
Inline verification: Add a debug log at the middleware layer that prints whether COMMIT_EPOCH_MS is present and non-zero for write endpoints. Absence on a write path means the instrumentation is not wired correctly.
Step 3: Configure the Connection Router
The router extracts the epoch, compares it against the replica’s measured lag, and selects the target pool. The lag measurement must come from a low-latency source — ideally a background goroutine/thread polling pg_stat_replication or SHOW REPLICA STATUS at 1 s intervals and storing the result in an atomic integer.
// Go: router decision at query time
func RouteQuery(ctx context.Context, query string) (*sql.DB, error) {
commitEpoch, ok := ctx.Value(commitEpochKey).(int64)
if !ok || commitEpoch == 0 {
return replicaPool.Get(), nil // no write context → replica
}
replicaLagMs := lagMonitor.CurrentLagMs() // updated every 1 s
toleranceWindow := config.StaleReadToleranceMs // e.g. 150
if replicaLagMs > toleranceWindow {
return primaryPool.Get(), nil // replica too stale → primary
}
return replicaPool.Get(), nil
}For connection pool routing rules enforced at the proxy layer rather than in application code, configure ProxySQL query rules to direct requests carrying the stale-read header to the primary hostgroup:
-- ProxySQL: route requests with stale-read header to primary (hostgroup 1)
INSERT INTO mysql_query_rules
(rule_id, active, proxy_port, match_digest, destination_hostgroup, apply)
VALUES
(10, 1, 6033, '^SELECT.*', 1, 1); -- override: header logic handled upstream
LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;PgBouncer pool configuration (transaction mode required for per-query routing):
[databases]
app_primary = host=primary.db port=5432 dbname=app
app_replica = host=replica.db port=5432 dbname=app
[pgbouncer]
pool_mode = transaction ; required — session mode prevents per-query rerouting
max_client_conn = 2000
default_pool_size = 50
min_pool_size = 20 ; pre-warmed; eliminates cold-start on routing shift
server_idle_timeout = 30
server_lifetime = 3600Inline verification: Log routing.decision: primary|replica per query. Under zero write load, 100% of reads should route to the replica. After injecting a write with a large synthetic lag, confirm the router selects the primary.
Configuration Snippet
The minimal complete routing configuration with annotated thresholds:
# routing.yaml — timestamp-based read routing
routing:
# Maximum replica lag before reads fall back to primary.
# Start at p95 of historical lag; tighten for user-facing flows.
stale_read_tolerance_ms: 150
# Upper bound on primary RPS before fallback_mode activates.
# Set to ~70% of primary's measured saturation point.
max_primary_rps: 5000
circuit_breaker:
# Fraction of primary requests timing out before opening the breaker.
failure_threshold: 0.15
# Time the breaker stays half-open before probing again.
half_open_timeout_ms: 10000
# What to serve when the breaker is open:
# PRIMARY — route everything to primary (may cascade)
# CACHE_OR_QUEUE — serve Redis cache or queue the read for retry
fallback_mode: CACHE_OR_QUEUE
lag_poll_interval_ms: 1000 # background lag monitor cadence
header_name: X-Write-Commit-Epoch-Ms
context_key: COMMIT_EPOCH_MSVerification and Rollback
Confirming the Fix
Deploy synthetic probes that perform INSERT → SELECT across each AZ every 10 s:
# OpenTelemetry span attributes to confirm correct routing
otel:
attributes:
db.statement: "SELECT * FROM orders WHERE id = ?"
routing.decision: "primary|replica"
replica.lag.ms: 42
write.commit.epoch.ms: 1718432000123SLO targets to validate against:
| Signal | Target |
|---|---|
| Routing accuracy (correct hostgroup per tolerance window) | ≥ 99.95% |
| Replica freshness p99 under sustained write load | < 150 ms |
| Primary offload ratio during steady state | > 80% |
Rolling Back
Execute this runbook if timestamp routing causes primary saturation or inconsistent routing:
-
Disable dynamic routing via feature flag:
bashcurl -X POST https://config.internal/api/v1/flags/dynamic_timestamp_routing \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -d '{"enabled": false}' -
Flush connection pools without dropping clients:
bash# PgBouncer: reload config in place kill -HUP $(pgrep pgbouncer) # ProxySQL: reload rules mysql -u admin -padmin -h 127.0.0.1 -P 6032 \ -e "LOAD MYSQL USERS TO RUNTIME; SAVE MYSQL USERS TO DISK;" -
Verify replica catch-up:
sql-- PostgreSQL SELECT client_addr, state, replay_lag FROM pg_stat_replication WHERE state = 'streaming'; -- Target: replay_lag < interval '1 second' -- MySQL SHOW REPLICA STATUS\G -- Verify: Seconds_Behind_Source = 0 AND Replica_SQL_Running = Yes -
Restore default read routing: Update load balancer weights to
primary: 0%, replica: 100%and confirm health check returns 200:bashcurl -s -o /dev/null -w "%{http_code}" http://app.internal/health/replica-freshness
Edge Cases and Gotchas
Cascading replication (replica-of-replica). With a cascading topology, the downstream node’s lag is the cumulative sum of each hop — primary → intermediate → leaf. The router must compare the commit epoch against the leaf replica’s lag, not the immediate standby’s. In practice, avoid routing timestamp-sensitive reads to cascading replicas; route them to the primary or the first-hop standby only.
Multi-region deployments. Cross-region replicas commonly carry 50–200 ms of baseline network latency in addition to WAL propagation delay. If your tolerance window is 150 ms, every write will route reads to the primary for cross-region replicas during normal operation. Either widen the tolerance for reads that can accept eventual consistency, or accept that cross-region replicas serve only latency-insensitive analytics. The fallback strategies when replicas fall behind page covers graceful degradation patterns for this case.
Logical replication and partial subscriptions. With logical replication, the subscriber applies only the tables it subscribes to. If the commit includes changes across both subscribed and unsubscribed tables, the epoch captured post-commit does not guarantee that the subscribed tables have caught up — only that the transaction committed on the primary. Verify using pg_stat_subscription.received_lsn vs the LSN recorded at commit time, not a wall-clock epoch alone.
FAQ
Why not use client-generated timestamps instead of the database commit epoch?
Client clocks drift due to NTP skew and are not authoritative over the commit sequence. clock_timestamp() or UNIX_TIMESTAMP(NOW(3)) called on the primary connection immediately after COMMIT is anchored to the database’s own timeline, eliminating cross-node clock disagreement from routing decisions.
What tolerance window should I start with?
Start at the p95 of historical replica lag — commonly 100–200 ms for same-AZ replicas. Tighten to 50 ms for user-facing read-after-write consistency paths; widen to 500 ms for background analytical workloads. Measure before setting — guessing causes either unnecessary primary load or genuine consistency violations.
Does this pattern work if the replica is also handling analytical queries?
Heavy analytical queries on the replica exhaust shared_buffers and temp_buffers, starving the WAL apply process and inflating lag. In that case the tolerance window trips more often, routing more reads to the primary. The fix is workload separation: move analytics to a dedicated replica not part of the timestamp-routing pool, or implement routing queries based on data freshness requirements to exclude the analytics replica from the eligible set for freshness-sensitive reads.
Related
← Back to Detecting and Handling Replication Lag in Real-Time
- Fallback Strategies When Replicas Fall Behind — circuit-breaker patterns and cache-or-queue fallback when lag is sustained
- Routing Queries Based on Data Freshness Requirements — broader framework for freshness-aware read routing beyond the timestamp pattern
- How to Force Primary Reads for Critical User Transactions — simpler per-transaction override when timestamp routing is not warranted
- Connection Pool Architecture for Read Replicas — PgBouncer and ProxySQL pool configuration that underpins the router layer