Managing Sticky Sessions in Distributed Database Reads
Sticky session routing in distributed read topologies introduces a deliberate trade-off: sacrificing uniform load distribution to guarantee read-your-writes consistency, optimise cache locality, and reduce connection churn. For backend engineers, DBAs, and SREs operating at scale, unmanaged session affinity becomes a vector for hot-spotting, connection pool exhaustion, and cascading replica desynchronisation. This page details production-grade patterns for pinning read sessions across distributed replicas β from proxy configuration to driver-level consistency enforcement and incident-aware fallback routing β as part of the broader Connection Routing & Pooling Strategies discipline.
Problem Framing
Stateless routing β round-robin or least-connections across all replicas β maximises throughput but introduces replication lag visibility windows that break transactional workflows. A user who writes a record and immediately reads it back may hit a replica that has not yet applied that WAL segment, returning stale data. Sticky session routing eliminates this read-your-writes violation by deterministically pinning a session identifier to a specific replica endpoint.
The operational risks of doing this poorly are significant:
- Hot-spotting: Without explicit pool partitioning, a single replica accumulates all sessions from a high-traffic tenant while others sit idle.
- Silent failover blindness: Long affinity TTLs leave sessions pinned to degraded replicas during partial network partitions or replica desync events.
- Pool exhaustion cascade: Sticky routing bypasses traditional connection pool load balancing. One hot replica exhausts
max_connectionsand blocks all pinned sessions simultaneously.
Understanding these failure modes is essential before configuring any affinity mechanism.
Concept Definition and Scope
Session affinity (also called connection pinning or read-session stickiness) defines the boundary at which a clientβs read requests are deterministically routed to a specific replica. The pinning key is typically one of:
- User or session ID β suitable for SaaS applications where per-user consistency matters most
- Tenant ID β suitable for multi-tenant platforms where tenant-scoped data isolation is the primary constraint
- Transaction trace ID β suitable for microservice architectures where a distributed trace spans multiple read operations that must see the same data version
Sticky routing is distinct from read/write splitting, which separates write traffic (primary) from read traffic (replicas) without enforcing which replica serves which session. The two mechanisms compose: splitting routes all reads away from the primary; affinity then selects which replica within the read pool handles a given session.
Affinity Scope and TTL Design
When designing affinity boundaries, evaluate these constraints:
| Dimension | Short TTL (< 30s) | Medium TTL (60β120s) | Long TTL (> 5m) |
|---|---|---|---|
| Connection re-routing overhead | High β pool thrashing on every request | Moderate β acceptable for most workloads | Low |
| Exposure to degraded replicas | Low | Moderate β needs active health checks | High β silent failovers go undetected |
| Cache locality benefit | None | Good | Strong |
| Pool saturation risk | Low | Medium | High β tenant skew accumulates |
A 60β90 second TTL aligned with your replication lag SLA is the recommended starting point for most applications. Adjust downward if you observe silent failovers; adjust upward if you observe pool thrashing in metrics.
Mechanism Deep-Dive
How Proxy Affinity Evaluation Works
When a query arrives at a proxy with affinity rules configured:
- The proxy extracts the session token from the connection context (username, application name, or a preceding
SETcommand). - It looks up the affinity map:
session_token β backend endpoint. - If a mapping exists and the backend is healthy (lag below threshold, not
OFFLINE_SOFT), the query is dispatched there. - If no mapping exists, the proxy applies its default routing rule (e.g., least-connections across the read hostgroup) and records the new mapping.
- Each dispatch resets or decrements the TTL counter for that mapping.
- On TTL expiry, health-check failure, or explicit reset, the mapping is deleted and the next query re-enters step 2.
The critical invariant is that health checks must run more frequently than your TTL. If health check intervals exceed TTL, a failing replica can accumulate new sessions before the proxy marks it offline.
ProxySQL Sticky Routing Configuration
ProxySQL enforces connection persistence via sticky_conn in mysql_query_rules and via multiplex=0 at the server level. The following configuration pins read traffic with explicit replication lag thresholds:
-- ProxySQL Admin Interface: create a read hostgroup with lag enforcement
INSERT INTO mysql_query_rules (
rule_id, active, match_digest, destination_hostgroup,
max_replication_lag, timeout, retries, apply
) VALUES (
100, 1, '^SELECT.*', 2,
5, -- max_replication_lag in seconds before bypassing to writer hostgroup (group 1)
2000, -- connection timeout in ms
2, -- retry attempts on transient errors
1
);
-- Poll replication lag every 2 seconds so affinity decisions stay fresh
UPDATE global_variables SET variable_value='2000'
WHERE variable_name='mysql-monitor_replication_lag_interval';
-- Default query timeout (adjust to your p99 read latency budget)
UPDATE global_variables SET variable_value='30000'
WHERE variable_name='mysql-default_query_timeout';
LOAD MYSQL VARIABLES TO RUNTIME;
SAVE MYSQL VARIABLES TO DISK;
LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;Critical parameters and degraded-state behaviour:
max_replication_lag: When a replica exceeds this threshold, ProxySQL routes that query to the writer hostgroup instead of the pinned replica. This prevents stale reads during replication storms but breaks affinity temporarily β your application must handle the consistency implications of the brief primary redirect.- Session persistence scope: ProxySQL maintains the same backend connection for a session when
multiplex=0is set on the server or when a transaction is open. For affinity that persists across multiple transactions, you must implement application-layer session tokens (see below) β ProxySQL alone does not retain cross-transaction pinning. - Failover threshold: During partial network partitions, ProxySQL transitions lagging replicas to
OFFLINE_SOFTstate, triggering re-routing without dropping active client connections. Setmysql-monitor_replication_lag_intervalto a value no larger than half your target TTL.
Trade-off Comparison Table
| Affinity Strategy | Consistency Guarantee | Hot-spot Risk | Failover Behaviour | Best Fit |
|---|---|---|---|---|
Proxy-level (ProxySQL sticky_conn) |
Read-your-writes within a transaction | Medium β depends on session distribution | Automatic via OFFLINE_SOFT; may redirect to primary |
High-throughput MySQL/MariaDB with ProxySQL |
Application-layer ContextVar pool |
Read-your-writes across requests | High if TTL not enforced | Manual fallback required | Python/SQLAlchemy services needing cross-request affinity |
| Driver tag-based (MongoDB read preference) | Bounded staleness via maxStalenessSeconds |
Low β driver selects from tag set | Driver widens selection window automatically | Document workloads with AZ-tagged replica sets |
HAProxy stick-table |
None (transport only) | Low | No built-in replication awareness | TCP-level persistence without read consistency needs |
Configuration Runbook
Application-Side Pool Partitioning (Python / SQLAlchemy)
When proxy-level routing introduces unacceptable latency or lacks granular session context, embed replica targeting directly into the applicationβs connection manager. Thread-local storage and Python ContextVar enable deterministic routing without manual query tagging. This pattern complements ORM middleware for automatic query routing by providing the underlying pool infrastructure that the middleware layer dispatches into.
from sqlalchemy import create_engine, event
from sqlalchemy.pool import QueuePool
from contextvars import ContextVar
import random
import logging
# Context variable stores the pinned replica URI for this request's lifecycle
session_replica_uri: ContextVar[str | None] = ContextVar("session_replica_uri", default=None)
logger = logging.getLogger(__name__)
class StickyReadPool:
"""
Partitioned connection pool manager for sticky read routing.
Each replica gets its own QueuePool to prevent cross-replica pool interference.
"""
def __init__(self, primary_uri: str, replica_uris: list[str], partition_size: int = 20):
# Primary pool: sized for writes + critical post-write reads
self.primary = create_engine(
primary_uri,
pool_size=partition_size,
max_overflow=5,
pool_timeout=3, # fail fast rather than queue behind write spikes
pool_pre_ping=True,
)
# Replica pools: one QueuePool per replica, keyed by URI
self.replica_pools: dict[str, object] = {
uri: create_engine(
uri,
pool_size=partition_size,
max_overflow=10,
pool_timeout=5,
pool_pre_ping=True, # evict stale connections proactively
)
for uri in replica_uris
}
self._replica_uris = list(replica_uris)
def get_read_connection(self):
"""
Returns a connection from the pinned replica pool, or falls back to
least-loaded (random) if the pinned pool is exhausted or not set.
"""
target_uri = session_replica_uri.get()
if target_uri and target_uri in self.replica_pools:
try:
return self.replica_pools[target_uri].connect()
except Exception:
# Pool exhausted or replica unreachable β fall through to stateless routing
logger.warning(
"sticky_fallback: pinned replica %s unavailable, degrading to stateless",
target_uri,
)
return self._stateless_connect()
def _stateless_connect(self):
"""Round-robin fallback when affinity cannot be satisfied."""
uri = random.choice(self._replica_uris)
return self.replica_pools[uri].connect()Implementation notes:
- Propagate
session_replica_urivia middleware interceptors at request ingress. Clear the context variable in afinallyblock post-request to prevent context leaking across request boundaries in async frameworks. - Set
pool_pre_ping=Trueon all replica engines to evict stale connections proactively; this prevents the pool from handing out connections to replicas that have restarted without draining. - If
pool_timeoutis exceeded on the pinned pool, the code above falls back immediately rather than blocking the request thread β log the fallback with trace IDs for post-incident analysis.
MongoDB Driver Tag-Based Affinity
For document databases, read preference tags combined with maxStalenessSeconds provide precise control over session pinning and consistency guarantees:
# MongoDB Node.js driver / Mongoose configuration
clientOptions:
readPreference: "nearest" # lowest RTT within the tag set
readPreferenceTags:
- region: "us-east-1"
rack: "A" # primary tag set: prefer rack-A replicas
- region: "us-east-1"
rack: "B" # fallback tag set if rack-A has no healthy members
- {} # final fallback: any available member
maxStalenessSeconds: 120 # reject reads from replicas lagging > 2 minutes
localThresholdMS: 15 # ping latency tolerance window for replica selection
serverSelectionTimeoutMS: 5000 # give up and throw MongoServerSelectionError after 5s
socketTimeoutMS: 10000Consistency enforcement and driver behaviour:
maxStalenessSecondsacts as a hard consistency boundary. The minimum valid value is 90 seconds (enforced by the MongoDB driver); values below this are rejected with a configuration error. If no replica meets the staleness threshold, the driver throwsMongoServerSelectionError, forcing the application to retry with relaxed consistency or fall back to the primary.localThresholdMSprevents routing to replicas experiencing network jitter. During AZ degradation, the driver automatically widens the selection window across tag sets, breaking strict affinity to maintain availability.- The
{}final fallback tag set is essential for production deployments β without it, a complete rack failure causes all reads to fail rather than degrading gracefully.
Monitoring and Alerting Signals
Sticky routing failures are often silent until they cascade. Instrument the following signals:
| Metric | Collection Method | Alert Threshold |
|---|---|---|
affinity_hit_rate |
Increment on pinned-pool success; track ratio vs. fallback | Alert if sustained < 70% over 5 minutes |
replication_lag_ms |
ProxySQL: SELECT * FROM stats_mysql_replication_lag; PostgreSQL: SELECT EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp()) * 1000 |
Alert at 80% of your max_replication_lag threshold |
pool_wait_time_ms |
SQLAlchemy: engine.pool.status() or checkout_latency event; ProxySQL: Queries_backends_wait_time_us |
Alert if p95 > 200 ms |
circuit_breaker_state_changes |
Application log counter | Alert on any transition to OPEN state |
sticky_fallback_total |
Application log counter | Alert if rate > 1% of reads over 2 minutes |
Prometheus alert rule for sustained affinity degradation:
groups:
- name: sticky_session_routing
rules:
- alert: StickyAffinityDegraded
expr: |
rate(sticky_hit_total[5m]) /
(rate(sticky_hit_total[5m]) + rate(sticky_fallback_total[5m])) < 0.70
for: 5m
labels:
severity: warning
annotations:
summary: "Sticky session affinity hit rate below 70%"
description: "Affinity hit rate is {{ $value | humanizePercentage }} β check replica health and TTL configuration."
- alert: ReplicaLagApproachingThreshold
expr: pg_replication_lag_seconds > 4 # 80% of a 5s max_replication_lag
for: 2m
labels:
severity: warning
annotations:
summary: "Replica lag within 20% of affinity threshold"Failure Modes and Recovery Steps
1. Hot-spot: One Replica Accumulates All Pinned Sessions
Root cause: Tenant skew β a small number of high-traffic tenants all hash to the same replica β combined with long TTLs that prevent rebalancing.
Diagnosis:
-- ProxySQL: check per-backend connection distribution
SELECT hostgroup, srv_host, ConnUsed, ConnFree, ConnOK, ConnERR
FROM stats_mysql_connection_pool
ORDER BY ConnUsed DESC;Remediation: Lower TTL to 60 seconds to allow natural rebalancing. Add a jitter factor (Β±10%) to TTL to prevent synchronized expiry. For severe imbalance, flush the affinity map explicitly:
-- ProxySQL: reset all backend connections (brief interruption)
PROXYSQL FLUSH MYSQL SESSIONS;2. Silent Failover: Sessions Pinned to a Degraded Replica
Root cause: Health check interval longer than TTL, or max_replication_lag set higher than the lag at which the replica actually becomes operationally degraded for your workload.
Diagnosis: Check for a gap between replication_lag_ms metric values and your max_replication_lag setting. If the replica is lagging at 4s but max_replication_lag=5, you have a 1-second window where queries succeed but return increasingly stale data.
Remediation: Set mysql-monitor_replication_lag_interval to at most half of your max_replication_lag in milliseconds. Add application-level timestamp-based staleness detection as a secondary guard:
def is_replica_fresh(conn, max_lag_seconds: float = 4.0) -> bool:
"""Returns False if replica lag exceeds threshold β caller should fall back to primary."""
row = conn.execute(
"SELECT EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp()) AS lag_s"
).fetchone()
return row and row.lag_s is not None and row.lag_s < max_lag_seconds3. Connection Pool Exhaustion on Pinned Replica
Root cause: Sticky routing concentrates connections onto one replica; max_connections is reached; new sessions queue or fail.
Diagnosis:
-- PostgreSQL: current connection count per database
SELECT datname, count(*) AS connections
FROM pg_stat_activity
GROUP BY datname
ORDER BY connections DESC;
-- Check proximity to max_connections limit
SHOW max_connections;Remediation: Reduce partition_size per-replica pool and add a PgBouncer layer in front of each replica to multiplex application connections. Ensure your fallback to stateless routing triggers before pool_timeout is exceeded β a 3β5 second pool_timeout with immediate fallback is preferable to queuing.
4. Circuit Breaker Trips and Does Not Recover
Root cause: Probe requests during the half-open state continue failing because the underlying issue (replica restart, network partition) has not resolved, keeping the breaker open indefinitely.
Diagnosis: Look for circuit_breaker_state_changes{state="open"} increments without subsequent half_open or closed transitions.
Circuit breaker implementation (Go):
package routing
import (
"time"
"github.com/sony/gobreaker"
)
var StickyCircuitBreaker = gobreaker.NewCircuitBreaker(gobreaker.Settings{
Name: "sticky_read_affinity",
MaxRequests: 100, // max requests allowed in half-open state
Interval: 30 * time.Second,
Timeout: 15 * time.Second, // time in open state before attempting recovery probe
ReadyToTrip: func(counts gobreaker.Counts) bool {
// Trip if failure rate > 5% or 3+ consecutive failures on at least 20 requests
failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
return counts.Requests >= 20 && (failureRatio > 0.05 || counts.ConsecutiveFailures > 3)
},
OnStateChange: func(name string, from gobreaker.State, to gobreaker.State) {
if to == gobreaker.StateOpen {
// Emit metric, trigger PagerDuty alert, log trace context
affinityBreakdowns.Inc()
}
},
})Recovery: When the breaker opens, routing immediately degrades to stateless least-connections. This prevents connection pool starvation while the replica recovers. Probe requests in half-open state should bypass the full connection pool and use a dedicated health-check connection.
5. Affinity Broken After Application Restart
Root cause: In-memory affinity maps are lost on process restart. Sessions that previously mapped to replica-2 are re-assigned arbitrarily, causing brief read-your-writes violations.
Remediation: Store the affinity map in a shared, low-latency cache (Redis, Memcached) with a TTL matching your affinity TTL. On startup, pre-warm the map from cache rather than starting fresh.
Further Reading
- Avoiding Connection Exhaustion During Replica Failover β How to configure connection limits, pre-drain pools before planned failovers, and handle the connection surge when traffic redistributes across surviving replicas.
- Configuring PgBouncer for Read-Only Connection Pools β PgBouncer transaction pooling settings that work cleanly under sticky routing, including
server_reset_queryandapplication_namepropagation. - ORM Middleware for Automatic Query Routing β Application-layer routers for SQLAlchemy, Django, and Prisma that dispatch reads to the pinned replica pool built in this guide.
- Using Application-Level Timestamps to Bypass Stale Replicas β A staleness-detection pattern that complements affinity routing by detecting when a pinned replica has fallen behind before executing a critical read.
FAQ
What is the minimum safe TTL for sticky session routing?
Anything under 30 seconds causes excessive connection re-routing and pool thrashing. Values over 5 minutes risk routing to silently-degraded replicas. A 60β90 second TTL aligned with your replication lag SLA is a good starting point. Add Β±10% jitter to prevent thundering-herd TTL expiry across many simultaneous sessions.
How does sticky routing interact with read/write splitting?
Write traffic always bypasses affinity and goes to the primary. Sticky routing only applies to the read path. After a committed write, the session should be pinned to a replica that has already applied that WAL segment β most proxies require explicit logic or a brief fallback to the primary for post-write reads. See implementing read/write splitting at the proxy layer for how to wire both mechanisms together.
When should I break affinity and fall back to stateless routing?
Break affinity when the pinned replica exceeds your replication lag threshold, when its connection pool is exhausted (pool_timeout exceeded), or when a circuit breaker trips after consecutive failures. Log every fallback with trace IDs β a sustained fallback rate above 1% of reads warrants a replica health investigation.
β Back to Connection Routing & Pooling Strategies
Related
- Connection Routing & Pooling Strategies β Parent section covering the full spectrum of routing patterns: read/write splitting, pool architecture, ORM middleware, and session affinity.
- Implementing Read/Write Splitting at the Proxy Layer β The complementary mechanism: how ProxySQL and HAProxy separate write from read traffic before affinity routing takes effect.
- Connection Pool Architecture for Read Replicas β How to size, partition, and monitor the underlying pools that sticky routing dispatches into.
- Fallback Strategies When Replicas Fall Behind β What to do when the pinned replicaβs lag exceeds your SLA: primary fallback, lag-aware routing, and retry budgets.
- Routing Queries Based on Data Freshness Requirements β How to combine per-query staleness tolerances with session affinity so that time-sensitive queries always reach a fresh replica.