Evaluating Consistency Models for Distributed Reads

Routing reads across a replica fleet forces an explicit choice: how much staleness is your application willing to accept, and what is the operational cost of reducing it? This question sits at the heart of every distributed read architecture. Before you wire up connection routing, you need a consistency model that aligns with your SLA budget, your write throughput, and your failure tolerance — because that model determines which replicas are eligible for a given read, how session state must be tracked, and how the routing layer behaves when a replica falls behind.

This page maps the three principal consistency levels — strong, causal, and eventual — to concrete proxy configurations, session-affinity patterns, and partition-recovery procedures. All of it connects back to the replication internals covered in Database Replication Fundamentals & Architecture.


Problem Framing

The operational risk this page addresses is consistency drift: reads that return stale data in ways the application cannot detect or compensate for. Consistency drift causes silent correctness bugs — a user who just updated their payment method sees the old one, an inventory check approves an order against already-depleted stock, a fraud signal routes on yesterday’s risk score. Unlike a database crash, these failures produce no error codes and no obvious alert.

The failure modes this analysis mitigates:

  • Read-after-write violations — a client reads a replica before its own write has propagated
  • Monotonic-read violations — a client reads replica A (lag: 10 ms), then replica B (lag: 3 s), and observes data going backwards in time
  • Causality violations — client B observes the effect of client A’s write before observing the write itself
  • Partition-induced divergence — a network split isolates replicas that continue serving reads from increasingly stale state

Each violation class maps to a different routing mechanism. Choosing the wrong mechanism either over-constrains you (every read hits the primary, defeating the purpose of replicas) or under-constrains you (users see data correctness bugs in production).


Concept Definition and Scope

Consistency models define the ordering guarantees that a distributed system promises to readers. In the context of read replicas they are not theoretical abstractions; they are enforced at the routing layer through proxy configurations, session state tokens, and transaction isolation settings.

The three models relevant to replica routing:

Strong consistency — every read returns the most recently committed write, regardless of which node services it. In PostgreSQL terms, this means either routing all reads to the primary or using synchronous streaming replication where at least one standby has confirmed WAL receipt before the primary acknowledges the write. The cost is write latency: the primary blocks on each commit until the synchronous standby responds.

Causal consistency — reads respect the happens-before relationship between writes. If write W₁ causally precedes W₂, any client that has observed W₂ will also observe W₁. Implementing this in a replica topology requires propagating a version vector or LSN (Log Sequence Number) through the application so the routing layer can verify a replica has replayed past a known point before directing a read to it.

Eventual consistency — replicas will converge to the same state given no new writes, but there is no bound on when. Reads may observe any past committed state. The consistency window is bounded by replication lag — typically single-digit milliseconds on the same LAN, seconds to tens of seconds across WAN links. Eventual consistency supports the highest read throughput but requires the application to tolerate stale reads on all code paths that use replica connections.

Disambiguation: Session consistency (read-your-own-writes) is a specific guarantee within the eventual consistency family. It does not require strong consistency globally; it only requires that a single session’s reads route to nodes that have replayed that session’s writes. This is cheaper to implement than strong consistency and covers the majority of user-visible correctness requirements.


How the Mechanism Works Internally

WAL propagation and the consistency window

Every write on the primary generates a WAL record. The replication stream carries those records to standbys, which replay them in order. The gap between the LSN the primary has committed and the LSN a replica has replayed is the consistency window — the range of writes a read on that replica might miss.

The sequence for an asynchronous replica read:

  1. Application issues INSERT or UPDATE on primary — primary writes WAL, flushes (if synchronous_commit = on) or buffers (if off), returns to client.
  2. WAL sender process streams the record to the replica’s WAL receiver over a TCP connection.
  3. WAL receiver writes the record to the replica’s pg_wal/ directory and signals the startup process.
  4. Startup process replays the record, advancing pg_last_wal_replay_lsn().
  5. Only after step 4 does a read on the replica reflect the write.

The gap between steps 1 and 4 is the consistency window. It is bounded by network RTT plus replica I/O throughput. Under synchronous_commit = on, the primary’s COMMIT blocks between steps 1 and 2, waiting for the replica to confirm WAL receipt — this collapses the window to near-zero at the cost of write latency equal to the primary-to-replica RTT.

The diagram below shows how the LSN advances through the commit pipeline and where each consistency level draws its guarantee boundary.

WAL Commit Pipeline and Consistency Guarantee Boundaries Sequence diagram showing WAL record flow from primary write through WAL sender, network, WAL receiver, and replay, with annotations showing where each consistency model's guarantee is established. PRIMARY WAL SENDER WAL RECEIVER REPLAY Write WAL Flush / Buffer Stream record Write to pg_wal/ Replay record Strong Causal (LSN check) Eventual time →

Proxy rule evaluation order

When a query arrives at the proxy, it evaluates routing rules sequentially. For ProxySQL, rule evaluation proceeds by ascending rule_id. The first matching rule wins. Rules must be ordered from most-specific (locking reads, transaction context) to least-specific (plain SELECT):

  1. SELECT ... FOR UPDATE → primary (rule_id 10)
  2. Queries carrying a session-affinity token marking an active write transaction → primary (rule_id 20)
  3. Plain SELECT within consistency window → replica pool (rule_id 30)

This ordering ensures correctness: a locking read that reaches a replica will not see the lock table, producing phantom-read errors or deadlock confusion.


Trade-off Comparison Table

Consistency Level Write Latency Added Read Staleness Bound Routing Complexity Failure Surface Best-Fit Workload
Strong (sync commit) +RTT to synchronous standby (typically 1–15 ms LAN, 50–150 ms WAN) Zero — reads always current Low — route all reads to primary or sync standby Standby disconnect stalls all writes Financial transactions, inventory reservations, auth tokens
Causal (LSN-gated) Near-zero — async replication Bounded by LSN propagation; sub-100 ms typical Medium — application must carry LSN token; proxy must compare against pg_last_wal_replay_lsn() LSN token staleness if not propagated correctly Social feeds, comment threads, any “read-your-own-writes + causally consistent” requirement
Session (read-your-writes) Zero — writes remain async Zero within session; stale for other clients Low-medium — session affinity tokens, TTL-based primary pinning Affinity token expiry or pool reset breaks guarantee User profile pages, shopping cart, any single-user interaction loop
Eventual Zero Unbounded — determined by replication lag Low — round-robin or least-connections to any healthy replica Silent staleness; no routing-layer detection Aggregation dashboards, read-heavy analytics, cached catalog data

Configuration Runbook

1. Strong consistency: synchronous commit on PostgreSQL primary

ini
# postgresql.conf — Primary
# Block COMMIT until at least one replica has flushed WAL to disk
synchronous_commit = on
synchronous_standby_names = 'FIRST 1 (replica_us_east_1, replica_us_west_2)'
wal_level = replica

# Degraded-state protection: if the sync standby exceeds this,
# the primary will stall. Set a sensible ceiling.
wal_sender_timeout = 60s

synchronous_commit = on causes the primary to wait for the standby to confirm WAL flush before returning to the client. synchronous_standby_names = 'FIRST 1 (...)' means the first available standby in the list satisfies the quorum — if replica_us_east_1 is connected, it acts as the synchronous standby; if it disconnects, replica_us_west_2 takes over automatically.

Failure-state annotation: If all named standbys disconnect and synchronous_standby_names is non-empty, the primary stalls all writes. The routing proxy must detect this via connection timeout and redirect reads to the primary, accepting elevated primary load.

2. Causal consistency: LSN-gated replica selection

The application writes to the primary, receives back the committed LSN, stores it in the session (cookie, header, or cache key), and passes it to the proxy as a routing hint. The proxy compares the hint against pg_last_wal_replay_lsn() on candidate replicas and routes only to replicas that have replayed past that point.

sql
-- After a write, capture the primary LSN
SELECT pg_current_wal_lsn() AS committed_lsn;
-- Returns e.g. 1/A3F4B290

-- On the replica, before routing a read:
SELECT pg_last_wal_replay_lsn() >= '1/A3F4B290'::pg_lsn AS is_caught_up;
-- Returns true → safe to route this read here

This pattern is compatible with PgBouncer at the application layer (the application chooses which connection string to use based on the freshness check) or at a middleware layer that evaluates the LSN comparison before dispatching the query.

3. Session consistency: ProxySQL transaction-scope pinning

sql
-- ProxySQL: route locking reads and writes to primary hostgroup (10)
INSERT INTO mysql_query_rules (rule_id, active, match_digest, destination_hostgroup, apply)
VALUES (10, 1, '^SELECT.*FOR UPDATE', 10, 1);

-- Route standard reads to replica pool (20)
INSERT INTO mysql_query_rules (rule_id, active, match_digest, destination_hostgroup, apply)
VALUES (20, 1, '^SELECT', 20, 1);

-- Enable multiplexing only outside active transactions
-- (ProxySQL automatically pins connections during explicit transactions)
UPDATE mysql_servers SET max_connections=200 WHERE hostgroup_id=20;

LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;

ProxySQL’s transaction tracking automatically detects BEGIN / START TRANSACTION and pins the connection to the same host for the duration, satisfying session consistency within transaction boundaries. Between transactions, standard reads fall through to the replica pool.

4. Cross-region replica: WAL streaming tuning

For replicas that serve cross-region reads — as detailed in designing multi-region read replica topologies — WAL streaming parameters need explicit tuning for high-latency links:

ini
# postgresql.conf — Cross-region replica
max_standby_streaming_delay = 30s    # Allow long-running analytical queries
                                     # before conflict resolution triggers
wal_receiver_timeout = 60s           # Declare stream dead after 60 s of silence
hot_standby_feedback = on            # Prevent primary from vacuuming tuples
                                     # still needed by active replica queries
recovery_min_apply_delay = 0         # No intentional delay; set >0 only for
                                     # delayed standbys used as logical backups

hot_standby_feedback = on is critical for analytical replicas: without it, the primary’s autovacuum can remove row versions that a long-running query on the replica still needs, producing a canceling statement due to conflict with recovery error.

5. HAProxy health-check and failover configuration

haproxy
backend db_replicas
  mode tcp
  option pgsql-check user monitor
  # Mark server down after 3 consecutive failures; re-enable after 2 successes
  server replica_1 10.0.1.10:5432 check inter 5s fall 3 rise 2
  server replica_2 10.0.2.10:5432 check inter 5s fall 3 rise 2
  # Primary as last-resort backup when all replicas fail
  server primary  10.0.0.10:5432 backup check inter 5s fall 3 rise 2
  timeout connect 500ms
  timeout server  2000ms

The backup directive on the primary ensures that if both replicas fail health checks, reads fall through to the primary rather than returning connection errors. This degrades consistency (primary handles both reads and writes) but preserves availability.


Monitoring and Alerting Signals

The key metrics to track for consistency SLA compliance:

Metric Source Alert Threshold Routing Action
pg_replication_lag_seconds pg_stat_replication.replay_lag on primary > 5 s for 2 min Pin affected sessions to primary
pg_last_wal_replay_lsn delta Compare across replicas > 100 MB LSN divergence Remove lagging replica from pool
haproxy_server_status HAProxy stats socket DOWN state Already handled by fall 3
ProxySQL hostgroup_is_read_only runtime_mysql_servers Unexpected 0 on replica pool Alert; potential misconfiguration
Application stale_read_event counter Custom instrumentation > 0.1% of reads Increase primary-pin TTL
yaml
# prometheus/alerts.yml
groups:
  - name: replication_consistency
    rules:
      - alert: HighReplicationLag
        expr: pg_replication_lag_seconds > 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Replica {{ $labels.instance }} lag exceeds 5 s"
          description: >
            Route affected sessions to primary until
            pg_replication_lag_seconds normalises below 1 s.

      - alert: ReplicaLSNDivergence
        expr: |
          max(pg_current_wal_lsn_bytes) - min(pg_last_wal_replay_lsn_bytes) > 104857600
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Replica LSN divergence exceeds 100 MB"
          description: "Remove diverged replica from read pool immediately."

The for: 2m duration on HighReplicationLag prevents alert flapping during transient write spikes. Couple these Prometheus alerts with distributed tracing (OpenTelemetry db.statement and net.peer.name attributes) to map latency spikes to specific routing decisions during post-incident review.

Synthetic read-after-write probe — deploy a cron job that writes a unique UUID to the primary, records the timestamp, polls each replica until the UUID appears, and emits replica_consistency_window_ms as a histogram. This gives you an empirical distribution of your actual consistency window rather than relying solely on lag estimates from pg_stat_replication.


Failure Modes and Recovery Steps

1. Synchronous standby disconnect stalls all writes

Root cause: The named synchronous standby disconnects (network failure, OOM kill, disk full) while synchronous_standby_names is non-empty and no other standby can satisfy the quorum.

Diagnosis:

sql
-- On primary: check sync state of all standbys
SELECT application_name, sync_state, write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
-- If sync_state = 'quorum' or 'sync' for zero rows, writes will stall

Recovery:

  1. If the standby is recoverable, restart it — it will reconnect and the primary resumes.
  2. If the standby is unrecoverable, temporarily demote it: ALTER SYSTEM SET synchronous_standby_names = ''; then SELECT pg_reload_conf();. This immediately unblocks writes at the cost of dropping to asynchronous replication.
  3. Promote an asynchronous standby to synchronous by adding it to synchronous_standby_names once the original is replaced.

2. Replica LSN falls behind and serves stale reads

Root cause: Write spike on primary outpaces the replica’s WAL apply rate, or max_standby_streaming_delay triggers frequent conflict cancellations that slow replay.

Diagnosis:

sql
-- On replica: how far behind is replay?
SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay;

-- On primary: per-replica breakdown
SELECT application_name,
       write_lag, flush_lag, replay_lag,
       pg_size_pretty(sent_lsn - replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;

Recovery:

  1. Remove the lagging replica from the proxy pool (HAProxy: set server db_replicas/replica_1 state drain).
  2. Check for long-running queries blocking replay: SELECT pid, query, state FROM pg_stat_activity WHERE backend_type = 'startup'; — terminate conflicting queries if necessary.
  3. Once replay_lag drops below 100 ms, re-enable: set server db_replicas/replica_1 state ready.

3. Session-affinity token expires mid-transaction

Root cause: The TTL on the primary-pin token (set to prevent sessions from being permanently pinned to the primary) expires before the client finishes a multi-step transaction involving a write followed by reads.

Diagnosis: Application-level stale_read_event counter spikes; users see their own recent writes disappear momentarily.

Recovery:

  1. Increase the pin TTL: if your write-to-read latency P99 is 800 ms, set pin_ttl = 2000ms minimum.
  2. Extend the token on each write within the session rather than setting a fixed expiry at first write.
  3. For multi-step wizards or checkout flows, pin the session to the primary for the entire flow duration and release the pin only on session end or explicit commit.

4. Read preference tag mismatch — no replica matches

Root cause: All replicas matching a specific tag set (e.g., {region: 'us-east-1', consistency: 'strong'}) are unavailable or lagging.

Diagnosis: MongoDB driver emits MongoServerSelectionError; Prometheus shows mongodb_mongod_connections drop to zero for the tagged set.

Recovery:

  1. Application must catch MongoServerSelectionError and implement exponential backoff with jitter: base = 100 ms, cap = 5 s, jitter = random(0, base).
  2. Fall back to primaryPreferred rather than failing the request.
  3. Serve a cached or degraded response for non-critical reads rather than propagating the database error to the end user.

5. Partial network partition isolates one replica

Root cause: A subset of replicas becomes unreachable from the proxy but remains reachable from the primary. The isolated replica continues accepting read connections from clients that can reach it directly, serving increasingly stale data.

Diagnosis: pg_stat_activity on the isolated replica shows active client connections; pg_last_xact_replay_timestamp() has not advanced in minutes; pg_stat_wal_receiver shows status = 'stopped'.

Recovery:

  1. The proxy will already have marked the replica DOWN via health checks — no action needed for clients routed through the proxy.
  2. For clients with direct replica connections (e.g., analytics pipelines configured with a direct DSN): update connection strings to use the proxy, or add a firewall rule to drop direct replica connections until the partition resolves.
  3. After the partition heals, the replica will replay the accumulated WAL. Keep it marked draining until replay_lag_bytes drops below your consistency budget before re-enabling in the pool.

FAQ

What is the difference between strong and eventual consistency in read replicas?

Strong consistency guarantees that every read reflects the latest committed write, enforced by blocking the primary until replicas acknowledge WAL receipt. Eventual consistency allows reads from replicas that may be milliseconds to seconds behind the primary, trading staleness tolerance for lower write latency and higher read throughput.

How does causal consistency differ from session consistency?

Session consistency scopes ordering guarantees to a single client session — you will always read your own writes within that session. Causal consistency is broader: any client that has observed a write will see all writes that causally preceded it, even across different sessions, as long as the causal context (LSN or vector clock) is propagated through the application tier.

When should I use synchronous_commit = on versus remote_write?

Use synchronous_commit = on (full durability) when you cannot tolerate data loss on primary crash — it waits for WAL to be flushed to disk on the standby. Use remote_write when you need replication confirmation without the I/O cost of disk flush: the WAL record is in the standby’s OS buffer, giving crash-safety against primary failure but not standby OS crash.

What happens to read routing when the synchronous standby disconnects?

The primary stalls new commits until wal_receiver_timeout expires or synchronous_standby_names is reconfigured. During the stall, routing proxies should detect the blocked write path via connection timeout and temporarily redirect read traffic to the primary, accepting higher primary load over write starvation. Once a replacement standby is promoted to synchronous, normal replica routing resumes.


← Back to Database Replication Fundamentals & Architecture