Designing Multi-Region Read Replica Topologies

Multi-region read replicas solve two distinct problems: placing read capacity close to users to reduce query latency, and providing geographic redundancy when a primary region becomes unavailable. Both goals introduce non-trivial network physics. Cross-region WAN links carry 60–200 ms round-trip times, variable packet loss, and asymmetric bandwidth — conditions that stress every layer of the replication stack from TCP window sizing through to replication lag accumulation and consistency guarantees. This page is part of the Database Replication Fundamentals & Architecture reference and covers topology selection, engine-level WAN tuning, lag-aware routing, monitoring, and incident recovery specific to geographically distributed read replicas.


Topology Patterns and Their Tradeoffs

Centralized Primary with Regional Read Replicas

The most common pattern: one primary region generates WAL or binary log and streams it directly to read replicas in each remote region. Each remote replica is read-only and applies changes asynchronously. Simplicity is the main advantage — there is one authoritative write path, no write conflict resolution, and straightforward promotion logic.

The weakness is bandwidth concentration. Every remote region pulls the full WAL stream from a single primary, so a high-write primary saturates WAN links and causes replicas to lag in step. When the primary region fails, promoting a remote replica requires verifying it has applied all WAL up to the last confirmed LSN — and discarding any that has not.

Cascading Relay Architecture

In a cascading setup, a regional relay replica acts as a WAL forwarding node: it receives the stream from the primary and resubscribes downstream replicas within its own region. The primary sees only one outbound WAL sender per remote region rather than one per replica. This pattern suits high-replica-count deployments (analytics clusters, regional CDN-adjacent nodes) where the primary’s max_wal_senders budget would otherwise be exhausted.

The tradeoff is added replication depth. A downstream replica is two hops from the primary, so its effective lag is the sum of primary→relay lag plus relay→downstream lag. Relay node failure severs all downstream replicas in that region simultaneously — making the relay itself a regional single point of failure requiring its own HA treatment.


Multi-Region Read Replica Topology Patterns Left side shows a centralized primary streaming WAL directly to three regional replicas. Right side shows a cascading relay where the primary streams to a relay node, which forwards to downstream replicas within the region. Centralized Primary Primary us-east-1 Read Replica eu-west-1 Read Replica ap-southeast-1 Read Replica sa-east-1 WAL stream WAL WAL stream Cascading Relay Primary us-east-1 Relay Node ap-southeast-1 WAL Replica A downstream Replica B downstream Replica C downstream WAL / binlog stream (async) Relay (forwarding) node

Trade-off Comparison

Dimension Centralized primary Cascading relay
WAN sender load on primary One sender per remote replica One sender per remote region
Replication depth 1 hop 2 hops (higher effective lag)
Relay SPOF risk None Relay failure severs all downstream replicas
Promotion complexity Low — any replica is one hop from source High — downstream replicas must re-point to promoted relay or primary
Best fit Low-to-medium replica count, predictable WAL volume High replica count per region, analytics fan-out
max_wal_senders pressure Scales with replica count Bounded by region count

Synchronous vs. Asynchronous Across WAN

Synchronous vs. asynchronous replication is the single most consequential decision for cross-region topologies. With PostgreSQL synchronous_commit = remote_apply, every write commit blocks until the standby confirms it has applied the WAL record. On a 100 ms RTT WAN link, that adds at least 100 ms to every write — and during network degradation the primary stalls completely, exhausting connection pool slots within seconds.

Asynchronous replication (synchronous_commit = off or the default on with no synchronous_standby_names set) allows the primary to continue at full write speed. The replica drifts behind by however much WAL the link can absorb. The operational cost is an RPO greater than zero: if the primary fails before the replica has applied all WAL, recent committed transactions are lost.

Practical guidance: use asynchronous replication for all cross-region read replicas and limit synchronous replication to same-region or same-AZ standbys where RTT is under 2 ms. If your SLA requires zero RPO across regions, you need a multi-primary consensus protocol (CockroachDB, YugabyteDB, Spanner) rather than streaming replication with synchronous standbys.

For MySQL deployments, semi-synchronous replication with rpl_semi_sync_source_wait_point = AFTER_SYNC offers a middle ground: the primary waits for acknowledgment that the relay log has been received before committing, but does not wait for the replica to apply. This limits RPO to at most the relay-log buffer size (typically one transaction) without blocking on apply latency.


Cross-region WAL streaming fails silently in misconfigured environments. The TCP connection persists but throughput degrades to near-zero as receive buffers fill and window scaling stalls. Address this at three layers.

TCP Kernel Parameters

ini
# /etc/sysctl.d/99-wan-replication.conf
# Increase socket buffer sizes to match high bandwidth-delay product links
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432

# Detect dead peers faster than the default 2-hour keepalive
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 6

# Enable BBR congestion control for satellite / high-latency WAN
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr

Apply with sysctl -p /etc/sysctl.d/99-wan-replication.conf. Verify with ss -tiO dport = :5432 — look for rcv_space growing to match rmem_max on an active WAL sender connection.

PostgreSQL WAL Sender Configuration

ini
# postgresql.conf — production WAN streaming replication
wal_level = replica

# Sender slots: one per replica plus headroom for pg_basebackup
max_wal_senders = 10
max_replication_slots = 10

# Compress WAL on the sender; reduces bandwidth ~30–60% for typical OLTP workloads
# CPU overhead is low on modern hardware; disable if the primary is CPU-bound
wal_compression = on

# Default to async for cross-region standbys; override per-standby if needed
synchronous_standby_names = ''

# Mirror the kernel keepalive settings at the application layer
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 6

# Prevent WAL bloat if a replica disconnects — limit retained WAL to 2 GB
wal_keep_size = 2048   # MB

# Give a slot-less standby time to reconnect during transient network interruption
wal_sender_timeout = 120s

Replication Slot Hygiene

Named replication slots guarantee the primary retains WAL until the consumer has applied it. In a cross-region context, this is a double-edged sword: a disconnected replica with an active slot will accumulate WAL indefinitely, eventually filling the primary’s data volume. Instrument pg_replication_slots.pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) as a byte-lag metric and alert when it exceeds 5 GB. Drop slots that have been inactive for more than your retention window.


Lag-Aware Read Routing

Routing cross-region reads requires knowing which replicas are current enough for a given query. Two complementary tools handle this: proxy-layer lag filtering and session-level LSN tracking for read-your-writes consistency.

HAProxy with PostgreSQL Health Check

HAProxy’s native pgsql-check verifies that a backend is accepting connections as a valid PostgreSQL server. Combine it with an external health-check script that queries pg_stat_replication or a Prometheus exporter to dynamically remove lagging backends.

haproxy
# haproxy.cfg — read replica backend with PostgreSQL health checks
backend pg_read_pool
  mode tcp
  balance leastconn
  option pgsql-check user monitor    # requires a 'monitor' role in pg_hba.conf

  # Declare replicas; use 'check' to enable the pgsql-check above
  # 'inter 2s fall 3 rise 2' — check every 2s, eject after 3 failures, restore after 2 successes
  server us-east-replica  10.0.1.10:5432 check inter 2s fall 3 rise 2
  server eu-west-replica  10.0.2.10:5432 check inter 2s fall 3 rise 2
  server ap-south-replica 10.0.3.10:5432 check inter 2s fall 3 rise 2

For lag-aware ejection, pair this with a health-check endpoint script that returns HTTP 503 when pg_replication_lag_seconds > 10, and switch HAProxy to option httpchk against that endpoint.

ProxySQL Lag-Threshold Routing (MySQL)

ProxySQL’s max_replication_lag per query rule implements lag filtering natively. Route reads to the replica hostgroup and fall back to the primary when lag exceeds the threshold.

sql
-- ProxySQL admin: lag-aware read routing
INSERT INTO mysql_query_rules
  (rule_id, active, match_digest, destination_hostgroup, max_replication_lag, apply)
VALUES
  (10, 1, '^SELECT', 20, 5, 1),    -- reads → replica hostgroup 20 if lag < 5 s
  (11, 1, '^SELECT', 10, 3600, 1); -- fallback → primary hostgroup 10 if lag ≥ 5 s

LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;

Read-Your-Writes via LSN Tracking

When a write completes, capture the primary’s current LSN (SELECT pg_current_wal_lsn() in PostgreSQL or @@gtid_executed in MySQL) and store it in the user session. On subsequent reads, route only to replicas where the applied LSN is at or beyond the session LSN:

sql
-- PostgreSQL: check if replica has applied up to the required LSN before reading
SELECT pg_last_wal_replay_lsn() >= $1::pg_lsn AS is_current;

If no replica is current enough, fall back to the primary with a connection-pool circuit breaker to prevent write-path saturation. Routing queries by data freshness requirements covers the full fallback decision tree.


Clock Synchronization and Temporal Consistency

Timestamp drift across regions breaks ordering guarantees and complicates audit trails. Enforce UTC at both application and database layers (timezone = 'UTC' in postgresql.conf). OS-level timezone conversions introduce silent ordering errors during DST transitions.

Clock skew directly impacts logical replication ordering. When skew exceeds 50 ms, logical decoding may misorder events and generate incorrect conflict-resolution decisions. Deploy chrony with iburst and maxpoll 10 to maintain sub-millisecond synchronization:

ini
# /etc/chrony.conf — sub-millisecond NTP for database hosts
server 169.254.169.123 prefer iburst  # AWS Time Sync Service (or equivalent cloud NTP)
maxpoll 10
makestep 1.0 3
rtcsync

For application-side reconciliation, track an explicit updated_at column alongside LSN metadata rather than relying on wall-clock ordering alone.


Monitoring and Alerting Signals

Blind trust in replication status leads to silent data divergence. Export these metrics from every regional node.

Key PostgreSQL metrics via pg_stat_replication and replication slot views:

sql
-- Lag in seconds per standby (run on primary)
SELECT application_name,
       state,
       pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)  AS send_lag_bytes,
       pg_wal_lsn_diff(sent_lsn, replay_lsn)             AS apply_lag_bytes,
       extract(epoch from (now() - reply_time))           AS last_reply_seconds
FROM pg_stat_replication;

-- Slot byte-lag (run on primary)
SELECT slot_name, active, restart_lsn,
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;

Prometheus alerting rules:

yaml
# prometheus_alerts.yml
groups:
  - name: multi_region_replication
    rules:
      - alert: ReplicationLagWarning
        expr: pg_replication_lag_seconds > 3
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Cross-region replica lag exceeds 3 s"
          description: "Verify WAN RTT and WAL apply queue depth on {{ $labels.instance }}"

      - alert: ReplicationLagCritical
        expr: pg_replication_lag_seconds > 10
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Replica lag > 10 s — ejecting from read pool"
          description: "HAProxy/ProxySQL health check should be returning 503 for {{ $labels.instance }}"

      - alert: ReplicationSlotInactive
        expr: pg_replication_slots_active == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Replication slot inactive — WAL accumulation risk"
          description: "Check network partition or replica crash. Slot: {{ $labels.slot_name }}"

      - alert: WalSlotByteLagHigh
        expr: pg_replication_slot_lag_bytes > 5e9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "WAL slot byte-lag > 5 GB — disk risk on primary"
          description: "Consider dropping slot {{ $labels.slot_name }} if replica is unrecoverable"

Alert thresholds to tier:

  • Warning: lag > 3 s or connection pool usage > 70 %
  • Critical: lag > 10 s or WAL slot byte-lag > 5 GB
  • Page: slot active = false for more than 1 min or TCP retransmit rate > 5 %

Failure Modes and Recovery Steps

1. DNS Resolution Delay During Regional Failover

Root cause: OS-level DNS caches ignore TTL minimums; stale records point traffic to a demoted or unreachable primary for minutes after failover.

Resolution:

  1. Set DNS record TTL to 60 s for all database endpoints before you need to fail over.
  2. Use client-side DNS bypass: connect via IP address written to a config file that your failover automation rewrites, or use a service mesh sidecar (Envoy, Linkerd) for connection-level rerouting.
  3. After failover, monitor dig +short <endpoint> from each application region to confirm propagation.

2. Replica Promotion Race (Split-Brain)

Root cause: Multiple regional controllers simultaneously detect primary unreachability and each independently promote their local replica.

Resolution:

  1. Enforce a single global coordinator for promotion decisions (Patroni with etcd, or Orchestrator with Raft-based topology manager).
  2. Use STONITH or network-ACL fencing to guarantee the old primary cannot accept writes before the new primary is active.
  3. If split-brain occurs: isolate the rogue primary immediately, run SHOW MASTER STATUS / SELECT pg_current_wal_lsn() on all candidates, promote the node with the furthest-ahead LSN, and rebuild others from a fresh base backup.

3. WAL Slot Bloat and Disk Exhaustion

Root cause: A replica disconnects but its named replication slot remains. The primary retains all WAL from the slot’s restart_lsn onward, filling the data volume.

Resolution:

sql
-- Step 1: Identify the bloated slot
SELECT slot_name, active, restart_lsn,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size
FROM pg_replication_slots
WHERE active = false;

-- Step 2: Drop the inactive slot (replica must rebuild from base backup)
SELECT pg_drop_replication_slot('slot_name_here');

-- Step 3: Verify WAL directory is reclaiming space
SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir();

After dropping: rebuild the replica from pg_basebackup, verify archive_mode = on and archive_command health to prevent future WAL gaps, and recreate the slot only after the replica has connected and begun streaming.

4. TCP Keepalive Timeout Causing Phantom Replica

Root cause: A long-idle WAL sender connection passes TCP keepalive checks at the OS level but the PostgreSQL process on the replica has stalled (OOM kill, lock contention). The primary considers the standby connected and does not advance WAL cleanup.

Resolution: Set wal_sender_timeout = 60s on the primary (default 60s in PostgreSQL 14+; verify it has not been set to 0). The primary will terminate a sender that receives no standby reply within this window, allowing the standby to reconnect or the slot to become visibly inactive.

5. Replication Stall Under Write Spike

Root cause: A large bulk write (schema migration, ETL load) generates WAL faster than the WAN link can transmit it. Replicas fall behind. If synchronous_standby_names names any remote standby, the primary stalls waiting for acknowledgment.

Resolution:

  • For one-off bulk operations: temporarily remove the remote standby from synchronous_standby_names, perform the operation, then restore after the replica catches up.
  • For recurring spikes: tune max_slot_wal_keep_size (PostgreSQL 13+) to limit per-slot WAL retention, and configure a monitoring gate that alerts when WAL sender queue depth exceeds 500 MB.
  • Increase slave_parallel_workers (MySQL) or max_worker_processes plus max_parallel_workers (PostgreSQL) on the replica to increase apply throughput.

Production-Readiness Checklist


Child Pages

Comparing PostgreSQL Streaming Replication vs MySQL GTID Side-by-side breakdown of WAL-based physical streaming and GTID-based binlog replication: slot mechanics, failover behavior, multi-source support, and the specific tuning parameters that differ between the two protocols on long-haul WAN links.

Step-by-Step Guide to Setting Up Cross-AZ Read Replicas Runnable walkthrough for establishing read replicas across availability zones as a lower-latency precursor to full cross-region deployment — validates baseline sync performance and replication slot hygiene before WAN expansion.


Frequently Asked Questions

What is the safe replication lag threshold before ejecting a cross-region replica? A threshold of 5–10 seconds is typical for most SLAs. Alert at 3 s, eject at 10 s, and page on slot inactivity. The correct value depends on your read-your-writes requirements and the cost of falling back to the primary.

Should I use synchronous or asynchronous replication across regions? Asynchronous replication is almost always correct for cross-region reads. Synchronous replication across a 60–150 ms WAN link adds that latency to every write commit and risks write stall during network degradation. Reserve synchronous mode for same-AZ or same-datacenter standbys.

How do I prevent WAL bloat when a cross-region replica disconnects? Drop inactive replication slots within your monitoring alert window. Set wal_keep_size as a backstop, but never rely on it alone — a disconnected named slot will accumulate WAL unboundedly until the slot is dropped or the replica reconnects.


Database Replication Fundamentals & Architecture