Database Replication Fundamentals & Architecture

Production systems that scale reads horizontally face a hard engineering constraint: every replica is a delayed mirror of the primary, and every application that reads from it must be designed around that delay. The core challenge is not adding replicas β€” it is making deliberate, auditable decisions about topology, commit acknowledgment, consistency boundaries, and routing before the first read lands on a standby node. Get those decisions wrong and the failure modes are silent: stale reads that corrupt transactional flows, split-brain after an uncoordinated failover, or a connection pool exhaustion that takes down the primary alongside every replica. This guide covers the full decision surface β€” from WAL streaming configuration and topology selection through connection routing and pooling strategies, replication lag management, and automated failover β€” with annotated configurations, explicit trade-offs, and incident-aware runbooks for backend engineers, DBAs, SREs, and platform engineers.


Architecture overview

The diagram below maps the three primary replication patterns β€” star (direct streaming), cascading, and logical multi-subscriber β€” against the key decision boundaries: commit acknowledgment, WAL fan-out, and query routing.

Database Replication Topology Overview Three replication patterns: star topology where the primary streams WAL directly to all replicas; cascading topology where an intermediate standby fans out to downstream nodes; and logical replication where a publisher sends decoded row changes to multiple independent subscribers. Star (direct streaming) Primary Replica 1 Replica 2 Replica 3 All WAL from primary Low complexity Β· High WAN load Cascading Primary Intermediate Replica A Replica B Lower WAN load Β· Chain failure risk Logical multi-subscriber Publisher (primary) Subscriber 1 (subset A) Subscriber 2 (subset B) Decoded row changes Β· Table-level filter Partial replication Β· No slot sharing Query routing layer (sits above all topologies) App / ORM Proxy / router ProxySQL Β· PgBouncer Β· HAProxy Primary (W) Replica pool (R) INSERT/UPDATE SELECT Replication lag monitor pg_stat_replication Β· pg_replication_lag_seconds Β· replay_lag Feeds circuit-breaker: route to primary when lag exceeds SLA threshold

1. Topology design

Replication topology dictates baseline write amplification, WAN bandwidth consumption, and the blast radius of a single-node failure. Before adding the first standby, choose a pattern you can operate and recover from at 2 a.m.

The three patterns in the diagram above map to concrete operational trade-offs:

  • Star (direct streaming): Every replica maintains its own wal_sender connection to the primary. Simple to reason about and fast to recover β€” promote any replica directly. The cost is linear growth in WAL sender slots and WAN egress as replica count grows.
  • Cascading: An intermediate standby re-streams WAL to downstream nodes. Reduces primary load and cross-region bandwidth but introduces a single point of amplification failure: if the intermediate node stalls, all downstream replicas lag in lockstep. See Designing Multi-Region Read Replica Topologies for latency boundaries, data sovereignty constraints, and cross-AZ failover readiness.
  • Logical multi-subscriber: The primary decodes WAL into row-level change events and publishes them per-table. Subscribers can consume subsets of the schema and run different major versions. The trade-off is slot-level WAL accumulation risk β€” if a subscriber falls behind, the primary cannot reclaim WAL segments until it catches up, threatening primary disk exhaustion.

Topology trade-off matrix

Topology pattern WAN bandwidth Primary load Failover complexity Best fit
Star (direct) High β€” all replicas pull from primary Linear with replica count Low β€” any replica promotes directly Regional apps, fewer than 5 replicas
Cascading Low — only primary→intermediate link is heavy Bounded (one or two sender slots) High — intermediate failure cascades Cross-region, bandwidth-constrained
Logical multi-subscriber Medium β€” decoded rows, not raw WAL pages Moderate β€” decoding CPU High β€” slot management, DDL restrictions Partial replication, cross-version upgrades
Mesh (multi-primary) Very high β€” bidirectional conflict traffic Very high Very high β€” conflict resolution required Active-active, near-zero RPO tolerance

WAL streaming configuration baseline

ini
# postgresql.conf β€” physical streaming replication
wal_level = replica                   # minimum for streaming + slots
max_wal_senders = 10                  # one per replica + headroom for basebackup
max_replication_slots = 10            # must match or exceed replica count
wal_compression = on                  # cuts WAN traffic ~30-50% on typical OLTP
checkpoint_completion_target = 0.9    # spreads I/O to reduce checkpoint spikes
wal_keep_size = 512MB                 # floor for slot-less replicas catching up

# Recovery side (postgresql.conf on standby, PG 12+)
restore_command = ''
primary_conninfo = 'host=primary-host port=5432 user=replicator password=secret'
primary_slot_name = 'replica_slot_1'  # tie WAL retention to this replica's LSN
hot_standby = on                      # allow read queries on standby
hot_standby_feedback = on             # prevent long-running replica queries from
                                      # causing table bloat on primary (HOT chains)

Failure modes: A missing max_wal_senders increase silently drops new replica connections. Setting wal_keep_size too low causes replicas that miss a checkpoint cycle to require a full pg_basebackup restart. hot_standby_feedback = on prevents vacuum from cleaning rows the replica is actively reading, but at the cost of bloat on the primary β€” monitor pg_stat_user_tables.n_dead_tup.


2. Synchronous vs asynchronous commit paths

Every write on the primary traverses a WAL pipeline. Where on that pipeline you require acknowledgment determines your RPO and your write latency ceiling. The two canonical modes and a quorum variant are covered in depth at Understanding Synchronous vs Asynchronous Replication, but the production decision is this:

  • synchronous_commit = off (async): The primary acknowledges the commit after flushing WAL locally. The replica may lag by one or many WAL segments. RPO is non-zero β€” typically milliseconds to seconds in normal operation, minutes under WAN congestion. This is the default and the right choice for most non-financial workloads.
  • synchronous_commit = on (sync): The primary waits for the standby to confirm flush before returning success to the client. RPO = 0 on the named standbys. Write latency increases by one round-trip (typically 1–10 ms local-AZ, 20–80 ms cross-region).
  • synchronous_commit = remote_write: A middle ground β€” the primary waits for the standby to have written WAL to its OS buffer but not necessarily flushed to disk. Loses the protection against a standby OS crash but is faster than full on.
ini
# postgresql.conf β€” quorum synchronous commit (PG 10+)
synchronous_standby_names = 'FIRST 1 (replica-az-a, replica-az-b, replica-az-c)'
# FIRST 1 = require one of the listed standbys; loses one AZ without write stall
# ANY 1   = any one named standby (useful when replica-az-c is geographically far)

synchronous_commit = on               # global default; override per transaction:
                                      # SET LOCAL synchronous_commit = off;

Failure modes: Naming a standby in synchronous_standby_names that is offline or lagging will block all writes until the standby reconnects or you remove it from the list. Monitor pg_stat_replication.sync_state and alert on sync_state = 'potential' for more than 30 seconds β€” that indicates your named synchronous standby has fallen back to async mode.


3. Consistency guarantees and read isolation

Once a replica exists, you must map every read path in the application to a consistency level. The three levels and the trade-offs they carry are detailed in Evaluating Consistency Models for Distributed Reads, which covers session consistency, monotonic reads, and read-after-write enforcement mechanisms.

The operational decision per read path:

Consistency level Read target Use case Anomaly risk
Linearizable (strong) Primary only Financial ledgers, inventory reservation, auth token creation None
Read-your-writes Primary for T seconds after write, then replica User profile updates, shopping cart Low β€” bounded to session
Causal / session Replica with LSN watermark check User feeds, paginated listings Low β€” per-session monotonic
Eventual Any replica, no freshness check Analytics, search indexing, reporting High β€” stale reads possible

Enforcing read-after-write at the application layer

python
# Pattern: route to primary for N seconds after a write (Python / SQLAlchemy)
import time

WRITE_FOLLOWTHROUGH_SECONDS = 2

def get_db_session(last_write_ts: float | None):
    """
    Return a session bound to the primary if a recent write occurred,
    otherwise route to the replica pool.
    """
    if last_write_ts and (time.time() - last_write_ts) < WRITE_FOLLOWTHROUGH_SECONDS:
        return primary_session_factory()
    return replica_session_factory()

For systems that cannot tolerate any stale reads on specific tables, use PostgreSQL’s pg_snapshot or the LSN-based consistency check:

sql
-- On replica: confirm it has caught up past the LSN recorded at write time
SELECT pg_last_wal_replay_lsn() >= $1::pg_lsn AS is_consistent;
-- $1 = LSN captured on primary immediately after the write via pg_current_wal_lsn()

Failure modes: Applications that route all reads to replicas without a read-after-write guard exhibit a classic anomaly: a user creates a record, is immediately redirected to a list page, and the record does not appear because the replica has not yet replayed the insert. The same pattern causes inventory over-sells when a stock decrement on the primary has not reached the replica before the next availability check.


4. Connection routing and proxy patterns

Intelligent traffic distribution prevents primary overload and maximizes replica utilization. Routing decisions must account for connection churn, session state, and fault tolerance. Read scaling trade-offs in high-traffic applications covers how these factors interact with connection pool saturation and proxy resource usage.

For the connection routing and pooling strategies decision, three approaches dominate production deployments:

Routing strategy Overhead model State management Failover agility Best fit
Client-side (driver) Low β€” per-process Application-managed Fast β€” no proxy restart Microservices, polyglot stacks
Middleware proxy Centralized CPU/memory Proxy-managed Moderate β€” config reload Monoliths, strict compliance
DNS-based Effectively none OS resolver cache Slow β€” TTL-bound Legacy systems, low-churn

ProxySQL read/write split (MySQL)

sql
-- ProxySQL 2.x: route SELECT to hostgroup 100 (replicas), writes to 10 (primary)
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup, apply)
VALUES
  (10, 1, '^SELECT',                               100, 1),
  (20, 1, '^(INSERT|UPDATE|DELETE|BEGIN|COMMIT)',  10,  1);
LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;

-- Verify routing decisions in real time
SELECT rule_id, hits, destination_hostgroup
FROM stats_mysql_query_rules
ORDER BY hits DESC;

PgBouncer pool-per-replica (PostgreSQL)

ini
# pgbouncer.ini β€” one pool targeting the replica
[databases]
mydb_read = host=replica-1.internal port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction            # transaction-level pooling for read replicas
max_client_conn = 2000
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3
server_check_query = SELECT 1
server_check_delay = 10

Failure modes: Routing loops trigger during rapid topology changes when a newly promoted replica still appears in the replicas hostgroup. Connection exhaustion occurs under burst traffic when default_pool_size is not sized to the replica’s max_connections ceiling. Single-proxy deployments become single points of failure β€” run at least two PgBouncer instances behind a load balancer, sharing the same userlist.txt.


5. Replication lag monitoring and alerting

Telemetry pipelines must track replication health, query distribution, and capacity baselines. Metric design determines incident response speed and false-positive rates. The full lag detection and handling workflow is in detecting and handling replication lag in real time.

The canonical Prometheus alert stack for a PostgreSQL streaming setup:

yaml
# prometheus/rules/replication.yml
groups:
  - name: replication_health
    rules:
      - alert: ReplicaLagCritical
        expr: pg_replication_lag_seconds > 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Replica {{ $labels.instance }} lag > 10 s β€” stale reads risk"
          runbook: "https://database-read-replicas.com/replication-lag-consistency-management/detecting-and-handling-replication-lag-in-real-time/"

      - alert: ReplicaLagWarning
        expr: pg_replication_lag_seconds > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Replica {{ $labels.instance }} lag > 2 s (sustained)"

      - alert: ProxyPoolSaturation
        expr: pgbouncer_pools_active_connections / pgbouncer_pools_max_connections > 0.9
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "PgBouncer pool {{ $labels.pool }} is at {{ $value | humanizePercentage }} capacity"

      - alert: ReplicationSlotInactive
        expr: pg_replication_slots_active == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Replication slot {{ $labels.slot_name }} inactive β€” WAL accumulating"

Query the engine directly to correlate positions:

sql
-- PostgreSQL: replication stream health per connected standby
SELECT
  client_addr,
  state,                                        -- 'streaming' | 'catchup' | 'backup'
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn,
  replay_lag,                                   -- wall-clock lag at destination
  pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS byte_lag
FROM pg_stat_replication
ORDER BY replay_lag DESC NULLS LAST;

Failure modes: Metric gaps emerge during network partitions when the exporter itself loses connectivity to the replica. Transient GC pauses on the replica cause momentary lag spikes that fire false-positive alerts β€” tune for: 5m to distinguish sustained lag from transient jitter. Observability backpressure during incident storms can cause metric ingestion delays that mask the true lag trajectory.


6. Failover, recovery, and disaster workflows

Automated promotion must balance speed against split-brain risk. Manual intervention increases MTTR but prevents data loss during ambiguous network states.

Failover mode Promotion speed Split-brain risk Best fit
Fully automated (Patroni/repmgr) 10–30 seconds High without strong quorum Stateless apps, high RTO tolerance
Semi-automated (manual gate) 1–5 minutes Low β€” operator confirms network state Financial systems, regulated data
Manual orchestrated 10+ minutes Near-zero Legacy, strict compliance

Patroni configuration baseline

yaml
# patroni.yml β€” excerpt for a three-node PostgreSQL cluster
bootstrap:
  dcs:
    ttl: 30                          # seconds before leader lease expires
    loop_wait: 10                    # how often each node checks DCS
    retry_timeout: 10
    maximum_lag_on_failover: 1048576 # 1 MB β€” skip replicas further behind
    synchronous_mode: false          # set true to require sync standby ack

postgresql:
  use_pg_rewind: true                # allow demoted primary to rejoin without
                                     # full pg_basebackup after failover
  use_slots: true
  pg_hba:
    - host replication replicator 0.0.0.0/0 scram-sha-256

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false

Promotion runbook (semi-automated):

  1. Confirm the primary is unreachable from at least two independent observers (not just the replica) β€” this rules out a network partition between the replica and primary.
  2. Check pg_last_wal_replay_lsn() on all candidates; promote the replica closest to the primary’s last known LSN.
  3. Execute pg_promote() or touch /tmp/postgresql.trigger on the chosen replica.
  4. Update the write-target DNS record or proxy hostgroup immediately after promotion completes.
  5. If the old primary comes back online, fence it with pg_rewind before allowing it to rejoin as a standby β€” never let it accept writes.

Failure modes: Delayed partition detection triggers split-brain when both nodes believe they hold the leader lease. Incomplete promotion leaves downstream replicas orphaned in recovery mode pointing at a demoted node. Stale DNS TTLs or proxy config caches route write traffic to the old primary β€” always set DNS TTL to 30 seconds or lower on write-target records.


7. Debugging and troubleshooting runbook

Systematic diagnostics isolate replication stalls, routing misconfigurations, and performance degradation under load.

sql
-- Step 1: Identify apply stalls on the replica
SELECT pid, wait_event_type, wait_event, query_start, state, query
FROM pg_stat_activity
WHERE backend_type = 'walsender' OR backend_type = 'walreceiver';

-- Step 2: Check for lock contention blocking WAL apply
SELECT blocking_pids, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE cardinality(pg_blocking_pids(pid)) > 0;

-- Step 3: Slow query analysis on the replica (enable first)
ALTER SYSTEM SET log_min_duration_statement = 500;  -- log queries > 500ms
ALTER SYSTEM SET log_lock_waits = on;
SELECT pg_reload_conf();

-- Step 4: Replication slot WAL accumulation
SELECT slot_name, active, restart_lsn,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

Failure modes: Log rotation gaps mask root causes during extended incidents β€” ensure log_rotation_age is set and logs are shipped to a central store before the local file rotates. Stale diagnostic data leads to incorrect remediation; always re-run diagnostic queries after each remediation step. Runbook drift from engine version changes (e.g., pg_stat_replication column renames between PG versions) causes failed recovery attempts β€” version-pin your runbook SQL and test it in staging after every major upgrade.


Child pages in this section

Each child page below covers one decision domain in depth. Read the one that matches your immediate operational problem; all of them cross-reference shared concepts defined in this page.

Topology and WAN design

Designing Multi-Region Read Replica Topologies Covers latency envelopes for cross-AZ and cross-region WAL shipping, data sovereignty placement constraints, and failover readiness testing for geographically distributed standby fleets. Includes a comparison of AWS RDS cross-region replicas vs self-managed PostgreSQL streaming across cloud VPCs, and step-by-step guidance in setting up cross-AZ read replicas.

Also covers the mechanics of PostgreSQL streaming replication vs MySQL GTID-based replication β€” diverging failure modes under network partition and how each engine tracks replication position.

Commit acknowledgment and RPO

Understanding Synchronous vs Asynchronous Replication Deep-dives into WAL pipeline stages, quorum commit modes (FIRST N vs ANY N), and the write-latency vs RPO trade-off surface. Includes annotated synchronous_standby_names configurations and a lag threshold calculator in how to calculate replication lag thresholds for SLA compliance. Also addresses when to use logical vs physical replication for read scaling.

Consistency and read isolation

Evaluating Consistency Models for Distributed Reads Maps consistency levels (linearizable, causal, eventual) to PostgreSQL isolation modes and routing constraints. Covers session consistency enforcement via LSN watermarking, monotonic read guarantees in connection pool deployments, and the read-after-write patterns that prevent user-visible anomalies.

Read capacity and throughput

Read Scaling Trade-offs in High-Traffic Applications Quantifies the throughput ceiling of common replica topologies under sustained read load, connection pool saturation patterns, and the diminishing returns of adding more replicas past the WAL apply bottleneck. Includes benchmarks for PgBouncer transaction vs session pooling modes under replica query mix.


Production readiness checklist


Frequently Asked Questions

What is the difference between synchronous and asynchronous replication?

Synchronous replication requires the primary to wait for at least one standby to confirm WAL receipt before acknowledging a commit to the client β€” guaranteeing zero data loss (RPO=0) at the cost of added write latency equal to one round-trip to the standby. Asynchronous replication lets the primary commit immediately and ships WAL in the background, trading a non-zero RPO (typically milliseconds to seconds) for lower write latency.

How do I prevent stale reads from read replicas?

The standard pattern is read-after-write routing: for a bounded window after any write (typically 1–5 seconds), route all reads for that session to the primary. For finer control, capture pg_current_wal_lsn() on the primary after the write and check pg_last_wal_replay_lsn() >= $captured_lsn on the replica before using it for a read.

When should I use cascading replication vs direct primary streaming?

Use cascading when WAN bandwidth between regions is expensive or constrained, or when the primary would otherwise need to maintain more than 8–10 concurrent wal_sender processes. Accept the trade-off of increased failover complexity: if the intermediate node fails, all downstream replicas must be re-pointed to the primary or another intermediate. Never put your only synchronous standby in a cascade chain.

What is the risk of replication slots to primary disk space?

A replication slot prevents the primary from recycling WAL segments until the subscribing replica has confirmed it has consumed them. If the replica disconnects or falls far behind, WAL accumulates on the primary without bound. Set max_slot_wal_keep_size (PG 13+) to cap this, and alert when retained WAL exceeds 20% of available disk space.

How many read replicas can a single primary support?

The practical ceiling is determined by max_wal_senders (default 10), available WAL sender CPU, and outbound network bandwidth. Beyond 8–10 direct-streaming replicas, consider a cascading architecture or logical replication to distribute WAL fan-out. Connection count to the primary is a separate constraint β€” use PgBouncer in front of each replica, not the primary, to avoid proxy connection pressure adding to the primary’s load.


← Home: Database Read Replicas & Connection Routing Patterns