Building Multi-Primary Replication with PostgreSQL Logical Replication
PostgreSQL logical replication offfers a straightforward path to uni-directional streaming. With careful design, it can also underpin multi-primary (multi-master) deployments by addressing two core challenges:
- Concurrency conflicts when multiple nodes modify the same row
- Replication loops that endlessly re-apply the same change across peers
Conflict management
Practical conflict handling typically relies on deterministic rules the system can enforce automatically.
-
INSERT
- Partition primary keys per node to avoid collisions. Use per-node sequences with different starting points and a fixed increment.
- If duplicates can occur, decide a winner using a stable rule (e.g., latest timestamp or a preferred node). Add metadata columns to support this decision (e.g., node identifier and last_modified).
-
UPDATE
- Choose a rule such as "highest last_modified wins" or "preferred node wins." Maintain last_modified on each write to make decisions consistent.
-
DELETE
- Deletes usually do not conflict, but ensure they are replicated and applied in the desired order relative to updates.
PostgreSQL 10’s logical apply will stop a subscription on any error (e.g., unique violation). To resume, either fix the data or advance the origin LSN carefully.
-
Option 1: Correct the data on the subscriber so the conflicting change can apply, then re-enable the subscription.
- Example: if an INSERT fails due to a unique violation, remove or modify the conflicting row on the subscriber, then run:
- ALTER SUBSCRIPTION sub_name ENABLE;
- Example: if an INSERT fails due to a unique violation, remove or modify the conflicting row on the subscriber, then run:
-
Option 2: Skip ahead with pg_replication_origin_advance. Use with caution.
- Determine the current origin LSN:
- SELECT * FROM pg_replication_origin_status; -- remote_lsn column
- Advance the replication origin:
- SELECT pg_replication_origin_advance('sub_name', '0/XXXXXXXX');
- Determine the current origin LSN:
Careless advancement can introduce data divergence; apply only when you fully understand the implications.
Loop prevention strategy
Bidirectional replication can produce infinite cycles if a change replicated from node A to B is then replicated back to A, and so on. A simple approach is to tag rows and adjust them upon apply.
- Add metadata columns to replicated tables:
- Boolean flag that marks whether a given change originated locally
- Optional origin identifier for traceability across nodes
- Use a BEFORE trigger on INSERT/UPDATE that:
- Marks locally-initiated writes as local
- Marks remotely-applied writes as non-local and prevents re-applying a change that already bounced through a peer
Trigger logic
Below is an implementation that detects logical apply workers via application_name and adjusts a per-row flag. Returning NULL on a replicated change that already carries a "not local" mark prevents the second-hop from writing back, which breaks the loop.
CREATE OR REPLACE FUNCTION prevent_cycle() RETURNS trigger AS $$
DECLARE
app_name text := current_setting('application_name', true);
BEGIN
-- Non-replication (client) backends initiate local DML
IF app_name IS NULL OR app_name !~* '^logical replication worker' THEN
NEW.from_local := true;
ELSE
-- DML originating from a logical apply worker
IF NEW.from_local IS NOT DISTINCT FROM false THEN
-- Already marked non-local by a peer; drop this write to stop the bounce
RETURN NULL;
ELSE
-- First arrival from a peer; accept but mark as non-local
NEW.from_local := false;
END IF;
END IF;
-- Keep a fresh timestamp for conflict-resolution policies
NEW.touched_at := clock_timestamp();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
End-to-end setup
Environment
Three PostgreSQL 10 instances on localhost:
- Ports: 1922, 1923, 1924
Schema and sequences
Replicated table with metadata for loop control and conflict policy.
-- Same DDL on all nodes
CREATE TABLE items (
id int PRIMARY KEY DEFAULT nextval('node_seq'),
payload text,
created_at timestamp,
from_local boolean,
touched_at timestamp DEFAULT clock_timestamp()
);
Use per-node sequences with fixed step to avoid PK collisions.
-- Node on port 1922
CREATE SEQUENCE node_seq INCREMENT BY 16 START WITH 1;
-- Node on port 1923
CREATE SEQUENCE node_seq INCREMENT BY 16 START WITH 2;
-- Node on port 1924
CREATE SEQUENCE node_seq INCREMENT BY 16 START WITH 3;
Trigger and activation
-- Same on all nodes
CREATE TRIGGER items_prevent_cycle
BEFORE INSERT OR UPDATE ON items
FOR EACH ROW EXECUTE PROCEDURE prevent_cycle();
-- Fire trigger on both local and replication apply
ALTER TABLE items ENABLE ALWAYS TRIGGER items_prevent_cycle;
Publications
-- Same on all nodes
CREATE PUBLICATION pub_items FOR TABLE items WITH (publish = 'insert, update, delete');
-- Alternatively, add tables later:
-- ALTER PUBLICATION pub_items ADD TABLE items;
Subscriptions
Create peer-to-peer subscriptions to form a full mesh.
-- On node 1922
CREATE SUBSCRIPTION sub_1922_1923
CONNECTION 'host=127.0.0.1 port=1923 user=postgres dbname=postgres'
PUBLICATION pub_items;
CREATE SUBSCRIPTION sub_1922_1924
CONNECTION 'host=127.0.0.1 port=1924 user=postgres dbname=postgres'
PUBLICATION pub_items;
-- On node 1923
CREATE SUBSCRIPTION sub_1923_1922
CONNECTION 'host=127.0.0.1 port=1922 user=postgres dbname=postgres'
PUBLICATION pub_items;
CREATE SUBSCRIPTION sub_1923_1924
CONNECTION 'host=127.0.0.1 port=1924 user=postgres dbname=postgres'
PUBLICATION pub_items;
-- On node 1924
CREATE SUBSCRIPTION sub_1924_1922
CONNECTION 'host=127.0.0.1 port=1922 user=postgres dbname=postgres'
PUBLICATION pub_items;
CREATE SUBSCRIPTION sub_1924_1923
CONNECTION 'host=127.0.0.1 port=1923 user=postgres dbname=postgres'
PUBLICATION pub_items;
Workload generation
To avoid PK conflicts while stressing updates, use staggered ID spaces and overlapping UPDATE ranges.
test1.sql (target node: 1922)
\set vid random(1,300000)
INSERT INTO items (id, payload, created_at)
SELECT 3*(random()*100000)::int, 't', now()
ON CONFLICT (id) DO UPDATE SET payload = 'x', touched_at = clock_timestamp();
UPDATE items SET payload = 'u', touched_at = clock_timestamp() WHERE id = :vid;
test2.sql (target node: 1923)
\set vid random(1,300000)
INSERT INTO items (id, payload, created_at)
SELECT 3*(random()*100000)::int + 1, 't', now()
ON CONFLICT (id) DO UPDATE SET payload = 'x', touched_at = clock_timestamp();
UPDATE items SET payload = 'u', touched_at = clock_timestamp() WHERE id = :vid;
test3.sql (target node: 1924)
\set vid random(1,300000)
INSERT INTO items (id, payload, created_at)
SELECT 3*(random()*100000)::int + 2, 't', now()
ON CONFLICT (id) DO UPDATE SET payload = 'x', touched_at = clock_timestamp();
UPDATE items SET payload = 'u', touched_at = clock_timestamp() WHERE id = :vid;
Run the load for each node:
pgbench -M prepared -n -r -P 1 -f ./test1.sql -c 10 -j 10 -T 10 -p 1922
pgbench -M prepared -n -r -P 1 -f ./test2.sql -c 10 -j 10 -T 10 -p 1923
pgbench -M prepared -n -r -P 1 -f ./test3.sql -c 10 -j 10 -T 10 -p 1924
Validation
A quick consistency check compares a hash aggregate across nodes.
-- Run on each node
SELECT SUM(hashtext((id::text || payload || created_at::text))) FROM items;
If the sums match across all three nodes, the replicated state is aligned.
Delete propagation test
Delete on any node and verify it propagates once without bouncing:
-- Execute on node 1922
DELETE FROM items WHERE id IN (SELECT id FROM items ORDER BY id LIMIT 10);
-- After replication catches up, confirm deletes exist on 1923 and 1924
SELECT COUNT(*) FROM items;