pgrac layers a complete cluster resident-process system on top of PostgreSQL's process model. The postmaster fork mechanism remains unchanged; the BackendType enum gains 14 new values appended to the existing set. At steady state, a 4-node primary cluster runs approximately 24 processes per node — 10 PG-native processes (several with targeted adjustments) plus 14 new pgrac cluster daemons.
Understanding the process tree is the first step in diagnosing cluster problems: the backend_type column in pg_stat_activity maps directly to the BackendType enum, and the names visible in ps aux output — postgres: lms 0, postgres: lmd, and so on — correspond one-to-one with the design documents. This chapter builds a comprehensive mental model of the process tree: what each process is, why it exists as a separate process, the order in which processes start and shut down, and the mechanisms they use to communicate. Internal protocol details (GES message encoding, PCM state-machine transitions, RDMA QP management) are left to deep-dive pages; this chapter establishes only the structural vocabulary.
postmaster is the root of the entire process tree. After shared memory and lock structures are initialized, it forks all child processes in four sequential Phases. Child processes communicate with each other via shared memory, signals, and the inter-node Interconnect — there is no direct parent–child dependency chain between processes (postmaster is a supervisor, not an intermediate router).
postmaster
│
┌───────────────┴───────────────┐
│ │
PG natives pgrac new
─────────── ──────────────
walwriter (+BOC) LMS × N (default 4)
bgwriter LMD
autovacuum LCK0
checkpointer LMHB
archiver DIAG
logical_repl RECO
... GRD0
...
Several processes on the "PG natives" side carry pgrac-specific adjustments (see §8.2) — they do not retain entirely stock behavior. Every process on the "pgrac new" side is wholly new, appended to the BackendType enum beyond the PG-native values, without disturbing any existing ABI.
Steady-state process count estimate: 4 LMS workers + 1 LMD + 1 LCK + 1 LMON + 1 Interconnect Listener + 1 Heartbeat + 1 Undo Cleaner + 1 TT GC + 1 DIAG + 1 Cluster Stats + 1 Sinval Broadcaster = 14 mandatory pgrac processes; plus approximately 10 PG-native processes, for a total of roughly 24 processes / node (primary, steady state).
Recovery Coordinator, Recovery Worker (dynamic count), and MRP are not counted in the steady-state figure. Recovery Coordinator / Worker are forked on demand during reconfiguration only and exit when finished; MRP starts only in standby + ADG mode.
All PG-native processes are retained; several carry targeted extensions. The extension principle: if additional logic can be embedded in an existing process, no new process is created (BOC embedded in walwriter is the canonical example). Extensions take effect only in cluster mode — the single-instance code path is unchanged.
| PG-native process | Adjustment | Notes |
|---|---|---|
postmaster | Initializes GES client at startup; registers with GRD; supervises all pgrac processes; decides restart vs. instance crash on child death | Core supervisor |
walwriter | Embeds BOC: 100 μs flush cycle; SCN piggyback maintenance; per-thread WAL stream handling | Most significant adjustment — see note below |
bgwriter | PCM state check before writing dirty blocks: flush only in X mode; skip if not X mode (let the PCM master coordinate) | Prevents cross-node buffer conflicts |
checkpointer | Cross-node barrier checkpoint; triggers cluster checkpoint barrier | Related to #18 |
archiver | Per-thread WAL archiving; thread_id-isolated archive paths; completion reported to GRD | Related to AD-009 |
autovacuum launcher | XID wraparound calculation is aware of per-instance XID segmentation (AD-012 exception 10) | Logic unchanged; boundary-aware |
startup | One-shot recovery only: crash recovery entry point; detects merged-recovery requirement; triggers Recovery Coordinator; exits after completion. No longer responsible for continuous standby apply | Continuous apply delegated to MRP |
walsender | Retained, no changes | — |
walreceiver | Per-thread receive under ADG | — |
logger | Retained, no changes | — |
logical rep launcher / worker | Retained, no changes | — |
walwriter-embedded BOC is the implementation host for the "BOC flush" described in the SCN chapter (§3.2). BOC fires frequently (every 100 μs) but each invocation does minimal work — a separate process would cost more than it saves. BOC is also tightly coupled to WAL flush timing (after commit, BOC advances the SCN), so embedding preserves temporal consistency. Oracle's BOC is likewise an embedded responsibility of LGWR; the design stays aligned.
Important change to the startup process: The PG-native startup process, in standby mode, continuously applies WAL until promote. That behavior is delegated to MRP in pgrac. pgrac's startup performs only the startup-time crash recovery and exits afterward — a single, narrow responsibility that simplifies fault isolation.
pgrac adds 14 categories of background process, organized into 5 subsystem groups. The table below gives each process's name, steady-state count, and one-line responsibility. Detailed design is covered in §8.5 (IPC Model) and the per-feature deep-dive pages.
| # | Subsystem | Process | Steady-state count | One-line responsibility |
|---|---|---|---|---|
| 1 | Lock & Cache | LMS (Lock Master Service) | N=4 (default; GUC cluster.lms_workers tunable 1–16) | Handles cross-node PCM/GES remote requests, responds to buffer-ship requests, executes lock grant / revoke decisions, carries SCN piggyback. |
| 2 | Lock & Cache | LMD (Lock Manager Daemon) | 1 | Receives local-node enqueue requests, maintains wait queue (FIFO + 3-level priority), constructs wait-for graph fragments for deadlock detection. |
| 3 | Lock & Cache | LCK (Lock Process) | 1 | Holds instance-level locks (dictionary lock, cluster catalog lock), preventing LMS workers from being long-blocked by instance locks. |
| 4 | Lock & Cache | LMON (Lock Monitor) | 1 | Monitors cluster node state, coordinates Reconfiguration, triggers GRD rebuild and fence decisions, launches Recovery Coordinator. |
| 5 | Cluster Comms | Interconnect Listener | 1 | Listens on RDMA QP / TCP fallback port, receives messages and dispatches them to LMS / LMD / LCK worker queues. |
| 6 | Cluster Comms | Heartbeat | 1 | Sends heartbeats to all nodes every second; maintains node liveness state; marks SUSPECT at 3 s no-response, DEAD at 6 s and notifies LMON. |
| 7 | Undo / TT | Undo Cleaner | 1 | Scans local instance undo segments every 30 s, reclaims RECYCLABLE space, maintains the retention window, advances the WRAP counter. |
| 8 | Undo / TT | TT GC (Transaction Table GC) | 1 | Scans TT slots every 10 s, reclaims expired slots whose commit_scn has been surpassed by the cluster-wide oldest_active_scn for reuse by new transactions. |
| 9 | Observability | DIAG | 1 | Cross-node diagnostic snapshots: detects long-waits (default 60 s) and triggers hang dumps, receives diagnostic requests from other nodes, aggregates cluster logs. |
| 10 | Observability | Cluster Stats | 1 | Samples cluster metrics every 10 s, populates pg_stat_cluster_* views, cross-node aggregation of wait-event history (default retention 7 days). |
| 11 | Observability | Sinval Broadcaster | 1 | Batch-broadcasts local-node catcache / relcache invalidation messages to all other nodes and injects them into the peer sinval queue, maintaining catalog consistency. |
| 12 | Cluster Recovery | Recovery Coordinator | 1 (reconfig only) | Collects WAL from the dead node, coordinates k-way SCN merge, allocates Recovery Workers, coordinates PCM lock-state restoration; exits when complete. |
| 13 | Cluster Recovery | Recovery Worker | M dynamic (reconfig only) | Receives WAL segments assigned by the Coordinator, executes redo / undo apply, reports progress; exits when complete. |
| 14 | Cluster Recovery | MRP (Managed Recovery Process) | 1 (standby + ADG only) | Continuously receives the per-thread WAL stream from walreceiver, applies it centrally (aligned with Oracle's MRP model), advances apply_scn; exits on promote. |
Sinval Broadcaster is a critical safety process: after a crash, postmaster restarts it immediately; more than 3 restarts → instance crash. catcache / relcache inconsistency is a data-correctness issue, not a performance issue — it cannot be degraded.
BackendType enum extension: The 14 new processes correspond to 14 new enum values appended to miscadmin.h (B_CLUSTER_STATS / B_DIAG / B_HEARTBEAT / B_INTERCONNECT / B_LCK / B_LMD / B_LMON / B_LMS_WORKER / B_MRP / B_RECOVERY_COORD / B_RECOVERY_WORKER / B_SINVAL_BCAST / B_TT_GC / B_UNDO_CLEANER), appended after the existing values without altering any existing value, maintaining PG 16.13 ABI compatibility. The backend_type column in pg_stat_activity displays them automatically; no changes to the view layer are required.
The startup sequence reflects the cluster dependency chain: networking and heartbeats must exist before the lock service; the lock service must exist before recovery; recovery must complete before client connections are accepted.
postmaster
│
├── Phase 0: Foundation
│ └─ logger (logging first)
│
├── Phase 1: Cluster foundation ← 60 s timeout; failure → instance crash
│ ├─ Interconnect Listener (network layer ready)
│ ├─ Heartbeat (heartbeat established)
│ └─ LMON (join cluster / GRD sync)
│
├── Phase 2: Lock service ← 30 s timeout; failure → instance crash
│ ├─ LMS0..LMSn (parallel fork)
│ ├─ LMD
│ └─ LCK
│
├── Phase 3: Recovery (on demand) ← 600 s timeout (GUC cluster.recovery_timeout)
│ ├─ startup process (crash recovery entry point)
│ │ ├─ detect merged recovery → LMON launches Recovery Coordinator
│ │ ├─ Recovery Coordinator → spawn Recovery Workers
│ │ └─ startup / Coordinator / Workers all exit when complete
│ └─ [standby + ADG only] MRP starts
│
└── Phase 4: Normal service ← 30 s timeout; single-process failure restarts 3×
├─ checkpointer / bgwriter / walwriter (with embedded BOC)
├─ archiver / autovacuum launcher
├─ TT GC / Undo Cleaner / Sinval Broadcaster
├─ DIAG / Cluster Stats / logical rep launcher
└─ begin accepting client connections
Three critical dependency arrows:
Shutdown is the reverse: The shutdown order is the inverse of startup. The critical invariant is "global locks must be released first (Phase 2 processes shut down) before the network is torn down (Phase 1 processes shut down)" — reversing this order leaves other nodes unable to detect the lock release, causing cluster state inconsistency.
Shutdown order:
1. Reject new connections
2. Wait for client backends to exit (default 30 s)
3. Phase 4 processes (Cluster Stats / DIAG / Sinval Broadcaster / Undo Cleaner / TT GC / archiver / autovacuum)
4. walwriter / bgwriter / checkpointer (final checkpoint)
5. [standby] MRP
6. Phase 2 processes (LCK / LMD / LMS0..LMSn) ← release global locks
7. Phase 1 processes (LMON / Heartbeat / Interconnect Listener) ← notify graceful leave
8. logger
9. postmaster exits
pgrac inter-process communication operates in two layers: same-node processes rely on shared memory + signals + in-process queues; cross-node processes rely on the message queues dispatched by the Interconnect Listener. The boundary between the two layers is explicit — there is no design that "directly accesses shared memory across nodes." All cross-node data access goes through protocol messages (PCM block ship / GES lock grant).
Same-node IPC:
| Mechanism | Purpose |
|---|---|
| Shared memory (SysV / mmap) | Lock structures, buffer pool, TT slots, GRD cache |
| Signals (SIGTERM / SIGUSR1 / SIGUSR2) | postmaster → child process control (identical to PG-native behavior) |
Latch (SetLatch) | Wake waiting backends / workers (reuses PG mechanism) |
| In-process queue (lock-free ring buffer) | LMS dispatcher → LMS worker (see §8.5.1) |
Cross-node IPC:
| Mechanism | Purpose |
|---|---|
| Interconnect (RDMA / TCP) | All cross-node protocol messages (PCM / GES / SCN / heartbeat) |
| Listener → worker queue | Interconnect Listener dispatches inbound messages by resource hash |
| Worker → Listener queue | Workers deliver outbound messages to the Listener for unified sending |
LMS is the highest-concurrency component in the process tree: N workers (default 4) share a single Interconnect Listener entry point, but each worker handles an independent subset of resources — there is no inter-worker lock contention.
Sharding strategy: worker_id = hash(resource_id) % N. For PCM, resource_id is the three-tuple (tablespace_oid, relfilenode, block_no); for GES it is the lock resource name. The same resource is always handled by the same worker, preventing concurrent races.
Message flow:
Cross-node message arrives
│
Interconnect Listener (single inbound point)
│
├─ read msg.resource_id
├─ compute worker_id = hash(resource_id) % N
└─ deliver to workers[worker_id].queue (lock-free ring buffer)
LMS worker inner loop:
while (running):
msg = my_queue.recv() # blocking wait
handle_pcm_or_ges_msg(msg) # PCM state machine / GES grant
update_local_scn(msg.piggyback_scn) # Lamport advance
if (reply needed):
outbound_queue.send(reply) # hand to Listener for sending
The key property of this design: the Listener is a single-threaded fan-out; each worker is a single-threaded serial processor of its own queue. There is no scenario in the system where multiple writers compete for the same GRD entry (all messages for a given resource are serialized to the same worker), so RDMA write operations require no additional per-resource locking.
Choosing N: The default N=4 corresponds to approximately 1.0–2.0 CPU cores for a 4-node OLTP cluster at 100K TPS. When N is too small, worker queues accumulate (monitor pg_stat_cluster_workers.queue_depth); when N is too large, LRU cache sharding loses efficiency (each worker caches fewer GRD entries). Production tuning guidance: consult the queue_depth field in the pg_stat_cluster_workers view.
Failure classification: LMS / LMD / LCK / LMON / Heartbeat / Interconnect Listener / Sinval Broadcaster are critical processes — after a crash, postmaster restarts them; more than 3 restarts → instance crash (fenced by other nodes). Undo Cleaner / TT GC / DIAG / Cluster Stats are gracefully-degradable processes — after a crash, postmaster restarts them; more than 3 restarts → WARNING only, no instance crash (GC falls behind or monitoring degrades, but cluster correctness is unaffected).
For deeper protocol details, refer to the following resources:
BackendType enum definitions, pg_stat_cluster_workers view fields, per-process failure-decision table, full GUC parameter list (cluster.lms_workers / cluster.recovery_timeout / cluster.heartbeat_interval, etc.)background-process-design.md — Full specification for 14 new process types + 7 adjusted PG-native processes: memory footprint estimates (steady-state ~375 MB for pgrac processes), CPU usage (4-node 100K TPS ~2.5–3.5 cores), implementation phases (Phase 1–4 mapped to Stages 1–20), test strategy