pgrac extends PostgreSQL's native 9 wait event categories with a design taxonomy of 10 Cluster:* classes totalling 46 events (full taxonomy in wait-events-design.md §2.1). This chapter is the complete catalog of that design target, with every event labelled by its Stage availability — Stage 2 currently ships a true subset; the remaining events are production targets for Stage 3-6. Descriptions are intentionally concise; timing diagrams, state machines, and alert thresholds belong to the design document itself.
This chapter reflects the full design target (46 events across 10 classes). spec-0.11 in Stage 0 already registered all 46 event names plus the 10 class IDs (0x10000000..0x19000000) into the catalog — they appear in the pg_wait_events system view — but Stage 2 wires zero pgstat_report_wait_start call sites for any of them (the L62 wait-event-registration ≠ runtime anti-pattern; see docs/spec-drafting-lessons.md §L62 / §L13). The events Stage 2 actually times are a separate, finer-grained set named in WAIT_EVENT_CLUSTER_* form — see §6.13.
The pgrac wait-event framework is source-compatible with PostgreSQL's pgstat_report_wait_start / pgstat_report_wait_end API. All new events hang under the Cluster: prefix; sub-categories use space + name (for example Cluster: PCM, Cluster: BufferShip). Each event's Stage tag follows these rules:
| Stage tag | Meaning |
|---|---|
Stage 0 catalog | spec-0.11 registered the enum name; appears in pg_wait_events; no runtime call site (decorative surface) |
Stage 3 design target | Planned for implementation in Stage 3 (MVCC + Undo + CR build) |
Stage 4 design target | Planned for implementation in Stage 4 (WAL + crash recovery) |
Stage 5 design target | Planned for implementation in Stage 5 (full RAC core + GES 8-mode) |
Stage 6 design target | Planned for implementation in Stage 6 (RDMA / DRM / ADG / production) |
Stage tags are derived from docs/stage2-6-detailed-spec-roadmap.md's subsystem ownership; specific spec numbers are noted under each table. A Stage tag does not mean a spec has been frozen — only that the owning subsystem is scheduled for that Stage.
| Class | Design events | ID base | Focus |
|---|---|---|---|
Cluster: GES | 5 | 0x10000000 | Global Enqueue Service — lock acquire, master query, convert / release |
Cluster: PCM | 6 | 0x11000000 | Page Consistency Manager — block state transitions N→S / N→X / S→X, etc. |
Cluster: BufferShip | 5 | 0x12000000 | Cache Fusion — CR / current block transfer across nodes |
Cluster: SCN | 4 | 0x13000000 | System Change Number — BOC flush, piggyback merge, cross-node compare |
Cluster: Reconfig | 5 | 0x14000000 | Membership change — GRD rebuild, lock recovery, fence, master selection, barrier |
Cluster: Recovery | 5 | 0x15000000 | Crash recovery — WAL fetch, k-way merge, parallel apply, PCM state restore |
Cluster: Sinval | 3 | 0x16000000 | Shared invalidation — cross-node catcache / relcache invalidation broadcast |
Cluster: Interconnect | 5 | 0x17000000 | Interconnect transport — RDMA send / recv, TCP fallback, tier switch, connect retry |
Cluster: Undo | 4 | 0x18000000 | Remote undo — CR build reads of remote undo blocks / TT slots, batch fetch, retention |
Cluster: ADG | 4 | 0x19000000 | Active Data Guard — MRP apply, WAL receive, read snapshot, SCN sync |
| Total | 46 | — | Each class reserves 256 ID slots |
| Item | Value | Source |
|---|---|---|
| Design target (full taxonomy) | 10 classes / 46 events | wait-events-design.md §2.1 |
| Stage 0 catalog registration (no runtime) | 46 / 46 | spec-0.11 §3 enum one-shot registration |
| Stage 2 wired design-target events | 0 / 46 | 2026-05-16 full spec-corpus grep — no pgstat_report_wait_start(WAIT_EVENT_GES_*/PCM_*/...) call sites |
| Stage 2 events introduced and wired (outside taxonomy) | 23 | spec-2.2 / 2.4 / 2.5 / 2.6 / 2.18-2.23 / 2.28 / 2.29 — see §6.13 |
pg_wait_events total row count (end of Stage 2) | 66 | spec-2.29 §D14 wait_events 65→66 baseline |
| View | Description |
|---|---|
pg_wait_events (PG-native) | Lists every wait event name and type in the catalog; the 46 design-target events plus all Stage 2 wired Cluster events appear automatically |
pg_stat_cluster_wait_events | Per-event cumulative call count and wait duration; during Stage 2 the design-target rows show calls = 0 (catalog-only) |
pg_stat_cluster_wait_events_history | Time-bucketed slices (default 10-second buckets, ring buffer holds ~1 hour) |
pg_stat_activity.wait_event / wait_event_type | Real-time wait state for the current backend — pgrac events surface here directly, with no schema change |
Cluster: top-level category; sub-category names use a space separator (Cluster: PCM).PCM block read N→S, GES enqueue acquire.(sampled) suffix and are sampled at 1/100; counters are aggregated by the Cluster Stats process.pgstat_report_wait_start / pgstat_report_wait_end; debug builds use assertions to detect nesting violations (INV1 no-overlap principle).GES (Global Enqueue Service) wait events cover global lock acquire, mode conversion, release acknowledgment, and master node lookup. Stage 2 has shipped LMS / LMD daemon skeletons + the 7-step state machine + cross-node deadlock detection (spec-2.18-2.23), but none of those paths inject the 5 GES-taxonomy wait events; what Stage 2 actually times are GesS4Wait / GesS4Reply / GesReplyWait (finer-grained, see §6.13).
| Event Name | Stage | Description | Typical Time |
|---|---|---|---|
GES enqueue acquire | Stage 5 design target | Lock request enters the LMD queue, waiting for master to grant | 8 μs (no contention) / ms–s range (holder not yet released) |
GES enqueue convert | Stage 5 design target | Held lock needs mode upgrade (e.g. S→X) | 10–15 μs (no contention) / tens of ms (coordination) |
GES enqueue release ack | Stage 5 design target | After LockRelease sends a RELEASE message, waits for master RELEASE_ACK | 5–7 μs |
GES master query | Stage 5 design target | GRD cache miss; broadcasts a query for the resource master node_id | 5–10 μs |
GES local fast path (sampled) | Stage 5 design target | This node is the master and no contention exists — pure local GRD lookup | < 1 μs (sampled at 1/100) |
Stage ownership: Stage 5 spec 5.1 (Full GES 8-mode lock matrix). Stage 0 catalog enum registered in spec-0.11 §3.
PCM (Page Consistency Manager) wait events cover block-level state transitions: N (no copy) → S / X, cross-node downgrades, and the temporary upgrade triggered by ITL cleanout. The PCM 9-state machine sits in Stage 2 spec-2.26 (not yet frozen); the activated path lands late Stage 2 + Stage 3.
| Event Name | Stage | Description | Typical Time |
|---|---|---|---|
PCM block read N→S | Stage 3 design target | First-time block read on this node; no local copy — fetch S from master / holder | 9–18 μs |
PCM block read N→X | Stage 3 design target | UPDATE / DELETE path requests exclusive mode directly | 10–15 μs |
PCM block write S→X | Stage 3 design target | This node holds an S copy; a write triggers an S→X upgrade | 12–20 μs |
PCM block convert wait | Stage 3 design target | Master convert queue is non-empty; this request enqueues and waits to be scheduled | < 50 μs (normal) / ms range (peak) |
PCM block downgrade | Stage 3 design target | LMS receives a downgrade request from master; includes PI creation | 3–10 μs (no dirty) / 1–100 ms (with dirty writeback) |
PCM ITL cleanout | Stage 3 design target | Block read detects an ITL cleanup is needed; triggers a temporary S→X upgrade | 15–25 μs |
Stage ownership: Stage 3 spec 3.6-3.8 (ITL slot / cleanout) + Stage 2 spec-2.26 (PCM 9-state activation, not yet frozen).
BufferShip wait events cover the Cache Fusion path — CR block construction and cross-node transfer, plus current-block send / receive. Buffer ship cr build is attributed to the LMS process rather than the requesting backend (LMS-attributed). The Cache Fusion 2-way / 3-way protocol sits in Stage 2 spec-2.30 / 2.31 (not yet frozen).
| Event Name | Stage | Description | Typical Time |
|---|---|---|---|
Buffer ship cr build (LMS-attributed) | Stage 3 design target | LMS constructs a CR version of the block from the undo chain | 5–15 μs |
Buffer ship cr send | Stage 3 design target | CR block transmitted over IC to the requester | 2–5 μs (with RDMA) |
Buffer ship cr receive | Stage 3 design target | Backend waits for the CR block to arrive and be installed into the CR chain | 10–20 μs (2-way / 3-way) |
Buffer ship current send | Stage 3 design target | Holder LMS ships the current block to the requester | 2–5 μs |
Buffer ship current receive | Stage 3 design target | DML path receives a current block shipped from the peer | 10–18 μs |
Stage ownership: Stage 3 spec 3.12-3.13 (CR block construction / cache); depends on Stage 2 spec-2.30/2.31 Cache Fusion protocol.
SCN (System Change Number) wait events cover commit-boundary SCN flushes, in-message piggyback advancement, cross-node comparison, and broadcast. Stage 1.17 already shipped the walwriter BOC framework; Stage 2 spec-2.9-2.12 shipped SCN broadcast / piggyback / cross-instance lookup observability, but none of those paths inject the 4 SCN-taxonomy wait events.
| Event Name | Stage | Description | Typical Time |
|---|---|---|---|
SCN BOC flush wait | Stage 3 design target | Batch on Commit — commit boundary flushes the accumulated SCN advance | ~100 μs |
SCN piggyback merge (sampled) | Stage 3 design target | Inbound message carries a piggyback SCN > local_scn; CAS advance | 0.5–1 μs (sampled at 1/100) |
SCN cross-node compare | Stage 3 design target | Cross-node SCN comparison (scn_recovery_cmp(), includes LSN + node_id tie-break) | 1–3 μs |
SCN advance broadcast | Stage 3 design target | This node broadcasts its SCN advance to other nodes | 3–8 μs |
Stage ownership: Stage 3 spec 3.9 (cluster snapshot read_scn) + existing walwriter BOC framework (Stage 1). Stage 2 spec-2.9-2.12 ship observability only; no wait event wiring.
Reconfig wait events appear during node join / leave / failure. Detailed Freeze / Rebuild / Thaw sequencing is covered in Chapter 5. The Stage 2 reconfig minimum closure (spec-2.5 / 2.6 / 2.28 / 2.29) added one fine-grained wait event BgProcLmonReconfigTick (in the BgProc class — see §6.13), but did not implement the 5 Reconfig-taxonomy events; the real 5-phase breakdown is deferred to Stage 5 spec 5.12 (clean leave / fail-stop reconfig).
| Event Name | Stage | Description | Typical Time |
|---|---|---|---|
Reconfig: GRD rebuild | Stage 5 design target | Rebuild of the global resource directory after a node joins or leaves | 50–200 ms (4 nodes) / 300 ms–1 s (16 nodes) |
Reconfig: lock recovery | Stage 5 design target | Reassignment / release of global locks held by a failed node | 20–100 ms |
Reconfig: fence wait | Stage 5 design target | Wait for fence-lite self-isolation to take effect or voting disk lease to expire | 1–5 s |
Reconfig: master selection | Stage 5 design target | Coordinator elected via min(survivor_set) | 1–10 ms |
Reconfig: barrier wait | Stage 5 design target | All active nodes reach the global barrier synchronization point | 10–100 ms |
Stage ownership: Stage 5 spec 5.12-5.14 (reconfig + node join / leave). The Stage 2 reconfig minimum closure timing belongs to BgProcLmonReconfigTick in §6.13.
Recovery wait events appear during crash recovery — Recovery Workers fetch WAL, perform k-way merge by SCN, apply in parallel, replay undo, and finally restore PCM state. The full crash recovery subsystem is in Stage 4 scope.
| Event Name | Stage | Description | Typical Time |
|---|---|---|---|
Recovery: WAL fetch | Stage 4 design target | Fetch WAL segments from shared storage / failed node | 10–50 ms per GB (depends on storage bandwidth) |
Recovery: k-way merge | Stage 4 design target | Multi-node WAL streams merged in commit_scn order | 10–100 ms per 1 GB WAL |
Recovery: apply per-thread | Stage 4 design target | Worker applies a segment of merged WAL | Throughput ~100 MB/s (per worker) |
Recovery: undo replay | Stage 4 design target | Replay undo for uncommitted transactions to restore a consistent snapshot | 10–200 ms (depends on in-flight volume) |
Recovery: PCM state restore | Stage 4 design target | After apply, rebuilds PCM state — which block is held with which lock on which node | 50–500 ms |
Stage ownership: Stage 4 spec 4.3-4.13 (Recovery Coordinator / Worker / k-way merge / GES/PCM/Undo recovery).
Sinval wait events cover cross-node catcache / relcache invalidation broadcast. Every schema change on the DDL path triggers one cluster-wide sinval broadcast. The Sinval Broadcaster + cluster invalidation sit in Stage 2 spec-2.33 / 2.34 (not yet frozen).
| Event Name | Stage | Description | Typical Time |
|---|---|---|---|
Sinval broadcast send | Stage 3 design target | This node sends an sinval invalidation message to all peers | 2–4 μs (RDMA) |
Sinval broadcast receive | Stage 3 design target | Sinval broadcast received from another node; injected into local sinval queue | 3–5 μs |
Sinval inject local queue | Stage 3 design target | Message moves from inbound buffer into the backend-local queue | 0.5–1 μs |
Stage ownership: Stage 2 spec-2.33/2.34 (SI broadcaster / catalog cluster invalidation, not yet frozen) → Stage 3 dual-dimension visibility companion.
Interconnect wait events cover the IC transport layer: RDMA send / recv (Stage 6+ introduces the hardware RDMA tier), TCP fallback (Stage 2 default Tier 4), tier switch, and connect retry.
| Event Name | Stage | Description | Typical Time |
|---|---|---|---|
Interconnect RDMA send | Stage 6 design target | After ibv_post_send(), wait for the send completion event | 1–3 μs |
Interconnect RDMA recv | Stage 6 design target | After ibv_post_recv(), wait for the recv completion event | 1–3 μs |
Interconnect TCP fallback | Stage 6 design target | AD-007 Tier 4: cluster messages sent / received over TCP | 50–200 μs |
Interconnect tier switch | Stage 6 design target | RDMA ↔ TCP tier switch — messages buffered and re-routed during the transition | 10–50 ms (transition instant) |
Interconnect connect retry | Stage 6 design target | Peer reconnect — retry connect with exponential backoff | 50 ms → 1 s (exp backoff) |
Stage ownership: Stage 6 spec 6.1-6.6 (RDMA provider vtable / dual transport / tier1 send-recv / memory registration / zero-copy). Stage 2 actual IC timing lives in §6.13's six IcTcp* / IcHeartbeatWait / IcReconnect events — providing finer TCP-tier breakdown than the design taxonomy.
The "alert on any occurrence" semantics for the design-target Interconnect TCP fallback only hold once Stage 6+ RDMA is in place — on a Stage 2 cluster, TCP is the default and only path, so that alert rule does not apply. During Stage 2, monitor the §6.13 events IcTcpRecv / IcTcpSend tail latency and IcReconnect frequency instead.
Undo wait events cover remote undo access — reading undo blocks / TT slots from other nodes during CR construction, batched undo segment fetch, and undo recycle blocked by a long-running transaction. The Undo / TT subsystem is in Stage 3 scope.
| Event Name | Stage | Description | Typical Time |
|---|---|---|---|
Undo remote read | Stage 3 design target | Reads a remote node's undo block during CR build | 10–20 μs |
Undo TT lookup remote | Stage 3 design target | Queries a remote Transaction Table slot to obtain commit_scn | 8–15 μs |
Undo segment fetch | Stage 3 design target | Batched fetch of a remote undo segment (CR build optimization path) | 20–50 μs per segment |
Undo retention wait | Stage 3 design target | Long-running transaction holds old undo, blocking undo segment recycle | Business-dependent (s–min) |
Stage ownership: Stage 3 spec 3.1-3.5 (Undo segment / record / tablespace / TT slot) + spec 3.14-3.15 (retention / Undo Cleaner).
ADG (Active Data Guard) wait events cover standby-side MRP apply, WAL receive lag, read query waiting for apply_scn, and cross-instance SCN convergence. ADG is in Stage 6 scope.
| Event Name | Stage | Description | Typical Time |
|---|---|---|---|
ADG MRP apply wait | Stage 6 design target | Standby MRP (Managed Recovery Process) applying the current WAL segment | WAL-rate dependent (ms–s) |
ADG WAL receive lag | Stage 6 design target | walreceiver lags behind primary — caused by network or load | Normal < 100 ms / abnormal seconds range |
ADG read snapshot wait | Stage 6 design target | Read query waits for apply_scn to advance to the required snapshot scn | 10–100 ms (typical) |
ADG SCN sync wait | Stage 6 design target | Cross-instance SCN convergence — standby SCN aligns with primary | 1–10 ms |
Stage ownership: Stage 6 spec 6.10-6.11 (ADG physical standby skeleton / apply lag / read-only service).
The table below tallies, for each of the 10 design-target classes in §6.2-§6.11, how many events Stage 2 has actually wired vs left as catalog-only. Every one of the 46 events has its enum name registered in the catalog as of Stage 0 (spec-0.11) — they appear in pg_wait_events — but none has a pgstat_report_wait_start call site (per the 2026-05-16 full spec corpus grep).
| Class | Design events | Stage 2 wired | Stage 2 catalog-only |
|---|---|---|---|
Cluster: GES | 5 | 0 | 5 |
Cluster: PCM | 6 | 0 | 6 |
Cluster: BufferShip | 5 | 0 | 5 |
Cluster: SCN | 4 | 0 | 4 |
Cluster: Reconfig | 5 | 0 | 5 |
Cluster: Recovery | 5 | 0 | 5 |
Cluster: Sinval | 3 | 0 | 3 |
Cluster: Interconnect | 5 | 0 | 5 |
Cluster: Undo | 4 | 0 | 4 |
Cluster: ADG | 4 | 0 | 4 |
| Total | 46 | 0 | 46 |
Stage 2 specs introduced and wired 23 finer-grained events outside the design taxonomy. They describe what the current implementation layer is actually doing (LMS / LMD daemon startup / drain / idle / scan, IC TCP send / receive, CSSD / Qvotec / LMON ticks, and so on) — at a finer granularity than the design-target events. They follow the PG CamelCase naming convention (BgProcLmonReconfigTick), which distinguishes them from the design taxonomy's space-separated names (GES enqueue acquire).
| Event Name | Spec Source | Description |
|---|---|---|
BgProcCssdMainLoop | spec-2.5 §D8 | CSSD heartbeat daemon main-loop idle wait |
BgProcQvotecMainLoop | spec-2.6 §Hardening | Voting disk quorum daemon main-loop idle wait |
BgProcLmonReconfigTick | spec-2.29 §D9 | LMON tick executing reconfig coordinator decision + epoch++ + ProcSignal broadcast path |
Additional BgProcLmonMainLoop, BgProcLckMainLoop, BgProcDiagMainLoop, and BgProcClusterStatsMainLoop events shipped in Stage 1 (spec-1.11 / 1.12 / 1.13 / 1.14) and are not enumerated here.
| Event Name | Spec Source | Description |
|---|---|---|
IcTcpAccept | spec-2.2 §2.5 | Listener fd waiting for an incoming connection |
IcTcpConnect | spec-2.2 §2.5 | Active-side nonblocking connect waiting for socket-writable |
IcTcpRecv | spec-2.2 §2.5 | Per-peer socket waiting readable (envelope / payload reads) |
IcTcpSend | spec-2.2 §2.5 | Per-peer socket waiting writable (short-write blocking) |
IcHeartbeatWait | spec-2.2 §2.5 | Waiting for the next heartbeat tick (WaitLatch(WL_TIMEOUT)) |
IcReconnect | spec-2.2 §2.5 | Reconnect backoff sleep after connect failure |
IcChunkReassembly | spec-2.4 §D10 | Multi-segment frame reassembly wait |
IcEpochEnforceDrop | spec-2.4 §D10 | Stale-epoch inbound message drop timing |
| Event Name | Spec Source | Description |
|---|---|---|
GesS4Wait | spec-2.20 §D12 / spec-2.21 §D4 | S4 remote wait — backend has sent GES_REQUEST, waiting for GES_GRANT/REJECT |
GesS4Reply | spec-2.21 §D15 | S4 reply processing window |
GesReplyWait | spec-2.23 §D12 | D2 backend cross-node reply wait (BAST / production deadlock path) |
| Event Name | Spec Source | Description |
|---|---|---|
LmsStartup | spec-2.18 §Q12 D | LMS daemon startup phase |
LmsDrain | spec-2.18 §Q12 D | LMS daemon drain phase |
LmsIdle | spec-2.18 §Q12 D | LMS daemon waiting for work_queue wake |
LmdStartup | spec-2.19 §Q12 D | LMD daemon startup phase |
LmdScan | spec-2.19 §D12 + spec-2.20 §I9 (semantic fill) + spec-2.22 §D2 | LMD Tarjan SCC scan in progress |
LmdIdle | spec-2.19 §Q12 D | LMD daemon waiting for submission counter wake |
LmdProbe | spec-2.22 §D6 | Handler processing a DEADLOCK_PROBE message |
LmdProbeCollect | spec-2.23 §D8 | Coordinator awaiting all PROBE REPORT replies |
| Event Name | Spec Source | Description |
|---|---|---|
ClusterFenceBackendInterruptCheck | spec-2.28 §D9 | Backend checking the freeze flag inside ProcessInterrupts (~1 μs; the hook lets perf views identify freeze-induced abort sources) |
These 23 Stage 2 events, plus the Stage 1 BgProc / SharedFs / StartupPhase events, plus the 46 catalog-only design-target enums from spec-0.11, make up the 66 rows in pg_wait_events at the end of Stage 2 (spec-2.29 §D14 baseline).
The 5 design-taxonomy Cluster: Reconfig events and the Stage 2 BgProcLmonReconfigTick event do not overlap — the former breaks down by reconfig business phase (GRD rebuild / lock recovery / fence wait / master selection / barrier wait), while the latter times the LMON tick execution path. When Stage 5 spec-5.12 lands, BgProcLmonReconfigTick's scope will be subsumed by or co-exist with the 5 design-taxonomy events; this chapter will be reorganised at that point.
Real-time inspection of the current backend's wait state, plus historical aggregates:
-- Current wait events across all backends (PG-native view)
SELECT pid, wait_event_type, wait_event, state, query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL
AND wait_event_type LIKE 'Cluster:%'
ORDER BY wait_event_type, wait_event;
-- Per-event cumulative call count + wait time (pgrac-added)
-- During Stage 2, design-target rows have calls = 0; §6.13 wired events have real values
SELECT wait_event_type, wait_event,
calls, total_wait_us,
(total_wait_us / NULLIF(calls,0))::int AS avg_us
FROM pg_stat_cluster_wait_events
WHERE wait_event_type LIKE 'Cluster:%'
ORDER BY total_wait_us DESC
LIMIT 20;
-- Time-bucketed history (default 10 s buckets, ring buffer keeps ~1 hour)
SELECT bucket_start, wait_event, calls, total_wait_us
FROM pg_stat_cluster_wait_events_history
WHERE wait_event = 'IcTcpRecv'
AND bucket_start > now() - interval '10 minutes'
ORDER BY bucket_start;
-- List every available wait event (PG-native; includes catalog-only design-target + Stage 2 wired)
SELECT type, name, description
FROM pg_wait_events
WHERE type LIKE 'Cluster:%'
ORDER BY type, name;
Typical Stage 2 diagnostic flow: first scan pg_stat_activity for the live distribution — currently observable event names should come from §6.13's 23-event set (the 46 design-taxonomy events have calls = 0 by definition). If a WAIT_EVENT_GES_* / WAIT_EVENT_PCM_* event from the Stage 0 catalog-only set ever appears in real time, it means a spec has wired it ahead of schedule — review the spec drafting trail.
BgProcLmonReconfigTickdocs/wait-events-design.md — complete 46-event taxonomy, timing diagrams, alert thresholds, Wait Event ID allocationdocs/stage2-6-detailed-spec-roadmap.md — per-Stage spec lists (the basis for this chapter's Stage labels)docs/spec-drafting-lessons.md §L62 / §L13 — the wait-event-registration ≠ runtime anti-pattern (why this chapter labels Stage 0 catalog-only)