pgrac extends PostgreSQL's existing 9 wait event categories with 10 new Cluster:* sub-categories, introducing 46 new wait events (Wait Event ID range 0x1000_0000–0x19FF_FFFF). This chapter is the reference catalog for those events: event name, type, start/end definitions, typical durations, and trigger scenarios. Descriptions are intentionally concise; for deeper detail refer to the wait-events-design.md design document linked in Cross-references.
All new pgrac events are attached under the Cluster: prefix sub-category, organized hierarchically using dot-separated sub-category names.
| Rule | Description | Example |
|---|---|---|
| Prefix | All pgrac events belong to the Cluster: top-level category | Cluster: PCM |
| Sub-category separator | Colon + space separates the top-level category from the sub-category name | Cluster: GES |
| Event name | CamelCase phrase, verb last | PCM block read N→S |
| Sampled event suffix | Very short paths (< 5 μs) are annotated with (sampled) | GES local fast path |
Why are precise start/end points required? Imprecise event boundaries cause two classes of problems: (1) overlap — the same time interval is double-counted by two events, breaking end-to-end summation; (2) gap — CPU time is incorrectly attributed as wait time, masking true bottlenecks. pgrac enforces strictly paired calls to pgstat_report_wait_start / pgstat_report_wait_end at the implementation level, and uses assertions in debug builds to detect nesting violations (INV1 no-overlap principle).
| Standard | Description |
|---|---|
| Minimum measurable threshold | Spin-waits shorter than 100 ns do not enter a wait event; they are attributed to CPU |
| Start point | The line that calls pgstat_report_wait_start(WAIT_EVENT_*) |
| End point | The line that calls pgstat_report_wait_end() |
| No overlap (INV1) | At any instant a backend can be in at most one active wait event |
| Sub-event handling | Parent event calls wait_end first; child event then calls wait_start; after the child ends the parent may restart its timer |
| Sampled events | Events typically shorter than 5 μs are sampled at 1/100; cumulative values are aggregated by the Cluster Stats process |
pgrac wait events
├─ Cluster · IPC
│ ├─ lms_pcm_msg (PCM block read N→S / N→X / S→X)
│ ├─ ges_grant_wait (GES enqueue acquire / convert)
│ └─ rdma_completion (Interconnect RDMA send / recv)
├─ Cluster · Lock
│ ├─ pcm_convert_wait (PCM block convert wait)
│ ├─ ges_master_wait (GES enqueue acquire — contended path)
│ └─ deadlock_probe_wait (GES deadlock detection)
├─ Cluster · Recovery
│ ├─ reconfig_freeze (Reconfig: barrier wait / fence wait)
│ └─ merged_redo_apply (Recovery: k-way merge / apply per-thread)
└─ Cluster · Disk I/O
├─ voting_disk_io (Reconfig: fence wait — disk heartbeat)
└─ shared_fs_io (IO: DataFileRead/Write on shared storage)
IPC events cover waits produced by the round-trip between a backend and its node's background processes (LMS / LMD), as well as cross-node RDMA / TCP Interconnect messages. This category represents the primary clustering overhead in pgrac; typical durations are in the microsecond range.
| Event Name | wait_event_type | Start | End | Typical Time | Trigger Scenario |
|---|---|---|---|---|---|
PCM block read N→S | Cluster: PCM | ReadBufferExtended detects PCM state N | Buffer installed into the local node's buffer pool | 9–18 μs | Backend reads a block for the first time; no local copy exists on this node |
PCM block read N→X | Cluster: PCM | Same as N→S, but request mode is X | Same as N→S | 10–15 μs | UPDATE/DELETE path requests exclusive mode directly |
PCM block write S→X | Cluster: PCM | MarkBufferDirty detects an S→X upgrade is needed | Upgrade complete; block is now writable | 12–20 μs | This node holds an S copy and needs to upgrade for a write |
Buffer ship cr receive | Cluster: BufferShip | GetCRBuffer decides to fetch from remote | CR buffer installed into CR chain | 10–20 μs | Read transaction needs a consistent-read version (2-way / 3-way) |
Buffer ship current receive | Cluster: BufferShip | X request issued; waiting to receive current block | Current buffer installed | 10–18 μs | DML path receives a current block shipped from the peer |
Interconnect RDMA send | Cluster: Interconnect | Before ibv_post_send() is called | Send completion event | 1–3 μs (L1 RDMA) / 50–100 μs (TCP) | All outbound PCM / GES / Sinval messages |
Interconnect RDMA recv | Cluster: Interconnect | After ibv_post_recv() is submitted | Recv completion event | 1–3 μs | All inbound message reception |
Sinval broadcast receive | Cluster: Sinval | Sinval broadcast received from another node | Message injected into local node's sinval queue | 3–5 μs | DDL cross-node catcache / relcache invalidation propagation |
GES enqueue release ack | Cluster: GES | LockRelease sends RELEASE message | Master RELEASE_ACK received | 5–7 μs | Lock release requires master acknowledgment |
SCN piggyback merge | Cluster: SCN | Inbound message carries piggyback SCN > local_scn | local_scn CAS advance completes | 0.5–1 μs (sampled) | Every inbound RDMA message carries a SCN advance |
Lock events cover GES (Global Enqueue Service) lock contention and queuing waits within the PCM (Page Consistency Manager) block lock state machine. These events are nearly absent during a contention-free steady state; their appearance indicates lock contention or cluster congestion.
| Event Name | wait_event_type | Start | End | Typical Time | Trigger Scenario |
|---|---|---|---|---|---|
GES enqueue acquire | Cluster: GES | LockAcquireExtended confirms the lock cannot be granted immediately | Lock granted; backend woken up | 8 μs (no contention) / ms–s range (with holder) | Transaction-level / object-level / DDL-level lock waits |
GES enqueue convert | Cluster: GES | Held lock needs mode upgrade (e.g. S→X) | Upgrade complete | 10–15 μs (no contention) / tens of ms (coordination) | Lock mode upgrade path |
GES master query | Cluster: GES | Enters find_resource_master() with a GRD cache miss | Master node_id returned | 5–10 μs | GRD cache miss triggers broadcast query for master |
PCM block convert wait | Cluster: PCM | Master convert queue is non-empty; this request is enqueued | Request is dequeued and scheduled | < 50 μs (normal) / ms range (peak) | Master-side convert queue backlog |
PCM block downgrade | Cluster: PCM | LMS receives a downgrade request from master | Downgrade complete (including PI creation) | 3–10 μs (no dirty) / 1–100 ms (with dirty writeback) | Another node requests this node to downgrade an X/S copy |
PCM ITL cleanout | Cluster: PCM | heap_page_prune detects an ITL cleanout is needed | Cleanout complete (temporary S→X upgrade) | 15–25 μs | Block read reveals ITL cleanup is needed, triggering an exclusive upgrade |
When GES enqueue acquire is contended, wait duration is entirely determined by business logic (the commit speed of the lock-holding transaction). A P99 exceeding 100 μs indicates a lock contention hotspot; investigate using pg_stat_activity.wait_event together with pg_locks to identify the holding transaction.
Recovery events appear in two scenarios: backends blocked by Freeze during Reconfiguration (node join / leave / failure), and Recovery Workers performing parallel WAL apply during crash recovery. Under normal operation these events should be near-zero; sustained waiting indicates the cluster is processing a membership change or recovering from a failure.
| Event Name | wait_event_type | Start | End | Typical Time | Trigger Scenario |
|---|---|---|---|---|---|
Reconfig: barrier wait | Cluster: Reconfig | Node enters global barrier synchronization point | All active nodes have reached the barrier | 10–100 ms | Global synchronization barrier before GRD switchover |
Reconfig: fence wait | Cluster: Reconfig | LMON decides to fence a dead node and begins waiting | Fence confirmation complete | 1–5 s | Confirming the dead node has stopped writing to disk / network has been isolated |
Reconfig: GRD rebuild | Cluster: Reconfig | LMON triggers GRD rebuild | GRD is consistent across all active nodes | 50–200 ms (4 nodes) / 300 ms–1 s (16 nodes) | GRD rebuild after a node joins or leaves |
Recovery: k-way merge | Cluster: Recovery | Merge algorithm starts (multiple WAL streams ready) | Redo stream sorted by SCN and ready | 10–100 ms per 1 GB WAL | Multi-threaded WAL merge by SCN during crash recovery |
Recovery: apply per-thread | Cluster: Recovery | Worker receives a thread WAL segment | That segment has been fully applied | Throughput ~100 MB/s | Parallel apply phase of crash recovery (per worker) |
Recovery: PCM state restore | Cluster: Recovery | Starts PCM state reconstruction after apply completes | PCM state is consistent across all nodes | 50–500 ms | Rebuilds which blocks are held with which locks on which nodes after recovery |
During Reconfiguration, all backends block simultaneously at the CHECK_FROZEN() macro — they will appear sequentially as Reconfig: barrier wait (Freeze phase). The target Freeze elapsed time is ≤ 2 seconds; on timeout, CSSD escalates to a full cluster restart.
Disk I/O events cover new storage access paths introduced by pgrac: voting disk heartbeat writes, shared storage data file access, and cluster WAL / archive I/O. PostgreSQL's native IO: event category is semantically unchanged, but its distribution shifts slightly due to shared storage latency (NVMe-oF adds +5–10 μs; iSCSI adds +50 μs) — when monitoring, distinguish between local disk and shared storage using the pg_stat_cluster_io view.
| Event Name | wait_event_type | Start | End | Typical Time | Trigger Scenario |
|---|---|---|---|---|---|
IO: DataFileRead (shared) | IO | Before smgrread() is called | Data returned | 50–200 μs (iSCSI) / 10–30 μs (NVMe-oF) | Buffer cache miss; page read from shared storage |
IO: DataFileWrite (shared) | IO | Before smgrwrite() is called | Write confirmed | 50–200 μs (iSCSI) | Dirty page flushed to shared storage |
Reconfig: fence wait (disk) | Cluster: Reconfig | CSSD detects disk heartbeat timeout | IO fence complete | 1–5 s | Disk isolation triggered by voting disk heartbeat timeout |
Interconnect TCP fallback | Cluster: Interconnect | RDMA detected as unavailable; entering TCP fallback | TCP path message send/receive complete | 50–200 μs | Fallback path when RDMA link fails (alert-level event) |
Interconnect TCP fallback warrants an alert on any occurrence — normal operation should run entirely on the RDMA tier. A TCP fallback means Interconnect hardware or configuration is abnormal; a single fallback costs 25–100× more than RDMA.
pg_stat_cluster_wait_events viewBuffer ship cr receive / Buffer ship current receive)GES enqueue acquire / GES enqueue convert, etc.)