pgrac's cluster communication layer is abstracted as Interconnect Tier, controlled by the cluster.interconnect_tier GUC. Every upper-layer subsystem (Cache Fusion, GES, SCN broadcast, CSSD heartbeat, reconfig coordinator) calls the same cluster_ic_* API; the underlying vtable routes calls to the active tier's implementation. The tier is chosen at postmaster startup and is immutable at runtime.
Current Stage 2 scope: tier1 = TCP (spec-2.2 loopback pseudo-cluster, then cross-host TCP). Stage 6+ plan: tier1 / tier2 / tier3 = escalating RDMA implementations (starting at spec-6.1, targeting 3–10 μs p99). This chapter describes the current Stage 2 protocol stack (spec-2.3 Envelope ABI + spec-2.4 framing), with the Stage 6 RDMA roadmap noted as forward-looking.
Oracle-style hardware labels — "Tier1 = InfiniBand / Tier2 = RoCE / Tier3 = TCP" — do not correspond to pgrac's cluster.interconnect_tier enum. In pgrac, tier numbers represent implementation maturity; all RDMA substitution happens at Stage 6 and beyond (see §7.5). The L1–L4 labels in wait events (wait-events-design.md §10) are a separate latency-bucket taxonomy unrelated to this enum.
cluster.interconnect_tier is a PGC_POSTMASTER enum GUC, defined in cluster-ic-design.md. Legal values depend on the stage:
| Stage | Legal values | Behavior of unimplemented values |
|---|---|---|
| Stage 0.18 (early) | stub | ereport(ERROR, ERRCODE_FEATURE_NOT_SUPPORTED); postmaster startup fails |
| Stage 2 (current) | stub, tier1 | Same — tier2/tier3 rejected at Stage 2 |
| Stage 6+ (planned) | stub, tier1, tier2, tier3 | tier3 = RDMA mlx5dv direct verbs target state |
The low-level vtable interface (ClusterICOps) has 5 function pointers, frozen since Stage 0.18:
struct ClusterICOps {
int (*send_bytes)(int32 target, const void *buf, size_t len);
int (*recv_bytes)(int32 *out_sender, void *buf, size_t bufsize, size_t *out_received);
int (*tier_init)(void);
int (*tier_shutdown)(void);
const char *tier_name;
};
Each tier provides one vtable instance: Stage 2 ships ClusterICOps_TCP; Stage 6.1 onward adds ClusterICOps_RDMA_Tier1/Tier2/Tier3 incrementally. Upper layers always call the same high-level API (§7.3) and are unaware of vtable swaps.
Stage 2 replaces tier1 from stub to TCP in two steps:
B_IC_LISTENER background process to accept cross-host connections.The TCP path uses standard sockets (SO_KEEPALIVE + TCP_KEEPIDLE/INTVL/CNT), pairing with spec-2.5 CSSD heartbeat to form the two-layer dead detection (see Ch 5 §5.3). Stage 2 has no RDMA path — all RDMA-related GUCs and code do not yet exist.
Typical Stage 2 cross-host TCP single-message latency ≈ 50 μs; this drops to 3–10 μs p99 after Stage 6+ swaps to RDMA tier3 (spec-6.1). Stage 2's goal is correctness and protocol maturity, not latency optimization — Cache Fusion performance benchmarks become meaningful only after the Stage 6+ roadmap lands.
spec-2.3 replaces the Stage 0.18 24-byte ClusterMsgHeader with a 36-byte ClusterICEnvelope — the unified envelope for all IC messages. From Stage 2 onward every cluster message wears the envelope; when Stage 6 swaps the underlying transport for RDMA, the envelope format does not change.
Envelope field layout (36 bytes, fixed):
| Offset | Size | Field | Notes |
|---|---|---|---|
| 0 | 2 B | magic | 0x4943 ("IC" LE, sanity marker) |
| 2 | 1 B | version | V1 = 1 |
| 3 | 1 B | msg_type | ClusterICMsgType enum (see table below) |
| 4 | 4 B | source_node_id | Sending node |
| 8 | 4 B | dest_node_id | Target node; 0xFFFFFFFF = broadcast |
| 12 | 8 B | epoch | cluster_epoch piggyback; spec-2.4 enforces match |
| 20 | 8 B | scn | Lamport SCN piggyback |
| 28 | 4 B | payload_length | ≤ 16 MB hard cap |
| 32 | 4 B | payload_crc32c | CRC32C over envelope (excl. self) + payload |
Four StaticAssertDecl lines lock sizeof == 36 plus the three offsets epoch@12, scn@20, payload_length@28.
Message type enum (ClusterICMsgType):
| Value | Name | Source spec | Notes |
|---|---|---|---|
| 0 | reserved sentinel | — | Illegal; verify rejects |
| 1 | HEARTBEAT | spec-2.5 | CSSD heartbeat |
| 2 | SCN_BROADCAST | spec-2.9 | Cross-node SCN broadcast |
| 3 | BOC_BROADCAST | spec-2.10 | BOC batch advance notification |
| 4 | GES_REQUEST | spec-2.16 | GES enqueue / convert / release request |
| 5 | GES_REPLY | spec-2.16 | GES grant / convert ack / release ack |
| 6 | CF_BLOCK_SHIP | spec-3.x | Cache Fusion block transfer (Stage 3+) |
| 7 | SINVAL | spec-2.x | Cluster-level sinval broadcast |
| 8 | FENCE_NOTIFY | spec-2.28 | fence-lite announcement |
| 9 | RECONFIG | spec-2.29 | Reconfig coordinator broadcast |
| 11 | CSSD_HEARTBEAT | spec-2.5 | CSSD application-level heartbeat payload (12 bytes) |
| 255 | CHUNK | spec-2.4 | Wrap payloads larger than 16 MB |
Public high-level API (cluster/cluster_ic_router.h):
cluster_ic_register_msg_type(const ClusterICMsgTypeInfo *info) — register msg_type + handler + producer mask at postmaster phase; duplicate = FATALcluster_ic_send_envelope(uint8 msg_type, int32 dest_node_id, const void *payload, uint32 payload_len) — build envelope + CRC + enqueuecluster_ic_dispatch_envelope(const ClusterICEnvelope *env, const void *payload) — verify + dispatch_table lookup + invoke handler (LMON recv path)cluster_ic_envelope_build / verify — build/verify pair (5-arg signature; spec-2.4 splits the stateful observation paths)cluster_ic_send_envelope_chunked(uint8 inner_msg_type, int32 dest_node_id, const void *payload, size_t len) — chunked variant for large payloadsStage 2.3 removes the Stage 0.18-era ClusterMsgHeader, cluster_msg_send, cluster_msg_recv. cluster_rpc_call (RPC request/reply style) is deferred to post-Stage 6 discussion.
spec-2.4 upgrades the envelope from "format definition" to "verification active": mismatched epochs are dropped + counted, and the Lamport clock converges via envelope piggyback.
Messages are length-prefixed; no delimiter bytes. The receiver always reads exactly 36 bytes of envelope first, then envelope.payload_length bytes of payload. magic = 0x4943 is a sanity marker, not a frame delimiter. TCP in-order delivery is assumed; partial-IO is handled by spec-2.2 v1.0.1's per-peer buffer state machine.
Chunking (spec-2.4 §3): payloads > 16 MB are wrapped by cluster_ic_send_envelope_chunked() with a 16-byte ClusterICChunkHeader (chunk_seq u32 / chunk_total u32 / total_payload_len u32 / inner_msg_type u8 + 3B pad); the outer envelope's msg_type = 255 (PGRAC_IC_CHUNK_MSG_TYPE). Default cap 64 MB, hard cap 256 MB.
Each envelope's epoch field (offset 12, 8 bytes) is written by the sender as cluster_epoch_get_current(). The spec-2.4 receiver's verify step 7: if env->epoch != current_epoch, drop the frame (do not dispatch), log a line, increment stale_epoch_drop_count, SQLSTATE 53R20 CLUSTER_IC_STALE_EPOCH_DROP. This guarantees that cross-reconfig stale envelopes never pollute the new topology — once the reconfig coordinator (Ch 5 §5.2.1) advances the epoch, in-flight old-epoch messages auto-invalidate.
At Stage 2.4, CLUSTER_EPOCH_INITIAL = 0 and the epoch stays at 0. The real epoch++ happens in spec-2.29's reconfig coordinator (see Ch 5).
Each envelope's scn field (offset 20, 8 bytes) is populated by the sender via cluster_scn_current(). After CRC + auth verification the receiver calls cluster_scn_observe(env->scn), advancing local SCN by the Lamport >= rule. The counter lamport_observe_advance_count tracks trigger frequency.
This is the SCN protocol's (Ch 4) cross-node convergence mechanism: every cross-node message carries SCN automatically — no dedicated SCN-broadcast traffic is needed. spec-2.9 / 2.10 provide explicit SCN_BROADCAST / BOC_BROADCAST msg_types for explicit convergence points.
Stage 6 production hardening replaces tier1 / tier2 / tier3 with escalating RDMA implementations. Semantics stay constant (envelope ABI unchanged, msg types unchanged, upper API unchanged); only the vtable swaps:
| Stage | vtable | Implementation focus | Target latency |
|---|---|---|---|
| 6.1 | ClusterICOps_RDMA_Tier1 | libibverbs basic wrapping; replaces TCP in tier1 slot, TCP kept as fallback | 10–20 μs |
| 6.2 | ClusterICOps_RDMA_Tier2 | MR pre-registration (mr_cache), zero-copy DMA | 5–10 μs |
| 6.3 | ClusterICOps_RDMA_Tier3 | mlx5dv direct verbs, lock-free path; target state | 3–10 μs p99 |
Design intent for Stage 6 RDMA (reserved for Stage 6 landing — not yet implemented in Stage 2):
{addr, size}; first send registers and caches the MR handle, subsequent sends hit the cache and skip ibv_reg_mr() (saving ~2–5 μs per call)The RDMA details in this section are Stage 6 design targets; current implementation has not started. Any pg_cluster_rdma_* view, cluster_rdma_* GUC, or pgrac_ctl verify rdma tool is not part of Stage 2 scope. Operations cannot — and need not — configure these before Stage 6 design delivery.
The Cluster: Interconnect class (5 events, wait-events-design.md §10) records IC-layer latency. Stage 2-applicable events:
| Event | Trigger | Typical latency |
|---|---|---|
Interconnect TCP fallback | Stage 2 default TCP path (and Stage 6+ AD-007 L4 bucket) | ~50 μs |
Interconnect connect retry | Exponential-backoff reconnect after peer disconnect | 100 ms–10 s |
Interconnect tier switch | Stage 6+ only; not visible in Stage 2 | — |
Interconnect RDMA send | Stage 6+ only | — |
Interconnect RDMA recv | Stage 6+ only | — |
View pg_stat_cluster_wait_events (spec-0.x, wait-events-design.md L1106) provides backend-aggregated stats; pg_stat_cluster_wait_events_history exposes time-bucketed slices. The AD-007 L1–L4 latency buckets (wait-events-design.md §10 L867-887) are a separate taxonomy at the wait-event dimension: L1 RDMA WRITE/READ, L2 RDMA SEND/RECV, L3 verbs queue congestion, L4 TCP fallback — these labels do not correspond to cluster.interconnect_tier enum values.
For deeper protocol detail, see:
CF_BLOCK_SHIP (msg_type 6), activated at Stage 3+cluster-ic-design.md — Full design: vtable ClusterICOps, enum evolution, Stage 0.18 → Stage 2 → Stage 6 pathspec-2.3-envelope-abi-ratify-transport-agnostic-api.md — Envelope ABI frozen version (v0.2, 2026-05-07)spec-2.4-framing-epoch-enforce-lamport-piggyback.md — Framing / epoch enforcement / Lamport piggyback frozen version (v0.2, 2026-05-08)Chapter 8 — Background Processes covers how LMON, CSSD, qvotec, LMS, LMD daemons send and receive messages through the IC framework: LMON's recv main loop (dispatch_envelope entry point), CSSD heartbeat's send path, LMS daemon's work_queue model.