pgrac abstracts its cluster communication layer into three Interconnect Tiers, covering every deployment scenario — from bare-metal InfiniBand to CI loopback TCP — under a single unified API. Tier selection is determined at startup by automatic runtime detection; upper-layer subsystems (Cache Fusion, GES, SCN broadcast, heartbeat) are fully unaware of the underlying network type. They always call the same cluster_msg_* / cluster_rpc_* API; the vtable internally routes each call to the concrete implementation for the active Tier.
The essential difference between tiers is not capability but latency magnitude: Tier 1 RDMA verbs achieves sub-5 μs single-block transfer on Mellanox mlx5 hardware — Cache Fusion performance is directly determined by this figure. Tier 2 RoCE offers a middle ground on Ethernet hardware, suitable for budget-constrained deployments that still need the RDMA programming model. Tier 3 TCP raises latency to approximately 50 μs, but is 100% available in CI environments and on development machines without RDMA hardware. The correctness of committed transactions is identical across all three Tiers — Tiers affect throughput and latency, not semantics.
This chapter establishes the conceptual framework needed to understand the three-tier strategy: why three tiers rather than two, how the Tier 1 zero-copy data path bypasses the kernel, the RoCE congestion-control mechanism in Tier 2, the trigger conditions for Tier 3 automatic downgrade and the pre-deployment checks, and common NIC tuning and monitoring practices in cluster operations. Protocol internals (vtable data structures, wire format, QP pool management algorithms) are left to deep-dive pages; this chapter builds only the conceptual vocabulary and operational intuition.
Production RDMA hardware can push Cache Fusion single-block transfer latency below 5 μs — one of pgrac's core performance targets when benchmarking against Oracle RAC. But not every environment has RDMA NICs: CI pipelines typically run on virtual machines without InfiniBand, and developer laptops almost never have an RDMA HCA. A simple binary strategy — "use RDMA if available, otherwise unsupported" — would drastically reduce CI test coverage: cross-node code paths could only be verified on production hardware, which contradicts pgrac's testability principle.
The three-tier design encapsulates hardware differences inside the vtable rather than scattering them across call sites:
Tier 1 (RDMA verbs) — Production first choice; Mellanox mlx5 first-class citizen; sub-5 μs single-block target. Tier 2 (RoCE) — Ethernet RDMA middle ground; generic verbs path; DCQCN congestion control; ~5–8 μs latency. Tier 3 (TCP fallback) — Automatically selected when no RDMA hardware is present; ~50 μs; default path for CI and development environments. All three tiers share the same call API; switching is performed at startup detection and is immutable at runtime (PGC_POSTMASTER level).
| Tier | Network Type | Typical Latency (CF single block) | Use Case |
|---|---|---|---|
| 1 | RDMA verbs (InfiniBand / mlx5) | < 5 μs | Production cluster, Mellanox hardware |
| 2 | RoCE (RDMA over Converged Ethernet) | ~5–8 μs | Ethernet RDMA, generic HCA |
| 3 | TCP socket | ~50 μs | No RDMA hardware, CI, developer machine |
This layering also mirrors Oracle RAC's evolution: early Oracle Exadata relied primarily on Ethernet Private Interconnect (analogous to Tier 2), with InfiniBand introduced later (analogous to Tier 1). pgrac provides all three tiers from day one, avoiding large-scale refactoring of call sites down the road.
The cluster.interconnect_tier GUC controls tier selection, typed as an enum (stub / tier1 / tier2 / tier3). Production environments configure it explicitly; when left unconfigured in CI, it defaults to stub (no-network stub) or tier3 (automatic detection downgrade).
The Tier 1 implementation targets Mellanox ConnectX-5/6/7 series HCAs, using libibverbs as the base API and layering three key optimizations on top: pre-registered MR cache, zero-copy DMA path, and hugepage alignment. Together these three push hot-path per-message latency from "tens of microseconds" down to the 5 μs target.
Before an RDMA send, the sender must register its memory buffer with the HCA via ibv_reg_mr(), obtaining a Memory Region handle (MR handle). A single ibv_reg_mr() call takes approximately 2–5 μs; registering on every message would immediately blow the Cache Fusion latency budget.
pgrac's MR pre-registration cache (mr_cache) uses {addr, size} as its key and maintains an LRU hash table: the first send to a given buffer registers and caches the MR handle; subsequent sends hit the cache directly without triggering ibv_reg_mr(). The production target for MR cache hit rate is > 98%.
Standard TCP send path: application buffer → kernel socket buffer → DMA to NIC — at least two memory copies. The core advantage of RDMA is the ability to bypass the kernel entirely: the HCA reads data directly from user-space memory via DMA and writes it into the remote node's user-space buffer, never touching the kernel socket stack.
pgrac Tier 1 uses RDMA Read / Write operations to take this zero-copy path. Cache Fusion places block data in a user-space buffer registered with an MR and issues an RDMA Write; the HCA independently completes the transfer — the CPU is free to process other work during transfer, confirming completion only when a completion event arrives on the CQ (Completion Queue).
Sender (Node A) Receiver (Node B)
┌─────────────────┐ ┌─────────────────┐
│ userspace │ │ userspace │
│ ┌───────────┐ │ │ ┌───────────┐ │
│ │ MR / QP │ ─── RDMA Read ────▶│ │ buffer │ │
│ └───────────┘ │ │ └───────────┘ │
│ ↓ │ │ ↑ │
└───────|─────────┘ └───────|─────────┘
| |
| ╳ kernel (bypassed) |
| |
↓ ↑
NIC ───────── physical network ──────── NIC
MR registration efficiency is positively correlated with TLB hit rate: under standard 4 KB pages, large buffers see elevated TLB miss rates. 2 MB hugepages expand TLB coverage by 512×, making ibv_reg_mr() pin operations more efficient and reducing TLB miss overhead during DMA. The Tier 1 MR allocator draws buffers from a 2 MB hugepage pool by default; the configuration knob is cluster_rdma_hugepage_size (default 2MB, upgradeable to 1GB).
A Queue Pair (QP) is the RDMA communication endpoint. Under a naive scheme, each session needs one QP per target node — at 10 nodes × 1,000 sessions the QP count explodes. Tier 1 uses SRQ (Shared Receive Queue) to reduce receiver-side QP count to O(N × S); on Mellanox hardware it further uses XRC (Extended Reliable Connection) to reduce QP count to O(N), maintaining a fixed QP footprint regardless of session count.
RoCE (RDMA over Converged Ethernet) runs the RDMA verbs protocol layer over standard Ethernet, requiring no InfiniBand switch — only a switch that supports PFC (Priority-based Flow Control) + ECN (Explicit Congestion Notification). The Tier 2 programming model is identical to Tier 1: the same ibv_reg_mr() / ibv_post_send() / CQ polling; the vtable implementation differs only at the underlying transport layer.
When RoCE runs over Ethernet, the network lacks InfiniBand's lossless flow control — if a switch buffer overflows, packets are dropped, and the retransmit overhead of RDMA RC QP for dropped packets far exceeds TCP's. DCQCN (Data Center Quantized Congestion Notification) is Mellanox's congestion control algorithm, natively implemented in NIC firmware on Mellanox hardware:
DCQCN_ALPHA parameter controls the reduction magnitude)K_MIN / K_MAX buffer thresholds)DCQCN runs in NIC firmware and is transparent to the pgrac application layer. Administrators must verify the switch's PFC/ECN configuration in RoCE deployments (see §7.5); pgrac_ctl verify rdma checks and reports switch configuration status.
The performance ceiling of the generic verbs path (non-Mellanox-specific XRC/DCQCN) is bounded by Ethernet hardware: single-block CF transfer ~5–8 μs, P99 under heavy load ~50 μs, bandwidth ~40–60 Gbps. This is already 6–10× faster than Tier 3 TCP, which is sufficient for most budget-constrained production clusters. Achieving Tier 1's extreme latency requires either InfiniBand hardware or a Mellanox NIC with full RoCE support.
Tier 3 is pgrac's functional safety net: in environments without any RDMA hardware, cluster communication falls back to standard TCP sockets — ~50 μs latency, bandwidth limited by Ethernet — but all cross-node semantics (Cache Fusion, GES lock protocol, SCN broadcast) are fully correct.
At startup, rdma_feature_detect() probes in the following order:
interconnect_tier is immediately set to TIER_3, all RDMA initialization paths are skipped, and log_warn("No RDMA HCA detected; running at Tier 3 (TCP)") is recorded.Before a production deployment, it is recommended to run a full hardware readiness check with pgrac_ctl verify rdma:
pgrac_ctl verify rdma
├─ HCA presence
├─ Vendor identification (mlx5 / other)
├─ Driver version (MLNX_OFED 5.x+ or rdma-core 30+)
├─ Hugepage configuration (2MB / 1GB)
├─ NUMA topology sanity
├─ Switch PFC/ECN configuration (RoCE deployments)
├─ Loopback latency / bandwidth test
└─ Final reported Tier level (1 / 2 / 3)
A failed check does not block instance startup, but failing items are logged as WARNINGs and listed in the verify_issues field of the pg_cluster_rdma_status view. All checks should pass before a production deployment, or the performance implications of each failing item should be understood.
Tier 3 TCP latency is approximately 10–20× that of Tier 1: single-block CF transfer ~50 μs (Tier 1 target < 5 μs), P99 under heavy load ~500 μs (Tier 1 < 10 μs), CPU utilization ~40% (Tier 1 < 5%). Under production OLTP load, Tier 3 Cache Fusion throughput is typically 5–10× lower than Tier 1.
For deployments that need OLTP performance but lack RDMA hardware at present, one approach is to go live on Tier 3 to validate functionality while procuring Mellanox NICs. Switching to Tier 1 requires only setting cluster.interconnect_tier = 'tier1' and restarting the instance; no application-layer changes are needed.
RDMA HCAs are typically bound to a single NUMA node. If the IRQ-handling CPUs and the HCA are on different NUMA nodes, every interrupt incurs a cross-NUMA access, degrading latency by 2–3×. The correct configuration pins HCA IRQs to the CPU cores on the HCA's own NUMA node:
# Query the HCA's NUMA node
cat /sys/class/infiniband/mlx5_0/device/numa_node
# Bind mlx5 IRQs to the corresponding NUMA CPU set (example: NUMA 0 = CPUs 0-15)
for irq in $(cat /proc/interrupts | grep mlx5 | awk '{print $1}' | tr -d ':'); do
echo 0-15 > /proc/irq/$irq/smp_affinity_list
done
The NUMA topology check in pgrac_ctl verify rdma automatically detects and reports the current IRQ affinity state. When cluster_rdma_numa_auto_bind = on (default), pgrac performs the above binding automatically at startup — no manual configuration required.
Hugepages must be pre-configured at the OS level; pgrac does not dynamically allocate them at runtime. The recommended /etc/sysctl.conf settings are:
vm.nr_hugepages = 1024 # 2MB × 1024 = 2GB, suitable for mid-scale clusters
# or
vm.nr_hugepages_2mb = 512
vm.nr_hugepages_1gb = 2 # Use 1GB huge pages on large-memory servers
Apply after reboot or with sysctl -p. The cluster_rdma_hugepage_size = '2MB' GUC (default) controls pgrac's MR buffer allocation strategy. If hugepages are unavailable, pgrac falls back to standard pages and logs a WARNING.
Each RDMA QP is bound to a Completion Queue (CQ). If the application layer consumes CQ entries more slowly than the NIC produces completion events, the CQ overflows — subsequent RDMA operations return errors and connections must be rebuilt. CQ overflow is a rare but high-impact failure mode in Tier 1 deployments.
To monitor: inspect the cq_overflow_count field in the pg_cluster_rdma_counters view. A continuously rising value typically means cluster_rdma_cq_depth (default 4096) needs to be increased, or the CQ consumption frequency of the IC Listener process (its poll interval) needs adjustment.
Tier 1 defaults to SRQ (Shared Receive Queue) mode; on Mellanox hardware XRC is additionally enabled, controlled by cluster_rdma_use_xrc = auto (default). XRC reduces QP count from O(N × S) to O(N) — the key to maintaining a fixed QP resource footprint at 1,000+ sessions. If the HCA does not support XRC (certain Intel / Broadcom models), pgrac automatically falls back to SRQ mode; QP count then scales linearly with session count, requiring adequate QP resources to be reserved in the HCA firmware configuration (see the max_qp field in ibv_devinfo).
| View | Key Fields | Purpose |
|---|---|---|
pg_cluster_rdma_status | tier, hca_vendor, xrc, dcqcn | Current Tier and hardware feature state |
pg_cluster_rdma_counters | mr_cache_hit_rate, p99_latency_us, cq_overflow_count | Performance metrics and overflow alerts |
pg_cluster_rdma_errors | type, severity, description | Hardware error event stream |
A sustained p99_latency_us above 10 μs (Tier 1) or 50 μs (Tier 2) is an early signal of performance regression — typically pointing to CQ overflow, incorrect NUMA affinity, or misconfigured DCQCN parameters on the switch.
For deeper protocol details, refer to the following resources:
pg_cluster_rdma_* view field definitionscluster-ic-design.md — Complete cluster_ic framework design: vtable data structure (ClusterICOps), wire format (24-byte fixed header), Stage evolution path (Stage 0 stub → Stage 2 TCP → Stage 6+ RDMA), dual-layer API (cluster_msg_* high-level / cluster_ic_* low-level)Chapter 8 — LMS Workers covers how Lock Manager Service worker processes send and receive GES lock messages over the IC framework: the LMS message main loop, cluster_msg_recv call sites, and the impact of LMS per-message latency on global lock protocol throughput under Tier 1 RDMA.