Cache Fusion is the central cross-node mechanism in pgrac: when node B needs a block that node A already holds in its buffer pool, pgrac transfers the block directly from A's memory to B's memory, bypassing disk entirely. Behind that deceptively simple description lie three tightly coupled subsystems: buffer classification (which node holds which version), PCM locks (who has permission to read or write), and the transfer protocol (how blocks move safely between nodes).
This chapter builds the conceptual framework required to understand Cache Fusion — the functional boundaries of the three buffer types, the tri-state PCM lock state machine, the participants in the 3-way protocol, and the cross-node write-ahead log constraint — without covering message formats or field-level details, which belong to their respective deep pages. The chapter is ordered by dependency: from "why Cache Fusion is needed" (motivation) → "what is stored" (buffer types) → "who controls access rights" (PCM locks) → "how blocks are transferred" (3-way protocol) → "how durability is guaranteed" (WAL constraint) → "when it does not help" (limits and mitigations).
If you are already familiar with Oracle RAC's GCS layer, the concepts here map nearly 1:1; pgrac deliberately aligns with Oracle Cache Fusion semantics (N/S/X state names, the PI concept, the 3-way path) to minimize the learning curve for experienced cluster DBAs.
The fundamental tension in a shared-disk cluster is that multiple nodes need to access the same data, yet at any given moment only one "authoritative version" can reside in some node's memory. The classical approach is to flush the block back to disk and let the requesting node re-read it — viable on a shared-disk model, but expensive.
Typical disk I/O latency sits in the millisecond range (SSD ~0.1–1 ms, HDD ~5–10 ms). By contrast, the target latency for a single block transfer over an RDMA interconnect is 5 μs — 20× to 2,000× faster than disk. For OLTP workloads, a typical read transaction touches dozens of blocks; if every cross-node access required a disk round-trip, the scaling benefit of clustering would be near zero.
Cache Fusion turns cross-node block access into a memory-level transfer, which is the prerequisite for near-linear scaling on a shared-disk cluster. Without Cache Fusion, a shared-disk cluster degrades into the anti-pattern of "multiple nodes fighting over the same disk I/O bandwidth."
pgrac block transfers are built on a three-tier interconnect architecture:
| Tier | Technology | Single 3-way total latency | Typical deployment |
|---|---|---|---|
| Tier 1 | RDMA InfiniBand / RoCE + hardware offload | ~4–5 μs | High-performance nodes within a rack |
| Tier 2 | RoCE (software path) | ~35 μs | Cross-rack within the same datacenter |
| Tier 3 | TCP/IP | ~300 μs | Cross-datacenter / disaster recovery |
Tier 1 RDMA enables zero-copy block transfer (an 8 KB block written directly into the target node's buffer pool frame), with CPU overhead of roughly 2 μs (sender + receiver combined) and minimal memory-bandwidth consumption.
Latency targets: Tier 1 RDMA 3-way total ~5 μs; Tier 2 RoCE ~35 μs; Tier 3 TCP ~300 μs. Even the slowest TCP path is still 3–10× faster than a random SSD read.
pgrac's buffer pool introduces three buffer categories on top of PG's native single-version buffer, each serving a distinct access pattern. For the full data-structure design see
The functional boundary of each buffer type:
| Type | Role | Who reads | Who writes | Transferred cross-node? |
|---|---|---|---|---|
| Current (XCUR/SCUR) | Latest version; mutable | All nodes | The node holding the X lock | Yes (3-way / 2-way) |
| CR (Consistent Read) | Historical version; read-only | Queries reading old snapshots | Constructed from Current; immutable | No (constructed locally) |
| PI (Past Image) | Retained dirty-copy mirror | Consistent-read fallback | Auto-generated when X lock is surrendered | No (retained locally) |
Current buffers are further subdivided into two PCM lock modes: XCUR (exclusive, writable) and SCUR (shared, read-only). These two sub-modes correspond to the X and S states of the PCM lock, detailed in the next section.
Across the cluster there is exactly one "authoritative current" copy of a block (either XCUR on one node or SCUR shared across multiple nodes). All writes must occur on the node holding the X lock; other nodes that need the latest version must first acquire an S lock or trigger the 3-way protocol to receive the block.
CR buffers are read-only copies constructed locally on the requesting node: when a node needs a historical snapshot version but the Current buffer has already been updated to a newer version, pgrac applies Current + undo information to roll back to the target SCN and produces a CR buffer. CR buffers are entirely local and require no cross-node transfer. The SCN ceiling of a CR buffer is determined by the snapshot of the initiating transaction (snapshot isolation); once constructed, the CR buffer can serve read requests immediately, without entering the GRD.
PI buffers are the most pgrac-specific of the three buffer types and are essential for understanding Cache Fusion correctness.
PI creation: when node A, which holds an X lock (i.e., XCUR), is asked to ship a block to node B, and that block is dirty (modified but not yet flushed to disk), A must retain a copy before sending — that copy is the PI. The PI allows node A to satisfy subsequent consistent-read requests (for the historical version before the X lock was surrendered) without re-reading from disk.
PI retirement: a PI is retired under either of two conditions:
Until retired, a PI must remain in the original node's buffer pool and cannot be evicted, even under memory pressure. This is why the buffer-pool eviction policy must be PI-aware (Current buffers may be evicted; PI buffers may not be evicted while their WAL has not yet been persisted).
PI and WAL coupling (feature-019 rule): before a dirty block is transferred cross-node, the corresponding WAL records must be fsync'd. This guarantees that even if the holder node crashes immediately after sending the block, the receiver can recover the block from the holder's redo log — without depending on the holder being available. Violating this rule creates a data-loss window where "the block is already in use on the requester, but the corresponding WAL on the holder was never persisted." This rule is the subject of Section 2.5.
PI is not an alias for CR buffer. CR is "a historical version constructed on demand"; PI is "a dirty copy retained before shipping away." Their lifecycles are entirely different.
PCM (Parallel Cache Management) locks are the control layer of Cache Fusion: they track the access-permission state of every block across the cluster, ensuring that write authority over a block never conflicts between two nodes at the same time. Every block has a single PCM lock record in the GRD (Global Resource Directory), managed by that block's master node.
The GRD is pgrac's global resource metadata dictionary, hash-partitioned by resource_id (for blocks, the BufferTag hash) across all nodes. Each block's PCM lock state is stored in the GRD shard on its master node:
The GRD sharding strategy (which node acts as master for a given hash bucket) is configured at cluster initialization and can be reassigned after a reconfig event (DRM, Dynamic Resource Mastering). This means the master node for a given block can migrate as the cluster topology changes, but at any stable moment the master is unique and unambiguous.
PCM locks have three fundamental states:
has_pi orthogonal flag: the X state carries an orthogonal flag has_pi, indicating "this node retained a PI copy when it shipped the X lock away." This flag does not alter the lock's grant semantics, but it affects how the master routes consistent-read requests: if a requester needs an older version, the master knows which node holds the PI and can route directly.
The three states plus has_pi together cover all cross-node buffer coherency scenarios.
The lifecycle of a PCM lock is bound to the block's residency in the buffer pool: when a block is evicted, the node holding S or X must first send a BLOCK_RELEASE message to the master, releasing the PCM lock, before reclaiming the buffer frame. This constraint ensures the master's GRD view never drifts into a state where "the block has already been evicted but the master still believes the holder has it cached."
For the complete state-transition details (including BAST notification paths) see
+-----+ N→S +-----+
| N |---->| S |
+-----+ +-----+
| ↗ |
N→X| ↗ X→N |S→X
↓ ↗ ↓
+---------+
| X | (has_pi orthogonal)
+---------+
The complete PCM state-machine transition set is the full permutation of three states (3×3 = 9 directed edges):
| Transition | Triggering scenario | Initiator |
|---|---|---|
| N → S | Request read access to block | Requester → Master |
| N → X | Request write access (new page) | Requester → Master |
| S → X | Upgrade to exclusive write | Requester → Master |
| S → N | Another node requests X; this node's S is revoked | Master → Holder |
| X → S | Downgrade: another node requests S | Master → Holder |
| X → N | Full invalidation: another node requests X | Master → Holder |
| X → X | Ownership transfer (uncommon; reconfig scenario) | Master coordinates |
| S → S | Shared-holder count changes (new member joins) | No explicit message |
| N → N | No-op | — |
S → X upgrade requires the master to first send INVALIDATE to every node holding S, wait for all acknowledgements, and only then grant X. This is the highest-overhead lock path in the cluster; application layers should avoid "read-then-write" access patterns that trigger S→X.
The 3-way protocol is Cache Fusion's canonical transfer path, named for its three participants: the Requester (the node that needs the block), the Master (the PCM-lock master for that block, determined by GRD hash), and the Current Holder (the node currently holding the block). The three parties play different roles in a single block transfer, using two message round trips to move a dirty block directly from the Holder's memory into the Requester's memory.
Two invariants hold across all protocol paths:
Node B (req) Node M (master) Node A (holder)
| | |
|--- 1. CR_REQ ----->| |
| |--- 2. ASK_HOLDER ---->|
| | |
| |<-- 3. SEND_BLOCK -----|
|<--- 4. SEND -------| |
| |
| steady state: 0-way (cached locally) |
Protocol four steps:
0-way steady state: once node B has the block in its local cache, subsequent accesses to that block require zero network messages (PCM = S or X, direct local read or write). This is why Cache Fusion has near-zero overhead for "hot block" scenarios. In typical OLTP workloads, the majority of block accesses occur on the 0-way path; the 3-way fires only on first acquisition or under lock contention.
2-way path: when the Master and the Holder are the same node, steps 2–3 collapse into one and only two message round trips are needed (~3 μs on Tier 1).
For the complete message field definitions, X→S downgrade sequencing, and S→X upgrade sequencing, see
The semantics of Cache Fusion transferring dirty blocks differ from the single-instance WAL-before-page rule, and require an additional constraint.
On a single PG instance, before a dirty buffer can be evicted, the corresponding WAL must be fsync'd first (the write-ahead log principle). In a cluster, this constraint must be extended to cross-node transfer scenarios:
feature-019 rule: before a dirty block is shipped from the holder to another node, all committed-transaction WAL records for that block must already be fsync'd into the holder node's redo stream.
This rule guarantees that if the Holder node crashes immediately after shipping the block, the Requester can still reconstruct the block from the Holder's redo log — even if the Holder is no longer available. Without this rule, there is a data-loss window where "the block is already in use on the Requester, but the corresponding WAL on the Holder was never persisted."
In practice, the LMS (Lock Manager Server) process checks block_lsn (the LSN corresponding to the block's last modification) before a block transfer, confirms that the WAL writer has already persisted that LSN, and only then issues the block. This check is free for non-dirty blocks (unmodified shared reads) and adds no fsync wait; only dirty blocks can potentially trigger a synchronous wait on the WAL writer.
From the PI perspective, the feature-019 rule is tightly coupled to the PI lifecycle: a PI may be retired once its corresponding WAL has been persisted, precisely because "WAL fsync'd = durable version on disk = the disk has taken over the PI's fallback role." The two mechanisms (pre-transfer WAL fsync + PI retention) together form a dual safety guarantee for Cache Fusion correctness.
For WAL record format and LSN management design details see
The feature-019 rule is a critical correctness constraint for Cache Fusion. Violating it — for example, bypassing the WAL fsync and shipping the block directly — leaves some cluster nodes holding unpersisted data; if a node crashes, redo-based recovery becomes impossible.
Cache Fusion provides limited benefit — and can become a performance bottleneck — in the following scenarios.
Single-row hot tuple (Hot Row Contention)
If a single high-frequency-update row (e.g., a counter or sequence) is written alternately by multiple nodes, the PCM X lock for that block migrates continuously between nodes: A writes → X moves to B → B writes → X moves back to A → A writes… Each migration triggers a full 3-way protocol. On Tier 1 RDMA this costs ~5 μs per trip; on Tier 3 TCP it costs ~300 μs. At write QPS above 10K/s, PCM protocol overhead can exceed the application logic itself.
Cross-node Ping-Pong
If the application-layer load balancer distributes reads and writes on the same row randomly across nodes, the block's PCM state oscillates between S and X (ping-pong). Every S→X upgrade requires the master to first invalidate all S holders — and when many nodes hold S, this is the most expensive lock protocol path in the cluster.
Many Small Transactions Competing Cross-Node on the Same Relation
In OLTP scenarios where the same hot table (e.g., an orders or inventory table) is written concurrently by all nodes, and the writes land on a small number of blocks (low-cardinality block layout), X→S→X oscillation triggers frequent INVALIDATE broadcasts. In Oracle RAC production environments this class of problem is known as the "gc buffer busy" wait event; pgrac's equivalent monitoring metric is cf_buffer_busy in pg_cluster_wait_events.
Application-layer mitigations:
pg_cluster_sequence instead (per-instance range allocation, batch prefetch, no cross-node contention).pg_cluster_cf_message_dist view to observe BLOCK_INVALIDATE frequency; a high-frequency-invalidate block tag is a direct signal of a ping-pong hotspot.The "doesn't help" scenarios are fundamentally application access-pattern problems, not protocol defects. In Oracle RAC production deployments, the most common performance problems almost always stem from indiscriminate hot-block access at the application layer, not Cache Fusion protocol overhead itself. The first diagnostic step is to inspect pg_cluster_wait_events, focusing on the wait-time distribution of cf_buffer_busy and cf_cr_request.
This chapter has established the Cache Fusion conceptual framework: the functional boundaries of the three buffer types (Current / CR / PI), the PCM lock tri-state machine (N/S/X + the has_pi orthogonal flag), the three participants in the 3-way protocol (Requester / Master / Holder), and the cross-node write-ahead log correctness constraint (the feature-019 rule). This framework is the foundational vocabulary for understanding pgrac cluster behavior; all mechanisms in subsequent chapters build on these concepts.
The next chapter, Chapter 3 — GES Concepts, introduces the other cross-node lock protocol that operates above the buffer layer: GES (Global Enqueue Service) manages cross-instance enqueue locks (row locks, table locks, transaction locks), with responsibilities entirely orthogonal to PCM locks but sharing the same GRD master-routing infrastructure at the architectural level. Understanding GES requires mastery of this chapter's PCM state machine, because the GES BAST (Blocking AST) notification mechanism is structurally symmetric with the PCM INVALIDATE path.
For protocol deep-dives (message formats, field definitions, timing diagrams, performance budgets) read the Cache Fusion deep page.