The previous chapter (Ch 10) described the physical layout of per-instance undo tablespaces and the cross-node visibility path: undo records live in each instance's independent segment, and CR block construction traverses the undo chain via UBA in reverse application. This chapter goes deeper into the buffer pool layer — where undo and heap blocks ultimately reside — and how pgrac extends PG's native single-machine buffer pool with cross-node buffer coordination.
The core challenge for the pgrac buffer pool is extending PG's single-machine single-copy model into a cluster multi-copy model — current (XCUR / SCUR as two PCM forms), CR (consistent-read copies derived per snapshot read SCN), and PI (past-image copies retained after relinquishing X) — without breaking PG's native hot-path performance, while maintaining global coherency through the PCM lock state machine (AD-002) and Cache Fusion protocol (AD-005). BufferTag remains identical to PG vanilla: CR and PI copies do not enter the BufTable hash; they are linked to the current buffer via cr_chain_head / pi_buf_id (buffer-pool-design.md §4.2).
PG's native buffer pool is a single-machine + single-version design: at most one copy (current) of each block exists in memory, all intra-node concurrency is serialized by LWLock (content_lock), there is no cross-instance coherency protocol, and no CR / PI concept. pgrac adds cross-node copy semantics on top of this, while preserving PG's BufTable hash path and pin/unpin mechanics — minimal invasiveness.
| Dimension | PG native | pgrac |
|---|---|---|
| In-memory copies per block | 1 (current) | 1 current (XCUR or SCUR) + 0..N CR copies (chain bound 8) + 0..1 PI |
| Cross-node coherency | ❌ None | PCM lock state machine (N/S/X) + Cache Fusion |
| Visibility copies | Heap dead tuples + CLOG | current + in-buffer CR copies constructed on demand (never written to disk) |
| BufferTag | RelFileLocator + ForkNumber + BlockNumber (20 B) | Unchanged (CR/PI associated via chain, not in BufTable) |
| BufferDesc size | 64 B (1 cache line) | 128 B (2 cache lines; hot fields all in first 64 B) |
| Eviction policy | Clock-sweep (single priority) | Four-pool differentiated: CR > PI > current eviction priority (descending) |
| Cross-node block access | ❌ Not supported | Cache Fusion RDMA transfer (~5 μs Tier 1) |
| CR block | ❌ Not supported | Dedicated BufferDesc slot inside CRPool + CR chain; in-memory only |
Every pgrac buffer slot carries exactly one copy type at any moment. The two PCM forms of the current copy (XCUR / SCUR) and the PI copy are derived from pcm_state and pi_flags (no independent type field — zero redundancy); CR is a separate copy type marked explicitly by buffer_type = BUF_TYPE_CR and linked into the current buffer's cr_chain:
| Type | Meaning | Buffer slot | Mapping / persistence |
|---|---|---|---|
| XCUR (Exclusive Current) | Exclusive write; at most 1 node holds it cluster-wide | current buffer; in BufTable | pcm_state = X, has_pi = false; on disk |
| SCUR (Shared Current) | Shared read; multiple nodes may hold simultaneously | current buffer; in BufTable | pcm_state = S, has_pi = false; on disk |
| CR (Consistent Read) | Historical version constructed at a given read_scn | Dedicated buffer slot (CRPool); hangs off current via cr_chain | buffer_type = BUF_TYPE_CR, cr_scn; in-memory only, never persisted |
| PI (Past Image) | Stale dirty copy retained after relinquishing X | Dedicated buffer slot (PIPool); linked from current via pi_buf_id | has_pi = true; in-memory only, 5 min TTL |
Of the four copy types, only current (XCUR / SCUR) is indexed by the BufTable hash (keyed by BufferTag). CR and PI are derivatives of the current copy and are reached via cr_chain_head / cr_chain_next (CR multi-version chain, sorted descending by cr_scn) and pi_buf_id (single PI reference) respectively. This keeps the BufTable hash dimension identical to PG vanilla — no extra hash keys are needed for historical versions (buffer-pool-design.md §4.2 / §6.1 / §7.2).
cluster-wide buffer state
Node 1 Node 2 Node 3
┌────────┐ ┌────────┐ ┌────────┐
│ pool │ │ pool │ │ pool │
│ │ │ │ │ │
block A: │ XCUR │ ──── X ──── │ · │ ─── X ──── │ · │ exclusive write
│ │ │ │ │ │
block B: │ SCUR │ ──── S ──── │ SCUR │ ─── S ──── │ SCUR │ shared read
│ │ │ │ │ │
block C: │ CR │ │ · │ │ CR │ constructed at read_scn
│ @SCN 99│ │ │ │ @SCN 99│ (dedicated CRPool slot, in-memory)
│ │ │ │ │ │
block D: │ PI │ │ XCUR │ │ PI │ stale page retained
│ @SCN 75│ │ @SCN 80│ │ @SCN 75│ (ordered by SCN)
└────────┘ └────────┘ └────────┘
Explicit buffer_type enum (buffer-pool-design.md §4.4):
typedef enum {
BUF_TYPE_CURRENT, /* Current block copy; readable/writable per PCM lock (N/S/X) */
BUF_TYPE_CR, /* Consistent Read copy; read-only, built per cr_scn */
BUF_TYPE_PI, /* Past Image; read-only, retained after X relinquishment */
} BufferType;
buffer_type is an explicit field (hot tail, offset 52), used by the eviction / flush / Cache Fusion paths to quickly fan out. XCUR vs SCUR is not a distinct buffer_type value — both are BUF_TYPE_CURRENT combined with pcm_state = X or S (per buffer-pool-design.md §5.2).
pgrac extends PG's native BufferDesc (64 B) to 128 B, appending cluster fields guarded by the USE_PGRAC_CLUSTER compile guard. This follows the same pattern as Ch 9's PageHeaderData extension and Ch 10's undo segment header extension: extend existing PG structs rather than introduce parallel structures.
/* BufferDesc — PG 16.13 measured layout (USE_PGRAC_CLUSTER mode, 128 B)
* Conceptually named ClusterBufferDesc; code retains PG's original name BufferDesc
* with compile-guard-appended fields.
*/
typedef struct BufferDesc {
/* === Cache line 1 first half: PG original fields [0, 52), HOT, compatible with PG vanilla === */
BufferTag tag; /* 20 B: RelFileLocator(12) + ForkNumber(4) + BlockNumber(4) */
int buf_id; /* 4 B */
pg_atomic_uint32 state; /* 4 B: refcount + usage_count + flags */
int wait_backend; /* 4 B */
int freeNext; /* 4 B */
LWLock content_lock; /* 16 B; ends at offset 52 */
/* === Cache line 1 cluster hot tail [52, 64), 12 B; hot path access === */
uint8 buffer_type; /* offset 52: BUF_TYPE_CURRENT / CR / PI (derived; redundant snapshot) */
uint8 pcm_state; /* offset 53: N / S / X */
uint8 pi_flags; /* offset 54: has_pi and related bits */
uint8 _pad; /* offset 55: 1 B padding for 8 B alignment of block_scn */
SCN block_scn; /* offset 56: 8 B; ends at 64 = cache line 1 boundary */
/* === Cache line 2 cold body [64, 128), 64 B; cluster-specific paths only === */
int cr_chain_head; /* offset 64: PIVOT B — moved here (CR construction is cold path) */
int cr_chain_next; /* offset 68 */
SCN cr_scn; /* offset 72: CR buffers only (equals snapshot.read_scn at construction) */
int pi_buf_id; /* offset 80 */
XLogRecPtr pi_lsn; /* offset 88: PI buffers only */
uint16 grd_master_node; /* offset 96 */
uint16 grd_master_seq; /* offset 98 */
uint8 cf_state; /* offset 100: Cache Fusion protocol state */
uint8 cf_owner_node; /* offset 101 */
uint16 cf_request_count; /* offset 102 */
LWLock pcm_lock; /* offset 104: accessed only during lock transition */
TimestampTz pi_created_at; /* offset 120: ends at 128 */
/* total: 128 B (BUFFERDESC_PAD_TO_SIZE = 128 in USE_PGRAC_CLUSTER mode) */
} BufferDesc;
v1.2 (2026-05-02) uncovered a critical measured finding during implementation: PG 16.13's sizeof(BufferTag) = 20 B (RelFileLocator 12 B + ForkNumber 4 B + BlockNumber 4 B = 20 B), not the 16 B assumed in early design documents. This pushes PG's original fields to occupy offset [0, 52), leaving the cluster hot tail only 12 B — not enough to simultaneously hold cr_chain_head (4 B) and block_scn (8 B) while keeping block_scn within cache line 1.
PIVOT B trade-off: block_scn is the critical field on the Stage 2–3 visibility hot path (every buffer access must compare block_scn against snapshot.read_scn) and must reside in cache line 1. cr_chain_head is only accessed during CR construction (a cold path) — it is sacrificed to free space in cache line 1, moved to the start of cache line 2 (offset 64).
hot path access pattern (cache line 1 only = first 64 B):
BufTableLookup → IncreaseRefcount → read pcm_state → read block_scn → LWLockAcquire(content_lock)
cache line 2 is never touched; identical overhead to PG native hot path (1 cache miss)
cold path (cache line 2, triggered only in new scenarios):
CR construction → access cr_chain_head / cr_chain_next / cr_scn
PI creation → access pi_buf_id / pi_lsn / pi_created_at
Cache Fusion → access cf_state / cf_owner_node / pcm_lock
At compile time, five StaticAssertDecl statements lock layout invariants via semantic constraints — for example, offsetof(block_scn) + sizeof(SCN) <= 64 (block_scn within cache line 1) and offsetof(cr_chain_head) >= 64 (cr_chain_head at the start of cache line 2) — rather than hardcoded magic offset numbers. If a future PG version expands BufferTag again, the assertions fire at compile time rather than silently miscalculating.
pgrac buffer pool concurrency safety is jointly guaranteed by two orthogonal and independent dimensions that cannot be merged:
Dimension 1: Pin (refcount)
refcount > 0 prevents a buffer from being evictedDimension 2: PCM Lock (N/S/X)
pcm_state field stored in ClusterBufferDesc hot tail (offset 53)/* Valid combinations of the two dimensions */
/* Pin + S: backend holds buffer reference, node holds shared PCM lock, can read locally */
/* Pin + X: backend holds buffer reference, node holds exclusive PCM lock, can write locally */
/* Unpinned + X: no backend reference but node still holds X lock → cannot evict immediately (see below) */
/* Pin + N: intermediate state during PCM lock transition → rare but valid */
Critical constraint for eviction and PCM X lock: a buffer holding a PCM X lock cannot be evicted directly even when refcount = 0 (unpinned). The reason is that the PCM X lock signals to the GRD that "the master of this block is on this node" — direct eviction would desynchronize GRD state from local buffer state. The correct path is to first notify the GRD to release the X lock (pcm_release_x_lock), transition the node's pcm_state → N, flush the dirty block, then remove it from BufTable and return the buffer slot.
Acquisition order: PCM lock and content_lock are always acquired strictly in "PCM first, then content" order to prevent deadlock (§5 of AD-002 design document provides a complete formal proof).
The 9 valid PCM state transitions (pcm-lock-protocol-design.md §4, derived from AD-002):
| # | Transition | Triggering scenario |
|---|---|---|
| 1 | N → S | Node's first read of the block (LOCK_REQUEST(S) → master) |
| 2 | N → X | Node's first write to the block (LOCK_REQUEST(X) → master) |
| 3 | S → X (self upgrade) | Node holds S and wants to write (LOCK_REQUEST(X_UPGRADE)) |
| 4 | X → S (retain PI) | Another node requests S; this node downgrades (DOWNGRADE(X→S, keep_pi=true)) |
| 5 | X → N (retain PI) | Another node requests X; this node fully relinquishes (DOWNGRADE(X→N, keep_pi=true)) |
| 6 | X → N (no PI) | Proactive release before evict (RELEASE → master) |
| 7 | S → N (invalidated) | Another node requests X; this node receives INVALIDATE |
| 8 | S → N (proactive) | Proactive release before evict (RELEASE → master) |
| 9 | ITL cleanout S → X | Reader-triggered commit_scn write-back (AD-006 round four); immediately downgrades X → S once cleanout completes |
pgrac adapts PG's clock-sweep eviction with a four-pool differentiated scheme (buffer-pool-design.md §9.2). current (XCUR / SCUR) gets the lowest eviction priority — hot data is the most precious. CR copies are the easiest to evict — their reconstruction cost is bounded (rebuilt by walking the undo chain). PI sits in between but is protected by a 5 min TTL to avoid being dropped during Reconfig.
Four-pool static partition (default; tunable via GUC cluster_cr_pool_pct / cluster_pi_pool_pct):
| Pool | Default share | Size (shared_buffers = 16 GB) | Eviction priority |
|---|---|---|---|
| CurrentPool (XCUR / SCUR) | 60% | 9.6 GB | Lowest (most precious; dirty must flush first) |
| CRPool | 20% | 3.2 GB | Highest (rebuilt O(undo) on demand; never written) |
| PIPool | 10% | 1.6 GB | Medium (evictable after TTL 5 min) |
| Reserve | 10% | 1.6 GB | Dynamically adjusted |
Adapted StrategyGetBuffer three-stage flow (buffer-pool-design.md §9.3):
StrategyGetBuffer():
/* 1. Prefer a victim in CRPool (stale CR copies, or the LRU-oldest CR) */
victim = sweep_cr_pool();
if (victim) return victim;
/* 2. Then look in PIPool (expired TTL preferred) */
victim = sweep_pi_pool();
if (victim) return victim;
/* 3. Classic clock-sweep fallback in CurrentPool
* (must release PCM X lock + flush dirty first) */
victim = sweep_current_pool();
if (victim->dirty) flush(victim);
return victim;
Per-current CR chain length is capped at 8 by default (buffer-pool-design.md §6.3; when cr_chain exceeds the cap the oldest CR copy is evicted). Total CR-pool occupancy is constrained by the 20% quota; CR hit ratio is observable via the pg_cluster_cr_chain_stats view.
PI TTL and eviction: a PI buffer's pi_created_at field (offset 120, cache line 2) records the creation timestamp; after the default 5 minutes (cluster_pi_ttl_sec = 300) it is marked as an eviction candidate. PI is also cleaned up early in the following cases: the node re-acquires an X lock on the same block (the PI is then meaningless); after Phase 4 master reconstruction completes during Reconfig; or when the cluster_undo_retention_sec window closes and the associated undo data becomes invalid.
OLTP impact: after PIVOT B the hot path reads only cache line 1 (first 64 B), adding just 1 byte of pcm_state read + branch overhead (~5 ns) over PG native. The four-pool structure gives current hot data priority retention while CRPool absorbs full-table scans, preventing them from polluting the OLTP working set. Overall OLTP TPS impact is < 1% (design analysis conclusion; Stage 1.6 empirical validation in progress).
The grd_master_node / grd_master_seq fields in BufferDesc (cache line 2, offsets 96–99) cache the block's master routing in the GRD (Global Resource Directory), avoiding a GRD lookup on every cross-node operation. The full GRD resource-identity model, shard routing, and holders/waiters table structure are covered in Chapter 3 · GES Concepts. After Reconfig rebuilds a master, these fields are invalidated via sinval (buffer-pool-design.md §11.3).
For deeper design details and related features:
ClusterBufferDesc C struct, 5 StaticAssertDecl semantic constraints, three-pool GUC parameters (cluster_cr_pool_pct / cluster_pi_pool_pct), pg_cluster_buffer_pool_stats view field definitions, memory budget (BufferDesc array 128 MB increment = +0.8% shared_buffers)pcm_lock (LWLock at offset 104)BCT_INVALID buffer triggering CF transfer, RDMA zero-copy path and cf_state field lifecycleFlushBuffer path, the relationship between pi_lsn and WAL truncation point, and how checkpoint coordinates dirty buffer flush order across the three pools