pgrac applies a precise surgical upgrade to PostgreSQL's 8 KB block format: the physical layout of PageHeaderData and all 30+ page-access macros remain fully compatible, 8 bytes of pd_block_scn are appended to the end of PageHeader, and the ITL slot array is placed in PG's special area (the block trailer) — reusing the same special-area mechanism shared by btree, hash, gin, and other index AMs, and aligning with Oracle's actual block layout (ITL at the block trailer). The result is a "dual-track visibility" system: local transactions take the PG-native xmin/xmax + CLOG hot path; cross-node transactions take the ITL slot → commit_scn → TT SCN path. Both paths coexist on the same physical block without interference.
The PG-native block was designed for single-instance use: PageHeader holds structural metadata, row data grows upward from the bottom, row pointers grow downward from the header, and all visibility information lives in each row's xmin/xmax fields — the block itself carries no transaction state. pgrac adds three incremental changes on top: PageHeader +8 bytes for SCN, tuple header +1 byte for the ITL index, and a new ITL slot array at the block trailer.
The pgrac block format is incompatible with the PG-native format: pd_pagesize_version is bumped from 4 to 5, and pg_upgrade cannot be used directly. The migration path is dump/restore or a dedicated migration tool (to be provided before spec-1.25 acceptance). Extensions that rely on pageinspect to read block contents directly will need to be adapted for the new format; most extensions that do not read block contents remain compatible.
| Field / Region | PG native | pgrac | Delta |
|---|---|---|---|
| PageHeader | 24 B | 32 B (+8 B pd_block_scn) | +8 B |
Row pointers (pd_linp[]) | Immediately after header, offset 24 | Same, but offset 32 | Offset shifted +8 B |
| Free space | pd_lower ↔ pd_upper | Same | Unchanged |
| Row data | Grows upward from bottom | Same, with +1 B t_itl_slot_idx per tuple | +1 B per row |
| ITL slot array | ❌ None | ✅ special area, default 384 B (8 × 48 B) | New |
| Special area | Empty for heap (0 B) | ITL slot array (384 B) | New |
| Block-level SCN | ❌ | ✅ pd_block_scn (8 B) | New |
| Delayed cleanout flag | ❌ | ✅ PD_DELAYED_CLEANOUT (pd_flags bit) | New |
| Available user data (8 KB) | ~8168 B | ~7776 B | −392 B (~4.8%) |
The ITL (Interested Transaction List) slot array resides in the block's special area (pd_special pointer, 384 bytes from the end of the block). Each slot is 48 bytes; the default count is 8 (INITRANS = 8). Access goes through the ClusterPageGetItlSlots(page) function, which wraps PageGetSpecialPointer with dual assertions (a PageHasItl flag check and a special-size check).
Each ITL slot's 48 bytes are laid out as follows:
| Field | Size | Meaning |
|---|---|---|
xid | 4 B | PG 32-bit transaction ID; high bits encode instance_id (A1-v2 cross-instance XID segmentation) |
wrap | 2 B | Slot reuse counter; prevents ABA false-positive matches |
flags | 1 B | ACTIVE / COMMITTED / ABORTED / CLEAN / NEEDS_CLEANOUT |
lock_count | 1 B | Number of row locks held by this slot |
undo_segment_head (UBA) | 16 B | Undo Block Address, points to the per-instance undo segment (see §9.3) |
commit_scn | 8 B | Filled on commit; INVALID (0) means not yet committed |
write_scn | 8 B | Local SCN at write time (companion to AD-008 Lamport monotonic advance) |
first_change_lsn | 8 B | LSN of this transaction's first modification to this block; crash recovery anchor |
+--------------------------------------+ 0
| PageHeaderData (24 B) |
| pd_block_scn ( 8 B) | pgrac addition (+8 B block-level SCN)
+--------------------------------------+ 32
| pd_linp[0..N] Row Pointers | (PG-compatible, unchanged)
| (each 4 B; FLEXIBLE_ARRAY) |
+--------------------------------------+
| |
| Free Space |
| (between row pointers |
| and row data) |
| |
+--------------------------------------+
| |
| Row Data | (grows upward from bottom)
| ClusterHeapTupleHeader (+1 B idx) |
| |
+--------------------------------------+
| ITL Slot 0 (48 B) | ↑ New region (block-trailer special area)
| ITL Slot 1 |
| ... |
| ITL Slot N-1 | N = INITRANS, default 8
+--------------------------------------+ 8192
Each ITL Slot (48 B):
+------+------+-------+-------+--------+------------+----------+-----------------+
| xid | wrap | flags | lock | UBA | commit_scn | write_scn| first_change_lsn|
| 4 B | 2 B | 1 B | 1 B | 16 B | 8 B | 8 B | 8 B |
+------+------+-------+-------+--------+------------+----------+-----------------+
PD_HAS_ITL (pd_flags bit 0x0008) identifies that this block's special area is an ITL array (as opposed to btree opaque data or similar). The heap AM initializes via PageInitHeapPage(page, 8192, 384); index AMs continue to call PageInit without ITL, leaving PD_HAS_ITL permanently 0.
UBA (Undo Block Address) is the type of the undo_segment_head field in an ITL slot — 16 bytes, acting as a precise pointer serving two query paths simultaneously:
commit_scn): uses (segment_id, tt_slot_offset) to reach the TT slot directly inside the undo segment header(segment_id, block_no, row_offset) to reach the specific undo record directlytypedef struct UBA {
UndoSegmentId segment_id; // 4 B undo segment number
BlockNumber block_no; // 4 B offset within undo block
uint16 tt_slot_offset; // 2 B TT slot index in segment header (0–47)
uint16 row_offset; // 2 B specific record offset within undo block
uint32 reserved; // 4 B reserved, MemSet to 0
/* total: 16 B */
};
tt_slot_offset indexes into the TT slot array in the undo segment header (48 slots × 32 B = 1536 B per segment header). Each TT slot holds commit_scn + status (ACTIVE / COMMITTED / ABORTED) and is the authoritative data source for the SCN visibility path.
Query chain: Row → ITL slot (block) → UBA → TT slot (undo segment header) → commit_scn; if historical row versions are needed, continue to UBA → undo record (undo block).
UBA's 16-byte size makes a pgrac ITL slot 16 bytes larger than an Oracle ITL slot (~32 B) — the remaining delta comes from write_scn (8 B, AD-008 Lamport companion) and first_change_lsn (8 B, crash recovery anchor). The choice of 48 B total vs Oracle's 32 B was explicitly justified and locked in under a capacity-vs-functionality trade-off.
Delayed cleanout is a key performance optimization adopted from Oracle: at transaction commit time, pgrac does not immediately traverse all modified rows to update the flags field of their ITL slots — in high-concurrency batch-write scenarios this can reduce commit-path I/O by 70–90%.
The mechanism works as follows:
status → COMMITTED, commit_scn written); the ITL slot's flags remains ACTIVE, and the block's PD_DELAYED_CLEANOUT flag is set.COMMITTED, writes commit_scn back to the ITL slot's commit_scn field, and changes flags to CLEAN.HeapTupleSatisfiesMVCC_scn performs a TT lookup and finds the slot COMMITTED but the ITL slot not yet CLEAN, it writes back immediately — no extra pass required, cleanout is done opportunistically.CLEAN or INACTIVE, the PD_DELAYED_CLEANOUT flag is cleared.The PD_DELAYED_CLEANOUT bit serves as a fast filter on the read path: if a block does not have this flag set, there is no need to scan ITL slot state — the read can proceed directly on the PG-native hint-bit path.
pgrac implements "dual-track visibility": every row tuple on the same block follows a different path depending on the XID origin and snapshot type. Both paths coexist physically, and the AD-012 dual-dimension routing logic selects between them at read time.
Path 1: PG-native path (local hot path)
Prerequisites: XID belongs to this instance + snapshot is a local snapshot created by this session.
t_xmin / t_xmaxProcArray for in-progress statussnapshot.xip[] for visibilityThis path is identical to PG-native code with no additional overhead.
Path 2: SCN path (cross-node / imported snapshot)
Prerequisites: XID belongs to a remote instance, or the snapshot is a cross-node SCN snapshot passed in from another node.
t_itl_slot_idx to obtain the ITL slot indexflags == CLEAN: compare commit_scn directly against snapshot.read_scnflags == ACTIVE or NEEDS_CLEANOUT: follow the UBA to the TT slot; determine visibility from the TT slot's status and commit_scn; perform cleanout opportunistically Read Heap Tuple
│
┌───────────┴───────────┐
↓ ↓
xmin/xmax ITL slot
in tuple in block
│ │
↓ ↓
PG CLOG check cluster TT lookup
(single instance) (cross-instance)
│ │
↓ ↓
commit / abort commit_scn compare
visible / invisible visible at read SCN?
│ │
└──── combined result ──┘
↓
visible?
CR block construction: when block_scn > snapshot.read_scn, pgrac clones a CR copy of the block in the buffer pool, applies undo for each ITL slot whose commit_scn > read_scn (reverting its modifications), and produces a historical version of the block. CR blocks exist only in the buffer pool and are never written to disk; they are identified by the is_cr_block = true and cr_scn fields.
The number of ITL slots is configured per-table via the INITRANS parameter, with syntax aligned to Oracle:
-- Specify at table creation
CREATE TABLE orders (...) WITH (INITRANS = 16);
-- Modify an existing table (takes effect on newly allocated blocks)
ALTER TABLE orders SET (INITRANS = 8);
| Scenario | Recommended INITRANS | Notes |
|---|---|---|
| Default | 8 | Suitable for general OLTP workloads; 384 B at block trailer |
| High-concurrency hot table (OLTP) | 16 – 32 | Frequent concurrent DML; prevents ITL overflow triggering block reorganization |
| OLAP / read-heavy | 4 | Reduces ITL footprint, leaving more space for user data |
| Index pages | 0 (N/A) | Index AM does not add ITL; PD_HAS_ITL = 0 |
The capacity loss from the INITRANS = 8 default is approximately 4.8% (~392 B / 8192 B). Reducing to INITRANS = 4 brings the loss down to ~3%; INITRANS = 16 raises it to ~7%. When all ITL slots are occupied by active transactions, pgrac attempts in order: reuse an already-COMMITTED slot (released after cleanout) → reorganize the block to free space → return "block full" (the writer automatically moves to the next block). Until Stage 3 visibility rework is complete, INITRANS DDL and ITL overflow handling are placeholder implementations.
For deeper design details, refer to the following resources:
ClusterPageHeader C struct, field-by-field annotations for the ITL slot 48 B layout, PD_HAS_ITL / PD_DELAYED_CLEANOUT flag semantics, PageInitHeapPage inline implementation, MaxHeapTupleSize formula derivationBufferDesc extended fields (is_cr_block / cr_scn / cr_chain_next / has_pi / pi_lsn), complete CR block construction flow, PI block ownership timing