The previous chapter (Ch 9) described pgrac's 8 KB block dual-track layout: the ITL slot array is embedded in the special area, each row tuple indexes its corresponding ITL slot via t_itl_slot_idx, and the ITL slot's undo_segment_head (UBA, 16 bytes) points into the per-instance undo segment. This chapter goes deeper into the destination that the UBA points to — the physical structure and lifecycle of the undo subsystem.
pgrac's undo subsystem implements full Oracle-style MVCC: each instance holds an independent undo tablespace, all DML writes undo before heap, and both rollback and CR block construction are performed via the undo chain. The core design goal is zero cross-node undo write contention — each instance writes only its own undo tablespace, while other instances access it read-only via Cache Fusion.
PG-native has no dedicated undo subsystem. Historical row versions live inside the heap (dead tuples) and are cleaned up asynchronously by VACUUM; visibility is determined by xmin/xmax + CLOG — a classic "MVCC by heap versioning" model. After pgrac introduces a full undo subsystem, historical versions leave the heap and move into dedicated undo segments, dramatically shortening the lifecycle of heap rows.
| Dimension | PG native | pgrac Per-instance Undo |
|---|---|---|
| Historical version location | Heap (dead tuples) | Dedicated undo segment |
| Visibility information | xmin/xmax in tuple + CLOG | ITL slot → UBA → TT slot (commit_scn) |
| MVCC model | Heap versioning | Oracle MVCC (undo-based) |
| Cross-node MVCC | ❌ Not supported | ✅ Cluster-wide SCN + undo chain |
| Undo tablespace | ❌ None | Per-instance (AD-010) |
| Write contention | — | No cross-node undo write contention |
| Reclamation mechanism | VACUUM scans heap | undo_vacuum bgworker reclaims by retention |
| Rollback path | Mark xmax invalid | Traverse undo chain in reverse, applying inverse operations |
If all nodes share a single undo tablespace, every DML operation must contend for segment write locks: segment allocation, TT slot acquisition, and undo block writes all introduce cross-node lock protocol overhead — a serious bottleneck under high-concurrency write workloads.
pgrac's choice of per-instance undo tablespace (AD-010) is motivated by:
pg_basebackup does not need to back up undo tablespaces, significantly reducing backup volumeCross-node reads of undo data (CR construction, reading commit_scn from a remote TT slot) are handled through Cache Fusion (#119) — read-only access with no write contention.
Each instance is provisioned by default with 16 segments × 64 MB = 1 GB of undo space, created automatically during the pgrac_ctl initdb phase. High-load instances can expand to 10+ GB (max_undo_segments_per_instance defaults to a ceiling of 64).
Each segment file is divided into two layers:
┌─────────────────────────────┐
│ shared data tablespace │ (cluster-wide, all nodes read/write)
└─────────────────────────────┘
↑
┌─────────────┴─────────────┐
│ │
Node 1 Node N
┌────────────────┐ ┌────────────────┐
│ undo_node_1 │ │ undo_node_N │
│ ┌──────────┐ │ │ ┌──────────┐ │
│ │ seg_001 │ │ │ │ seg_001 │ │
│ │ 64 MB │ │ │ │ 64 MB │ │
│ └──────────┘ │ │ └──────────┘ │
│ ... │ │ ... │
│ ┌──────────┐ │ │ ┌──────────┐ │
│ │ seg_016 │ │ │ │ seg_016 │ │
│ └──────────┘ │ │ └──────────┘ │
│ 16 × 64MB │ │ 16 × 64MB │
│ = 1 GB │ │ = 1 GB │
└────────────────┘ └────────────────┘
(only Node 1 writes) (only Node N writes)
The tablespace catalog entry in pg_tablespace is identified by spctype = 'undo' (AD-004 new type); the spcowner_instance column records which instance owns it:
SELECT spcname, spctype, spcparams, spcowner_instance
FROM pg_tablespace
WHERE spctype = 'undo';
spcname | spctype | spcparams | spcowner_instance
----------------------+---------+----------------------------------------+-------------------
undo_tbs_instance_1 | undo | {size=1GB, segments=16, retention=900} | 1
undo_tbs_instance_2 | undo | {size=1GB, segments=16, retention=900} | 2
The Segment Header occupies the full 8 KB of Block 0. The first 24 bytes reuse PageHeaderData (LSN + checksum + flags), followed in order by segment metadata, block allocation pointer, retention information, and a transaction table of 48 TT slots (1.5 KB).
A segment's lifecycle passes through five states: ALLOCATED → ACTIVE → COMMITTED → RECYCLABLE → ALLOCATED. By default, 8 segments per instance are ACTIVE (held by live transactions), 4 are COMMITTED (within the retention window), and the remainder are idle or recyclable. Each transaction exclusively occupies one segment (per-transaction exclusive strategy), eliminating intra-segment concurrency contention and giving all undo records of the same transaction good locality.
Each TT slot is a fixed 32 bytes and is the authoritative source of transaction state for the undo subsystem:
| Field | Size | Meaning |
|---|---|---|
xid | 4 B | PG 32-bit transaction ID (high bits encode instance_id, A1-v2 cross-instance XID segmentation) |
wrap | 2 B | Slot reuse counter (WRAP); prevents ABA false-positive matches |
status | 1 B | ACTIVE / COMMITTED / ABORTED / RECYCLABLE |
flags | 1 B | Cleanout state + reserved bits |
commit_scn | 8 B | Written at commit time; INVALID (0) means not yet committed |
first_undo_block (UBA) | 16 B | Precise address of this transaction's first undo record (segment_id, block_no, tt_slot_offset, row_offset) |
WRAP prevents ABA: a TT slot is not immediately freed after COMMITTED — readers may still need to query commit_scn. A slot is only marked RECYCLABLE and eligible for reuse when commit_scn < oldest_active_snapshot_scn and all undo records associated with that slot have been reclaimed. On reuse wrap++; when an ITL slot references a TT slot, it compares the wrap value — a mismatch indicates the slot has been reused, and the ITL slot falls back to its cached commit_scn.
48 slots × 32 B = 1.5 KB, roughly 19% of the segment header; combined with the default 16 segments, each instance supports 8 × 48 = 384 concurrently active transactions (excluding COMMITTED slots).
Each undo record consists of a fixed 32-byte UndoRecordHeader plus a variable-length payload. The header's prev_undo_in_tx (16 B UBA) points to the previous undo record of the same transaction, forming a reverse singly-linked list — the core data structure for rollback.
Tx A write order (chronological):
Block 5: Record 1 (INSERT row P) prev = NULL
Block 5: Record 2 (UPDATE row Q) prev → (seg X, blk 5, rec 0)
Block 5: Record 3 (DELETE row R) prev → (seg X, blk 5, rec 1)
Block 6: Record 4 (UPDATE row S) prev → (seg X, blk 5, rec 2)
TT slot[A].first_undo_block → (seg X, blk 6, rec 3) ← most recent (rollback entry point)
Rollback path (reverse):
Record 4 → Record 3 → Record 2 → Record 1 → NULL
Each step applies the inverse operation (INSERT→delete row; UPDATE→restore pre-image; DELETE→re-insert)
Payload sizes differ by operation type: INSERT undo is lightest (only the row offset, ~40 B total); UPDATE undo uses a column-delta optimization, storing only the pre-image of modified columns (~80 B for a 5-column change); DELETE undo is heaviest, requiring the full old row (~240 B total for a 200 B row).
CR block construction: when reading a block where block_scn > snapshot.read_scn, pgrac clones a CR copy in the buffer pool, traverses the UBA chain for each ITL slot whose commit_scn > read_scn to fetch undo records, and reverse-applies them to reconstruct the historical version. CR blocks exist only in the buffer pool and are never written to disk.
Undo retention: the default retention window is 15 minutes (cluster_undo_retention_sec = 900), using max mode by default (retention extends indefinitely when an active reader holds an older snapshot, preventing STO). The background undo_vacuum worker scans every 60 seconds and reclaims expired segments in three stages: TT slot → RECYCLABLE, undo records → discardable, entire segment → returned to the free pool.
Cross-node visibility is the most complex path in pgrac MVCC: Node A reads a row tuple whose last modification was performed by a transaction on Node B, with the corresponding undo data residing in Node B's undo segment.
Node A: read tuple at row R
│
↓
Tuple has ITL slot index → look up ITL slot
│
↓
ITL slot has UBA = (segment_id, block_no, record_offset)
│
↓
cluster TT lookup → which node owns this undo segment? Node B
│
↓
fetch undo block (segment_id, block_no) from Node B's undo
(via Cache Fusion if cached, else shared storage IO)
│
↓
SCN check: is this version visible at read_scn?
│
↓
visible / invisible
Segment header cache: after Node A fetches Node B's segment header (containing the TT slot array), it is cached locally in an LRU cache for 30 seconds (segment_header_cache_ttl = 30s), avoiding a Cache Fusion request for every SCN lookup. Once the ITL slot's commit_scn is written back (delayed cleanout), subsequent reads retrieve it directly from the block without consulting the TT slot again.
Undo block cache: undo blocks fetched during CR construction are cached locally for 5 minutes (undo_block_cache_ttl = 5min), subject to PI pool quota limits. Cache hit rates are typically above 90% (see pg_cluster_undo_cf_activity).
Cross-node undo access is a read-only operation (Node A never writes to Node B's undo), so there is no write contention. Cache Fusion here only needs to issue shared requests (S mode) and does not trigger the XCR (Exclusive CR) protocol — overhead is significantly lower than cross-node heap write contention.
For deeper design details and related features:
UndoSegmentHeader C struct, three undo record op payload formats, five-state lifecycle state machine, TT slot reuse protocol, undo_vacuum bgworker implementation, CREATE UNDO TABLESPACE DDLsegment_id / block_no / tt_slot_offset / row_offset four-field layout, and how ITL slots hold the UBABufferDesc extended fields (is_cr_block / cr_scn / cr_chain_next), complete CR block construction flow, PI block and undo block coordination