GES (Global Enqueue Service) is pgrac's second cross-node locking protocol, orthogonal to Cache Fusion's PCM locks: PCM manages block-level concurrency inside the buffer cache, while GES manages all cross-node locks outside the buffer cache — row locks, table locks, transaction locks, DDL locks, advisory locks, and control-file locks. The two protocols share the same GRD master-routing infrastructure (see
PCM locks are designed for block-level concurrent access inside the buffer cache: the 8 KB block is their unit of operation, and the N/S/X tri-state plus the orthogonal has_pi flag cover all buffer-transfer scenarios. But a large class of coordination needs in a database have nothing to do with the buffer cache:
ALTER TABLE, DROP) must block concurrent DML on all nodes.pg_advisory_lock requires cluster-wide mutual exclusion, not single-node mutual exclusion.pg_control) require a global exclusive lock.The common characteristic of these scenarios is that the lock object is a logical resource (a row, table, object, or advisory key) — not a physical block. PCM has no concept of "table lock" or "row lock"; it only knows BufferTag (file number + block number). Forcing row-lock semantics into PCM would produce semantic confusion and uncontrolled granularity — this is the fundamental reason Oracle's design separates GCS and GES responsibilities, and pgrac follows the same architectural decision.
Boundary summary: PCM locks operate on BufferTag (physical blocks); GES locks operate on LockTag (logical resources). Both protocols share the GRD master routing table, but their keyspaces are completely isolated — a PCM lock record for a block and a GES lock record for a table can never interfere with each other.
pgrac GES covers all cluster-aware lock types from PG's native heavyweight lock manager, plus dedicated cluster lock types introduced by pgrac:
| Lock type | Abbrev | Purpose | Feature |
|---|---|---|---|
| Transaction row lock | TX | Cross-node row-level conflict detection (LOCKTAG_TUPLE / LOCKTAG_TRANSACTION) | #23 |
| Table-level DML lock | TM | DDL and DML cross-node coordination (LOCKTAG_RELATION) | #24 |
| Sequence lock | SEQ | Cross-node sequence range allocation | #27 |
| Control-file lock | CF | Cross-node exclusive write of pg_control | #77 |
| User lock | UL | pg_advisory_lock cluster-wide mutual exclusion | #78 |
| Tablespace / Instance-state / Call / Quiesce locks | TT/IS/CI/XR | Four cluster-management enqueue types | #79 |
| PG-specific locks | PG | SSI, SPECULATIVE TOKEN, VXID, FROZEN_ID | #124 |
Two lock types explicitly outside GES scope (excluded per AD-011): LC Lock (Library Cache — an Oracle-specific shared-pool metadata lock; PG has no equivalent concept) and RC Lock (Row Cache — Oracle's data-dictionary cache lock; PG's catalog cache is managed by relcache / catcache and requires no cross-node row-cache lock).
The division of responsibility between PCM and GES follows a clear boundary: buffer cache in / buffer cache out. The table below compares the two protocols across five dimensions:
| Dimension | PCM | GES |
|---|---|---|
| Lock object | BufferTag (file number + block number, physical block) | LockTag (relation OID, XID, object OID, etc., logical resource) |
| Lock mode set | N / S / X (3 states) + has_pi orthogonal flag | AccessShare → AccessExclusive (PG's 8 modes) |
| Primary message forms | BLOCK_REQUEST / SEND_BLOCK / PI_PUBLISH | GES_LOCK_REQUEST / GES_BAST / GES_LOCK_GRANT |
| Access frequency | High (OLTP: multiple block accesses per SQL statement) | Medium (transaction-granular / DDL-granular: a few requests per transaction) |
| GRD keyspace | pcm_resource (block hash) | ges_resource (LockTag hash) |
The only shared infrastructure between the two protocols is GRD master routing and DRM (Dynamic Resource Mastering) — hash(key) % N determines which node is master, and DRM can migrate the master for a hot resource to the active node to reduce cross-node messages.
The PCM/GES boundary is not a convention — it is a semantic incompatibility:
PCM lock correctness depends on the physical address uniqueness of blocks — a BufferTag uniquely identifies a single 8 KB block on shared storage, and the PCM state machine's invariant is "at most one node in the cluster holds an X lock." Violating this invariant allows two nodes to simultaneously modify the same block, causing lost writes.
GES lock semantics are about transaction visibility — a LOCKTAG_TRANSACTION lock represents "whether XID has committed," with no relation to the physical location of any block. If PCM's N/S/X tri-state were used to represent row locks, PCM's "X→S downgrade" would be misinterpreted as "row-lock downgrade," causing transaction isolation to break down.
In the other direction, GES cannot manage buffer-cache block transfers: GES's BAST is an asynchronous notification (waiting for the holder's transaction to finish naturally), whereas PCM's block transfer requires synchronous coordination (the requester must wait until the block content arrives before proceeding). These two timing semantics are incompatible.
GES master selection is identical to PCM: for each LockTag, a hash value is computed and taken modulo the number of nodes to determine which node is master for that resource:
master_node = hash(LockTag) % N
The master node is responsible for maintaining two core data structures for that resource:
grant_list: the list of (node, mode) pairs that have been granted the lock. Multiple nodes may appear simultaneously in the grant_list for the same resource, as long as their modes are compatible (e.g., multiple AccessShare holders).convert_queue: the queue of requests waiting for a lock conversion (or initial acquisition), ordered by FIFO + priority to prevent starvation.The local_lockid held by non-master nodes is merely a local cache of the master's view — it reduces round-trip messages to the master (local fast path), but the authoritative state always lives in the master's grant_list / convert_queue.
+-------+ +-------+ +-------+
resource ---> Master | Master | Master |
hash mod N Node 1 | Node 2 | Node 3 |
+-------+ +-------+ +-------+
hash%3=0 hash%3=1 hash%3=2
local_lockid (Node N) → forwards to → master(resource)
↓
grant_list + convert_queue
The DRM migration mechanism for master routing (dynamic migration of the master for a hot resource to the active node) is shared with PCM and uses the same GRD protocol; for details see
BAST (Blocking AST, Blocking Asynchronous System Trap) is the most important design in GES that distinguishes it from traditional blocking locks. In a traditional single-instance lock manager, when node B requests a lock that conflicts with one held by node A, B's request is placed in a wait queue and B's process is suspended — this is synchronous wait.
The GES BAST pattern does not block the requester in the wait queue with nothing to do; instead:
GES_LOCK_CONVERT (or GES_LOCK_REQUEST); the master detects the conflict and records the request in the convert_queue.GES_BAST to the current holder (node 3), notifying it that "someone needs this lock — please release it at your convenience."GES_LOCK_RELEASE at commit or rollback.convert_queue, grants the mode requested by node 1, and sends GES_LOCK_GRANT.Node 1 (requester) Node 2 (master) Node 3 (current holder)
| | |
|--- 1. CONVERT --->| |
| |--- 2. BAST ------->|
| | |
| |<-- 3. RELEASE -----|
| | on commit |
|<-- 4. GRANT -------| |
| |
The core value of BAST is that it does not interrupt the holder's normal transaction execution. Traditional blocking locks force the holder into a wait state before its transaction commits (via OS signal or futex wakeup), which in a cross-node scenario means an extra network round trip enters the requester's wait path. BAST converts this "request signal" into an asynchronous notification to the holder — the holder handles it naturally within its own transaction lifecycle, without the requester polling repeatedly and without unnecessary context switches.
On the conflict-free path, the target cross-instance acquisition latency for a GES lock is approximately 5 μs (Tier 1 RDMA). With a conflict where the holder's transaction is about to commit, the latency is approximately 10–20 μs. With a conflict where the holder's transaction is still running, the wait time is determined by the holder transaction (potentially seconds).
BAST is cooperative release, not preemptive. After receiving BAST, the holder may continue executing its current transaction until it commits; the master will not forcibly interrupt it. This means long-running transactions can cause unpredictably long wait times for the requester — long-transaction management is a performance strategy that application layers using GES must actively address.
pgrac deliberately does not change PG's native 8 lock modes — this is one of the core decisions in AD-012. The complete set from AccessShareLock to AccessExclusiveLock is preserved unchanged in the cluster:
| Mode | Typical trigger | Conflicts with |
|---|---|---|
| AccessShareLock | SELECT | AccessExclusiveLock only |
| RowShareLock | SELECT FOR UPDATE | ExclusiveLock, AccessExclusiveLock |
| RowExclusiveLock | INSERT / UPDATE / DELETE | Share and above |
| ShareUpdateExclusiveLock | VACUUM, ANALYZE | ShareUpdateExclusive and above |
| ShareLock | CREATE INDEX (non-concurrent) | RowExclusive and above |
| ShareRowExclusiveLock | Triggers, etc. | RowExclusive and above |
| ExclusiveLock | REFRESH MATERIALIZED VIEW CONCURRENTLY | RowShare and above |
| AccessExclusiveLock | ALTER TABLE, DROP | All modes |
pgrac changes only two things: scope (extended from single-node to cluster-wide) and acquire path (local fast path unchanged; cross-node scenarios are coordinated through GES master routing). The compatibility matrix, lock-escalation semantics, and wait graph (deadlock detection) are fully consistent with single-instance PG — this ensures that existing PG applications migrating to a pgrac cluster experience no unexpected differences in lock behavior.
The local fast path remains efficient: when the node's LOCALLOCK cache hits (the node already holds a compatible mode for that lock), GES produces no network messages whatsoever, and latency is identical to single-instance PG. Only when there is a LOCALLOCK miss and the resource is a cluster-aware type does GES trigger the cross-node protocol.
This chapter has established the GES conceptual framework: the PCM/GES boundary is determined by "buffer cache in / buffer cache out" (Section 3.3.1 explains why the boundary cannot be crossed); master routing uses hash(LockTag) % N to shard each enqueue resource to a unique master node, with grant_list + convert_queue as the authoritative state maintained by the master; the BAST pattern implements cooperative release, allowing the holder to handle lock release naturally within its own transaction lifecycle without being forcibly interrupted; and PG's 8 lock modes are fully preserved, with scope extended from single-node to cluster and the acquire path using the GES protocol on LOCALLOCK miss.
The following deep pages provide complete implementation-level details:
Subsequent manual chapters build on the GES conceptual foundation: Chapter 4 — SCN (Lamport clocks; all cross-node messages — including GES and PCM — piggyback the current SCN for causal ordering and consistent reads); Chapter 5 — Reconfiguration (when the cluster topology changes, the complete state-reconstruction process for GRD master reallocation, grant_list rebuild, and orphaned-lock cleanup).