Write-Ahead Log (WAL)

Purpose & Guarantees

The Write-Ahead Log (WAL) ensures that every committed transaction is recoverable after any failure, including power loss and OS crashes. The rule is simple: no data page may be written to disk unless the corresponding WAL record has already been durably flushed.

This gives Absolute DB two critical guarantees:

Durability (D in ACID): Committed transactions are never lost, even if the server crashes immediately after the commit acknowledgement is sent.
Crash recovery: On startup after a crash, the WAL is replayed from the last checkpoint to reconstruct any in-flight pages that were not yet flushed to their data files.

The WAL also serves as the source for streaming replication to read replicas and standby nodes, and as the input to the CDC (Change Data Capture) engine.

LSN — Log Sequence Number

Every WAL record is addressed by a monotonically increasing Log Sequence Number (LSN). LSNs are 64-bit integers displayed in the format segment/offset (e.g., 0/1048576). The LSN uniquely identifies a position in the WAL stream.

sql — Working with LSNs

-- Current WAL insert LSN (latest committed position)
SELECT absdb_current_lsn();

-- LSN of the last checkpoint
SELECT absdb_last_checkpoint_lsn();

-- Check replication lag as WAL bytes behind primary
SELECT
  replica_name,
  sent_lsn,
  replay_lsn,
  sent_lsn - replay_lsn AS lag_bytes
FROM absdb_replication_status;

CRC-32C Integrity

Every WAL record includes a CRC-32C checksum computed over the record header and body. During crash recovery and WAL replay, each record's checksum is verified before it is applied. A checksum mismatch indicates a corrupt or truncated WAL segment — the server halts and reports the offending LSN rather than applying corrupted data.

CRC-32C (Castagnoli) is used rather than CRC-32 because it is hardware-accelerated on modern CPUs (Intel SSE4.2 PCRC32 instruction, ARM64 CRC32 extensions) and has better error detection properties for database workloads.

Group Commit

Absolute DB batches WAL records from multiple concurrent transactions into a single fsync() call using group commit. Up to 64 WAL records are flushed per fsync, dramatically reducing the number of I/O operations required under concurrent write load.

Scenario	Without Group Commit	With Group Commit
64 concurrent commits	64 fsync calls	1 fsync call
Throughput impact	Limited by fsync latency per commit	Throughput scales with concurrency
Commit latency	1× fsync latency	1× fsync latency (shared)

Group commit is transparent — every transaction still gets a durable commit guarantee. The group commit window is bounded by the 64-record batch limit; transactions beyond that batch do not wait.

bash — Configure WAL commit behaviour

# Tune group commit batch size (default 64)
./bin/absdb-server --wal-group-commit 64

# Synchronous commit modes
# fsync  — full durability, maximum safety (default)
# nosync — highest throughput, risk of loss on crash (dev only)
./bin/absdb-server --wal-sync-mode fsync

WAL Streaming for Replication

Read replicas and standby nodes receive the WAL stream in real time over the Raft replication channel (port 9091) or directly via the WAL streaming replication protocol. Replicas apply WAL records continuously, maintaining a replication lag that is typically under 100 ms on a local network.

sql — Monitor replication

-- Replication status on the primary
SELECT
  replica_name,
  state,           -- streaming | catchup | idle
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn,
  sync_state       -- async | sync | quorum
FROM absdb_replication_status;

-- Replication status on a replica
SELECT
  primary_host,
  receive_lsn,
  replay_lsn,
  lag_seconds
FROM absdb_replica_status;

WAL Tap & Change Data Capture (CDC)

The CDC engine taps the WAL to produce a stream of row-level change events — INSERT, UPDATE, DELETE — in Debezium-compatible JSON and Protobuf binary formats. This stream is delivered over WebSocket (ws://host:8080/cdc) and gRPC (AbsoluteDB.Subscribe).

sql — CDC subscription

-- Subscribe to all changes on the orders table
SUBSCRIBE TO TABLE orders;

-- Subscribe starting from a specific LSN (resume after restart)
SUBSCRIBE TO TABLE orders STARTING AT LSN '0/1048576';

-- Subscribe to the whole database
SUBSCRIBE TO DATABASE myapp STARTING AT LSN '0/1048576';

-- Filter events server-side (only ship relevant changes)
SUBSCRIBE TO TABLE orders WHERE status = 'paid';

json — Debezium-compatible CDC event

{
  "op":    "c",                    // c=create, u=update, d=delete
  "ts_ms": 1743850200000,
  "source": {
    "db":    "myapp",
    "table": "orders",
    "lsn":   1048576
  },
  "before": null,
  "after": {
    "id":     101,
    "user_id": 42,
    "total":  99.95,
    "status": "paid"
  }
}

The CDC ring buffer holds 100 MB of unacknowledged events. If a consumer falls too far behind, it receives a buffer overflow signal and must re-subscribe from a saved LSN. ACK messages advance the server-side cursor and free buffer space.

Re-Read Before Shutdown

During a clean shutdown, Absolute DB performs a Re-Read Before Shutdown pass: the WAL is re-scanned from the last checkpoint to the current insert LSN to ensure no in-flight records are dropped. This guarantees that even if the shutdown signal arrives while a group-commit batch is being assembled, all committed transactions are safely flushed.

On dirty shutdown (power loss, SIGKILL), recovery begins from the last valid checkpoint and replays all WAL records up to the last intact CRC-32C-verified record.

SUBSCRIBE … STARTING AT LSN

Any WAL consumer — a replica, a CDC subscriber, or an application — can resume from an arbitrary LSN position. This is essential for fault-tolerant consumers that must survive restarts without missing events.

sql — Resume CDC from a saved LSN

-- Application saves the last processed LSN to its own store
-- On restart, resume exactly where it left off
SUBSCRIBE TO TABLE payments
  STARTING AT LSN '0/2097152'
  FORMAT DEBEZIUM_JSON;

The WAL segments required to serve a given LSN must still be present on disk or in the WAL archive. Requests for LSNs older than the WAL retention window return an error; the consumer must fall back to a full snapshot and then resume streaming.

WAL Archiving to Object Storage

Completed WAL segments are archived to object storage (S3, GCS, Azure Blob) automatically. Archived segments are the source for PITR recovery and serve as a durable off-site backup of all changes.

bash — WAL archive configuration

# absdb.conf
wal_archive_enabled  = true
wal_archive_target   = s3://my-bucket/wal-archive/
wal_archive_compress = zstd    # none | lz4 | zstd
wal_archive_interval = 60      # seconds between segment uploads
wal_segment_size     = 16MB    # segment size (16 MB default)

# View archived segments
absdb wal-list s3://my-bucket/wal-archive/

# Restore a specific WAL segment for inspection
absdb wal-fetch \
  s3://my-bucket/wal-archive/000000010000000000000001 \
  /tmp/wal-segment-inspect

PITR Walkthrough

To recover to an exact point in time (e.g., one minute before an accidental mass delete):

bash — Full PITR workflow

# 1. Identify the base backup closest to the target time
absdb backup --list s3://my-bucket/backups/
# → full-base-20260404  (taken 2026-04-04 02:00 UTC)

# 2. Restore the base backup to a temporary directory
absdb restore \
  --from s3://my-bucket/backups/full-base-20260404 \
  --to   /tmp/absdb-pitr

# 3. Configure recovery target
cat > /tmp/absdb-pitr/recovery.conf <

Configuring WAL Retention for PITR

WAL segments on disk are retained until they are no longer needed for crash recovery or replication. The retention window determines how far back a PITR recovery can reach.

bash — WAL retention settings

# Retain 14 days of WAL for PITR
./bin/absdb-server --wal-retention-days 14

# Or in absdb.conf
wal_retention_days = 14

# Force removal of WAL segments no longer needed
# (normally automatic; use only for emergency disk recovery)
absdb admin wal-cleanup --before-lsn '0/8000000'

If WAL disk usage is a concern, enable WAL compression (wal_archive_compress = zstd) and move segments to object storage promptly. The PITR window is determined by the oldest WAL segment available in either the local archive or the object storage archive.

absdb_wal_stats Virtual Table

The absdb_wal_stats virtual table provides a real-time view of WAL health and throughput:

sql — WAL statistics

SELECT * FROM absdb_wal_stats;

-- Key columns:
--   current_lsn          — current WAL insert position
--   last_checkpoint_lsn  — LSN of most recent checkpoint
--   wal_bytes_written     — total bytes written since start
--   wal_records_written   — total records since start
--   group_commit_batches  — total group-commit fsync operations
--   avg_records_per_fsync — efficiency indicator (target: 30–64)
--   last_archived_lsn     — most recently archived segment's end LSN
--   last_archived_time    — timestamp of last successful archive
--   archive_lag_seconds   — how far behind archiving is (target: < 120)
--   wal_segment_size_bytes — configured segment size

-- Alert if archive lag exceeds 5 minutes
SELECT CASE
  WHEN archive_lag_seconds > 300
  THEN 'WARNING: WAL archive lag exceeds 5 minutes'
  ELSE 'OK'
END AS archive_health
FROM absdb_wal_stats;

Contents