Storage Format

Overview

Absolute DB supports multiple storage formats, each optimised for a specific access pattern. All formats share the same MVCC transaction layer, WAL, and LIRS buffer pool — they differ only in how data is physically arranged on disk.

Format	Page Size	Best For
B+Tree (row store)	4 KB (default)	OLTP: point lookups, range scans, random writes
PAX columnar	64 KB	OLAP: full-column scans, aggregations, compression
LSM-Tree SSTables	4 KB (variable)	Write-heavy: sensor data, logs, append-mostly workloads
WAL records	Variable	Durability, replication, PITR, CDC
Binary COPY	Streaming	Bulk load: 3× faster than text COPY
Arrow IPC	Aligned (64B)	Analytics interop: zero-copy transfer to Arrow consumers
Parquet	Row groups	Data lake interop: Spark, DuckDB, Snowflake, BigQuery

B+Tree Row Pages (4 KB)

The default row-store uses B+Tree with 4 KB pages. Each page has a fixed header, a slot array pointing to variable-length row tuples, and free space at the end of the page. Pages are aligned to 4 KB boundaries for efficient direct I/O.

Key characteristics that affect user-visible behaviour:

Bloom filters on leaf pages: Before fetching a leaf page from disk for a point lookup, a per-page bloom filter is consulted. If the filter is negative, the page is skipped entirely — eliminating unnecessary I/O for keys that don't exist.
MVCC tuples: Each row version includes a transaction ID range. Older versions are retained until VACUUM determines no active transaction can still see them.
Pluggable page size: Increase to 64 KB for analytics or 2 MB for bulk ingest via PAGE_SIZE at table creation.

sql — Page size selection

-- OLTP (default 4 KB)
CREATE TABLE orders (id BIGINT, ...) PAGE_SIZE 4096;

-- Analytics (64 KB — fewer I/O calls for large scans)
CREATE TABLE metrics (ts TIMESTAMP, ...) PAGE_SIZE 65536;

-- Bulk ingest (2 MB — maximum sequential write throughput)
CREATE TABLE raw_events (id BIGINT, ...) PAGE_SIZE 2097152;

PAX Columnar Pages (64 KB)

PAX (Partition Attributes aXross) pages store each column's data contiguously within the page. This layout achieves much higher compression ratios than row-oriented storage because adjacent values in a column tend to be similar. SIMD instructions can process entire columns in bulk without row-hopping.

Each 64 KB PAX page contains:

Header (64 bytes): magic number, column count, row count, per-column data offsets, and per-column zone maps (min/max values).
Per-column null bitmap: one bit per row, 32-byte aligned, indicates which rows have NULL in this column.
Per-column data array: fixed-width values for numeric types; offset/length pairs for variable-width types.
Footer: CRC-32C checksum over the entire page.

Encoding	Applies To	Effect
RLE (Run-Length Encoding)	Low-cardinality columns, sorted data	Repeating values stored as (value, count) pairs — 10–100× compression
Bit-packing	Small integers, enum codes, boolean flags	Store values at their minimum bit width — 2–8× compression
Dictionary	String columns with ≤ 256 distinct values	Replace strings with 1-byte codes — 4–32× compression
Delta + RLE	Timestamps, monotonic counters	Store deltas between adjacent values, then RLE — 5–20× compression

Zone maps (stored in the page header) allow the query planner to skip entire pages when the query predicate falls outside the page's min/max range. For example, a query with WHERE ts > '2026-04-01' skips all pages whose max timestamp is before that date — without reading a single row from those pages.

LSM-Tree SSTables

The LSM-Tree (Log-Structured Merge-Tree) backend is optimised for write-heavy workloads. Writes go to an in-memory MemTable first, which is periodically flushed to immutable disk files called SSTables (Sorted String Tables). Background compaction merges SSTables into larger levels.

Component	Description
MemTable	In-memory write buffer (sorted by key). Flushed to L0 when full (default 64 MB).
L0 SSTables	Freshly flushed MemTables. May overlap in key range. Compacted to L1.
L1–LN SSTables	Leveled compaction: each level is 10× larger. No key overlap within a level.
Bloom filter	Per-SSTable bloom filter eliminates disk reads for missing keys.

LSM is ideal for time-series data, sensor readings, and audit logs where write throughput is the priority. Point lookup performance is slightly lower than B+Tree because multiple SSTable levels may need to be checked.

sql — Enable LSM backend

-- Create a table using the LSM backend
CREATE TABLE sensor_readings (
    device_id  TEXT,
    ts         TIMESTAMP,
    value      DOUBLE PRECISION
) USING LSM;

-- Force manual compaction (normally automatic)
CALL absdb_lsm_compact('sensor_readings');

WAL Records

WAL records are variable-length. Each record begins with a fixed header followed by the payload. The header includes the LSN, transaction ID, record type, payload length, and a CRC-32C checksum. Group commit batches up to 64 records per fsync.

Record Type	Contents
INSERT	Table OID, page ID, slot number, new row data
UPDATE	Table OID, old page/slot, new page/slot, new row data
DELETE	Table OID, page ID, slot number
CHECKPOINT	Snapshot of all active transaction IDs, buffer pool dirty list
COMMIT	Transaction ID, commit timestamp
ROLLBACK	Transaction ID, undo chain pointer
DDL	Schema change descriptor (CREATE/ALTER/DROP)

Binary COPY Format

Absolute DB supports PostgreSQL-compatible binary COPY format for bulk data load and export. Binary COPY is approximately 3× faster than text COPY because it eliminates text parsing and format conversion overhead.

sql — Binary COPY commands

-- Bulk import from binary COPY file
COPY orders FROM '/data/orders-dump.bin' WITH (FORMAT binary);

-- Bulk export to binary COPY file
COPY (SELECT * FROM orders WHERE created_at > '2026-01-01')
TO '/tmp/orders-export.bin' WITH (FORMAT binary);

-- Pipe directly from psql or any PG-compatible client
\copy orders FROM orders.bin WITH BINARY

The binary format is compatible with PostgreSQL's binary COPY protocol, so tools that produce PostgreSQL binary dumps (pg_dump, COPY TO, pgcopydb) can load data directly into Absolute DB without conversion.

Apache Arrow IPC

Absolute DB exports query results in Apache Arrow IPC format — a columnar, zero-copy memory layout widely used by analytics frameworks (pandas, DuckDB, Polars, Spark, BigQuery Storage API). No libarrow dependency is needed — the IPC format is generated natively.

sql / bash — Arrow IPC export

-- Export a query result as Arrow IPC (file format)
COPY (SELECT * FROM metrics WHERE ts > '2026-01-01')
TO '/tmp/metrics.arrow' WITH (FORMAT arrow);

-- Stream Arrow IPC over HTTP (REST endpoint)
curl http://localhost:8080/api/query \
  -H 'Accept: application/vnd.apache.arrow.file' \
  -H 'Content-Type: application/json' \
  -d '{"sql": "SELECT * FROM metrics LIMIT 1000000"}'

python — Read Absolute DB Arrow export with pandas

import pyarrow as pa
import pyarrow.ipc as ipc
import pandas as pd

# Read Arrow IPC file produced by Absolute DB
with open('/tmp/metrics.arrow', 'rb') as f:
    reader = ipc.open_file(f)
    table = reader.read_all()

df = table.to_pandas()
print(df.describe())

Arrow IPC output is aligned to 64-byte boundaries, enabling zero-copy reads on systems that support memory-mapped files. Record batches are self-describing — consumers do not need the database schema separately.

Apache Parquet

Parquet is a widely adopted columnar file format for data lakes and analytics platforms. Absolute DB reads and writes Parquet files natively — no libparquet or external library is required.

Parquet Feature	Absolute DB Support
Column encodings	PLAIN, DICTIONARY, RLE_DICTIONARY, BIT_PACKED
Compression codecs	LZ4 (fast), Zstd (high ratio)
Row groups	Configurable size (default 128 MB)
Column statistics	min/max/null_count per column per row group
Schema	Derived from SQL table definition; type mapping documented in API reference

sql — Parquet export and import

-- Export to Parquet (local file)
COPY (SELECT * FROM orders WHERE year = 2026)
TO '/data/orders-2026.parquet' WITH (FORMAT parquet, COMPRESSION zstd);

-- Export to Parquet on S3
COPY (SELECT * FROM orders)
TO 's3://my-bucket/exports/orders.parquet'
WITH (FORMAT parquet, COMPRESSION lz4);

-- Import from Parquet
COPY orders_archive FROM '/data/orders-2025.parquet'
WITH (FORMAT parquet);

Parquet files produced by Absolute DB are compatible with Apache Spark, DuckDB, Polars, Snowflake external tables, BigQuery, AWS Athena, and any other Parquet-compatible tool. The row group statistics are populated so predicate pushdown works in downstream tools.

Pluggable Page Sizes

Page size is configurable per table at creation time. The choice affects I/O efficiency for different access patterns:

Page Size	Best For	Trade-off
4 KB (default)	OLTP, mixed workloads, random access	Lowest overhead per point lookup
64 KB	Analytics, PAX columnar, time-series	More data per I/O — better scan throughput
2 MB	Bulk ingest, archival, write-intensive	Maximum sequential write throughput; higher memory per page in pool

Page size cannot be changed after table creation. If you need to change page size, export the data (using COPY or Parquet), drop and recreate the table with the new page size, and reload.

Compression Overview

Algorithm	Where Used	Ratio	Speed
RLE	PAX columnar, time-series chunks	10–100×	Very fast (CPU only)
Bit-packing	PAX integer columns	2–8×	Very fast (SIMD)
Dictionary	PAX string columns	4–32×	Fast
Gorilla (delta-delta)	Time-series float columns	~10:1	Fast
LZ4	WAL archive, Parquet, object storage backup	2–4×	Very fast (> 500 MB/s)
Zstd	WAL archive, Parquet, backup compression	3–7×	Fast (200–400 MB/s)

All compression is transparent to SQL queries — compressed data is decompressed automatically during reads. Compression is applied at page or chunk granularity so the query engine can decompress only the pages it needs.

Contents