How Absolute DB stores data on disk — row pages, columnar pages, LSM-Tree SSTables, WAL records, bulk load formats, and open interoperability formats (Arrow IPC, Parquet).
Absolute DB supports multiple storage formats, each optimised for a specific access pattern. All formats share the same MVCC transaction layer, WAL, and LIRS buffer pool — they differ only in how data is physically arranged on disk.
| Format | Page Size | Best For |
|---|---|---|
| B+Tree (row store) | 4 KB (default) | OLTP: point lookups, range scans, random writes |
| PAX columnar | 64 KB | OLAP: full-column scans, aggregations, compression |
| LSM-Tree SSTables | 4 KB (variable) | Write-heavy: sensor data, logs, append-mostly workloads |
| WAL records | Variable | Durability, replication, PITR, CDC |
| Binary COPY | Streaming | Bulk load: 3× faster than text COPY |
| Arrow IPC | Aligned (64B) | Analytics interop: zero-copy transfer to Arrow consumers |
| Parquet | Row groups | Data lake interop: Spark, DuckDB, Snowflake, BigQuery |
The default row-store uses B+Tree with 4 KB pages. Each page has a fixed header, a slot array pointing to variable-length row tuples, and free space at the end of the page. Pages are aligned to 4 KB boundaries for efficient direct I/O.
Key characteristics that affect user-visible behaviour:
PAGE_SIZE at table creation.-- OLTP (default 4 KB)
CREATE TABLE orders (id BIGINT, ...) PAGE_SIZE 4096;
-- Analytics (64 KB — fewer I/O calls for large scans)
CREATE TABLE metrics (ts TIMESTAMP, ...) PAGE_SIZE 65536;
-- Bulk ingest (2 MB — maximum sequential write throughput)
CREATE TABLE raw_events (id BIGINT, ...) PAGE_SIZE 2097152;
PAX (Partition Attributes aXross) pages store each column's data contiguously within the page. This layout achieves much higher compression ratios than row-oriented storage because adjacent values in a column tend to be similar. SIMD instructions can process entire columns in bulk without row-hopping.
Each 64 KB PAX page contains:
| Encoding | Applies To | Effect |
|---|---|---|
| RLE (Run-Length Encoding) | Low-cardinality columns, sorted data | Repeating values stored as (value, count) pairs — 10–100× compression |
| Bit-packing | Small integers, enum codes, boolean flags | Store values at their minimum bit width — 2–8× compression |
| Dictionary | String columns with ≤ 256 distinct values | Replace strings with 1-byte codes — 4–32× compression |
| Delta + RLE | Timestamps, monotonic counters | Store deltas between adjacent values, then RLE — 5–20× compression |
Zone maps (stored in the page header) allow the query planner to skip entire pages when the query predicate falls outside the page's min/max range. For example, a query with WHERE ts > '2026-04-01' skips all pages whose max timestamp is before that date — without reading a single row from those pages.
The LSM-Tree (Log-Structured Merge-Tree) backend is optimised for write-heavy workloads. Writes go to an in-memory MemTable first, which is periodically flushed to immutable disk files called SSTables (Sorted String Tables). Background compaction merges SSTables into larger levels.
| Component | Description |
|---|---|
| MemTable | In-memory write buffer (sorted by key). Flushed to L0 when full (default 64 MB). |
| L0 SSTables | Freshly flushed MemTables. May overlap in key range. Compacted to L1. |
| L1–LN SSTables | Leveled compaction: each level is 10× larger. No key overlap within a level. |
| Bloom filter | Per-SSTable bloom filter eliminates disk reads for missing keys. |
LSM is ideal for time-series data, sensor readings, and audit logs where write throughput is the priority. Point lookup performance is slightly lower than B+Tree because multiple SSTable levels may need to be checked.
-- Create a table using the LSM backend
CREATE TABLE sensor_readings (
device_id TEXT,
ts TIMESTAMP,
value DOUBLE PRECISION
) USING LSM;
-- Force manual compaction (normally automatic)
CALL absdb_lsm_compact('sensor_readings');
WAL records are variable-length. Each record begins with a fixed header followed by the payload. The header includes the LSN, transaction ID, record type, payload length, and a CRC-32C checksum. Group commit batches up to 64 records per fsync.
| Record Type | Contents |
|---|---|
| INSERT | Table OID, page ID, slot number, new row data |
| UPDATE | Table OID, old page/slot, new page/slot, new row data |
| DELETE | Table OID, page ID, slot number |
| CHECKPOINT | Snapshot of all active transaction IDs, buffer pool dirty list |
| COMMIT | Transaction ID, commit timestamp |
| ROLLBACK | Transaction ID, undo chain pointer |
| DDL | Schema change descriptor (CREATE/ALTER/DROP) |
Absolute DB supports PostgreSQL-compatible binary COPY format for bulk data load and export. Binary COPY is approximately 3× faster than text COPY because it eliminates text parsing and format conversion overhead.
-- Bulk import from binary COPY file
COPY orders FROM '/data/orders-dump.bin' WITH (FORMAT binary);
-- Bulk export to binary COPY file
COPY (SELECT * FROM orders WHERE created_at > '2026-01-01')
TO '/tmp/orders-export.bin' WITH (FORMAT binary);
-- Pipe directly from psql or any PG-compatible client
\copy orders FROM orders.bin WITH BINARY
The binary format is compatible with PostgreSQL's binary COPY protocol, so tools that produce PostgreSQL binary dumps (pg_dump, COPY TO, pgcopydb) can load data directly into Absolute DB without conversion.
Absolute DB exports query results in Apache Arrow IPC format — a columnar, zero-copy memory layout widely used by analytics frameworks (pandas, DuckDB, Polars, Spark, BigQuery Storage API). No libarrow dependency is needed — the IPC format is generated natively.
-- Export a query result as Arrow IPC (file format)
COPY (SELECT * FROM metrics WHERE ts > '2026-01-01')
TO '/tmp/metrics.arrow' WITH (FORMAT arrow);
-- Stream Arrow IPC over HTTP (REST endpoint)
curl http://localhost:8080/api/query \
-H 'Accept: application/vnd.apache.arrow.file' \
-H 'Content-Type: application/json' \
-d '{"sql": "SELECT * FROM metrics LIMIT 1000000"}'
import pyarrow as pa
import pyarrow.ipc as ipc
import pandas as pd
# Read Arrow IPC file produced by Absolute DB
with open('/tmp/metrics.arrow', 'rb') as f:
reader = ipc.open_file(f)
table = reader.read_all()
df = table.to_pandas()
print(df.describe())
Arrow IPC output is aligned to 64-byte boundaries, enabling zero-copy reads on systems that support memory-mapped files. Record batches are self-describing — consumers do not need the database schema separately.
Parquet is a widely adopted columnar file format for data lakes and analytics platforms. Absolute DB reads and writes Parquet files natively — no libparquet or external library is required.
| Parquet Feature | Absolute DB Support |
|---|---|
| Column encodings | PLAIN, DICTIONARY, RLE_DICTIONARY, BIT_PACKED |
| Compression codecs | LZ4 (fast), Zstd (high ratio) |
| Row groups | Configurable size (default 128 MB) |
| Column statistics | min/max/null_count per column per row group |
| Schema | Derived from SQL table definition; type mapping documented in API reference |
-- Export to Parquet (local file)
COPY (SELECT * FROM orders WHERE year = 2026)
TO '/data/orders-2026.parquet' WITH (FORMAT parquet, COMPRESSION zstd);
-- Export to Parquet on S3
COPY (SELECT * FROM orders)
TO 's3://my-bucket/exports/orders.parquet'
WITH (FORMAT parquet, COMPRESSION lz4);
-- Import from Parquet
COPY orders_archive FROM '/data/orders-2025.parquet'
WITH (FORMAT parquet);
Parquet files produced by Absolute DB are compatible with Apache Spark, DuckDB, Polars, Snowflake external tables, BigQuery, AWS Athena, and any other Parquet-compatible tool. The row group statistics are populated so predicate pushdown works in downstream tools.
Page size is configurable per table at creation time. The choice affects I/O efficiency for different access patterns:
| Page Size | Best For | Trade-off |
|---|---|---|
| 4 KB (default) | OLTP, mixed workloads, random access | Lowest overhead per point lookup |
| 64 KB | Analytics, PAX columnar, time-series | More data per I/O — better scan throughput |
| 2 MB | Bulk ingest, archival, write-intensive | Maximum sequential write throughput; higher memory per page in pool |
Page size cannot be changed after table creation. If you need to change page size, export the data (using COPY or Parquet), drop and recreate the table with the new page size, and reload.
| Algorithm | Where Used | Ratio | Speed |
|---|---|---|---|
| RLE | PAX columnar, time-series chunks | 10–100× | Very fast (CPU only) |
| Bit-packing | PAX integer columns | 2–8× | Very fast (SIMD) |
| Dictionary | PAX string columns | 4–32× | Fast |
| Gorilla (delta-delta) | Time-series float columns | ~10:1 | Fast |
| LZ4 | WAL archive, Parquet, object storage backup | 2–4× | Very fast (> 500 MB/s) |
| Zstd | WAL archive, Parquet, backup compression | 3–7× | Fast (200–400 MB/s) |
All compression is transparent to SQL queries — compressed data is decompressed automatically during reads. Compression is applied at page or chunk granularity so the query engine can decompress only the pages it needs.
~154 KB binary · zero external dependencies · 2,737 tests passing · SQL:2023 100%