Storage format 2.3 · Design sketch

Make sparse nested columns physically sparse.

Rep/def streams describe every structural event in row order. That is correct, but sparse nested data pays for empty rows again and again. The discrete sparse layout keeps logical row coordinates, then stores only the structural facts that lead to real child values.

BytesDo not encode empty branches as dense level events.

ScanRead compact structure and compact value payloads.

TakeProbe structure first, then read only reachable leaf ranges.

list<int32> · 6 rows slot domains

row domain

slot facts

[1,4]

[2,1]

[2]

leaf values

empty structural fact null leaf value

The problem is structural density, not value density.

A sparse nested column can have very few values but a huge logical row domain. If the format still writes structure as a dense event stream, empty rows keep participating in compression, decoding, and random access.

FullZip

dense events

One structural stream covers the full row range. Take can hit a broad compressed page even when selected rows are empty.

MiniBlock

smaller pages

The dense structure is split into smaller units. Random access improves, but empty regions still shape the page plan — and a sparse enough page exceeds the structural budget entirely.

Discrete Sparse

slot domains

Structure is stored as sparse mappings over each parent slot domain. Payload pages only contain reachable leaf values.

The layout is a chain of slot-domain mappings.

The outermost slot domain is the page row domain. Every nested layer maps selected parent slots to child slots. Leaf value buffers live in the final compact value domain.

Rows stay logical

Row ids are not rewritten. The format keeps the page row domain so slices and takes preserve Arrow semantics.

Lists store presence

A list layer records non-empty parent positions and child counts. Missing valid positions mean empty lists. When every non-empty list has the same length, a constant count set replaces the counts buffer.

Nulls stay local

Nullable layers record null positions in their own parent slot domain, not in a global event stream.

Fixed-size lists are free

A fixed-size-list layer stores only its dimension and null positions. Child ranges are deterministic, so no positions or counts are written.

Leaves are compact

Value buffers contain only visible leaf values and keep mini-block compression for the physical payload.

Encoding `list<int>`.

This is the smallest example that shows the contract: empty lists are represented by absence from sparse structural facts, not by payload gaps.

From rows to sparse structure to leaf values

Input rows are converted into a list structural layer, and the leaf payload is packed independently.

input rows

[]

[ 1011 ]

null

[]

[ 12 ]

[]

parent row domain

012345

non_empty

counts

nulls

leaf value domain

take [0,1,5]: row 0 stops empty, row 1 reads leaf range 0..2, row 5 stops empty.

Deep nesting uses the same rule recursively.

There is no special global row-index trick. Each layer owns a parent slot domain and emits the child ranges needed by the next layer.

`struct<profile: struct<events: list<struct<score:int32, tags:list<int32>>>>>`

The selected rows either stop at a null or empty layer, or continue downward to compact score and tag payloads. In the file, each leaf column (score, tags) is its own page repeating these shared outer layers — the diagram shows one merged view.

row domain

profile validity

null

events list

[]

event slot domain

null

score values

tags list

[]

tag values

take [1,2,5]: row 1 stops at empty events, row 5 stops at profile null. Row 2 reaches event slots 0 and 1 — slot 1 is a null event, slot 0 reads score 91 and tags [3,5].

Why this improves sparse workloads.

The important change is not just fewer pages. It is the ability to decide whether a selected path exists before reading and decoding value payloads.

Smaller bytes

Empty rows do not produce dense rep/def events. Sparse positions and counts scale with non-empty structure, not with total row count.

cost ~= positions + counts + nulls + leaf_values

Faster scans

Sequential reads walk compact structural metadata and compact value buffers. There is less structural data to inflate before rebuilding Arrow arrays.

scan = decode_sparse_layers + decode_leaf_payload

Faster takes

Take first slices slot-domain mappings. Empty and null branches stop in metadata; non-empty branches become exact leaf ranges.

take = probe_structure -> read_reachable_chunks

When the writer picks it.

Sparse is a page layout, not a column-level promise. The writer decides per page; readers identify it only from PageLayout.sparse_layout, never from field metadata.

Explicit request

2.3+ only

lance-encoding:structural-encoding=sparse

Field metadata forces sparse layout for the column's pages. Requests error out — never silently fall back — on file versions before 2.3, or when the column has no native structural layers (or uses dictionary / packed-struct blocks).

Automatic escape

2.3+ auto

repdef_over_budget && has_sparse_plan

With no explicit request, the writer switches to sparse when the dense rep/def stream exceeds the mini-block structural page budget — the page would need splitting, or a single row alone is over budget — and the page's structure has a native sparse mapping.

Everything else

default

miniblock | fullzip

When the structural budget is satisfied, the existing mini-block / full-zip selection path is unchanged. Dense data never pays for the sparse machinery.

The format contract.

This layout belongs to file version 2.3 because it changes the physical representation of nested structure. Readers must reject malformed sparse pages with a format error.

buffer 0

Value chunk metadata: one 8-byte entry per chunk — packed chunk size and the number of visible values. Sums must match the value buffer size and num_visible_items exactly.

metadata

buffer 1

Mini-block encoded value chunks. Leaf payload only — no rep/def levels inside.

values

buffer 2+

Structural buffers — one per explicit position or count set — in SparseStructuralLayer order, outermost to innermost.

structure

What must be true

Positions decode to strictly increasing values inside their parent slot domain.
List counts sum to exactly the next child slot domain size.
Position and count sets are normalized: empty, all-slots, contiguous-range, and constant-count forms live in metadata alone.
Only explicit sets write a structural buffer — exactly one each. Normalized sets carry no buffer and no compression descriptor.

Where it wins

Most selected rows are empty or null before reaching leaf values.
Non-empty values are clustered enough to coalesce leaf ranges.
Structural metadata is small enough to cache or read cheaply.
The workload mixes full scans with random sparse takes.

Make sparse nested columns physically sparse.

The problem is structural density, not value density.

FullZip

MiniBlock

Discrete Sparse

The layout is a chain of slot-domain mappings.

Rows stay logical

Lists store presence

Nulls stay local

Fixed-size lists are free

Leaves are compact

Encoding list<int>.

From rows to sparse structure to leaf values

Deep nesting uses the same rule recursively.

struct<profile: struct<events: list<struct<score:int32, tags:list<int32>>>>>

Why this improves sparse workloads.

Smaller bytes

Faster scans

Faster takes

When the writer picks it.

Explicit request

Automatic escape

Everything else

The format contract.

What must be true

Where it wins

Encoding `list<int>`.

`struct<profile: struct<events: list<struct<score:int32, tags:list<int32>>>>>`