Storage format 2.3 design sketch

Make sparse nested columns physical sparse.

Rep/def streams describe every structural event in row order. That is correct, but sparse nested data pays for empty rows again and again. The discrete sparse layout keeps logical row coordinates, then stores only the structural facts that lead to real child values.

BytesDo not encode empty branches as dense level events.
ScanRead compact structure and compact value payloads.
TakeProbe structure first, then read only reachable leaf ranges.
row domain
0
1
2
3
4
5
slot facts
[1,4]
[2,1]
[2]
leaf values
10
11
12

The problem is structural density, not value density.

A sparse nested column can have very few values but a huge logical row domain. If the format still writes structure as a dense event stream, empty rows keep participating in compression, decoding, and random access.

FullZip

dense events

One structural stream covers the full row range. Take can hit a broad compressed page even when selected rows are empty.

MiniBlock

smaller pages

The dense structure is split into smaller units. Random access improves, but empty regions still shape the page plan.

Discrete Sparse

slot domains

Structure is stored as sparse mappings over each parent slot domain. Payload pages only contain reachable leaf values.

The layout is a chain of slot-domain mappings.

The outermost slot domain is the page row domain. Every nested layer maps selected parent slots to child slots. Leaf value buffers live in the final compact value domain.

Rows stay logical

Row ids are not rewritten. The format keeps the page row domain so slices and takes preserve Arrow semantics.

Lists store presence

A list layer records non-empty parent positions and child counts. Missing valid positions mean empty lists.

Nulls stay local

Nullable layers record null positions in their own parent slot domain, not in a global event stream.

Leaves are compact

Value buffers contain only visible leaf values and keep mini-block compression for the physical payload.

Encoding list<int>.

This is the smallest example that shows the contract: empty lists are represented by absence from sparse structural facts, not by payload gaps.

From rows to sparse structure to leaf values

Input rows are converted into a list structural layer, and the leaf payload is packed independently.

input rows
0
[]
1
[ 1011 ]
2
null
3
[]
4
[ 12 ]
5
[]
leaf value domain
10
11
12
take [0,1,5]: row 0 stops empty, row 1 reads leaf range 0..2, row 5 stops empty.

Deep nesting uses the same rule recursively.

There is no special global row-index trick. Each layer owns a parent slot domain and emits the child ranges needed by the next layer.

struct<profile: struct<events: list<struct<score:int32, tags:list<int32>>>>

The selected rows either stop at a null or empty layer, or continue downward to compact score and tag payloads.

row domain
0
1
2
3
4
5
6
7
profile validity
null
1
1
1
1
null
1
1
events list
-
[]
2
[]
1
-
[]
1
event slots
0
1
2
3
score values
91
-
84
77
tags list
2
-
[]
2
tag values
3
5
8
13
take [1,2,5]: row 1 stops at empty events, row 2 descends to event slots 0..2, row 5 stops at profile null.

Why this improves sparse workloads.

The important change is not just fewer pages. It is the ability to decide whether a selected path exists before reading and decoding value payloads.

Smaller bytes

Empty rows do not produce dense rep/def events. Sparse positions and counts scale with non-empty structure, not with total row count.

cost ~= positions + counts + nulls + leaf_values

Faster scans

Sequential reads walk compact structural metadata and compact value buffers. There is less structural data to inflate before rebuilding Arrow arrays.

scan = decode_sparse_layers + decode_leaf_payload

Faster takes

Take first slices slot-domain mappings. Empty and null branches stop in metadata; non-empty branches become exact leaf ranges.

take = probe_structure -> read_reachable_chunks

The format contract.

This layout belongs to file version 2.3 because it changes the physical representation of nested structure. Older stable versions keep their existing encodings.

What must be true

  • Structural layers are stored outermost to innermost.
  • Positions are sorted and delta-compressed within their parent slot domain.
  • List counts sum to the next child slot domain size.
  • Leaf payload chunks contain no rep/def buffers.

Where it wins

  • Most selected rows are empty or null before reaching leaf values.
  • Non-empty values are clustered enough to coalesce leaf ranges.
  • Structural metadata is small enough to cache or read cheaply.
  • The workload mixes full scans with random sparse takes.