FullZip
dense eventsOne structural stream covers the full row range. Take can hit a broad compressed page even when selected rows are empty.
Storage format 2.3 design sketch
Rep/def streams describe every structural event in row order. That is correct, but sparse nested data pays for empty rows again and again. The discrete sparse layout keeps logical row coordinates, then stores only the structural facts that lead to real child values.
A sparse nested column can have very few values but a huge logical row domain. If the format still writes structure as a dense event stream, empty rows keep participating in compression, decoding, and random access.
One structural stream covers the full row range. Take can hit a broad compressed page even when selected rows are empty.
The dense structure is split into smaller units. Random access improves, but empty regions still shape the page plan.
Structure is stored as sparse mappings over each parent slot domain. Payload pages only contain reachable leaf values.
The outermost slot domain is the page row domain. Every nested layer maps selected parent slots to child slots. Leaf value buffers live in the final compact value domain.
Row ids are not rewritten. The format keeps the page row domain so slices and takes preserve Arrow semantics.
A list layer records non-empty parent positions and child counts. Missing valid positions mean empty lists.
Nullable layers record null positions in their own parent slot domain, not in a global event stream.
Value buffers contain only visible leaf values and keep mini-block compression for the physical payload.
list<int>.This is the smallest example that shows the contract: empty lists are represented by absence from sparse structural facts, not by payload gaps.
Input rows are converted into a list structural layer, and the leaf payload is packed independently.
There is no special global row-index trick. Each layer owns a parent slot domain and emits the child ranges needed by the next layer.
struct<profile: struct<events: list<struct<score:int32, tags:list<int32>>>>The selected rows either stop at a null or empty layer, or continue downward to compact score and tag payloads.
The important change is not just fewer pages. It is the ability to decide whether a selected path exists before reading and decoding value payloads.
Empty rows do not produce dense rep/def events. Sparse positions and counts scale with non-empty structure, not with total row count.
Sequential reads walk compact structural metadata and compact value buffers. There is less structural data to inflate before rebuilding Arrow arrays.
Take first slices slot-domain mappings. Empty and null branches stop in metadata; non-empty branches become exact leaf ranges.
This layout belongs to file version 2.3 because it changes the physical representation of nested structure. Older stable versions keep their existing encodings.