Blob v2 preparation · Current API direction

Prepare blob layout before the data file is written.

Blob v2 now has a clear advanced boundary: bind a LanceBlobSession to one data_file_path, produce a writer-side prepared blob column, then write that column with ordinary columns through the existing LanceFileWriter and DataReplacement path.

Core decision

The boundary is not a second fragment writer and not a blob-only commit path. It is a preparation session that owns the data-file-local blob namespace, while the data file and transaction stay on the existing write path.

Public break 0

Namespace file

Commit existing

Before and after

The file format does not need a new concept here. The difference is whether users can explicitly prepare writer-side blob values before the data file is encoded.

Before: preprocessing owns everything

coupled

Logical input only

Struct<data, uri> is the stable entry point. Users describe values, not the underlying layout.

BlobPreprocessor chooses layout

Inline, packed, dedicated, and external values are all converted inside the dataset write path. Blob ids are allocated there as well.

Existing sidecars are awkward

Advanced users cannot directly register a packed blob they already own, even if they also own the replacement data file id.

After: prepare first, write normally

decoupled

Bind a LanceBlobSession

The session is created from data_file_path and owns the shared data-file-local blob_id allocator.

Prepare writer-side values

Create new packed or dedicated sidecars, or call load_packed / load_dedicated for sidecars already placed under this namespace.

Finish into a prepared array

Write the prepared array and ordinary columns into the same .lance data file, then commit with existing transaction APIs.

Data flow

The key change is recognizing three distinct schema surfaces. Logical input keeps the old behavior. Prepared input goes directly to the encoder. Footer descriptors remain the persisted reader view.

Logical input

Struct<data, uri> remains the regular-user API. No blob id or sidecar knowledge is required.

BlobPreprocessor

Only logical schemas trigger placement policy. Internally it should reuse the same blob writer primitives as the advanced API.

Prepared input

Struct<kind, data, uri, blob_id, blob_size, position> is validated exactly and passed to the file encoder.

LanceFileWriter

Still writes one complete data file. The encoder converts writer-side prepared values to the persisted descriptor view.

DataReplacement

DataFile.create reads the footer metadata, then the existing transaction path swaps the replacement file into the fragment.

API shape

The public API should expose only objects with stable contracts. There is no open_fragment_writer and no dataset-bound second write path.

geneva-style replacement python

file_id = str(uuid.uuid4())
data_file_path = dataset_uri / "data" / f"{file_id}.lance"
data_file_name = f"{file_id}.lance"

session = lance.LanceBlobSession(str(data_file_path))
images = session.open_writer("image")

packed = images.new_packed()
packed.write_blob(image_0)
packed.write_blob(image_1)
images.extend(packed.finish())

images.extend(images.load_packed(
    existing_blob_id,
    offsets=[0, 4096],
    sizes=[4096, 8192],
))

image_array = images.finish()
physical_schema = pa.schema([id_field, images.field])
batch = pa.record_batch([id_array, image_array],
                        schema=physical_schema)

with LanceFileWriter(str(data_file_path),
                     schema=physical_schema,
                     version="2.2") as writer:
    writer.write_batch(batch)

data_file = lance.fragment.DataFile.create(dataset, data_file_name)
operation = lance.LanceOperation.DataReplacement([
    lance.LanceOperation.DataReplacementGroup(fragment_id, data_file)
])
dataset = lance.LanceDataset.commit(dataset, operation,
                                    read_version=dataset.version)

session

LanceBlobSession(data_file_path) derives data_dir, data_file_key, and the sidecar path namespace.

scope

new_packed

Create a Lance-owned packed sidecar and return a PackedBlobWriter. finish() returns prepared values in write order.

new file

new_dedicated

Create one dedicated sidecar. Bytes may be appended in multiple writes; finish() returns one prepared value.

new file

load_packed

Load an existing packed sidecar under the current data file key, validate offset/size ranges, and return values.

existing

load_dedicated

Load an existing dedicated sidecar and use the object size as the descriptor size.

existing

Compatibility

Compatibility comes from exact schema inference and narrow guardrails around the advanced path. The ordinary logical API remains the default.

Existing users

Keep using BlobArrayBuilder / blob_field. Logical blob input behaves as before.

unchanged

Advanced users

Can construct prepared blob columns through LanceBlobWriter, or pass exact prepared structs when they take responsibility for sidecars.

new

Schema inference

No marker and no public BlobInputMode. Only exact logical or exact prepared input is accepted.

exact

Dataset schema

On create/overwrite, prepared writer-side fields are normalized back to the logical blob field so physical structure does not leak into the dataset schema.

stable

DataFile.create

Must keep recursive schema validation for normal columns. The only special case is the blob footer descriptor view, whose child field ids are not public API.

guarded

Validation

blob_id == 0, range overflow, and out-of-object ranges are rejected before prepared values reach the commit path.

boundary

From first principles, this is not a blob side channel. It splits “prepare writer-side blob values” from “write and commit a complete data file.” The old API stays simple; the new API gives advanced users control without changing the file format.