feat(blob_v2): add ref_id deduplication (Plan A — input-only hint)#6600
Open
DanielMao1 wants to merge 1 commit intolance-format:mainfrom
Open
feat(blob_v2): add ref_id deduplication (Plan A — input-only hint)#6600DanielMao1 wants to merge 1 commit intolance-format:mainfrom
DanielMao1 wants to merge 1 commit intolance-format:mainfrom
Conversation
Multiple rows with the same positive `ref_id` share one physical blob. `ref_id = 0` or null means no sharing (existing behavior — unchanged). Plan A design: on-disk Blob v2 descriptor stays at 5 fields. `ref_id` flows through the write-time pipeline as an in-memory input hint only: - BlobPreprocessor::preprocess_batch (Packed/Dedicated dedup cache) - BlobV2StructuralEncoder::maybe_encode (Inline dedup cache) It is dropped before the descriptor is emitted to disk. Benefits: - Zero on-disk format change; full compatibility with existing readers - User API is new but additive: Blob(data=, ref_id=42) + blob_array(...) - Dedup works across Inline (≤64KB), Packed (64KB–4MB), Dedicated (>4MB) Trade-offs vs persisting ref_id in descriptor: - No post-write observability (cannot SELECT ref_id, COUNT(*) GROUP BY) - No compaction hint (future compactors must rehash or fall back) These can be added later as a separate opt-in feature without breaking the format. Verification (test_ref_id_dedup.py, 20 rows × 1 ref_id): - inline_32kb: 1.05x amplification - packed_1mb: 1.00x amplification (1 sidecar file) - dedicated_6mb: 1.00x amplification (1 sidecar file instead of 20) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(blob_v2): add ref_id deduplication across Inline / Packed / Dedicated
Zero on-disk format change.
ref_idis a write-time input hint only —consumed by the preprocessor and encoder, then dropped before any byte touches
disk. On-disk Blob v2 descriptor remains exactly the 5 fields it has today.
Motivation
Blob v2 today assumes 1 row = 1 blob. Every row owns its own bytes. There
is no API, no descriptor field, no internal cache to say "these rows reuse
that row's blob."
But real multimodal / time-series workloads routinely align rows at different
logical frequencies into a single table, where many rows reference the same
underlying object:
Without format-level dedup, the current options are all bad:
Concrete impact
Today — 20 rows carrying the same 6 MB payload produce 20 independent
sidecar files totaling 120 MB. The 120 MB is purely duplicate bytes.
With this PR — the same 20 rows produce 1 sidecar file of 6 MB. The 20
rows' descriptors all share one
blob_id; read-back returns byte-identicalpayloads per row.
Verified savings (
python/test_ref_id_dedup.py, 20 rows, singleref_id):For real training workloads where 8 labels reference 1 GOP (video column),
expected savings on the video column are ~8×, which typically dominates
dataset size.
User-facing API
The only user-visible change is one new optional field on
Blob:ref_id = Noneor0(the default) means "no sharing" — identical to pre-PRbehavior. Existing code without
ref_idis unaffected.On disk, the 20 rows share a single sidecar file:
Read-back is invisible to readers — every row sees the full 6 MB payload;
only the disk layout differs.
Design: where dedup happens
The PR hooks into two existing layers; it does not add a new pipeline
stage. Each hook consults one in-memory cache keyed by
ref_id, then updatesit. The cache lives for exactly one fragment's write and is dropped at
finalization.
flowchart TD P["<b>Python</b><br/>20 × Blob(data=bytes, ref_id=42)<br/>lance.write_dataset(batch, ...)"] P -->|PyO3| R R["<b>Rust orchestration</b><br/>Dataset::write → InsertBuilder::execute_stream<br/>→ write_fragments_internal <i>(gate: version ≥ 2.2)</i><br/>→ do_write_fragments"] R --> V V["V2WriterAdapter::write<br/><i>one BlobPreprocessor per fragment</i>"] V --> H1 H1(["<b>Hook 1 · BlobPreprocessor::preprocess_batch</b><br/>5-field Struct ──▶ 7-field Struct<br/>Packed / Dedicated dedup<br/><i>via ref_id_sidecar_cache</i>"]) H1 --> F F["FileWriter::write_batch<br/>encode_batch — per-column dispatch"] F --> H2 H2(["<b>Hook 2 · BlobV2StructuralEncoder::maybe_encode</b><br/>7-field Struct ──▶ 5-field descriptor<br/>Inline dedup<br/><i>via ref_dedup_tmp_map</i>"]) H2 --> D D[("<b>On disk</b><br/>kind · position · size · blob_id · blob_uri<br/><i>ref_id not persisted</i>")] classDef hook fill:#fff3e0,stroke:#e65100,stroke-width:2.5px,color:#3e2723 classDef normal fill:#fafafa,stroke:#9e9e9e,color:#212121 classDef disk fill:#e0f2f1,stroke:#00695c,stroke-width:2px,color:#004d40 class H1,H2 hook class P,R,V,F normal class D diskThe two amber pill nodes are the only new decision points.
ref_identersthe pipeline with the user's input StructArray, flows through the in-memory
7-field intermediate struct between preprocessor and encoder, and exits at
Hook 2 when the 5-field descriptor is constructed for disk.
Where the dedup decision happens
Hook 1 —
BlobPreprocessor::preprocess_batch(rust/lance/src/dataset/blob.rs)Owns a per-fragment cache covering the two preprocessor-written paths:
Row-level loop consults the cache before routing by size:
For the example (20 rows, ref_id=42, 6 MB each): row 0 misses and writes one
Dedicated sidecar file; rows 1..19 hit and reuse the cached
(blob_id, size)without any additional I/O.
Hook 2 —
BlobV2StructuralEncoder::maybe_encode(rust/lance-encoding/src/encodings/logical/blob.rs)Owns a symmetric cache for Inline, because only the encoder knows the
out-of-line buffer offset:
The two caches partition the problem cleanly:
ref_id_sidecar_cacheBlobPreprocessorref_dedup_tmp_mapBlobV2StructuralEncoderWhat lands on disk
BLOB_V2_DESC_FIELDSis unchanged from upstream:The encoder constructs the descriptor StructArray with exactly 5 children;
ref_idis read in Hook 2 for cache lookup but intentionally not appendedto the children vector.
Verified via
LanceFileReader.metadata().schemaon a dataset written by thisPR:
Byte-identical structure to any dataset written by upstream.
Non-intrusiveness
Readers: zero change required
Old readers see 20 rows with the same
blob_idin the same 5-fielddescriptor. Each row independently resolves
blob_id=1to the same sidecarfile — 20 reads, 20 correct byte payloads. No coordination needed. No
schema version bump.
Compaction / GC: unaffected
data file is gone. Shared rows all live in one data file, so deleting any
subset leaves the fragment alive; deleting the whole fragment triggers the
blob's GC. No changes.
blob_id=1will either preserve the sharing (if the compactor treats equal
blob_idas a single blob) or produce 20 independent blob_ids (regressing to
upstream behavior). Correctness is preserved either way; only storage
efficiency may regress. No compactor change needed for this PR.
Old files work unchanged
The schema validator in
blob_version_from_descriptionsaccepts the 5-fieldV2 descriptor exactly as upstream — this PR doesn't widen or narrow that
check. Files written by any prior Lance version continue to work.
Implementation footprint
Production code (excluding the 97-line test script): ~188 lines added.
All changes are additive: two
HashMapfields, oneSidecarRefenum, twocache lookup/insert call sites, and the Python
Blob.ref_idplumbing.Testing
cd python python test_ref_id_dedup.pyExpected output — all three size classes dedup, read-back is byte-identical:
Explicitly not in scope
write_datasetinvocation /one fragment. Cross-fragment / cross-write sharing is natural future work.
SELECT ref_id, COUNT(*) GROUP BY ref_idisnot possible because
ref_idis not persisted. The discussion threadincludes "Option B" (persist
ref_idin the descriptor) as a candidatefor users who need this.
ref_id. Lance'sFileScheduleralreadymerges adjacent reads to the same file, so the penalty is modest. Explicit
ref_id-aware read caching can be layered on separately.
Built on
PackWriterreused)single data file, so data-file-bound GC remains correct)