Skip to content

fix: use OutputBatches metric variant for DF52 compatibility#11

Closed
wjones127 wants to merge 363 commits intorerun-io:mainfrom
wjones127:fix/df52-avg-coercion
Closed

fix: use OutputBatches metric variant for DF52 compatibility#11
wjones127 wants to merge 363 commits intorerun-io:mainfrom
wjones127:fix/df52-avg-coercion

Conversation

@wjones127
Copy link
Copy Markdown

Summary

  • DF52 introduced a dedicated MetricValue::OutputBatches variant. Using the generic Count variant with name "output_batches" causes a panic in aggregate_by_name() due to mismatched enum variants.

Test plan

  • pytest python/tests/test_scalar_index.py::test_fts_with_filter passes

🤖 Generated with Claude Code

BubbleCal and others added 30 commits December 24, 2025 19:15
Co-authored-by: lijinglun <lijinglun@bytedance.com>
…ance-format#5569)

In previous change, we manually added each supported catalog
integration, but that means every time we add a new one we still need to
make a PR to lance, and that is inconvenient. This PR makes it simpler
by having the nav `.pages` in the `lance-namespace-impls` repo and just
add the template in the end, so we only need to update that repo for new
catalog implementation supports.
…t#5566)

This PR introduces the credentials vending feature to the namespace
impl, allowing us to vend credentials if we run directory namespace, or
run it as backend for rest namespace. This would allow us to fully test
the credentials vending code path end to end.

The actual vending logic mainly consults the same feature implemented in
Apache Polaris. The support covers aws, gcp and azure.
Adds docs showing how to use the new Lance-DuckDB community extension
(will need updates based on new updates by @Xuanwo in the coming days).

---------

Co-authored-by: Xuanwo <github@xuanwo.io>
We should link to the Lance research paper on the landing page alongside
the introduction to Lance to encourage technical readers to read the
paper.

Also left-aligns the text in the intro box to make it nicer to read.
This PR will optimize rle implementation

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.2`) and fully reviewed and edited by me. I take full
responsibility for all changes.**
…mat#5588)

This PR will avoid panic while hitting non-null empty multi-vector

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.2`) and fully reviewed and edited by me. I take full
responsibility for all changes.**
This document doesn't change the spec but instead writes down what the
existing code does. So I'm not sure it needs a vote. However, I would
like to get lots of eyes on it and have multiple approvals if possible.

---------

Co-authored-by: Xuanwo <github@xuanwo.io>
Make fixes for the following breaking changes in 0.4.0:
1. introduced full error handling spec, update rust interface to
implement the spec, and also dir and rest implementations in rust,
python, java based on it
2. fix that `create_empty_table` is deprecated and `declare_table` is
introduced. We will start to use `declare_table`, and mark
`create_empty_table` as deprecated and will be removed in 3.0.0. If
`declare_table` fails, it currently falls back to `create_empty_table`.
3. we add `deregister_table` support for dir namespace (without
manifest), by adding a `.lance-deregistered` file. So when checking
table existence, we check (1) the table directory exists, (2) if table
directory exists, the `.lance-deregistered` file does not exist. This
allows user to deregister a table immediately without deleting data
which could be long running.
Prior to this commit, the ZstdBufferCompressor would construct a new
zstd stream encoder on every call to compress. With this change we
create one compression context for the ZstdBufferCompressor, and reuse
it across calls to compress.

Reuse is not implemented for lz4 compression or for decompression. These
were both explored but did not bring meaningful benefits over the
existing code.
When we cherry-pick bug fixes onto release branches, we sometimes need
to modify the commits to work on a new base. When we do that, we change
the commit so it's not the same as the original merge commit from the
PR. The GitHub release-notes API excludes these commits from the notes
it generates. This PR reproduces the same change notes format, but uses
the commit messages to grab the PRs, even if they have been rebased.
…mat#5591)

Null map entries can have non-zero length with garbage values that
should be ignored. MapStructuralEncoder was passing all entries to the
child encoder, but repdef only counted valid entries, which caused
errors to occur when encoding Structs with Map values.

Add MapArrayExt trait (mirroring ListArrayExt) with
filter_garbage_nulls() and trimmed_entries() methods, and use them in
MapStructuralEncoder.
In some cases, we initially considered using RLE but ultimately found
that the data is better stored with bitpacking. This PR implements that
change.

| Metric | Parquet (reference) | Lance (before change) | Lance (after
change) | Delta (after vs before) |
|---|---:|---:|---:|---:|
| `int_score` compressed size (bytes) | 56,035 | 377,838 | 71,556 |
-306,282 (-81.06%) |
| `int_score` vs Parquet (ratio) | 1.00x | 6.74x | 1.28x | -5.47x |
| Lance chosen encoding (hint) | `RLE_DICTIONARY` (plus `RLE`, `PLAIN`,
`SNAPPY`) | `rle` | `inline_bitpacking` | n/a |

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.2`) and fully reviewed and edited by me. I take full
responsibility for all changes.**
…e-format#5594)

This PR will introduce dictionary encoding for 64bit types like
int64/double.


| Field | Parquet (bytes) | Lance 2.1 (bytes) | Lance 2.2 before (bytes)
| Lance 2.2 after (bytes) | vs Parquet | vs Lance 2.1 | vs Lance 2.2
before |
|---|---:|---:|---:|---:|---:|---:|---:|
| `token_count` | 806,050 | 350,852 | 351,208 | 312,168 | -493,882
(-61.3%) | -38,684 (-11.0%) | -39,040 (-11.1%) |
| `score` | 254,145 | 1,438,596 | 1,438,952 | 164,048 | -90,097 (-35.5%)
| -1,274,548 (-88.6%) | -1,274,904 (-88.6%) |

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.2`) and fully reviewed and edited by me. I take full
responsibility for all changes.**
```
create_hnsw_sq(100000x128)
                        time:   [7.1499 s 7.1644 s 7.1840 s]
                        change: [-1.1794% -0.9172% -0.6107%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe

search_hnsw_sq100000x128
                        time:   [253.49 µs 253.87 µs 254.24 µs]
                        change: [-3.6161% -3.4660% -3.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
```

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…#5571)

For now, when lance.auto_cleanup.interval is set to 0, dataset commits
panic with a "division by zero" error:

```
thread 'dataset::cleanup::tests::test_auto_cleanup_interval_zero' panicked at rust/lance/src/dataset/cleanup.rs:672:12:
attempt to calculate the remainder with a divisor of zero
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
```

To fix the panic, we can interpret `lance.auto_cleanup.interval = 0` as
triggering cleanup after each commit, which is equivalent to
`lance.auto_cleanup.interval = 1`
… `nearest` config (lance-format#5486)

Rust core now supports limiting the distance range during vector
searches. This parameter can be exposed in the Python SDK so that users
can limit the distance range of the returned results when performing
vector queries.

User could use this to do a vector distance range search:
```python
import lance

ds = lance.dataset("vec_test.lance")
q = ds.sample(1).column("vec").to_pylist()[0]
distance_range = (0.0, 1.0)
results = ds.to_table(
    columns=["id"],
    nearest={
        "column": "vector",
        "q": q,
        "k": 20,
        "distance_range": distance_range,
    },
)

```

Co-authored-by: xloya <xiaojiebao@apache.org>
Co-authored-by: Will Jones <willjones127@gmail.com>
close lance-format#4723   

**Key Changes:**        
The distributed index creation leverages the existing IVF framework
while adding coordination mechanisms for multi-node execution. The index
merger component now handles distributed fragment consolidation and
metadata synchronization. This work enables scalable vector index
creation for large-scale datasets, significantly reducing index build
time.

- Implemented distributed IVF index building infrastructure for parallel
index construction across multiple nodes
- Enhanced the index merger component for distributed operations
- For IVF_HNSW part, the HNSW graph is built locally within the shard as
a sub-index of the partition; there is no cross-shard graph merging and
no cross-shard edges. These are supported but distribution only happens
in IVF.
- CPU only, torch accelerator will not be supported and fall back to
single node IVF index creation.

**Current Status in this PR:**
• FLAT/SQ: Should work now, it is under active testing phase, validating
distributed performance and accuracy
• PQ (Product Quantization): Currently depends on global training
codebook, requiring centralized training before distributed deployment.
• RQ (Residual Quantization): I didn't consider this when I design this
PR. Not yet supported in distributed mode maybe, planned for future
implementation

Once I finish all the testing phase in my side on performance and recall
accuracy, I will mark it ready to review.

---------

Co-authored-by: yanghua <yanghua1127@gmail.com>
The test covers 5 parameters (cache, k, nprobes, refine factor, dataset
size) for a total of 64 tests. It takes ~3-4 minutes to run on my
system.

There are other parameters (distance type, number of dimensions,
PQ-v-SQ, etc.) but these should only really affect compute time and/or
recall and are probably better tested in rust benchmarks. Recall
benchmarks will be done separately as they require real data.
…5613)

This PR updates the Lance docs for the DuckDB extension per the latest
fixes [here](lance-format/lance-duckdb#119). The
examples show the latest public-facing API for vector search, FTS and
hybrid search. For other functionality (including the full SQL reference
and cloud reference), we point users to the source repo files.
Closes lance-format#2271

---------

Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
This is part 1 of improving boolean query performance

## Summary

- Introduce an internal BooleanMatchPlan that normalizes match-only
boolean FTS queries into a single-column plan.
Add unit tests covering valid/invalid plan construction.
- No runtime behavior changes yet; this is a prep step for WAND-based
boolean execution.
Motivation
- Current boolean FTS planning drops limit, preventing WAND pruning.
This PR lays the groundwork for an index-level boolean execution path by
providing a normalized, index-executable plan.
Introduce some builder classes to make creating scalar index easy. For
example.

```java
  ScalarIndexParams params = BTreeIndexParams.builder()
      .zoneSize(2048)
      .build();

  IndexParams indexParams = IndexParams.builder().setScalarIndexParams(scalarParams).build();

   // Create BTree index on 'id' column
   dataset.createIndex(
       Collections.singletonList("id"),
       IndexType.BTREE,
       Optional.of("btree_id_index"),
       indexParams,
       true);
```

---------

Co-authored-by: lijinglun <lijinglun@bytedance.com>
westonpace and others added 25 commits February 21, 2026 16:52
When training an FTS index we load all of the partitions and merge them.
The code was setup to load partitions in parallel. However, there was no
spawn call so it wasn't actually loading anything in parallel. This lead
to starvation on HEAD calls. The order of operations (everything was
serialized) was something like...

Start load of part 1 file 1
Start load of part 2 file 1
...
Start load of part N file 1
Finish load and do CPU work to load file 1
Start load of part 1 file 2
Finish load and do CPU work to load file 2
...

The load request of part N file 1 would not get polled again until a lot
of CPU work was done (to parse files 1-N-1). This resulted in the HEAD
request being starved which looked like an S3 timeout.

This fix uses spawn. I've tested it against 15M rows of fineweb data and
ensured the RAM is still reasonably bounded even with parallel loading.
…#5980)

During testing, found some issue with `RestNamespace` implementation of
the table versions API, which was not 100% according to the spec
definition.
This PR adds a basic lance skill as user guide.

Users can install them via

```shell
npx skills add lance-format/lance
```

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.2`) and fully reviewed and edited by me. I take full
responsibility for all changes.**

---------

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
…t#5987)

This PR will replace lance-format#5985
since we don't need a new dep

## Summary

- remove `shellexpand` from workspace and `lance-io` dependencies
- replace tilde expansion in `lance-io` with `std::env::home_dir`-based
logic
- keep support for `~` and `~/...` paths (plus `~\\...` on Windows)
- update lockfile to drop `shellexpand`

## Validation

- `cargo check -p lance-io`
- `cargo test -p lance-io test_tilde_expansion -- --nocapture`
Fixing error:

```
FAILED python/tests/test_table_provider.py::test_table_loading - ImportError: Incompatible libraries. DataFusion 52.0.0 introduced an incompatible signature change for table providers. Either downgrade DataFusion or upgrade your function library.
```
…ance-format#5953)

If client specify .with_fragments, vector and FTS searches on indexed
fragments should respect the target fragments.

Previous PR to fix the unindexed fragments path:
lance-format#5924

Co-authored-by: stevie9868 <yingjianwu2@email.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
…ting_point (lance-format#5836)

## What

fix error messages in the `convert_to_floating_point` method of the
`FixedSizeListArrayExt` trait.

## Why

Error messages previously always showed "Int8Type" regardless of the
actual type being cast, which was misleading and made debugging
difficult.

## Changes

- Update error messages to accurately reflect the actual type being
attempted to cast

Co-authored-by: Will Jones <willjones127@gmail.com>
Prior to this commit, data_stats did a serial loop over fragment
metadata with one IO per fragmnet. For datasets with large numbers of
fragments, this can take a large amount of time.

This commit parallelises this call over the parallelism of the object
store.
…nce-format#5934)

Make geospatial dependencies (geodatafusion, geoarrow-array,
geoarrow-schema, geo-traits, geo-types) optional across the lance crate
stack via a new `geo` feature flag.

This allows consumers that don't need spatial indexing or geospatial
UDFs to avoid pulling in ~40 transitive dependencies and the associated
runtime overhead of registering geo UDFs in every DataFusion
SessionContext.

The `geo` feature is enabled by default in the top-level `lance` crate,
preserving backward compatibility. Consumers can opt out with
`default-features = false`.

Crates modified:
- lance-geo: all geo deps made optional, bbox module gated
- lance-datafusion: lance-geo made optional, geo UDF registration gated
- lance-index: lance-geo + geoarrow deps optional, rtree module and
RTreeIndexPlugin registration gated, geo bench requires feature
- lance: geo feature added to defaults, propagates to sub-crates

---------

Co-authored-by: Miroslav Drbal <miroslav.drbal@gendigital.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Unreleased version after creating v3.0.0-rc.1
1. fix python binding short-circuit for DirectoryNamespace and
RestNamespace binding
2. fix local file system access to LanceFileSession for namespace-based
access
3. fix propagating storage options to __manifest table in
DirectoryNamespace
…nce-format#5995)

In full-zip variable packed decoding, rep/def may produce visible rows
with empty payloads (for null/invalid items). The decoder previously
assumed every visible row had bytes for each child and failed with
`Packed struct fixed child exceeds row bounds`.

This happened during write new tables with blob v2.

## Summary

- fix `PackedStructVariablePerValueDecompressor` to handle empty packed
rows (`row_start == row_end`)
- append one per-child placeholder value for empty rows so child
builders remain row-aligned

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.3-codex`) and fully reviewed and edited by me. I take full
responsibility for all changes.**
- add a warning that `drop_columns` is metadata-only but data can become
unrecoverable after `compact_files` + `cleanup_old_versions`
- add operational guidance for rollback windows (tag/snapshot, delayed
cleanup, validation before aggressive cleanup)

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.3-codex`) and fully reviewed and edited by me. I take full
responsibility for all changes.**
This introduces a DeleteResult with num_rows_deleted, similar to
UpdateResult.

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
From
lance-format#5983 (comment),
we currently use `CommitConflict` for two situations:

1. Incompatible transactions: there is a conflict that is not
retry-able. For example, you are trying to create an index, but a
concurrent transaction overwrote the table and changed the schema.
1. Commit step ran out of retries: we hit the max number of rebase
attempts, and even though we could retry again, we aren't. This is
indeed just throttling.

This makes them separate errors.
…at#6002)

Also added helper function `extract_namespace_arc` for shared logic

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…format#6006)

Summary
- short-circuit FTS scans when `fast_search` is enabled and no indexed
fragments exist so we return an empty plan instead of scanning unindexed
data
- skip the unindexed-match planning path entirely under `fast_search`,
forcing only index-backed queries even when fragments exist
- add plan verification and a regression test proving `fast_search`
excludes rows appended after building the FTS index
DataFusion 52 changed AVG's type signature from UserDefined to
Coercible, so the old UserDefined-only guard skipped coercion and
AVG(Int64) failed at execution time. Use fields_with_udf to resolve
coerced types from the function signature, which handles all signature
variants.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DF52 introduced a dedicated `MetricValue::OutputBatches` variant.
Using the generic `Count` variant with name "output_batches" causes
a panic in `aggregate_by_name()` due to mismatched enum variants.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the bug Something isn't working label Feb 25, 2026
@wjones127 wjones127 marked this pull request as ready for review February 25, 2026 19:24
@wjones127 wjones127 changed the base branch from tsaucer/df52 to main February 25, 2026 19:41
@wjones127 wjones127 closed this Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.