fix: use OutputBatches metric variant for DF52 compatibility#11
Closed
wjones127 wants to merge 363 commits intorerun-io:mainfrom
Closed
fix: use OutputBatches metric variant for DF52 compatibility#11wjones127 wants to merge 363 commits intorerun-io:mainfrom
wjones127 wants to merge 363 commits intorerun-io:mainfrom
Conversation
Co-authored-by: lijinglun <lijinglun@bytedance.com>
…ance-format#5569) In previous change, we manually added each supported catalog integration, but that means every time we add a new one we still need to make a PR to lance, and that is inconvenient. This PR makes it simpler by having the nav `.pages` in the `lance-namespace-impls` repo and just add the template in the end, so we only need to update that repo for new catalog implementation supports.
…t#5566) This PR introduces the credentials vending feature to the namespace impl, allowing us to vend credentials if we run directory namespace, or run it as backend for rest namespace. This would allow us to fully test the credentials vending code path end to end. The actual vending logic mainly consults the same feature implemented in Apache Polaris. The support covers aws, gcp and azure.
Adds docs showing how to use the new Lance-DuckDB community extension (will need updates based on new updates by @Xuanwo in the coming days). --------- Co-authored-by: Xuanwo <github@xuanwo.io>
We should link to the Lance research paper on the landing page alongside the introduction to Lance to encourage technical readers to read the paper. Also left-aligns the text in the intro box to make it nicer to read.
This PR will optimize rle implementation --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.2`) and fully reviewed and edited by me. I take full responsibility for all changes.**
…mat#5588) This PR will avoid panic while hitting non-null empty multi-vector --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.2`) and fully reviewed and edited by me. I take full responsibility for all changes.**
This document doesn't change the spec but instead writes down what the existing code does. So I'm not sure it needs a vote. However, I would like to get lots of eyes on it and have multiple approvals if possible. --------- Co-authored-by: Xuanwo <github@xuanwo.io>
Make fixes for the following breaking changes in 0.4.0: 1. introduced full error handling spec, update rust interface to implement the spec, and also dir and rest implementations in rust, python, java based on it 2. fix that `create_empty_table` is deprecated and `declare_table` is introduced. We will start to use `declare_table`, and mark `create_empty_table` as deprecated and will be removed in 3.0.0. If `declare_table` fails, it currently falls back to `create_empty_table`. 3. we add `deregister_table` support for dir namespace (without manifest), by adding a `.lance-deregistered` file. So when checking table existence, we check (1) the table directory exists, (2) if table directory exists, the `.lance-deregistered` file does not exist. This allows user to deregister a table immediately without deleting data which could be long running.
Prior to this commit, the ZstdBufferCompressor would construct a new zstd stream encoder on every call to compress. With this change we create one compression context for the ZstdBufferCompressor, and reuse it across calls to compress. Reuse is not implemented for lz4 compression or for decompression. These were both explored but did not bring meaningful benefits over the existing code.
When we cherry-pick bug fixes onto release branches, we sometimes need to modify the commits to work on a new base. When we do that, we change the commit so it's not the same as the original merge commit from the PR. The GitHub release-notes API excludes these commits from the notes it generates. This PR reproduces the same change notes format, but uses the commit messages to grab the PRs, even if they have been rebased.
…mat#5591) Null map entries can have non-zero length with garbage values that should be ignored. MapStructuralEncoder was passing all entries to the child encoder, but repdef only counted valid entries, which caused errors to occur when encoding Structs with Map values. Add MapArrayExt trait (mirroring ListArrayExt) with filter_garbage_nulls() and trimmed_entries() methods, and use them in MapStructuralEncoder.
In some cases, we initially considered using RLE but ultimately found that the data is better stored with bitpacking. This PR implements that change. | Metric | Parquet (reference) | Lance (before change) | Lance (after change) | Delta (after vs before) | |---|---:|---:|---:|---:| | `int_score` compressed size (bytes) | 56,035 | 377,838 | 71,556 | -306,282 (-81.06%) | | `int_score` vs Parquet (ratio) | 1.00x | 6.74x | 1.28x | -5.47x | | Lance chosen encoding (hint) | `RLE_DICTIONARY` (plus `RLE`, `PLAIN`, `SNAPPY`) | `rle` | `inline_bitpacking` | n/a | --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.2`) and fully reviewed and edited by me. I take full responsibility for all changes.**
…e-format#5594) This PR will introduce dictionary encoding for 64bit types like int64/double. | Field | Parquet (bytes) | Lance 2.1 (bytes) | Lance 2.2 before (bytes) | Lance 2.2 after (bytes) | vs Parquet | vs Lance 2.1 | vs Lance 2.2 before | |---|---:|---:|---:|---:|---:|---:|---:| | `token_count` | 806,050 | 350,852 | 351,208 | 312,168 | -493,882 (-61.3%) | -38,684 (-11.0%) | -39,040 (-11.1%) | | `score` | 254,145 | 1,438,596 | 1,438,952 | 164,048 | -90,097 (-35.5%) | -1,274,548 (-88.6%) | -1,274,904 (-88.6%) | --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.2`) and fully reviewed and edited by me. I take full responsibility for all changes.**
```
create_hnsw_sq(100000x128)
time: [7.1499 s 7.1644 s 7.1840 s]
change: [-1.1794% -0.9172% -0.6107%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) high mild
1 (10.00%) high severe
search_hnsw_sq100000x128
time: [253.49 µs 253.87 µs 254.24 µs]
change: [-3.6161% -3.4660% -3.3038%] (p = 0.00 < 0.05)
Performance has improved.
```
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…#5571) For now, when lance.auto_cleanup.interval is set to 0, dataset commits panic with a "division by zero" error: ``` thread 'dataset::cleanup::tests::test_auto_cleanup_interval_zero' panicked at rust/lance/src/dataset/cleanup.rs:672:12: attempt to calculate the remainder with a divisor of zero note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ``` To fix the panic, we can interpret `lance.auto_cleanup.interval = 0` as triggering cleanup after each commit, which is equivalent to `lance.auto_cleanup.interval = 1`
… `nearest` config (lance-format#5486) Rust core now supports limiting the distance range during vector searches. This parameter can be exposed in the Python SDK so that users can limit the distance range of the returned results when performing vector queries. User could use this to do a vector distance range search: ```python import lance ds = lance.dataset("vec_test.lance") q = ds.sample(1).column("vec").to_pylist()[0] distance_range = (0.0, 1.0) results = ds.to_table( columns=["id"], nearest={ "column": "vector", "q": q, "k": 20, "distance_range": distance_range, }, ) ``` Co-authored-by: xloya <xiaojiebao@apache.org> Co-authored-by: Will Jones <willjones127@gmail.com>
close lance-format#4723 **Key Changes:** The distributed index creation leverages the existing IVF framework while adding coordination mechanisms for multi-node execution. The index merger component now handles distributed fragment consolidation and metadata synchronization. This work enables scalable vector index creation for large-scale datasets, significantly reducing index build time. - Implemented distributed IVF index building infrastructure for parallel index construction across multiple nodes - Enhanced the index merger component for distributed operations - For IVF_HNSW part, the HNSW graph is built locally within the shard as a sub-index of the partition; there is no cross-shard graph merging and no cross-shard edges. These are supported but distribution only happens in IVF. - CPU only, torch accelerator will not be supported and fall back to single node IVF index creation. **Current Status in this PR:** • FLAT/SQ: Should work now, it is under active testing phase, validating distributed performance and accuracy • PQ (Product Quantization): Currently depends on global training codebook, requiring centralized training before distributed deployment. • RQ (Residual Quantization): I didn't consider this when I design this PR. Not yet supported in distributed mode maybe, planned for future implementation Once I finish all the testing phase in my side on performance and recall accuracy, I will mark it ready to review. --------- Co-authored-by: yanghua <yanghua1127@gmail.com>
The test covers 5 parameters (cache, k, nprobes, refine factor, dataset size) for a total of 64 tests. It takes ~3-4 minutes to run on my system. There are other parameters (distance type, number of dimensions, PQ-v-SQ, etc.) but these should only really affect compute time and/or recall and are probably better tested in rust benchmarks. Recall benchmarks will be done separately as they require real data.
…5613) This PR updates the Lance docs for the DuckDB extension per the latest fixes [here](lance-format/lance-duckdb#119). The examples show the latest public-facing API for vector search, FTS and hybrid search. For other functionality (including the full SQL reference and cloud reference), we point users to the source repo files.
Closes lance-format#2271 --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
This is part 1 of improving boolean query performance ## Summary - Introduce an internal BooleanMatchPlan that normalizes match-only boolean FTS queries into a single-column plan. Add unit tests covering valid/invalid plan construction. - No runtime behavior changes yet; this is a prep step for WAND-based boolean execution. Motivation - Current boolean FTS planning drops limit, preventing WAND pruning. This PR lays the groundwork for an index-level boolean execution path by providing a normalized, index-executable plan.
Introduce some builder classes to make creating scalar index easy. For
example.
```java
ScalarIndexParams params = BTreeIndexParams.builder()
.zoneSize(2048)
.build();
IndexParams indexParams = IndexParams.builder().setScalarIndexParams(scalarParams).build();
// Create BTree index on 'id' column
dataset.createIndex(
Collections.singletonList("id"),
IndexType.BTREE,
Optional.of("btree_id_index"),
indexParams,
true);
```
---------
Co-authored-by: lijinglun <lijinglun@bytedance.com>
When training an FTS index we load all of the partitions and merge them. The code was setup to load partitions in parallel. However, there was no spawn call so it wasn't actually loading anything in parallel. This lead to starvation on HEAD calls. The order of operations (everything was serialized) was something like... Start load of part 1 file 1 Start load of part 2 file 1 ... Start load of part N file 1 Finish load and do CPU work to load file 1 Start load of part 1 file 2 Finish load and do CPU work to load file 2 ... The load request of part N file 1 would not get polled again until a lot of CPU work was done (to parse files 1-N-1). This resulted in the HEAD request being starved which looked like an S3 timeout. This fix uses spawn. I've tested it against 15M rows of fineweb data and ensured the RAM is still reasonably bounded even with parallel loading.
…#5980) During testing, found some issue with `RestNamespace` implementation of the table versions API, which was not 100% according to the spec definition.
This PR adds a basic lance skill as user guide. Users can install them via ```shell npx skills add lance-format/lance ``` --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.2`) and fully reviewed and edited by me. I take full responsibility for all changes.** --------- Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
…t#5987) This PR will replace lance-format#5985 since we don't need a new dep ## Summary - remove `shellexpand` from workspace and `lance-io` dependencies - replace tilde expansion in `lance-io` with `std::env::home_dir`-based logic - keep support for `~` and `~/...` paths (plus `~\\...` on Windows) - update lockfile to drop `shellexpand` ## Validation - `cargo check -p lance-io` - `cargo test -p lance-io test_tilde_expansion -- --nocapture`
Fixing error: ``` FAILED python/tests/test_table_provider.py::test_table_loading - ImportError: Incompatible libraries. DataFusion 52.0.0 introduced an incompatible signature change for table providers. Either downgrade DataFusion or upgrade your function library. ```
…ance-format#5953) If client specify .with_fragments, vector and FTS searches on indexed fragments should respect the target fragments. Previous PR to fix the unindexed fragments path: lance-format#5924 Co-authored-by: stevie9868 <yingjianwu2@email.com> Co-authored-by: Will Jones <willjones127@gmail.com>
…ting_point (lance-format#5836) ## What fix error messages in the `convert_to_floating_point` method of the `FixedSizeListArrayExt` trait. ## Why Error messages previously always showed "Int8Type" regardless of the actual type being cast, which was misleading and made debugging difficult. ## Changes - Update error messages to accurately reflect the actual type being attempted to cast Co-authored-by: Will Jones <willjones127@gmail.com>
Prior to this commit, data_stats did a serial loop over fragment metadata with one IO per fragmnet. For datasets with large numbers of fragments, this can take a large amount of time. This commit parallelises this call over the parallelism of the object store.
…nce-format#5934) Make geospatial dependencies (geodatafusion, geoarrow-array, geoarrow-schema, geo-traits, geo-types) optional across the lance crate stack via a new `geo` feature flag. This allows consumers that don't need spatial indexing or geospatial UDFs to avoid pulling in ~40 transitive dependencies and the associated runtime overhead of registering geo UDFs in every DataFusion SessionContext. The `geo` feature is enabled by default in the top-level `lance` crate, preserving backward compatibility. Consumers can opt out with `default-features = false`. Crates modified: - lance-geo: all geo deps made optional, bbox module gated - lance-datafusion: lance-geo made optional, geo UDF registration gated - lance-index: lance-geo + geoarrow deps optional, rtree module and RTreeIndexPlugin registration gated, geo bench requires feature - lance: geo feature added to defaults, propagates to sub-crates --------- Co-authored-by: Miroslav Drbal <miroslav.drbal@gendigital.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Will Jones <willjones127@gmail.com>
Unreleased version after creating v3.0.0-rc.1
1. fix python binding short-circuit for DirectoryNamespace and RestNamespace binding 2. fix local file system access to LanceFileSession for namespace-based access 3. fix propagating storage options to __manifest table in DirectoryNamespace
…nce-format#5995) In full-zip variable packed decoding, rep/def may produce visible rows with empty payloads (for null/invalid items). The decoder previously assumed every visible row had bytes for each child and failed with `Packed struct fixed child exceeds row bounds`. This happened during write new tables with blob v2. ## Summary - fix `PackedStructVariablePerValueDecompressor` to handle empty packed rows (`row_start == row_end`) - append one per-child placeholder value for empty rows so child builders remain row-aligned --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.3-codex`) and fully reviewed and edited by me. I take full responsibility for all changes.**
- add a warning that `drop_columns` is metadata-only but data can become unrecoverable after `compact_files` + `cleanup_old_versions` - add operational guidance for rollback windows (tag/snapshot, delayed cleanup, validation before aggressive cleanup) --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.3-codex`) and fully reviewed and edited by me. I take full responsibility for all changes.**
This introduces a DeleteResult with num_rows_deleted, similar to UpdateResult. --------- Co-authored-by: Will Jones <willjones127@gmail.com>
From lance-format#5983 (comment), we currently use `CommitConflict` for two situations: 1. Incompatible transactions: there is a conflict that is not retry-able. For example, you are trying to create an index, but a concurrent transaction overwrote the table and changed the schema. 1. Commit step ran out of retries: we hit the max number of rebase attempts, and even though we could retry again, we aren't. This is indeed just throttling. This makes them separate errors.
…at#6002) Also added helper function `extract_namespace_arc` for shared logic --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…format#6006) Summary - short-circuit FTS scans when `fast_search` is enabled and no indexed fragments exist so we return an empty plan instead of scanning unindexed data - skip the unindexed-match planning path entirely under `fast_search`, forcing only index-backed queries even when fragments exist - add plan verification and a regression test proving `fast_search` excludes rows appended after building the FTS index
DataFusion 52 changed AVG's type signature from UserDefined to Coercible, so the old UserDefined-only guard skipped coercion and AVG(Int64) failed at execution time. Use fields_with_udf to resolve coerced types from the function signature, which handles all signature variants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DF52 introduced a dedicated `MetricValue::OutputBatches` variant. Using the generic `Count` variant with name "output_batches" causes a panic in `aggregate_by_name()` due to mismatched enum variants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MetricValue::OutputBatchesvariant. Using the genericCountvariant with name"output_batches"causes a panic inaggregate_by_name()due to mismatched enum variants.Test plan
pytest python/tests/test_scalar_index.py::test_fts_with_filterpasses🤖 Generated with Claude Code