From 13ea73a351fe1bb39a433f104ec90d450715c6ce Mon Sep 17 00:00:00 2001 From: Weston Pace Date: Mon, 23 Feb 2026 06:50:55 -0800 Subject: [PATCH 1/2] Expand the FTS index doc explaining the training process and how mulitple partitions work --- docs/src/format/table/index/scalar/fts.md | 54 +++++++++++++++++++++++ 1 file changed, 54 insertions(+) diff --git a/docs/src/format/table/index/scalar/fts.md b/docs/src/format/table/index/scalar/fts.md index 5af36d294b8..33c4a5ed0da 100644 --- a/docs/src/format/table/index/scalar/fts.md +++ b/docs/src/format/table/index/scalar/fts.md @@ -18,6 +18,8 @@ The FTS index consists of multiple files storing the token dictionary, document 3. `invert.lance` - Compressed posting lists for each token 4. `metadata.lance` - Index metadata and configuration +An FTS index may contain multiple partitions. Each partition has its own set of token, document, and posting list files, prefixed with the partition ID (e.g. `part_0_tokens.lance`, `part_0_docs.lance`, `part_0_invert.lance`). The `metadata.lance` file lists all partition IDs in the index. At query time, every partition must be searched and the results combined to produce the final ranked output. Fewer partitions generally means better query performance, since each partition requires its own token dictionary lookup and posting list scan. The number of partitions is controlled by the training configuration -- specifically `LANCE_FTS_TARGET_SIZE` determines how large each merged partition can grow (see [Training Process](#training-process) for details). + ### Token Dictionary File Schema | Column | Type | Nullable | Description | @@ -189,6 +191,58 @@ address.city:San address.city:Francisco ``` +## Training Process + +Building an FTS index is a multi-phase pipeline: the source column is scanned, documents are tokenized in parallel, intermediate results are spilled to part files on disk, and the part files are merged into final output partitions. + +### Phase 1: Tokenization + +The input column is read as a stream of record batches and dispatched to a pool of tokenizer worker tasks. Each worker tokenizes documents independently, accumulating tokens, posting lists, and document metadata in memory. + +When a worker's accumulated data reaches the partition size limit or the document count hits `u32::MAX`, it flushes the data to disk as a set of part files (`part__tokens.lance`, `part__invert.lance`, `part__docs.lance`). A single worker may produce multiple part files if it processes enough data. + +### Phase 2: Merge + +After all workers finish, the part files are merged into output partitions. Part files are streamed with bounded buffering so that not all data needs to be loaded into memory at once. For each part file, the token dictionaries are unified, document sets are concatenated, and posting lists are rewritten with adjusted IDs. + +When a merged partition reaches the target size, it is written to the destination store and a new one is started. After all part files are consumed the final partition is flushed, and a `metadata.lance` file is written listing the partition IDs and index parameters. + +### Configuration + +| Environment Variable | Default | Description | +|----------------------------|----------------------------------|-----------------------------------------------------------------------------------------------------------------------| +| `LANCE_FTS_NUM_SHARDS` | Number of compute-intensive CPUs | Number of parallel tokenizer worker tasks. Higher values increase indexing throughput but use more memory. | +| `LANCE_FTS_PARTITION_SIZE` | 256 (MiB) | Maximum uncompressed size of a worker's in-memory buffer before it is spilled to a part file. | +| `LANCE_FTS_TARGET_SIZE` | 4096 (MiB) | Target uncompressed size for merged output partitions. Fewer, larger partitions improve query performance. | + +### Memory and Performance Considerations + +Memory usage is primarily determined by two factors: + +- **`LANCE_FTS_NUM_SHARDS`** -- Each worker holds an independent in-memory buffer. Peak memory is roughly `NUM_SHARDS * PARTITION_SIZE` plus the overhead of token dictionaries and posting list structures. +- **`LANCE_FTS_PARTITION_SIZE`** -- Larger values reduce the number of part files and make the merge phase cheaper. Smaller values reduce per-worker memory at the cost of more part files. + +Merge phase memory is bounded by the streaming approach: part files are loaded one at a time with a small concurrency buffer. The merged partition's in-memory size is bounded by `LANCE_FTS_TARGET_SIZE`. + +Building an FTS index requires temporary disk space to store the part files generated during tokenization. The amount of temporary space depends heavily on whether position information is enabled. An index with `with_position: true` stores the position of every token occurrence in every document, which can easily require 10x the size of the original column or more in temporary disk space. An index without positions tends to be smaller than the original column and will typically need less than 2x the size of the column in total disk space. + +Performance tips: + +- Larger `LANCE_FTS_TARGET_SIZE` produces fewer output partitions, which is beneficial for query performance because queries must scan every partition's token dictionary. When memory allows, prefer fewer, larger partitions. +- `with_position: true` significantly increases index size because term positions are stored for every occurrence. Only enable it when phrase queries are needed. +- The ngram tokenizer generates many more tokens per document than word-level tokenizers, so expect larger index sizes and higher memory usage. + +### Distributed Training + +The FTS index supports distributed training where different worker nodes each index a subset of the data and the results are assembled afterward. + +1. Each distributed worker is assigned a **fragment mask** (`(fragment_id as u64) << 32`) that is OR'd into the partition IDs it generates, ensuring globally unique IDs across workers. +2. Workers set `skip_merge: true` so they write their part files directly without running the merge phase. +3. Instead of a single `metadata.lance`, each worker writes per-partition metadata files named `part__metadata.lance`. +4. After all workers finish, a coordinator merges the metadata files: it collects all partition IDs, remaps them to a sequential range starting from 0 (renaming the corresponding data files), and writes the final unified `metadata.lance`. + +This allows each worker to operate independently during the tokenization phase. Only the final metadata merge requires a single-node step, and it is lightweight since it only renames files and writes a small metadata file. + ## Accelerated Queries Lance SDKs provide dedicated full text search APIs to leverage the FTS index capabilities. From 46fe8c0bc7e8c2742f26fa6493853090140c2ad1 Mon Sep 17 00:00:00 2001 From: Weston Pace Date: Tue, 24 Feb 2026 09:25:10 -0800 Subject: [PATCH 2/2] Add link to FTS index training details from quickstart. Fix various broken links. --- docs/src/format/table/index/index.md | 36 ++-- docs/src/format/table/index/system/mem_wal.md | 2 +- docs/src/format/table/index/vector/index.md | 75 +++---- docs/src/format/table/mem_wal.md | 186 +++++++++--------- docs/src/guide/performance.md | 2 +- docs/src/quickstart/full-text-search.md | 17 +- docs/src/quickstart/versioning.md | 4 +- 7 files changed, 164 insertions(+), 158 deletions(-) diff --git a/docs/src/format/table/index/index.md b/docs/src/format/table/index/index.md index 54827c71ba5..767d92c6c9c 100644 --- a/docs/src/format/table/index/index.md +++ b/docs/src/format/table/index/index.md @@ -1,12 +1,11 @@ - # Indices in Lance Lance supports three main categories of indices to accelerate data access: scalar indices, vector indices, and system indices. -**Scalar indices** are traditional indices that speed up queries on scalar data types, such as +**Scalar indices** are traditional indices that speed up queries on scalar data types, such as integers and strings. Examples include [B-trees](scalar/btree.md) and -[full-text search indices](scalar/full_text.md). Typically, scalar indices receive a +[full-text search indices](scalar/fts.md). Typically, scalar indices receive a query predicate, such as equality or range conditions, and output a set of row addresses that satisfy the predicate. @@ -59,7 +58,7 @@ all rows in the fragments they cover, with one exception: if a fragment has dele of index creation, the index segment is allowed to not contain the deleted rows. The fragments an index covers are those recorded in the `fragment_bitmap` field. -Index segments together **do not** need to cover all fragments. This means an index isn't required to +Index segments together **do not** need to cover all fragments. This means an index isn't required to be fully up-to-date. When this happens, engines can split their queries into indexed and unindexed subplans and merge the results. @@ -71,13 +70,13 @@ subplans and merge the results. Consider the example dataset in the figure above: -* The dataset contains three fragments with ids 0, 1, 2. Fragment 1 has 10 deleted rows, indicated +- The dataset contains three fragments with ids 0, 1, 2. Fragment 1 has 10 deleted rows, indicated by the deletion file. -* There is an index called "id_idx", which has two segments: one covering fragments 0 and another covering +- There is an index called "id_idx", which has two segments: one covering fragments 0 and another covering fragment 1. Fragment 2 is not covered by the index. Queries using this index will need to query both segments and then scan fragment 2 directly. Additionally, when querying the segment covering fragment 1, the engine will need to filter out the 10 deleted rows. -* There is another index called "vec_idx", which has a single segment covering all three fragments. +- There is another index called "vec_idx", which has a single segment covering all three fragments. Because it covers all fragments, queries using this index do not need to scan any fragments directly. They do, however, need to filter out the 10 deleted rows from fragment 1. @@ -99,13 +98,13 @@ Index segments are created and updated through a transactional process: directory, where `{UUID}` is a newly generated unique identifier. 2. **Prepare the metadata**: Create an `IndexMetadata` message with: - - `uuid`: The newly generated UUID - - `name`: The index name (must match existing segments if adding to an existing index) - - `fields`: The column(s) being indexed - - `fragment_bitmap`: The set of fragment IDs covered by this segment - - `index_details`: Index-specific configuration and parameters - - `version`: The format version of this index type - - See the full protobuf definition in [table.proto](https://github.com/lance-format/lance/blob/main/protos/table.proto). + - `uuid`: The newly generated UUID + - `name`: The index name (must match existing segments if adding to an existing index) + - `fields`: The column(s) being indexed + - `fragment_bitmap`: The set of fragment IDs covered by this segment + - `index_details`: Index-specific configuration and parameters + - `version`: The format version of this index type + - See the full protobuf definition in [table.proto](https://github.com/lance-format/lance/blob/main/protos/table.proto). 3. **Commit the transaction**: Write a new manifest that includes the new index segment in its `IndexSection`. This is done atomically using the same transaction mechanism @@ -149,13 +148,12 @@ When loading an index: The `IndexMetadata` message contains important information about the index segment: -* `uuid`: the unique identifier of the index segment. -* `fields`: the column(s) the index is built on. -* `fragment_bitmap`: the set of fragment IDs covered by this index segment. -* `index_details`: a protobuf `Any` message that contains index-specific details, such as index type, +- `uuid`: the unique identifier of the index segment. +- `fields`: the column(s) the index is built on. +- `fragment_bitmap`: the set of fragment IDs covered by this index segment. +- `index_details`: a protobuf `Any` message that contains index-specific details, such as index type, parameters, and storage format. This allows different index types to store their own metadata. -
Full protobuf definitions diff --git a/docs/src/format/table/index/system/mem_wal.md b/docs/src/format/table/index/system/mem_wal.md index 73693cd6704..f9169bcfb76 100644 --- a/docs/src/format/table/index/system/mem_wal.md +++ b/docs/src/format/table/index/system/mem_wal.md @@ -9,4 +9,4 @@ For the complete specification, see: - [MemWAL Index Overview](../../mem_wal.md#memwal-index) - Purpose and high-level description - [MemWAL Index Details](../../mem_wal.md#memwal-index-details) - Storage format, schemas, and staleness handling -- [MemWAL Index Builder](../../mem_wal.md#memwal-index-builder) - Background process and configuration updates +- [MemWAL Implementation](../../mem_wal.md#implementation-expectation) - Implementation details and expectations diff --git a/docs/src/format/table/index/vector/index.md b/docs/src/format/table/index/vector/index.md index f987c6a675e..51365ce110a 100644 --- a/docs/src/format/table/index/vector/index.md +++ b/docs/src/format/table/index/vector/index.md @@ -1,6 +1,6 @@ # Vector Indices -Lance provides a powerful and extensible secondary index system for efficient vector similarity search. +Lance provides a powerful and extensible secondary index system for efficient vector similarity search. All vector indices are stored as regular Lance files, making them portable and easy to manage. It is designed for efficient similarity search across large-scale vector datasets. @@ -12,7 +12,7 @@ Lance splits each vector index into 3 parts - clustering, sub-index and quantiza Clustering divides all the vectors into different disjoint clusters (a.k.a. partitions). Lance currently supports using Inverted File (IVF) as the primary clustering mechanism. -IVF partitions the vectors into clusters using the k-means clustering algorithm. +IVF partitions the vectors into clusters using the k-means clustering algorithm. Each cluster contains vectors that are similar to the cluster centroid. During search, only the most relevant clusters are examined, dramatically reducing search time. IVF can be combined with any sub-index type and quantization method. @@ -51,7 +51,7 @@ Here are the commonly used combinations: The Lance vector index format has gone through 3 versions so far. This document currently only records version 3 which is the latest version. -The specific version of the vector index is recorded in the `index_version` field of the generic [index metadata](../index.md#index-metadata). +The specific version of the vector index is recorded in the `index_version` field of the generic [index metadata](../index.md#loading-an-index). ## Storage Layout (V3) @@ -68,7 +68,7 @@ The index file stores the search structure with graph or flat organization. The Arrow schema of the Lance file varies depending on the sub-index type used. !!! note - All partitions are stored in the same file, and partitions must be written in order. +All partitions are stored in the same file, and partitions must be written in order. ##### FLAT @@ -89,7 +89,7 @@ HNSW (Hierarchical Navigable Small World) indices provide fast approximate searc | `_distance` | list | false | Distances to neighbors | !!! note - HNSW consists of multiple levels, and all levels must be written in order starting from level 0. +HNSW consists of multiple levels, and all levels must be written in order starting from level 0. #### Arrow Schema Metadata @@ -111,8 +111,8 @@ References the IVF metadata stored in the Lance file global buffer. This value records the global buffer index, currently this is always "1". !!! note - Global buffer indices in Lance files are 1-based, - so you need to subtract 1 when accessing them through code. +Global buffer indices in Lance files are 1-based, +so you need to subtract 1 when accessing them through code. ##### "lance:flat" @@ -159,7 +159,7 @@ Since the auxiliary file stores the actual (quantized) vectors, the Arrow schema of the Lance file varies depending on the quantization method used. !!! note - All partitions are stored in the same file, and partitions must be written in order. +All partitions are stored in the same file, and partitions must be written in order. ##### FLAT @@ -205,11 +205,12 @@ The auxiliary file also contains metadata in its Arrow schema metadata for vecto Here are the metadata keys and their corresponding values: ##### "distance_type" + The distance metric used to compute similarity between vectors (e.g., "l2", "cosine", "dot"). ##### "lance:ivf" -Similar to the index file's "lance:ivf" but focused on vector storage layout. +Similar to the index file's "lance:ivf" but focused on vector storage layout. This doesn't contain the partitions' centroids. It's only used for tracking each partition's offset and length in the auxiliary file. @@ -254,7 +255,7 @@ For **RabitQ (RQ)**: ##### Quantization Codebook -For product quantization, the codebook is stored in `Tensor` format +For product quantization, the codebook is stored in `Tensor` format in the auxiliary file's global buffer for efficient access: ```protobuf @@ -264,7 +265,7 @@ in the auxiliary file's global buffer for efficient access: ##### Rotation Matrix For RabitQ, the rotation matrix is stored in `Tensor` format -in the auxiliary file's global buffer. The rotation matrix is an orthogonal matrix used +in the auxiliary file's global buffer. The rotation matrix is an orthogonal matrix used to rotate vectors before binary quantization: ```protobuf @@ -283,26 +284,26 @@ PQ uses 16 num_sub_vectors (m=16) with 8 num_bits per subvector, and distance ty #### Index File - Arrow Schema Metadata: - - `"lance:index"` → `{ "type": "IVF_PQ", "distance_type": "l2" }` - - `"lance:ivf"` → "1" (references IVF metadata in the global buffer) - - `"lance:flat"` → `["", "", ...]` (one empty string per partition; IVF_PQ uses a FLAT sub-index inside each partition) + - `"lance:index"` → `{ "type": "IVF_PQ", "distance_type": "l2" }` + - `"lance:ivf"` → "1" (references IVF metadata in the global buffer) + - `"lance:flat"` → `["", "", ...]` (one empty string per partition; IVF_PQ uses a FLAT sub-index inside each partition) - Lance File Global buffer (Protobuf): - - `Ivf` message containing: - - `centroids_tensor`: shape `[num_partitions, 128]` (float32) - - `offsets`: start offset (row) of each partition in `auxiliary.idx` - - `lengths`: number of vectors in each partition - - `loss`: k-means loss (optional) + - `Ivf` message containing: + - `centroids_tensor`: shape `[num_partitions, 128]` (float32) + - `offsets`: start offset (row) of each partition in `auxiliary.idx` + - `lengths`: number of vectors in each partition + - `loss`: k-means loss (optional) #### Auxiliary File - Arrow Schema Metadata: - - `"distance_type"` → `"l2"` - - `"lance:ivf"` → tracks per-partition `offsets` and `lengths` (no centroids here) - - `"storage_metadata"` → `[ "{"pq":{"num_sub_vectors":16,"nbits":8,"dimension":128,"transposed":true}}" ]` + - `"distance_type"` → `"l2"` + - `"lance:ivf"` → tracks per-partition `offsets` and `lengths` (no centroids here) + - `"storage_metadata"` → `[ "{"pq":{"num_sub_vectors":16,"nbits":8,"dimension":128,"transposed":true}}" ]` - Lance File Global buffer: - - `Tensor` codebook with shape `[256, num_sub_vectors, dim/num_sub_vectors]` = `[256, 16, 8]` (float32) -- Rows with Arrow schema: + - `Tensor` codebook with shape `[256, num_sub_vectors, dim/num_sub_vectors]` = `[256, 16, 8]` (float32) +- Rows with Arrow schema: ```python pa.schema([ @@ -319,26 +320,26 @@ RQ uses 1 bit per dimension (num_bits=1), and distance type is "l2". #### Index File - Arrow Schema Metadata: - - `"lance:index"` → `{ "type": "IVF_RQ", "distance_type": "l2" }` - - `"lance:ivf"` → "1" (references IVF metadata in the global buffer) - - `"lance:flat"` → `["", "", ...]` (one empty string per partition; IVF_RQ uses a FLAT sub-index inside each partition) + - `"lance:index"` → `{ "type": "IVF_RQ", "distance_type": "l2" }` + - `"lance:ivf"` → "1" (references IVF metadata in the global buffer) + - `"lance:flat"` → `["", "", ...]` (one empty string per partition; IVF_RQ uses a FLAT sub-index inside each partition) - Lance File Global buffer (Protobuf): - - `Ivf` message containing: - - `centroids_tensor`: shape `[num_partitions, 128]` (float32) - - `offsets`: start offset (row) of each partition in `auxiliary.idx` - - `lengths`: number of vectors in each partition - - `loss`: k-means loss (optional) + - `Ivf` message containing: + - `centroids_tensor`: shape `[num_partitions, 128]` (float32) + - `offsets`: start offset (row) of each partition in `auxiliary.idx` + - `lengths`: number of vectors in each partition + - `loss`: k-means loss (optional) #### Auxiliary File - Arrow Schema Metadata: - - `"distance_type"` → `"l2"` - - `"lance:ivf"` → tracks per-partition `offsets` and `lengths` (no centroids here) - - `"lance:rabit"` → `"{"rotate_mat_position":1,"num_bits":1,"packed":true}"` + - `"distance_type"` → `"l2"` + - `"lance:ivf"` → tracks per-partition `offsets` and `lengths` (no centroids here) + - `"lance:rabit"` → `"{"rotate_mat_position":1,"num_bits":1,"packed":true}"` - Lance File Global buffer: - - `Tensor` rotation matrix with shape `[code_dim, code_dim]` = `[128, 128]` (float32) -- Rows with Arrow schema: + - `Tensor` rotation matrix with shape `[code_dim, code_dim]` = `[128, 128]` (float32) +- Rows with Arrow schema: ```python pa.schema([ diff --git a/docs/src/format/table/mem_wal.md b/docs/src/format/table/mem_wal.md index 806dd3bdd40..5e6907038fb 100644 --- a/docs/src/format/table/mem_wal.md +++ b/docs/src/format/table/mem_wal.md @@ -32,10 +32,10 @@ If two regions contain rows with the same primary key, the following scenario ca 5. The row from Region A (older) now overwrites the row from Region B (newer) This violates the expected "last write wins" semantics. -By ensuring each primary key is assigned to exactly one region via the region spec, +By ensuring each primary key is assigned to exactly one region via the region spec, merge order between regions becomes irrelevant for correctness. -See [MemWAL Region Architecture](#memwal-region-architecture) for the complete region architecture. +See [MemWAL Region Architecture](#region-architecture) for the complete region architecture. ### MemWAL Index @@ -66,7 +66,7 @@ The MemTable is periodically **flushed** to storage based on memory pressure and ### MemTable -A MemTable holds rows inserted into the region before flushing to storage. +A MemTable holds rows inserted into the region before flushing to storage. It serves 2 purposes: 1. build up data and related indexes to be flushed to storage as a flushed MemTable @@ -85,7 +85,7 @@ and each write into the MemTable is a new Arrow record batch. #### MemTable Generation -Based on conditions like memory limit and durability requirements, +Based on conditions like memory limit and durability requirements, a MemTable needs to be **flushed** to storage and discarded. When that happens, new writes go to a new MemTable and the cycle repeats. Each MemTable is assigned a monotonically increasing generation number starting from 1. @@ -145,12 +145,12 @@ A flushed MemTable is created by flushing the MemTable to storage. In Lance MemWAL spec, a flushed MemTable must be a Lance table following the Lance table format spec. !!!note - This is called Sorted String Table (SSTable) or Sorted Run in many LSM-tree literatures and implementations. - However, since our MemTable is not sorted, we just use the term flushed MemTable to avoid confusion. +This is called Sorted String Table (SSTable) or Sorted Run in many LSM-tree literatures and implementations. +However, since our MemTable is not sorted, we just use the term flushed MemTable to avoid confusion. #### Flushed MemTable Storage Layout -The MemTable of generation `i` is flushed to `_mem_wal/{region_uuid}/{random_hex}_gen_{i}/` directory, +The MemTable of generation `i` is flushed to `_mem_wal/{region_uuid}/{random_hex}_gen_{i}/` directory, where `{random_hex}` is a random 8-character hex value generated at flush time. The random hex value is necessary to ensure if one MemTable flush attempt fails, The retry can use another directory. @@ -158,7 +158,7 @@ The content within the generation directory follows the [Lance table storage lay #### Merging MemTable to Base Table -Generation numbers determine merge order of flushed MemTable into base table: +Generation numbers determine merge order of flushed MemTable into base table: lower numbers represent older data and must be merged to the base table first to preserve correct upsert semantics. Within a single flushed MemTable, if there are multiple rows of the same primary key, @@ -193,7 +193,7 @@ The manifest is serialized as a protobuf binary file using the `RegionManifest` #### Region Manifest Versioning -Manifests are versioned starting from 1 and immutable. +Manifests are versioned starting from 1 and immutable. Each update creates a new manifest file at the next version number. Updates use put-if-not-exists or file rename to ensure atomicity depending on the storage system. If two processes compete, one wins and the other retries. @@ -212,7 +212,7 @@ To read the latest manifest version: 4. The latest version is the last found version !!!note - This works because the write rate to region manifests is significantly lower than read rates. Region manifests are only updated when region metadata changes (MemTable flush), not on every write. This ensures HEAD requests will eventually terminate and find the latest version. +This works because the write rate to region manifests is significantly lower than read rates. Region manifests are only updated when region metadata changes (MemTable flush), not on every write. This ensures HEAD requests will eventually terminate and find the latest version. #### Region Manifest Storage Layout @@ -235,17 +235,17 @@ The index stores its data in two parts: The `index_details` field in `IndexMetadata` contains a `MemWalIndexDetails` protobuf message with the following key fields: - **Configuration fields** (`region_specs`, `maintained_indexes`) are the source of truth for MemWAL configuration. -Writers read these fields to determine how to partition data and which indexes to maintain. + Writers read these fields to determine how to partition data and which indexes to maintain. - **Merge progress** (`merged_generations`) tracks the last generation merged to the base table for each region. -This field is updated atomically with merge-insert data commits, enabling conflict resolution when multiple mergers operate concurrently. -Each entry contains the region UUID and generation number. + This field is updated atomically with merge-insert data commits, enabling conflict resolution when multiple mergers operate concurrently. + Each entry contains the region UUID and generation number. - **Index catchup progress** (`index_catchup`) tracks which merged generation each base table index has been rebuilt to cover. -When data is merged from a flushed MemTable to the base table, the base table's indexes may be rebuilt asynchronously. -During this window, queries should use the flushed MemTable's pre-built indexes instead of scanning unindexed data in the base table. -See [Indexed Read Plan](#indexed-read-plan) for details. + When data is merged from a flushed MemTable to the base table, the base table's indexes may be rebuilt asynchronously. + During this window, queries should use the flushed MemTable's pre-built indexes instead of scanning unindexed data in the base table. + See [Indexed Read Plan](#indexed-read-plan) for details. - **Region snapshot fields** (`snapshot_ts_millis`, `num_regions`, `inline_snapshots`) provide a snapshot of region states. -The actual region manifests remain authoritative for region state. -When `num_regions` is 0, the `inline_snapshots` field may be `None` or an empty Lance file with 0 rows but proper schema. + The actual region manifests remain authoritative for region state. + When `num_regions` is 0, the `inline_snapshots` field may be `None` or an empty Lance file with 0 rows but proper schema.
MemWalIndexDetails protobuf message @@ -277,13 +277,13 @@ Regions without a spec ID (`spec_id = 0`) are manually-created regions not gover A region spec's field array consists of **region field** definitions. Each region field has the following properties: -| Property | Description | -|----------|-------------| -| `field_id` | Unique string identifier for this region field | -| `source_ids` | Array of field IDs referencing source columns in the schema | -| `transform` | A well-known region expression, specify this or `expression` | -| `expression` | A DataFusion SQL expression for custom logic, specify this or `transform` | -| `result_type` | The output type of the region value | +| Property | Description | +| ------------- | ------------------------------------------------------------------------- | +| `field_id` | Unique string identifier for this region field | +| `source_ids` | Array of field IDs referencing source columns in the schema | +| `transform` | A well-known region expression, specify this or `expression` | +| `expression` | A DataFusion SQL expression for custom logic, specify this or `transform` | +| `result_type` | The output type of the region value | #### Region Expression @@ -304,16 +304,16 @@ Region expressions must satisfy the following requirements: A **Region Transform** is a well-known region expression with a predefined name. When a transform is specified, the expression is derived automatically. -| Transform | Parameters | Region Expression | Result Type | -|-----------|------------|-------------------|-------------| -| `identity` | (none) | `col0` | same as source | -| `year` | (none) | `date_part('year', col0)` | `int32` | -| `month` | (none) | `date_part('month', col0)` | `int32` | -| `day` | (none) | `date_part('day', col0)` | `int32` | -| `hour` | (none) | `date_part('hour', col0)` | `int32` | -| `bucket` | `num_buckets` | `abs(murmur3(col0)) % N` | `int32` | -| `multi_bucket` | `num_buckets` | `abs(murmur3_multi(col0, col1, ...)) % N` | `int32` | -| `truncate` | `width` | `left(col0, W)` (string) or `col0 - (col0 % W)` (numeric) | same as source | +| Transform | Parameters | Region Expression | Result Type | +| -------------- | ------------- | --------------------------------------------------------- | -------------- | +| `identity` | (none) | `col0` | same as source | +| `year` | (none) | `date_part('year', col0)` | `int32` | +| `month` | (none) | `date_part('month', col0)` | `int32` | +| `day` | (none) | `date_part('day', col0)` | `int32` | +| `hour` | (none) | `date_part('hour', col0)` | `int32` | +| `bucket` | `num_buckets` | `abs(murmur3(col0)) % N` | `int32` | +| `multi_bucket` | `num_buckets` | `abs(murmur3_multi(col0, col1, ...)) % N` | `int32` | +| `truncate` | `width` | `left(col0, W)` (string) or `col0 - (col0 % W)` (numeric) | same as source | The `bucket` and `multi_bucket` transforms use Murmur3 hash functions: @@ -326,10 +326,10 @@ The hash result is wrapped with `abs()` and modulo `N` to produce a non-negative Region snapshots are stored using one of two strategies based on the number of regions: -| Region Count | Storage Strategy | Location | -|--------------|------------------|----------| -| <= 100 (threshold) | Inline | `inline_snapshots` field in index details | -| > 100 | External Lance file | `_indices/{UUID}/index.lance` | +| Region Count | Storage Strategy | Location | +| ------------------ | ------------------- | ----------------------------------------- | +| <= 100 (threshold) | Inline | `inline_snapshots` field in index details | +| > 100 | External Lance file | `_indices/{UUID}/index.lance` | The threshold (100 regions) is implementation-defined and may vary. @@ -344,23 +344,23 @@ This file uses standard Lance format with the region snapshot schema, enabling e Region snapshots are stored as a Lance file with one row per region. The schema has one column per `RegionManifest` field plus region spec columns: -| Column | Type | Description | -|--------|------|-------------| -| `region_id` | `fixed_size_binary(16)` | Region UUID bytes | -| `version` | `uint64` | Region manifest version | -| `region_spec_id` | `uint32` | Region spec ID (0 if manual) | -| `writer_epoch` | `uint64` | Writer fencing token | -| `replay_after_wal_entry_position` | `uint64` | Last WAL entry position (0-based) flushed to MemTable | -| `wal_entry_position_last_seen` | `uint64` | Last WAL entry position (0-based) seen (hint) | -| `current_generation` | `uint64` | Next generation to flush | -| `flushed_generations` | `list>` | Flushed MemTable paths | -| `region_field_{field_id}` | varies | Region field value (one column per field in region spec) | +| Column | Type | Description | +| --------------------------------- | ------------------------------------------------ | -------------------------------------------------------- | +| `region_id` | `fixed_size_binary(16)` | Region UUID bytes | +| `version` | `uint64` | Region manifest version | +| `region_spec_id` | `uint32` | Region spec ID (0 if manual) | +| `writer_epoch` | `uint64` | Writer fencing token | +| `replay_after_wal_entry_position` | `uint64` | Last WAL entry position (0-based) flushed to MemTable | +| `wal_entry_position_last_seen` | `uint64` | Last WAL entry position (0-based) seen (hint) | +| `current_generation` | `uint64` | Next generation to flush | +| `flushed_generations` | `list>` | Flushed MemTable paths | +| `region_field_{field_id}` | varies | Region field value (one column per field in region spec) | For example, with a region spec containing a field `user_bucket` of type `int32`: -| Column | Type | Description | -|--------|------|-------------| -| ... | ... | (base columns above) | +| Column | Type | Description | +| -------------------------- | ------- | ---------------------------- | +| ... | ... | (base columns above) | | `region_field_user_bucket` | `int32` | Bucket value for this region | This schema directly corresponds to the fields in the `RegionManifest` protobuf message plus the computed region field values. @@ -483,7 +483,7 @@ Without proper merging, queries would return duplicate or stale rows. ### Reader Consistency -Reader consistency depends on two factors: +Reader consistency depends on two factors: 1. access to in-memory MemTables 2. the source of region metadata (either through MemWAL index or region manifests) @@ -492,8 +492,8 @@ Strong consistency requires access to in-memory MemTables for all regions involv Otherwise, the query is eventually consistent due to missing unflushed data or stale MemWAL Index snapshots. !!!note - Reading a stale MemWAL Index does not impact correctness, only freshness: - +Reading a stale MemWAL Index does not impact correctness, only freshness: + - **Merged MemTable still in index**: If a flushed MemTable has been merged to the base table but still shows in the MemWAL index, readers query both. This results in some inefficiency for querying the same data twice, but [LSM-tree merging](#lsm-tree-merging-read) ensures correct results since both contain the same data. The inefficiency is also compensated by the fact that the data is covered by index and we rarely end up scanning both data. - **Garbage collected MemTable still in index**: If a flushed MemTable has been garbage collected, but is still in the MemWAL index, readers would fail to open it and skip it. This is also safe because if it is garbage collected, the data must already exist in the base table. - **Newly flushed MemTable not in index**: If a newly flushed MemTable is added after the snapshot was built, it is not queried. The result is eventually consistent but correct for the snapshot's point in time. @@ -537,16 +537,16 @@ Region pruning applies to both scan queries and prefilters in search queries. #### Indexed Read Plan -When data is merged from a flushed MemTable to the base table, the base table's indexes are rebuilt asynchronously by the [base table index builders](#base-table-index-builder). +When data is merged from a flushed MemTable to the base table, the base table's indexes are rebuilt asynchronously by the base table index builders. During this window, the merged data exists in the base table but is not yet covered by the base table's indexes. Without special handling, indexed queries would fall back to expensive full scans for the unindexed part of the base table. To maintain indexed read performance, the query planner should use `index_catchup` progress to determine the optimal data source for each query. The key insight is that flushed MemTables serve as a bridge between the base table's index catchup and the current merged state. -For a query that requires a specific index for acceleration, when `index_gen < merged_gen`, +For a query that requires a specific index for acceleration, when `index_gen < merged_gen`, the generations in the gap `(index_gen, merged_gen]` have data already merged in the base table but are not covered by the base table's index. -Since flushed MemTables contain pre-built indexes (created during [MemTable flush](#memtable-flush)), queries can use these indexes instead of scanning unindexed data in the base table. +Since flushed MemTables contain pre-built indexes (created during [MemTable flush](#flushed-memtable)), queries can use these indexes instead of scanning unindexed data in the base table. This ensures all reads remain indexed regardless of how far behind the async index builder is. ## Appendices @@ -566,18 +566,18 @@ Region manifest (version 1): #### Scenario -| Step | Writer A | Writer B | Manifest State | -|------|----------|----------|----------------| -| 1 | Loads manifest, sees epoch=5 | | epoch=5, version=1 | -| 2 | Increments to epoch=6, writes manifest v2 | | epoch=6, version=2 | -| 3 | Starts writing WAL entries 13, 14, 15 | | | -| 4 | | Loads manifest v2, sees epoch=6 | epoch=6, version=2 | -| 5 | | Increments to epoch=7, writes manifest v3 | epoch=7, version=3 | -| 6 | | Starts writing WAL entries 16, 17 | | -| 7 | Tries to flush MemTable, loads manifest | | | -| 8 | Sees epoch=7, but local epoch=6 | | | -| 9 | **Writer A is fenced!** Aborts all operations | | | -| 10 | | Continues writing normally | epoch=7, version=3 | +| Step | Writer A | Writer B | Manifest State | +| ---- | --------------------------------------------- | ----------------------------------------- | ------------------ | +| 1 | Loads manifest, sees epoch=5 | | epoch=5, version=1 | +| 2 | Increments to epoch=6, writes manifest v2 | | epoch=6, version=2 | +| 3 | Starts writing WAL entries 13, 14, 15 | | | +| 4 | | Loads manifest v2, sees epoch=6 | epoch=6, version=2 | +| 5 | | Increments to epoch=7, writes manifest v3 | epoch=7, version=3 | +| 6 | | Starts writing WAL entries 16, 17 | | +| 7 | Tries to flush MemTable, loads manifest | | | +| 8 | Sees epoch=7, but local epoch=6 | | | +| 9 | **Writer A is fenced!** Aborts all operations | | | +| 10 | | Continues writing normally | epoch=7, version=3 | #### What Happens to Writer A's WAL Entries? @@ -622,19 +622,19 @@ Region manifest (version 1): Two mergers both try to merge generation 6 concurrently. -| Step | Merger A | Merger B | MemWAL Index | -|------|----------|----------|--------------| -| 1 | Reads index: merged_gen=5 | | merged_gen=5 | -| 2 | Reads region manifest | | | -| 3 | Starts merging gen 6 | | | -| 4 | | Reads index: merged_gen=5 | merged_gen=5 | -| 5 | | Reads region manifest | | -| 6 | | Starts merging gen 6 | | -| 7 | Commits (merged_gen=6) | | **merged_gen=6** | -| 8 | | Tries to commit | | -| 9 | | **Conflict**: reads new index | | -| 10 | | Sees merged_gen=6 >= 6, aborts | | -| 11 | | Reloads, continues to gen 7 | | +| Step | Merger A | Merger B | MemWAL Index | +| ---- | ------------------------- | ------------------------------ | ---------------- | +| 1 | Reads index: merged_gen=5 | | merged_gen=5 | +| 2 | Reads region manifest | | | +| 3 | Starts merging gen 6 | | | +| 4 | | Reads index: merged_gen=5 | merged_gen=5 | +| 5 | | Reads region manifest | | +| 6 | | Starts merging gen 6 | | +| 7 | Commits (merged_gen=6) | | **merged_gen=6** | +| 8 | | Tries to commit | | +| 9 | | **Conflict**: reads new index | | +| 10 | | Sees merged_gen=6 >= 6, aborts | | +| 11 | | Reloads, continues to gen 7 | | Merger B's conflict resolution detected that generation 6 was already merged by checking the MemWAL Index in the conflicting commit. @@ -642,15 +642,15 @@ Merger B's conflict resolution detected that generation 6 was already merged by Merger A crashes after committing to the table. -| Step | Merger A | Merger B | MemWAL Index | -|------|----------|----------|--------------| -| 1 | Reads index: merged_gen=5 | | merged_gen=5 | -| 2 | Merges gen 6, commits | | **merged_gen=6** | -| 3 | **CRASH** | | merged_gen=6 | -| 4 | | Reads index: merged_gen=6 | merged_gen=6 | -| 5 | | Reads region manifest | | -| 6 | | **Skips gen 6** (already merged) | | -| 7 | | Merges gen 7, commits | **merged_gen=7** | +| Step | Merger A | Merger B | MemWAL Index | +| ---- | ------------------------- | -------------------------------- | ---------------- | +| 1 | Reads index: merged_gen=5 | | merged_gen=5 | +| 2 | Merges gen 6, commits | | **merged_gen=6** | +| 3 | **CRASH** | | merged_gen=6 | +| 4 | | Reads index: merged_gen=6 | merged_gen=6 | +| 5 | | Reads region manifest | | +| 6 | | **Skips gen 6** (already merged) | | +| 7 | | Merges gen 7, commits | **merged_gen=7** | The MemWAL Index is the single source of truth. Merger B correctly used it to determine that generation 6 was already merged. diff --git a/docs/src/guide/performance.md b/docs/src/guide/performance.md index a23363b1115..ca1458834f1 100644 --- a/docs/src/guide/performance.md +++ b/docs/src/guide/performance.md @@ -64,7 +64,7 @@ debugging query performance. Lance is designed to be thread-safe and performant. Lance APIs can be called concurrently unless explicitly stated otherwise. Users may create multiple tables and share tables between threads. Operations may run in parallel on the same table, but some operations may lead to conflicts. For -details see [conflict resolution](../format/index.md#conflict-resolution). +details see [conflict resolution](../format/table/transaction.md#conflict-resolution). Most Lance operations will use multiple threads to perform work in parallel. There are two thread pools in lance: the IO thread pool and the compute thread pool. The IO thread pool is used for diff --git a/docs/src/quickstart/full-text-search.md b/docs/src/quickstart/full-text-search.md index 2ed477f8593..829397dd676 100644 --- a/docs/src/quickstart/full-text-search.md +++ b/docs/src/quickstart/full-text-search.md @@ -58,6 +58,7 @@ print(ds.schema) ``` This prints the PyArrow schema of the dataset: + ``` id: int64 text: large_string @@ -77,7 +78,7 @@ ds.create_scalar_index( The index creation process builds an efficient lookup structure that maps words to the documents containing them. This enables high-performance keyword-based search, even on large datasets. !!! warning "Index Creation Time" - Index creation time depends on the size of your text data. For large datasets, this process may take several minutes, but the performance benefits at query time are substantial. +Index creation time depends on the size of your text data. For large datasets, this process may take several minutes, but the performance benefits at query time are substantial. ## Advanced Index Configuration @@ -195,6 +196,7 @@ query_result = ds.to_table( You can use boolean search operators by constructing a structured query object. #### All terms: `AND` + ```python from lance.query import FullTextOperator, MatchQuery @@ -207,6 +209,7 @@ query_result = ds.to_table(full_text_query=and_query) ``` #### Any terms: `OR` + ```python from lance.query import FullTextOperator, MatchQuery @@ -278,7 +281,7 @@ table = ds.to_table(full_text_query="'train to boston'") ``` !!! warning "Stop Words Are Removed by Default" - Common words like "to", "the", etc. are categorized as stop words and are removed by default when creating the index. If you want to search exact phrases that include stop words, set `remove_stop_words=False` when creating the index. +Common words like "to", "the", etc. are categorized as stop words and are removed by default when creating the index. If you want to search exact phrases that include stop words, set `remove_stop_words=False` when creating the index. ### Substring matches with N-gram indexing @@ -385,8 +388,8 @@ print(stats["num_unindexed_rows"], stats["num_indexed_rows"]) ``` !!! info - If you used a custom index name, replace `"text_idx"` with your index name. - If you did not set `name=...` when creating the FTS index on column `"text"`, the default index name is `"text_idx"`. +If you used a custom index name, replace `"text_idx"` with your index name. +If you did not set `name=...` when creating the FTS index on column `"text"`, the default index name is `"text_idx"`. If you changed tokenizer settings (such as `with_position`, `base_tokenizer`, stop words, or stemming), rebuild the index with `create_scalar_index(..., replace=True)` so the full dataset is indexed with the new configuration. @@ -406,6 +409,10 @@ Using specific, targeted search terms often yields better performance than broad Combining full-text search with metadata filters can significantly reduce the search space and improve performance. Use structured data filters to narrow down results before applying text search, or vice versa. This approach is particularly effective for large datasets where you can eliminate many irrelevant documents early in the query process. +### Further Reading + +For advanced usage instructions with different tokenizers and more technical details on the index training process, including information about the expected memory and disk usage, visit the [full-text index](../format/table/index/scalar/fts.md) specification. + ## Next Steps -Check out the **[User Guide](../guide/read_and_write/)** and explore the Lance API in more detail. +Check out the **[User Guide](../guide/read_and_write.md)** and explore the Lance API in more detail. diff --git a/docs/src/quickstart/versioning.md b/docs/src/quickstart/versioning.md index 30c41b91344..8cdf1cb35ea 100644 --- a/docs/src/quickstart/versioning.md +++ b/docs/src/quickstart/versioning.md @@ -5,7 +5,7 @@ description: Learn how to version your Lance datasets with append, overwrite, ta # Versioning Your Datasets with Lance -Lance supports versioning natively, allowing you to track changes over time. +Lance supports versioning natively, allowing you to track changes over time. In this tutorial, you'll learn how to append new data to existing datasets while preserving historical versions and access specific versions using version numbers or meaningful tags. You'll also understand how to implement proper data governance practices with Lance's native versioning capabilities. @@ -110,4 +110,4 @@ For more details, see [Tags and Branches](../guide/tags_and_branches.md). Now that you've mastered dataset versioning with Lance, check out **[Vector Indexing and Vector Search With Lance](vector-search.md)**. You can learn how to build high-performance vector search capabilities on top of your Lance tables. -This will teach you how to build fast, scalable search capabilities for your versioned datasets. \ No newline at end of file +This will teach you how to build fast, scalable search capabilities for your versioned datasets.