Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions docs/src/format/table/mem_wal.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ In other words, a WAL consists of an ordered list of WAL entries starting from p
Writer must flush WAL entries in sequential order from lower to higher position.
If WAL entry `N` is not flushed fully, WAL entry `N+1` must not exist in storage.

### WAL Replay
#### WAL Replay

**Replaying** a WAL means to read data in the WAL from a lower to a higher position.
This is commonly used to recover the latest MemTable after it is lost,
Expand Down Expand Up @@ -161,6 +161,9 @@ The content within the generation directory follows the [Lance table storage lay
Generation numbers determine merge order of flushed MemTable into base table:
lower numbers represent older data and must be merged to the base table first to preserve correct upsert semantics.

Within a single flushed MemTable, if there are multiple rows of the same primary key,
the row that is last inserted wins.

### Region Manifest

Each region has a manifest file. This is the source of truth for the state of a region.
Expand Down Expand Up @@ -465,7 +468,7 @@ Readers **MUST** merge results from multiple data sources (base table, flushed M

When the same primary key exists in multiple sources, the reader must keep only the newest version based on:

1. **Generation number** (`_gen`): Higher generation wins. The base table has generation -1, MemTables have positive integers starting from 1.
1. **Generation number** (`_gen`): Higher generation wins. The base table has generation 0, MemTables have positive integers starting from 1.
2. **Row address** (`_rowaddr`): Within the same generation, higher row address wins (later writes within a batch overwrite earlier ones).

The ordering for "newest" is: highest `_gen` first, then highest `_rowaddr`.
Expand Down Expand Up @@ -506,7 +509,7 @@ Datasets come from:
2. flushed MemTables (persisted but not yet merged)
3. optionally in-memory MemTables (if accessible).

Each dataset is tagged with a generation number: -1 for the base table, and positive integers for MemTable generations.
Each dataset is tagged with a generation number: 0 for the base table, and positive integers for MemTable generations.
Within a region, the generation number determines data freshness, with higher numbers representing newer data.
Rows from different regions do not need deduplication since each primary key maps to exactly one region.

Expand Down
15 changes: 15 additions & 0 deletions rust/lance-index/src/scalar/inverted/builder.rs
Original file line number Diff line number Diff line change
Expand Up @@ -410,6 +410,21 @@ impl InnerBuilder {
self.id
}

/// Set the token set for this builder.
pub fn set_tokens(&mut self, tokens: TokenSet) {
self.tokens = tokens;
}

/// Set the document set for this builder.
pub fn set_docs(&mut self, docs: DocSet) {
self.docs = docs;
}

/// Set the posting lists for this builder.
pub fn set_posting_lists(&mut self, posting_lists: Vec<PostingListBuilder>) {
self.posting_lists = posting_lists;
}

pub async fn remap(&mut self, mapping: &HashMap<u64, Option<u64>>) -> Result<()> {
// for the docs, we need to remove the rows that are removed from the doc set,
// and update the row ids of the rows that are updated
Expand Down
5 changes: 5 additions & 0 deletions rust/lance-index/src/scalar/inverted/tokenizer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,11 @@ impl InvertedIndexParams {
self
}

/// Get whether positions are stored in this index.
pub fn has_positions(&self) -> bool {
self.with_position
}

pub fn max_token_length(mut self, max_token_length: Option<usize>) -> Self {
self.max_token_length = max_token_length;
self
Expand Down
4 changes: 4 additions & 0 deletions rust/lance/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -183,5 +183,9 @@ harness = false
name = "memtable_read"
harness = false

[[bench]]
name = "mem_wal_read"
harness = false

[lints]
workspace = true
Loading
Loading