diff --git a/.typos.toml b/.typos.toml index fc490c95dcf..50a08bce5e7 100644 --- a/.typos.toml +++ b/.typos.toml @@ -11,6 +11,7 @@ afe = "afe" typ = "typ" rabit = "rabit" flate = "flate" +Ines = "Ines" [default.expect] nprobs = "nprobes" diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index f3eefd9a43e..17a28732d81 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -1,5 +1,5 @@ site_name: Lance -site_description: Modern columnar data format for ML and LLMs +site_description: Open Lakehouse Format for Multimodal AI site_url: https://lancedb.github.io/lance/ docs_dir: src @@ -8,6 +8,7 @@ repo_url: https://github.com/lancedb/lance theme: name: material + custom_dir: overrides logo: logo/white.png favicon: logo/logo.png palette: @@ -40,7 +41,11 @@ theme: markdown_extensions: - admonition - pymdownx.details - - pymdownx.superfences + - pymdownx.superfences: + custom_fences: + - name: mermaid + class: mermaid + format: !!python/name:pymdownx.superfences.fence_code_format - pymdownx.highlight: anchor_linenums: true line_spans: __span @@ -70,3 +75,12 @@ extra: - icon: fontawesome/brands/twitter link: https://twitter.com/lancedb +copyright: © 2025 Lance Format. All rights reserved. + +extra_css: + - assets/stylesheets/termynal.css + - assets/stylesheets/home.css + +extra_javascript: + - assets/javascript/termynal.js + diff --git a/docs/overrides/home.html b/docs/overrides/home.html new file mode 100644 index 00000000000..77dd4c2905a --- /dev/null +++ b/docs/overrides/home.html @@ -0,0 +1,263 @@ +{% extends "main.html" %} + +{% block tabs %} + {{ super() }} + + + + +
+
+
+ +

The Open Lakehouse Format for Multimodal AI

+
+ +
+
+
+ + +
+
+
+

What is Lance?

+

+ Lance contains a file format, table format, and catalog spec for multimodal AI, + allowing you to build a complete open lakehouse on top of object storage to power your AI workflows. + Lance brings high-performance vector search, full-text search, random access, and feature + engineering capabilities to the lakehouse, while you can still get all the existing lakehouse benefits + like SQL analytics, ACID transactions, time travel, and integrations with open engines (Apache Spark, Ray, Trino, DuckDB, etc.) + and open catalogs (Apache Polaris, Unity Catalog, Apache Gravitino, Hive Metastore, etc.) +

+ Learn More +
+
+
+ + +
+
+
+
+

Expressive Hybrid Search

+

+ Lance enables powerful hybrid search combining vector similarity, full-text search, + and SQL analytics on the same dataset. All query types are accelerated by corresponding + secondary indices as part of the Lance specification. +

+

+ Run semantic search on embeddings, BM25 search on keywords, and apply complex SQL predicates - + all using a single table with a unified interface. +

+ Learn More +
+
+ +
+
+
+
+ + +
+
+
+
+

Lightning-fast Random Access

+

+ Lance delivers 100x faster random access compared to Parquet or Iceberg. With efficient + row-addressing, you can access individual records across multiple files instantly, + making it perfect for real-time ML serving, random sampling, and interactive applications. +

+

+ Unlike traditional columnar formats, Lance maintains high performance even when + randomly accessing scattered rows across your entire dataset. +

+ Learn More +
+
+
+ import lance + dataset = lance.dataset("s3://my-bucket/embeddings") + table = dataset.take([100, 5000, 1000000]) + dataset.take([0, 1], columns=["id", "vector"]) +
+
+
+
+
+ + +
+
+
+
+

Native Multimodal Data Support

+

+ Store images, videos, audio, text, and embeddings in a single unified format. + Lance's blob encoding efficiently handles large binary objects with lazy loading, + while optimized vector storage accelerates similarity search. +

+

+ Perfect for AI/ML workloads where you need to store raw data alongside embeddings + for multimodal retrieval and generation workflows. +

+ Learn More +
+
+
+ import lance + import pyarrow as pa + schema = pa.schema([ + pa.field("video", pa.large_binary(), + metadata={"lance-encoding:blob": "true"}), + pa.field("embedding", pa.list_(pa.float32(), 128))]) + lance.write_dataset(table, "multimodal.lance", schema=schema) +
+
+
+
+
+ + +
+
+
+
+

Data Evolution > Schema Evolution

+

+ Schema evolution in most open table formats are metadata only and fast. + But when trying to backfill column values in existing rows, a full table rewrite is typically required. + Lance supports efficient schema evolution with backfill, making it perfect for ML + feature engineering, embedding and media content management. +

+

+ Adding a new column with data is as simple as writing new Lance files to the Lance table - + no need to rewrite your entire dataset. +

+ Learn More +
+
+
+ import lance + dataset = lance.dataset("my_data.lance") + @lance.batch_udf() + def add_embeddings(batch): + vectors = model.encode(batch["text"]) + return {"embedding": vectors} + dataset.add_columns(add_embeddings) +
+
+
+
+
+ + +
+
+
+
+

Rich Ecosystem Integrations

+

+ As an open format, Lance integrates seamlessly with the Python data ecosystem and modern data platforms. + Work with your favorite tools including Pandas, Polars, and PyTorch for data processing and machine learning. + Connect with leading query engines like Apache DataFusion, DuckDB, Apache Spark, Trino, and Apache Flink/Fluss + to run SQL analytics and distributed processing on your Lance datasets. +

+ View Integrations +
+
+ Lance Ecosystem Integrations +
+
+
+
+ + + +{% endblock %} + +{% block content %}{% endblock %} +{% block footer %} + {{ super() }} +{% endblock %} diff --git a/docs/src/assets/images/ecosystem-integrations.png b/docs/src/assets/images/ecosystem-integrations.png new file mode 100644 index 00000000000..070d7c00a78 Binary files /dev/null and b/docs/src/assets/images/ecosystem-integrations.png differ diff --git a/docs/src/assets/images/lance-mj.png b/docs/src/assets/images/lance-mj.png new file mode 100644 index 00000000000..15c30cb3b24 Binary files /dev/null and b/docs/src/assets/images/lance-mj.png differ diff --git a/docs/src/assets/javascript/termynal.js b/docs/src/assets/javascript/termynal.js new file mode 100644 index 00000000000..77ec6cb01c8 --- /dev/null +++ b/docs/src/assets/javascript/termynal.js @@ -0,0 +1,197 @@ +/** + * termynal.js + * A lightweight, modern and extensible animated terminal window, using + * async/await. + * + * @author Ines Montani + * @version 0.0.1 + * @license MIT + */ + +'use strict'; + +/** Generate a terminal widget. */ +class Termynal { + /** + * Construct the widget's settings. + * @param {(string|Node)=} container - Query selector or container element. + * @param {Object=} options - Custom settings. + * @param {string} options.prefix - Prefix to use for data attributes. + * @param {number} options.startDelay - Delay before animation, in ms. + * @param {number} options.typeDelay - Delay between each typed character, in ms. + * @param {number} options.lineDelay - Delay between each line, in ms. + * @param {number} options.progressLength - Number of characters displayed as progress bar. + * @param {string} options.progressChar – Character to use for progress bar, defaults to █. + * @param {number} options.progressPercent - Max percent of progress. + * @param {string} options.cursor – Character to use for cursor, defaults to ▋. + * @param {Object[]} lineData - Dynamically loaded line data objects. + * @param {boolean} options.noInit - Don't initialise the animation. + */ + constructor(container = '#termynal', options = {}) { + this.container = (typeof container === 'string') ? document.querySelector(container) : container; + this.pfx = `data-${options.prefix || 'ty'}`; + this.startDelay = options.startDelay + || parseFloat(this.container.getAttribute(`${this.pfx}-startDelay`)) || 600; + this.typeDelay = options.typeDelay + || parseFloat(this.container.getAttribute(`${this.pfx}-typeDelay`)) || 90; + this.lineDelay = options.lineDelay + || parseFloat(this.container.getAttribute(`${this.pfx}-lineDelay`)) || 1500; + this.progressLength = options.progressLength + || parseFloat(this.container.getAttribute(`${this.pfx}-progressLength`)) || 40; + this.progressChar = options.progressChar + || this.container.getAttribute(`${this.pfx}-progressChar`) || '█'; + this.progressPercent = options.progressPercent + || parseFloat(this.container.getAttribute(`${this.pfx}-progressPercent`)) || 100; + this.cursor = options.cursor + || this.container.getAttribute(`${this.pfx}-cursor`) || '▋'; + this.lineData = this.lineDataToElements(options.lineData || []); + if (!options.noInit) this.init() + } + + /** + * Initialise the widget, get lines, clear container and start animation. + */ + init() { + // Appends dynamically loaded lines to existing line elements. + this.lines = [...this.container.querySelectorAll(`[${this.pfx}]`)].concat(this.lineData); + + /** + * Calculates width and height of Termynal container. + * If container is empty and lines are dynamically loaded, defaults to browser `auto` or CSS. + */ + const containerStyle = getComputedStyle(this.container); + this.container.style.width = containerStyle.width !== '0px' ? + containerStyle.width : undefined; + this.container.style.minHeight = containerStyle.height !== '0px' ? + containerStyle.height : undefined; + + this.container.setAttribute('data-termynal', ''); + this.container.innerHTML = ''; + this.start(); + } + + /** + * Start the animation and rener the lines depending on their data attributes. + */ + async start() { + await this._wait(this.startDelay); + + for (let line of this.lines) { + const type = line.getAttribute(this.pfx); + const delay = line.getAttribute(`${this.pfx}-delay`) || this.lineDelay; + + if (type == 'input') { + line.setAttribute(`${this.pfx}-cursor`, this.cursor); + await this.type(line); + await this._wait(delay); + } + + else if (type == 'progress') { + await this.progress(line); + await this._wait(delay); + } + + else { + this.container.appendChild(line); + await this._wait(delay); + } + + line.removeAttribute(`${this.pfx}-cursor`); + } + } + + /** + * Animate a typed line. + * @param {Node} line - The line element to render. + */ + async type(line) { + const chars = [...line.textContent]; + const delay = line.getAttribute(`${this.pfx}-typeDelay`) || this.typeDelay; + line.textContent = ''; + this.container.appendChild(line); + + for (let char of chars) { + await this._wait(delay); + line.textContent += char; + } + } + + /** + * Animate a progress bar. + * @param {Node} line - The line element to render. + */ + async progress(line) { + const progressLength = line.getAttribute(`${this.pfx}-progressLength`) + || this.progressLength; + const progressChar = line.getAttribute(`${this.pfx}-progressChar`) + || this.progressChar; + const chars = progressChar.repeat(progressLength); + const progressPercent = line.getAttribute(`${this.pfx}-progressPercent`) + || this.progressPercent; + line.textContent = ''; + this.container.appendChild(line); + + for (let i = 1; i < chars.length + 1; i++) { + await this._wait(this.typeDelay); + const percent = Math.round(i / chars.length * 100); + line.textContent = `${chars.slice(0, i)} ${percent}%`; + if (percent>progressPercent) { + break; + } + } + } + + /** + * Helper function for animation delays, called with `await`. + * @param {number} time - Timeout, in ms. + */ + _wait(time) { + return new Promise(resolve => setTimeout(resolve, time)); + } + + /** + * Converts line data objects into line elements. + * + * @param {Object[]} lineData - Dynamically loaded lines. + * @param {Object} line - Line data object. + * @returns {Element[]} - Array of line elements. + */ + lineDataToElements(lineData) { + return lineData.map(line => { + let div = document.createElement('div'); + div.innerHTML = `${line.value || ''}`; + + return div.firstElementChild; + }); + } + + /** + * Helper function for generating attributes string. + * + * @param {Object} line - Line data object. + * @returns {string} - String of attributes. + */ + _attributes(line) { + let attrs = ''; + for (let prop in line) { + attrs += this.pfx; + + if (prop === 'type') { + attrs += `="${line[prop]}" ` + } else if (prop !== 'value') { + attrs += `-${prop}="${line[prop]}" ` + } + } + + return attrs; + } +} + +/** +* HTML API: If current script has container(s) specified, initialise Termynal. +*/ +if (document.currentScript.hasAttribute('data-termynal-container')) { + const containers = document.currentScript.getAttribute('data-termynal-container'); + containers.split('|') + .forEach(container => new Termynal(container)) +} diff --git a/docs/src/assets/stylesheets/home.css b/docs/src/assets/stylesheets/home.css new file mode 100644 index 00000000000..e8b0aa19c7e --- /dev/null +++ b/docs/src/assets/stylesheets/home.css @@ -0,0 +1,292 @@ +/* Lance Homepage Styles */ + +* { + box-sizing: border-box; +} + +.container { + width: 100%; + max-width: 1140px; + margin-right: auto; + margin-left: auto; + padding-right: 15px; + padding-left: 15px; +} + +/* Hero Section - Fullscreen with Background Image */ +.mdx-container { + text-align: center; + color: #f8f8f8; + background: url("../images/lance-mj.png") no-repeat center center; + background-size: cover; + min-height: 100vh; + height: 100vh; + display: flex; + align-items: center; + justify-content: center; +} + +.intro-message { + position: relative; + padding: 40px 20px; + font-family: "Lato", -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif; + max-width: 1000px; + margin: 0 auto; +} + +.hero-logo { + display: inline-flex; + align-items: center; + margin-bottom: 16px; +} + +.hero-logo img { + height: 120px; + width: auto; + margin-right: 24px; + margin-top: 12px; + filter: drop-shadow(3px 3px 8px rgba(0, 0, 0, 0.9)); +} + +.intro-message h1 { + font-weight: 400; + margin: 0; + display: inline-block; + text-shadow: 3px 3px 8px rgba(0, 0, 0, 0.9), 1px 1px 3px rgba(0, 0, 0, 1); + font-size: 8em; + line-height: 1.2; + color: #ffffff; + vertical-align: middle; +} + +.intro-message h1 sup { + font-size: 2rem; + text-shadow: 2px 2px 6px rgba(0, 0, 0, 0.9); +} + +.intro-message h3 { + font-size: 1.1rem; + text-shadow: 2px 2px 6px rgba(0, 0, 0, 0.9), 1px 1px 3px rgba(0, 0, 0, 1); + font-weight: 600; + margin-bottom: 32px; + color: #ffffff; +} + +.intro-divider { + width: 400px; + max-width: 80%; + border-top: 1px solid rgba(255, 255, 255, 0.8); + border-bottom: 1px solid rgba(0, 0, 0, 0.2); + margin: 24px auto; +} + +.list-inline { + padding-left: 0; + margin-left: -5px; + list-style: none; + margin-bottom: 0; +} + +.list-inline li { + display: inline-block; + padding-right: 5px; + padding-left: 5px; +} + +.intro-message .md-button { + margin: 8px; + padding: 14px 36px; + font-size: 1.1rem; + font-weight: 600; + text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.8); + transition: all 0.3s ease; +} + +.intro-message .md-button:hover { + transform: translateY(-2px); + box-shadow: 0 8px 16px rgba(0, 0, 0, 0.3); +} + +.intro-message .md-button--primary:hover { + box-shadow: 0 8px 20px rgba(102, 126, 234, 0.5); +} + +.intro-message .md-button:not(.md-button--primary):hover { + box-shadow: 0 8px 20px rgba(255, 255, 255, 0.4); +} + +/* What is Lance Section */ +.lance-intro-section { + padding: 80px 0; + background-color: rgba(128, 128, 128, 0.03); + border-bottom: 1px solid rgba(128, 128, 128, 0.1); +} + +.lance-intro-content { + max-width: 900px; + margin: 0 auto; + text-align: center; +} + +.lance-intro-content h2 { + font-size: 36px; + font-weight: 500; + margin-bottom: 32px; + color: var(--md-primary-fg-color); +} + +.lance-intro-content p { + font-size: 16px; + line-height: 1.8; + margin-bottom: 32px; + opacity: 0.9; + text-align: left; +} + +.lance-intro-content .md-button { + margin-top: 16px; + padding: 10px 28px; + font-size: 14px; + border: 2px solid currentColor; + background-color: transparent; + transition: all 0.3s ease; +} + +.lance-intro-content .md-button:hover { + color: var(--md-primary-fg-color); + background-color: transparent; +} + +/* Feature Sections */ +.lance-feature-section { + padding: 80px 0; + border-bottom: 1px solid rgba(128, 128, 128, 0.1); +} + +.lance-feature-section:last-child { + border-bottom: none; +} + +.lance-feature-content { + display: flex; + flex-wrap: wrap; + align-items: center; + gap: 60px; +} + +.lance-feature-text { + flex: 1; + min-width: 300px; +} + +.lance-feature-text h2 { + font-size: 30px; + font-weight: 500; + margin-bottom: 16px; + color: var(--md-primary-fg-color); +} + +.lance-feature-text p { + font-size: 15px; + line-height: 1.6; + opacity: 0.85; + margin-bottom: 16px; +} + +.lance-feature-text .md-button { + font-size: 0.6rem; + padding: 0; + transition: all 0.3s ease; +} + +.lance-feature-text .md-button:hover { + transform: translateX(4px); + color: var(--md-primary-fg-color); +} + +.lance-feature-demo { + flex: 1; + min-width: 400px; + display: flex; + justify-content: center; + overflow: hidden; +} + +/* Alternating layout */ +.lance-feature-section.reverse .lance-feature-content { + flex-direction: row-reverse; +} + +/* Terminal container adjustments */ +[data-termynal] { + margin: 0 auto; + width: 500px; + max-width: 100%; + font-size: 14px; + box-shadow: 0 8px 24px rgba(0, 0, 0, 0.15); +} + +/* Responsive Design */ +@media (max-width: 768px) { + .container { + padding-right: 10px; + padding-left: 10px; + } + + .mdx-hero__content h1 { + font-size: 2.2rem; + } + + .mdx-hero__content p { + font-size: 1.1rem; + } + + .lance-feature-section { + padding: 50px 0; + } + + .lance-feature-content { + flex-direction: column; + gap: 40px; + } + + .lance-feature-section.reverse .lance-feature-content { + flex-direction: column; + } + + .lance-feature-text h2 { + font-size: 1.8rem; + } + + .lance-feature-text p { + font-size: 1rem; + } + + .lance-feature-demo { + min-width: 100%; + } + + [data-termynal] { + width: 100% !important; + max-width: 100% !important; + font-size: 12px; + padding: 50px 25px 25px; + } +} + +@media (max-width: 480px) { + .mdx-hero__content h1 { + font-size: 1.8rem; + } + + .mdx-hero__content .md-button { + display: block; + margin: 8px auto; + width: 90%; + } + + [data-termynal] { + font-size: 10px; + padding: 40px 15px 15px; + } +} diff --git a/docs/src/assets/stylesheets/termynal.css b/docs/src/assets/stylesheets/termynal.css new file mode 100644 index 00000000000..456f469bea3 --- /dev/null +++ b/docs/src/assets/stylesheets/termynal.css @@ -0,0 +1,107 @@ +/** + * termynal.js + * + * @author Ines Montani + * @version 0.0.1 + * @license MIT + */ + +:root { + --color-bg: #252a33; + --color-text: #eee; + --color-text-subtle: #a2a2a2; + --color-keyword: #c678dd; + --color-string: #98c379; + --color-function: #61afef; + --color-number: #d19a66; + --color-comment: #5c6370; +} + +[data-termynal] { + width: 750px; + max-width: 100%; + background: var(--color-bg); + color: var(--color-text); + font-size: 18px; + font-family: 'Fira Mono', Consolas, Menlo, Monaco, 'Courier New', Courier, monospace; + border-radius: 4px; + padding: 75px 45px 35px; + position: relative; + -webkit-box-sizing: border-box; + box-sizing: border-box; +} + +[data-termynal]:before { + content: ''; + position: absolute; + top: 15px; + left: 15px; + display: inline-block; + width: 15px; + height: 15px; + border-radius: 50%; + /* A little hack to display the window buttons in one pseudo element. */ + background: #d9515d; + -webkit-box-shadow: 25px 0 0 #f4c025, 50px 0 0 #3ec930; + box-shadow: 25px 0 0 #f4c025, 50px 0 0 #3ec930; +} + +[data-termynal]:after { + content: 'Python'; + position: absolute; + color: var(--color-text-subtle); + top: 5px; + left: 0; + width: 100%; + text-align: center; +} + +[data-ty] { + display: block; + line-height: 2; + white-space: pre; +} + +[data-ty]:before { + /* Set up defaults and ensure empty lines are displayed. */ + content: ''; + display: inline-block; + vertical-align: middle; +} + +[data-ty="input"]:before, +[data-ty-prompt]:before { + margin-right: 0.75em; + color: var(--color-text-subtle); +} + +[data-ty="input"]:before { + content: '$'; +} + +[data-ty][data-ty-prompt]:before { + content: attr(data-ty-prompt); +} + +[data-ty-cursor]:after { + content: attr(data-ty-cursor); + font-family: monospace; + margin-left: 0.5em; + -webkit-animation: blink 1s infinite; + animation: blink 1s infinite; +} + + +/* Cursor animation */ + +@-webkit-keyframes blink { + 50% { + opacity: 0; + } +} + +@keyframes blink { + 50% { + opacity: 0; + } +} diff --git a/docs/src/format/index.md b/docs/src/format/index.md index 0785a65222b..7bc346cdc1e 100644 --- a/docs/src/format/index.md +++ b/docs/src/format/index.md @@ -1,16 +1,82 @@ # Lance Format Specification -The Lance format contains both a table format and a columnar file format. -When combined, we refer to it as a data format. -Because Lance can store both structured and unstructured multimodal data, Lance typically refers to tables as "datasets". -A Lance dataset is designed to efficiently handle secondary indices, fast ingestion and modification of data, -and a rich set of schema and data evolution features. - -## Feature Flags - -As the file format and dataset evolve, new feature flags are added to the format. -There are two separate fields for checking for feature flags, -depending on whether you are trying to read or write the table. -Readers should check the `reader_feature_flags` to see if there are any flag it is not aware of. -Writers should check `writer_feature_flags`. If either sees a flag they don't know, -they should return an "unsupported" error on any read or write operation. \ No newline at end of file +Lance is a **Lakehouse Format** that spans three specification layers: file format, table format, and catalog spec. + +## Understanding the Lakehouse Stack + +To understand where Lance fits in the data ecosystem, let's first map out the complete lakehouse technology stack. +The modern lakehouse architecture consists of six distinct layers: + +![Lakehouse Stack](../images/lakehouse_stack.png) + +### 1. Object Store + +At the foundation lies the **object store**—storage systems characterized by their object-based simple hierarchy, +typically providing highly durable guarantees with HTTP-based communication protocols for data transfer. +This includes systems like S3, GCS, and Azure Blob Storage. + +### 2. File Format + +Above the storage layer, the **file format** describes how a single file should be stored on disk. +This is where formats like Apache Parquet operate, defining the internal structure, encoding, and compression of individual data files. + +### 3. Table Format + +The **table format** layer describes how multiple files work together to form a logical table. +The key feature that modern table formats enable is transactional commits and read isolation to allow multiple writers and readers to safely operate against the same table. +All major open source table formats including Iceberg and Lance implement these features through MVCC (Multi-Version Concurrency Control), +where each commit atomically produces a new table version, and all table versions form a serializable history for the specific table. +This also unlocks features like time travel and makes features like schema evolution easy to develop. + +### 4. Catalog Spec + +The **catalog spec** defines how any system can discover and manage a collection of tables within storage. +This is where the lower storage and format stack meets the upper service and compute stack. + +Table formats require at least a way to list all available tables and to describe, add and drop tables in the list. +This is necessary for actually building the so-called **connectors** in compute engines so they can discover and start working on the table according to the format. +Historically, Hive has defined the Hive MetaStore spec that is sufficient for most table formats including Delta Lake, Hudi, Paimon, and also Lance. +Iceberg offers its unique Iceberg REST Catalog spec. + +From the top down, projects like Apache Polaris, Unity Catalog, and Apache Gravitino usually offer additional specification for operating against table derivatives +(e.g. views, materialized views, user-defined table functions) and objects used in table operations (e.g. user-defined functions, policies). + +This intersection between top and bottom stack is also why typically a catalog service would provide both the catalog specifications offered by the format side for easy connectivity to compute engines, +as well as providing their own APIs for extended management features. + +Another key differentiation of a catalog spec versus a catalog service is that there can be multiple different vendors implementing the same spec. +For example, for Polaris REST spec we have open source Apache Polaris server, Snowflake Horizon Catalog, and Polaris-compatible services in AWS Glue, Azure OneLake, etc. + +### 5. Catalog Service + +A **catalog service** implements one or more catalog specifications to provide both table metadata and optionally continuous background maintenance (compaction, optimization, index updates) that table formats require to stay performant. +Catalog services typically implement multiple specifications to support different table formats. +For example, Polaris, Unity and Gravitino all support the Iceberg REST catalog specification for Iceberg tables, and have their own generic table API for other table formats. + +Since table formats are static specifications, catalog services supply the active operational work needed for production deployments. +This is often where open source transitions to commercial offerings, as open source projects typically provide metadata functionality, while commercial solutions offer the full operational experience including automated maintenance. +There are also open source solutions like Apache Amoro emerging to fill this gap with complete open source catalog service implementations that offer both table metadata access and continuous optimization. + +### 6. Compute Engine + +Finally, **compute engines** are the workhorses that visit catalog services and leverage their knowledge of file formats, table formats, and catalog specifications to perform complex data workflows, including SQL queries, analytics processing, vector search, full-text search, and machine learning training. +All sorts of applications can be built on top of compute engines to serve more concrete analytics, ML and AI use cases. + +### The Overall Lakehouse Architecture + +In the lakehouse architecture, compute power resides in the object store, catalog services, and compute engines. +The middle three layers (file format, table format, catalog spec) are specifications without compute. +This separation enables portability and interoperability. + +## Understanding Lance as a Lakehouse Format + +Lance spans all three specification layers: + +1. **File Format**: The Lance columnar file format, [read specification →](file/index.md) +2. **Table Format**: The Lance table format, [read specification →](table/index.md) +3. **Catalog Spec**: The Lance Namespace specification, [read specification →](namespace/index.md) + +For comparison: + +- **Apache Iceberg** operates at the table format and catalog spec layers, using Apache Parquet, Apache Avro and Apache ORC as the file format +- **Delta Lake** and **Apache Hudi** operate at only the table format layer, using Apache Parquet as the file format \ No newline at end of file diff --git a/docs/src/format/table/.pages b/docs/src/format/table/.pages index 94701b5a1b4..ec66d452eb6 100644 --- a/docs/src/format/table/.pages +++ b/docs/src/format/table/.pages @@ -1,3 +1,8 @@ nav: - index.md + - Versioning: versioning.md + - Transactions: transaction.md + - Layout: layout.md + - Branch & Tag: branch_tag.md + - Row ID & Lineage: row_id_lineage.md - index diff --git a/docs/src/format/table/branch_tag.md b/docs/src/format/table/branch_tag.md new file mode 100644 index 00000000000..e3bd328d5d2 --- /dev/null +++ b/docs/src/format/table/branch_tag.md @@ -0,0 +1,121 @@ +# Branch and Tag Specification + +## Overview + +Lance supports branching and tagging for managing multiple independent version histories and creating named references to specific versions. +Branches enable parallel development workflows, while tags provide stable named references for important versions. + +## Branching + +### Branch Name + +Branch names must follow these validation rules: + +1. Cannot be empty +2. Cannot start or end with `/` +3. Cannot contain consecutive `//` +4. Cannot contain `..` or `\` +5. Segments must contain only alphanumeric characters, `.`, `-`, `_` +6. Cannot end with `.lock` +7. Cannot be named `main` (reserved for main branch) + +### Branch Metadata Path + +Branch metadata is stored at `_refs/branches/{branch-name}.json` in the dataset root. +Since branch names support hierarchical naming with `/` characters, the `/` is URL-encoded as `%2F` in the filename to distinguish it from directory separators (e.g., `bugfix/issue-123` becomes `bugfix%2Fissue-123.json`): + +``` +{dataset_root}/ + _refs/ + branches/ + feature-a.json + bugfix%2Fissue-123.json # Note: '/' encoded as '%2F' +``` + +### Branch Metadata File Format + +Each branch metadata file is a JSON file with the following fields: + +| JSON Key | Type | Optional | Description | +|------------------|--------|----------|--------------------------------------------------------------------------------| +| `parent_branch` | string | Yes | Name of the branch this was created from. `null` indicates branched from main. | +| `parent_version` | number | | Version number of the parent branch at the time this branch was created. | +| `create_at` | number | | Unix timestamp (seconds since epoch) when the branch was created. | +| `manifest_size` | number | | Size of the initial manifest file in bytes. | + +### Branch Dataset Layout + +Each branch dataset is technically a [shallow clone](layout.md#shallow-clone) of the source dataset. +Branch datasets are organized using the `tree/` directory at the dataset root: + +``` +{dataset_root}/ + tree/ + {branch_name}/ + _versions/ + *.manifest + _transactions/ + *.txn + _deletions/ + *.arrow + *.bin + _indices/ + {UUID}/ + index.idx +``` + +Named branches store their version-specific files under `tree/{branch_name}/`, resembling the GitHub branch path convention. +It uses the branch name as is to form the path, +which means `/` would create a logical subdirectory (e.g., `bugfix/issue-123`, `feature/user-auth`): + +``` +{dataset_root}/ + tree/ + feature-a/ + _versions/ + 1.manifest + 2.manifest + bugfix/ + issue-123/ + _versions/ + 1.manifest +``` + +## Tagging + +### Tag Name + +Tag names must follow these validation rules: + +1. Cannot be empty +2. Must contain only alphanumeric characters, `.`, `-`, `_` +3. Cannot start or end with `.` +4. Cannot end with `.lock` +5. Cannot contain consecutive `..` + +Note that tag names do not support `/` characters, unlike branch names. + +### Tag Storage + +Tags are stored as JSON files under `_refs/tags/` at the dataset root: + +``` +{dataset_root}/ + _refs/ + tags/ + v1.0.0.json + v1.1.0.json + production.json +``` + +Tags are always stored at the root dataset level, regardless of which branch they reference. + +### Tag File Format + +Each tag file is a JSON file with the following fields: + +| JSON Key | Type | Optional | Description | +|-----------------|--------|----------|--------------------------------------------------------------------------| +| `branch` | string | Yes | Branch name being tagged. `null` or absent indicates main branch. | +| `version` | number | | Version number being tagged within that branch. | +| `manifest_size` | number | | Size of the manifest file in bytes. Used for efficient manifest loading. | diff --git a/docs/src/format/table/index.md b/docs/src/format/table/index.md index 9119bac7538..0114feeb0a1 100644 --- a/docs/src/format/table/index.md +++ b/docs/src/format/table/index.md @@ -1,289 +1,148 @@ # Lance Table Format -## Dataset Directory +## Overview -A `Lance Dataset` is organized in a directory. +The Lance table format organizes datasets as versioned collections of fragments and indices. +Each version is described by an immutable manifest file that references data files, deletion files, transaction file and indices. +The format supports ACID transactions, schema evolution, and efficient incremental updates through Multi-Version Concurrency Control (MVCC). -``` -/path/to/dataset: - data/*.lance -- Data directory - _versions/*.manifest -- Manifest file for each dataset version. - _indices/{UUID-*}/index.idx -- Secondary index, each index per directory. - _deletions/*.{arrow,bin} -- Deletion files, which contain IDs of rows - that have been deleted. -``` +## Manifest -A `Manifest` file includes the metadata to describe a version of the dataset. +![Overview](../../images/table_overview.png) -```protobuf -%%% proto.message.Manifest %%% -``` - -### Fragments +A manifest describes a single version of the dataset. +It contains the complete schema definition including nested fields, the list of data fragments comprising this version, +a monotonically increasing version number, and an optional reference to the index section that describes a list of index metadata. -`DataFragment` represents a chunk of data in the dataset. Itself includes one or more `DataFile`, -where each `DataFile` can contain several columns in the chunk of data. -It also may include a `DeletionFile`, which is explained in a later section. +
+Manifest protobuf message ```protobuf -%%% proto.message.DataFragment %%% +%%% proto.message.Manifest %%% ``` -The overall structure of a fragment is shown below. One or more data files store the columns of a fragment. -New columns can be added to a fragment by adding new data files. The deletion file (if present), -stores the rows that have been deleted from the fragment. - -![Fragment Structure](../../images/fragment_structure.png) - -Every row has a unique ID, which is an u64 that is composed of two u32s: the fragment ID and the local row ID. -The local row ID is just the index of the row in the data files. - -## Dataset Update and Data Evolution - -`Lance` supports fast dataset update and schema evolution via manipulating the `Manifest` metadata. - -`Appending` is done by appending new `Fragment` to the dataset. While adding columns is done -by adding new `DataFile` of the new columns to each `Fragment`. Finally, -`Overwrite` a dataset can be done by resetting the `Fragment` list of the `Manifest`. - -![Data Evolution](../../images/data_evolution.png) +
## Schema & Fields -Fields represent the metadata for a column. This includes the name, data type, id, nullability, and encoding. - -Fields are listed in depth first order, and can be one of: +The schema of the table is written as a series of fields, plus a schema metadata map. +The data types generally have a 1-1 correspondence with the Apache Arrow data types. +Each field, including nested fields, have a unique integer id. At initial table creation time, fields are assigned ids in depth-first order. +Afterwards, field IDs are assigned incrementally for newly added fields. -1. parent (struct) -2. repeated (list/array) -3. leaf (primitive) +Column encoding configurations are specified through field metadata using the `lance-encoding:` prefix. +See [File Format Encoding Specification](../file/encoding.md) for details on available encodings, compression schemes, and configuration options. -For example, the schema: - -``` -a: i32 -b: struct { - c: list - d: i32 -} -``` +
+Field protobuf message -Would be represented as the following field list: - -| name | id | type | parent_id | logical_type | -|-------|----|----------|-----------|--------------| -| `a` | 1 | LEAF | 0 | `"int32"` | -| `b` | 2 | PARENT | 0 | `"struct"` | -| `b.c` | 3 | REPEATED | 2 | `"list"` | -| `b.c` | 4 | LEAF | 3 | `"int32"` | -| `b.d` | 5 | LEAF | 2 | `"int32"` | - -### Field Encoding Specification - -Column-level encoding configurations are specified through PyArrow field metadata: - -```python -import pyarrow as pa - -schema = pa.schema([ - pa.field( - "compressible_strings", - pa.string(), - metadata={ - "lance-encoding:compression": "zstd", - "lance-encoding:compression-level": "3", - "lance-encoding:structural-encoding": "miniblock", - "lance-encoding:packed": "true" - } - ) -]) +```protobuf +%%% proto.message.lance.file.Field %%% ``` -| Metadata Key | Type | Description | Example Values | Example Usage (Python) | -|--------------------------------------|--------------|----------------------------------------------|-------------------|----------------------------------------------------------------| -| `lance-encoding:compression` | Compression | Specifies compression algorithm | zstd | `metadata={"lance-encoding:compression": "zstd"}` | -| `lance-encoding:compression-level` | Compression | Zstd compression level (1-22) | 3 | `metadata={"lance-encoding:compression-level": "3"}` | -| `lance-encoding:blob` | Storage | Marks binary data (>4MB) for chunked storage | true/false | `metadata={"lance-encoding:blob": "true"}` | -| `lance-encoding:packed` | Optimization | Struct memory layout optimization | true/false | `metadata={"lance-encoding:packed": "true"}` | -| `lance-encoding:structural-encoding` | Nested Data | Encoding strategy for nested structures | miniblock/fullzip | `metadata={"lance-encoding:structural-encoding": "miniblock"}` | - -## Deletion - -Rows can be marked deleted by adding a deletion file next to the data in the `_deletions` folder. -These files contain the indices of rows that have been deleted for some fragments. -For a given version of the dataset, each fragment can have up to one deletion file. -Fragments that have no deleted rows have no deletion file. - -Readers should filter out row IDs contained in these deletion files during a scan or ANN search. - -Deletion files come in two flavors: - -1. Arrow files: which store a column with a flat vector of indices -2. Roaring bitmaps: which store the indices as compressed bitmaps. +
-[Roaring Bitmaps](https://roaringbitmap.org/) are used for larger deletion sets, -while Arrow files are used for small ones. This is because Roaring Bitmaps are known to be inefficient for small sets. +## Fragments -The filenames of deletion files are structured like: +![Fragment Structure](../../images/fragment_structure.png) -``` -_deletions/{fragment_id}-{read_version}-{random_id}.{arrow|bin} -``` +A fragment represents a horizontal partition of the dataset containing a subset of rows. +Each fragment has a unique `uint32` identifier assigned incrementally based on the dataset's maximum fragment ID. +Each fragment consists of one or more data files storing columns, plus an optional deletion file. +If present, the deletion file stores the positions (0-based) of the rows that have been deleted from the fragment. +The fragment tracks the total row count including deleted rows in its physical rows field. +Column subsets can be read without accessing all data files, and each data file is independently compressed and encoded. -Where `fragment_id` is the fragment the file corresponds to, `read_version` is the version of the dataset that it was created off of (usually one less than the version it was committed to), and `random_id` is a random i64 used to avoid collisions. The suffix is determined by the file type (`.arrow` for Arrow file, `.bin` for roaring bitmap). +
+DataFragment protobuf message ```protobuf -%%% proto.message.DeletionFile %%% +%%% proto.message.DataFragment %%% ``` -Deletes can be materialized by re-writing data files with the deleted rows removed. -However, this invalidates row indices and thus the ANN indices, which can be expensive to recompute. - -## Committing Datasets - -A new version of a dataset is committed by writing a new manifest file to the `_versions` directory. - -To prevent concurrent writers from overwriting each other, -the commit process must be atomic and consistent for all writers. -If two writers try to commit using different mechanisms, they may overwrite each other's changes. -For any storage system that natively supports atomic rename-if-not-exists or put-if-not-exists, -these operations should be used. This is true of local file systems and most cloud object stores -including Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage. -For ones that lack this functionality, an external locking mechanism can be configured by the user. - -### Manifest Naming Schemes - -Manifest files must use a consistent naming scheme. The names correspond to the versions. -That way we can open the right version of the dataset without having to read all the manifests. -It also makes it clear which file path is the next one to be written. - -There are two naming schemes that can be used: - -1. V1: `_versions/{version}.manifest`. This is the legacy naming scheme. -2. V2: `_versions/{u64::MAX - version:020}.manifest`. This is the new naming scheme. - The version is zero-padded (to 20 digits) and subtracted from `u64::MAX`. - This allows the versions to be sorted in descending order, - making it possible to find the latest manifest on object storage using a single list call. - -It is an error for there to be a mixture of these two naming schemes. +
-### Conflict Resolution +### Data Evolution -If two writers try to commit at the same time, one will succeed and the other will fail. -The failed writer should attempt to retry the commit, but only if its changes are compatible -with the changes made by the successful writer. +This fragment design enables a new concept called data evolution, which means efficient schema evolution (add column, update column, drop column) with backfill. +For example, when adding a new column, new column data are added by appending new data files to each fragment, with values computed for all existing rows in the fragment. +There is no need to rewrite the entire table to just add data for a single column. +This enables efficient feature engineering and embedding updates for ML/AI workloads. -The changes for a given commit are recorded as a transaction file, -under the `_transactions` prefix in the dataset directory. -The transaction file is a serialized `Transaction` protobuf message. -See the `transaction.proto` file for its definition. +Each data file should contain a distinct set of field ids. +It is not required that all field ids in the dataset schema are found in one of the data files. +If there is no corresponding data file, that column should be read as entirely `NULL`. -![Conflict Resolution Flow](../../images/conflict_resolution_flow.png) +Field ids might be replaced with `-2`, a tombstone value. +In this case that column should be ignored. This used, for example, when rewriting a column: +The old data file replaces the field id with `-2` to ignore the old data, and a new data file is appended to the fragment. -The commit process is as follows: +## Data Files -1. The writer finishes writing all data files. -2. The writer creates a transaction file in the `_transactions` directory. - This file describes the operations that were performed, which is used for two purposes: - (1) to detect conflicts, and (2) to re-build the manifest during retries. -3. Look for any new commits since the writer started writing. - If there are any, read their transaction files and check for conflicts. - If there are any conflicts, abort the commit. Otherwise, continue. -4. Build a manifest and attempt to commit it to the next version. - If the commit fails because another writer has already committed, go back to step 3. +Data files store column data for a fragment using the Lance file format. +Each data file stores a subset of the columns in the fragment. +Field IDs are assigned either sequentially based on schema position (for Lance file format v1) +or independently of column indices due to variable encoding widths (for Lance file format v2). -When checking whether two transactions conflict, be conservative. -If the transaction file is missing, assume it conflicts. -If the transaction file has an unknown operation, assume it conflicts. +
+DataFile protobuf message -### External Manifest Store - -If the backing object store does not support *-if-not-exists operations, -an external manifest store can be used to allow concurrent writers. -An external manifest store is a KV store that supports put-if-not-exists operation. -The external manifest store supplements but does not replace the manifests in object storage. -A reader unaware of the external manifest store could read a table that uses it, -but it might be up to one version behind the true latest version of the table. - -![External Store Commit](../../images/external_store_commit.gif) - -The commit process is as follows: - -1. `PUT_OBJECT_STORE mydataset.lance/_versions/{version}.manifest-{uuid}` stage a new manifest in object store under a unique path determined by new uuid -2. `PUT_EXTERNAL_STORE base_uri, version, mydataset.lance/_versions/{version}.manifest-{uuid}` commit the path of the staged manifest to the external store. -3. `COPY_OBJECT_STORE mydataset.lance/_versions/{version}.manifest-{uuid} mydataset.lance/_versions/{version}.manifest` copy the staged manifest to the final path -4. `PUT_EXTERNAL_STORE base_uri, version, mydataset.lance/_versions/{version}.manifest` update the external store to point to the final manifest - -Note that the commit is effectively complete after step 2. If the writer fails after step 2, a reader will be able to detect the external store and object store are out-of-sync, and will try to synchronize the two stores. If the reattempt at synchronization fails, the reader will refuse to load. This is to ensure that the dataset is always portable by copying the dataset directory without special tool. +```protobuf +%%% proto.message.DataFile %%% +``` -![External Store Reader](../../images/external_store_reader.gif) +
-The reader load process is as follows: +## Deletion Files -1. `GET_EXTERNAL_STORE base_uri, version, path` then, if path does not end in a UUID return the path -2. `COPY_OBJECT_STORE mydataset.lance/_versions/{version}.manifest-{uuid} mydataset.lance/_versions/{version}.manifest` reattempt synchronization -3. `PUT_EXTERNAL_STORE base_uri, version, mydataset.lance/_versions/{version}.manifest` update the external store to point to the final manifest -4. `RETURN mydataset.lance/_versions/{version}.manifest` always return the finalized path, return error if synchronization fails +Deletion files (a.k.a. deletion vectors) track deleted rows without rewriting data files. +Each fragment can have at most one deletion file per version. +Deletion files support two storage formats. +The Arrow IPC format (`.arrow` extension) stores a flat Int32Array of deleted row offsets and is efficient for sparse deletions. +The Roaring Bitmap format (`.bin` extension) stores a compressed roaring bitmap and is efficient for dense deletions. +Readers must filter rows whose offsets appear in the deletion file for the fragment. -## Feature: Stable Row IDs +Deletions can be materialized by rewriting data files with deleted rows removed. +However, this invalidates row addresses and requires rebuilding indices, which can be expensive. -The row IDs features assigns a unique u64 ID to each row in the table. -This ID is stable throughout the lifetime of the row. To make access fast, a secondary index is created that maps row IDs to their locations in the table. -The respective parts of these indices are stored in the respective fragment's metadata. +
+DeletionFile protobuf message -**row ID** -: A unique auto-incrementing u64 ID assigned to each row in the table. +```protobuf +%%% proto.message.DeletionFile %%% +``` -**row address** -: The current location of a row in the table. This is a u64 that can be thought of as a pair of two u32 values: the fragment ID and the local row offset. For example, if the row address is (42, 9), then the row is in the 42rd fragment and is the 10th row in that fragment. +
-**row ID sequence** -: The sequence of row IDs in a fragment. +## Related Specifications -**row ID index** -: A secondary index that maps row IDs to row addresses. This index is constructed by reading all the row ID sequences. +### Storage Layout -### Assigning Row IDs +File organization, base path system, and multi-location storage. -Row IDs are assigned in a monotonically increasing sequence. The next row ID is stored in the manifest as the field `next_row_id`. This starts at zero. When making a commit, the writer uses that field to assign row IDs to new fragments. If the commit fails, the writer will re-read the new `next_row_id`, update the new row IDs, and then try again. This is similar to how the `max_fragment_id` is used to assign new fragment IDs. +See [Storage Layout Specification](layout.md) -When a row updated, it is typically assigned a new row ID rather than reusing the old one. This is because this feature doesn't have a mechanism to update secondary indices that may reference the old values for the row ID. By deleting the old row ID and creating a new one, the secondary indices will avoid referencing stale data. +### Transactions -### Row ID Sequences +MVCC, commit protocol, transaction types, and conflict resolution. -The row ID values for a fragment are stored in a `RowIdSequence` protobuf message. This is described in the [protos/rowids.proto](https://github.com/lancedb/lance/blob/main/protos/rowids.proto) file. Row ID sequences are just arrays of u64 values, which have representations optimized for the common case where they are sorted and possibly contiguous. For example, a new fragment will have a row ID sequence that is just a simple range, so it is stored as a `start` and `end` value. +See [Transaction Specification](transaction.md) -These sequence messages are either stored inline in the fragment metadata, or are written to a separate file and referenced from the fragment metadata. This choice is typically made based on the size of the sequence. If the sequence is small, it is stored inline. If it is large, it is written to a separate file. By keeping the small sequences inline, we can avoid the overhead of additional IO operations. +### Row Lineage -```protobuf -oneof row_id_sequence { - // Inline sequence - bytes inline_sequence = 1; - // External file reference - string external_file = 2; -} // row_id_sequence -``` +Row address, Stable row ID, row version tracking, and change data feed. -### Row ID Index +See [Row ID & Lineage Specification](row_id_lineage.md) -To ensure fast access to rows by their row ID, a secondary index is created that maps row IDs to their locations in the table. This index is built when a table is loaded, based on the row ID sequences in the fragments. For example, if fragment 42 has a row ID sequence of `[0, 63, 10]`, then the index will have entries for `0 -> (42, 0)`, `63 -> (42, 1)`, `10 -> (42, 2)`. The exact form of this index is left up to the implementation, but it should be optimized for fast lookups. +### Indices -### Row ID masks +Vector indices, scalar indices, full-text search, and index management. -Because index files are immutable, they main contain references to row IDs that have been deleted or that have new values. -To handle this, a mask is created for the index. +See [Index Specification](index/index.md) -![Index and Row ID marks](../../images/stable_row_id_indices.png) +### Versioning -For example, consider the sequence shown in the above image. -It has a dataset with two columns, `str` and `vec`. -A string column and a vector column. -Each of them have indices, a scalar index for the string column and a vector index for the vector column. -There is just one fragment in the dataset, with contiguous row IDs 1 through 3. +Feature flags and format version compatibility. -When an update operation is made that modifies the `vec` column in row 2, a new fragment is created with the updated value. -A deletion file is added to the original fragment marking that row 2 as deleted in the first file. -In the `str` index, the fragment bitmap is updated to reflect the new location of the row IDs:`{1, 2}`. -Meanwhile, the `vec` index's fragment bitmap does not update, staying at `{1}`. -This is because the value in `vec` was updated, so the data in the index no longer reflects the data in the table. +See [Format Versioning Specification](versioning.md) diff --git a/docs/src/format/table/layout.md b/docs/src/format/table/layout.md new file mode 100644 index 00000000000..46efa56a908 --- /dev/null +++ b/docs/src/format/table/layout.md @@ -0,0 +1,203 @@ +# Storage Layout Specification + +## Overview + +This specification defines how Lance datasets are organized on object storage. +The layout design emphasizes portability, allowing datasets to be relocated or referenced across multiple storage systems with minimal metadata changes. + +## Dataset Root + +The dataset root is the location where the dataset was initially created. +Every Lance dataset has exactly one dataset root, which serves as the primary storage location for the dataset's files. +The dataset root contains the standard subdirectory structure (`data/`, `_versions/`, `_deletions/`, `_indices/`, `_refs/`, `tree/`) that organizes the dataset's files. + +## Basic Layout + +A Lance dataset in its basic form stores all files within the dataset root directory structure: + +``` +{dataset_root}/ + data/ + *.lance -- Data files containing column data + _versions/ + *.manifest -- Manifest files (one per version) + _transactions/ + *.txn -- Transaction files for commit coordination + _deletions/ + *.arrow -- Deletion vector files (arrow format) + *.bin -- Deletion vector files (bitmap format) + _indices/ + {UUID}/ + ... -- Index content (different for each index type) + _refs/ + tags/ + *.json -- Tag metadata + branches/ + *.json -- Branch metadata + tree/ + {branch_name}/ + ... -- Branch dataset + +``` + +## Base Path System + +### BasePath Message + +The manifest's `base_paths` field contains an array of `BasePath` entries that define alternative storage locations for dataset files. +Each base path entry has a unique numeric identifier that file metadata can reference to indicate where files are located. +The `path` field specifies an absolute path interpretable by the object store. +The `is_dataset_root` field determines how the path is interpreted: when true, the path points to a dataset root with standard subdirectories (`data/`, `_deletions/`, `_indices/`); when false, the path points directly to a file directory without subdirectories. +An optional `name` field provides a human-readable alias, which is particularly useful for referencing tags in shallow clones. + +
+BasePath protobuf message + +```protobuf +message BasePath { + uint32 id = 1; + optional string name = 2; + bool is_dataset_root = 3; + string path = 4; +} +``` + +
+ +### File Metadata Base References + +Three types of files can specify alternative base paths: data files, deletion files, and index metadata. +Each of these file types includes an optional `base_id` field in their metadata that references a base path entry by its numeric identifier. +When a file's `base_id` is absent, the file is located relative to the dataset root. +When a file's `base_id` is present, readers must look up the corresponding base path entry in the manifest's `base_paths` array to determine where the file is stored. + +At read time, path resolution follows a two-step process. +First, the reader determines the base path: if `base_id` is absent, the base path is the dataset root; otherwise, the reader looks up the base path entry using the `base_id` to obtain the path and its `is_dataset_root` flag. +Second, the reader constructs the full file path based on whether the base path represents a dataset root. +For dataset roots (when `is_dataset_root` is true), the full path includes standard subdirectories: data files are located under `data/`, deletion files under `_deletions/`, and indices under `_indices/`. +For non-root base paths (when `is_dataset_root` is false), the base path points directly to the file directory, and the file path is appended directly without subdirectory prefixes. + +### Example Complex Layout Scenarios + +#### Hot/Cold Tiering + +``` +Manifest base_paths: +[ + { id: 0, is_dataset_root: true, path: "s3://hot-bucket/dataset" }, + { id: 1, is_dataset_root: true, path: "s3://cold-bucket/dataset-archive" } +] + +Fragment 0 (recent data): + DataFile { path: "fragment-0.lance", base_id: 0 } + → resolves to: s3://hot-bucket/dataset/data/fragment-0.lance + +Fragment 100 (historical data): + DataFile { path: "fragment-100.lance", base_id: 1 } + → resolves to: s3://cold-bucket/dataset-archive/data/fragment-100.lance +``` + +This allows seamless querying across storage tiers without data movement. + +#### Multi-Region Distribution + +``` +Manifest base_paths: +[ + { id: 0, is_dataset_root: true, path: "s3://us-east-bucket/dataset" }, + { id: 1, is_dataset_root: true, path: "s3://eu-west-bucket/dataset" }, + { id: 2, is_dataset_root: true, path: "s3://ap-south-bucket/dataset" } +] + +Fragments distributed by data locality: + Fragment 0 (US users): base_id: 0 + Fragment 1 (EU users): base_id: 1 + Fragment 2 (Asia users): base_id: 2 +``` + +Compute jobs can read data from the nearest region without data transfer. + +#### Shallow Clone + +Shallow clones create a new dataset that references data files from a source dataset without copying: + +**Example: Shallow Clone** + +``` +Source dataset: s3://production/main-dataset +Clone dataset: s3://experiments/test-variant + +Clone manifest base_paths: +[ + { id: 0, is_dataset_root: true, path: "s3://experiments/test-variant" }, + { id: 1, is_dataset_root: true, path: "s3://production/main-dataset", + name: "v1.0" } +] + +Original fragments (inherited): + DataFile { path: "fragment-0.lance", base_id: 1 } + → resolves to: s3://production/main-dataset/data/fragment-0.lance + +New fragments (clone-specific): + DataFile { path: "fragment-new.lance", base_id: 0 } + → resolves to: s3://experiments/test-variant/data/fragment-new.lance +``` + +The clone can append new data, modify schemas, or delete rows without affecting the source dataset. +Only the manifest and new data files are stored in the clone location. + +**Workflow:** + +1. [Clone transaction](transaction.md#clone) creates new manifest in target location +2. Manifest includes base path pointing to source dataset +3. Original fragments reference source via `base_id: 1` +4. Subsequent writes reference clone location via `base_id: 0` +5. Source dataset remains immutable and can be garbage collected independently + +## Dataset Portability + +The base path system combined with relative file references provides strong portability guarantees for Lance datasets. +All file paths within Lance files are stored relative to their containing directory, enabling datasets to be relocated without file modifications. + +To port a dataset to a new location, simply copy all contents from the dataset root directory. +The copied dataset will function immediately at the new location without any manifest updates, as all file references within the dataset root resolve through relative paths. + +When a dataset uses multiple base paths (such as in shallow clones or multi-bucket configurations), users have flexibility in how to port the dataset. +The simplest approach is to copy only the dataset root, which preserves references to the original base path locations. +Alternatively, users can copy additional base paths to the new location and update the manifest's `base_paths` array to reflect the new base paths. +Since only the `base_paths` field in the manifest requires modification, this remains a lightweight metadata operation that does not require rewriting additional metadata or data files. + +## File Naming Conventions + +### Data Files + +Pattern: `data/{uuid-based-filename}.lance` + +Data files use UUID-based filenames optimized for S3 throughput. +The filename is generated from a UUID (16 bytes) by converting the first 3 bytes to a 24-character binary string and the remaining 13 bytes to a 26-character hex string, resulting in a 50-character filename. +The binary prefix (rather than hex) provides maximum entropy per character, allowing S3's internal partitioning to quickly recognize access patterns and scale appropriately, minimizing throttling. + +Example: `data/101100101101010011010110a1b2c3d4e5f6g7h8i9j0.lance` + +### Deletion Files + +Pattern: `_deletions/{fragment_id}-{read_version}-{id}.{extension}` + +Deletion files use two extensions: `.arrow` for Arrow IPC format (sparse deletions) and `.bin` for Roaring bitmap format (dense deletions). + +Example: `_deletions/42-10-a1b2c3d4.arrow` + +### Transaction Files + +Pattern: `_transactions/{read_version}-{uuid}.txn` + +Where `read_version` is the table version the transaction was built from. + +Example: `_transactions/5-550e8400-e29b-41d4-a716-446655440000.txn` + +### Manifest Files + +Manifest files are stored in the `_versions/` directory with naming schemes that support atomic commits. + +See [Manifest Naming Schemes](transaction.md#manifest-naming-schemes) for details on the V1 and V2 patterns and their implications for version discovery. + diff --git a/docs/src/format/table/row_id_lineage.md b/docs/src/format/table/row_id_lineage.md new file mode 100644 index 00000000000..3f61673128b --- /dev/null +++ b/docs/src/format/table/row_id_lineage.md @@ -0,0 +1,337 @@ +# Row ID and Lineage Specification + +## Overview + +Lance provides row identification and lineage tracking capabilities. +Row addressing enables efficient random access to rows within the table through a physical location encoding. +Stable row IDs provide persistent identifiers that remain constant throughout a row's lifetime, even as its physical location changes. +Row version tracking records when rows were created and last modified, enabling incremental processing, change data capture, and time-travel queries. + +## Row ID Styles + +Lance uses two different styles of row IDs: + +### Row Address + +Row address is the physical location of a row in the table, represented as a 64-bit identifier composed of two 32-bit values: + +``` +row_address = (fragment_id << 32) | local_row_offset +``` + +This addressing scheme enables efficient random access: given a row address, the fragment and offset are extracted with bit operations. +Row addresses change when data is reorganized through compaction or updates. + +Row address is currently the primary form of identifier used for indexing purposes. +Secondary indices (vector indices, scalar indices, full-text search indices) reference rows by their row addresses. + +!!! note + Work to support stable row IDs in indices is in progress. + +### Stable Row ID + +Stable Row ID is a unique auto-incrementing u64 identifier assigned to each row that remains constant throughout the row's lifetime, +even when the row's physical location (row address) changes. +See the next section for more details. + +!!! warning + Historically, "row ID" was used to mean row address interchangeably. + With the introduction of stable row IDs, + there could be places in code and documentation that mix the terms "row ID" and "row address" or "row ID" and "stable row ID". + Please raise a PR if you find any place incorrect or confusing. + +## Stable Row ID + +### Row ID Assignment + +Row IDs are assigned using a monotonically increasing `next_row_id` counter stored in the manifest. + +**Assignment Protocol:** + +1. Writer reads the current `next_row_id` from the manifest at the read version +2. Writer assigns row IDs sequentially starting from `next_row_id` for new rows +3. Writer updates `next_row_id` in the new manifest to `next_row_id + num_new_rows` +4. If commit fails due to conflict, writer rebases: + - Re-reads the new `next_row_id` from the latest version + - Reassigns row IDs to new rows using the updated counter + - Retries commit + +This protocol mirrors fragment ID assignment and ensures row IDs are unique across all table versions. + +### Row ID Behavior on Updates + +When a row is updated, it is typically assigned a new row ID rather than reusing the old one. +This avoids the complexity of updating secondary indices that may reference the old values. + +**Update Workflow:** + +1. Original row with ID `R` exists at address `(F1, O1)` +2. Update operation creates new row with ID `R'` at address `(F2, O2)` +3. Deletion vector marks row ID `R` as deleted in fragment `F1` +4. Secondary indices referencing old row ID `R` are invalidated through fragment bitmap updates +5. New row ID `R'` requires index rebuild for affected columns + +This approach ensures secondary indices do not reference stale data. + +### Row ID Sequences + +#### Storage Format + +Row ID sequences are stored using the `RowIdSequence` protobuf message. +The sequence is partitioned into segments, each encoded optimally based on the data pattern. + +
+RowIdSequence protobuf message + +```protobuf +%%% proto.message.RowIdSequence %%% +``` + +
+ +#### Segment Encodings + +Each segment uses one of five encodings optimized for different data patterns: + +##### Range (Contiguous Values) + +For sorted, contiguous values with no gaps. +Example: Row IDs `[100, 101, 102, 103, 104]` → `Range{start: 100, end: 105}`. +Used for new fragments where row IDs are assigned sequentially. + +
+Range protobuf message + +```protobuf +%%% proto.message.Range %%% +``` + +
+ +##### Range with Holes (Sparse Deletions) + +For sorted values with few gaps. +Example: Row IDs `[100, 101, 103, 104]` (missing 102) → `RangeWithHoles{start: 100, end: 105, holes: [102]}`. +Used for fragments with sparse deletions where maintaining the range is efficient. + +
+RangeWithHoles protobuf message + +```protobuf +%%% proto.message.RangeWithHoles %%% +``` + +
+ +##### Range with Bitmap (Dense Deletions) + +For sorted values with many gaps. +The bitmap encodes 8 values per byte, with the most significant bit representing the first value. +Used for fragments with dense deletion patterns. + +
+RangeWithBitmap protobuf message + +```protobuf +%%% proto.message.RangeWithBitmap %%% +``` + +
+ +##### Sorted Array (Sparse Values) + +For sorted but non-contiguous values, stored as an `EncodedU64Array`. +Used for merged fragments or fragments after compaction. + +##### Unsorted Array (General Case) + +For unsorted values, stored as an `EncodedU64Array`. +Rare; most operations maintain sorted order. + +#### Encoded U64 Arrays + +The `EncodedU64Array` message supports bitpacked encoding to minimize storage. +The implementation selects the most compact encoding based on the value range, choosing between base + 16-bit offsets, base + 32-bit offsets, or full 64-bit values. + +
+EncodedU64Array protobuf message + +```protobuf +%%% proto.message.EncodedU64Array %%% +``` + +
+ +#### Inline vs External Storage + +Row ID sequences are stored either inline in the fragment metadata or in external files. +Sequences smaller than ~200KB are stored inline to avoid additional I/O, while larger sequences are written to external files referenced by path and offset. +This threshold balances manifest size against the overhead of separate file reads. + +
+DataFragment row_id_sequence field + +```protobuf +message DataFragment { + oneof row_id_sequence { + bytes inline_row_ids = 5; + ExternalFile external_row_ids = 6; + } +} +``` + +
+ +### Row ID Index + +#### Construction + +The row ID index is built at table load time by aggregating row ID sequences from all fragments: + +``` +For each fragment F with ID f: + For each (position p, row_id r) in F.row_id_sequence: + index[r] = (f, p) +``` + +This creates a mapping from row ID to current row address. + +#### Index Invalidation with Updates + +When rows are updated, the row ID index must account for stale mappings: + +**Example Scenario:** + +1. Initial state: Fragment 1 contains rows with IDs `[1, 2, 3]` at offsets `[0, 1, 2]` +2. Update operation modifies row 2: + - New fragment 2 created with row ID `4` (new ID assigned) + - Deletion vector marks row ID `2` as deleted in fragment 1 +3. Row ID index: + - `1 → (1, 0)` ✓ Valid + - `2 → (1, 1)` ✗ Invalid (deleted) + - `3 → (1, 2)` ✓ Valid + - `4 → (2, 0)` ✓ Valid (new row) + +#### Fragment Bitmaps for Index Masking + +Secondary indices use fragment bitmaps to track which row IDs remain valid: + +**Without Row ID Updates:** + +``` +String Index on column "str": + Fragment Bitmap: {1, 2} (covers fragments 1 and 2) + All indexed row IDs are valid +``` + +**With Row ID Updates:** + +``` +Vector Index on column "vec": + Fragment Bitmap: {1} (only fragment 1) + Row ID 2 was updated, so index entry for ID 2 is stale + Index query filters out ID 2 using deletion vectors +``` + +This bitmap-based approach allows indices to remain immutable while accounting for row modifications. + +## Row Version Tracking + +### Created At Version + +Each row tracks the version at which it was created. +The sequence uses run-length encoding for efficient storage, where each run specifies a span of consecutive rows and the version they were created in. + +Example: Fragment with 1000 rows created in version 5: +``` +RowDatasetVersionSequence { + runs: [ + RowDatasetVersionRun { span: Range{start: 0, end: 1000}, version: 5 } + ] +} +``` + +
+DataFragment created_at_version_sequence field + +```protobuf +message DataFragment { + oneof created_at_version_sequence { + bytes inline_created_at_versions = 9; + ExternalFile external_created_at_versions = 10; + } +} +``` + +
+ +
+RowDatasetVersionSequence protobuf messages + +```protobuf +%%% proto.message.RowDatasetVersionSequence %%% +``` + +
+ +### Last Updated At Version + +Each row tracks the version at which it was last modified. +When a row is created, `last_updated_at_version` equals `created_at_version`. +When a row is updated, a new row is created with both `created_at_version` and `last_updated_at_version` set to the current version, and the old row is marked deleted. + +Example: Row created in version 3, updated in version 7: +``` +Old row (marked deleted): + created_at_version: 3 + last_updated_at_version: 3 + +New row: + created_at_version: 7 + last_updated_at_version: 7 +``` + +
+DataFragment last_updated_at_version_sequence field + +```protobuf +message DataFragment { + oneof last_updated_at_version_sequence { + bytes inline_last_updated_at_versions = 7; + ExternalFile external_last_updated_at_versions = 8; + } +} +``` + +
+ +## Change Data Feed + +Lance supports querying rows that changed between versions through version tracking columns. +These queries can be expressed as standard SQL predicates on the `_row_created_at_version` and `_row_last_updated_at_version` columns. + +### Inserted Rows + +Rows created between two versions can be retrieved by filtering on `_row_created_at_version`: + +```sql +SELECT * FROM dataset +WHERE _row_created_at_version > {begin_version} + AND _row_created_at_version <= {end_version} +``` + +This query returns all rows inserted in the specified version range, including the version metadata columns `_row_created_at_version`, `_row_last_updated_at_version`, and `_rowid`. + +### Updated Rows + +Rows modified (but not newly created) between two versions can be retrieved by combining filters on both version columns: + +```sql +SELECT * FROM dataset +WHERE _row_created_at_version <= {begin_version} + AND _row_last_updated_at_version > {begin_version} + AND _row_last_updated_at_version <= {end_version} +``` + +This query excludes newly inserted rows by requiring `_row_created_at_version <= {begin_version}`, ensuring only pre-existing rows that were subsequently updated are returned. + diff --git a/docs/src/format/table/transaction.md b/docs/src/format/table/transaction.md new file mode 100644 index 00000000000..56b867a4683 --- /dev/null +++ b/docs/src/format/table/transaction.md @@ -0,0 +1,447 @@ +# Transaction Specification + +## Transaction Overview + +Lance implements Multi-Version Concurrency Control (MVCC) to provide ACID transaction guarantees for concurrent readers and writers. +Each commit creates a new immutable table version through atomic storage operations. +All table versions form a serializable history, enabling features such as time travel and schema evolution. + +Transactions are the fundamental unit of change in Lance. +A transaction describes a set of modifications to be applied atomically to create a new table version. +The transaction model supports concurrent writes through optimistic concurrency control with automatic conflict resolution. + +## Commit Protocol + +### Storage Primitives + +Lance commits rely on atomic write operations provided by the underlying object store: + +- **rename-if-not-exists**: Atomically rename a file only if the target does not exist +- **put-if-not-exists**: Atomically write a file only if it does not already exist (also known as PUT-IF-NONE-MATCH or conditional PUT) + +These primitives guarantee that exactly one writer succeeds when multiple writers attempt to create the same manifest file concurrently. + +### Manifest Naming Schemes + +Lance supports two manifest naming schemes: + +- **V1**: `{version}.manifest` - Monotonically increasing version numbers (e.g., `1.manifest`, `2.manifest`) +- **V2**: `{u64::MAX - version:020}.manifest` - Reverse-sorted lexicographic ordering (e.g., `18446744073709551614.manifest` for version 1) + +The V2 scheme enables efficient discovery of the latest version through lexicographic object listing. + +### Transaction Files + +Transaction files store the serialized transaction protobuf message for each commit attempt. +These files serve two purposes: + +1. Enable manifest reconstruction during commit retries when concurrent transactions have been committed +2. Support conflict detection by describing the operation performed + +### Commit Algorithm + +The commit process attempts to atomically write a new manifest file using the storage primitives described above. +When concurrent writers conflict, the system loads transaction files to detect conflicts and attempts to rebase the transaction if possible. +If the atomic commit fails, the process retries with updated transaction state. +For detailed conflict detection and resolution mechanisms, see the [Conflict Resolution](#conflict-resolution) section. + +## Transaction Types + +The authoritative specification for transaction types is defined in [`protos/transaction.proto`](https://github.com/lancedb/lance/blob/main/protos/transaction.proto). + +Each transaction contains a `read_version` field indicating the table version from which the transaction was built, +a `uuid` field uniquely identifying the transaction, and an `operation` field specifying one of the following transaction types: + +### Append + +Adds new fragments to the table without modifying existing data. +Fragment IDs are not assigned at transaction creation time; they are assigned during manifest construction. + +
+Append protobuf message + +```protobuf +%%% proto.message.Append %%% +``` + +
+ +### Delete + +Marks rows as deleted using deletion vectors. +May update fragments (adding deletion vectors) or delete entire fragments. +The `predicate` field stores the deletion condition, enabling conflict detection with concurrent transactions. + +
+Delete protobuf message + +```protobuf +%%% proto.message.Delete %%% +``` + +
+ +### Overwrite + +Creates or completely overwrites the table with new data, schema, and configuration. + +
+Overwrite protobuf message + +```protobuf +%%% proto.message.Overwrite %%% +``` + +
+ +### CreateIndex + +Adds, replaces, or removes secondary indices (vector indices, scalar indices, full-text search indices). + +
+CreateIndex protobuf message + +```protobuf +%%% proto.message.CreateIndex %%% +``` + +
+ +### Rewrite + +Reorganizes data without semantic modification. +This includes operations such as compaction, defragmentation, and re-ordering. +Rewrite operations change row addresses, requiring index updates. +New fragment IDs must be reserved via `ReserveFragments` before executing a `Rewrite` transaction. + +
+Rewrite protobuf message + +```protobuf +%%% proto.message.Rewrite %%% +``` + +
+ +### Merge + +Adds new columns to the table, modifying the schema. +All fragments must be updated to include the new columns. + +
+Merge protobuf message + +```protobuf +%%% proto.message.Merge %%% +``` + +
+ +### Project + +Removes columns from the table, modifying the schema. +This is a metadata-only operation; data files are not modified. + +
+Project protobuf message + +```protobuf +%%% proto.message.Project %%% +``` + +
+ +### Restore + +Reverts the table to a previous version. + +
+Restore protobuf message + +```protobuf +%%% proto.message.Restore %%% +``` + +
+ +### ReserveFragments + +Pre-allocates fragment IDs for use in future `Rewrite` operations. +This allows rewrite operations to reference fragment IDs before the rewrite transaction is committed. + +
+ReserveFragments protobuf message + +```protobuf +%%% proto.message.ReserveFragments %%% +``` + +
+ +### Clone + +Creates a shallow or deep copy of the table. +Shallow clones are metadata-only copies that reference original data files through `base_paths`. +Deep clones are full copies using object storage native copy operations (e.g., S3 CopyObject). + +
+Clone protobuf message + +```protobuf +%%% proto.message.Clone %%% +``` + +
+ +### Update + +Modifies row values without adding or removing rows. +Supports two execution modes: REWRITE_ROWS deletes rows in current fragments and rewrites them in new fragments, which is optimal when the majority of columns are modified or only a small number of rows are affected; REWRITE_COLUMNS fully rewrites affected columns within fragments by tombstoning old column versions, which is optimal when most rows are affected but only a subset of columns are modified. + +
+Update protobuf message + +```protobuf +%%% proto.message.Update %%% +``` + +
+ +### UpdateConfig + +Modifies table configuration, table metadata, schema metadata, or field metadata without changing data. + +
+UpdateConfig protobuf message + +```protobuf +%%% proto.message.UpdateConfig %%% +``` + +
+ +### DataReplacement + +Replaces data in specific column regions with new data files. + +
+DataReplacement protobuf message + +```protobuf +%%% proto.message.DataReplacement %%% +``` + +
+ +### UpdateMemWalState + +Updates the state of MemWal indices (write-ahead log based indices). + +
+UpdateMemWalState protobuf message + +```protobuf +%%% proto.message.UpdateMemWalState %%% +``` + +
+ +### UpdateBases + +Adds new base paths to the table, enabling reference to data files in additional locations. + +
+UpdateBases protobuf message + +```protobuf +%%% proto.message.UpdateBases %%% +``` + +
+ +## Conflict Resolution + +### Terminology + +When concurrent transactions attempt to commit against the same read version, Lance employs conflict resolution to determine whether the transactions can coexist. +Three outcomes are possible: + +- **Rebasable**: The transaction can be modified to incorporate concurrent changes while preserving its semantic intent. + The transaction is transformed to account for the concurrent modification, then the commit is retried automatically within the commit layer. + +- **Retryable**: The transaction cannot be rebased, but the operation can be re-executed at the application level with updated data. + The implementation returns a retryable conflict error, signaling that the application should re-read the data and retry the operation. + The retried operation is expected to produce semantically equivalent results. + +- **Incompatible**: The transactions conflict in a fundamental way where retrying would violate the operation's assumptions or produce semantically different results than expected. + The commit fails with a non-retryable error. + Callers should proceed with extreme caution if they decide to retry, as the transaction may produce different output than originally intended. + +### Rebase Mechanism + +The `TransactionRebase` structure tracks the state necessary to rebase a transaction against concurrent commits: + +1. **Fragment tracking**: Maintains a map of fragments as they existed at the transaction's read version, marking which require rewriting +2. **Modification detection**: Tracks the set of fragment IDs that have been modified or deleted +3. **Affected rows**: For Delete and Update operations, stores the specific rows affected by the operation for fine-grained conflict detection +4. **Fragment reuse indices**: Accumulates fragment reuse index metadata from concurrent Rewrite operations + +When a concurrent transaction is detected, the rebase process: + +1. Compares fragment modifications to determine if there is overlap +2. For Delete/Update operations, compares `affected_rows` to detect whether the same rows were modified +3. Merges deletion vectors when both transactions delete rows from the same fragment +4. Accumulates fragment reuse index updates when concurrent Rewrites change fragment IDs +5. Modifies the transaction if rebasable, or returns a retryable/incompatible conflict error + +### Conflict Scenarios + +#### Rebasable Conflict Example + +The following diagram illustrates a rebasable conflict where two Delete operations modify different rows in the same fragment: + +```mermaid +gitGraph + commit id: "v1" + commit id: "v2" + branch writer-a + branch writer-b + checkout writer-a + commit id: "Delete rows 100-199" tag: "read_version=2" + checkout writer-b + commit id: "Delete rows 500-599" tag: "read_version=2" + checkout main + merge writer-a tag: "v3" + checkout writer-b + commit id: "Rebase: merge deletion vectors" type: HIGHLIGHT + checkout main + merge writer-b tag: "v4" +``` + +In this scenario: + +- Writer A deletes rows 100-199 and successfully commits version 3 +- Writer B attempts to commit but detects version 3 exists +- Writer B's transaction is rebasable because it only modified deletion vectors (not data files) and `affected_rows` do not overlap +- Writer B rebases by merging Writer A's deletion vector with its own, write it to storage +- Writer B successfully commits version 4 + +#### Retryable Conflict Example + +The following diagram illustrates a retryable conflict where an Update operation encounters a concurrent Rewrite (compaction) that prevents automatic rebasing: + +```mermaid +gitGraph + commit id: "v1" + commit id: "v2" + branch writer-a + branch writer-b + checkout writer-a + commit id: "Compact fragments 1-5" tag: "read_version=2" + checkout writer-b + commit id: "Update rows in fragment 3" tag: "read_version=2" + checkout main + merge writer-a tag: "v3: fragments compacted" + checkout writer-b + commit id: "Detect conflict: cannot rebase" type: REVERSE +``` + +In this scenario: + +- Writer A compacts fragments 1-5 into a single fragment and successfully commits version 3 +- Writer B attempts to update rows in fragment 3 but detects version 3 exists +- Writer B's Update transaction is retryable but not rebasable: fragment 3 no longer exists after compaction +- The commit layer returns a retryable conflict error +- The application must re-execute the Update operation against version 3, locating the rows in the new compacted fragment + +#### Incompatible Conflict Example + +The following diagram illustrates an incompatible conflict where a Delete operation encounters a concurrent Restore that fundamentally invalidates the operation: + +```mermaid +gitGraph + commit id: "v1" + commit id: "v2" + commit id: "v3" + branch writer-a + branch writer-b + checkout writer-a + commit id: "Restore to v1" tag: "read_version=3" + checkout writer-b + commit id: "Delete rows added in v2-v3" tag: "read_version=3" + checkout main + merge writer-a tag: "v4: restored to v1" + checkout writer-b + commit id: "Detect conflict: incompatible" type: REVERSE +``` + +In this scenario: + +- Writer A restores the table to version 1 and successfully commits version 4 +- Writer B attempts to delete rows that were added between versions 2 and 3 +- Writer B's Delete transaction is incompatible: the table has been restored to version 1, and the rows it intended to delete no longer exist +- The commit fails with a non-retryable error +- If the caller retries the deletion operation against version 4, it would either delete nothing (if those rows don't exist in v1) or delete different rows (if similar row IDs exist in v1), producing semantically different results than originally intended + + +## External Manifest Store + +If the backing object store does not support atomic operations (rename-if-not-exists or put-if-not-exists), an external manifest store can be used to enable concurrent writers. + +An external manifest store is a key-value store that supports put-if-not-exists operations. +The external manifest store supplements but does not replace the manifests in object storage. +A reader unaware of the external manifest store can still read the table, but may observe a version up to one commit behind the true latest version. + +### Commit Process with External Store + +The commit process follows a four-step protocol: + +![External Store Commit Process](../../images/external_store_commit.gif) + +1. **Stage manifest**: `PUT_OBJECT_STORE {dataset}/_versions/{version}.manifest-{uuid}` + - Write the new manifest to object storage under a unique path determined by a new UUID + - This staged manifest is not yet visible to readers + +2. **Commit to external store**: `PUT_EXTERNAL_STORE base_uri, version, {dataset}/_versions/{version}.manifest-{uuid}` + - Atomically commit the path of the staged manifest to the external store using put-if-not-exists + - The commit is effectively complete after this step + - If this operation fails due to conflict, another writer has committed this version + +3. **Finalize in object store**: `COPY_OBJECT_STORE {dataset}/_versions/{version}.manifest-{uuid} → {dataset}/_versions/{version}.manifest` + - Copy the staged manifest to the final path + - This makes the manifest discoverable by readers unaware of the external store + +4. **Update external store pointer**: `PUT_EXTERNAL_STORE base_uri, version, {dataset}/_versions/{version}.manifest` + - Update the external store to point to the finalized manifest path + - Completes the synchronization between external store and object storage + +**Fault Tolerance:** + +If the writer fails after step 2 but before step 4, the external store and object store are temporarily out of sync. +Readers detect this condition and attempt to complete the synchronization. +If synchronization fails, the reader refuses to load to ensure dataset portability. + +### Reader Process with External Store + +The reader follows a validation and synchronization protocol: + +![External Store Reader Process](../../images/external_store_reader.gif) + +1. **Query external store**: `GET_EXTERNAL_STORE base_uri, version` → `path` + - Retrieve the manifest path for the requested version + - If the path does not end with a UUID, return it directly (synchronization complete) + - If the path ends with a UUID, synchronization is required + +2. **Synchronize to object store**: `COPY_OBJECT_STORE {dataset}/_versions/{version}.manifest-{uuid} → {dataset}/_versions/{version}.manifest` + - Attempt to finalize the staged manifest + - This operation is idempotent + +3. **Update external store**: `PUT_EXTERNAL_STORE base_uri, version, {dataset}/_versions/{version}.manifest` + - Update the external store to reflect the finalized path + - Future readers will see the synchronized state + +4. **Return finalized path**: Return `{dataset}/_versions/{version}.manifest` + - Always return the finalized path + - If synchronization fails, return an error to prevent reading inconsistent state + +This protocol ensures that datasets using external manifest stores remain portable: copying the dataset directory preserves all data without requiring the external store. diff --git a/docs/src/format/table/versioning.md b/docs/src/format/table/versioning.md new file mode 100644 index 00000000000..745dd1ccd87 --- /dev/null +++ b/docs/src/format/table/versioning.md @@ -0,0 +1,34 @@ +# Format Versioning + +## Feature Flags + +As the table format evolves, new feature flags are added to the format. +There are two separate fields for checking for feature flags, +depending on whether you are trying to read or write the table. +Readers should check the `reader_feature_flags` to see if there are any flag it is not aware of. +Writers should check `writer_feature_flags`. If either sees a flag they don't know, +they should return an "unsupported" error on any read or write operation. + +## Current Feature Flags + + + +
+ +| Flag Bit | Flag Name | Reader Required | Writer Required | Description | +|----------|---------------------------------|-----------------|-----------------|-------------------------------------------------------------------------------------------------------------| +| 1 | `FLAG_DELETION_FILES` | Yes | Yes | Fragments may contain deletion files, which record the tombstones of soft-deleted rows. | +| 2 | `FLAG_STABLE_ROW_IDS` | Yes | Yes | Row IDs are stable for both moves and updates. Fragments contain an index mapping row IDs to row addresses. | +| 4 | `FLAG_USE_V2_FORMAT_DEPRECATED` | No | No | Files are written with the new v2 format. This flag is deprecated and no longer used. | +| 8 | `FLAG_TABLE_CONFIG` | No | Yes | Table config is present in the manifest. | +| 16 | `FLAG_BASE_PATHS` | Yes | Yes | Dataset uses multiple base paths (for shallow clones or multi-base datasets). | + +
+ +Flags with bit values 32 and above are unknown and will cause implementations to reject the dataset with an "unsupported" error. diff --git a/docs/src/images/fragment_structure.png b/docs/src/images/fragment_structure.png index e5dfd7f2e20..7590e10319f 100644 Binary files a/docs/src/images/fragment_structure.png and b/docs/src/images/fragment_structure.png differ diff --git a/docs/src/images/lakehouse_stack.png b/docs/src/images/lakehouse_stack.png new file mode 100644 index 00000000000..ff9546d9639 Binary files /dev/null and b/docs/src/images/lakehouse_stack.png differ diff --git a/docs/src/images/table_overview.png b/docs/src/images/table_overview.png new file mode 100644 index 00000000000..b20c6db96ad Binary files /dev/null and b/docs/src/images/table_overview.png differ diff --git a/docs/src/index.md b/docs/src/index.md index 4c0e823e82e..3856529f5cd 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -1,36 +1,9 @@ --- +template: home.html +title: Lance hide: toc --- -# Welcome to Lance Open Source Documentation! - -Lance Logo - -*Lance is a modern columnar data format optimized for machine learning and AI applications. It efficiently handles diverse multimodal data types while providing high-performance querying and versioning capabilities.* - -[Quickstart Locally With Python](quickstart){ .md-button .md-button--primary } [Read the Format Specification](format){ .md-button .md-button } [Train Your LLM on a Lance Dataset](examples/python/llm_training){ .md-button .md-button--primary } - -## 🎯 How Does Lance Work? - -Lance is designed to be used with images, videos, 3D point clouds, audio and tabular data. It supports any POSIX file systems, and cloud storage like AWS S3 and Google Cloud Storage. - -This file format is particularly suited for [**vector search**](quickstart/vector-search), full-text search and [**LLM training**](examples/python/llm_training) on multimodal data. To learn more about how Lance works, [**read the format specification**](format). - -!!! info "Looking for LanceDB?" - **This is the Lance table format project** - the open source core that powers LanceDB. - If you want the complete vector database and multimodal lakehouse built on Lance, visit [lancedb.com](https://lancedb.com) - -## ⚡ Key Features of Lance Format - -| Feature | Description | -|---------|-------------| -| 🚀 **[High-Performance Random Access](guide/performance)** | 100x faster than Parquet for random access patterns | -| 🔄 **[Zero-Copy Data Evolution](guide/data_evolution)** | Add, drop or update column data without rewriting the entire dataset | -| 🎨 **[Multimodal Data](guide/blob)** | Natively store large text, images, videos, documents and embeddings | -| 🔍 **[Vector Search](quickstart/vector-search)** | Find nearest neighbors in under 1 millisecond with IVF-PQ, IVF-SQ, HNSW | -| 📝 **[Full-Text Search](guide/tokenizer)** | Fast search over text with inverted index, Ngram index plus tokenizers | -| 💾 **[Row Level Transaction](format#conflict-resolution)** | Fully ACID transaction with row level conflict resolution | -