Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ The key features of Lance include:

* **Data evolution:** Efficiently add columns with backfilled values without full table rewrites, perfect for ML feature engineering.

* **Zero-copy versioning:** ACID transactions, time travel, and automatic versioning without needing extra infrastructure.
* **Zero-copy versioning:** Automatic versioning with ACID transactions, time travel, tags, and branches—no extra infrastructure needed.

* **Rich ecosystem integrations:** Apache Arrow, Pandas, Polars, DuckDB, Apache Spark, Ray, Trino, Apache Flink, and open catalogs (Apache Polaris, Unity Catalog, Apache Gravitino).

Expand Down
2 changes: 1 addition & 1 deletion docs/src/guide/.pages
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ nav:
- Data Evolution: data_evolution.md
- Blob API: blob.md
- JSON Support: json.md
- Tags: tags.md
- Tags and Branches: tags_and_branches.md
- Object Store Configuration: object_store.md
- Distributed Write: distributed_write.md
- Migration Guide: migration.md
Expand Down
51 changes: 0 additions & 51 deletions docs/src/guide/tags.md

This file was deleted.

125 changes: 125 additions & 0 deletions docs/src/guide/tags_and_branches.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Manage Tags and Branches

Lance provides Git-like tag and branch capabilities through the `LanceDataset.tags` and `LanceDataset.branches` properties.

## Tags
Tags label specific versions within a branch's history.

`Tags` are particularly useful for tracking the evolution of datasets,
especially in machine learning workflows where datasets are frequently updated.
For example, you can `create`, `update`,
and `delete` or `list` tags.

The `reference` parameter (used in `create`, `update`, and `checkout_version`) accepts:

- An **integer**: version number in the **current branch** (e.g., `1`)
- A **string**: tag name (e.g., `"stable"`)
- A **tuple** `(branch_name, version)`: a specific version in a named branch
- `(None, 2)` means version 2 on the main branch
- `("main", 2)` means version 2 on the main branch (explicit)
- `("experiment", 3)` means version 3 on the experiment branch
- `("branch-name", None)` means the latest version on that branch

!!! note

Creating or deleting tags does not generate new dataset versions.
Tags exist as auxiliary metadata stored in a separate directory.

```python
import lance
import pyarrow as pa

ds = lance.dataset("./tags.lance")
print(len(ds.versions()))
# 2
print(ds.tags.list())
# {}
ds.tags.create("v1-prod", (None, 1))
print(ds.tags.list())
# {'v1-prod': {'version': 1, 'manifest_size': ...}}
ds.tags.update("v1-prod", (None, 2))
print(ds.tags.list())
# {'v1-prod': {'version': 2, 'manifest_size': ...}}
ds.tags.delete("v1-prod")
print(ds.tags.list())
# {}
print(ds.tags.list_ordered())
# []
ds.tags.create("v1-prod", (None, 1))
print(ds.tags.list_ordered())
# [('v1-prod', {'version': 1, 'manifest_size': ...})]
ds.tags.update("v1-prod", (None, 2))
print(ds.tags.list_ordered())
# [('v1-prod', {'version': 2, 'manifest_size': ...})]
ds.tags.delete("v1-prod")
print(ds.tags.list_ordered())
# []
```

!!! note

Tagged versions are exempted from the `LanceDataset.cleanup_old_versions()`
process.

To remove a version that has been tagged, you must first `LanceDataset.tags.delete()`
the associated tag.

## Branches

Branches manage parallel lines of dataset evolution. You can create a branch from an existing version or tag, read and write to it independently, and checkout different branches. You can `create`, `delete`, `list`, and `checkout` branches.

The `reference` parameter works the same as for Tags (see above).

!!! note

Creating or deleting branches does not generate new dataset versions.
New versions are created by writes (append/overwrite/index operations).

Each branch maintains its own linear version history, so version numbers may overlap across branches. Use `(branch_name, version_number)` tuples as global identifiers for operations like `checkout_version` and `tags.create`.

"main" is a reserved branch name. Lance uses "main" to identify the default branch.

### Create and checkout branches
```python
import lance
import pyarrow as pa

# Open dataset
ds = lance.dataset("/tmp/test.lance")

# Create branch from latest version (default: current branch's latest)
experiment_branch = ds.create_branch("experiment")
experimental_data = pa.Table.from_pydict({"a": [11], "b": [12]})
lance.write_dataset(experimental_data, experiment_branch, mode="append")

# Create tag on the latest version of the experimental branch
ds.tags.create("experiment-rc", ("experiment", None))

# Checkout by tag name
experiment_rc = ds.checkout_version("experiment-rc")
# Checkout the latest version of the experimental branch by tuple
experiment_latest = ds.checkout_version(("experiment", None))

# Create a new branch from a tag
new_experiment = ds.create_branch("new-experiment", "experiment-rc")
```

### List branches
```python
print(ds.branches.list())
# {'experiment': {...}, 'new-experiment': {...}}
```

### Delete a branch
```python
# Ensure the branch is no longer needed before deletion
ds.branches.delete("experiment")
print(ds.branches.list_ordered(order="desc"))
# {'new-experiment': {'parent_branch': 'experiment', 'parent_version': 2, 'create_at': ..., 'manifest_size': ...}, ...}
```

!!! note

Branches hold references to data files. Lance ensures that cleanup does not delete files still referenced by any branch.

Delete unused branches to allow their referenced files to be cleaned up by `cleanup_old_versions()`.
21 changes: 19 additions & 2 deletions docs/src/quickstart/versioning.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Versioning
description: Learn how to version your Lance datasets with append, overwrite, and tag features
description: Learn how to version your Lance datasets with append, overwrite, tags, and branches
---

# Versioning Your Datasets with Lance
Expand Down Expand Up @@ -75,7 +75,7 @@ lance.dataset('/tmp/test.lance', version=2).to_table().to_pandas()

## Tag Your Important Versions

Create named tags for important versions, making it easier to reference specific versions by meaningful names. To create tags for relevant versions, do this:
Create named tags for important versions, making it easier to reference them by meaningful names.

```python
dataset.tags.create("stable", 2)
Expand All @@ -89,6 +89,23 @@ Tags can be checked out like versions:
lance.dataset('/tmp/test.lance', version="stable").to_table().to_pandas()
```

For advanced tag operations (e.g., tagging versions on specific branches), see [Tags and Branches](../guide/tags_and_branches.md).

## Work with Branches

Branches manage parallel lines of dataset evolution. You can create branches from existing versions or tags, read and write to them independently, and checkout different branches.

```python
# Create branch from current latest version
experiment_branch = ds.create_branch("experiment")

# Write to the branch (affects only that branch's history)
tbl = pa.Table.from_pandas(pd.DataFrame({"a": [42]}))
lance.write_dataset(tbl, experiment_branch, mode="append")
```

For more details, see [Tags and Branches](../guide/tags_and_branches.md).

## Next Steps

Now that you've mastered dataset versioning with Lance, check out **[Vector Indexing and Vector Search With Lance](vector-search.md)**. You can learn how to build high-performance vector search capabilities on top of your Lance tables.
Expand Down