Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/src/guide/.pages
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
nav:
- Read and Write: read_and_write.md
- Data Types: data_types.md
- Data Evolution: data_evolution.md
- Blob API: blob.md
- JSON Support: json.md
Expand Down
341 changes: 341 additions & 0 deletions docs/src/guide/data_types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,341 @@
# Data Types

Lance uses [Apache Arrow](https://arrow.apache.org/) as its in-memory data format. This guide covers the supported data types with a focus on array types, which are essential for vector embeddings and machine learning applications.

## Arrow Type System

Lance supports the full Apache Arrow type system. When writing data through Python (PyArrow) or Rust (arrow-rs), the Arrow types are automatically mapped to Lance's internal representation.

### Primitive Types

| Arrow Type | Description | Example Use Case |
|------------|-------------|------------------|
| `Boolean` | True/false values | Flags, filters |
| `Int8`, `Int16`, `Int32`, `Int64` | Signed integers | IDs, counts |
| `UInt8`, `UInt16`, `UInt32`, `UInt64` | Unsigned integers | IDs, indices |
| `Float16`, `Float32`, `Float64` | Floating point numbers | Measurements, scores |
| `Decimal128`, `Decimal256` | Fixed-precision decimals | Financial data |
| `Date32`, `Date64` | Date values | Timestamps |

Copilot AI Feb 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Example Use Case" column lists "Timestamps" for Date types, which is misleading. Date types (Date32, Date64) store only dates without time information, while the Timestamp type (listed separately on line 20) stores both date and time. Consider changing the example use case for Date types to something like "Birth dates, event dates" to clearly differentiate from timestamps.

Suggested change
| `Date32`, `Date64` | Date values | Timestamps |
| `Date32`, `Date64` | Date values | Birth dates, event dates |

Copilot uses AI. Check for mistakes.
| `Time32`, `Time64` | Time values | Time of day |
| `Timestamp` | Date and time with timezone | Event timestamps |
| `Duration` | Time duration | Elapsed time |

### String and Binary Types

| Arrow Type | Description | Example Use Case |
|------------|-------------|------------------|
| `Utf8` | Variable-length UTF-8 string | Text, names |
| `LargeUtf8` | Large UTF-8 string (64-bit offsets) | Large documents |
| `Binary` | Variable-length binary data | Raw bytes |
| `LargeBinary` | Large binary data (64-bit offsets) | Large blobs |
| `FixedSizeBinary(n)` | Fixed-length binary data | UUIDs, hashes |

## Array Types for Vector Embeddings

Lance provides excellent support for array types, which are critical for storing vector embeddings in AI/ML applications.

### FixedSizeList - The Preferred Type for Vector Embeddings

`FixedSizeList` is the recommended type for storing fixed-dimensional vector embeddings. Each vector has the same number of dimensions, making it highly efficient for storage and computation.

=== "Python"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary? What would the === be rendered to?


```python
import lance
import pyarrow as pa
import numpy as np

# Create a schema with a vector embedding column
# This defines a 128-dimensional float32 vector
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("text", pa.utf8()),
pa.field("vector", pa.list_(pa.float32(), 128)), # FixedSizeList of 128 floats
])

# Create sample data with embeddings
num_rows = 1000
vectors = np.random.rand(num_rows, 128).astype(np.float32)

table = pa.Table.from_pydict({
"id": list(range(num_rows)),
"text": [f"document_{i}" for i in range(num_rows)],
"vector": [v.tolist() for v in vectors],
}, schema=schema)

# Write to Lance format
ds = lance.write_dataset(table, "./embeddings.lance")
print(f"Created dataset with {ds.count_rows()} rows")
```

=== "Rust"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

In duckdb.md
=== "SQL"

```sql
INSTALL lance FROM community;
LOAD lance;
```

=== "Python"

```python
import duckdb

duckdb.sql(
    """
    INSTALL lance FROM community;
    LOAD lance;
    """
)
```
Clipboard_Screenshot_1770277712

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get.


```rust
use arrow_array::{
ArrayRef, FixedSizeListArray, Float32Array, Int64Array, RecordBatch, StringArray,
};
use arrow_schema::{DataType, Field, Schema};
use lance::dataset::WriteParams;
use lance::Dataset;
use std::sync::Arc;

#[tokio::main]
async fn main() -> lance::Result<()> {
// Define schema with a 128-dimensional vector column
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int64, false),
Field::new("text", DataType::Utf8, false),
Field::new(
"vector",
DataType::FixedSizeList(
Arc::new(Field::new("item", DataType::Float32, true)),
128,
),
false,
),
]));

// Create sample data
let ids = Int64Array::from(vec![0, 1, 2]);
let texts = StringArray::from(vec!["doc_0", "doc_1", "doc_2"]);

// Create vector embeddings (128-dimensional)
let values: Vec<f32> = (0..384).map(|i| i as f32 / 100.0).collect();
let values_array = Float32Array::from(values);
let vectors = FixedSizeListArray::try_new_from_values(values_array, 128)?;

let batch = RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(ids) as ArrayRef,
Arc::new(texts) as ArrayRef,
Arc::new(vectors) as ArrayRef,
],
)?;

// Write to Lance
let dataset = Dataset::write(
vec![batch].into_iter().map(Ok),
"embeddings.lance",
WriteParams::default(),
)
.await?;

println!("Created dataset with {} rows", dataset.count_rows().await?);
Ok(())
}
```

### Vector Search with Embeddings

Once you have vector embeddings stored in Lance, you can perform efficient vector similarity search:

```python
import lance
import numpy as np

# Open the dataset
ds = lance.dataset("./embeddings.lance")

# Create a query vector (same dimension as stored vectors)
query_vector = np.random.rand(128).astype(np.float32).tolist()

# Perform vector search - find 10 nearest neighbors
results = ds.to_table(
nearest={
"column": "vector",
"q": query_vector,
"k": 10,
}
)
print(results.to_pandas())
```

For production workloads with large datasets, create a vector index for much faster search:

```python
# Create an IVF-PQ index for fast approximate nearest neighbor search
ds.create_index(
"vector",
index_type="IVF_PQ",
num_partitions=256, # Number of IVF partitions
num_sub_vectors=16, # Number of PQ sub-vectors
)

# Search with the index (automatically used)
results = ds.to_table(
nearest={
"column": "vector",
"q": query_vector,
"k": 10,
"nprobes": 20, # Number of partitions to search
}
)
```

### List and LargeList - Variable-Length Arrays

For variable-length arrays where each row may have a different number of elements, use `List` or `LargeList`:

```python
import lance
import pyarrow as pa

# Schema with variable-length arrays
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("tags", pa.list_(pa.utf8())), # Variable number of string tags
pa.field("scores", pa.list_(pa.float32())), # Variable number of float scores
])

table = pa.Table.from_pydict({
"id": [1, 2, 3],
"tags": [["python", "ml"], ["rust"], ["data", "analytics", "ai"]],
"scores": [[0.9, 0.8], [0.95], [0.7, 0.85, 0.9]],
}, schema=schema)

ds = lance.write_dataset(table, "./variable_arrays.lance")
```

## Nested and Complex Types

### Struct Types

Store structured data with multiple named fields:

```python
import lance
import pyarrow as pa

# Schema with nested struct
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("metadata", pa.struct([
pa.field("source", pa.utf8()),
pa.field("timestamp", pa.timestamp("us")),
pa.field("embedding_model", pa.utf8()),
])),
pa.field("vector", pa.list_(pa.float32(), 384)), # 384-dim embedding
])

table = pa.Table.from_pydict({
"id": [1, 2],
"metadata": [
{"source": "web", "timestamp": "2024-01-15T10:30:00", "embedding_model": "text-embedding-3-small"},
{"source": "api", "timestamp": "2024-01-15T11:45:00", "embedding_model": "text-embedding-3-small"},
],
"vector": [
[0.1] * 384,
[0.2] * 384,
],
}, schema=schema)

ds = lance.write_dataset(table, "./with_metadata.lance")
```

### Map Types

Store key-value pairs with dynamic keys:

```python
import lance
import pyarrow as pa

schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("attributes", pa.map_(pa.utf8(), pa.utf8())),
])

table = pa.Table.from_pydict({
"id": [1, 2],
"attributes": [
[("color", "red"), ("size", "large")],
[("color", "blue"), ("material", "cotton")],
],
}, schema=schema)

ds = lance.write_dataset(table, "./with_maps.lance")
```

## Data Type Mapping for Integrations

When integrating Lance with other systems (like Apache Flink, Spark, or Presto), the following type mappings apply:

| External Type | Lance/Arrow Type | Notes |
|--------------|------------------|-------|
| `BOOLEAN` | `Boolean` | |
| `TINYINT` | `Int8` | |
| `SMALLINT` | `Int16` | |
| `INT` / `INTEGER` | `Int32` | |
| `BIGINT` | `Int64` | |
| `FLOAT` | `Float32` | |
| `DOUBLE` | `Float64` | |
| `DECIMAL(p,s)` | `Decimal128(p,s)` | |
| `STRING` / `VARCHAR` | `Utf8` | |
| `CHAR(n)` | `Utf8` | Fixed-width string |

Copilot AI Feb 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mapping from CHAR(n) to Utf8 with the note "Fixed-width string" may be misleading. Arrow's Utf8 type is a variable-length string type, not fixed-width. When SQL CHAR(n) types (which are padded to a fixed width) are converted to Arrow/Lance, they become variable-length Utf8 strings. Consider clarifying the note to say "Fixed-width in source system" or "Converted from fixed-width string" to avoid confusion about Arrow's representation.

Suggested change
| `CHAR(n)` | `Utf8` | Fixed-width string |
| `CHAR(n)` | `Utf8` | Fixed-width in source system; stored as variable-length Utf8 |

Copilot uses AI. Check for mistakes.
| `DATE` | `Date32` | |
| `TIME` | `Time64` | Microsecond precision |
| `TIMESTAMP` | `Timestamp` | |
| `TIMESTAMP WITH LOCAL TIMEZONE` | `Timestamp` | With timezone info |
| `BINARY` / `VARBINARY` | `Binary` | |
| `BYTES` | `Binary` | |
| `ARRAY<T>` | `List(T)` | Variable-length array |
| `ARRAY<T>(n)` | `FixedSizeList(T, n)` | Fixed-length array (vectors) |
| `ROW` / `STRUCT` | `Struct` | Nested structure |
| `MAP<K,V>` | `Map(K, V)` | Key-value pairs |

### Vector Embeddings in Integrations

For vector embedding columns, use `ARRAY<FLOAT>(n)` or `ARRAY<DOUBLE>(n)` where `n` is the embedding dimension:

```sql
-- Example: Creating a table with vector embeddings in SQL-compatible systems
CREATE TABLE embeddings (
id BIGINT,
text STRING,
vector ARRAY<FLOAT>(384) -- 384-dimensional vector
);
```

This maps to Lance's `FixedSizeList(Float32, 384)` type, which is optimized for:

- Efficient columnar storage
- SIMD-accelerated distance computations
- Vector index creation and search

## Best Practices for Vector Data

1. **Use FixedSizeList for embeddings**: Always use `FixedSizeList` (not variable-length `List`) for vector embeddings to enable efficient storage and indexing.

2. **Choose appropriate precision**:
- `Float32` is the standard choice, balancing precision and storage
- `Float16` or `BFloat16` can reduce storage by 50% with minimal accuracy loss
- `Int8` for quantized embeddings

3. **Align dimensions for SIMD**: Vector dimensions divisible by 8 enable optimal SIMD acceleration. Common dimensions: 128, 256, 384, 512, 768, 1024, 1536.

4. **Create indexes for large datasets**: For datasets with more than ~10,000 vectors, create an ANN index for fast search:

```python
# IVF_PQ is recommended for most use cases
ds.create_index("vector", index_type="IVF_PQ", num_partitions=256, num_sub_vectors=16)

# IVF_HNSW_SQ offers better recall at the cost of more memory
ds.create_index("vector", index_type="IVF_HNSW_SQ", num_partitions=256)
```

5. **Store metadata alongside vectors**: Lance efficiently handles mixed workloads with both vector and scalar data:

```python
# Combine vector search with metadata filtering
results = ds.to_table(
filter="category = 'electronics'",
nearest={"column": "vector", "q": query, "k": 10}
)
```

## See Also

- [Vector Search Tutorial](../quickstart/vector-search.md) - Complete guide to vector search with Lance
- [Extension Arrays](arrays.md) - Special array types for ML (BFloat16, images)
- [Performance Guide](performance.md) - Optimization tips for large-scale deployments
Loading