Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/src/guide/.pages
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
nav:
- Read and Write: read_and_write.md
- Data Types: data_types.md
- Data Evolution: data_evolution.md
- Blob API: blob.md
- JSON Support: json.md
Expand Down
390 changes: 390 additions & 0 deletions docs/src/guide/data_types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,390 @@
# Data Types

Lance uses [Apache Arrow](https://arrow.apache.org/) as its in-memory data format. This guide covers the supported data types with a focus on array types, which are essential for vector embeddings and machine learning applications.

## Arrow Type System

Lance supports the full Apache Arrow type system. When writing data through Python (PyArrow) or Rust (arrow-rs), the Arrow types are automatically mapped to Lance's internal representation.

### Primitive Types

| Arrow Type | Description | Example Use Case |
|------------|-------------|------------------|
| `Boolean` | True/false values | Flags, filters |
| `Int8`, `Int16`, `Int32`, `Int64` | Signed integers | IDs, counts |
| `UInt8`, `UInt16`, `UInt32`, `UInt64` | Unsigned integers | IDs, indices |
| `Float16`, `Float32`, `Float64` | Floating point numbers | Measurements, scores |
| `Decimal128`, `Decimal256` | Fixed-precision decimals | Financial data |
| `Date32`, `Date64` | Date values | Birth dates, event dates |
| `Time32`, `Time64` | Time values | Time of day |
| `Timestamp` | Date and time with timezone | Event timestamps |
| `Duration` | Time duration | Elapsed time |

### String and Binary Types

| Arrow Type | Description | Example Use Case |
|------------|-------------|------------------|
| `Utf8` | Variable-length UTF-8 string | Text, names |
| `LargeUtf8` | Large UTF-8 string (64-bit offsets) | Large documents |
| `Binary` | Variable-length binary data | Raw bytes |
| `LargeBinary` | Large binary data (64-bit offsets) | Large blobs |
| `FixedSizeBinary(n)` | Fixed-length binary data | UUIDs, hashes |

### Blob Type for Large Binary Objects

Lance provides a specialized **Blob** type for efficiently storing and retrieving very large binary objects such as videos, images, audio files, or other multimedia content. Unlike regular binary columns, blobs are stored out-of-line and support lazy loading, which means you can read portions of the data without loading everything into memory.

To create a blob column, add the `lance-encoding:blob` metadata to a `LargeBinary` field:

```python
import pyarrow as pa
import lance

# Define schema with a blob column for videos
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("filename", pa.utf8()),
pa.field("video", pa.large_binary(), metadata={"lance-encoding:blob": "true"}),
])

# Read video file
with open("sample_video.mp4", "rb") as f:
video_data = f.read()

# Create and write dataset
table = pa.table({
"id": [1],
"filename": ["sample_video.mp4"],
"video": [video_data],
}, schema=schema)

ds = lance.write_dataset(table, "./videos.lance", schema=schema)
```

To read blob data, use `take_blobs()` which returns file-like objects for lazy reading:

```python
# Retrieve blob as a file-like object (lazy loading)
blobs = ds.take_blobs("video", ids=[0])

# Use with libraries that accept file-like objects
import av # pip install av
with av.open(blobs[0]) as container:
for frame in container.decode(video=0):
# Process video frames without loading entire video into memory
pass
```

For more details, see the [Blob API Guide](blob.md).

## Array Types for Vector Embeddings

Lance provides excellent support for array types, which are critical for storing vector embeddings in AI/ML applications.

### FixedSizeList - The Preferred Type for Vector Embeddings

`FixedSizeList` is the recommended type for storing fixed-dimensional vector embeddings. Each vector has the same number of dimensions, making it highly efficient for storage and computation.

#### Python Example

```python
import lance
import pyarrow as pa
import numpy as np

# Create a schema with a vector embedding column
# This defines a 128-dimensional float32 vector
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("text", pa.utf8()),
pa.field("vector", pa.list_(pa.float32(), 128)), # FixedSizeList of 128 floats
])

# Create sample data with embeddings
num_rows = 1000
vectors = np.random.rand(num_rows, 128).astype(np.float32)

table = pa.Table.from_pydict({
"id": list(range(num_rows)),
"text": [f"document_{i}" for i in range(num_rows)],
"vector": [v.tolist() for v in vectors],
}, schema=schema)

# Write to Lance format
ds = lance.write_dataset(table, "./embeddings.lance")
print(f"Created dataset with {ds.count_rows()} rows")
```

#### Rust Example

```rust
use arrow_array::{
ArrayRef, FixedSizeListArray, Float32Array, Int64Array, RecordBatch, StringArray,
};
use arrow_schema::{DataType, Field, Schema};
use lance::dataset::WriteParams;
use lance::Dataset;
use std::sync::Arc;

#[tokio::main]
async fn main() -> lance::Result<()> {
// Define schema with a 128-dimensional vector column
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int64, false),
Field::new("text", DataType::Utf8, false),
Field::new(
"vector",
DataType::FixedSizeList(
Arc::new(Field::new("item", DataType::Float32, true)),
128,
),
false,
),
]));

// Create sample data
let ids = Int64Array::from(vec![0, 1, 2]);
let texts = StringArray::from(vec!["doc_0", "doc_1", "doc_2"]);

// Create vector embeddings (128-dimensional)
let values: Vec<f32> = (0..384).map(|i| i as f32 / 100.0).collect();
let values_array = Float32Array::from(values);
let vectors = FixedSizeListArray::try_new_from_values(values_array, 128)?;

let batch = RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(ids) as ArrayRef,
Arc::new(texts) as ArrayRef,
Arc::new(vectors) as ArrayRef,
],
)?;

// Write to Lance
let dataset = Dataset::write(
vec![batch].into_iter().map(Ok),
"embeddings.lance",
WriteParams::default(),
)
.await?;

println!("Created dataset with {} rows", dataset.count_rows().await?);
Ok(())
}
```

### Vector Search with Embeddings

Once you have vector embeddings stored in Lance, you can perform efficient vector similarity search:

```python
import lance
import numpy as np

# Open the dataset
ds = lance.dataset("./embeddings.lance")

# Create a query vector (same dimension as stored vectors)
query_vector = np.random.rand(128).astype(np.float32).tolist()

# Perform vector search - find 10 nearest neighbors
results = ds.to_table(
nearest={
"column": "vector",
"q": query_vector,
"k": 10,
}
)
print(results.to_pandas())
```

For production workloads with large datasets, create a vector index for much faster search:

```python
# Create an IVF-PQ index for fast approximate nearest neighbor search
ds.create_index(
"vector",
index_type="IVF_PQ",
num_partitions=256, # Number of IVF partitions
num_sub_vectors=16, # Number of PQ sub-vectors
)

# Search with the index (automatically used)
results = ds.to_table(
nearest={
"column": "vector",
"q": query_vector,
"k": 10,
"nprobes": 20, # Number of partitions to search
}
)
```

### List and LargeList - Variable-Length Arrays

For variable-length arrays where each row may have a different number of elements, use `List` or `LargeList`:

```python
import lance
import pyarrow as pa

# Schema with variable-length arrays
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("tags", pa.list_(pa.utf8())), # Variable number of string tags
pa.field("scores", pa.list_(pa.float32())), # Variable number of float scores
])

table = pa.Table.from_pydict({
"id": [1, 2, 3],
"tags": [["python", "ml"], ["rust"], ["data", "analytics", "ai"]],
"scores": [[0.9, 0.8], [0.95], [0.7, 0.85, 0.9]],
}, schema=schema)

ds = lance.write_dataset(table, "./variable_arrays.lance")
```

## Nested and Complex Types

### Struct Types

Store structured data with multiple named fields:

```python
import lance
import pyarrow as pa

# Schema with nested struct
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("metadata", pa.struct([
pa.field("source", pa.utf8()),
pa.field("timestamp", pa.timestamp("us")),
pa.field("embedding_model", pa.utf8()),
])),
pa.field("vector", pa.list_(pa.float32(), 384)), # 384-dim embedding
])

table = pa.Table.from_pydict({
"id": [1, 2],
"metadata": [
{"source": "web", "timestamp": "2024-01-15T10:30:00", "embedding_model": "text-embedding-3-small"},
{"source": "api", "timestamp": "2024-01-15T11:45:00", "embedding_model": "text-embedding-3-small"},
],
"vector": [
[0.1] * 384,
[0.2] * 384,
],
}, schema=schema)

ds = lance.write_dataset(table, "./with_metadata.lance")
```

### Map Types

Store key-value pairs with dynamic keys:

```python
import lance
import pyarrow as pa

schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("attributes", pa.map_(pa.utf8(), pa.utf8())),
])

table = pa.Table.from_pydict({
"id": [1, 2],
"attributes": [
[("color", "red"), ("size", "large")],
[("color", "blue"), ("material", "cotton")],
],
}, schema=schema)

ds = lance.write_dataset(table, "./with_maps.lance")
```

## Data Type Mapping for Integrations

When integrating Lance with other systems (like Apache Flink, Spark, or Presto), the following type mappings apply:

| External Type | Lance/Arrow Type | Notes |
|--------------|------------------|-------|
| `BOOLEAN` | `Boolean` | |
| `TINYINT` | `Int8` | |
| `SMALLINT` | `Int16` | |
| `INT` / `INTEGER` | `Int32` | |
| `BIGINT` | `Int64` | |
| `FLOAT` | `Float32` | |
| `DOUBLE` | `Float64` | |
| `DECIMAL(p,s)` | `Decimal128(p,s)` | |
| `STRING` / `VARCHAR` | `Utf8` | |
| `CHAR(n)` | `Utf8` | Fixed-width in source system; stored as variable-length Utf8 |
| `DATE` | `Date32` | |
| `TIME` | `Time64` | Microsecond precision |
| `TIMESTAMP` | `Timestamp` | |
| `TIMESTAMP WITH LOCAL TIMEZONE` | `Timestamp` | With timezone info |
| `BINARY` / `VARBINARY` | `Binary` | |
| `BYTES` | `Binary` | |
| `BLOB` | `LargeBinary` with `lance-encoding:blob` | Large binary objects with lazy loading |
| `ARRAY<T>` | `List(T)` | Variable-length array |
| `ARRAY<T>(n)` | `FixedSizeList(T, n)` | Fixed-length array (vectors) |
| `ROW` / `STRUCT` | `Struct` | Nested structure |
| `MAP<K,V>` | `Map(K, V)` | Key-value pairs |

### Vector Embeddings in Integrations

For vector embedding columns, use `ARRAY<FLOAT>(n)` or `ARRAY<DOUBLE>(n)` where `n` is the embedding dimension:

```sql
-- Example: Creating a table with vector embeddings in SQL-compatible systems
CREATE TABLE embeddings (
id BIGINT,
text STRING,
vector ARRAY<FLOAT>(384) -- 384-dimensional vector
);
```

This maps to Lance's `FixedSizeList(Float32, 384)` type, which is optimized for:

- Efficient columnar storage
- SIMD-accelerated distance computations
- Vector index creation and search

## Best Practices for Vector Data

1. **Use FixedSizeList for embeddings**: Always use `FixedSizeList` (not variable-length `List`) for vector embeddings to enable efficient storage and indexing.

2. **Choose appropriate precision**:
- `Float32` is the standard choice, balancing precision and storage
- `Float16` or `BFloat16` can reduce storage by 50% with minimal accuracy loss
- `Int8` for quantized embeddings

3. **Align dimensions for SIMD**: Vector dimensions divisible by 8 enable optimal SIMD acceleration. Common dimensions: 128, 256, 384, 512, 768, 1024, 1536.

4. **Create indexes for large datasets**: For datasets with more than ~10,000 vectors, create an ANN index for fast search:

```python
# IVF_PQ is recommended for most use cases
ds.create_index("vector", index_type="IVF_PQ", num_partitions=256, num_sub_vectors=16)

# IVF_HNSW_SQ offers better recall at the cost of more memory
ds.create_index("vector", index_type="IVF_HNSW_SQ", num_partitions=256)
```

5. **Store metadata alongside vectors**: Lance efficiently handles mixed workloads with both vector and scalar data:

```python
# Combine vector search with metadata filtering
results = ds.to_table(
filter="category = 'electronics'",
nearest={"column": "vector", "q": query, "k": 10}
)
```

## See Also

- [Vector Search Tutorial](../quickstart/vector-search.md) - Complete guide to vector search with Lance
- [Blob API Guide](blob.md) - Storing and retrieving large binary objects (videos, images)
- [Extension Arrays](arrays.md) - Special array types for ML (BFloat16, images)
- [Performance Guide](performance.md) - Optimization tips for large-scale deployments