-
Notifications
You must be signed in to change notification settings - Fork 729
docs: add array type support #5884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
67e7619
a6800a8
cf69ccf
47d438f
e37fbfc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,341 @@ | ||||||
| # Data Types | ||||||
|
|
||||||
| Lance uses [Apache Arrow](https://arrow.apache.org/) as its in-memory data format. This guide covers the supported data types with a focus on array types, which are essential for vector embeddings and machine learning applications. | ||||||
|
|
||||||
| ## Arrow Type System | ||||||
|
|
||||||
| Lance supports the full Apache Arrow type system. When writing data through Python (PyArrow) or Rust (arrow-rs), the Arrow types are automatically mapped to Lance's internal representation. | ||||||
|
|
||||||
| ### Primitive Types | ||||||
|
|
||||||
| | Arrow Type | Description | Example Use Case | | ||||||
| |------------|-------------|------------------| | ||||||
| | `Boolean` | True/false values | Flags, filters | | ||||||
| | `Int8`, `Int16`, `Int32`, `Int64` | Signed integers | IDs, counts | | ||||||
| | `UInt8`, `UInt16`, `UInt32`, `UInt64` | Unsigned integers | IDs, indices | | ||||||
| | `Float16`, `Float32`, `Float64` | Floating point numbers | Measurements, scores | | ||||||
| | `Decimal128`, `Decimal256` | Fixed-precision decimals | Financial data | | ||||||
| | `Date32`, `Date64` | Date values | Timestamps | | ||||||
| | `Time32`, `Time64` | Time values | Time of day | | ||||||
| | `Timestamp` | Date and time with timezone | Event timestamps | | ||||||
| | `Duration` | Time duration | Elapsed time | | ||||||
|
|
||||||
| ### String and Binary Types | ||||||
|
|
||||||
| | Arrow Type | Description | Example Use Case | | ||||||
| |------------|-------------|------------------| | ||||||
| | `Utf8` | Variable-length UTF-8 string | Text, names | | ||||||
| | `LargeUtf8` | Large UTF-8 string (64-bit offsets) | Large documents | | ||||||
| | `Binary` | Variable-length binary data | Raw bytes | | ||||||
| | `LargeBinary` | Large binary data (64-bit offsets) | Large blobs | | ||||||
| | `FixedSizeBinary(n)` | Fixed-length binary data | UUIDs, hashes | | ||||||
|
|
||||||
| ## Array Types for Vector Embeddings | ||||||
|
|
||||||
| Lance provides excellent support for array types, which are critical for storing vector embeddings in AI/ML applications. | ||||||
|
|
||||||
| ### FixedSizeList - The Preferred Type for Vector Embeddings | ||||||
|
|
||||||
| `FixedSizeList` is the recommended type for storing fixed-dimensional vector embeddings. Each vector has the same number of dimensions, making it highly efficient for storage and computation. | ||||||
|
|
||||||
| === "Python" | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it necessary? What would the |
||||||
|
|
||||||
| ```python | ||||||
| import lance | ||||||
| import pyarrow as pa | ||||||
| import numpy as np | ||||||
|
|
||||||
| # Create a schema with a vector embedding column | ||||||
| # This defines a 128-dimensional float32 vector | ||||||
| schema = pa.schema([ | ||||||
| pa.field("id", pa.int64()), | ||||||
| pa.field("text", pa.utf8()), | ||||||
| pa.field("vector", pa.list_(pa.float32(), 128)), # FixedSizeList of 128 floats | ||||||
| ]) | ||||||
|
|
||||||
| # Create sample data with embeddings | ||||||
| num_rows = 1000 | ||||||
| vectors = np.random.rand(num_rows, 128).astype(np.float32) | ||||||
|
|
||||||
| table = pa.Table.from_pydict({ | ||||||
| "id": list(range(num_rows)), | ||||||
| "text": [f"document_{i}" for i in range(num_rows)], | ||||||
| "vector": [v.tolist() for v in vectors], | ||||||
| }, schema=schema) | ||||||
|
|
||||||
| # Write to Lance format | ||||||
| ds = lance.write_dataset(table, "./embeddings.lance") | ||||||
| print(f"Created dataset with {ds.count_rows()} rows") | ||||||
| ``` | ||||||
|
|
||||||
| === "Rust" | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Get. |
||||||
|
|
||||||
| ```rust | ||||||
| use arrow_array::{ | ||||||
| ArrayRef, FixedSizeListArray, Float32Array, Int64Array, RecordBatch, StringArray, | ||||||
| }; | ||||||
| use arrow_schema::{DataType, Field, Schema}; | ||||||
| use lance::dataset::WriteParams; | ||||||
| use lance::Dataset; | ||||||
| use std::sync::Arc; | ||||||
|
|
||||||
| #[tokio::main] | ||||||
| async fn main() -> lance::Result<()> { | ||||||
| // Define schema with a 128-dimensional vector column | ||||||
| let schema = Arc::new(Schema::new(vec![ | ||||||
| Field::new("id", DataType::Int64, false), | ||||||
| Field::new("text", DataType::Utf8, false), | ||||||
| Field::new( | ||||||
| "vector", | ||||||
| DataType::FixedSizeList( | ||||||
| Arc::new(Field::new("item", DataType::Float32, true)), | ||||||
| 128, | ||||||
| ), | ||||||
| false, | ||||||
| ), | ||||||
| ])); | ||||||
|
|
||||||
| // Create sample data | ||||||
| let ids = Int64Array::from(vec![0, 1, 2]); | ||||||
| let texts = StringArray::from(vec!["doc_0", "doc_1", "doc_2"]); | ||||||
|
|
||||||
| // Create vector embeddings (128-dimensional) | ||||||
| let values: Vec<f32> = (0..384).map(|i| i as f32 / 100.0).collect(); | ||||||
| let values_array = Float32Array::from(values); | ||||||
| let vectors = FixedSizeListArray::try_new_from_values(values_array, 128)?; | ||||||
|
|
||||||
| let batch = RecordBatch::try_new( | ||||||
| schema.clone(), | ||||||
| vec![ | ||||||
| Arc::new(ids) as ArrayRef, | ||||||
| Arc::new(texts) as ArrayRef, | ||||||
| Arc::new(vectors) as ArrayRef, | ||||||
| ], | ||||||
| )?; | ||||||
|
|
||||||
| // Write to Lance | ||||||
| let dataset = Dataset::write( | ||||||
| vec![batch].into_iter().map(Ok), | ||||||
| "embeddings.lance", | ||||||
| WriteParams::default(), | ||||||
| ) | ||||||
| .await?; | ||||||
|
|
||||||
| println!("Created dataset with {} rows", dataset.count_rows().await?); | ||||||
| Ok(()) | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ### Vector Search with Embeddings | ||||||
|
|
||||||
| Once you have vector embeddings stored in Lance, you can perform efficient vector similarity search: | ||||||
|
|
||||||
| ```python | ||||||
| import lance | ||||||
| import numpy as np | ||||||
|
|
||||||
| # Open the dataset | ||||||
| ds = lance.dataset("./embeddings.lance") | ||||||
|
|
||||||
| # Create a query vector (same dimension as stored vectors) | ||||||
| query_vector = np.random.rand(128).astype(np.float32).tolist() | ||||||
|
|
||||||
| # Perform vector search - find 10 nearest neighbors | ||||||
| results = ds.to_table( | ||||||
| nearest={ | ||||||
| "column": "vector", | ||||||
| "q": query_vector, | ||||||
| "k": 10, | ||||||
| } | ||||||
| ) | ||||||
| print(results.to_pandas()) | ||||||
| ``` | ||||||
|
|
||||||
| For production workloads with large datasets, create a vector index for much faster search: | ||||||
|
|
||||||
| ```python | ||||||
| # Create an IVF-PQ index for fast approximate nearest neighbor search | ||||||
| ds.create_index( | ||||||
| "vector", | ||||||
| index_type="IVF_PQ", | ||||||
| num_partitions=256, # Number of IVF partitions | ||||||
| num_sub_vectors=16, # Number of PQ sub-vectors | ||||||
| ) | ||||||
|
|
||||||
| # Search with the index (automatically used) | ||||||
| results = ds.to_table( | ||||||
| nearest={ | ||||||
| "column": "vector", | ||||||
| "q": query_vector, | ||||||
| "k": 10, | ||||||
| "nprobes": 20, # Number of partitions to search | ||||||
| } | ||||||
| ) | ||||||
| ``` | ||||||
|
|
||||||
| ### List and LargeList - Variable-Length Arrays | ||||||
|
|
||||||
| For variable-length arrays where each row may have a different number of elements, use `List` or `LargeList`: | ||||||
|
|
||||||
| ```python | ||||||
| import lance | ||||||
| import pyarrow as pa | ||||||
|
|
||||||
| # Schema with variable-length arrays | ||||||
| schema = pa.schema([ | ||||||
| pa.field("id", pa.int64()), | ||||||
| pa.field("tags", pa.list_(pa.utf8())), # Variable number of string tags | ||||||
| pa.field("scores", pa.list_(pa.float32())), # Variable number of float scores | ||||||
| ]) | ||||||
|
|
||||||
| table = pa.Table.from_pydict({ | ||||||
| "id": [1, 2, 3], | ||||||
| "tags": [["python", "ml"], ["rust"], ["data", "analytics", "ai"]], | ||||||
| "scores": [[0.9, 0.8], [0.95], [0.7, 0.85, 0.9]], | ||||||
| }, schema=schema) | ||||||
|
|
||||||
| ds = lance.write_dataset(table, "./variable_arrays.lance") | ||||||
| ``` | ||||||
|
|
||||||
| ## Nested and Complex Types | ||||||
|
|
||||||
| ### Struct Types | ||||||
|
|
||||||
| Store structured data with multiple named fields: | ||||||
|
|
||||||
| ```python | ||||||
| import lance | ||||||
| import pyarrow as pa | ||||||
|
|
||||||
| # Schema with nested struct | ||||||
| schema = pa.schema([ | ||||||
| pa.field("id", pa.int64()), | ||||||
| pa.field("metadata", pa.struct([ | ||||||
| pa.field("source", pa.utf8()), | ||||||
| pa.field("timestamp", pa.timestamp("us")), | ||||||
| pa.field("embedding_model", pa.utf8()), | ||||||
| ])), | ||||||
| pa.field("vector", pa.list_(pa.float32(), 384)), # 384-dim embedding | ||||||
| ]) | ||||||
|
|
||||||
| table = pa.Table.from_pydict({ | ||||||
| "id": [1, 2], | ||||||
| "metadata": [ | ||||||
| {"source": "web", "timestamp": "2024-01-15T10:30:00", "embedding_model": "text-embedding-3-small"}, | ||||||
| {"source": "api", "timestamp": "2024-01-15T11:45:00", "embedding_model": "text-embedding-3-small"}, | ||||||
| ], | ||||||
| "vector": [ | ||||||
| [0.1] * 384, | ||||||
| [0.2] * 384, | ||||||
| ], | ||||||
| }, schema=schema) | ||||||
|
|
||||||
| ds = lance.write_dataset(table, "./with_metadata.lance") | ||||||
| ``` | ||||||
|
|
||||||
| ### Map Types | ||||||
|
|
||||||
| Store key-value pairs with dynamic keys: | ||||||
|
|
||||||
| ```python | ||||||
| import lance | ||||||
| import pyarrow as pa | ||||||
|
|
||||||
| schema = pa.schema([ | ||||||
| pa.field("id", pa.int64()), | ||||||
| pa.field("attributes", pa.map_(pa.utf8(), pa.utf8())), | ||||||
| ]) | ||||||
|
|
||||||
| table = pa.Table.from_pydict({ | ||||||
| "id": [1, 2], | ||||||
| "attributes": [ | ||||||
| [("color", "red"), ("size", "large")], | ||||||
| [("color", "blue"), ("material", "cotton")], | ||||||
| ], | ||||||
| }, schema=schema) | ||||||
|
|
||||||
| ds = lance.write_dataset(table, "./with_maps.lance") | ||||||
| ``` | ||||||
|
|
||||||
| ## Data Type Mapping for Integrations | ||||||
|
|
||||||
| When integrating Lance with other systems (like Apache Flink, Spark, or Presto), the following type mappings apply: | ||||||
|
|
||||||
| | External Type | Lance/Arrow Type | Notes | | ||||||
| |--------------|------------------|-------| | ||||||
| | `BOOLEAN` | `Boolean` | | | ||||||
| | `TINYINT` | `Int8` | | | ||||||
| | `SMALLINT` | `Int16` | | | ||||||
| | `INT` / `INTEGER` | `Int32` | | | ||||||
| | `BIGINT` | `Int64` | | | ||||||
| | `FLOAT` | `Float32` | | | ||||||
| | `DOUBLE` | `Float64` | | | ||||||
| | `DECIMAL(p,s)` | `Decimal128(p,s)` | | | ||||||
| | `STRING` / `VARCHAR` | `Utf8` | | | ||||||
| | `CHAR(n)` | `Utf8` | Fixed-width string | | ||||||
|
||||||
| | `CHAR(n)` | `Utf8` | Fixed-width string | | |
| | `CHAR(n)` | `Utf8` | Fixed-width in source system; stored as variable-length Utf8 | |

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "Example Use Case" column lists "Timestamps" for Date types, which is misleading. Date types (
Date32,Date64) store only dates without time information, while theTimestamptype (listed separately on line 20) stores both date and time. Consider changing the example use case for Date types to something like "Birth dates, event dates" to clearly differentiate from timestamps.