Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 26 additions & 1 deletion docs/src/format/table/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ a monotonically increasing version number, and an optional reference to the inde

## Schema & Fields

The schema of the table is written as a series of fields, plus a schema metadata map.
The schema of the table is written as a series of fields, plus a schema metadata map.
The data types generally have a 1-1 correspondence with the Apache Arrow data types.
Each field, including nested fields, have a unique integer id. At initial table creation time, fields are assigned ids in depth-first order.
Afterwards, field IDs are assigned incrementally for newly added fields.
Expand All @@ -42,6 +42,31 @@ See [File Format Encoding Specification](../file/encoding.md) for details on ava

</details>

### Unenforced Primary Key

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this! I missed it in the doc refresh


Lance supports defining an unenforced primary key through field metadata.
This is useful for deduplication during merge-insert operations and other use cases that benefit from logical row identity.
The primary key is "unenforced" meaning Lance does not always validate uniqueness constraints.
Users can use specific workloads like merge-insert to enforce it if necessary.
The primary key is fixed after initial setting and must not be updated or removed.

A primary key field must satisfy:
Comment thread
jackye1995 marked this conversation as resolved.

- The field, and all its ancestors, must not be nullable.
- The field must be a leaf field (primitive data type without children).
- The field must not be within a list or map type.

To mark a field as part of the primary key, add the following metadata to the Arrow field:

- `lance-schema:unenforced-primary-key`: Set to `true`, `1`, or `yes` (case-insensitive) to indicate the field is part of the primary key.
- `lance-schema:unenforced-primary-key:field-id` (optional): A 1-based integer specifying the field's ID within a composite primary key.

For composite primary keys with multiple columns, the field ID determines the primary key field ordering:

- When field IDs are specified, fields are ordered by their field ID values (1, 2, 3, ...).
- When field IDs are not specified, fields are ordered by their lance schema field id.
- Fields with explicit field IDs are ordered before fields without explicit field IDs.

## Fragments

![Fragment Structure](../../images/fragment_structure.png)
Expand Down
5 changes: 5 additions & 0 deletions protos/file.proto
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,11 @@ message Field {

bool unenforced_primary_key = 12;

// Field ID of this field in the primary key (1-based).
// 0 means the field is part of the primary key but no explicit field ID is set.
// When set to a positive value, primary key fields are ordered by this field ID.
uint32 unenforced_primary_key_field_id = 13;

// DEPRECATED ----------------------------------------------------------------

// Deprecated: Only used in V1 file format. V2 uses variable encodings defined
Expand Down
40 changes: 31 additions & 9 deletions rust/lance-core/src/datatypes/field.rs
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,13 @@ use crate::{
/// (3) The field must not be within a list type.
pub const LANCE_UNENFORCED_PRIMARY_KEY: &str = "lance-schema:unenforced-primary-key";

/// Use this config key in Arrow field metadata to specify the field ID of a primary key column.
/// The value is a 1-based integer indicating the order within the composite primary key.
/// When specified, primary key fields are ordered by this field ID.
/// When not specified, primary key fields are ordered by their lance schema field id.
pub const LANCE_UNENFORCED_PRIMARY_KEY_FIELD_ID: &str =
"lance-schema:unenforced-primary-key:field-id";

fn has_blob_v2_extension(field: &ArrowField) -> bool {
field
.metadata()
Expand Down Expand Up @@ -148,7 +155,11 @@ pub struct Field {

/// Dictionary value array if this field is dictionary.
pub dictionary: Option<Dictionary>,
pub unenforced_primary_key: bool,

/// Field ID of this field in the primary key (1-based).
/// None means the field is not part of the primary key.
/// Some(n) means this field is the nth column in the primary key.
pub unenforced_primary_key_field_id: Option<u32>,
}

impl Field {
Expand Down Expand Up @@ -574,7 +585,7 @@ impl Field {
nullable: self.nullable,
children: vec![],
dictionary: self.dictionary.clone(),
unenforced_primary_key: self.unenforced_primary_key,
unenforced_primary_key_field_id: self.unenforced_primary_key_field_id,
};
if path_components.is_empty() {
// Project stops here, copy all the remaining children.
Expand Down Expand Up @@ -845,7 +856,7 @@ impl Field {
nullable: self.nullable,
children,
dictionary: self.dictionary.clone(),
unenforced_primary_key: self.unenforced_primary_key,
unenforced_primary_key_field_id: self.unenforced_primary_key_field_id,
};
return Ok(f);
}
Expand Down Expand Up @@ -908,7 +919,7 @@ impl Field {
nullable: self.nullable,
children,
dictionary: self.dictionary.clone(),
unenforced_primary_key: self.unenforced_primary_key,
unenforced_primary_key_field_id: self.unenforced_primary_key_field_id,
})
}
}
Expand Down Expand Up @@ -1038,6 +1049,11 @@ impl Field {
pub fn is_leaf(&self) -> bool {
self.children.is_empty()
}

/// Return true if the field is part of the (unenforced) primary key.
pub fn is_unenforced_primary_key(&self) -> bool {
self.unenforced_primary_key_field_id.is_some()
}
}

impl fmt::Display for Field {
Expand Down Expand Up @@ -1114,10 +1130,16 @@ impl TryFrom<&ArrowField> for Field {
}
_ => vec![],
};
let unenforced_primary_key = metadata
.get(LANCE_UNENFORCED_PRIMARY_KEY)
.map(|s| matches!(s.to_lowercase().as_str(), "true" | "1" | "yes"))
.unwrap_or(false);
let unenforced_primary_key_field_id = metadata
.get(LANCE_UNENFORCED_PRIMARY_KEY_FIELD_ID)
.and_then(|s| s.parse::<u32>().ok())
.or_else(|| {
// Backward compatibility: use 0 for legacy boolean flag
metadata
.get(LANCE_UNENFORCED_PRIMARY_KEY)
.filter(|s| matches!(s.to_lowercase().as_str(), "true" | "1" | "yes"))
.map(|_| 0)
});
let is_blob_v2 = has_blob_v2_extension(field);

if is_blob_v2 {
Expand Down Expand Up @@ -1154,7 +1176,7 @@ impl TryFrom<&ArrowField> for Field {
nullable: field.is_nullable(),
children,
dictionary: None,
unenforced_primary_key,
unenforced_primary_key_field_id,
})
}
}
Expand Down
131 changes: 127 additions & 4 deletions rust/lance-core/src/datatypes/schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -111,11 +111,27 @@ impl<'a> Iterator for SchemaFieldIterPreOrder<'a> {
}

impl Schema {
/// The unenforced primary key fields in the schema
/// The unenforced primary key fields in the schema, ordered by field ID.
Comment thread
jackye1995 marked this conversation as resolved.
Outdated
///
/// Fields with explicit field IDs (1, 2, 3, ...) are ordered by their field ID.
/// Fields without explicit field IDs (using the legacy boolean flag) are ordered
/// by their schema field id and come after fields with explicit field IDs.
pub fn unenforced_primary_key(&self) -> Vec<&Field> {
self.fields_pre_order()
.filter(|f| f.unenforced_primary_key)
.collect::<Vec<_>>()
let mut pk_fields: Vec<&Field> = self
.fields_pre_order()
.filter(|f| f.is_unenforced_primary_key())
.collect();

pk_fields.sort_by_key(|f| {
let pk_field_id = f.unenforced_primary_key_field_id.unwrap_or(0);
if pk_field_id > 0 {
(false, pk_field_id as i32, f.id)
} else {
(true, f.id, f.id)
}
});

pk_fields
}

pub fn compare_with_options(&self, expected: &Self, options: &SchemaCompareOptions) -> bool {
Expand Down Expand Up @@ -2599,4 +2615,111 @@ mod tests {
.contains(error_message_contains[idx]));
}
}

#[test]
fn test_schema_unenforced_primary_key_ordering() {
Comment thread
jackye1995 marked this conversation as resolved.
use crate::datatypes::field::LANCE_UNENFORCED_PRIMARY_KEY_FIELD_ID;

// When field IDs are specified, fields are ordered by their field ID values
let arrow_schema = ArrowSchema::new(vec![
ArrowField::new("a", DataType::Int32, false).with_metadata(
vec![
(
"lance-schema:unenforced-primary-key".to_owned(),
"true".to_owned(),
),
(
LANCE_UNENFORCED_PRIMARY_KEY_FIELD_ID.to_owned(),
"2".to_owned(),
),
]
.into_iter()
.collect::<HashMap<_, _>>(),
),
ArrowField::new("b", DataType::Int64, false).with_metadata(
vec![
(
"lance-schema:unenforced-primary-key".to_owned(),
"true".to_owned(),
),
(
LANCE_UNENFORCED_PRIMARY_KEY_FIELD_ID.to_owned(),
"1".to_owned(),
),
]
.into_iter()
.collect::<HashMap<_, _>>(),
),
]);
let schema = Schema::try_from(&arrow_schema).unwrap();
let pk_fields = schema.unenforced_primary_key();
assert_eq!(pk_fields.len(), 2);
assert_eq!(pk_fields[0].name, "b");
assert_eq!(pk_fields[1].name, "a");

// When field IDs are not specified, fields are ordered by their lance schema field id
let arrow_schema = ArrowSchema::new(vec![
ArrowField::new("c", DataType::Int32, false).with_metadata(
vec![(
"lance-schema:unenforced-primary-key".to_owned(),
"true".to_owned(),
)]
.into_iter()
.collect::<HashMap<_, _>>(),
),
ArrowField::new("d", DataType::Int64, false).with_metadata(
vec![(
"lance-schema:unenforced-primary-key".to_owned(),
"true".to_owned(),
)]
.into_iter()
.collect::<HashMap<_, _>>(),
),
]);
let schema = Schema::try_from(&arrow_schema).unwrap();
let pk_fields = schema.unenforced_primary_key();
assert_eq!(pk_fields.len(), 2);
assert_eq!(pk_fields[0].name, "c");
assert_eq!(pk_fields[1].name, "d");

// Fields with explicit field IDs are ordered before fields without explicit field IDs
let arrow_schema = ArrowSchema::new(vec![
ArrowField::new("e", DataType::Int32, false).with_metadata(
vec![(
"lance-schema:unenforced-primary-key".to_owned(),
"true".to_owned(),
)]
.into_iter()
.collect::<HashMap<_, _>>(),
),
ArrowField::new("f", DataType::Int64, false).with_metadata(
vec![
(
"lance-schema:unenforced-primary-key".to_owned(),
"true".to_owned(),
),
(
LANCE_UNENFORCED_PRIMARY_KEY_FIELD_ID.to_owned(),
"1".to_owned(),
),
]
.into_iter()
.collect::<HashMap<_, _>>(),
),
ArrowField::new("g", DataType::Utf8, false).with_metadata(
vec![(
"lance-schema:unenforced-primary-key".to_owned(),
"true".to_owned(),
)]
.into_iter()
.collect::<HashMap<_, _>>(),
),
]);
let schema = Schema::try_from(&arrow_schema).unwrap();
let pk_fields = schema.unenforced_primary_key();
assert_eq!(pk_fields.len(), 3);
assert_eq!(pk_fields[0].name, "f");
assert_eq!(pk_fields[1].name, "e");
assert_eq!(pk_fields[2].name, "g");
}
}
11 changes: 9 additions & 2 deletions rust/lance-file/src/datatypes.rs
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,13 @@ impl From<&pb::Field> for Field {
nullable: field.nullable,
children: vec![],
dictionary: field.dictionary.as_ref().map(Dictionary::from),
unenforced_primary_key: field.unenforced_primary_key,
unenforced_primary_key_field_id: if field.unenforced_primary_key_field_id > 0 {
Some(field.unenforced_primary_key_field_id)
} else if field.unenforced_primary_key {
Some(0)
} else {
None
},
}
}
}
Expand Down Expand Up @@ -77,7 +83,8 @@ impl From<&Field> for pb::Field {
.map(|name| name.to_owned())
.unwrap_or_default(),
r#type: 0,
unenforced_primary_key: field.unenforced_primary_key,
unenforced_primary_key: field.unenforced_primary_key_field_id.is_some(),
unenforced_primary_key_field_id: field.unenforced_primary_key_field_id.unwrap_or(0),
}
}
}
Expand Down
Loading