Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion docs/src/format/table/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ a monotonically increasing version number, and an optional reference to the inde

## Schema & Fields

The schema of the table is written as a series of fields, plus a schema metadata map.
The schema of the table is written as a series of fields, plus a schema metadata map.
The data types generally have a 1-1 correspondence with the Apache Arrow data types.
Each field, including nested fields, have a unique integer id. At initial table creation time, fields are assigned ids in depth-first order.
Afterwards, field IDs are assigned incrementally for newly added fields.
Expand All @@ -42,6 +42,30 @@ See [File Format Encoding Specification](../file/encoding.md) for details on ava

</details>

### Unenforced Primary Key

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this! I missed it in the doc refresh


Lance supports defining an unenforced primary key through field metadata.
This is useful for deduplication during merge-insert operations and other use cases that benefit from logical row identity.
The primary key is "unenforced" meaning Lance does not always validate uniqueness constraints.
Users can use specific workloads like merge-insert to enforce it if necessary.

A primary key field must satisfy:
Comment thread
jackye1995 marked this conversation as resolved.

- The field, and all its ancestors, must not be nullable.
- The field must be a leaf field (primitive data type without children).
- The field must not be within a list or map type.

To mark a field as part of the primary key, add the following metadata to the Arrow field:

- `lance-schema:unenforced-primary-key`: Set to `true`, `1`, or `yes` (case-insensitive) to indicate the field is part of the primary key.
- `lance-schema:unenforced-primary-key:position` (optional): A 1-based integer specifying the field's position within a composite primary key.

For composite primary keys with multiple columns, the position determines the column ordering:

- When positions are specified, fields are ordered by their position values (1, 2, 3, ...).
- When positions are not specified, fields are ordered by their lance schema field id.
- Fields with explicit positions are ordered before fields without explicit positions.

## Fragments

![Fragment Structure](../../images/fragment_structure.png)
Expand Down
5 changes: 5 additions & 0 deletions protos/file.proto
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,11 @@ message Field {

bool unenforced_primary_key = 12;

// Position of this field in the primary key (1-based).
// 0 means the field is part of the primary key but no explicit position is set.
// When set to a positive value, primary key fields are ordered by this position.
uint32 unenforced_primary_key_position = 13;
Comment thread
jackye1995 marked this conversation as resolved.

// DEPRECATED ----------------------------------------------------------------

// Deprecated: Only used in V1 file format. V2 uses variable encodings defined
Expand Down
42 changes: 33 additions & 9 deletions rust/lance-core/src/datatypes/field.rs
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,13 @@ use crate::{
/// (3) The field must not be within a list type.
pub const LANCE_UNENFORCED_PRIMARY_KEY: &str = "lance-schema:unenforced-primary-key";

/// Use this config key in Arrow field metadata to specify the position of a primary key column.
/// The value is a 1-based integer indicating the order within the composite primary key.
/// When specified, primary key fields are ordered by this position.
/// When not specified, primary key fields are ordered by their lance schema field id.
pub const LANCE_UNENFORCED_PRIMARY_KEY_POSITION: &str =
"lance-schema:unenforced-primary-key:position";
Comment thread
jackye1995 marked this conversation as resolved.

fn has_blob_v2_extension(field: &ArrowField) -> bool {
field
.metadata()
Expand Down Expand Up @@ -148,7 +155,11 @@ pub struct Field {

/// Dictionary value array if this field is dictionary.
pub dictionary: Option<Dictionary>,
pub unenforced_primary_key: bool,

/// Position of this field in the primary key (1-based).
/// None means the field is not part of the primary key.
/// Some(n) means this field is the nth column in the primary key.
pub unenforced_primary_key_position: Option<u32>,
}

impl Field {
Expand Down Expand Up @@ -574,7 +585,7 @@ impl Field {
nullable: self.nullable,
children: vec![],
dictionary: self.dictionary.clone(),
unenforced_primary_key: self.unenforced_primary_key,
unenforced_primary_key_position: self.unenforced_primary_key_position,
};
if path_components.is_empty() {
// Project stops here, copy all the remaining children.
Expand Down Expand Up @@ -845,7 +856,7 @@ impl Field {
nullable: self.nullable,
children,
dictionary: self.dictionary.clone(),
unenforced_primary_key: self.unenforced_primary_key,
unenforced_primary_key_position: self.unenforced_primary_key_position,
};
return Ok(f);
}
Expand Down Expand Up @@ -908,7 +919,7 @@ impl Field {
nullable: self.nullable,
children,
dictionary: self.dictionary.clone(),
unenforced_primary_key: self.unenforced_primary_key,
unenforced_primary_key_position: self.unenforced_primary_key_position,
})
}
}
Expand Down Expand Up @@ -1038,6 +1049,11 @@ impl Field {
pub fn is_leaf(&self) -> bool {
self.children.is_empty()
}

/// Return true if the field is part of the (unenforced) primary key.
pub fn is_unenforced_primary_key(&self) -> bool {
self.unenforced_primary_key_position.is_some()
}
}

impl fmt::Display for Field {
Expand Down Expand Up @@ -1114,10 +1130,18 @@ impl TryFrom<&ArrowField> for Field {
}
_ => vec![],
};
let unenforced_primary_key = metadata
.get(LANCE_UNENFORCED_PRIMARY_KEY)
.map(|s| matches!(s.to_lowercase().as_str(), "true" | "1" | "yes"))
.unwrap_or(false);
// Parse primary key position: first try explicit position, then fall back to boolean flag
Comment thread
jackye1995 marked this conversation as resolved.
Outdated
let unenforced_primary_key_position = metadata
.get(LANCE_UNENFORCED_PRIMARY_KEY_POSITION)
.and_then(|s| s.parse::<u32>().ok())
.or_else(|| {
// Backward compatibility: if only the boolean flag is set, use 0 to indicate
// "is PK but no explicit position" (will be ordered by field id)
metadata
.get(LANCE_UNENFORCED_PRIMARY_KEY)
.filter(|s| matches!(s.to_lowercase().as_str(), "true" | "1" | "yes"))
.map(|_| 0)
});
let is_blob_v2 = has_blob_v2_extension(field);

if is_blob_v2 {
Expand Down Expand Up @@ -1154,7 +1178,7 @@ impl TryFrom<&ArrowField> for Field {
nullable: field.is_nullable(),
children,
dictionary: None,
unenforced_primary_key,
unenforced_primary_key_position,
})
}
}
Expand Down
135 changes: 131 additions & 4 deletions rust/lance-core/src/datatypes/schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -111,11 +111,31 @@ impl<'a> Iterator for SchemaFieldIterPreOrder<'a> {
}

impl Schema {
/// The unenforced primary key fields in the schema
/// The unenforced primary key fields in the schema, ordered by position.
///
/// Fields with explicit positions (1, 2, 3, ...) are ordered by their position.
/// Fields without explicit positions (using the legacy boolean flag) are ordered
/// by their schema field id and come after explicitly positioned fields.
pub fn unenforced_primary_key(&self) -> Vec<&Field> {
self.fields_pre_order()
.filter(|f| f.unenforced_primary_key)
.collect::<Vec<_>>()
let mut pk_fields: Vec<&Field> = self
.fields_pre_order()
.filter(|f| f.is_unenforced_primary_key())
.collect();

// Sort by position, with fields lacking explicit position (position=0)
// coming after explicitly positioned fields, sorted by field id
pk_fields.sort_by_key(|f| {
let pos = f.unenforced_primary_key_position.unwrap_or(0);
if pos > 0 {
// Explicit position: sort by position, then by field id for stability
(false, pos as i32, f.id)
} else {
// No explicit position: sort by field id, after explicit positions
(true, f.id, f.id)
}
});

pk_fields
}

pub fn compare_with_options(&self, expected: &Self, options: &SchemaCompareOptions) -> bool {
Expand Down Expand Up @@ -2599,4 +2619,111 @@ mod tests {
.contains(error_message_contains[idx]));
}
}

#[test]
fn test_schema_unenforced_primary_key_ordering() {
Comment thread
jackye1995 marked this conversation as resolved.
use crate::datatypes::field::LANCE_UNENFORCED_PRIMARY_KEY_POSITION;

// Test 1: Explicit positions should order by position
let arrow_schema = ArrowSchema::new(vec![
ArrowField::new("a", DataType::Int32, false).with_metadata(
vec![
(
"lance-schema:unenforced-primary-key".to_owned(),
"true".to_owned(),
),
(
LANCE_UNENFORCED_PRIMARY_KEY_POSITION.to_owned(),
"2".to_owned(),
),
]
.into_iter()
.collect::<HashMap<_, _>>(),
),
ArrowField::new("b", DataType::Int64, false).with_metadata(
vec![
(
"lance-schema:unenforced-primary-key".to_owned(),
"true".to_owned(),
),
(
LANCE_UNENFORCED_PRIMARY_KEY_POSITION.to_owned(),
"1".to_owned(),
),
]
.into_iter()
.collect::<HashMap<_, _>>(),
),
]);
let schema = Schema::try_from(&arrow_schema).unwrap();
let pk_fields = schema.unenforced_primary_key();
assert_eq!(pk_fields.len(), 2);
assert_eq!(pk_fields[0].name, "b"); // position 1
assert_eq!(pk_fields[1].name, "a"); // position 2

// Test 2: No explicit positions should order by field id
let arrow_schema = ArrowSchema::new(vec![
ArrowField::new("c", DataType::Int32, false).with_metadata(
vec![(
"lance-schema:unenforced-primary-key".to_owned(),
"true".to_owned(),
)]
.into_iter()
.collect::<HashMap<_, _>>(),
),
ArrowField::new("d", DataType::Int64, false).with_metadata(
vec![(
"lance-schema:unenforced-primary-key".to_owned(),
"true".to_owned(),
)]
.into_iter()
.collect::<HashMap<_, _>>(),
),
]);
let schema = Schema::try_from(&arrow_schema).unwrap();
let pk_fields = schema.unenforced_primary_key();
assert_eq!(pk_fields.len(), 2);
assert_eq!(pk_fields[0].name, "c"); // field_id 0
assert_eq!(pk_fields[1].name, "d"); // field_id 1

// Test 3: Mixed - explicit positions come before fields without explicit positions
let arrow_schema = ArrowSchema::new(vec![
ArrowField::new("e", DataType::Int32, false).with_metadata(
vec![(
"lance-schema:unenforced-primary-key".to_owned(),
"true".to_owned(),
)]
.into_iter()
.collect::<HashMap<_, _>>(),
),
ArrowField::new("f", DataType::Int64, false).with_metadata(
vec![
(
"lance-schema:unenforced-primary-key".to_owned(),
"true".to_owned(),
),
(
LANCE_UNENFORCED_PRIMARY_KEY_POSITION.to_owned(),
"1".to_owned(),
),
]
.into_iter()
.collect::<HashMap<_, _>>(),
),
ArrowField::new("g", DataType::Utf8, false).with_metadata(
vec![(
"lance-schema:unenforced-primary-key".to_owned(),
"true".to_owned(),
)]
.into_iter()
.collect::<HashMap<_, _>>(),
),
]);
let schema = Schema::try_from(&arrow_schema).unwrap();
let pk_fields = schema.unenforced_primary_key();
assert_eq!(pk_fields.len(), 3);
assert_eq!(pk_fields[0].name, "f"); // explicit position 1
assert_eq!(pk_fields[1].name, "e"); // no explicit position, field_id 0
assert_eq!(pk_fields[2].name, "g"); // no explicit position, field_id 2
}
}
11 changes: 9 additions & 2 deletions rust/lance-file/src/datatypes.rs
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,13 @@ impl From<&pb::Field> for Field {
nullable: field.nullable,
children: vec![],
dictionary: field.dictionary.as_ref().map(Dictionary::from),
unenforced_primary_key: field.unenforced_primary_key,
unenforced_primary_key_position: if field.unenforced_primary_key_position > 0 {
Some(field.unenforced_primary_key_position)
} else if field.unenforced_primary_key {
Some(0)
} else {
None
},
}
}
}
Expand Down Expand Up @@ -77,7 +83,8 @@ impl From<&Field> for pb::Field {
.map(|name| name.to_owned())
.unwrap_or_default(),
r#type: 0,
unenforced_primary_key: field.unenforced_primary_key,
unenforced_primary_key: field.unenforced_primary_key_position.is_some(),
unenforced_primary_key_position: field.unenforced_primary_key_position.unwrap_or(0),
}
}
}
Expand Down
Loading