Skip to content
This repository has been archived by the owner on Aug 13, 2024. It is now read-only.

feat: Add Transform, Partition and Sorting #8

Merged
merged 1 commit into from
Jun 16, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
169 changes: 169 additions & 0 deletions src/types.rs
Original file line number Diff line number Diff line change
Expand Up @@ -144,3 +144,172 @@ pub struct SchemaV2 {
/// types contained in this schema.
pub types: Struct,
}

/// Transform is used to transform predicates to partition predicates,
/// in addition to transforming data values.
///
/// Deriving partition predicates from column predicates on the table data
/// is used to separate the logical queries from physical storage: the
/// partitioning can change and the correct partition filters are always
/// derived from column predicates.
///
/// This simplifies queries because users don’t have to supply both logical
/// predicates and partition predicates.
///
/// All transforms must return `null` for a `null` input value.
pub enum Transform {
/// Source value, unmodified
///
/// - Source type could be `Any`.
/// - Return type is the same with source type.
Identity,
/// Hash of value, mod `N`.
///
/// Bucket partition transforms use a 32-bit hash of the source value.
/// The 32-bit hash implementation is the 32-bit Murmur3 hash, x86
/// variant, seeded with 0.
///
/// Transforms are parameterized by a number of buckets, N. The hash mod
/// N must produce a positive value by first discarding the sign bit of
/// the hash value. In pseudo-code, the function is:
///
/// ```text
/// def bucket_N(x) = (murmur3_x86_32_hash(x) & Integer.MAX_VALUE) % N
/// ```
///
/// - Source type could be `int`, `long`, `decimal`, `date`, `time`,
/// `timestamp`, `timestamptz`, `string`, `uuid`, `fixed`, `binary`.
/// - Return type is `int`.
Bucket(i32),
/// Value truncated to width `W`
///
/// For `int`:
///
/// - `v - (v % W)` remainders must be positive
/// - example: W=10: 1 → 0, -1 → -10
/// - note: The remainder, v % W, must be positive.
///
/// For `long`:
///
/// - `v - (v % W)` remainders must be positive
/// - example: W=10: 1 → 0, -1 → -10
/// - note: The remainder, v % W, must be positive.
///
/// For `decimal`:
///
/// - `scaled_W = decimal(W, scale(v)) v - (v % scaled_W)`
/// - example: W=50, s=2: 10.65 → 10.50
///
/// For `string`:
///
/// - Substring of length L: `v.substring(0, L)`
/// - example: L=3: iceberg → ice
/// - note: Strings are truncated to a valid UTF-8 string with no more
/// than L code points.
///
/// - Source type could be `int`, `long`, `decimal`, `string`
/// - Return type is the same with source type.
Truncate(i32),
/// Extract a date or timestamp year, as years from 1970
///
/// - Source type could be `date`, `timestamp`, `timestamptz`
/// - Return type is `int`
Year,
/// Extract a date or timestamp month, as months from 1970-01-01
///
/// - Source type could be `date`, `timestamp`, `timestamptz`
/// - Return type is `int`
Month,
/// Extract a date or timestamp day, as days from 1970-01-01
///
/// - Source type could be `date`, `timestamp`, `timestamptz`
/// - Return type is `int`
Day,
/// Extract a timestamp hour, as hours from 1970-01-01 00:00:00
///
/// - Source type could be `timestamp`, `timestamptz`
/// - Return type is `int`
Hour,
/// Always produces `null`
///
/// The void transform may be used to replace the transform in an
/// existing partition field so that the field is effectively dropped in
/// v1 tables.
///
/// - Source type could be `Any`.
/// - Return type is Source type or `int`
Void,
}

/// Data files are stored in manifests with a tuple of partition values
/// that are used in scans to filter out files that cannot contain records
/// that match the scan’s filter predicate.
///
/// Partition values for a data file must be the same for all records stored
/// in the data file. (Manifests store data files from any partition, as long
/// as the partition spec is the same for the data files.)
pub struct Partition {
/// A source column id from the table’s schema
pub source_column_id: i32,
/// A partition field id that is used to identify a partition field
/// and is unique within a partition spec.
///
/// In v2 table metadata, it is unique across all partition specs.
pub partition_field_id: i32,
/// A transform that is applied to the source column to produce
/// a partition value
///
/// The source column, selected by id, must be a primitive type
/// and cannot be contained in a map or list, but may be nested in
/// a struct.
pub transform: Transform,
/// A partition name
pub name: String,
}

/// Users can sort their data within partitions by columns to gain
/// performance. The information on how the data is sorted can be declared
/// per data or delete file, by a sort order.
///
/// - Order id `0` is reserved for the unsorted order.
/// - Sorting floating-point numbers should produce the following behavior:
/// `-NaN` < `-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity`
/// < `NaN`
pub struct SortOrder {
/// The sort order id of this SortOrder
pub id: i32,
/// The order of the sort fields within the list defines the order in
/// which the sort is applied to the data
pub fields: Vec<SortField>,
}

/// Field of the specified sort order.
pub struct SortField {
/// A source column id from the table’s schema
pub source_column_id: i32,
/// A transform that is applied to the source column to produce
/// a partition value
///
/// The source column, selected by id, must be a primitive type
/// and cannot be contained in a map or list, but may be nested in
/// a struct.
pub transform: Transform,
/// sort direction, that can only be either `asc` or `desc`
pub direction: SortDirection,
/// A null order that describes the order of null values when sorted.
/// Can only be either nulls-first or nulls-last
pub null_order: NullOrder,
}

/// sort direction, that can only be either `asc` or `desc`
pub enum SortDirection {
ASC,
DESC,
}

/// A null order that describes the order of null values when sorted.
/// Can only be either nulls-first or nulls-last
pub enum NullOrder {
NullsFirst,
NullsLast,
}