Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[draft] Add LogicalType, try to support user-defined types #8143

Closed
wants to merge 1 commit into from

Conversation

yukkit
Copy link
Contributor

@yukkit yukkit commented Nov 12, 2023

Which issue does this PR close?

Closes #7923 .

Current Pull Request is an Experimental Demo for Validating the Feasibility of Logical Types

Rationale for this change

What changes are included in this PR?

Features

  • Create User-Defined Types (UDTs) through SQL, specifying the field types as UDTs during table creation.
  • Support the use of UDT as a function signature in udf/udaf.
  • Register extension types through the register_data_type function in the SessionContext.

New Additions

  • LogicalType struct.
  • ExtensionType trait. Abstraction for extension types.
  • TypeSignature struct. Uniquely identifies a data type.

Major Changes

  • Added get_data_type(&self, _name: &TypeSignature) -> Option<LogicalType> function to the ContextProvider trait.
  • In DFSchema, DFField now uses LogicalType, removing arrow Field and retaining only data_type, nullable, metadata since dict_id, dict_is_ordered are not necessary at the logical stage.
  • ExprSchemable and ExprSchema now use LogicalType.
  • ast to logical plan conversion now uses LogicalType.

To Be Implemented

  • TypeCoercionRewriter in the analyze stage uses logical types. For example, functions like comparison_coercion, get_input_types, get_valid_types, etc.
  • Functions signatures for udf/udaf use TypeSignature instead of the existing DataType for ease of use in udf/udaf.

To Be Determined

  • Should ScalarValue use LogicalType or arrow DataType?
    • LogicalType.
    • DataType
  • Should TableSource return DFSchema or arrow Schema?
    • Schema.
    • DFSchema
  • Conversion between physical types and logical types (in Datafusion, type conversion is achieved through the conversion of DFSchema to Schema; logical plans use DFSchema, physical plans use Schema).
  • Conversion between Schema and DFSchema
    • When to convert Schema to DFSchema?
      • During the construction of the logical TableScan node, obtain arrow Schema through TableSource/TableProvider and then convert it to DFSchema.
      • TableSource/TableProvider returns DFSchema instead of Schema.
    • When to convert DFSchema to Schema?
      • Directly obtain arrow Schema from TableSource in physical planner, no need for conversion.
      • Convert the DFSchema returned by TableSource to Schema in the physical planner stage.

Some Thoughts

  • In this comment, the use case of converting from dyn Array to LineStringArray or MultiPointArray was raised. In my perspective, assuming there is a function specifically designed for handling LineString data, the function signature can be defined as LineString, ensuring that the input data must be of a type acceptable by LineStringArray.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate labels Nov 12, 2023
@yukkit
Copy link
Contributor Author

yukkit commented Nov 12, 2023

Current PR has some unresolved issues requiring collaboration for discussion.
Once there is a consensus on all the issues among the team, I will reorganize the PR accordingly.

@yukkit
Copy link
Contributor Author

yukkit commented Nov 13, 2023

I've organized the logic for the mutual conversion between DFSchema and Schema in datafusion. In theory, there should be no conversion logic from Schema to DFSchema. I've outlined all the modifications below.

DFSchema to Schema

No need to change

DefaultPhysicalPlanner

  • DescribeTable
  • Values -> ValuesExec
  • EmptyRelation -> EmptyExec
  • Unnest -> UnnestExec
  • CopyTo
  • Explain
  • Analyze

To be changed

  • TableProvider::schema

    • ViewTable
    • ListingTable
    • EmptyTable
    • MemTable
    • StreamingTable
  • DataFrame

    • write_table: replace with DFSchema
    • cache: build MemTable

Schema to DFSchema (To be changed)

  • LogicalPlanBuilder::insert_into: can directly use DFSchema
  • LogicalPlanBuilder::explain: can directly use DFSchema
  • ConstEvaluator: construct DFSchema then to Schema
  • SqlToRel::explain_to_plan: output schema can directly use DFSchema
  • SqlToRel::describe_table_to_plan: output schema can directly use DFSchema
  • SqlToRel::insert_to_plan: depends on table_source.schema()
  • SqlToRel::delete_to_plan: depends on table_source.schema()
  • ListingTable::scan: used to create_physical_expr

@alamb
Copy link
Contributor

alamb commented Nov 13, 2023

Thanks @yukkit -- I plan to give this a look, but probably will not have time until tomorrow

@@ -661,6 +667,28 @@ impl SessionContext {
}
}

async fn create_type(&self, cmd: CreateType) -> Result<DataFrame> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core processing logic for creating SQL UDT.

Comment on lines +684 to +687
self.register_data_type(
type_signature.to_owned_type_signature(),
extension_type,
)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Register the extension types created through SQL UDT.

@@ -1846,6 +1895,22 @@ impl SessionState {
}
}

impl TypeManager for SessionState {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SessionState implements TymeManager for registering and retrieving extended data types.

@@ -37,7 +37,7 @@ use std::sync::Arc;
/// trait to allow expr to typable with respect to a schema
pub trait ExprSchemable {
/// given a schema, return the type of the expr
fn get_type<S: ExprSchema>(&self, schema: &S) -> Result<DataType>;
fn get_type<S: ExprSchema>(&self, schema: &S) -> Result<LogicalType>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace all arrow DataType used in the logical planning stage with LogicalType(literal ScalarValue TBD).

Comment on lines +95 to +101
// TODO not convert to DataType
let data_types = data_types
.into_iter()
.map(|e| e.physical_type())
.collect::<Vec<_>>();

Ok((fun.return_type)(&data_types)?.as_ref().clone().into())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Functions signatures for udf/udaf use TypeSignature instead of the arrow DataType

.ok_or_else(|| {
plan_datafusion_err!(
// FIXME support logical data types, not convert to DataType
let data_type = comparison_coercion(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify the function comparison_coercion to accept LogicalType as parameters.

@@ -886,39 +887,34 @@ pub(crate) fn find_column_indexes_referenced_by_expr(
/// can this data type be used in hash join equal conditions??
/// data types here come from function 'equal_rows', if more data types are supported
/// in equal_rows(hash join), add those data types here to generate join logical plan.
pub fn can_hash(data_type: &DataType) -> bool {
pub fn can_hash(data_type: &LogicalType) -> bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function can_hash might be abstracted as a property of logical data types.

let right_type = pattern.get_type(&self.schema)?;
// FIXME like_coercion use LogicalType
let left_type = expr.get_type(&self.schema)?.physical_type();
let right_type = pattern.get_type(&self.schema)?.physical_type();
let coerced_type = like_coercion(&left_type, &right_type).ok_or_else(|| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Modify the function like_coercion to accept LogicalType as parameters.

@@ -295,15 +303,16 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
})
}

pub(crate) fn convert_data_type(&self, sql_type: &SQLDataType) -> Result<DataType> {
pub(crate) fn convert_data_type(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert data types in the AST to logical data types.

Comment on lines +387 to +396
SQLDataType::Custom(name, params) => {
let name = object_name_to_string(name);
let params = params.iter().map(Into::into).collect();
let type_signature = TypeSignature::new_with_params(name, params);
if let Some(data_type) = self.context_provider.get_data_type(&type_signature) {
return Ok(data_type);
}

plan_err!("User-Defined SQL type {sql_type:?} not found")
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert custom data types.

@alamb alamb changed the title [draft] try to support user-defined types [draft] Add LogicalType, try to support user-defined types Nov 14, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an epic PR @yukkit -- I looked through it and I really liked the structure and the overall code. Really nice 👏

I think the next steps for this PR are to circulate it more widely -- perhaps via a note to the arrow mailing list, and discord / slack channels (I can help with this) and potentially have people gauge how much disruption this will cause with downstream crates (I can test with IOx perhaps, and see what that experience is like)

Again, thanks for pushing this forward

_ => dt1 == dt2,
}
fn datatype_is_logically_equal(dt1: &LogicalType, dt2: &LogicalType) -> bool {
dt1 == dt2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is certainly nicer

data_type: LogicalType,
nullable: bool,
/// A map of key-value pairs containing additional custom meta data.
metadata: HashMap<String, String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could store in a BTreeMap to avoid sorting them each time 🤔

use arrow_schema::{DataType, Field, IntervalUnit, TimeUnit};

#[derive(Clone, Debug)]
pub enum LogicalType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we should add doc comments here, but it also makes sense to avoid over doing it on RFC / draft.

.collect::<Vec<_>>();
DataType::Struct(fields.into())
}
other => panic!("not implemented {other:?}"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the idea that DataType::Dictionary and DataType::RunEndEncoded would also be included here? If so it makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up. This should include DataType::Dictionary and DataType::RunEndEncoded here.

@@ -253,6 +253,7 @@ pub struct ListingOptions {
pub format: Arc<dyn FileFormat>,
/// The expected partition column names in the folder structure.
/// See [Self::with_table_partition_cols] for details
/// TODO this maybe LogicalType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one usecase is to use dictionary encoding for the partition columns to minimize the overhead of creating such columns. As long as they can be represented / specified with LogicalType I think it is a good idea to try changing this too.

@lewiszlw
Copy link
Member

What's the status of this pr? This should be a very useful feature.

@alamb
Copy link
Contributor

alamb commented Jan 11, 2024

I think this PR is stalled and I don't have any update

@yukkit
Copy link
Contributor Author

yukkit commented Mar 9, 2024

Please accept my apologies for the delay. Due to personal circumstances, I have been unable to attend to any work. I will now proceed to resume work on this PR.

@alamb
Copy link
Contributor

alamb commented Mar 9, 2024

No worries at all -- I hope all is well and we look forward to this work

Copy link

github-actions bot commented May 9, 2024

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label May 9, 2024
@therealsharath
Copy link

Hello, sorry if this is a redundant question. What is the status of this PR?

@github-actions github-actions bot removed the Stale PR has not had any activity for some time label May 14, 2024
@alamb
Copy link
Contributor

alamb commented May 14, 2024

Hello, sorry if this is a redundant question. What is the status of this PR?

I think it is stale and on track to be closed from what I can see

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Jul 14, 2024
@github-actions github-actions bot closed this Jul 22, 2024
@alamb
Copy link
Contributor

alamb commented Jul 22, 2024

FYI #11160 tracks a new proposal for this feature. It seems to be gaining traction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Proposal] Support User-Defined Types (UDT)
4 participants