Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make builtin window function output datatype to be derived from schema #9686

Merged
merged 5 commits into from
Mar 20, 2024

Conversation

comphead
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

There are scenarios when output datatype for builtin window functions are not the same as hardcoded in DataFusion.
For instance Apache Spark requires some functions like ROW_NUMBER, RANK, etc to be Int rather than UInt which is hardcoded in DF. It would be great to derive builtin window function output type from the incoming schema rather than having the type hardcoded. Having this derivation from the schema increases DF extensibility to be used in other hybrid systems.

What changes are included in this PR?

Changed physical planner for window builtin functions to derive the output data type from schema.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@comphead
Copy link
Contributor Author

@mustafasrepo @alamb @viirya appreciate for your thoughts

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @comphead, have a few questions

datafusion/core/tests/fuzz_cases/window_fuzz.rs Outdated Show resolved Hide resolved
.map(|arg| arg.data_type(input_schema))
.collect::<Result<_>>()?;
// derive the output datatype from incoming schema
let out_data_type: &DataType = input_schema.field_with_name(&name)?.data_type();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how this works (or makes sense), can you help me understand? How can you decide the output type of a function without knowing what the function is (i.e. using fun.return_type)? Is it always guaranteed that you will find name in input_schema? If yes, no problem. But if no, shouldn't you fall back to using fun.return_type?

I understand that some people may want to have ROW_NUMBER output an Int instead of an UInt (which doesn't make sense but I can see how such an ask may arise for backwards compatibility reasons).

Copy link
Contributor

@mustafasrepo mustafasrepo Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think with these changes parameter input_schema: &Schema, name is misleading in the create_built_in_window_expr function. Similarly, logical_input_schema: &DFSchema, parameter for the create_window_expr is misleading. Now these parameters are the schema of the window function (not input schema). Hence, I think it is guaranteed to find name of the window function in the schema (As long as window function name calculations are consistent).

I checked, how LogicalPlan::Window calculates its schema.
It calls exprlist_to_fields util to calculate fields of window expressions.

  • Which in turn uses .to_field method on input_schema
  • Which in turn uses Expr::get_type method on input_schema
  • which calculates output type using fun.return_type(&data_types).

In this sense, I think this PR behaves same with current version, and removes duplication. However, we should change parameter/variable names to reflect which data it actually uses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mustafasrepo for clarification, I would also like to add the part were fun.report_type() called for window builtin functions in logical planner https://github.com/apache/arrow-datafusion/blob/b87dd6143c2dc089b07f74780bd525c4369e68a3/datafusion/expr/src/expr_schema.rs#L184

So when physical layer gets called all the metadata(names, datatypes) is already known

@@ -742,7 +742,7 @@ impl DefaultPhysicalPlanner {
);
}

let logical_input_schema = input.schema();
let logical_input_schema = logical_plan.schema();
Copy link
Contributor

@mustafasrepo mustafasrepo Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell this schema is no longer input_schema.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there will always will be a tension between different desired semantics for window functions the same as there is a tension between different desired semantics for scalar functions.

DataFusion currently has several hard coded window function implementations, but it is hard to customize how they work without the code getting complicated quickly (e.g. here)

In my opinion, the "cleanest" way to handle this challenge is to remove "BuiltInWindowFunctions" entirely, and switch to only using WindowUDFImpl API: #8709, the same way we are doing this for scalar functions.

Using a UDF for each function means we could move the rules for determining the output type of each function can be handled individually (or if needed a second implementation provided that returns a different type).

BuiltInWindowFunction::DenseRank => Arc::new(dense_rank(name, data_type)),
BuiltInWindowFunction::PercentRank => Arc::new(percent_rank(name, data_type)),
BuiltInWindowFunction::CumeDist => Arc::new(cume_dist(name, data_type)),
BuiltInWindowFunction::RowNumber => Arc::new(RowNumber::new(name, out_data_type)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it strange in general that we have to pass a "output type" to RowNumber when it is (effectively) always the same.

Copy link
Contributor Author

@comphead comphead Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right @alamb, It is always the same for DataFusion, but when building the hybrid system, like something on top of the DF, it happens expected external output type its not the same as hardcoded in DF.

Imho, derivation from schema improves DF extensibility as it reduces coupling between physical and logical layers and the schema is the contract. Thus external systems on top of DF may safely use physical layer just providing the expected schema.

UPD: Thus external systems on top of DF may safely use physical layer just providing the expected query plan(schema is the part of the query plan)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reg to trait unification this is also a great idea to improve extensibility.
I'm thinking if can we move on with this PR, and migrate all window funcs to unified trait in follow up activity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking if can we move on with this PR, and migrate all window funcs to unified trait in follow up activity?

That seems fine to me (and the migration is already tracked, albiet at a high level) by another ticket

Copy link
Contributor

@mustafasrepo mustafasrepo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!. Thanks @comphead

@comphead comphead merged commit 1d0171a into apache:main Mar 20, 2024
23 checks passed
@comphead comphead mentioned this pull request Mar 21, 2024
mwylde added a commit to ArroyoSystems/arrow-datafusion that referenced this pull request May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants