-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(planner): Allowing setting sort order of parquet files without specifying the schema #12466
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @devanbenz -- I am sorry I thought i had left a review of thsi PR before but apparently I had not hit submit
datafusion/sql/src/statement.rs
Outdated
@@ -1028,8 +1030,26 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { | |||
.into_iter() | |||
.collect(); | |||
|
|||
let schema = self.build_schema(columns)?; | |||
let df_schema = schema.to_dfschema_ref()?; | |||
let df_schema = match file_type.as_str() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry for the delayed feeback @devanbenz -- I swear I typed this feedback but i must not have clicked "submit"
Basically my concerns about this approach are twofold:
- This code assumes the parquet file is on the local filesystem (when for many systems it may be on remote object storage)
- It also adds a dependency in sql parsing to the parquet format. Since
parquet
has quite a few dependencies, this new dependency is likely non ideal for systems that are using DataFusion for sql parsing (like dask-sql for example)
Perhaps you could delay the creation of the ORDER BY until the table provider is resolved?
The table provider: https://github.com/apache/datafusion/blob/2521043ddcb3895a2010b8e328f3fa10f77fc094/datafusion/expr/src/planner.rs#L35-L34
Once the table provider is resolved then the schema's table can be known
Another benefit of this approach is that it would work for all formats, not just parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry for the delayed feeback @devanbenz -- I swear I typed this feedback but i must not have clicked "submit"
Thats alright -happens to me all the time 😅
Perhaps you could delay the creation of the ORDER BY until the table provider is resolved?
Sounds good, I like this idea. 👍
Converting to a draft until I have the final implementation done 👍 |
Which issue does this PR close?
Closes #7317
Rationale for this change
This allows for setting the order upon creation of tables using parquet files without having to specify the schema. Since parquet already has the schema readily available in the metadata this is a relatively quick fix that will enable downstream usage to be less cumbersome, specifically, when setting up reproduction of issues.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?