Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(planner): Allowing setting sort order of parquet files without specifying the schema #12466

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

devanbenz
Copy link
Contributor

@devanbenz devanbenz commented Sep 14, 2024

Which issue does this PR close?

Closes #7317

Rationale for this change

This allows for setting the order upon creation of tables using parquet files without having to specify the schema. Since parquet already has the schema readily available in the metadata this is a relatively quick fix that will enable downstream usage to be less cumbersome, specifically, when setting up reproduction of issues.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the sql SQL Planner label Sep 14, 2024
@devanbenz devanbenz marked this pull request as draft September 14, 2024 17:12
@devanbenz devanbenz marked this pull request as ready for review September 14, 2024 19:06
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @devanbenz -- I am sorry I thought i had left a review of thsi PR before but apparently I had not hit submit

@@ -1028,8 +1030,26 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
.into_iter()
.collect();

let schema = self.build_schema(columns)?;
let df_schema = schema.to_dfschema_ref()?;
let df_schema = match file_type.as_str() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry for the delayed feeback @devanbenz -- I swear I typed this feedback but i must not have clicked "submit"

Basically my concerns about this approach are twofold:

  1. This code assumes the parquet file is on the local filesystem (when for many systems it may be on remote object storage)
  2. It also adds a dependency in sql parsing to the parquet format. Since parquet has quite a few dependencies, this new dependency is likely non ideal for systems that are using DataFusion for sql parsing (like dask-sql for example)

Perhaps you could delay the creation of the ORDER BY until the table provider is resolved?

The table provider: https://github.com/apache/datafusion/blob/2521043ddcb3895a2010b8e328f3fa10f77fc094/datafusion/expr/src/planner.rs#L35-L34

Once the table provider is resolved then the schema's table can be known

Another benefit of this approach is that it would work for all formats, not just parquet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry for the delayed feeback @devanbenz -- I swear I typed this feedback but i must not have clicked "submit"

Thats alright -happens to me all the time 😅

Perhaps you could delay the creation of the ORDER BY until the table provider is resolved?

Sounds good, I like this idea. 👍

@devanbenz devanbenz marked this pull request as draft September 17, 2024 18:28
@devanbenz
Copy link
Contributor Author

Converting to a draft until I have the final implementation done 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allowing setting sort order of parquet files without specifying the schema
2 participants