Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Conditionally allow to keep partition_by columns when using PARTITIONED BY enhancement #11107

Merged
merged 15 commits into from
Jun 28, 2024

Conversation

hveiga
Copy link
Contributor

@hveiga hveiga commented Jun 24, 2024

Which issue does this PR close?

Closes #10971

What changes are included in this PR?

  • Added a flag to FileSinkConfig to conditionally enable this feature. Disabled by default.

Are these changes tested?

Test is included as part of copy.slt.

Are there any user-facing changes?

No breaking changes.
Added some documentation for this new option.

@github-actions github-actions bot added sql SQL Planner core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jun 24, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @hveiga -- I think this PR looks quite close.

I think we need should make sure this configuration setting matches the pattern of the other settings (it is somewhat special as it isn't a format specific option, and it doesn't have special DML syntax either...)

datafusion/core/src/physical_planner.rs Show resolved Hide resolved
datafusion/sqllogictest/test_files/copy.slt Show resolved Hide resolved
docs/source/user-guide/sql/dml.md Show resolved Hide resolved
Copy link
Contributor

@devinjdangelo devinjdangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hveiga, this looks great! I think we should clean up the fact that "format." is being added as a prefix to this option and create a different name space for it instead. This could help avoid the TableOptions raising an error for an unexpected format option as well.

datafusion/core/src/datasource/listing/table.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_planner.rs Show resolved Hide resolved
datafusion/core/src/physical_planner.rs Outdated Show resolved Hide resolved
docs/source/user-guide/sql/dml.md Show resolved Hide resolved
@devinjdangelo
Copy link
Contributor

One last thing we could add in this PR (or a ticket for a follow on) would be to add a session level configuration setting for this, so users could control the default behavior.

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Jun 25, 2024
@hveiga
Copy link
Contributor Author

hveiga commented Jun 25, 2024

I think I have addressed all the comments in the PR, the only lingering one is about flowing the new option from SQL when using CREATE EXTERNAL TABLE. Thank you for the quick turnaround on the reviews and comments. I believe the new option keep_partition_by_columns would need to also be added to ListingOptions and parsed accordingly. Is that the way to go?

Copy link
Contributor

@devinjdangelo devinjdangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates @hveiga ! The new session parameter is a great addition. I do think we should avoid changing the CopyTo struct.

Updates to create external table can wait for a future PR in my view.

datafusion/expr/src/logical_plan/dml.rs Outdated Show resolved Hide resolved
Héctor Veiga Ortiz added 3 commits June 25, 2024 15:37
 - separate options by prefix 'hive.'
 - add hive_options to CopyTo struct
 - add more documentation
 - add session execution flag to enable feature, false by default
@hveiga
Copy link
Contributor Author

hveiga commented Jun 25, 2024

Thank you for the updates @hveiga ! The new session parameter is a great addition. I do think we should avoid changing the CopyTo struct.

Updates to create external table can wait for a future PR in my view.

Thanks for the review. I reverted the hive_options changes and the PR is ready for (yet another) review :)

@alamb
Copy link
Contributor

alamb commented Jun 25, 2024

I took the liberty of pushing some commits to this PR to fix some CI errors

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hveiga and @devinjdangelo and @berkaysynnada for the reviews

It would be great if @devinjdangelo and @berkaysynnada could give this another review, but I think it is good enough to merge now.

I agree we should file a follow on ticket to figure out how to support this as part of the CREATE EXTERNAL TABLE syntax

@@ -888,7 +888,15 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
}
Some(v) => v,
};
if !(&key.contains('.')) {

if key.to_lowercase().contains("keep_partition_by_columns") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this special case is unfortunate, but I don't have a great idea of how to make it better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further consideration, IMO we should restrict this handling to format options. For other configurations, users must specify the prefix. Otherwise, this list will continue to grow longer, and using those prefixes would lose their meaning (which is why all these refactors were done to have a structured configuration).

Additionally, instead of using "hive," we need a more general term. Perhaps the "execution" prefix would be a better alternative.

Copy link
Contributor Author

@hveiga hveiga Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on what you mean by "we should restrict this handling to format options"? I am unsure if you are suggesting using a different struct for these non-format. options or continue using the HashMap with a mix of options and then extract out the logic to filter out the non-format. options and do not pass those down through table_options.alter_with_string_hash_map.

I agree on changing the prefix to execution. which also aligns with the ExecutionOptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave a try to this in e91397a . Let me know if that's what you had in mind. Thanks for feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what I had in my mind. Thank you for the collaboration. I just would like to mention two more points:

  1. I guess you missed to add execution prefix to the example in datafusion/sql/src/parser.rs.
let keep_partition_by_columns = source_option_tuples
                    .get("execution.keep_partition_by_columns")
                    .map(|v| v.trim() == "true")
                    .unwrap_or(...

If the user provides anything other than "true," it is interpreted as "false." It might be wiser to give an error if the value is neither "true" nor "false."

Copy link
Contributor Author

@hveiga hveiga Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I just fixed that test case and also slightly modified the handling of the value to account for what we discussed:

  • Give preference to what is explicitly provided.
  • If what is provided is invalid, through a config error. Added a test in copy.slt for this.
  • If not provided, fallback to ExecutionOptions value, false by default.

Hopefully e352203 is finally ready. Thanks.

Copy link
Contributor

@berkaysynnada berkaysynnada Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is a good first step towards adding settings other than format. Thanks for your effort, @hveiga. I think the PR is ready to be merged once conflicts are resolved.

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one minor issue left. Once that is resolved, we can merge the PR. Thanks, @hveiga.

datafusion/core/src/physical_planner.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Jun 28, 2024

I merged up from main to resolve some conflicts (and also now so that CI will run automatically). Once CI passes I plan to merge this PR

@alamb
Copy link
Contributor

alamb commented Jun 28, 2024

Thanks again @hveiga and @berkaysynnada

@hveiga
Copy link
Contributor Author

hveiga commented Jun 28, 2024

Thanks everyone for the help to get this over the finish line! Excited to see this in the next release of Datafusion 🎉

@alamb
Copy link
Contributor

alamb commented Jun 28, 2024

🚀

@alamb alamb merged commit 330ece8 into apache:main Jun 28, 2024
24 checks passed
comphead pushed a commit to comphead/arrow-datafusion that referenced this pull request Jul 2, 2024
…TITIONED BY enhancement (apache#11107)

* feat: conditionally allow to keep partition_by columns

* feat: add flag to file sink config, add tests

* this commit contains:
 - separate options by prefix 'hive.'
 - add hive_options to CopyTo struct
 - add more documentation
 - add session execution flag to enable feature, false by default

* do not add hive_options to CopyTo

* npx prettier

* fmt

* change prefix to execution. , update override order for condition.

* improve handling of flag, added test for config error

* trying to make CI happier

* prettier

* Update test

* update doc

---------

Co-authored-by: Héctor Veiga Ortiz <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…TITIONED BY enhancement (apache#11107)

* feat: conditionally allow to keep partition_by columns

* feat: add flag to file sink config, add tests

* this commit contains:
 - separate options by prefix 'hive.'
 - add hive_options to CopyTo struct
 - add more documentation
 - add session execution flag to enable feature, false by default

* do not add hive_options to CopyTo

* npx prettier

* fmt

* change prefix to execution. , update override order for condition.

* improve handling of flag, added test for config error

* trying to make CI happier

* prettier

* Update test

* update doc

---------

Co-authored-by: Héctor Veiga Ortiz <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conditionally allow to keep partition_by columns when using PARTITIONED BY
4 participants