-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6086: [Rust] [DataFusion] Add support for partitioned Parquet data sources #5494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Sorry @paddyhoran I found this last minute issue ... pretty small fix though. |
paddyhoran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one question.
| Err(ExecutionError::General("No files found".to_string())) | ||
| } else { | ||
| let parquet_file = ParquetFile::open(&filenames[0], None, 0)?; | ||
| let schema = parquet_file.projection_schema.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the schema of the files differ? I guess it just fails are execution time when a different schema is encountered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this code assumes that all of the partitions have the same schema currently. It's pretty basic. I imagine we could eventually have schema merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will write up a JIRA to add validation that all the partitions have the same schema. That would be a nice improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Feel free to merge.
I discovered this last minute while running manual tests. I have been able to run parallel queries against parquet files using this branch as a dependency.