-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change FileScanConfig.table_partition_cols
from (String, DataType)
to Field
s
#7890
Conversation
@alamb and @crepererum |
@@ -101,7 +101,7 @@ pub struct FileScanConfig { | |||
/// all records after filtering are returned. | |||
pub limit: Option<usize>, | |||
/// The partitioning columns | |||
pub table_partition_cols: Vec<(String, DataType)>, | |||
pub table_partition_cols: Vec<Field>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the key change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want to use FieldRef not Field
@@ -135,8 +135,7 @@ impl FileScanConfig { | |||
table_cols_stats.push(self.statistics.column_statistics[idx].clone()) | |||
} else { | |||
let partition_idx = idx - self.file_schema.fields().len(); | |||
let (name, dtype) = &self.table_partition_cols[partition_idx]; | |||
table_fields.push(Field::new(name, dtype.to_owned(), false)); | |||
table_fields.push(self.table_partition_cols[partition_idx].to_owned()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this where we convert table_partition_cols
to Field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @NGA-TRAN -- I think the idea of passing a real Field
as the partition column makes a lot of sense and that this PR does it very nicely 👍
I had a few code improvement suggestions, but nothing I think is required to merge this.
Thanks again
datafusion/core/src/datasource/physical_plan/file_scan_config.rs
Outdated
Show resolved
Hide resolved
) | ||
} | ||
|
||
fn config_for_proj_with_field_tab_part( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this name confusing given the three letter abbreviations and I don't think this is common elsewhere in the DataFusion codebase.
How about something like
fn config_for_proj_with_field_tab_part( | |
fn config_for_projection_with_partition_fields( |
Or maybe instead you could change config_for_projection
to take table_partition_cols: Vec<Field>,
and make a function like
/// Convert all
fn partition_cols( table_partition_cols: Vec<(&str, DataType)>) -> Vec<Field> {
table_partition_cols
.iter()
.map(|(name, dtype)| Field::new(name, dtype.clone(), false))
.collect::<Vec<_>>()
}
And then convert the call sites of config_for_projection
to be config_for_projection(.., partition_cols(..))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented your second suggestion @alamb . Thanks
FileScanConfig.table_partition_cols
from (String, DataType)
to Field
s
Co-authored-by: Andrew Lamb <[email protected]>
I have addressed all the comments. Thanks @alamb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NGA-TRAN
Thanks @NGA-TRAN |
Which issue does this PR close?
Closes #7875
Rationale for this change
Currently,
FileScanConfig.table_partition_cols
has data typeVec<(String, DataType)>
to store only columns name and its data type. A column can include many more information such asnullable
and extra meta data. Thus, when we convert table_partition_cols to Fields here, all other information of a field will either empty or default.We want the data type of table_partition_cols a vector of Fields in the first place so when we need to store a Field, we won't lose any information.
FYI: IOx needs this requirement.
What changes are included in this PR?
Replace data type of
FileScanConfig.table_partition_cols
fromVec<(String, DataType)>
to Vec`Are these changes tested?
Yes
Are there any user-facing changes?
The API to create
FileScanConfig
needs a vector of Fields fortable_partition_cols
. Most of the places it is an empty vector means it is not used.