-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Reading and Writing Extension FileTypes #8667
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -154,6 +154,10 @@ pub enum FileTypeWriterOptions { | |
JSON(JsonWriterOptions), | ||
Avro(AvroWriterOptions), | ||
Arrow(ArrowWriterOptions), | ||
/// For extension [FileType]s, FileTypeWriterOptions simply stores | ||
/// any passed StatementOptions to be handled later by any custom | ||
/// physical plans (Such as a FileFormat::create_writer_physical_plan) | ||
Extension(Option<StatementOptions>), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FileTypeWriterOptions does not handle any parsing or validation for externally defined FileTypes. It simply acts as a container for the options passed in. |
||
} | ||
|
||
impl FileTypeWriterOptions { | ||
|
@@ -184,14 +188,17 @@ impl FileTypeWriterOptions { | |
FileType::ARROW => { | ||
FileTypeWriterOptions::Arrow(ArrowWriterOptions::try_from(options)?) | ||
} | ||
FileType::Extension(_) => { | ||
FileTypeWriterOptions::Extension(Some(statement_options.clone())) | ||
} | ||
}; | ||
|
||
Ok(file_type_write_options) | ||
} | ||
|
||
/// Constructs a FileTypeWriterOptions from session defaults only. | ||
pub fn build_default( | ||
file_type: &FileType, | ||
file_type: &mut FileType, | ||
config_defaults: &ConfigOptions, | ||
) -> Result<Self> { | ||
let empty_statement = StatementOptions::new(vec![]); | ||
|
@@ -214,6 +221,7 @@ impl FileTypeWriterOptions { | |
FileType::ARROW => { | ||
FileTypeWriterOptions::Arrow(ArrowWriterOptions::try_from(options)?) | ||
} | ||
FileType::Extension(_) => FileTypeWriterOptions::Extension(None), | ||
}; | ||
|
||
Ok(file_type_write_options) | ||
|
@@ -290,6 +298,7 @@ impl Display for FileTypeWriterOptions { | |
FileTypeWriterOptions::JSON(_) => "JsonWriterOptions", | ||
#[cfg(feature = "parquet")] | ||
FileTypeWriterOptions::Parquet(_) => "ParquetWriterOptions", | ||
FileTypeWriterOptions::Extension(_) => "ExensionWriterOptions", | ||
}; | ||
write!(f, "{}", name) | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -150,6 +150,9 @@ impl ListingTableConfig { | |
), | ||
#[cfg(feature = "parquet")] | ||
FileType::PARQUET => Arc::new(ParquetFormat::default()), | ||
FileType::Extension(_) => { | ||
unreachable!("FileType::from_str cannot return Extension variant!") | ||
} | ||
}; | ||
|
||
Ok((file_format, ext)) | ||
|
@@ -768,7 +771,7 @@ impl TableProvider for ListingTable { | |
let file_type_writer_options = match &self.options().file_type_write_options { | ||
Some(opt) => opt.clone(), | ||
None => FileTypeWriterOptions::build_default( | ||
&file_format.file_type(), | ||
&mut file_format.file_type(), | ||
state.config_options(), | ||
)?, | ||
}; | ||
|
@@ -1716,6 +1719,9 @@ mod tests { | |
) | ||
.await?; | ||
} | ||
FileType::Extension(_) => { | ||
panic!("Extension file type not implemented in write path.") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we instead return an NotImplement exception? |
||
} | ||
} | ||
|
||
// Create and register the source table with the provided schema and inserted data | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -605,6 +605,7 @@ impl DefaultPhysicalPlanner { | |
FileType::JSON => Arc::new(JsonFormat::default()), | ||
FileType::AVRO => Arc::new(AvroFormat {} ), | ||
FileType::ARROW => Arc::new(ArrowFormat {}), | ||
FileType::Extension(_ext) => return not_impl_err!("Extension FileTypes not supported in Copy To statements.") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The built in Copy plans can't support external types (since we can't initialize them in DataFusion), but it should be not too difficult to build a custom COPY PhysicalPlan that uses a custom FileFormat, but otherwise is the same and relies on the build in Copy logical plan. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we could have the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried this (as well as combining ExtensionFileType trait into FileFormat), but that would require adding lots of dependencies between the DataFusion crates. FileType is in common and used in most of DataFusion while FileFormat is only in core. This could work but I think the consequence would be DataFusion becomes one big crate. I think the only way to keep the independent crates is to have at least two File related abstractions. One simple one for logical planning in Common and one heavy duty one for Execution plans in core. Even linking them is a challenge because then common depends on core 🤔 |
||
}; | ||
|
||
sink_format.create_writer_physical_plan(input_exec, session_state, config, None).await | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -891,6 +891,9 @@ impl TryFrom<&FileTypeWriterOptions> for protobuf::FileTypeWriterOptions { | |
FileTypeWriterOptions::Arrow(ArrowWriterOptions {}) => { | ||
return not_impl_err!("Arrow file sink protobuf serialization") | ||
} | ||
FileTypeWriterOptions::Extension(_) => { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should be able to support serializing this since it is ultimately just a Vec<(String, String)> There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could also punt it into the |
||
return not_impl_err!("Extension file sink protobuf serialization") | ||
} | ||
}; | ||
Ok(Self { | ||
file_type: Some(file_type), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the manual and auto traits should be compatible, so silencing this is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably also manually implement
Hash
pretty easily as well, but I agree it isn't a bad thing