-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Reading and Writing Extension FileTypes #8667
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left built in FileTypes as concrete enum variants, while supporting externally defined FileTypes via a boxed trait object contained in a FileType variant. I think this is the path of least resistance towards supporting externally defined FileFormats along the read and write path in a ListingTable.
We could potentially also refactor the built in types (parquet, csv, ect) to be externally defined structs which implement ExtensionFileType and remove the FileType enum entirely. This would entail a more significant refactor of FileTypeWriterOptions as well.
@@ -41,7 +43,8 @@ pub trait GetExt { | |||
} | |||
|
|||
/// Readable file type | |||
#[derive(Debug, Clone, PartialEq, Eq, Hash)] | |||
#[allow(clippy::derived_hash_with_manual_eq)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the manual and auto traits should be compatible, so silencing this is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably also manually implement Hash
pretty easily as well, but I agree it isn't a bad thing
/// For extension [FileType]s, FileTypeWriterOptions simply stores | ||
/// any passed StatementOptions to be handled later by any custom | ||
/// physical plans (Such as a FileFormat::create_writer_physical_plan) | ||
Extension(Option<StatementOptions>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileTypeWriterOptions does not handle any parsing or validation for externally defined FileTypes. It simply acts as a container for the options passed in.
@@ -891,6 +891,9 @@ impl TryFrom<&FileTypeWriterOptions> for protobuf::FileTypeWriterOptions { | |||
FileTypeWriterOptions::Arrow(ArrowWriterOptions {}) => { | |||
return not_impl_err!("Arrow file sink protobuf serialization") | |||
} | |||
FileTypeWriterOptions::Extension(_) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to support serializing this since it is ultimately just a Vec<(String, String)>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also punt it into the FileType::Extension
(e.g. add a serialize(&self) -> Result<Vec<u8>>
type method)
@@ -605,6 +605,7 @@ impl DefaultPhysicalPlanner { | |||
FileType::JSON => Arc::new(JsonFormat::default()), | |||
FileType::AVRO => Arc::new(AvroFormat {} ), | |||
FileType::ARROW => Arc::new(ArrowFormat {}), | |||
FileType::Extension(_ext) => return not_impl_err!("Extension FileTypes not supported in Copy To statements.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The built in Copy plans can't support external types (since we can't initialize them in DataFusion), but it should be not too difficult to build a custom COPY PhysicalPlan that uses a custom FileFormat, but otherwise is the same and relies on the build in Copy logical plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could have the FileTypeExtension
have a method that returned Arc<dyn FileFormat>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this (as well as combining ExtensionFileType trait into FileFormat), but that would require adding lots of dependencies between the DataFusion crates. FileType is in common and used in most of DataFusion while FileFormat is only in core.
This could work but I think the consequence would be DataFusion becomes one big crate. I think the only way to keep the independent crates is to have at least two File related abstractions. One simple one for logical planning in Common and one heavy duty one for Execution plans in core.
Even linking them is a challenge because then common depends on core 🤔
I think this would make an excellent follow on PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @devinjdangelo -- this is a very neat PR.
As you say, I think making FileType
itself a trait would be the ideal solution, but doing so would require refactoring a bunch more refactoring (likely FileTypeWriterOptions
a trait for example)
I wonder if you have time to try that alternate approach and see how invasive it turns out to be 🤔
@@ -98,7 +98,7 @@ jobs: | |||
with: | |||
rust-version: stable | |||
- name: Run tests (excluding doctests) | |||
run: cargo test --lib --tests --bins --features avro,json,backtrace | |||
run: RUST_MIN_STACK=504857600 cargo test --lib --tests --bins --features avro,json,backtrace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the new stack size needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. Locally I did not need this. To get CI tests to pass I needed to boost stack size.
@@ -41,7 +43,8 @@ pub trait GetExt { | |||
} | |||
|
|||
/// Readable file type | |||
#[derive(Debug, Clone, PartialEq, Eq, Hash)] | |||
#[allow(clippy::derived_hash_with_manual_eq)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably also manually implement Hash
pretty easily as well, but I agree it isn't a bad thing
@@ -1716,6 +1719,9 @@ mod tests { | |||
) | |||
.await?; | |||
} | |||
FileType::Extension(_) => { | |||
panic!("Extension file type not implemented in write path.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead return an NotImplement exception?
@@ -891,6 +891,9 @@ impl TryFrom<&FileTypeWriterOptions> for protobuf::FileTypeWriterOptions { | |||
FileTypeWriterOptions::Arrow(ArrowWriterOptions {}) => { | |||
return not_impl_err!("Arrow file sink protobuf serialization") | |||
} | |||
FileTypeWriterOptions::Extension(_) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also punt it into the FileType::Extension
(e.g. add a serialize(&self) -> Result<Vec<u8>>
type method)
@@ -605,6 +605,7 @@ impl DefaultPhysicalPlanner { | |||
FileType::JSON => Arc::new(JsonFormat::default()), | |||
FileType::AVRO => Arc::new(AvroFormat {} ), | |||
FileType::ARROW => Arc::new(ArrowFormat {}), | |||
FileType::Extension(_ext) => return not_impl_err!("Extension FileTypes not supported in Copy To statements.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could have the FileTypeExtension
have a method that returned Arc<dyn FileFormat>
?
Thank you for the review @alamb. It should be achievable to move the built in FileTypes to use the same trait interface and retire the concrete enum types. This would serve as a built in example of how to implement the trait for extension types as well and would even allow pulling out the "built in" files supported to their own crates. I'll try to find some time within the next week to try this. |
I think this would be an ideal long term outcome (e.g. have a |
I did not get back to this when Id hoped. It is still on my list, but I probably will not get to it until this weekend. |
I've been thinking about this more, and I think we need a similar interface to how we register a custom The above is based on my understanding having studied a bit how we handle external My main uncertainty to the above is how we would handle serialization. Perhaps the |
Yes.
I think the way serialization is handled today is via "ExtensionCodec"
Which do exactly as you suggest and serialize table providers to/from bytes maybe @thinkharderdev has some more thoughts to share |
Yeah, the extension codecs seem like a logical place to handle this. |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #8657
Closes #8637
Closes #8345
Rationale for this change
DataFusion is built to be as extensible as possible. We currently have a FileType enum which prevents external crates adding support for reading/writing new FileFormats in a ListingTable.
What changes are included in this PR?
Creates a new
FileType::Exension(Box<dyn ExtensionFileType>)
variant.Are these changes tested?
Partially by existing tests, additional likely needed
Are there any user-facing changes?
Ability to extend DataFusion to support reading/writing additional file formats more easily