-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support proto serialization of InsertExec
#7303
Comments
I'm not sure it would be desirable to extend I have in mind three solutions. 1 Leave as is, and require anyone desiring this behaviour to implement a 2 There are currently 3 implementations of
( So we could introduce three new message ParquetFileSinkExecNode {
...
}
message CsvFileSinkExecNode {
...
}
message JsonFileSinkExecNode {
...
} These represent the default supported 3 Instead of smashing the message FileSinkExec {
PhysicalPlanNode input = 1;
DataSink sink = 2;
...
}
message DataSink {
oneof DataSinkImpl {
ParquetSink parquet_sink = 1;
CsvSink csv_sink = 2;
JsonSink json_sink = 3;
DataSinkExtension extension = 4;
}
}
message ParquetSink {
...
}
message CsvSink {
...
}
message JsonSink {
...
}
message DataSinkExtension {
bytes sink = 1;
} This would more accurately model how it's represented in DataFusion, but has the disadvantage of requiring a new extension type, which presumably would need a new codec as well (similar to how I'm taking a stab at the third option (though maybe the second option would be better? 🤔), but would greatly appreciate any insight on the matter. |
FWIW option 2 is what we are doing now and it works easily enough. But I think option 3 sounds pretty good to me. But that seems like it would basically just boil down to adding methods to |
Yeah haven't given too much thought to the interop code yet, just mainly been considering the proto side, so unsure how the code would turn out. If you've already been implementing option 2, how have you been handling encoding/decoding |
We actually have our own |
I see. If we go ahead with option 2 you might still have to use your own implementation of the |
Decided to go with JSON first as it was the most trivial to implement. Parquet is difficult as it relies on WriterProperties struct from arrow-rs crate, and CSV similarly relies on WriterBuilder from arrow-rs so these need extra work on the serde for those structs |
One potential option for WriterProperities / builder would be to port the parsing logic upstream to parquet-rs: apache/arrow-rs#4693 And then you could serialize the options as a string 🤔 |
Is your feature request related to a problem or challenge?
Currently plans that include an
InsertExec
cannot be serialized to protobuf (and hence used in Ballista)Describe the solution you'd like
The easiest way to support this would be to modify
PhysicalExtensionCodec
to support serializing/deserializing adyn DataSink
. So something like:In this case the "standard" implementations would be handled directly within the main serde logic and if a given
Arc<dyn DataSink>
wasn't one of the standard cases then it would try and use the extension codec.Alternatively we might push serialization to the
DataSink
trait itself:Describe alternatives you've considered
Not do anything
Additional context
No response
The text was updated successfully, but these errors were encountered: