-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pass bytes directly to parquet's KeyValue #4317
Comments
The parquet format defines the key value metadata as strings - https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L674. Which according to the thrift specification are UTF-8 - https://thrift.apache.org/docs/types. It would therefore be ill-formed for us to write non-UTF-8 data here... One option might be to support writing arbitrary data before the footer, and then encode just this file offset in the metadata. This is similar to how bloom filters, indices, etc... are stored. Would this be workable, it would mean additional IO on your end to actually fetch this data when needed Edit: Although at that point you might as well just store said metadata in a separate location 🤔 |
This required more work for us to refactor our application, so I'm not very interested at this, at least for now. 😅
If I choose this solution, which file can I refer to? I want to evaluate how much work this is required. |
I don't think the work to support this in parquet-rs would be particularly complicated. However, I can't say how much work it would be to integrate in your application. I would have thought it not materially more or less complicated than storing the bloom data in a separate file. |
Thanks, It seems write those info to another file is more flexible. I will reconsider how to fix this more carefully. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We are using parquet's key value metadata to store our application related date, the interface of key value must be string, and we use base64 to convert bytes to string in order to suit current interface.
base64 will increase bytes size, so I think if parquet could support key value to be
Vec<u8>
directly.Describe the solution you'd like
See above
Describe alternatives you've considered
Additional context
#2444 (comment)
The text was updated successfully, but these errors were encountered: