Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pass bytes directly to parquet's KeyValue #4317

Closed
jiacai2050 opened this issue May 31, 2023 · 4 comments
Closed

pass bytes directly to parquet's KeyValue #4317

jiacai2050 opened this issue May 31, 2023 · 4 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@jiacai2050
Copy link
Contributor

jiacai2050 commented May 31, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

We are using parquet's key value metadata to store our application related date, the interface of key value must be string, and we use base64 to convert bytes to string in order to suit current interface.

base64 will increase bytes size, so I think if parquet could support key value to be Vec<u8> directly.

Describe the solution you'd like
See above

Describe alternatives you've considered

Additional context

#2444 (comment)

@jiacai2050 jiacai2050 added the enhancement Any new improvement worthy of a entry in the changelog label May 31, 2023
@tustvold
Copy link
Contributor

tustvold commented May 31, 2023

The parquet format defines the key value metadata as strings - https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L674. Which according to the thrift specification are UTF-8 - https://thrift.apache.org/docs/types. It would therefore be ill-formed for us to write non-UTF-8 data here...

One option might be to support writing arbitrary data before the footer, and then encode just this file offset in the metadata. This is similar to how bloom filters, indices, etc... are stored. Would this be workable, it would mean additional IO on your end to actually fetch this data when needed

Edit: Although at that point you might as well just store said metadata in a separate location 🤔

@jiacai2050
Copy link
Contributor Author

jiacai2050 commented Jun 2, 2023

Although at that point you might as well just store said metadata in a separate location.

This required more work for us to refactor our application, so I'm not very interested at this, at least for now. 😅

One option might be to support writing arbitrary data before the footer, and then encode just this file offset in the metadata. This is similar to how bloom filters, indices, etc... are stored.

If I choose this solution, which file can I refer to? I want to evaluate how much work this is required.

@tustvold
Copy link
Contributor

tustvold commented Jun 2, 2023

I want to evaluate how much work this is required

I don't think the work to support this in parquet-rs would be particularly complicated. However, I can't say how much work it would be to integrate in your application. I would have thought it not materially more or less complicated than storing the bloom data in a separate file.

@jiacai2050
Copy link
Contributor Author

Thanks, It seems write those info to another file is more flexible. I will reconsider how to fix this more carefully.

@tustvold tustvold added the parquet Changes to the parquet crate label Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

2 participants