-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PARQUET-951] Pull request for handling protobuf field id #410
Conversation
How about doing some configurable fallback to old behaviour ? |
I was thinking about that, but I did not see the point to keeping old behavior. |
@qinghui-xu for backward compatibility, the default behavior should be the old one. |
@qinghui-xu, just fyi @costimuraru is also contributing to parquet-proto. See: #411 @lukasnalezenec @qinghui-xu @costimuraru: you are all interested in parquet-protos :) |
@julienledem @lukasnalezenec @costimuraru
What do you think about it? Edit 2017.05.05:Regarding to the merge of files from both old schema (without field id) and new schema (with field id), using this flag we can choose one of the following behaviors:
[Note] This suggestion is regarding to the general purpose parquet file merge (not limited to using parquet-protobuf to do the merge), I think we would perhaps need this for use cases such as parquet hive? |
Thanks @qinghui-xu, Here is some context: Ideally we want the new behavior to become the preferred default (new users have schema matching by field id instead of by name) while preserving the current behavior (existing files are still interpreted the same way) So I propose that we use the following default: But we add a read-time flag to turn it off if people really want to maintain the name based mapping all the time. However it would be great to hear from people who use parquet-proto in production. @lukasnalezenec, @qinghui-xu, @costimuraru, others? |
Hello @julienledem Otherwise I edited my previous comment, to propose something for managing the (general-purpose) file merge by using the flag that I proposed. Perhaps the metadata flag will be necessary regarding to this purpose? |
I think it is not necessary. |
To clarify the read-time flag that I am proposing is passed to the reader while reading and not saved in the file. Which means it is true or false. However you may have a mix of schemas with ids (new) and schemas without (old).
yes
In this case there is a user provided schema which is the reference. Things that are not found in this schema are ignored since we don't need to read them they will be skipped when reading the data. This is just projection push down (for example we removed the field from the IDL) |
IMHO we should have some protobuf specific metadata about way how data were written. We can also use them when matching fields ids (together with value passed to reader). We can remove instalation of protobuf libraries from travis.xml We can remove instructions about installing protobuffers from main project readme We can also add some decent logging during initialization. It can be trace level. |
Do you mean similar to how we add the thrift schema in the parquet footer? |
Putting a flag (e.g. "parquet.field.id.persistent") as a protobuf specific metadata would be enough to tell how the schema is handled. If the value is absent, we will consider it as false. |
328db78
to
13926a5
Compare
Use protobuf-maven-plugin to manage the protoc for compiling proto. This way the build is independent of the environment and no protoc is needed to be installed.
Field id is the key element for serialization framework such as protobuf. Persist field id in the file metadata if the schema supports it.
When read back a parquet file with protobuf schema, try to match the parquet schema to protobuf with respect to field id. If the parquet file is written with a previsous parquet-protobuf ( prior to 1.9.1), field id is not persisted in parquet file metadata. When reading these legacy file, schema converter falls back to matching fields by name.
Use parameter "parquet.schema.field.with.id" to enable schema field id persistence. When this flag is on, all schema fields should contain id (this is generally different from the field index which is the field position), otherwise it will be considered as an error. This flag is set as extra metadata in the footer, and its default value is false. For parquet-protobuf, this flag is systematically set to true, except that it's set to false explicitly in the configuration, which makes field id persistence as the default behavior for parquet-protobuf.
Protoc is now managed by maven plugin.
In this commit, we seperate the new behavior of reading fields by id and the old behavoir of parquet-protobuf. In new behavior: For schema containing field id (flag "parquet.schema.field.with.id" on), parquet-protobuf now supports reading parquet files with new protobuf schema after removing fields. When reading a field already removed from the schema, it will be treated as unknown field (for the moment the implementation just safely ignores them). Also, there is a more strict check on the field id presence. If the footer has the flag "parquet.schema.field.with.id" on, protobuf reader expects all fields in the schema contain id, or it will raise an error. This new behavior is the default behavior, but can be explicitly disabled by setting flag off in the configuration. This will fall back to the old behavior: mapping fields by name, and error will be raised if unknown fields are found.
So I reworked a little bit on the pull request, in a whole:
Remark: regarding to the removed fields, current implementation will just ignore the value in that field. Future improvement could leverage protobuf's unkown fields capability, such that the field value can be kept in the protobuf message's unknownFieldSet. |
I think this is getting close. |
@qinghui-xu @julienledem Leveraging the (protobuf) field ids is a great idea! @qinghui-xu, would you be so kind to rebase this PR, so we could give it a try on our platform? |
Hello, @costimuraru |
@qinghui-xu that would be great! |
Hello @costimuraru @qinghui-xu @julienledem |
Hi,
We already write field ids to schema.
https://github.com/apache/parquet-mr/blob/master/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java#L80
2018-05-29 19:03 GMT+02:00 Benoît Hanotte <[email protected]>:
… Hello @costimuraru <https://github.com/costimuraru> @qinghui-xu
<https://github.com/qinghui-xu> @julienledem
<https://github.com/julienledem>
As the protobuf descriptor is already serialized in the file metadata (
https://github.com/apache/parquet-mr/blob/master/
parquet-protobuf/src/main/java/org/apache/parquet/proto/
ProtoWriteSupport.java#L132) and contains all the information required to
map the protobuf field id to its name, can't we leverage this instead of
changing the way we set the field id in the parquet schema?
Not only would this isolate the change to the protobuf part of the logic,
it would also bring backward compatibility as files already contain the
descriptor in its serialized form. In this case we would only need to set a
flag at read-time, instead of also having to add a flag when writing.
If we were setting the parquet field ids according to the protobuf ids, I
don't think we would be able to support schema compatibility for files
written with a previous version of parquet as the parquet schema of the
file would be missing the required information.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#410 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEJuUHhtTNYvu9_46NR2goixJ-6POb-uks5t3X9agaJpZM4NImRv>
.
|
I close this because this PR seems need a big refactoring to be able to merge into upstream, also we do not currently work with it actively. |
In current implementation, field id is not persisted in parquet file metadata. We propose this patch to address the problem (especially for parquet-protobuf). The field id is important with regards to the schema backward/forward compatiblity in protobuf.