[PARQUET-951] Pull request for handling protobuf field id #410

qinghui-xu · 2017-04-26T09:29:13Z

In current implementation, field id is not persisted in parquet file metadata. We propose this patch to address the problem (especially for parquet-protobuf). The field id is important with regards to the schema backward/forward compatiblity in protobuf.

lukasnalezenec · 2017-04-26T11:07:32Z

How about doing some configurable fallback to old behaviour ?

qinghui-xu · 2017-04-26T11:28:56Z

I was thinking about that, but I did not see the point to keeping old behavior.
And there is a second question: if we keep old behavoir by some configuration, which will be the default behavior?

julienledem · 2017-05-01T23:25:55Z

@qinghui-xu for backward compatibility, the default behavior should be the old one.
Currently, schema evolution is relying on field names.
Possibly you could make a new top level ProtoParquetReader (with a different name) that would have this property set to use the ids as a default.

julienledem · 2017-05-01T23:28:48Z

@qinghui-xu, just fyi @costimuraru is also contributing to parquet-proto. See: #411

@lukasnalezenec @qinghui-xu @costimuraru: you are all interested in parquet-protos :)

qinghui-xu · 2017-05-02T10:16:48Z

@julienledem @lukasnalezenec @costimuraru
Hello, for keeping the old behavior as the default, I would suggest the following:

When writing parquet files with parquet-protobuf, we will write out some flag "parquet.field.id.persistent" as extra metadata in the footer. "True" means the field id is persistent and "False" means the other way. Absence of the flag is equivalent to "False", which preserves the default behavior for the current state.
This flag will be set in org.apache.hadoop.conf.Configuration during the job initialization.
When reading, the reader/read support (for protobuf) will depend on this metadata flag to behave accordingly.

What do you think about it?

Edit 2017.05.05:

Regarding to the merge of files from both old schema (without field id) and new schema (with field id), using this flag we can choose one of the following behaviors:

Do not merge when file footers contain both true and false value for the flag.
Merge by name to old style schema (without field id).
If new schema contains all fields from old schema (check by name and type for all child fields), merge to new style schema (with field id). Otherwise could not merge (throw an error? or simply do nothing but warning)

[Note] This suggestion is regarding to the general purpose parquet file merge (not limited to using parquet-protobuf to do the merge), I think we would perhaps need this for use cases such as parquet hive?

julienledem · 2017-05-05T02:59:00Z

Thanks @qinghui-xu,
After reading your proposal, I’ve been thinking about this a bit more and I’m wondering if the presence of the ids themselves in the schema can be used as the flag to decide how to do the logic.

Here is some context:
This schema matching/merging logic is used when several files that have been produced at different times are read together. In parquet-proto, the reader will provide its own schema as a reference (based on the version of the IDL it is using) and all the file schemas will be mapped to it.
So it is likely that after updating the writer, at least for a while, some files will have the field IDs and some won't.

Ideally we want the new behavior to become the preferred default (new users have schema matching by field id instead of by name) while preserving the current behavior (existing files are still interpreted the same way)

So I propose that we use the following default:
When reading from a new schema with the field ids, old schemas get mapped by name (they don’t have the ids anyway) and new schemas get mapped by id. This has the benefit of keeping the old behavior for existing files and having the preferred default for new files. Only schema changes happening from now on will have the new behavior.

But we add a read-time flag to turn it off if people really want to maintain the name based mapping all the time.
We have several flags like this. Here is an example in the pig integration: This one is to merge column per index rather than name (similar to the hive logic)
https://github.com/apache/parquet-mr/blob/70f28810a5547219e18ffc3465f519c454fee6e5/parquet-pig/src/main/java/org/apache/parquet/pig/TupleReadSupport.java#L55

However it would be great to hear from people who use parquet-proto in production. @lukasnalezenec, @qinghui-xu, @costimuraru, others?

qinghui-xu · 2017-05-05T08:57:57Z

Hello @julienledem
Thanks for replying.
Some details regarding to your proposition. In the write time, do we somehow need to keep the old behavior with some configuration, which will ignore filed id when writing metadata? Or this is not necessary since in read time they can be ignored by some configuration flag anyway.

Otherwise I edited my previous comment, to propose something for managing the (general-purpose) file merge by using the flag that I proposed. Perhaps the metadata flag will be necessary regarding to this purpose?

julienledem · 2017-05-12T21:29:11Z

@qinghui-xu

In the write time, do we somehow need to keep the old behavior with some configuration, which will ignore filed id when writing metadata? Or this is not necessary since in read time they can be ignored by some configuration flag anyway.

I think it is not necessary.

julienledem · 2017-05-12T21:34:16Z

Do not merge when file footers contain both true and false value for the flag.

To clarify the read-time flag that I am proposing is passed to the reader while reading and not saved in the file. Which means it is true or false. However you may have a mix of schemas with ids (new) and schemas without (old).

Merge by name to old style schema (without field id).

yes

If new schema contains all fields from old schema (check by name and type for all child fields), merge to new style schema (with field id). Otherwise could not merge (throw an error? or simply do nothing but warning)

In this case there is a user provided schema which is the reference. Things that are not found in this schema are ignored since we don't need to read them they will be skipped when reading the data. This is just projection push down (for example we removed the field from the IDL)

lukasnalezenec · 2017-05-16T18:05:10Z

IMHO we should have some protobuf specific metadata about way how data were written. We can also use them when matching fields ids (together with value passed to reader).

We can remove instalation of protobuf libraries from travis.xml
https://github.com/apache/parquet-mr/blob/master/.travis.yml
We can remove lines 4-19

We can remove instructions about installing protobuffers from main project readme

We can also add some decent logging during initialization. It can be trace level.

julienledem · 2017-05-17T00:04:57Z

@lukasnalezenec

IMHO we should have some protobuf specific metadata about way how data were written. We can also use them when matching fields ids (together with value passed to reader).

Do you mean similar to how we add the thrift schema in the parquet footer?
https://github.com/apache/parquet-mr/blob/master/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/AbstractThriftWriteSupport.java#L90

qinghui-xu · 2017-05-18T23:18:50Z

IMHO we should have some protobuf specific metadata about way how data were written. We can also use them when matching fields ids (together with value passed to reader).

Do you mean similar to how we add the thrift schema in the parquet footer?
https://github.com/apache/parquet-mr/blob/master/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/AbstractThriftWriteSupport.java#L90

Putting a flag (e.g. "parquet.field.id.persistent") as a protobuf specific metadata would be enough to tell how the schema is handled. If the value is absent, we will consider it as false.
But I agree with @julienledem , the default behavior of parquet protobuf writer will be writing out field id in the schema (at the same time the flag "parquet.field.id.persistent" is set to true and written as extra metadata). And we will give people possibility to explicitly ignore field id (and this flag will be set to false and written as an extra metadata as well)
This flag will provide us a way to safely handle the cohabitation of new and old schemas (we can also use it to validate the schema, if the flag is on but one of the fields in the schema has no field id, then the footer of the file is probably corrupted)

Use protobuf-maven-plugin to manage the protoc for compiling proto. This way the build is independent of the environment and no protoc is needed to be installed.

Field id is the key element for serialization framework such as protobuf. Persist field id in the file metadata if the schema supports it.

When read back a parquet file with protobuf schema, try to match the parquet schema to protobuf with respect to field id. If the parquet file is written with a previsous parquet-protobuf ( prior to 1.9.1), field id is not persisted in parquet file metadata. When reading these legacy file, schema converter falls back to matching fields by name.

Use parameter "parquet.schema.field.with.id" to enable schema field id persistence. When this flag is on, all schema fields should contain id (this is generally different from the field index which is the field position), otherwise it will be considered as an error. This flag is set as extra metadata in the footer, and its default value is false. For parquet-protobuf, this flag is systematically set to true, except that it's set to false explicitly in the configuration, which makes field id persistence as the default behavior for parquet-protobuf.

Protoc is now managed by maven plugin.

In this commit, we seperate the new behavior of reading fields by id and the old behavoir of parquet-protobuf. In new behavior: For schema containing field id (flag "parquet.schema.field.with.id" on), parquet-protobuf now supports reading parquet files with new protobuf schema after removing fields. When reading a field already removed from the schema, it will be treated as unknown field (for the moment the implementation just safely ignores them). Also, there is a more strict check on the field id presence. If the footer has the flag "parquet.schema.field.with.id" on, protobuf reader expects all fields in the schema contain id, or it will raise an error. This new behavior is the default behavior, but can be explicitly disabled by setting flag off in the configuration. This will fall back to the old behavior: mapping fields by name, and error will be raised if unknown fields are found.

qinghui-xu · 2017-06-02T15:57:52Z

@julienledem @lukasnalezenec

So I reworked a little bit on the pull request, in a whole:

For the field id, use a flag "parquet.schema.field.with.id" to indicate if field id should be present in the schema. In write time, if flag is on but all fields do not contain an id, it will be considered as error. (This can avoid problem if people try to merge files with old/new schema)
This flag is default to false, which means previous behavior is the default for all framework (except for parquet-protobuf)
For protobuf, it will by default set the flag on, except this is disabled explicitly in the configuration. Which means now by default protobuf field id is preserved when writing with parquet-protobuf.
Reading with parquet-protobuf for old schema files (the flag is absent), behavior is the same as before. When reading new schema files (flag on), there will be systematically a check on field id, and field is mapped by id instead of name, but we still have the possibility to fall back to the old read by name behavior for new schema files, which is to set the flag off explicitly in the configuration.
For projection, field id is handled transparently. If projection is without field id (people tend to not putting it), it will be decorated with id according to the schema fields.
Now if a field is removed from protobuf schema, we can read a file written with an older version of the schema if we are using the new behavior for parquet-protobuf.
Use protobuf maven plugin and remove protobuf build in travis config, this saves about 20 minutes for travis build.

Remark: regarding to the removed fields, current implementation will just ignore the value in that field. Future improvement could leverage protobuf's unkown fields capability, such that the field value can be kept in the protobuf message's unknownFieldSet.

julienledem · 2017-06-09T18:28:03Z

I think this is getting close.
@matt-martin @lukasnalezenec @costimuraru @kgalieva: comments?

costimuraru · 2018-05-19T11:29:52Z

@qinghui-xu @julienledem Leveraging the (protobuf) field ids is a great idea!

@qinghui-xu, would you be so kind to rebase this PR, so we could give it a try on our platform?

qinghui-xu · 2018-05-21T18:09:13Z

Hello, @costimuraru
I would be happy to help to rebase this.

costimuraru · 2018-05-24T10:43:24Z

@qinghui-xu that would be great!

BenoitHanotte · 2018-05-29T17:03:16Z

Hello @costimuraru @qinghui-xu @julienledem
As the protobuf descriptor is already serialized in the file metadata (https://github.com/apache/parquet-mr/blob/master/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoWriteSupport.java#L132) and contains all the information required to map the protobuf field id to its name, can't we leverage this instead of changing the way we set the field id in the parquet schema?
Not only would this isolate the change to the protobuf part of the logic, it would also bring backward compatibility as files already contain the descriptor in its serialized form. In this case we would only need to set a flag at read-time, instead of also having to add a flag when writing.
If we were setting the parquet field ids according to the protobuf ids, I don't think we would be able to support schema compatibility for files written with a previous version of parquet as the parquet schema of the file would be missing the required information.

lukasnalezenec · 2018-06-05T06:07:14Z

Hi, We already write field ids to schema. https://github.com/apache/parquet-mr/blob/master/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java#L80 2018-05-29 19:03 GMT+02:00 Benoît Hanotte <[email protected]>:

…

Hello @costimuraru <https://github.com/costimuraru> @qinghui-xu <https://github.com/qinghui-xu> @julienledem <https://github.com/julienledem> As the protobuf descriptor is already serialized in the file metadata ( https://github.com/apache/parquet-mr/blob/master/ parquet-protobuf/src/main/java/org/apache/parquet/proto/ ProtoWriteSupport.java#L132) and contains all the information required to map the protobuf field id to its name, can't we leverage this instead of changing the way we set the field id in the parquet schema? Not only would this isolate the change to the protobuf part of the logic, it would also bring backward compatibility as files already contain the descriptor in its serialized form. In this case we would only need to set a flag at read-time, instead of also having to add a flag when writing. If we were setting the parquet field ids according to the protobuf ids, I don't think we would be able to support schema compatibility for files written with a previous version of parquet as the parquet schema of the file would be missing the required information. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#410 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEJuUHhtTNYvu9_46NR2goixJ-6POb-uks5t3X9agaJpZM4NImRv> .

qinghui-xu · 2018-11-24T15:19:33Z

I close this because this PR seems need a big refactoring to be able to merge into upstream, also we do not currently work with it actively.

qinghui-xu force-pushed the master branch 3 times, most recently from 328db78 to 13926a5 Compare May 29, 2017 16:55

qinghui-xu added 6 commits May 31, 2017 14:04

Parquet-protobuf build independent of environment

9f7da64

Use protobuf-maven-plugin to manage the protoc for compiling proto. This way the build is independent of the environment and no protoc is needed to be installed.

Keep track of field id in parqet file metadata

e5b90ea

Field id is the key element for serialization framework such as protobuf. Persist field id in the file metadata if the schema supports it.

Remove protoc install from travis

f65bd0a

Protoc is now managed by maven plugin.

qinghui-xu force-pushed the master branch from 13926a5 to 76887c0 Compare June 2, 2017 14:53

BenoitHanotte mentioned this pull request May 29, 2018

PARQUET-1292 Adding constructors to ProtoParquetWriter with writeSpecsCompliant flag #473

Open

qinghui-xu closed this Nov 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PARQUET-951] Pull request for handling protobuf field id #410

[PARQUET-951] Pull request for handling protobuf field id #410

qinghui-xu commented Apr 26, 2017

lukasnalezenec commented Apr 26, 2017

qinghui-xu commented Apr 26, 2017 •

edited

Loading

julienledem commented May 1, 2017

julienledem commented May 1, 2017

qinghui-xu commented May 2, 2017 •

edited

Loading

julienledem commented May 5, 2017

qinghui-xu commented May 5, 2017

julienledem commented May 12, 2017 •

edited

Loading

julienledem commented May 12, 2017

lukasnalezenec commented May 16, 2017

julienledem commented May 17, 2017

qinghui-xu commented May 18, 2017

qinghui-xu commented Jun 2, 2017 •

edited

Loading

julienledem commented Jun 9, 2017

costimuraru commented May 19, 2018

qinghui-xu commented May 21, 2018

costimuraru commented May 24, 2018

BenoitHanotte commented May 29, 2018

lukasnalezenec commented Jun 5, 2018 via email

qinghui-xu commented Nov 24, 2018

[PARQUET-951] Pull request for handling protobuf field id #410

[PARQUET-951] Pull request for handling protobuf field id #410

Conversation

qinghui-xu commented Apr 26, 2017

lukasnalezenec commented Apr 26, 2017

qinghui-xu commented Apr 26, 2017 • edited Loading

julienledem commented May 1, 2017

julienledem commented May 1, 2017

qinghui-xu commented May 2, 2017 • edited Loading

Edit 2017.05.05:

julienledem commented May 5, 2017

qinghui-xu commented May 5, 2017

julienledem commented May 12, 2017 • edited Loading

julienledem commented May 12, 2017

lukasnalezenec commented May 16, 2017

julienledem commented May 17, 2017

qinghui-xu commented May 18, 2017

qinghui-xu commented Jun 2, 2017 • edited Loading

julienledem commented Jun 9, 2017

costimuraru commented May 19, 2018

qinghui-xu commented May 21, 2018

costimuraru commented May 24, 2018

BenoitHanotte commented May 29, 2018

lukasnalezenec commented Jun 5, 2018 via email

qinghui-xu commented Nov 24, 2018

qinghui-xu commented Apr 26, 2017 •

edited

Loading

qinghui-xu commented May 2, 2017 •

edited

Loading

julienledem commented May 12, 2017 •

edited

Loading

qinghui-xu commented Jun 2, 2017 •

edited

Loading