Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PARQUET-951] Pull request for handling protobuf field id #410

Closed
wants to merge 6 commits into from

Conversation

qinghui-xu
Copy link
Contributor

In current implementation, field id is not persisted in parquet file metadata. We propose this patch to address the problem (especially for parquet-protobuf). The field id is important with regards to the schema backward/forward compatiblity in protobuf.

@lukasnalezenec
Copy link
Contributor

How about doing some configurable fallback to old behaviour ?

@qinghui-xu
Copy link
Contributor Author

qinghui-xu commented Apr 26, 2017

I was thinking about that, but I did not see the point to keeping old behavior.
And there is a second question: if we keep old behavoir by some configuration, which will be the default behavior?

@julienledem
Copy link
Member

@qinghui-xu for backward compatibility, the default behavior should be the old one.
Currently, schema evolution is relying on field names.
Possibly you could make a new top level ProtoParquetReader (with a different name) that would have this property set to use the ids as a default.

@julienledem
Copy link
Member

@qinghui-xu, just fyi @costimuraru is also contributing to parquet-proto. See: #411

@lukasnalezenec @qinghui-xu @costimuraru: you are all interested in parquet-protos :)

@qinghui-xu
Copy link
Contributor Author

qinghui-xu commented May 2, 2017

@julienledem @lukasnalezenec @costimuraru
Hello, for keeping the old behavior as the default, I would suggest the following:

  1. When writing parquet files with parquet-protobuf, we will write out some flag "parquet.field.id.persistent" as extra metadata in the footer. "True" means the field id is persistent and "False" means the other way. Absence of the flag is equivalent to "False", which preserves the default behavior for the current state.
    This flag will be set in org.apache.hadoop.conf.Configuration during the job initialization.
  2. When reading, the reader/read support (for protobuf) will depend on this metadata flag to behave accordingly.

What do you think about it?


Edit 2017.05.05:

Regarding to the merge of files from both old schema (without field id) and new schema (with field id), using this flag we can choose one of the following behaviors:

  1. Do not merge when file footers contain both true and false value for the flag.
  2. Merge by name to old style schema (without field id).
  3. If new schema contains all fields from old schema (check by name and type for all child fields), merge to new style schema (with field id). Otherwise could not merge (throw an error? or simply do nothing but warning)

[Note] This suggestion is regarding to the general purpose parquet file merge (not limited to using parquet-protobuf to do the merge), I think we would perhaps need this for use cases such as parquet hive?

@julienledem
Copy link
Member

Thanks @qinghui-xu,
After reading your proposal, I’ve been thinking about this a bit more and I’m wondering if the presence of the ids themselves in the schema can be used as the flag to decide how to do the logic.

Here is some context:
This schema matching/merging logic is used when several files that have been produced at different times are read together. In parquet-proto, the reader will provide its own schema as a reference (based on the version of the IDL it is using) and all the file schemas will be mapped to it.
So it is likely that after updating the writer, at least for a while, some files will have the field IDs and some won't.

Ideally we want the new behavior to become the preferred default (new users have schema matching by field id instead of by name) while preserving the current behavior (existing files are still interpreted the same way)

So I propose that we use the following default:
When reading from a new schema with the field ids, old schemas get mapped by name (they don’t have the ids anyway) and new schemas get mapped by id. This has the benefit of keeping the old behavior for existing files and having the preferred default for new files. Only schema changes happening from now on will have the new behavior.

But we add a read-time flag to turn it off if people really want to maintain the name based mapping all the time.
We have several flags like this. Here is an example in the pig integration: This one is to merge column per index rather than name (similar to the hive logic)
https://github.com/apache/parquet-mr/blob/70f28810a5547219e18ffc3465f519c454fee6e5/parquet-pig/src/main/java/org/apache/parquet/pig/TupleReadSupport.java#L55

However it would be great to hear from people who use parquet-proto in production. @lukasnalezenec, @qinghui-xu, @costimuraru, others?

@qinghui-xu
Copy link
Contributor Author

Hello @julienledem
Thanks for replying.
Some details regarding to your proposition. In the write time, do we somehow need to keep the old behavior with some configuration, which will ignore filed id when writing metadata? Or this is not necessary since in read time they can be ignored by some configuration flag anyway.

Otherwise I edited my previous comment, to propose something for managing the (general-purpose) file merge by using the flag that I proposed. Perhaps the metadata flag will be necessary regarding to this purpose?

@julienledem
Copy link
Member

julienledem commented May 12, 2017

@qinghui-xu

In the write time, do we somehow need to keep the old behavior with some configuration, which will ignore filed id when writing metadata? Or this is not necessary since in read time they can be ignored by some configuration flag anyway.

I think it is not necessary.

@julienledem
Copy link
Member

Do not merge when file footers contain both true and false value for the flag.

To clarify the read-time flag that I am proposing is passed to the reader while reading and not saved in the file. Which means it is true or false. However you may have a mix of schemas with ids (new) and schemas without (old).

Merge by name to old style schema (without field id).

yes

If new schema contains all fields from old schema (check by name and type for all child fields), merge to new style schema (with field id). Otherwise could not merge (throw an error? or simply do nothing but warning)

In this case there is a user provided schema which is the reference. Things that are not found in this schema are ignored since we don't need to read them they will be skipped when reading the data. This is just projection push down (for example we removed the field from the IDL)

@lukasnalezenec
Copy link
Contributor

IMHO we should have some protobuf specific metadata about way how data were written. We can also use them when matching fields ids (together with value passed to reader).

We can remove instalation of protobuf libraries from travis.xml
https://github.com/apache/parquet-mr/blob/master/.travis.yml
We can remove lines 4-19

We can remove instructions about installing protobuffers from main project readme

We can also add some decent logging during initialization. It can be trace level.

@julienledem
Copy link
Member

@lukasnalezenec

IMHO we should have some protobuf specific metadata about way how data were written. We can also use them when matching fields ids (together with value passed to reader).

Do you mean similar to how we add the thrift schema in the parquet footer?
https://github.com/apache/parquet-mr/blob/master/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/AbstractThriftWriteSupport.java#L90

@qinghui-xu
Copy link
Contributor Author

IMHO we should have some protobuf specific metadata about way how data were written. We can also use them when matching fields ids (together with value passed to reader).

Do you mean similar to how we add the thrift schema in the parquet footer?
https://github.com/apache/parquet-mr/blob/master/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/AbstractThriftWriteSupport.java#L90

Putting a flag (e.g. "parquet.field.id.persistent") as a protobuf specific metadata would be enough to tell how the schema is handled. If the value is absent, we will consider it as false.
But I agree with @julienledem , the default behavior of parquet protobuf writer will be writing out field id in the schema (at the same time the flag "parquet.field.id.persistent" is set to true and written as extra metadata). And we will give people possibility to explicitly ignore field id (and this flag will be set to false and written as an extra metadata as well)
This flag will provide us a way to safely handle the cohabitation of new and old schemas (we can also use it to validate the schema, if the flag is on but one of the fields in the schema has no field id, then the footer of the file is probably corrupted)

@qinghui-xu qinghui-xu force-pushed the master branch 3 times, most recently from 328db78 to 13926a5 Compare May 29, 2017 16:55
Use protobuf-maven-plugin to manage the protoc for compiling proto.
This way the build is independent of the environment and no protoc
is needed to be installed.
Field id is the key element for serialization framework such as
protobuf. Persist field id in the file metadata if the schema
supports it.
When read back a parquet file with protobuf schema, try to match
the parquet schema to protobuf with respect to field id.
If the parquet file is written with a previsous parquet-protobuf (
prior to 1.9.1), field id is not persisted in parquet file metadata.
When reading these legacy file, schema converter falls back to
matching fields by name.
Use parameter "parquet.schema.field.with.id" to enable schema field
id persistence. When this flag is on, all schema fields should
contain id (this is generally different from the field index which
is the field position), otherwise it will be considered as an
error. This flag is set as extra metadata in the footer, and its
default value is false.

For parquet-protobuf, this flag is systematically set to true,
except that it's set to false explicitly in the configuration,
which makes field id persistence as the default behavior for
parquet-protobuf.
Protoc is now managed by maven plugin.
In this commit, we seperate the new behavior of reading fields
by id and the old behavoir of parquet-protobuf.

In new behavior:
For schema containing field id (flag "parquet.schema.field.with.id"
on), parquet-protobuf now supports reading parquet files with new
protobuf schema after removing fields. When reading a field
already removed from the schema, it will be treated as
unknown field (for the moment the implementation just safely
ignores them).
Also, there is a more strict check on the field id presence. If
the footer has the flag "parquet.schema.field.with.id" on,
protobuf reader expects all fields in the schema contain id, or it
will raise an error.

This new behavior is the default behavior, but can be explicitly
disabled by setting flag off in the configuration. This will fall
back to the old behavior: mapping fields by name, and error will
be raised if unknown fields are found.
@qinghui-xu
Copy link
Contributor Author

qinghui-xu commented Jun 2, 2017

@julienledem @lukasnalezenec

So I reworked a little bit on the pull request, in a whole:

  1. For the field id, use a flag "parquet.schema.field.with.id" to indicate if field id should be present in the schema. In write time, if flag is on but all fields do not contain an id, it will be considered as error. (This can avoid problem if people try to merge files with old/new schema)
  2. This flag is default to false, which means previous behavior is the default for all framework (except for parquet-protobuf)
  3. For protobuf, it will by default set the flag on, except this is disabled explicitly in the configuration. Which means now by default protobuf field id is preserved when writing with parquet-protobuf.
  4. Reading with parquet-protobuf for old schema files (the flag is absent), behavior is the same as before. When reading new schema files (flag on), there will be systematically a check on field id, and field is mapped by id instead of name, but we still have the possibility to fall back to the old read by name behavior for new schema files, which is to set the flag off explicitly in the configuration.
  5. For projection, field id is handled transparently. If projection is without field id (people tend to not putting it), it will be decorated with id according to the schema fields.
  6. Now if a field is removed from protobuf schema, we can read a file written with an older version of the schema if we are using the new behavior for parquet-protobuf.
  7. Use protobuf maven plugin and remove protobuf build in travis config, this saves about 20 minutes for travis build.

Remark: regarding to the removed fields, current implementation will just ignore the value in that field. Future improvement could leverage protobuf's unkown fields capability, such that the field value can be kept in the protobuf message's unknownFieldSet.

@julienledem
Copy link
Member

I think this is getting close.
@matt-martin @lukasnalezenec @costimuraru @kgalieva: comments?

@costimuraru
Copy link

@qinghui-xu @julienledem Leveraging the (protobuf) field ids is a great idea!

@qinghui-xu, would you be so kind to rebase this PR, so we could give it a try on our platform?

@qinghui-xu
Copy link
Contributor Author

Hello, @costimuraru
I would be happy to help to rebase this.

@costimuraru
Copy link

@qinghui-xu that would be great!

@BenoitHanotte
Copy link

Hello @costimuraru @qinghui-xu @julienledem
As the protobuf descriptor is already serialized in the file metadata (https://github.com/apache/parquet-mr/blob/master/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoWriteSupport.java#L132) and contains all the information required to map the protobuf field id to its name, can't we leverage this instead of changing the way we set the field id in the parquet schema?
Not only would this isolate the change to the protobuf part of the logic, it would also bring backward compatibility as files already contain the descriptor in its serialized form. In this case we would only need to set a flag at read-time, instead of also having to add a flag when writing.
If we were setting the parquet field ids according to the protobuf ids, I don't think we would be able to support schema compatibility for files written with a previous version of parquet as the parquet schema of the file would be missing the required information.

@lukasnalezenec
Copy link
Contributor

lukasnalezenec commented Jun 5, 2018 via email

@qinghui-xu qinghui-xu closed this Nov 24, 2018
@qinghui-xu
Copy link
Contributor Author

I close this because this PR seems need a big refactoring to be able to merge into upstream, also we do not currently work with it actively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants