ORC column map fix #227

edgarRd · 2019-06-19T01:15:43Z

This PR addresses #213
Thanks!

cc @rdsr

orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java

orc/src/main/java/org/apache/iceberg/orc/ColumnMap.java

orc/src/main/java/org/apache/iceberg/orc/ORC.java

orc/src/main/java/org/apache/iceberg/orc/ColumnMap.java

orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java

orc/src/main/java/org/apache/iceberg/orc/OrcIterable.java

spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcReader.java

rdblue · 2019-06-21T17:56:14Z

Nice work, @edgarRd! I think this is pretty close. Eventually, I'd also like to apply some of the schema tests that we use for Parquet and Avro to ORC to validate that all projections work. Seems like a good follow-up.

Could you also document the serialization format for the column map, and what property you're using? I'd like to include this in the spec.

orc/src/main/java/org/apache/iceberg/orc/ColumnMap.java

orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java

orc/src/main/java/org/apache/iceberg/orc/OrcIterable.java

orc/src/main/java/org/apache/iceberg/orc/TypeConversion.java

rdsr · 2019-06-22T02:22:54Z

@rdblue , @edgarRd . I was thinking about this and wanted to get your inputs here.

I was thinking if the implementation would become simpler/cleaner if we maintained the column ids close to the ORC schema. For instance, we could do something similar to Avro schema, where the column ids are maintained as metadata properties tied to each schema element. This, of course, is not possible today with ORC's TypeDescription. But, let's say we exposed a RichTypeDescription schema for ORC instead of its standard TypeDescription. The RichTypeDescription would maintain metadata properties for each schema field, similar to Avro. This metadata property will hold the column ids. This could help get rid of the separate map which we have to maintain, can possibly also make things simpler when we have to address schema evolution [by id] during read where we'd have to make use of visitors, similar to Avro.

The big benefit I see is that we don't have to expose two data structures or spread the schema information across two classes.

Another potential benefit of this approach is, if we can get the ORC community to adopt a patch which attaches metadata per field, then we can completely get rid of our own RichTypeDescription without changing our code drastically.

rdblue · 2019-06-22T23:06:14Z

@rdsr, are you proposing that we attach an alternative schema to a data file? If so, why not just write the JSON form of the Iceberg schema? I like the idea to get changes we need into upstream ORC (@omalley may be able to help there) but I think we should get this working before those are written, committed, and released.

rdsr · 2019-06-23T02:09:19Z

are you proposing that we attach an alternative schema to a data file? If so, why not just write the JSON form of the Iceberg schema?

The mechanism to persist field ids could be anything depending on our implementation, which would be neatly encapsulated in the ORC class. Today, we are serializing a Map, we could just as well serialize a json as you mentioned. But the schema class that we construct should have the field ids [as described above]. I'm just curious whether having a single schema class with ids instead of a Map and a schema class would make future and current changes simpler.

I'm also totally fine by first trying to fix ORC first before we make any changes here.

edgarRd · 2019-06-24T05:49:17Z

@rdblue @rdsr I agree to get this working before trying to fix ORC. I think as long as we agree on how to persist the schema in ORC files, since that'll go in the spec, we can change the implementation to something more maintainable later on avoiding the additional mapping class. I can follow up in the ORC project to get the metadata in fields.

While we can store the JSON form of the Iceberg schema I think we'd still need the mapping of ORC col -> Iceberg col, right? Unless we have an assumption that the stored Iceberg schema maintains the column order of the ORC flatten columns, unless I'm missing something.

Btw, thanks for taking a look at this!

omalley · 2019-06-24T17:41:00Z

I've filed a jira for adding type annotations to ORC, please feel free to comment over there.

https://issues.apache.org/jira/browse/ORC-522

Please don't make a json file that needs to be parsed before opening up an ORC file in Iceberg. That is really slow and painful. Using the ORC file metadata is much much better since that means it stays in sync with the file.

Sorry about not getting back to the Iceberg ORC support to finish the schema evolution. It is unfortunate that Iceberg uses id-based schema evolution rather than using the name-based evolution that ORC natively supports. My original plan for ORC in Iceberg was to use the column ids to munge the column names so that ORC's schema evolution would do the right thing. So basically if the file had column "foo" as id 5 and the reader wanted "bar" as id 5, the Iceberg reader would munge the "bar" to "foo" and pass that down to ORC and then ORC's schema evolution would do the appropriate translations.

rdblue · 2019-06-24T19:03:05Z

@omalley, I think the proposal is not to write separate JSON files. It is to embed the metadata in ORC's file-level properties. My point was that as long as we are encoding type ID and required/optional, we may as well just store the Iceberg schema that was used to write the file.

rdsr · 2019-06-27T14:42:36Z

FYI. There's an PR out for metadata properties attached to Typedescription apache/orc#410 . Thanks @omalley !

orc/src/main/java/org/apache/iceberg/orc/IcebergColumn.java

orc/src/main/java/org/apache/iceberg/orc/ColumnMap.java

orc/src/main/java/org/apache/iceberg/orc/IcebergColumn.java

orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java

orc/src/main/java/org/apache/iceberg/orc/TypeConversion.java

Use the ORC schema to read based on Iceberg to ORC mapping if found.

* Remove workaround on timestamp nanos as per ORC-546: apache/orc#420 * Use ORC TypeDescription attributes for column mapping

* Rename `toOrc` and `toIceberg` to `convert` for consistency * Throw error on projection if required field does not exist * Using withZone on timestamp convertions * OrcColumn -> OrcField * Fix typos * Use assertThrows from iceberg AssertHelpers

* Adding support for time type * Test roundtrip for all Iceberg types

rdblue · 2020-01-07T19:08:49Z

+1

Thanks for all your work on this, @edgarRd! I'm merging it.

Also, thanks to @rdsr, @shardulm94, @lxynov, and @omalley for reviewing!

lxynov · 2020-01-09T02:09:40Z

Thanks! @rdblue

Just a reminder, could you also give an update to https://iceberg.apache.org/? So that we can reference the updated spec.