-
Notifications
You must be signed in to change notification settings - Fork 2.9k
ORC column map fix #227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC column map fix #227
Conversation
spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcReader.java
Outdated
Show resolved
Hide resolved
|
Nice work, @edgarRd! I think this is pretty close. Eventually, I'd also like to apply some of the schema tests that we use for Parquet and Avro to ORC to validate that all projections work. Seems like a good follow-up. Could you also document the serialization format for the column map, and what property you're using? I'd like to include this in the spec. |
|
@rdblue , @edgarRd . I was thinking about this and wanted to get your inputs here. I was thinking if the implementation would become simpler/cleaner if we maintained the column ids close to the ORC schema. For instance, we could do something similar to Avro schema, where the column ids are maintained as metadata properties tied to each schema element. This, of course, is not possible today with ORC's The big benefit I see is that we don't have to expose two data structures or spread the schema information across two classes. Another potential benefit of this approach is, if we can get the ORC community to adopt a patch which attaches metadata per field, then we can completely get rid of our own |
|
@rdsr, are you proposing that we attach an alternative schema to a data file? If so, why not just write the JSON form of the Iceberg schema? I like the idea to get changes we need into upstream ORC (@omalley may be able to help there) but I think we should get this working before those are written, committed, and released. |
The mechanism to persist field ids could be anything depending on our implementation, which would be neatly encapsulated in the ORC class. Today, we are serializing a Map, we could just as well serialize a json as you mentioned. But the schema class that we construct should have the field ids [as described above]. I'm just curious whether having a single schema class with ids instead of a Map and a schema class would make future and current changes simpler. I'm also totally fine by first trying to fix ORC first before we make any changes here. |
|
@rdblue @rdsr I agree to get this working before trying to fix ORC. I think as long as we agree on how to persist the schema in ORC files, since that'll go in the spec, we can change the implementation to something more maintainable later on avoiding the additional mapping class. I can follow up in the ORC project to get the metadata in fields. While we can store the JSON form of the Iceberg schema I think we'd still need the mapping of Btw, thanks for taking a look at this! |
|
I've filed a jira for adding type annotations to ORC, please feel free to comment over there. https://issues.apache.org/jira/browse/ORC-522 Please don't make a json file that needs to be parsed before opening up an ORC file in Iceberg. That is really slow and painful. Using the ORC file metadata is much much better since that means it stays in sync with the file. Sorry about not getting back to the Iceberg ORC support to finish the schema evolution. It is unfortunate that Iceberg uses id-based schema evolution rather than using the name-based evolution that ORC natively supports. My original plan for ORC in Iceberg was to use the column ids to munge the column names so that ORC's schema evolution would do the right thing. So basically if the file had column "foo" as id 5 and the reader wanted "bar" as id 5, the Iceberg reader would munge the "bar" to "foo" and pass that down to ORC and then ORC's schema evolution would do the appropriate translations. |
|
@omalley, I think the proposal is not to write separate JSON files. It is to embed the metadata in ORC's file-level properties. My point was that as long as we are encoding type ID and required/optional, we may as well just store the Iceberg schema that was used to write the file. |
|
FYI. There's an PR out for metadata properties attached to Typedescription apache/orc#410 . Thanks @omalley ! |
Use the ORC schema to read based on Iceberg to ORC mapping if found.
* Remove workaround on timestamp nanos as per ORC-546: apache/orc#420 * Use ORC TypeDescription attributes for column mapping
* Rename `toOrc` and `toIceberg` to `convert` for consistency * Throw error on projection if required field does not exist * Using withZone on timestamp convertions * OrcColumn -> OrcField * Fix typos * Use assertThrows from iceberg AssertHelpers
* Adding support for time type * Test roundtrip for all Iceberg types
|
+1 Thanks for all your work on this, @edgarRd! I'm merging it. Also, thanks to @rdsr, @shardulm94, @lxynov, and @omalley for reviewing! |
|
Thanks! @rdblue Just a reminder, could you also give an update to https://iceberg.apache.org/? So that we can reference the updated spec. |
This PR addresses #213
Thanks!
cc @rdsr