-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Hive: Vectorized ORC reads for Hive #2613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java
Outdated
Show resolved
Hide resolved
hive3/src/main/java/org/apache/iceberg/mr/hive/vector/VectorizedRowBatchIterator.java
Show resolved
Hide resolved
hive3/src/main/java/org/apache/iceberg/mr/hive/vector/HiveVectorizedReader.java
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/mapred/AbstractMapredIcebergRecordReader.java
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/mapred/MapredIcebergInputFormat.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
Show resolved
Hide resolved
rymurr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a hive expert so I don't have a strong opinion on the Hive specific stuff. the only thing I notice is we have added some less than stellar code from Hive that I would rather see cleanly implemented. I understand this is temporary, what are the timelines for Hive4 and iceberg being able to drop the copied Hive code?
| if (jdkVersion == '8') { | ||
| // The purpose of this module is to re-shade org.apache.orc.storage to the original org.apache.hadoop.hive package | ||
| // name. This is to be used by Hive3 for features including e.g. vectorization. | ||
| project(':iceberg-hive3-orc-bundle') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this compare to the original hive bundle? Do we need to support both? Apologies if thats a dumb question, I am just not 100% sure I understand what will be in this resulting bundle and who will use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look on this @rymurr!
Originally I wanted to stick to not modifying any code snippets that needs to be copied from Hive to Iceberg as it's meant to be temporary. But.. since I don't see Hive4 getting released in the foreseeable future yet I have now rather amended these as per checkstyle.
About this hive3-orc-bundle: it's something that only hive3 depends on, and the soul purpose of it is to re-shade hive-storage-api classes that are currently being referred to as org.apache.orc.storage... within Iceberg.
This bundle will produce iceberg-data, and iceberg-orc classes with the original package name references to storage-api classes which is required runtime, where Hive3 code (from Hive codebase) is already linked with the original class names.
The reason this is not a dependency of iceberg-mr module, is that we won't support vectorization for the Hive2 integration of Iceberg
hive3/src/main/java/org/apache/iceberg/mr/hive/vector/CompatibilityHiveVectorUtils.java
Outdated
Show resolved
Hide resolved
hive3/src/main/java/org/apache/iceberg/mr/hive/vector/CompatibilityHiveVectorUtils.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/hadoop/hive/ql/exec/vector/VectorizedSupport.java
Outdated
Show resolved
Hide resolved
|
Thanks for the review @rymurr! @rdblue, @RussellSpitzer: Would you like to take a look before we merge this change? If you have time, I would really appreciate it! Thanks, |
|
Thanks for the reviews and the commit! |
* Add 0.12.0 release notes pt 2 * Add more blurbs and fix formatting. - Add blurbs for #2565, #2583, and #2547. - Make formatting consistent. * Add blurb for #2613 Hive Vectorized Reader * Reword blurbs for #2565 and #2365 * More changes based on review comments * More updates to the 0.12.0 release notes * Add blurb for #2232 fix parquet row group filters * Add blurb for #2308
Merge remote-tracking branch 'upstream/merge-master-20210816' into master ## 该MR主要解决什么? merge upstream/master,引入最近的一些bugFix和优化 ## 该MR的修改是什么? 核心关注PR: > Predicate PushDown 支持,https://github.com/apache/iceberg/pull/2358, https://github.com/apache/iceberg/pull/2926, https://github.com/apache/iceberg/pull/2777/files > Spark场景写入空dataset 报错问题,直接skip掉即可, apache#2960 > Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务, apache#288 > Spark 修复nested Struct Pruning问题, apache#2877 > 可以使用Table Properties指定创建v2 format表,apache#2887 > 补充SortRewriteStrategy框架,逐步支持不同rewrite策略, apache#2609 (WIP:apache#2829) > Spark 为catalog配置hadoop属性支持, apache#2792 > Spark 针对timestamps without timezone读写支持, apache#2757 > Spark MicroBatch支持配置属性skip delete snapshots, apache#2752 > Spark V2 RewriteDatafilesAction 支持 > Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关,补充schema-id, Core: add schema id to snapshot > Spark Extension支持identifier fields操作, apache#2560 > Parquet: Update to 1.12.0, apache#2441 > Hive: Vectorized ORC reads for Hive, apache#2613 > Spark: Add an action to remove all referenced files, apache#2415 ## 该MR是如何测试的? UT
Currently Hive relies on org.apache.iceberg.mr.mapreduce.IcebergInputFormat for reading data from Iceberg. This in turn relies on the internal reader implementations for the various file formats, such as GenericOrcReader(s). Unfortunately for Hive, the whole read pipeline is record based, whereas GenericRecord instances hold the information on which Hive has to use the Iceberg object inspectors to retrieve the encoded data from.
From a design point of view I see 3 layers where during reading data flows upwards:
The current design however combines the two bottom layers, and although Hive already has support for vectorized execution, and already has implementations for vectorized reading from various file formats; with Iceberg in the picture it can't have the advantages of vectorization.
In this change I propose to swap out the bottom layer with the implementations Hive already has, so in the case of ORC: with VectorizedOrcInputFormat. This produces VectorizedRowBatch instances which Hive expects as its vectorized format. This could be used as the 'Hive' in memory data model, something IcebergInputFormat already distinguishes upon.
One big obstacle is that Iceberg depends on a type of ORC that has the nohive classifier. This means that classes that are actually coming from Hive/storage-api are shaded inside ORC with an ORC classname. In order to revert this while also causing the minimal disruption to other parts of Iceberg I propose to create an intermediary module for iceberg-hive3 to depend on (rather then depending on iceberg-orc and iceberg-data)
Tracking issue: #2612