Hive: Vectorized ORC reads for Hive #2613

szlta · 2021-05-19T12:57:44Z

Currently Hive relies on org.apache.iceberg.mr.mapreduce.IcebergInputFormat for reading data from Iceberg. This in turn relies on the internal reader implementations for the various file formats, such as GenericOrcReader(s). Unfortunately for Hive, the whole read pipeline is record based, whereas GenericRecord instances hold the information on which Hive has to use the Iceberg object inspectors to retrieve the encoded data from.

From a design point of view I see 3 layers where during reading data flows upwards:

Execution engine: Hive
Table format: Iceberg
File format: ORC/Parquet/Avro

The current design however combines the two bottom layers, and although Hive already has support for vectorized execution, and already has implementations for vectorized reading from various file formats; with Iceberg in the picture it can't have the advantages of vectorization.

In this change I propose to swap out the bottom layer with the implementations Hive already has, so in the case of ORC: with VectorizedOrcInputFormat. This produces VectorizedRowBatch instances which Hive expects as its vectorized format. This could be used as the 'Hive' in memory data model, something IcebergInputFormat already distinguishes upon.

One big obstacle is that Iceberg depends on a type of ORC that has the nohive classifier. This means that classes that are actually coming from Hive/storage-api are shaded inside ORC with an ORC classname. In order to revert this while also causing the minimal disruption to other parts of Iceberg I propose to create an intermediary module for iceberg-hive3 to depend on (rather then depending on iceberg-orc and iceberg-data)

Tracking issue: #2612

mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java

hive3/src/main/java/org/apache/iceberg/mr/hive/vector/VectorizedRowBatchIterator.java

hive3/src/main/java/org/apache/iceberg/mr/hive/vector/HiveVectorizedReader.java

mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

mr/src/main/java/org/apache/iceberg/mr/mapred/AbstractMapredIcebergRecordReader.java

mr/src/main/java/org/apache/iceberg/mr/mapred/MapredIcebergInputFormat.java

mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java

szlta · 2021-05-25T14:44:07Z

cc: @rdblue @rymurr

rymurr

I am not a hive expert so I don't have a strong opinion on the Hive specific stuff. the only thing I notice is we have added some less than stellar code from Hive that I would rather see cleanly implemented. I understand this is temporary, what are the timelines for Hive4 and iceberg being able to drop the copied Hive code?

rymurr · 2021-05-27T09:43:21Z

build.gradle

 if (jdkVersion == '8') {
+  // The purpose of this module is to re-shade org.apache.orc.storage to the original org.apache.hadoop.hive package
+  // name. This is to be used by Hive3 for features including e.g. vectorization.
+  project(':iceberg-hive3-orc-bundle') {


how does this compare to the original hive bundle? Do we need to support both? Apologies if thats a dumb question, I am just not 100% sure I understand what will be in this resulting bundle and who will use it.

Thanks for taking a look on this @rymurr!
Originally I wanted to stick to not modifying any code snippets that needs to be copied from Hive to Iceberg as it's meant to be temporary. But.. since I don't see Hive4 getting released in the foreseeable future yet I have now rather amended these as per checkstyle.

About this hive3-orc-bundle: it's something that only hive3 depends on, and the soul purpose of it is to re-shade hive-storage-api classes that are currently being referred to as org.apache.orc.storage... within Iceberg.
This bundle will produce iceberg-data, and iceberg-orc classes with the original package name references to storage-api classes which is required runtime, where Hive3 code (from Hive codebase) is already linked with the original class names.
The reason this is not a dependency of iceberg-mr module, is that we won't support vectorization for the Hive2 integration of Iceberg

hive3/src/main/java/org/apache/iceberg/mr/hive/vector/CompatibilityHiveVectorUtils.java

mr/src/main/java/org/apache/hadoop/hive/ql/exec/vector/VectorizedSupport.java

pvary · 2021-06-03T12:03:38Z

Thanks for the review @rymurr!

@rdblue, @RussellSpitzer: Would you like to take a look before we merge this change? If you have time, I would really appreciate it!

Thanks,
Peter

pvary · 2021-06-07T08:54:21Z

Thanks @szlta for the PR and @rymurr for the review!

szlta · 2021-06-07T11:02:02Z

Thanks for the reviews and the commit!

* Add 0.12.0 release notes pt 2 * Add more blurbs and fix formatting. - Add blurbs for #2565, #2583, and #2547. - Make formatting consistent. * Add blurb for #2613 Hive Vectorized Reader * Reword blurbs for #2565 and #2365 * More changes based on review comments * More updates to the 0.12.0 release notes * Add blurb for #2232 fix parquet row group filters * Add blurb for #2308

Merge remote-tracking branch 'upstream/merge-master-20210816' into master ## 该MR主要解决什么？ merge upstream/master，引入最近的一些bugFix和优化 ## 该MR的修改是什么？核心关注PR： > Predicate PushDown 支持，https://github.com/apache/iceberg/pull/2358， https://github.com/apache/iceberg/pull/2926， https://github.com/apache/iceberg/pull/2777/files > Spark场景写入空dataset 报错问题，直接skip掉即可， apache#2960 > Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务， apache#288 > Spark 修复nested Struct Pruning问题， apache#2877 > 可以使用Table Properties指定创建v2 format表，apache#2887 > 补充SortRewriteStrategy框架，逐步支持不同rewrite策略， apache#2609 （WIP：apache#2829） > Spark 为catalog配置hadoop属性支持， apache#2792 > Spark 针对timestamps without timezone读写支持， apache#2757 > Spark MicroBatch支持配置属性skip delete snapshots， apache#2752 > Spark V2 RewriteDatafilesAction 支持 > Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关，补充schema-id， Core: add schema id to snapshot > Spark Extension支持identifier fields操作， apache#2560 > Parquet: Update to 1.12.0, apache#2441 > Hive: Vectorized ORC reads for Hive, apache#2613 > Spark: Add an action to remove all referenced files, apache#2415 ## 该MR是如何测试的？ UT

Hive: Vectorized ORC reads for Hive

d6c59dd

github-actions bot added build hive MR labels May 19, 2021