Skip to content

Conversation

@szlta
Copy link
Contributor

@szlta szlta commented May 19, 2021

Currently Hive relies on org.apache.iceberg.mr.mapreduce.IcebergInputFormat for reading data from Iceberg. This in turn relies on the internal reader implementations for the various file formats, such as GenericOrcReader(s). Unfortunately for Hive, the whole read pipeline is record based, whereas GenericRecord instances hold the information on which Hive has to use the Iceberg object inspectors to retrieve the encoded data from.

From a design point of view I see 3 layers where during reading data flows upwards:

  • Execution engine: Hive
  • Table format: Iceberg
  • File format: ORC/Parquet/Avro

The current design however combines the two bottom layers, and although Hive already has support for vectorized execution, and already has implementations for vectorized reading from various file formats; with Iceberg in the picture it can't have the advantages of vectorization.

In this change I propose to swap out the bottom layer with the implementations Hive already has, so in the case of ORC: with VectorizedOrcInputFormat. This produces VectorizedRowBatch instances which Hive expects as its vectorized format. This could be used as the 'Hive' in memory data model, something IcebergInputFormat already distinguishes upon.

One big obstacle is that Iceberg depends on a type of ORC that has the nohive classifier. This means that classes that are actually coming from Hive/storage-api are shaded inside ORC with an ORC classname. In order to revert this while also causing the minimal disruption to other parts of Iceberg I propose to create an intermediary module for iceberg-hive3 to depend on (rather then depending on iceberg-orc and iceberg-data)

Tracking issue: #2612

@szlta
Copy link
Contributor Author

szlta commented May 25, 2021

cc: @rdblue @rymurr

Copy link
Contributor

@rymurr rymurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a hive expert so I don't have a strong opinion on the Hive specific stuff. the only thing I notice is we have added some less than stellar code from Hive that I would rather see cleanly implemented. I understand this is temporary, what are the timelines for Hive4 and iceberg being able to drop the copied Hive code?

if (jdkVersion == '8') {
// The purpose of this module is to re-shade org.apache.orc.storage to the original org.apache.hadoop.hive package
// name. This is to be used by Hive3 for features including e.g. vectorization.
project(':iceberg-hive3-orc-bundle') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this compare to the original hive bundle? Do we need to support both? Apologies if thats a dumb question, I am just not 100% sure I understand what will be in this resulting bundle and who will use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look on this @rymurr!
Originally I wanted to stick to not modifying any code snippets that needs to be copied from Hive to Iceberg as it's meant to be temporary. But.. since I don't see Hive4 getting released in the foreseeable future yet I have now rather amended these as per checkstyle.

About this hive3-orc-bundle: it's something that only hive3 depends on, and the soul purpose of it is to re-shade hive-storage-api classes that are currently being referred to as org.apache.orc.storage... within Iceberg.
This bundle will produce iceberg-data, and iceberg-orc classes with the original package name references to storage-api classes which is required runtime, where Hive3 code (from Hive codebase) is already linked with the original class names.
The reason this is not a dependency of iceberg-mr module, is that we won't support vectorization for the Hive2 integration of Iceberg

@pvary
Copy link
Contributor

pvary commented Jun 3, 2021

Thanks for the review @rymurr!

@rdblue, @RussellSpitzer: Would you like to take a look before we merge this change? If you have time, I would really appreciate it!

Thanks,
Peter

@pvary pvary merged commit 282b6f9 into apache:master Jun 7, 2021
@pvary
Copy link
Contributor

pvary commented Jun 7, 2021

Thanks @szlta for the PR and @rymurr for the review!

@szlta
Copy link
Contributor Author

szlta commented Jun 7, 2021

Thanks for the reviews and the commit!

@szlta szlta mentioned this pull request Jun 21, 2021
cwsteinbach added a commit to cwsteinbach/apache-iceberg that referenced this pull request Aug 17, 2021
cwsteinbach added a commit that referenced this pull request Aug 19, 2021
* Add 0.12.0 release notes pt 2

* Add more blurbs and fix formatting.

- Add blurbs for #2565, #2583, and #2547.
- Make formatting consistent.

* Add blurb for #2613 Hive Vectorized Reader

* Reword blurbs for #2565 and #2365

* More changes based on review comments

* More updates to the 0.12.0 release notes

* Add blurb for #2232 fix parquet row group filters

* Add blurb for #2308
chenjunjiedada pushed a commit to chenjunjiedada/incubator-iceberg that referenced this pull request Oct 20, 2021
Merge remote-tracking branch 'upstream/merge-master-20210816' into master
## 该MR主要解决什么?

merge upstream/master,引入最近的一些bugFix和优化

## 该MR的修改是什么?

核心关注PR:
> Predicate PushDown 支持,https://github.com/apache/iceberg/pull/2358, https://github.com/apache/iceberg/pull/2926, https://github.com/apache/iceberg/pull/2777/files
> Spark场景写入空dataset 报错问题,直接skip掉即可, apache#2960
> Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务, apache#288
> Spark 修复nested Struct Pruning问题, apache#2877
> 可以使用Table Properties指定创建v2 format表,apache#2887
> 补充SortRewriteStrategy框架,逐步支持不同rewrite策略, apache#2609 (WIP:apache#2829)
> Spark 为catalog配置hadoop属性支持, apache#2792
> Spark 针对timestamps without timezone读写支持, apache#2757
> Spark MicroBatch支持配置属性skip delete snapshots, apache#2752
> Spark V2 RewriteDatafilesAction 支持
> Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关,补充schema-id, Core: add schema id to snapshot 
> Spark Extension支持identifier fields操作, apache#2560
> Parquet: Update to 1.12.0, apache#2441
> Hive: Vectorized ORC reads for Hive, apache#2613
> Spark: Add an action to remove all referenced files, apache#2415

## 该MR是如何测试的?

UT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants