Skip to content

[UMBRELLA] Support ORC Storage #14450

@hudi-bot

Description

@hudi-bot

[https://github.com//issues/68]

#155

JIRA info


Comments

15/Mar/19 04:59;ambition119;I could try fix this issue, thanks.;;;


18/Mar/19 10:37;taherk77;According to my understanding the Orc writer should implement HoodieStorageWriter. However, what I wanted to confirm was just like HoodieParquetWriter which is writing parquet and avro records. Should hoodie orc writer class follow the same?;;;


18/Mar/19 12:50;ambition119;I am already developing, [~taherk77] so are you going to do it?;;;


18/Mar/19 13:03;taherk77;[~ambition119] I think an easier way to do this would be to use the [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFile.java] createWriter function where you can just add the schema and get the writer.;;;


18/Mar/19 13:04;taherk77;[~ambition119] We can do this together if you want to!;;;


18/Mar/19 18:05;vinoth;Great to see this picking up steam! :)  

May be you both can do a HIP together, if this becomes a large enough project (which it could, as you figure out the query side of things)?

 

 

 ;;;


20/Mar/19 02:07;ambition119;[~taherk77] thanks you suggest,new code detail is :

[https://github.com/ambition119/incubator-hudi/blob/dd67c4f92e98d9457349003dfd9bd68776b276b3/hoodie-client/src/main/java/com/uber/hoodie/io/storage/HoodieOrcWriter.java]

[~vc] So best,I hope hudi is getting better and better.;;;


20/Mar/19 06:19;taherk77;Hi [~ambition119], the code looks good. This is what i was talking about, if we use the hive orc writer much of our work should be easy.

 

However in HoodieOrcConfig, we should initialize the defaults as per default ORC settings.

 

Follow this link https://orc.apache.org/docs/hive-config.html#configuration-properties take a look at this section where the default properties for stripsize, block size and row stride is given. Also as per my opinion we should set the default orc compression as ZLIB from the HoodieOrcConfig as ORC works best with ZLIB. 

 

Include a unit test case for the same and we should be good to go on this one. Further [~vc] can advise us on how to close this.;;;


20/Mar/19 19:10;vinoth;Supporting ORC is a major step for Hudi. So thanks for attempting this!

One main aspect, to consider is all read paths and ensure things continue to work.

I will list them out here AFAIK.

  • Currently, we use HoodieBloomIndex for indexing, which sort of implicitly assumes the file is a parquet file. This would definitely break.
  • HoodieInputFormat and HoodieRealtimeInputFormat all subclass ParquetInputFormat, which helps Hive execute queries well. We need to think about how to support ORC ReadOptimized views and how to merge ORC base files and log files. 
  • Same goes for testing Spark and Presto implementations. 

 

I feel the scope is large here and probably better if you both can start a HIP around this?

[https://cwiki.apache.org/confluence/display/HUDI/Hudi+Improvement+Plan+Details+and+Process] 

 

 

 

 

 

 ;;;


27/Mar/19 07:52;ambition119;[~taherk77] Can we work together to complete this HIP? thanks.;;;


27/Mar/19 07:55;ambition119;I fix OrcConfig, detail info:

https://github.com/ambition119/incubator-hudi/blob/41162b5b217fc1de8e5d3d9448db176e09eab5fe/hoodie-client/src/main/java/com/uber/hoodie/io/storage/HoodieOrcConfig.java;;;


27/Mar/19 17:40;taherk77;Sure we can we'll start the HIP on this;;;


27/Mar/19 17:41;taherk77;[~ambition119] Defaults looks good. ;;;


27/Mar/19 23:24;vinoth;Awesome! Look forward to the HIP! ;;;


28/Mar/19 11:12;ambition119;[~taherk77] happy, HIP work is led by you, thanks!;;;


29/Mar/19 09:34;taherk77;[~vc] [~ambition119] A small HIP has been prepared here [https://cwiki.apache.org/confluence/display/HUDI/2019/03/29/Hudi+ORC+storage];;;


02/Apr/19 06:44;ambition119;[~taherk77] next task dev new implementation of HoodieInputFormat to support ORC writing?;;;


02/Apr/19 12:52;taherk77;So we have more tasks here

Making HoodieBloomIndex work with ORC, for now it strictly works with parquet.

Making orc work with HoodieRealtimeInputFormat

Letting hudi know the file format: So we should have something --orc or --parquet and Hudi should be able to know what file format we have to start. Check the HoodieDeltaStreamer class;;;


02/Apr/19 12:59;taherk77;On the 2nd point HoodieRealtimeInputFormat extends HoodieInputFormat this class then extends MapredParquetInputFormat. Try changing this extend to just FileInputFormat and then at run time we can either pass OrcInputFormat or then MapredParquetInputFormat.  Try this out see how it should fit

 ;;;


02/Apr/19 13:09;taherk77;[~vc] Is there a reason why the MapredParquetInputFormat is used with HoodieInputFormat instead of ParquetInputFormat? Why are we not using ParquetInputFormat which has implements the new mapreduce package and not mapred? ;;;


02/Apr/19 18:36;vinoth;[~taherk77] yes.. Thats the class used by Hive to recognize a table as parquet table. When it does so, it turns on few optimizations.. MapredParquetInputFormat wraps ParquetInputFormat internally. 

 

For ORC, we have to create new input format classes - say HoodieOrcInputFormat /HoodieOrcRealtimeInputFormat.. For making some progress initially, you can just copy code around and fix the HiveSyncTool to register with the tables using these.. We can figure out how to restructure code in backwards compatible way, once you know what works..

 

Will review the HIP shortly!  ;;;


03/Apr/19 18:18;vinoth;I have copied over the HIP to google docs 

[https://docs.google.com/document/d/13bAc9A7U__am_pWVxKtqSeLUsOWXpx3sV02n0qAmwws/edit#|https://docs.google.com/document/d/13bAc9A7U__am_pWVxKtqSeLUsOWXpx3sV02n0qAmwws/edit]

for better commenting.. outline seems ok to me.. if you can flesh it out with more details esp on query side integration, it would be great ;;;


03/Apr/19 18:40;vinoth;can you also start a DISCUSS thread on the ML, so people look at the HIP when ready;;;


04/Apr/19 08:37;ambition119;DISCUSS has initiated, I agree to create a new class HoodieOrcInputFormat/HoodieOrcRealtimeInputFormat.  

HoodieCreateHandle/HoodieMergeHandle also use default HoodieParquetWriter, so I add writerType field choose HoodieParquetWriter or HoodieOrcWriter ?

 

 ;;;


08/Apr/19 12:18;ambition119;[~taherk77] [~vc] I dev new commit info: 

[https://github.com/ambition119/incubator-hudi/commit/1574f2d76a04de38e69464528b9df7638fc02c0f]

If you have time, give me some suggestion. thanks;;;


09/Apr/19 17:17;vinoth;[~ambition119] can you open a PR against master, you can tag it as WIP as the other PRs you'd see . Happy to review that.. We can also merge the support in smaller phases, as long as parquet continues to work and orc stuff can be turned off by a flag.. 

 

HoodieCreateHandle/HoodieMergeHandle also use default HoodieParquetWriter, so I add writerType field choose HoodieParquetWriter or HoodieOrcWriter ?

There is an abstraction for StorageWriter which is picked by the handles dynamically I think? if not, what you mention makes total sense to me. ;;;


09/Apr/19 17:17;vinoth;sorry for being scarce.. Still recovering from illness. :/ ;;;


03/Oct/19 04:32;vinoth;[~ambition119] are you still interested to work on it.. ;;;


11/Nov/19 06:09;ambition119;[~vinoth]  sorry, I don't this work now.  thank you!;;;


08/Apr/20 16:24;vinoth;[~lamberken] do you think. we can target thiswork for 0.6.0? ;;;


08/Apr/20 19:35;lamber-ken;[~vinoth], I'm not sure.

 
||task||version||
|1. add storage type|0.6.0|
|2. support write|0.6.0|
|3. support reader|0.6.0|
|4. hive / spark query |may 0.6.1 / 0.6.2|

 

BTW, [~garyli1019] also interested in this, we will work together to accelerates the implementation process of this.

 

PS: 

  • I'm working on BloomV2
  • [~garyli1019] working on spark realtime query

 

We will come back here soon, any suggestion are welcome : );;;


08/Apr/20 22:13;vinoth;Great! that sounds like a plan... having just 1 engine support with Orc initially, would be help.. e.g we can say you can write ORC using Hudi and query using Hive;;;


09/Sep/20 18:58;manijndl77;Hi [~vinoth] i am new to community would love to pick up any task here ;;;


09/Sep/20 19:35;vinoth;[~manijndl77] assigned it to you. there is a fair bit of prior work that attempted this. you can search PRs and RFCs, there is probably an easier way to do this now, given the base file format etc have been abstracted out nicely now;;;


20/Nov/20 08:36;lrz;Hi, [~vinoth] we are eager to use this feature.  could you update any information when you free. also if you can help to deaggregate the sub tasks, then we would love to pick up some task, thank you very much 
 ;;;


20/Nov/20 17:32;vinoth;[~lrz] absolutely.

[~manijndl77] do you have an update for us.? Happy to hop on a call with you all and plan this out end-end as well. 

 ;;;


29/Nov/20 12:35;manijndl77;Hi [~vinoth] / [~lrz] sorry for replying late due to some health issues ,will raise the PR by this week then we can plan the remaining things.;;;


10/Dec/20 04:53;manijndl77;Hi [~vinoth] can you please review the Orc Writer PR and plan.

Orc Writer : - Completed with all Major Supported Data Type (primitive, map, structs ..) [https://github.com//pull/2320|https://github.com//pull/2320].

Orc Reader : - Targeting to complete this Week.

Hive Sync to Support Orc Tables: - Targeting Next Week Although Input format and output format is completed.

Documentation :- Minor.;;;


11/Dec/20 08:12;vinoth;Will do. Planning to wrap up the critical parts of RFC-15 impl this week. So should have fair amount of time next for the PRs ;;;


23/Mar/21 00:40;Mithalee;Hi.I am planning to generate orc files from Hudi. Is this task still under development?;;;


23/Mar/21 05:11;vinoth;[~pwason] can you please update this JIRA? Also should we assign this to the intern? ;;;


15/Apr/21 20:11;nishith29;[~Teresa] Please create the tickets for the remaining work around fixing test cases as well as the HoodieORCInputFormat under this ticket. We will use that to collaborate and source help from other members of the community.;;;


19/May/21 10:23;manasaks;Hi ...am willing to contribute to the task ( Support to ORC storage), could you please assign me a Jira ticket ...;;;


20/May/21 06:22;nishith29;[~manasaks] I think you can start with https://issues.apache.org/jira/browse/HUDI-1827. At a high level, take a look at how to remove hard dependency on Parquet. Once you have read through the code a bit, I can guide you further on that ticket. 

I'm unable to assign any tickets to you :( ;;;


31/May/21 08:14;manasaks;[~nishith29] Orc support for boostrap implementation is complete ...Need your inputs on better way to pass the below parameter to configuration -HOODIE_BASE_FILE_FORMAT_PROP_NAME..

Have tested SparkRDDWriteClient 's bootstrap method in Java by passing sample orc file to the configuration ...

This is regarding Jira 

 https://issues.apache.org/jira/browse/HUDI-1827.

Also have started working on spark integration with ORC (HUDI - 1824)

Looking at the ORC api , we don't seem to have corresponding api's like  ParquetWriteSupport and orc write method which accepts InternalRow... But instead an abstraction of VectorizedRowBatch...

So i presume we would have to explicitly convert from InternalRow -> VectorizedRowBatch

Also are  there any alternate class like ParquetWriteSupport for ORC for implementing bloom filter functionality;;;


11/Aug/21 18:08;manasaks;[~nishith29] / [~vinoth] Could you please review the below PR for ORC support for bootstrap operation . 

[https://github.com//pull/3457]

 ;;;

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions