Skip to content

Conversation

@yihua
Copy link
Contributor

@yihua yihua commented Jan 26, 2022

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua
Copy link
Contributor Author

yihua commented Feb 1, 2022

cc @vinothchandar

My approach is pulling the HFile format relevant classes from HBase repo with rel 2.4.9, into hudi repo hudi-io module with renamed package of org.apache.hudi.hbase instead of org.apache.hadoop.hbase. I trimmed some classes to limit the number of deps pulled in. All the backward compatibility logic of KeyValue.KVComparator (hbase1) vs CellComparator (hbase2) is pulled in as well so we can control that. In such a way, any hudi logic using HFile format is going to use internal org.apache.hudi.hbase classes, while SparkHoodieHBaseIndex still uses hbase lib with org.apache.hadoop.hbase classes (these two are independent).

A few things to finalize:

  • I'm questioning whether we should flip the hbase version in hudi repo, since if we can unlock the HFile format for metadata table, Presto, Trino, with the first WIP PR, there is no real need to upgrade hbase version to 2.x, which could introduce compatibility issues for SparkHoodieHBaseIndex. Anything I miss here? wdyt?
  • Right now, protobuf is used to generate proto classes and I pulled in the .proto and protobuf libs (hudi-io-proto module). Should I just put the generated java classes inside the repo and get rid of the proto related files altogether? I can keep hudi-io-proto module though and make hudi-io include generated code, not depending on hudi-io-proto, so in the future we can still evolve the protos.
  • Regarding the new dependencies pulled in, I can further trim the list down if some can cause conflict, e.g., commons-lang3, protobuf:
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <scope>provided</scope>
    
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-hdfs</artifactId>
      <scope>provided</scope>
      
      <groupId>org.apache.hbase.thirdparty</groupId>
      <artifactId>hbase-shaded-protobuf</artifactId>
      <version>4.0.1</version>

      <groupId>org.apache.hbase.thirdparty</groupId>
      <artifactId>hbase-shaded-miscellaneous</artifactId>
      <version>4.0.1</version>

      <groupId>org.apache.hbase.thirdparty</groupId>
      <artifactId>hbase-shaded-gson</artifactId>
      <version>4.0.1</version>

      <groupId>org.apache.hbase.thirdparty</groupId>
      <artifactId>hbase-shaded-netty</artifactId>
      <version>4.0.1</version>

      <groupId>org.apache.htrace</groupId>
      <artifactId>htrace-core4</artifactId>
      <version>4.2.0-incubating</version>

      <groupId>org.apache.commons</groupId>
      <artifactId>commons-lang3</artifactId>
      <version>3.12.0</version>
      <scope>compile</scope>

      <groupId>org.apache.yetus</groupId>
      <artifactId>audience-annotations</artifactId>
      <version>0.13.0</version>

      <groupId>com.esotericsoftware</groupId>
      <artifactId>kryo-shaded</artifactId>
      <version>4.0.2</version>

@vinothchandar
Copy link
Member

@yihua thanks for taking a stab at this.

since if we can unlock the HFile format for metadata table, Presto, Trino, with the first WIP PR, there is no real need to upgrade hbase version to 2.x

Real issue with HFile usage in Hudi has been the bundling (shading and making the size smaller). HFile 2.x vs 1.x, its more about getting on a version that is not 5 years old :) . I don't think we saw any large perf improvements between 1.x and 2.x. I think even with the 1.x hbase we are on the ver 3 of HFile? (http://www.devdoc.net/bigdata/hbase-0.98.7-hadoop1/book/hfilev3.html , the HFile has its own version, like Hudi table version) @codope can chime in here as well. The urgency to do this stems from finalizing this before all the indexing work lands.

Should I just put the generated java classes inside the repo and get rid of the proto related files altogether?

Need to take a closer look. if proto is used to define the storage format. may be we should keep it in? How big is that

Regarding the new dependencies pulled in, I can further trim the list down if some can cause conflict, e.g., commons-lang3, protobuf:

right. the desired way for us is to trim the HFile to much much smaller amount of code even. We should not bring in any new dependencies that Hudi has gotten rid of - commons-lang, guava. Otherwise it defeats the purpose a little bit.

@vinothchandar
Copy link
Member

at 66K lines, this is currently still too much code to maintain for return. Wondering if its easier to think about how we can have different base files supported within the same table/partition and punt this. We could just write our own format which can be lot thinner

@vinothchandar
Copy link
Member

How much more of the code do we think we can trim (not just the deps)

@nsivabalan
Copy link
Contributor

@yihua : Can we close this if not valid anymore.

@yihua
Copy link
Contributor Author

yihua commented Mar 16, 2022

Per discussion, if we want to pull HFile related code into Hudi, there is more work to do to trim the code that's irrelevant, doing code rewrite inside classes, beyond just pulling in necessary classes, to bring LoC much lower than 66K. This direction is much more involved.

For now, we'll go with the approach of upgrading HBase to 2.x and properly shading the dependencies, before we write our own file format for the same purpose. #5004 is ready for review for HBase upgrade to 2.x along this line. Closing this WIP PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants