Skip to content

Conversation

@codope
Copy link
Member

@codope codope commented Jan 10, 2022

What is the purpose of the pull request

We introduced hudi-presto-bundle in the presto-hive module which caused some build issue in presto installation. See prestodb/presto#17164 for more details. This PR fixes the hudi-presto-bundle to solve that issue. Specifically:

  • Removes parquet-avro and avro from hudi-presto-bundle. Presto has a higher version of both, and I checked for the APIs that we use, which have not changed in the presto version of parquet or avro.
  • Removes hbase-shaded-server and shade the hbase-server which is included as part of hudi-common. This will ensure that the only namespace that is there in jar are hudi-related i.e. com/uber/hoodie/hadoop/* and org/apache/hudi/*
  • Total size of bundle: 16 MB.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codope codope added the priority:critical Production degraded; pipelines stalled label Jan 10, 2022
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alexeykudinkin
Copy link
Contributor

alexeykudinkin commented Jan 13, 2022

@codope can you please help me understand how Presto would work w/o parquet-avro (which RecordReader's are relying on to read using ParquetFileReaders) and hbase-shaded-server (which carries HFile, and poised to fail if Metadata Table is enabled)

@codope
Copy link
Member Author

codope commented Jan 13, 2022

@codope can you please help me understand how Presto would work w/o parquet-avro (which RecordReader's are relying on to read using ParquetFileReaders) and hbase-shaded-server (which carries HFile, and poised to fail if Metadata Table is enabled)

parquet-avro and avro are part of hive-apache artifact in presto, so we are going to use that. I removed hbase-shaded-server as hbase-server (part of hudi-common) is already included.

@alexeykudinkin
Copy link
Contributor

Oh, i missed that you removed the exclusions.

@alexeykudinkin
Copy link
Contributor

@codope i had to revert these changes in my PR, since Presto queries are failing after rebase:

2022-01-14T20:45:04.265Z	WARN	hive-hive-0	com.facebook.presto.hive.util.ResumableTasks	ResumableTask completed exceptionally
java.lang.NoClassDefFoundError: org/apache/avro/message/BinaryMessageEncoder
	at org.apache.hudi.avro.model.HoodieMetadataRecord.<clinit>(HoodieMetadataRecord.java:23)
	at org.apache.hudi.metadata.HoodieBackedTableMetadata.getLogRecordScanner(HoodieBackedTableMetadata.java:330)
	at org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$openReadersIfNeeded$2(HoodieBackedTableMetadata.java:262)
	at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
	at org.apache.hudi.metadata.HoodieBackedTableMetadata.openReadersIfNeeded(HoodieBackedTableMetadata.java:239)
	at org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:129)
	at org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordByKey(HoodieBackedTableMetadata.java:124)
	at org.apache.hudi.metadata.BaseTableMetadata.fetchAllPartitionPaths(BaseTableMetadata.java:154)
	at org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:98)
	at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:297)
	at org.apache.hudi.AbstractHoodieTableFileIndex.getAllQueryPartitionPaths(AbstractHoodieTableFileIndex.scala:233)
	at org.apache.hudi.AbstractHoodieTableFileIndex.loadPartitionPathFiles(AbstractHoodieTableFileIndex.scala:195)
	at org.apache.hudi.AbstractHoodieTableFileIndex.refresh0(AbstractHoodieTableFileIndex.scala:108)
	at org.apache.hudi.AbstractHoodieTableFileIndex.<init>(AbstractHoodieTableFileIndex.scala:88)
	at org.apache.hudi.hadoop.HiveHoodieTableFileIndex.<init>(HiveHoodieTableFileIndex.java:52)
	at org.apache.hudi.hadoop.HoodieFileInputFormatBase.listStatusForSnapshotMode(HoodieFileInputFormatBase.java:170)
	at org.apache.hudi.hadoop.HoodieFileInputFormatBase.listStatus(HoodieFileInputFormatBase.java:141)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
	at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:362)
	at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:258)
	at com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:93)
	at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:187)
	at com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
	at com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
	at com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
	at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.avro.message.BinaryMessageEncoder
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at com.facebook.presto.server.PluginClassLoader.loadClass(PluginClassLoader.java:80)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 29 more

https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=5256&view=logs&j=3b6e910d-b98f-5de6-b9cb-1e5ff571f5de&t=30b5aae4-0ea0-5566-42d0-febf71a7061a

#4556

@codope
Copy link
Member Author

codope commented Jan 15, 2022

@codope i had to revert these changes in my PR, since Presto queries are failing after rebase:

2022-01-14T20:45:04.265Z	WARN	hive-hive-0	com.facebook.presto.hive.util.ResumableTasks	ResumableTask completed exceptionally
java.lang.NoClassDefFoundError: org/apache/avro/message/BinaryMessageEncoder
	at org.apache.hudi.avro.model.HoodieMetadataRecord.<clinit>(HoodieMetadataRecord.java:23)

@alexeykudinkin Let's not revert this. Instead, we should upgrade the presto version in hudi integ test. Currently, it is 0.217, over 3 years old which did not package avro.message. We want our bundles to be as lightweight as possible and so rely on deps provided by presto as much as possible. Moreover, 0.217 is far removed from the reality. It does not contain the hudi-specific changes that we did in Presto. Also, most Hudi users that I have interacted with are on 0.246 or later.

@alexeykudinkin
Copy link
Contributor

alexeykudinkin commented Jan 15, 2022

That makes total sense to me. But for that we have to update the Docker images we're using in ITs, right? If'd revert those changes my PR would have ITs failing b/c of missing classes.

Let me know when you'll be able to update Docker images and i'll revert the POM changes.

EDIT

Please keep in mind that Hive flows on the current master don't involve Metadata table (which uses HFile), and therefore we'd need to validate that it works either triggering that flow manually or basing it on top of #4556 which does trigger Metadata table usage in Hive flows.

@vinishjail97 vinishjail97 mentioned this pull request Jan 24, 2022
5 tasks
vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants