-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3191] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex
#4531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
AbstractHoodieTableFileIndexAbstractHoodieTableFileIndex
8848863 to
53b67bd
Compare
AbstractHoodieTableFileIndexAbstractHoodieTableFileIndex
|
@hudi-bot run azure |
8203659 to
5dae550
Compare
|
@hudi-bot run azure |
| <!-- Scala --> | ||
| <dependency> | ||
| <groupId>org.scala-lang</groupId> | ||
| <artifactId>scala-library</artifactId> | ||
| <version>${scala.version}</version> | ||
| </dependency> | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's a good idea to have scala deps in hudi-common. Then there will be hudi-common_2.11 and hudi-common_2.12, which is not really necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's a good idea to have scala deps in hudi-common.
Yeah, this is a by-product of keeping this FileIndex in Scala for now. There's also no other module shared by both Spark and Hive unfortunately.
HUDI-3239 would actually resolve this converting it to Java.
Then there will be hudi-common_2.11 and hudi-common_2.12, which is not really necessary.
This is not necessary -- since we don't depend on any diverging behavior here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool
| * path with the partition columns in this case. | ||
| * | ||
| */ | ||
| abstract class AbstractHoodieTableFileIndex(engineContext: HoodieEngineContext, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should have this implemented in Java instead? Since it's no longer Spark specific, it is not required to be in Scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's a task for it: HUDI-3239
| public static File prepareParquetTable(java.nio.file.Path basePath, Schema schema, int numberOfFiles, | ||
| int numberOfRecords, String commitNumber, HoodieTableType tableType) throws IOException { | ||
| HoodieTestUtils.init(HoodieTestUtils.getDefaultHadoopConf(), basePath.toString(), tableType, HoodieFileFormat.PARQUET); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove the added empty lines in several places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason why i'm adding those is to logically separate clusters of operations (like setting up partition, inserting data, etc) to make it easier to digest from just a glance. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, Sg. I'll let you make the judgement.
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieTableFileIndex.java
Show resolved
Hide resolved
| // TODO cleanup | ||
| validate(targetFiles, listStatusForSnapshotModeLegacy(job, tableMetaClientMap, snapshotPaths)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a flag to turn off validation by default for query performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually just for CI validation. Will clean it up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'll take another look once done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, wasn't elaborate enough -- i'm using this hook to validate the behavior is the same as it was prior to refactoring. I'm planning to clean this up in the topmost PR on the stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets file a JIRA for this for tracking
e138780 to
29076ce
Compare
| * @param shouldIncludePendingCommits flags whether file-index should exclude any pending operations | ||
| * @param fileStatusCache transient cache of fetched [[FileStatus]]es | ||
| */ | ||
| abstract class AbstractHoodieTableFileIndex(engineContext: HoodieEngineContext, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename the class with Base prefix here.
| public static File prepareParquetTable(java.nio.file.Path basePath, Schema schema, int numberOfFiles, | ||
| int numberOfRecords, String commitNumber, HoodieTableType tableType) throws IOException { | ||
| HoodieTestUtils.init(HoodieTestUtils.getDefaultHadoopConf(), basePath.toString(), tableType, HoodieFileFormat.PARQUET); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, Sg. I'll let you make the judgement.
| // TODO cleanup | ||
| validate(targetFiles, listStatusForSnapshotModeLegacy(job, tableMetaClientMap, snapshotPaths)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'll take another look once done.
…instead of single one)
…ndex` for table's files listing in snapshot mode
Rebased `AbstractHoodieTableFileIndex` onto `HoodieTableQueryType`
…ns w/ partition-metadata
… its listing (for Hive compatibility)
- hudi-hive-sync-bundle - presto-bundle - trino-bundle
… jars into Hive CLI (addresses issues of tests failing after enabling Metadata table)
d8313ea to
08b9ee5
Compare
AbstractHoodieTableFileIndexAbstractHoodieTableFileIndex
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'll merge this one to unblock the stacked PRs.
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually quite not sure if this is delivering us real code duplication reduction here. Also lets not please change bundles temporarily. and leave things in intermediate state
| * </ol> | ||
| */ | ||
| public enum HoodieTableQueryType { | ||
| QUERY_TYPE_SNAPSHOT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop the QUERY_TYPE prefix?
| configProperties: TypedProperties, | ||
| specifiedQueryInstant: Option[String] = None, | ||
| @transient fileStatusCache: FileStatusCacheTrait) { | ||
| abstract class HoodieTableFileIndexBase(engineContext: HoodieEngineContext, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BaseHoodieTableIndex or just even HoodieTableFileIndex
| * are queried</li> | ||
| * </ol> | ||
| */ | ||
| public enum HoodieTableQueryType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also just HoodieQueryType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine either way, but in HoodieQueryType though Hoodie prefix seem to be affiliated with a query ("hoodie's query") and it's a little confusing as compared to "hoodie's table query"
| import org.apache.hudi.common.model.HoodieTableQueryType; | ||
| import org.apache.hudi.common.table.HoodieTableMetaClient; | ||
| import org.apache.hudi.common.util.Option; | ||
| import org.slf4j.Logger; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets stick to log4j?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slf4j is dictated by Spark (otherwise we'd need to provide for wrapper of log4j over slf4j interface)
| try { | ||
| return HoodieInputFormatUtils.getFileStatus(baseFileOpt.get()); | ||
| } catch (IOException ioe) { | ||
| throw new RuntimeException(ioe); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap into HoodieIOException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, missed those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already addressed in #4559
| rtFileStatus.setDeltaLogFiles(sortedLogFiles); | ||
| return rtFileStatus; | ||
| } catch (IOException e) { | ||
| throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
| // TODO cleanup | ||
| validate(targetFiles, listStatusForSnapshotModeLegacy(job, tableMetaClientMap, snapshotPaths)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets file a JIRA for this for tracking
| <!-- TODO(HUDI-3239) remove this --> | ||
| <include>org.scala-lang:scala-library</include> | ||
|
|
||
| <include>org.apache.parquet:parquet-avro</include> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of this is problematic. cc @codope
|
@vinothchandar responding inline
Can you elaborate?
MT doesn't work in Presto bundles on master. @codope is working to update the Hudi's Presto Docker image. As soon as this is done, either me or Sagar will revert these changes. |
Tips
What is the purpose of the pull request
Rebasing Hive's FileInputFormat onto
AbstractHoodieTableFileIndexBrief change log
AbstractHoodieTableFileIndexto "hudi-spark-common" (temporarily, will be migrated to "hudi-common")HiveHoodieTableFileIndeximpl ofAbstractHoodieTableFileIndexfor HiveHiveFileInputFormatBaseontoHiveHoodieTableFileIndexVerify this pull request
This pull request is already covered by existing tests, such as (please describe tests).
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.