Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

Tips

What is the purpose of the pull request

Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex

Brief change log

  • Moving AbstractHoodieTableFileIndex to "hudi-spark-common" (temporarily, will be migrated to "hudi-common")
  • Bootstrapping HiveHoodieTableFileIndex impl of AbstractHoodieTableFileIndex for Hive
  • Rebasing HiveFileInputFormatBase onto HiveHoodieTableFileIndex

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@alexeykudinkin alexeykudinkin changed the title [WIP][HUDI-3191] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex [WIP][HUDI-3191][Stacked on 4520] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex Jan 7, 2022
@alexeykudinkin alexeykudinkin force-pushed the ak/rpath-ref-3 branch 4 times, most recently from 8848863 to 53b67bd Compare January 8, 2022 02:00
@alexeykudinkin alexeykudinkin changed the title [WIP][HUDI-3191][Stacked on 4520] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex [HUDI-3191][Stacked on 4520] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex Jan 8, 2022
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

Comment on lines +159 to +165
<!-- Scala -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's a good idea to have scala deps in hudi-common. Then there will be hudi-common_2.11 and hudi-common_2.12, which is not really necessary.

Copy link
Contributor Author

@alexeykudinkin alexeykudinkin Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's a good idea to have scala deps in hudi-common.

Yeah, this is a by-product of keeping this FileIndex in Scala for now. There's also no other module shared by both Spark and Hive unfortunately.

HUDI-3239 would actually resolve this converting it to Java.

Then there will be hudi-common_2.11 and hudi-common_2.12, which is not really necessary.

This is not necessary -- since we don't depend on any diverging behavior here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool

* path with the partition columns in this case.
*
*/
abstract class AbstractHoodieTableFileIndex(engineContext: HoodieEngineContext,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have this implemented in Java instead? Since it's no longer Spark specific, it is not required to be in Scala.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's a task for it: HUDI-3239

public static File prepareParquetTable(java.nio.file.Path basePath, Schema schema, int numberOfFiles,
int numberOfRecords, String commitNumber, HoodieTableType tableType) throws IOException {
HoodieTestUtils.init(HoodieTestUtils.getDefaultHadoopConf(), basePath.toString(), tableType, HoodieFileFormat.PARQUET);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the added empty lines in several places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason why i'm adding those is to logically separate clusters of operations (like setting up partition, inserting data, etc) to make it easier to digest from just a glance. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, Sg. I'll let you make the judgement.

Comment on lines +202 to +203
// TODO cleanup
validate(targetFiles, listStatusForSnapshotModeLegacy(job, tableMetaClientMap, snapshotPaths));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a flag to turn off validation by default for query performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually just for CI validation. Will clean it up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'll take another look once done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, wasn't elaborate enough -- i'm using this hook to validate the behavior is the same as it was prior to refactoring. I'm planning to clean this up in the topmost PR on the stack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets file a JIRA for this for tracking

* @param shouldIncludePendingCommits flags whether file-index should exclude any pending operations
* @param fileStatusCache transient cache of fetched [[FileStatus]]es
*/
abstract class AbstractHoodieTableFileIndex(engineContext: HoodieEngineContext,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename the class with Base prefix here.

public static File prepareParquetTable(java.nio.file.Path basePath, Schema schema, int numberOfFiles,
int numberOfRecords, String commitNumber, HoodieTableType tableType) throws IOException {
HoodieTestUtils.init(HoodieTestUtils.getDefaultHadoopConf(), basePath.toString(), tableType, HoodieFileFormat.PARQUET);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, Sg. I'll let you make the judgement.

Comment on lines +202 to +203
// TODO cleanup
validate(targetFiles, listStatusForSnapshotModeLegacy(job, tableMetaClientMap, snapshotPaths));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'll take another look once done.

@alexeykudinkin alexeykudinkin changed the title [HUDI-3191][Stacked on 4520] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex [HUDI-3191] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex Jan 18, 2022
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'll merge this one to unblock the stacked PRs.

@yihua yihua merged commit 4bea758 into apache:master Jan 18, 2022
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually quite not sure if this is delivering us real code duplication reduction here. Also lets not please change bundles temporarily. and leave things in intermediate state

* </ol>
*/
public enum HoodieTableQueryType {
QUERY_TYPE_SNAPSHOT,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop the QUERY_TYPE prefix?

configProperties: TypedProperties,
specifiedQueryInstant: Option[String] = None,
@transient fileStatusCache: FileStatusCacheTrait) {
abstract class HoodieTableFileIndexBase(engineContext: HoodieEngineContext,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseHoodieTableIndex or just even HoodieTableFileIndex

* are queried</li>
* </ol>
*/
public enum HoodieTableQueryType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also just HoodieQueryType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine either way, but in HoodieQueryType though Hoodie prefix seem to be affiliated with a query ("hoodie's query") and it's a little confusing as compared to "hoodie's table query"

import org.apache.hudi.common.model.HoodieTableQueryType;
import org.apache.hudi.common.table.HoodieTableMetaClient;
import org.apache.hudi.common.util.Option;
import org.slf4j.Logger;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets stick to log4j?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slf4j is dictated by Spark (otherwise we'd need to provide for wrapper of log4j over slf4j interface)

try {
return HoodieInputFormatUtils.getFileStatus(baseFileOpt.get());
} catch (IOException ioe) {
throw new RuntimeException(ioe);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap into HoodieIOException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, missed those

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in #4559

rtFileStatus.setDeltaLogFiles(sortedLogFiles);
return rtFileStatus;
} catch (IOException e) {
throw new RuntimeException(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Comment on lines +202 to +203
// TODO cleanup
validate(targetFiles, listStatusForSnapshotModeLegacy(job, tableMetaClientMap, snapshotPaths));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets file a JIRA for this for tracking

<!-- TODO(HUDI-3239) remove this -->
<include>org.scala-lang:scala-library</include>

<include>org.apache.parquet:parquet-avro</include>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this is problematic. cc @codope

@alexeykudinkin
Copy link
Contributor Author

@vinothchandar responding inline

I am actually quite not sure if this is delivering us real code duplication reduction here.

Can you elaborate?

Also lets not please change bundles temporarily. and leave things in intermediate state

MT doesn't work in Presto bundles on master. @codope is working to update the Hudi's Presto Docker image. As soon as this is done, either me or Sagar will revert these changes.

@vinishjail97 vinishjail97 mentioned this pull request Jan 24, 2022
5 tasks
vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants