[HUDI-6055] Fix input format for bootstrap tables #8397

codope · 2023-04-06T10:40:22Z

Change Logs

Push down instantiation of table schema resolver to the point when it's actually needed. This helps in avoiding some cycles reading commit metadata unless needed.

Impact

None.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

Push down instantiattion of table schema resolver

hudi-bot · 2023-04-10T16:13:14Z

CI report:

63b8f47 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

This is a good catch.

yihua · 2023-05-05T05:20:04Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java

+    // No need trigger schema evolution for count(*)/count(1) operation
+    boolean disableSchemaEvolution = requiredColumns.isEmpty() || (requiredColumns.size() == 1 && requiredColumns.get(0).isEmpty());
+    if (!disableSchemaEvolution) {
+      if (!internalSchemaOption.isPresent()) {


Should this condition be internalSchemaOption == null since it may not be initialized?

yihua · 2023-05-05T05:22:34Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java

-    if (internalSchemaOption.isPresent()) {
+    List<String> requiredColumns = getRequireColumn(job);
+    // No need trigger schema evolution for count(*)/count(1) operation
+    boolean disableSchemaEvolution = requiredColumns.isEmpty() || (requiredColumns.size() == 1 && requiredColumns.get(0).isEmpty());


To clarify, do requiredColumns contain the columns from the predicate(s), e.g., count(*) where col1 is not null?

I see. This is existing logic. Still wondering the same question.

yihua · 2023-05-05T05:25:06Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java

      // reading hoodie schema evolution table
      job.setBoolean(HIVE_EVOLUTION_ENABLE, true);
-      Path finalPath = ((FileSplit)split).getPath();
+      Path finalPath = ((FileSplit) split).getPath();
      InternalSchema prunedSchema;
-      List<String> requiredColumns = getRequireColumn(job);
-      // No need trigger schema evolution for count(*)/count(1) operation
-      boolean disableSchemaEvolution =
-          requiredColumns.isEmpty() || (requiredColumns.size() == 1 && requiredColumns.get(0).isEmpty());
-      if (!disableSchemaEvolution) {
-        prunedSchema = InternalSchemaUtils.pruneInternalSchema(internalSchemaOption.get(), requiredColumns);
-        InternalSchema querySchema = prunedSchema;
-        Long commitTime = Long.valueOf(FSUtils.getCommitTime(finalPath.getName()));
-        InternalSchema fileSchema = InternalSchemaCache.searchSchemaAndCache(commitTime, metaClient, false);
-        InternalSchema mergedInternalSchema = new InternalSchemaMerger(fileSchema, querySchema, true,
-            true).mergeSchema();
-        List<Types.Field> fields = mergedInternalSchema.columns();
-        setColumnNameList(job, fields);
-        setColumnTypeList(job, fields);
-        pushDownFilter(job, querySchema, fileSchema);
-      }
+      prunedSchema = InternalSchemaUtils.pruneInternalSchema(internalSchemaOption.get(), requiredColumns);
+      InternalSchema querySchema = prunedSchema;
+      Long commitTime = Long.valueOf(FSUtils.getCommitTime(finalPath.getName()));
+      InternalSchema fileSchema = InternalSchemaCache.searchSchemaAndCache(commitTime, metaClient, false);
+      InternalSchema mergedInternalSchema = new InternalSchemaMerger(fileSchema, querySchema, true, true).mergeSchema();
+      List<Types.Field> fields = mergedInternalSchema.columns();
+      setColumnNameList(job, fields);
+      setColumnTypeList(job, fields);
+      pushDownFilter(job, querySchema, fileSchema);


and should this part be guarded by !internalSchemaOption.isPresent()?

yihua · 2023-05-05T05:29:37Z

hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java

-      throw new HoodieException("Failed to parse partition column values from the partition-path:"
-          + " likely non-encoded slashes being used in partition column's values. You can try to"
-          + " work this around by switching listing mode to eager");
+      LOG.warn(">>> PartitionColumns: " + partitionColumns + "  PartitionValues: " + partitionColumnValues);


So I assume we still need to fail here instead of printing the warning and letting it return?

yihua · 2023-05-05T05:30:52Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java

+    boolean disableSchemaEvolution = requiredColumns.isEmpty() || (requiredColumns.size() == 1 && requiredColumns.get(0).isEmpty());
+    if (!disableSchemaEvolution) {
+      if (!internalSchemaOption.isPresent()) {
+        internalSchemaOption = new TableSchemaResolver(metaClient).getTableInternalSchemaFromCommitMetadata();


Still do try .. catch .. here in case the internal schema cannot be read?

yihua · 2023-05-05T05:32:52Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieTableFileIndex.java

        true,
        new NoopCache(),
-        false);
+        true);


Now I remember we need to fix the lazy listing for Hvie File Index. Should this be in a separate PR?

Yes, this should not have been part of this PR. Actually, it doesn't really matter for Hudi connector as it doesn't go through COW input format code. And for Hive connector, we already saw that it was partition loader will instantiate this every call. However, the actual perf issue for Hive connector was fixed due to #7527 (comment)

xiarixiaoyao · 2023-05-05T11:40:12Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java

+  public SchemaEvolutionContext(InputSplit split, JobConf job, Option<HoodieTableMetaClient> metaClientOption) {
    this.split = split;
    this.job = job;
    this.metaClient = metaClientOption.isPresent() ? metaClientOption.get() : setUpHoodieTableMetaClient();


now the init of internalSchemaOption has been removed.

hudi/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java

Line 91 in 83d4fe1

if (schemaEvolutionContext.internalSchemaOption.isPresent()) {

need to modify

oh ok then there is no need of this PR.

codope · 2023-05-06T14:06:17Z

@yihua @xiarixiaoyao thanks for reviewing. As mentioned in the comments, the unnecessary instantiation is already removed and we don't need lazy listing in Hive file index impl for now. So, I am going to close the PR.

Fix COW input format for bootstrap tables

63b8f47

Push down instantiattion of table schema resolver

codope changed the title ~~[WIP] Fix COW input format for bootstrap tables~~ [HUDI-6055] Fix parquet input format for bootstrap tables Apr 10, 2023

codope requested a review from xiarixiaoyao April 10, 2023 13:12

codope changed the title ~~[HUDI-6055] Fix parquet input format for bootstrap tables~~ [HUDI-6055] Fix input format for bootstrap tables Apr 10, 2023

vinothchandar added the release-0.13.1 label May 2, 2023

codope assigned yihua and xiarixiaoyao May 4, 2023

yihua reviewed May 5, 2023

View reviewed changes

xiarixiaoyao reviewed May 5, 2023

View reviewed changes

codope closed this May 6, 2023

hudi-bot mentioned this pull request Dec 9, 2025

Avoid loading TableSchemaResolver until needed in reading bootstrap tables #15864

Open

[HUDI-6055] Fix input format for bootstrap tables #8397

[HUDI-6055] Fix input format for bootstrap tables #8397

Uh oh!

Conversation

codope commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Apr 10, 2023

CI report:

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codope commented May 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codope commented Apr 6, 2023 •

edited

Loading