Skip to content

Conversation

@codope
Copy link
Member

@codope codope commented Apr 6, 2023

Change Logs

Push down instantiation of table schema resolver to the point when it's actually needed. This helps in avoiding some cycles reading commit metadata unless needed.

Impact

None.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

Push down instantiattion of table schema resolver
@codope codope changed the title [WIP] Fix COW input format for bootstrap tables [HUDI-6055] Fix parquet input format for bootstrap tables Apr 10, 2023
@codope codope requested a review from xiarixiaoyao April 10, 2023 13:12
@codope codope changed the title [HUDI-6055] Fix parquet input format for bootstrap tables [HUDI-6055] Fix input format for bootstrap tables Apr 10, 2023
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch.

// No need trigger schema evolution for count(*)/count(1) operation
boolean disableSchemaEvolution = requiredColumns.isEmpty() || (requiredColumns.size() == 1 && requiredColumns.get(0).isEmpty());
if (!disableSchemaEvolution) {
if (!internalSchemaOption.isPresent()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this condition be internalSchemaOption == null since it may not be initialized?

if (internalSchemaOption.isPresent()) {
List<String> requiredColumns = getRequireColumn(job);
// No need trigger schema evolution for count(*)/count(1) operation
boolean disableSchemaEvolution = requiredColumns.isEmpty() || (requiredColumns.size() == 1 && requiredColumns.get(0).isEmpty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, do requiredColumns contain the columns from the predicate(s), e.g., count(*) where col1 is not null?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This is existing logic. Still wondering the same question.

Comment on lines 161 to +173
// reading hoodie schema evolution table
job.setBoolean(HIVE_EVOLUTION_ENABLE, true);
Path finalPath = ((FileSplit)split).getPath();
Path finalPath = ((FileSplit) split).getPath();
InternalSchema prunedSchema;
List<String> requiredColumns = getRequireColumn(job);
// No need trigger schema evolution for count(*)/count(1) operation
boolean disableSchemaEvolution =
requiredColumns.isEmpty() || (requiredColumns.size() == 1 && requiredColumns.get(0).isEmpty());
if (!disableSchemaEvolution) {
prunedSchema = InternalSchemaUtils.pruneInternalSchema(internalSchemaOption.get(), requiredColumns);
InternalSchema querySchema = prunedSchema;
Long commitTime = Long.valueOf(FSUtils.getCommitTime(finalPath.getName()));
InternalSchema fileSchema = InternalSchemaCache.searchSchemaAndCache(commitTime, metaClient, false);
InternalSchema mergedInternalSchema = new InternalSchemaMerger(fileSchema, querySchema, true,
true).mergeSchema();
List<Types.Field> fields = mergedInternalSchema.columns();
setColumnNameList(job, fields);
setColumnTypeList(job, fields);
pushDownFilter(job, querySchema, fileSchema);
}
prunedSchema = InternalSchemaUtils.pruneInternalSchema(internalSchemaOption.get(), requiredColumns);
InternalSchema querySchema = prunedSchema;
Long commitTime = Long.valueOf(FSUtils.getCommitTime(finalPath.getName()));
InternalSchema fileSchema = InternalSchemaCache.searchSchemaAndCache(commitTime, metaClient, false);
InternalSchema mergedInternalSchema = new InternalSchemaMerger(fileSchema, querySchema, true, true).mergeSchema();
List<Types.Field> fields = mergedInternalSchema.columns();
setColumnNameList(job, fields);
setColumnTypeList(job, fields);
pushDownFilter(job, querySchema, fileSchema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and should this part be guarded by !internalSchemaOption.isPresent()?

throw new HoodieException("Failed to parse partition column values from the partition-path:"
+ " likely non-encoded slashes being used in partition column's values. You can try to"
+ " work this around by switching listing mode to eager");
LOG.warn(">>> PartitionColumns: " + partitionColumns + " PartitionValues: " + partitionColumnValues);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I assume we still need to fail here instead of printing the warning and letting it return?

boolean disableSchemaEvolution = requiredColumns.isEmpty() || (requiredColumns.size() == 1 && requiredColumns.get(0).isEmpty());
if (!disableSchemaEvolution) {
if (!internalSchemaOption.isPresent()) {
internalSchemaOption = new TableSchemaResolver(metaClient).getTableInternalSchemaFromCommitMetadata();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still do try .. catch .. here in case the internal schema cannot be read?

true,
new NoopCache(),
false);
true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I remember we need to fix the lazy listing for Hvie File Index. Should this be in a separate PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should not have been part of this PR. Actually, it doesn't really matter for Hudi connector as it doesn't go through COW input format code. And for Hive connector, we already saw that it was partition loader will instantiate this every call. However, the actual perf issue for Hive connector was fixed due to #7527 (comment)

public SchemaEvolutionContext(InputSplit split, JobConf job, Option<HoodieTableMetaClient> metaClientOption) {
this.split = split;
this.job = job;
this.metaClient = metaClientOption.isPresent() ? metaClientOption.get() : setUpHoodieTableMetaClient();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now the init of internalSchemaOption has been removed.

if (schemaEvolutionContext.internalSchemaOption.isPresent()) {
need to modify

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok then there is no need of this PR.

@codope
Copy link
Member Author

codope commented May 6, 2023

@yihua @xiarixiaoyao thanks for reviewing. As mentioned in the comments, the unnecessary instantiation is already removed and we don't need lazy listing in Hive file index impl for now. So, I am going to close the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants