Skip to content

[HUDI-5194] Fix schema files cleaning by FileBasedInternalSchemaStorageManager#7183

Open
xiarixiaoyao wants to merge 2 commits intoapache:masterfrom
xiarixiaoyao:schemaFix
Open

[HUDI-5194] Fix schema files cleaning by FileBasedInternalSchemaStorageManager#7183
xiarixiaoyao wants to merge 2 commits intoapache:masterfrom
xiarixiaoyao:schemaFix

Conversation

@xiarixiaoyao
Copy link
Copy Markdown
Contributor

Change Logs

Fix the bug, history schema files cannot be cleaned by FileBasedInternalSchemaStorageManager
Fix the bug, schema evolution cannot worked very well on non-batch read mode under spark3.1x
optimize implement for compaction.

Impact

none

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@xiarixiaoyao xiarixiaoyao force-pushed the schemaFix branch 2 times, most recently from ac3457c to 8a74bb3 Compare November 11, 2022 07:38
@xushiyan xushiyan added area:schema Schema evolution and data types priority:critical Production degraded; pipelines stalled labels Nov 11, 2022
@alexeykudinkin
Copy link
Copy Markdown
Contributor

Thanks for putting up a fix @xiarixiaoyao!

Let's sequence this one w/ #6358, which tackles some of our longstanding issues w/ schema handling (slightly overlapping w/ your PR)

@xiarixiaoyao
Copy link
Copy Markdown
Contributor Author

Thanks for putting up a fix @xiarixiaoyao!

Let's sequence this one w/ #6358, which tackles some of our longstanding issues w/ schema handling (slightly overlapping w/ your PR)

ok, will until 6358 is merged and then update this pr.

@xiarixiaoyao xiarixiaoyao force-pushed the schemaFix branch 2 times, most recently from 113b079 to 91ffb11 Compare November 28, 2022 10:32
@xiarixiaoyao xiarixiaoyao force-pushed the schemaFix branch 3 times, most recently from e0299dc to 3f352e7 Compare November 28, 2022 10:57
@nsivabalan nsivabalan added the release-0.12.2 Patches targetted for 0.12.2 label Dec 6, 2022
@xiarixiaoyao xiarixiaoyao changed the title [HUDI-5194][WIP]Fix problems found in schema evolution in the production enviroment [HUDI-5194]Fix problems found in schema evolution in the production enviroment Dec 6, 2022
@codope codope changed the title [HUDI-5194]Fix problems found in schema evolution in the production enviroment [HUDI-5194] Fix schema files cleaning by FileBasedInternalSchemaStorageManager Dec 7, 2022
@codope
Copy link
Copy Markdown
Member

codope commented Dec 7, 2022

@xiarixiaoyao Can you please rebase?

@xiarixiaoyao xiarixiaoyao force-pushed the schemaFix branch 2 times, most recently from 90c90ea to 3cb5b94 Compare December 8, 2022 10:28
@xiarixiaoyao
Copy link
Copy Markdown
Contributor Author

@codope already rebased code.

InternalSchema mergedSchema = new InternalSchemaMerger(writeInternalSchema, querySchema,
true, false, false).mergeSchema();
Schema newWriterSchema = AvroInternalSchemaConverter.convert(mergedSchema, writerSchema.getFullName());
Schema writeSchemaFromFile = AvroInternalSchemaConverter.convert(writeInternalSchema, newWriterSchema.getFullName());
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove useless check

boolean schemaEvolutionEnable = new TableSchemaResolver(table.getMetaClient()).getTableInternalSchemaFromCommitMetadata().isPresent();
Pair<Option<String>, Option<String>> schemaPair = Pair.of(Option.empty(), Option.empty());
if (schemaEvolutionEnable) {
schemaPair = InternalSchemaCache.getInternalSchemaAndAvroSchemaForClusteringAndCompaction(table.getMetaClient(), instantTime);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimize the code, trigger the corresponding logic only when the schema evolution is enabled

@xiarixiaoyao xiarixiaoyao force-pushed the schemaFix branch 2 times, most recently from 15bb7a5 to cee3746 Compare December 8, 2022 10:35
hoodieTableMetaClient, false);
if (historicalSchemas == null) {
FileBasedInternalSchemaStorageManager schemaStorageManager = new FileBasedInternalSchemaStorageManager(hoodieTableMetaClient);
historicalSchemas = SerDeHelper.parseSchemas(schemaStorageManager.getHistorySchemaStr());
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache historical schema,reduce the overhead of search fileSchema.
in our env. we have a log with 1700+ avroBlock,
In original logic, it is very time-consuming to do 1700 fileSchema lookup operations

@xiarixiaoyao xiarixiaoyao force-pushed the schemaFix branch 3 times, most recently from ee75297 to 0265ffa Compare December 8, 2022 10:49
@xiarixiaoyao
Copy link
Copy Markdown
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented Dec 9, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@bvaradar
Copy link
Copy Markdown
Contributor

@xiarixiaoyao : Is this still an issue ? If so, Can you please update this PR based on changes from #6358 and rebase. I will review and land this diff.

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Feb 26, 2024
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiarixiaoyao @bvaradar is this PR still necessary based on the latest master?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:schema Schema evolution and data types priority:critical Production degraded; pipelines stalled release-0.12.2 Patches targetted for 0.12.2 size:L PR with lines of changes in (300, 1000] type:bug Bug reports and fixes

Projects

Status: 🏗 Under discussion

Development

Successfully merging this pull request may close these issues.

9 participants