Skip to content

Conversation

@yihua
Copy link
Contributor

@yihua yihua commented Mar 22, 2022

What is the purpose of the pull request

This PR addresses the failure of retried clean action under the following scenario:

(1) C5.clean.requested in data table
(2) C5.clean.inflight in data table
(3) Apply changes to metadata table
(4) C5.deltacommit.requested and C5.deltacommit.inflight in metadata table
(5) Job crashes
(6) Restart the job to rerun the same clean action C5

The following exception is thrown:

19747 [main] WARN  org.apache.hudi.table.action.clean.CleanActionExecutor  - Failed to perform previous clean operation, instant: [==>00000000000005__clean__INFLIGHT]
org.apache.hudi.exception.HoodieIOException: Failed to create file /var/folders/60/wk8qzx310fd32b2dp7mhzvdc0000gn/T/junit94869438059871451/dataset/.hoodie/metadata/.hoodie/00000000000005.deltacommit.requested
	at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createImmutableFileInPath(HoodieActiveTimeline.java:673)
	at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createFileInMetaPath(HoodieActiveTimeline.java:655)
	at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createNewInstant(HoodieActiveTimeline.java:163)
	at org.apache.hudi.client.BaseHoodieWriteClient.startCommit(BaseHoodieWriteClient.java:895)
	at org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:877)
	at org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:860)
	at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:141)
	at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:670)
	at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:694)
	at org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$1(BaseActionExecutor.java:69)
	at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
	at org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:69)
	at org.apache.hudi.table.action.clean.CleanActionExecutor.runClean(CleanActionExecutor.java:211)
	at org.apache.hudi.table.action.clean.CleanActionExecutor.runPendingClean(CleanActionExecutor.java:176)
	at org.apache.hudi.table.action.clean.CleanActionExecutor.lambda$execute$6(CleanActionExecutor.java:238)
	at java.util.ArrayList.forEach(ArrayList.java:1259)
	at org.apache.hudi.table.action.clean.CleanActionExecutor.execute(CleanActionExecutor.java:232)
	at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.clean(HoodieSparkCopyOnWriteTable.java:339)
	at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:781)
	at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:750)
	at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:804)
	at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:794)
	at org.apache.hudi.table.TestCleaner.runCleaner(TestCleaner.java:693)
...
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already exists: file:/var/folders/60/wk8qzx310fd32b2dp7mhzvdc0000gn/T/junit94869438059871451/dataset/.hoodie/metadata/.hoodie/00000000000005.deltacommit.requested
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:289)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)
	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)
	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)
	at org.apache.hudi.common.fs.HoodieWrapperFileSystem.lambda$create$2(HoodieWrapperFileSystem.java:222)
	at org.apache.hudi.common.fs.HoodieWrapperFileSystem.executeFuncWithTimeMetrics(HoodieWrapperFileSystem.java:101)
	at org.apache.hudi.common.fs.HoodieWrapperFileSystem.create(HoodieWrapperFileSystem.java:221)
	at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createImmutableFileInPath(HoodieActiveTimeline.java:668)
	... 142 more

The root cause is that the same instant timestamp is used for the clean action and MDT is not expected to roll back the same deltacommit before committing the changes again. This PR fixes the problem by looking at all instants on the active timeline of the metadata table, not just the completed instants, before starting a commit in metadata table.

Brief change log

  • Fixes the commit starting logic in SparkHoodieBackedTableMetadataWriter and FlinkHoodieBackedTableMetadataWriter
  • Add new unit and functional tests for the behavior

Verify this pull request

This change adds new tests in TestCleaner and TestCleanPlanExecutor for the failure scenario mentioned above. Before the fix, the tests fail. After the fix, the tests succeeds without any exception.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@yihua yihua force-pushed the HUDI-3624-reattempt-failed-cleaning branch from 1c6c768 to 7ef79d8 Compare March 23, 2022 20:39
@yihua yihua changed the title [WIP][HUDI-3624] Check all instants before starting a commit in metadata table [HUDI-3624] Check all instants before starting a commit in metadata table Mar 23, 2022
@yihua yihua added the priority:blocker Production down; release blocker label Mar 23, 2022
@yihua yihua force-pushed the HUDI-3624-reattempt-failed-cleaning branch from 7ef79d8 to 62aa788 Compare March 24, 2022 21:37
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit 9b3dd2e into apache:master Mar 25, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants