Skip to content

Conversation

@bvaradar
Copy link
Contributor

@bvaradar bvaradar commented Oct 6, 2019

Before this change, Cleaner performs cleaning of old file versions and then stores the deleted files in .clean files.
With this setup, we will not be able to track file deletions if a cleaner fails after deleting files but before writing .clean metadata.
This is fine for regular file-system view generation but Incremental timeline syncing relies on clean/commit/compaction metadata to keep a consistent file-system view.

Cleaner state transitions is now similar to that of compaction.

  1. Requested : HoodieWriteClient.scheduleClean() selects the list of files that needs to be deleted and stores them in metadata
  2. Inflight : HoodieWriteClient marks the state to be inflight before it starts deleting
  3. Completed : HoodieWriteClient marks the state after completing the deletion according to the cleaner plan

There will be followup PRs after this :

  1. HUDI-294 for making cleaner stats use relative paths.
  2. HUDI-137 for similar handling for Rollback
  3. HUDI-80 for incrementalize cleaning

@bvaradar bvaradar changed the title [WIP] [HUDI-137] Fix state transitions for Hudi cleaning action [HUDI-137] Fix state transitions for Hudi cleaning action Oct 6, 2019
Comment on lines 1119 to 1127
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this snippet of code removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will confirm with @n3nash before I remove this part.

@leesf
Copy link
Contributor

leesf commented Oct 6, 2019

Thanks for opening the PR @bvaradar , left a few comments.

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create a HoodieCleanerClient and consolidate all the code around cleaning there? in the interest of keeping HoodieWriteClient better readable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pass the table from clean() above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is your formatting consistent? this line could easily fit on the previous line, looks like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will run spotcheck to fix this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space between , and 1 ? is checkstyle not enforcing this .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inflight?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some javadocs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the real code, we could have the same instant time for commit and clean? is this needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I added this to force different timestamps for same actions (1.clean, 2.clean,...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is your thought on moving forward and skipping the requested -> inflight transition if it was already in inflight instead of reverting back to state? do we do the same thing for compaction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, In the case of compaction, we do this and also rollback partially written parquet. In this case, it is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are finishing up any inflight/requested cleans first before getting here, this list to clean here would do mostly pick only actual new files to clean ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@bvaradar
Copy link
Contributor Author

@vinothchandar : Addressed review comments. Also, re-enabled spotless to fix any linting automatically and disabled spotless again. We can wait for the tests to pass before reviewing again.

@bvaradar bvaradar force-pushed the hudi_137_clean branch 5 times, most recently from 8bf4d20 to aaa73d5 Compare November 5, 2019 15:52
@bvaradar
Copy link
Contributor Author

bvaradar commented Nov 5, 2019

@vinothchandar: Updated PR. Please review when you get a chance

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the two small things and go ahead and merge

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move these off code and into a JIRA..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. we can revisit everything more holistically for archival redesign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be refreshed for every operation like the table? or not needed since table is passed in.. I think latter.. just calling out to confirm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is the latter. No need to refresh the Cleaner client itself.

@vinothchandar vinothchandar self-assigned this Nov 11, 2019
…action actions

Before this change, Cleaner performs cleaning of old file versions and then stores the deleted files in .clean files.
With this setup, we will not be able to track file deletions if a cleaner fails after deleting files but before writing .clean metadata.
This is fine for regular file-system view generation but Incremental timeline syncing relies on clean/commit/compaction metadata to keep a consistent file-system view.

Cleaner state transitions is now similar to that of compaction.

1. Requested : HoodieWriteClient.scheduleClean() selects the list of files that needs to be deleted and stores them in metadata
2. Inflight : HoodieWriteClient marks the state to be inflight before it starts deleting
3. Completed : HoodieWriteClient marks the state after completing the deletion according to the cleaner plan
@bvaradar bvaradar merged commit 1032fc3 into apache:master Nov 11, 2019
nsivabalan added a commit to nsivabalan/hudi that referenced this pull request Oct 1, 2024
nsivabalan added a commit to nsivabalan/hudi that referenced this pull request Oct 1, 2024
nsivabalan added a commit to nsivabalan/hudi that referenced this pull request Oct 1, 2024
nsivabalan added a commit to nsivabalan/hudi that referenced this pull request Oct 1, 2024
nsivabalan added a commit to nsivabalan/hudi that referenced this pull request Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants