Skip to content

Conversation

@nsivabalan
Copy link
Contributor

What is the purpose of the pull request

Add data table validation tool to validate data table in sane condition.

  • No data files found for commits prior to active timeline.
  • No extra data files for completed commits in active timeline compared to whats stored in commit metadata.

Can run in sync once mode or continuous mode as well

Verify this pull request

verified by running locally.
both sync once and continious. injected failures in continuous after few rounds and ensured the tool caught it.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan nsivabalan requested a review from yihua February 24, 2022 14:20
@nsivabalan nsivabalan changed the title [HUDI-3497] Adding Datavalidator tool [HUDI-3497] Adding Datatable validator tool Feb 24, 2022
@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Feb 24, 2022
@nsivabalan
Copy link
Contributor Author

@zhangyue19921010 : Can you help review this as well.

@zhangyue19921010
Copy link
Contributor

@zhangyue19921010 : Can you help review this as well.

Sure will do it ASAP

Copy link
Contributor

@zhangyue19921010 zhangyue19921010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a few comments. Thanks for this powerful patch :)

Just a little Question, we do validation based on active timeline. Are these information enough or maybe we could take archived instants into account ?

After all not all the data files can be tracked by archive timeline.

LOG.error("Data table validation failed due to extra files found for completed commits " + danglingFiles.size());
danglingFiles.forEach(entry -> LOG.error("Dangling file: " + entry.toString()));
finalResult = false;
if (!cfg.ignoreFailed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could throw this HoodieValidationException("Data table validation failed due to dangling files " + danglingFiles.size()); each time danglingFiles not empty and check !cfg.ignoreFailed only in catch block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I am throwing HoodieValidationException. not sure I understand your feedback. I also want to catch any other exceptions in general and not just HoodieValidationException. hence I had to throw here and again catch in catch block and again throw if cfg.ignoreFailed is set to false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay got it. Thanks for your explanation. We're all good here.

@nsivabalan
Copy link
Contributor Author

Have addressed all feedback. asked a clarification on one of the comment.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@zhangyue19921010 zhangyue19921010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

LOG.error("Data table validation failed due to extra files found for completed commits " + danglingFiles.size());
danglingFiles.forEach(entry -> LOG.error("Dangling file: " + entry.toString()));
finalResult = false;
if (!cfg.ignoreFailed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay got it. Thanks for your explanation. We're all good here.

@nsivabalan nsivabalan merged commit f7088a9 into apache:master Mar 1, 2022
@xushiyan xushiyan added the status:triaged Issue has been reviewed and categorized label Mar 8, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled status:triaged Issue has been reviewed and categorized

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants