-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3497] Adding Datatable validator tool #4902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-3497] Adding Datatable validator tool #4902
Conversation
|
@zhangyue19921010 : Can you help review this as well. |
Sure will do it ASAP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left a few comments. Thanks for this powerful patch :)
Just a little Question, we do validation based on active timeline. Are these information enough or maybe we could take archived instants into account ?
After all not all the data files can be tracked by archive timeline.
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieDataTableValidator.java
Outdated
Show resolved
Hide resolved
| LOG.error("Data table validation failed due to extra files found for completed commits " + danglingFiles.size()); | ||
| danglingFiles.forEach(entry -> LOG.error("Dangling file: " + entry.toString())); | ||
| finalResult = false; | ||
| if (!cfg.ignoreFailed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could throw this HoodieValidationException("Data table validation failed due to dangling files " + danglingFiles.size()); each time danglingFiles not empty and check !cfg.ignoreFailed only in catch block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I am throwing HoodieValidationException. not sure I understand your feedback. I also want to catch any other exceptions in general and not just HoodieValidationException. hence I had to throw here and again catch in catch block and again throw if cfg.ignoreFailed is set to false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay got it. Thanks for your explanation. We're all good here.
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieDataTableValidator.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieDataTableValidator.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieDataTableValidator.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieDataTableValidator.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieDataTableValidator.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieDataTableValidator.java
Outdated
Show resolved
Hide resolved
|
Have addressed all feedback. asked a clarification on one of the comment. |
zhangyue19921010
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :)
| LOG.error("Data table validation failed due to extra files found for completed commits " + danglingFiles.size()); | ||
| danglingFiles.forEach(entry -> LOG.error("Dangling file: " + entry.toString())); | ||
| finalResult = false; | ||
| if (!cfg.ignoreFailed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay got it. Thanks for your explanation. We're all good here.
What is the purpose of the pull request
Add data table validation tool to validate data table in sane condition.
Can run in sync once mode or continuous mode as well
Verify this pull request
verified by running locally.
both sync once and continious. injected failures in continuous after few rounds and ensured the tool caught it.
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.