Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-2193] Fail Azkaban job on Temporal Job Failure #4096

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

iPalash
Copy link
Contributor

@iPalash iPalash commented Feb 6, 2025

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):
    Currently when the temporal job running on Yarn fails, we don't propagate the error back to Azkaban job which launches the Yarn Application.

The change here bubbles the issues encountered when the job fails upto the GobblinYarnAppLaucher run by the Azkaban job and fails with a RuntimeException after logging the issues summary.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Tested manually

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@iPalash iPalash changed the title [GOBBLIN-2193] Fail Azkaban job on GaaS Job (running on temporal) Failure [GOBBLIN-2193] Fail Azkaban job on Temporal Job Failure Feb 6, 2025
@Blazer-007 Blazer-007 requested a review from Copilot February 6, 2025 07:02
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

gobblin-temporal/src/test/java/org/apache/gobblin/temporal/yarn/YarnServiceTest.java:148

  • The testHandleJobFailureEvent method should also test the scenario where the job state is successful.
}

@codecov-commenter
Copy link

codecov-commenter commented Feb 6, 2025

Codecov Report

Attention: Patch coverage is 13.33333% with 13 lines in your changes missing coverage. Please review.

Project coverage is 51.49%. Comparing base (adef734) to head (ba88388).

Files with missing lines Patch % Lines
...rg/apache/gobblin/yarn/GobblinYarnAppLauncher.java 13.33% 13 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #4096      +/-   ##
============================================
+ Coverage     45.38%   51.49%   +6.10%     
- Complexity     3192     7566    +4374     
============================================
  Files           696     1388     +692     
  Lines         26628    52198   +25570     
  Branches       2655     5733    +3078     
============================================
+ Hits          12085    26878   +14793     
- Misses        13542    23026    +9484     
- Partials       1001     2294    +1293     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@iPalash iPalash requested a review from Blazer-007 February 13, 2025 04:48
if (this.jobSummaryEvent.getJobState() != null && !this.jobSummaryEvent.getJobState().getState().isSuccess()) {
this.amrmClientAsync.unregisterApplicationMaster(FinalApplicationStatus.FAILED, this.jobSummaryEvent.getIssuesSummary(), null);
} else {
this.amrmClientAsync.unregisterApplicationMaster(FinalApplicationStatus.SUCCEEDED, StringUtils.defaultString(this.jobSummaryEvent.getIssuesSummary()), null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.jobSummaryEvent.getIssuesSummary() wouldn't be null, right? since getIssuesSummary() returns an empty string as default. if yes, it is fine to use StringUtils.defaultString but we should use for both statuses or not use it at all

TextStringBuilder sb = new TextStringBuilder();
try {
List<Issue> issues = this.getIssueRepository().getAll();
if (issues.size() == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use issues.isEmpty()

LOGGER.error("Gobblin Yarn application failed for the following reason: " + applicationReport.getDiagnostics());
applicationFailed = true;
LOGGER.error("Gobblin Yarn application failed because of the following issues: " + applicationReport.getDiagnostics());
} else if (StringUtils.isNotBlank(applicationReport.getDiagnostics())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be removed as it's not useful to have diagnostics for success cases, these are mostly task failures which have already been retried

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be useful in case where no work units were generated as in that cases job always succeeds so it will be easier to know directly.

Comment on lines +389 to +392
synchronized (this.applicationDone) {
while (!this.applicationCompleted) {
try {
this.applicationDone.wait();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be simpler and cleaner to use CountDownLatch instead of explicit synchronization with synchronized, wait(), and notify()

throw new RuntimeException("Gobblin Yarn application failed");
}
} catch (InterruptedException ie) {
LOGGER.error("Interrupted while waiting for the Gobblin Yarn application to finish", ie);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw exception?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants