Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-1863] Multi-Active Launch Job Related Issues #3727

Merged
merged 5 commits into from
Aug 1, 2023

Conversation

umustafi
Copy link
Contributor

@umustafi umustafi commented Jul 28, 2023

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):

Bugs

  • DagManager check leader status before addDag bc calling this method from non-leader hosts throws a NPE which may caused failed dag event to be emitted. Then those job executions are skipped since they are marked as failed.
  • also handle LAUNCH type events upon leader change and setting a new participant DagManager to be active. Failing to handle these events may be causing missed flow launches on any leader change or restart. 
    Refactoring Done
  • Create a static function in Orchestrator to emit a flow failed event if spec compilation fails. This function is shared by the between the DagManager and Orchestrator. It takes in the eventSubmitter as an argument to emit the event.
  • I abstracted all the checks in Orchestrator.orchestrate done to compile the flow and ensure an execution is allowed to be re-used by the DagActionStoreChangeMonitor when submitting the flow to the DagManager by calling submitFlowToDagManager.
  • Note that we skip spec compilation in multi-active mode to avoid two compilations since it's a memory intensive process and we can wait to do it before submitting to the DagManager.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    NA

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@umustafi umustafi changed the title [GOBBLIN-1863] DagManager checks leader status before adding dag to avoid NPE [GOBBLIN-1863] Multi-Active Launch Job Related Issues Jul 29, 2023
Comment on lines 388 to 395
public void submitFlowToDagManager(FlowSpec flowSpec)
throws IOException {
submitFlowToDagManager(flowSpec, specCompiler.compileFlow(flowSpec));
Dag<JobExecutionPlan> jobExecutionPlanDag = specCompiler.compileFlow(flowSpec);
if (jobExecutionPlanDag == null || jobExecutionPlanDag.isEmpty()) {
emitFlowCompilationFailedEvent(flowSpec, TimingEventUtils.getFlowMetadata(flowSpec));
return;
}
submitFlowToDagManager(flowSpec, jobExecutionPlanDag);
Copy link
Contributor

@Will-Lo Will-Lo Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm wondering why we don't directly call the orchestrate() function from the SpecMonitor instead of submitFlowToDagManager, because there are a few more checks done in that function, namely:

  1. Checking for another execution
  2. Ensuring that the specCompiler is ready before compilation (happens less these days with a file based compiler but still possible that there will be a race condition introduced where you try to compile a flow when the service starts up but the flowgraph is not finished loading into memory).

If this is not possible, we need to abstract these checks to another function and ensure that they're done before submitting to the dagmanager in any configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't call orchestrate function directly from the DagActionStoreMonitor because the chain of events is Scheduler -> Orchestrator -> FlowTriggerHandler (Multi-Active algorithm) -> MySQL -> Brooklin change stream -> DagActionStoreMonitor -> DagManager so we don't want the MA algorithm triggered again. However, I did end up abstracting all the checks done to compile the flow and ensure an execution is allowed to re-use when submitting the flow to the DagManager from the DagActionStoreMonitor

Comment on lines 479 to 481
public void handleLaunchFlowEvent(DagActionStore.DagAction action) {
if (this.specCompiler.isPresent()) {
FlowId flowId = new FlowId().setFlowGroup(action.getFlowGroup()).setFlowName(action.getFlowName());
Copy link
Contributor

@Will-Lo Will-Lo Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as the one above, wanted to double check that all the checks are being done before submitting these flows which the orchestrator is doing, otherwise there will be some bugs introduced that are guarded against. Is it possible to use a helper or a static function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-used helper function created in Orchestrator

Comment on lines +380 to +385
Optional<TimingEvent> flowCompileFailedTimer = eventSubmitter.transform(submitter ->
new TimingEvent(submitter, TimingEvent.FlowTimings.FLOW_COMPILE_FAILED));

if (flowCompileFailedTimer.isPresent()) {
flowCompileFailedTimer.get().stop(flowMetadata);
}
Copy link
Contributor

@phet phet Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this works (w/o being challenged by checked exceptions), it would be more idiomatic:

eventSubmitter.transform(...).ifPresent(timer -> timer.stop(fmd));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous response about Java Optional vs. Google Optional

Comment on lines 393 to 395
return;
}
submitFlowToDagManager(flowSpec, jobExecutionPlanDag);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find else MUCH preferable to the early return here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With lots of nesting it becomes hard to tell at which points we may have exited the code block, I'd prefer early return since I end up adding a few more checks that must pass before we submit to the dagManager.

* @param spec
* @param flowMetadata
*/
public void emitFlowCompilationFailedEvent(Spec spec, Map<String, String> flowMetadata) {
Copy link
Contributor

@phet phet Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no biggie, but naming-wise, I agree the timing is emitted... but what may be most noteworthy is that it inserts the METADATA_MESSAGE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed to populateFlowCompilationFailedEventMessage, what do you think?

* the process.
*/
public void handleLaunchFlowEvent(DagActionStore.DagAction action) {
if (this.specCompiler.isPresent()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.specCompiler.ifPresent(theSpecCompiler -> {
  ...
});

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is using google library's Optional which does not have this method. I've generally seen us using this Optional version so will keep as is.

try {
URI flowUri = FlowSpec.Utils.createFlowSpecUri(flowId);
spec = (FlowSpec) flowCatalog.getSpecs(flowUri);
Optional<Dag<JobExecutionPlan>> optionalJobExecutionPlanDag = orchestrator.handleChecksBeforeExecution(spec);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking since we need to pass the orchestrator to the dagmanager here which seems like an antipattern, it would be easier to have the orchestrator handle the dagmanager launch event on startup. What would the LOE be in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this function to the Orchestrator, does not actually change the code pattern here since there's only a bit of wrapper code (mainly to get the spec corresponding to the dagAction) around the Orchestrator helper funcs that it relies on. It's more clear to keep it here since the use case is DagManager specific

@codecov-commenter
Copy link

codecov-commenter commented Aug 1, 2023

Codecov Report

Merging #3727 (8b9be6e) into master (8198404) will decrease coverage by 0.38%.
Report is 1 commits behind head on master.
The diff coverage is 26.82%.

@@             Coverage Diff              @@
##             master    #3727      +/-   ##
============================================
- Coverage     47.45%   47.08%   -0.38%     
- Complexity     8687    10856    +2169     
============================================
  Files          1731     2144     +413     
  Lines         66736    84746   +18010     
  Branches       7219     9409    +2190     
============================================
+ Hits          31671    39902    +8231     
- Misses        32306    41222    +8916     
- Partials       2759     3622     +863     
Files Changed Coverage Δ
...org/apache/gobblin/runtime/api/DagActionStore.java 84.61% <0.00%> (-7.06%) ⬇️
...ervice/monitoring/DagActionStoreChangeMonitor.java 0.00% <0.00%> (ø)
...blin/service/modules/orchestration/DagManager.java 79.87% <12.00%> (-2.65%) ⬇️
...in/service/modules/orchestration/Orchestrator.java 47.36% <32.58%> (-4.23%) ⬇️
...blin/runtime/api/MysqlMultiActiveLeaseArbiter.java 78.84% <100.00%> (ø)

... and 426 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Will-Lo Will-Lo merged commit d13c327 into apache:master Aug 1, 2023
phet pushed a commit to phet/gobblin that referenced this pull request Aug 15, 2023
* DagManager checks leader status before adding dag to avoid NPE

* handle launch events after leader change in DagManager

* Refactor Orchestrator and DagManager

* remove unecessary specCompiler changes

* add docstring

---------

Co-authored-by: Urmi Mustafi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants