[GOBBLIN-1837] Implement multi-active, non blocking for leader host#3700
[GOBBLIN-1837] Implement multi-active, non blocking for leader host#3700Will-Lo merged 11 commits intoapache:masterfrom
Conversation
phet
left a comment
There was a problem hiding this comment.
overall, very nice work urmi!
(I have a bit more to go, since this is a BIG PR, but here are my first thoughts)
gobblin-metrics-libs/gobblin-metrics-base/src/main/avro/DagActionStoreChangeEvent.avsc
Outdated
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/DagActionStore.java
Outdated
Show resolved
Hide resolved
| */ | ||
| DagAction getDagAction(String flowGroup, String flowName, String flowExecutionId) throws IOException, SpecNotFoundException, | ||
| SQLException; | ||
| DagAction getDagAction(String flowGroup, String flowName, String flowExecutionId, DagActionValue dagActionValue) |
There was a problem hiding this comment.
one thing I don't understand: a DagAction has exactly the four fields that are params to this method. if so, does this method just duplicate exists?
merely wondering: is there a use case for getting any and all actions related to a particular flow execution?
relatedly w/ deleteDagAction (above), couldn't that take just a single param of type DagAction?
There was a problem hiding this comment.
You're right actually using getDagAction doesn't make sense anymore since the action is part of the primary key. Instead it may be useful to have getDagActions(flow identifiers) to get all pending actions associated with a flow right now. We don't have any explicit use case at the moment so I will remove this method.
Any method now with the store needs all columns that comprise the primary key, so we can actually pass DagAction to any of these functions but looking at how the functions are used we will end up creating a new DagAction object then pass to the function then unpack those values anyway so I am not certain that changing the signature is that beneficial unless we care more about encapsulating the idea that the PK is needed for all of these actions and that DagAction is PK.
There was a problem hiding this comment.
for the delete case especially, I wondered whether we'd already have a DagAction on hand.
overall, and IMO unfortunately, we use quite little abstraction throughout gobblin service. most emblematic is the regular use of (String, String, String) or (String, String, long) to specify a flow execution ID. an alternative impl, by contrast might combine a FlowExecutionId w/ a DagActionValue to form a DagAction. that would be not only more succinct and self-documenting, but also more typesafe. it's from this general perspective that I prefer the signature:
deleteDagAction(DagAction)
to
deleteDagAction(String, String, String, DagActionValue)
There was a problem hiding this comment.
Changing to utilize DagAction in signature but leaving the other methods as is for now.
...n-runtime/src/main/java/org/apache/gobblin/runtime/dag_action_store/MysqlDagActionStore.java
Outdated
Show resolved
Hide resolved
...n-runtime/src/main/java/org/apache/gobblin/runtime/dag_action_store/MysqlDagActionStore.java
Outdated
Show resolved
Hide resolved
| RETRY, | ||
| CANCEL, | ||
| NEXT_HOP |
There was a problem hiding this comment.
nit: I mentioned ADVANCE elsewhere, but NEXT_HOP is fine too. as for RETRY, I believe RESUME is the terminology we've adopted pretty widely--or do you find precendent for RETRY?
There was a problem hiding this comment.
In terms of action RETRY and RESUME work similarly, but we use them to describe different starting points. RETRY is invoked by DagManager automatically if a flow fails and is configured to allow retries. RESUME is manually invoked by the user. It may be worth to have the differentiation noted for logging purposes but treat these cases the same when it comes to acting on them.
...n-runtime/src/main/java/org/apache/gobblin/runtime/api/SchedulerLeaseDeterminationStore.java
Outdated
Show resolved
Hide resolved
...n-runtime/src/main/java/org/apache/gobblin/runtime/api/SchedulerLeaseDeterminationStore.java
Outdated
Show resolved
Hide resolved
...n-runtime/src/main/java/org/apache/gobblin/runtime/api/SchedulerLeaseDeterminationStore.java
Outdated
Show resolved
Hide resolved
...n-runtime/src/main/java/org/apache/gobblin/runtime/api/SchedulerLeaseDeterminationStore.java
Outdated
Show resolved
Hide resolved
phet
left a comment
There was a problem hiding this comment.
whew! nice work overall, with a couple design points to discuss
gobblin-api/src/main/java/org/apache/gobblin/configuration/ConfigurationKeys.java
Outdated
Show resolved
Hide resolved
gobblin-api/src/main/java/org/apache/gobblin/configuration/ConfigurationKeys.java
Outdated
Show resolved
Hide resolved
...time/src/main/java/org/apache/gobblin/runtime/api/MysqlSchedulerLeaseDeterminationStore.java
Outdated
Show resolved
Hide resolved
...time/src/main/java/org/apache/gobblin/runtime/api/MysqlSchedulerLeaseDeterminationStore.java
Outdated
Show resolved
Hide resolved
...time/src/main/java/org/apache/gobblin/runtime/api/MysqlSchedulerLeaseDeterminationStore.java
Outdated
Show resolved
Hide resolved
...service/src/main/java/org/apache/gobblin/service/monitoring/DagActionStoreChangeMonitor.java
Outdated
Show resolved
Hide resolved
...service/src/main/java/org/apache/gobblin/service/monitoring/DagActionStoreChangeMonitor.java
Outdated
Show resolved
Hide resolved
...service/src/main/java/org/apache/gobblin/service/monitoring/DagActionStoreChangeMonitor.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/gobblin/service/modules/orchestration/SchedulerLeaseAlgoHandler.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/gobblin/service/modules/orchestration/SchedulerLeaseAlgoHandler.java
Outdated
Show resolved
Hide resolved
Will-Lo
left a comment
There was a problem hiding this comment.
Looking good so far, few comments
...time/src/main/java/org/apache/gobblin/runtime/api/MysqlSchedulerLeaseDeterminationStore.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/gobblin/service/modules/orchestration/SchedulerLeaseAlgoHandler.java
Outdated
Show resolved
Hide resolved
...service/src/main/java/org/apache/gobblin/service/monitoring/DagActionStoreChangeMonitor.java
Outdated
Show resolved
Hide resolved
...service/src/main/java/org/apache/gobblin/service/monitoring/DagActionStoreChangeMonitor.java
Outdated
Show resolved
Hide resolved
...time/src/main/java/org/apache/gobblin/runtime/api/MysqlSchedulerLeaseDeterminationStore.java
Outdated
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/DagActionStore.java
Outdated
Show resolved
Hide resolved
gobblin-metrics-libs/gobblin-metrics-base/src/main/avro/DagActionStoreChangeEvent.avsc
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/DagActionStore.java
Outdated
Show resolved
Hide resolved
3667771 to
82708a1
Compare
Codecov Report
@@ Coverage Diff @@
## master #3700 +/- ##
============================================
- Coverage 46.89% 46.82% -0.07%
- Complexity 10772 10801 +29
============================================
Files 2138 2141 +3
Lines 84139 84405 +266
Branches 9357 9383 +26
============================================
+ Hits 39456 39525 +69
- Misses 41078 41277 +199
+ Partials 3605 3603 -2
... and 25 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
phet
left a comment
There was a problem hiding this comment.
I'm about 90% done w/ this PR... here are most of the comments to consider. I'll need to finish off a little later.
.../src/main/java/org/apache/gobblin/service/monitoring/DagActionStoreChangeMonitorFactory.java
Outdated
Show resolved
Hide resolved
| this.dagActionStore.get().deleteDagAction(dagId.flowGroup, dagId.flowName, dagId.flowExecutionId); | ||
| this.dagActionStore.get().deleteDagAction( | ||
| new DagActionStore.DagAction(dagId.flowGroup, dagId.flowName, dagId.flowExecutionId, | ||
| DagActionStore.FlowActionType.KILL)); |
There was a problem hiding this comment.
how do we know for sure that this would be a FlowActionType.KILL? (may be worth a code comment)
There was a problem hiding this comment.
You're right this depends on the caller, I did not notice that. I modified the method to take the FlowActionType as a parameter to the function. As of now it should only have values RESUME or KILL
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/util/InjectionNames.java
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/metrics/RuntimeMetrics.java
Outdated
Show resolved
Hide resolved
gobblin-api/src/main/java/org/apache/gobblin/configuration/ConfigurationKeys.java
Outdated
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/MultiActiveLeaseArbiter.java
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/MultiActiveLeaseArbiter.java
Outdated
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/MultiActiveLeaseArbiter.java
Outdated
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/MultiActiveLeaseArbiter.java
Show resolved
Hide resolved
...rc/main/java/org/apache/gobblin/service/modules/orchestration/SchedulerLeaseAlgoHandler.java
Outdated
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/MySQLMultiActiveLeaseArbiter.java
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/MySQLMultiActiveLeaseArbiter.java
Outdated
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/MySQLMultiActiveLeaseArbiter.java
Outdated
Show resolved
Hide resolved
| + "WHEN CURRENT_TIMESTAMP < (lease_acquisition_timestamp + linger) then 1" | ||
| + "WHEN CURRENT_TIMESTAMP >= (lease_acquisition_timestamp + linger) then 2" | ||
| + "ELSE 3 END as leaseValidityStatus, linger FROM %s, %s " + WHERE_CLAUSE_TO_MATCH_KEY; |
There was a problem hiding this comment.
what about:
(lease_acquisition_timestamp + linger) < CURRENT_TIMESTAMP as isLeaseExpired
then you either have:
1 (TRUE) - expired
0 (FALSE) - not expired
NULL - no lease
(IMO easier to follow than SQL CASE)
There was a problem hiding this comment.
I tried to use boolean value for it but if the column is NULL then the boolean returned is false so it becomes hard to distinguish between leaseValid and noLease. I end up having to specially define the no lease case.
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/MySQLMultiActiveLeaseArbiter.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/gobblin/service/modules/orchestration/SchedulerLeaseAlgoHandler.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/gobblin/service/modules/orchestration/SchedulerLeaseAlgoHandler.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/gobblin/service/modules/orchestration/SchedulerLeaseAlgoHandler.java
Outdated
Show resolved
Hide resolved
| jobProps.setProperty(ConfigurationKeys.SCHEDULER_REMINDER_EVENT_TIMESTAMP_MILLIS_KEY, String.valueOf(status.getReminderEventTimeMillis())); | ||
| jobProps.setProperty(ConfigurationKeys.SCHEDULER_NEW_EVENT_TIMESTAMP_MILLIS_KEY, String.valueOf(status.getReminderEventTimeMillis())); | ||
| JobKey key = new JobKey(flowAction.getFlowName(), flowAction.getFlowGroup()); | ||
| Trigger trigger = this.jobScheduler.getTrigger(key, jobProps); |
There was a problem hiding this comment.
sorry, I guess I'm unfamiliar: what's the meaning/nature of this Trigger we get from one scheduler and give to another?
(I'd love to avoid bringing in the dependency on JobScheduler if we can avoid it, and instead have this class depend only on the SchedulerService.)
There was a problem hiding this comment.
I added some comments to clarify but there is an existing function in JobScheduler named getTrigger (will rename this to createTrigger) we are using to create a new Trigger for the job that will fire after the lease should expire and passing it to the SchedulerService to schedule it.
...rc/main/java/org/apache/gobblin/service/modules/orchestration/SchedulerLeaseAlgoHandler.java
Outdated
Show resolved
Hide resolved
… logic from general lease handler
phet
left a comment
There was a problem hiding this comment.
this is taking shape quite nicely! mostly small comments... we're nearly there!
gobblin-api/src/main/java/org/apache/gobblin/configuration/ConfigurationKeys.java
Outdated
Show resolved
Hide resolved
gobblin-api/src/main/java/org/apache/gobblin/configuration/ConfigurationKeys.java
Outdated
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/DagActionStore.java
Outdated
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/MultiActiveLeaseArbiter.java
Outdated
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/MultiActiveLeaseArbiter.java
Outdated
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/MysqlMultiActiveLeaseArbiter.java
Outdated
Show resolved
Hide resolved
| @@ -199,23 +197,27 @@ public LeaseAttemptStatus tryAcquireLease(DagActionStore.DagAction flowAction, l | |||
| int leaseValidityStatus = resultSet.getInt(4); | |||
There was a problem hiding this comment.
hopefully doesn't feel like overkill, but I'd abstract this by defining a static inner @Data class with an overloaded constructor (or static factory method) taking a ResultSet
There was a problem hiding this comment.
I don't see a large benefit from this since the static class will have to encode these column retrievals anyway but I made it a bit more clear by using the column name instead of index to retrieve the values so it's more readable.
...lin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/Orchestrator.java
Outdated
Show resolved
Hide resolved
| if (this.eventSubmitter.isPresent()) { | ||
| new TimingEvent(this.eventSubmitter.get(), TimingEvent.FlowTimings.FLOW_FAILED).stop(flowMetadata); |
There was a problem hiding this comment.
tip: ifPresent()
(note: this works easily and naturally... unless checked exceptions)
There was a problem hiding this comment.
This format is used in many other places in the Orchestrator so I will leave as is
...lin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/Orchestrator.java
Outdated
Show resolved
Hide resolved
4799679 to
859490a
Compare
859490a to
8e434cb
Compare
phet
left a comment
There was a problem hiding this comment.
absolutely excellent work--very nice job here!
...rvice/src/main/java/org/apache/gobblin/service/modules/orchestration/FlowTriggerHandler.java
Outdated
Show resolved
Hide resolved
...lin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/Orchestrator.java
Show resolved
Hide resolved
...service/src/main/java/org/apache/gobblin/service/monitoring/DagActionStoreChangeMonitor.java
Outdated
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/MysqlMultiActiveLeaseArbiter.java
Outdated
Show resolved
Hide resolved
* upstream/master: Fix bug with total count watermark whitelist (apache#3724) [GOBBLIN-1858] Fix logs relating to multi-active lease arbiter (apache#3720) [GOBBLIN-1838] Introduce total count based completion watermark (apache#3701) Correct num of failures (apache#3722) [GOBBLIN- 1856] Add flow trigger handler leasing metrics (apache#3717) [GOBBLIN-1857] Add override flag to force generate a job execution id based on gobbl… (apache#3719) [GOBBLIN-1855] Metadata writer tests do not work in isolation after upgrading to Iceberg 1.2.0 (apache#3718) Remove unused ORC writer code (apache#3710) [GOBBLIN-1853] Reduce # of Hive calls during schema related updates (apache#3716) [GOBBLIN-1851] Unit tests for MysqlMultiActiveLeaseArbiter with Single Participant (apache#3715) [GOBBLIN-1848] Add tags to dagmanager metrics for extensibility (apache#3712) [GOBBLIN-1849] Add Flow Group & Name to Job Config for Job Scheduler (apache#3713) [GOBBLIN-1841] Move disabling of current live instances to the GobblinClusterManager startup (apache#3708) [GOBBLIN-1840] Helix Job scheduler should not try to replace running workflow if within configured time (apache#3704) [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched (apache#3711) [GOBBLIN-1842] Add timers to GobblinMCEWriter (apache#3703) [GOBBLIN-1844] Ignore workflows marked for deletion when calculating container count (apache#3709) [GOBBLIN-1846] Validate Multi-active Scheduler with Logs (apache#3707) [GOBBLIN-1845] Changes parallelstream to stream in DatasetsFinderFilteringDecorator to avoid classloader issues in spark (apache#3706) [GOBBLIN-1843] Utility for detecting non optional unions should convert dataset urn to hive compatible format (apache#3705) [GOBBLIN-1837] Implement multi-active, non blocking for leader host (apache#3700) [GOBBLIN-1835]Upgrade Iceberg Version from 0.11.1 to 1.2.0 (apache#3697) Update CHANGELOG to reflect changes in 0.17.0 Reserving 0.18.0 version for next release [GOBBLIN-1836] Ensuring Task Reliability: Handling Job Cancellation and Graceful Exits for Error-Free Completion (apache#3699) [GOBBLIN-1805] Check watermark for the most recent hour for quiet topics (apache#3698) [GOBBLIN-1825]Hive retention job should fail if deleting underlying files fail (apache#3687) [GOBBLIN-1823] Improving Container Calculation and Allocation Methodology (apache#3692) [GOBBLIN-1830] Improving Container Transition Tracking in Streaming Data Ingestion (apache#3693) [GOBBLIN-1833]Emit Completeness watermark information in snapshotCommitEvent (apache#3696)
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
This task will include the implementation of non-blocking, multi-active scheduler for each host. The algorithm is described in much greater detail in the Java docs written for the classes below. Please read them for clarification. This PR will NOT include metric emission or unit tests for validation. That will be done in a separate follow-up ticket. The work in this ticket includes
New Classes
MultiActiveLeaseArbiterused to define a generic approach for the non-blocking, multi participant system which will be used for theSchedulerbut can be extended in the future toDagManagerand other modules of the system that we want to alter to multi-activeMultiActiveLeaseArbitercalledMsqlMultiActiveLeaseArbiterwhich uses a MySQL store to resolve ownership of a flow event among multiple competing participantsFlowTriggerHandlerto coordinate between hosts with enabled schedulers to respond to flow action events -> only handling triggers to LAUNCH an event in the meantimeModifications to Existing Classes
DagActionStoreschema andDagActionStoreMonitorto act upon newLAUNCHtype events in addition to KILL/RESUMEJobSchedulerto store the timestamp of the trigger event within it's job propertiesOrchestratorlogic to trigger the event using the algorithm above ifmulti-active scheduleris enabled, otherwise submit events directly to the DagManager after receiving a scheduler triggerNOTE: because I'm updating the DagActionStore schema this change will require manually altering the primary key of the table before deploying these changes. MySQL only creates/updates the table if the same table name does not exist.
Tests
Limiting the scope of this PR to the implementation and will focus on metrics, logging for validation, and unit tests in a separate PR.
Commits