[GOBBLIN-1840] Helix Job scheduler should not try to replace running workflow if within configured time by Peiyingy · Pull Request #3704 · apache/gobblin

Peiyingy · 2023-06-14T18:01:19Z

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-1840] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-1840

Description

Here are some details about my PR, including screenshots (if applicable):

Problem Statement

Currently, there is a problem with the Helix replanner, that Azkaban jobs can be triggered at the same time, causing replanning to happen in a short time span twice or more. It is expensive to create a replanner, consuming a lot of resources and a long time for both the Zookeeper and the Application Master.

Solution

We implemented a concurrent hashmap to store the create time for each job so that we can check the hashmap record to make sure that we only reschedule the workflow when the last replanning is earlier than the throttle timeout threshold, which has a default time of an hour and totally configurable. We also have a throttling feature that is able to turn off, stopping this early return mechanism.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

The unit tests are permutations regards to three variables: same or different workflow, time span, and whether throttling is enabled. The original testNewJobAndUpdate is for the same workflow, long time period, and throttle enabled. The rest of the tests have descriptive names.

Commits

My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

…workflow if within configured time

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

umustafi

nice work adding unit tests! few suggestions :)

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

umustafi · 2023-06-15T03:20:04Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

  private boolean startServicesCompleted;
  private final long helixJobStopTimeoutMillis;
+  private final Duration throttleTimeoutDuration;
+  private ConcurrentHashMap<String, Instant> jobStartTimeMap;


jobUriToStartTimeMap? better if u can clarify what the string is. Also lets be consistent between START/CREATE time you mention in description. Why do we use Instant rather than Timestamp or Long (milliseconds)? The latter is typically what I see used in our code.

+1 to adjusting the map name. Pretty sure the key is the jobName is the key, which refers to the Gobblin configuration job.name

Also just my opinion, but Instant (or I guess Timestamp) are less error prone to write code for than millis.

I don't have an opinion on changing the above duration to Millis long to fit the rest of the class. But Instant vs long is a big deal because long is hard to reason about. It often refers to epoch millis but you always have to add that epochmillis to the name of the map.

As for java.time.Instant vs java.sql.Timestamp, I've seen Instant used elsewhere. And personally haven't seen Timestamp used much. So clearly there are multiple pockets of usage. and either one makes sense probably.

I don't have that strong of a preference with Instant vs. Timestamp/Long, the latter are more common on GaaS side so I was initially surprised. More important to update the map name.

umustafi · 2023-06-15T03:22:06Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

  public void handleUpdateJobConfigArrival(UpdateJobConfigArrivalEvent updateJobArrival) {
    LOGGER.info("Received update for job configuration of job " + updateJobArrival.getJobName());
+    String jobName = updateJobArrival.getJobName();
+    boolean throttleEnabled = PropertiesUtils.getPropAsBoolean(updateJobArrival.getJobConfig(),


usually booleans are easily identified with is.... like isThrottleEnabled

does this default to false if config is not provided? if not provide a default value

Default is provided by GobblinClusterConfigurationKeys.DEFAULT_HELIX_JOB_SCHEDULING_THROTTLE_ENABLED_KEY in the config file, which is set to false. Should I add an additional default proof in this part?

but you are not using that value here right? You want to provide DEFAULT_HELIX_JOB_SCHEDULING_THROTTLE_ENABLED_KEY here so value is used

In the function of getPropAsBoolean, that is:

public static boolean getPropAsBoolean( @NotNull Properties properties, String key, String defaultValue )

so it will call String.valueOf(GobblinClusterConfigurationKeys.DEFAULT_HELIX_JOB_SCHEDULING_THROTTLE_ENABLED_KEY) as the default value if GobblinClusterConfigurationKeys.HELIX_JOB_SCHEDULING_THROTTLE_ENABLED_KEY is not assigned

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

umustafi · 2023-06-15T03:29:05Z

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

+      throws Exception {
+    try (MockedStatic<Instant> mocked = mockStatic(Instant.class, CALLS_REAL_METHODS)) {
+      mocked.when(Instant::now).thenReturn(beginTime, shortPeriod);
+      HelixManager helixManager = HelixManagerFactory


can u reuse helixManager across tests?

The original approach was to use that across tests, but when the test number increases, it would cause the error HelixManager (ZkClient) is not connected, so I changed that to local variable to avoid this problem.

i see, let's add a comment to explain that in javadoc for this testing class and make a method to create HelixManager that you can reuse across test. You can also put comment there to explain why you changed to local variable so the knowledge is preserved for those updating tests in future.

umustafi · 2023-06-15T03:29:31Z

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

+      jobScheduler.handleNewJobConfigArrival(newJobConfigArrivalEvent);
+      connectAndAssertWorkflowId(workflowIdSuffix1, newJobConfigArrivalEvent, helixManager);
+
+      properties1.setProperty(GobblinClusterConfigurationKeys.HELIX_JOB_SCHEDULING_THROTTLE_ENABLED_KEY, "true");


if there's no properties2, then just name this properties

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

umustafi · 2023-06-15T03:31:36Z

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

+      HelixManager helixManager = HelixManagerFactory
+          .getZKHelixManager(helixClusterName, TestHelper.TEST_HELIX_INSTANCE_NAME, InstanceType.CONTROLLER,
+              zkConnectingString);
+      GobblinHelixJobScheduler jobScheduler = createJobScheduler(helixManager);


same here u can reuse perhaps, lots of repeated code. Let's try to DRY (do not repeat yourself)

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

homatthew

Some comments and replies to existing questions

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinClusterConfigurationKeys.java

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

homatthew · 2023-06-15T15:21:39Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

  private boolean startServicesCompleted;
  private final long helixJobStopTimeoutMillis;
+  private final Duration throttleTimeoutDuration;
+  private ConcurrentHashMap<String, Instant> jobStartTimeMap;


+1 to adjusting the map name. Pretty sure the key is the jobName is the key, which refers to the Gobblin configuration job.name

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

homatthew · 2023-06-15T15:26:22Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

+
+    if (throttleEnabled && this.jobStartTimeMap.containsKey(jobName)) {
+      Instant jobStartTime = this.jobStartTimeMap.get(jobName);
+      Duration workflowDuration = Duration.between(jobStartTime, Instant.now());


Maybe workflowRunningDuration is a more descriptive name. @ZihanLi58 can you help chime in here

homatthew · 2023-06-15T15:33:23Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

  private boolean startServicesCompleted;
  private final long helixJobStopTimeoutMillis;
+  private final Duration throttleTimeoutDuration;
+  private ConcurrentHashMap<String, Instant> jobStartTimeMap;


Also just my opinion, but Instant (or I guess Timestamp) are less error prone to write code for than millis.

I don't have an opinion on changing the above duration to Millis long to fit the rest of the class. But Instant vs long is a big deal because long is hard to reason about. It often refers to epoch millis but you always have to add that epochmillis to the name of the map.

As for java.time.Instant vs java.sql.Timestamp, I've seen Instant used elsewhere. And personally haven't seen Timestamp used much. So clearly there are multiple pockets of usage. and either one makes sense probably.

codecov-commenter · 2023-06-15T19:08:52Z

Codecov Report

Merging #3704 (2496212) into master (51a852d) will decrease coverage by 1.22%.
The diff coverage is 67.18%.

@@             Coverage Diff              @@
##             master    #3704      +/-   ##
============================================
- Coverage     46.97%   45.76%   -1.22%     
+ Complexity    10794     9402    -1392     
============================================
  Files          2138     1863     -275     
  Lines         84132    74423    -9709     
  Branches       9356     8305    -1051     
============================================
- Hits          39518    34057    -5461     
+ Misses        41015    37266    -3749     
+ Partials       3599     3100     -499

Impacted Files	Coverage Δ
...in/java/org/apache/gobblin/cluster/HelixUtils.java	`48.91% <50.00%> (+1.39%)`	⬆️
...ache/gobblin/cluster/GobblinHelixJobScheduler.java	`57.42% <55.81%> (+3.18%)`	⬆️
...bblin/cluster/GobblinClusterConfigurationKeys.java	`50.00% <100.00%> (+50.00%)`	⬆️
...ter/GobblinThrottlingHelixJobLauncherListener.java	`100.00% <100.00%> (ø)`

... and 312 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

ZihanLi58

We do have a race condition here if we receive one update request while the last job has not been submitted successfully.
Can we consider refractor the code a little bit and add lock there to make sure that no two messages for one same workflow will be handled at the same time?

umustafi

looks pretty good, one small comment

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

…g HelixManager as local variable

homatthew · 2023-06-26T21:30:12Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

        LOGGER.info("Scheduling job " + jobUri);
        scheduleJob(jobProps,
-                    new GobblinHelixJobLauncherListener(this.launcherMetrics));
+            listener);


nit: Does not need to be on a new line

homatthew · 2023-06-26T21:30:24Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

      } else {
-        LOGGER.info("No job schedule found, so running job " + jobUri);
+        LOGGER.info("No job schedule"
+            + " found, so running job " + jobUri);


nit: Does not need to be on a new line

homatthew · 2023-06-26T21:30:28Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

+            + " found, so running job " + jobUri);
        this.jobExecutor.execute(new NonScheduledJobRunner(jobProps,
-                                 new GobblinHelixJobLauncherListener(this.launcherMetrics)));
+            listener));


nit: Does not need to be on a new line

homatthew · 2023-06-26T21:30:43Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

-                                 new GobblinHelixJobLauncherListener(this.launcherMetrics)));
+            listener));
      }
+


nit: Does not need a new line

homatthew · 2023-06-26T21:40:25Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

+    String jobName = updateJobArrival.getJobName();
+
+    if (this.isThrottleEnabled &&
+        this.jobNameToNextSchedulableTime.getOrDefault(jobName, Instant.ofEpochMilli(0)).isAfter(clock.instant())) {


Nit: This line is a bit dense. And to indicate beginning of time, the documentation for Instant has Instant.MIN or Instant.EPOCH which should be more readable.

Also, intuitively it feels a little weird to read as "nextSchedulableTime is after current time". I feel it's more intuitive for it to be

"current time is before nextSchedulableTime" i.e.

clock.instant().isBefore(jobNameToNextSchedulableTime.getOrDefault(jobName, Instant.ofEpochMilli(0)))

or IMO even more readable

Instant nextSchedulableTime = jobNameToNextSchedulableTime.getOrDefault(jobName, Instant.MIN); if (this.isThrottleEnabled && clock.instant().isBefore(nextSchedulableTime)) { ...

homatthew · 2023-06-26T22:15:11Z

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

  }

+  // Time span exceeds throttle timeout, within same workflow, throttle is enabled
+  // Job will be updated


Comments describing the method should be a java doc instead of a regular comment

homatthew · 2023-06-26T22:16:47Z

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

-    properties1.setProperty(GobblinClusterConfigurationKeys.CANCEL_RUNNING_JOB_ON_DELETE, "true");
+    GobblinHelixJobScheduler gobblinHelixJobScheduler;
+    if (isThrottleEnabled) {
+      gobblinHelixJobScheduler = new GobblinHelixJobScheduler(ConfigFactory.empty(), helixManager, java.util.Optional.empty(),


We can inject the clock regardless of if throttling is enabled. We'd never want to use UTC clock in a unit test IMO

homatthew · 2023-06-26T22:17:10Z

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

+      gobblinHelixJobScheduler = new GobblinHelixJobScheduler(ConfigFactory.empty(), helixManager, java.util.Optional.empty(),
+          new EventBus(), appWorkDir, Lists.emptyList(), schedulerService, jobCatalog);
+    }
+    gobblinHelixJobScheduler.setThrottleEnabled(isThrottleEnabled);


nit: wouldn't we want to set this via config?

homatthew · 2023-06-26T22:18:19Z

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

-    String workFlowId = null;
+  private String getWorkflowID (NewJobConfigArrivalEvent newJobConfigArrivalEvent, HelixManager helixManager)
+      throws Exception {
+    // endTime is manually set time period that we allow HelixUtils to fetch workflowIdMap before timeout


Maybe better wording:

Poll helix for up to 30 seconds to fetch until a workflow with a matching job name exists in Helix and then return that workflowID

homatthew · 2023-06-26T22:19:55Z

gobblin-cluster/src/test/resources/mockito-extensions/org.mockito.plugins.MockMaker

@@ -0,0 +1 @@
+mock-maker-inline


I don't think we need this anymore since we are not mocking any static classes

homatthew

Some more suggestions wrt loggers

...ster/src/main/java/org/apache/gobblin/cluster/GobblinThrottlingHelixJobLauncherListener.java

homatthew

I see a race condition. The rest are pretty much nits. Great work!

homatthew · 2023-06-27T03:51:31Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

+    Instant nextSchedulableTime = jobNameToNextSchedulableTime.getOrDefault(jobName, Instant.MIN);
+    if (this.isThrottleEnabled && clock.instant().isBefore(nextSchedulableTime)) {
+      LOGGER.info("Replanning is skipped for job {}. Current time is "
+          + clock.instant() + " and the next schedulable time would be "


clock.instant() should be using the {} syntax. Same for the nextSchedulable time. And instead of getting the value from the map, use the nextSchedulableTime variable

homatthew · 2023-06-27T03:52:30Z

...ster/src/main/java/org/apache/gobblin/cluster/GobblinThrottlingHelixJobLauncherListener.java

+  private Clock clock;
+
+  public GobblinThrottlingHelixJobLauncherListener(GobblinHelixJobLauncherMetrics jobLauncherMetrics,
+      ConcurrentHashMap jobNameToNextSchedulableTime, Duration helixJobSchedulingThrottleTimeout, Clock clock) {


Shouldn't the it should be ConcurrentHashMap<String, Instant> instead of just ConcurrentHashMap?

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

homatthew · 2023-06-27T04:00:31Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

    LOGGER.info("Received update for job configuration of job " + updateJobArrival.getJobName());
+    String jobName = updateJobArrival.getJobName();
+
+    Instant nextSchedulableTime = jobNameToNextSchedulableTime.getOrDefault(jobName, Instant.MIN);


Random question, but is there a reason we use Instant.min as the default value here and Instant.EPOCH as the placeholder elsewhere?

homatthew · 2023-06-27T04:04:29Z

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

-    final Properties properties1 =
-        GobblinHelixJobLauncherTest.generateJobProperties(this.baseConfig, "1", workflowIdSuffix1);
-    properties1.setProperty(GobblinClusterConfigurationKeys.CANCEL_RUNNING_JOB_ON_DELETE, "true");
+    Config helixJobSchedulerConfig = ConfigFactory.empty().withValue("helix.job.scheduling.throttle.enabled",


Use the GobblinClusterConfigurationKeys.HELIX_JOB_SCHEDULING_THROTTLE_ENABLED_KEY instead of the raw string value

homatthew · 2023-06-27T04:05:49Z

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

+            zkConnectingString);
+    GobblinHelixJobScheduler jobScheduler = createJobScheduler(helixManager, isThrottleEnabled, mockClock);
+    final Properties properties =
+        GobblinHelixJobLauncherTest.generateJobProperties(


Nit: was this meant to be on a new line? Seems like it would fit fine on the line above

homatthew · 2023-06-27T04:09:39Z

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

+    return newJobConfigArrivalEvent;
+  }
+
+  private void connectAndAssertWorkflowId(String expectedSuffix, NewJobConfigArrivalEvent newJobConfigArrivalEvent, HelixManager helixManager ) throws Exception {


Random question, but why do we use NewJobConfigArrivalEvent instead of just passing a string job name? There were some places in the code where we constructed a brand new NewJobConfigArrivalEvent just to pass it into this method.

homatthew · 2023-06-27T04:21:13Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java


  @Subscribe
-  public void handleDeleteJobConfigArrival(DeleteJobConfigArrivalEvent deleteJobArrival) throws InterruptedException {
+  public synchronized void handleDeleteJobConfigArrival(DeleteJobConfigArrivalEvent deleteJobArrival) throws InterruptedException {


Super minor nit. Not sure if it's even worth implementing:

Would we want to reset the Instant in the map to Instant.EPOCH if we delete a workflow? My understanding is that internally we don't use this delete job config method and only rely on update, so this wouldn't really affect our own use case.

I am not sure which behavior is more intuitive:

If I explicitly delete, I should be able to reschedule it and bypass the throttle time

Regardless of if I deleted the old flow, the throttle time should prevent resubmission

The current behavior is (2). And to make the behavior (1), we would:

Store the current time in the map,

Set the value in the map to Instant.EPOCH

If there is a job exception we reset the value back to the original value that was in the map

The delete operations are synchronous and the method is synchronized, so this approach would be thread safe

In handleUpdateJobConfigArrival, we call handleDeleteJobConfigArrival directly. So if you want to specifically reset it, remember to distinguish the two calls here (one is called by handleUpdateJobConfigArrival) and another is called directly.

Unless you change all the cancel APIs in our code to send a delete job message to trigger this method, then, in that case, resetting the timer can enable us to start a new job immediately, otherwise it does not make sense to achieve 1, as we don't explicitly delete anyway...

Since handleDeleteJobConfigArrival is a completely synchronous method, the synchronized method handleUpdateJobConfigArrival would just hold the lock while deleting and then proceed to handleNewJobConfigArrival. There would be no need to distinguish between the two since @Peiyingy needs to address the race condition described in https://github.com/apache/gobblin/pull/3704/files#r1243111712 by updating the map immediately in the newJobConfigArrival method.

Yeah we don't call explicitly call delete so it's just semantics about which is more intuitive behavior if we ever use this in the future. Since this is purely hypothetical I don't want to waste effort changing the behavior to (1). I think we should just add a comment describing that deleting a workflow with throttling enabled means that the next schedulable time for the workflow will remain unchanged and you have to wait out the throttle timeout before being able to reschedule

ZihanLi58 · 2023-06-27T20:50:04Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java


  @Subscribe
-  public void handleDeleteJobConfigArrival(DeleteJobConfigArrivalEvent deleteJobArrival) throws InterruptedException {
+  public synchronized void handleDeleteJobConfigArrival(DeleteJobConfigArrivalEvent deleteJobArrival) throws InterruptedException {


In handleUpdateJobConfigArrival, we call handleDeleteJobConfigArrival directly. So if you want to specifically reset it, remember to distinguish the two calls here (one is called by handleUpdateJobConfigArrival) and another is called directly.

Unless you change all the cancel APIs in our code to send a delete job message to trigger this method, then, in that case, resetting the timer can enable us to start a new job immediately, otherwise it does not make sense to achieve 1, as we don't explicitly delete anyway...

ZihanLi58 · 2023-06-27T20:51:59Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java


  @Subscribe
-  public void handleUpdateJobConfigArrival(UpdateJobConfigArrivalEvent updateJobArrival) {
+  public synchronized void handleUpdateJobConfigArrival(UpdateJobConfigArrivalEvent updateJobArrival) {


@homatthew are we sure this change won't affect performance when those message-handling methods will be called frequently? (That's why initially I suggested having job level lock)

Summary of offline discussion:

What kind of throughput are expecting with this job launcher? I.e. for fliptop I know the traffic is bursty but how is it bursty? What sort of magnitude are we talking about here?

we have 100k jobs submitted throughout the day, so around 1~2 per second? And cancel job can be triggered randomly but should be much in-frequent

Since the only blocking operation in the critical section is the delete operation, and there are infrequent deletes (usually this takes seconds to complete), we can go ahead with the change and add fine-grained locking in the future if necessary

homatthew · 2023-06-27T21:50:32Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

  public synchronized void handleUpdateJobConfigArrival(UpdateJobConfigArrivalEvent updateJobArrival) {
    LOGGER.info("Received update for job configuration of job " + updateJobArrival.getJobName());
-    String jobName = updateJobArrival.getJobName();
+    String jobUri = updateJobArrival.getJobName();


Let's not mix up the usage of job uri and job name. If you are gonna use job name (e.g. jobNameToNextSchedulableTime), then use the term job name everywhere. And if you are gonna use job uri, then change it for all of them

homatthew · 2023-06-27T21:52:55Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

@@ -357,19 +370,22 @@ public synchronized void handleNewJobConfigArrival(NewJobConfigArrivalEvent newJ
      }
    } catch (JobException je) {
      LOGGER.error("Failed to schedule or run job " + jobUri, je);


Update this log to say that you are resetting the clock

homatthew · 2023-06-27T21:56:48Z

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

  }

-  private void runWorkflowTest(Instant mockedTime, String jobSuffix,
+  private void runWorkflowTest(Duration mockedTime, String jobSuffix,


Does mockedTime still make sense? Seems like it now represents a step duration for incrementing the clock forward in time

homatthew · 2023-06-27T21:59:00Z

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

+    AtomicReference<Instant> nextInstant = new AtomicReference<>(beginTime);
+    when(mockClock.instant()).thenAnswer(invocation -> {
+      Instant currentInstant = nextInstant.get();
+      nextInstant.set(currentInstant.plus(mockedTime));


I noticed that you're using AtomicReference. But you are not doing an atomic get and set, which basically defeats the point of what you're doing.

Did you mean to do something like getAndAccumulate?

homatthew · 2023-06-27T23:15:24Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

+   * Deleting a workflow with throttling enabled means that the next
+   * schedulable time for the workflow will remain unchanged.
+   * Note: In such case, it is required to wait until the throttle
+   * timeout period elapses before the workflow can be rescheduled.


Nice comment!

homatthew · 2023-06-27T23:19:28Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

        GobblinHelixJobScheduler.this.runJob(this.jobProps, this.jobListener);
      } catch (JobException je) {
-        LOGGER.error("Failed to run job " + this.jobProps.getProperty(ConfigurationKeys.JOB_NAME_KEY), je);
+        LOGGER.error("Failed to schedule or run job to run job " + this.jobProps.getProperty(ConfigurationKeys.JOB_NAME_KEY), je);


Typo / wording. schedule or run job to run job

homatthew · 2023-06-27T23:30:47Z

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

+    String assertUpdateWorkflowIdSuffix, boolean isThrottleEnabled, boolean isSameWorkflow) throws Exception {
+    Clock mockClock = Mockito.mock(Clock.class);
+    AtomicReference<Instant> nextInstant = new AtomicReference<>(beginTime);
+    when(mockClock.instant()).thenAnswer(invocation -> nextInstant.getAndAccumulate(nextInstant.get(), (currentInstant, x) -> currentInstant.plus(mockedPeriod)));


nextInstant.get() is not used and is just a placeholder right? Since it seems like just a placeholder value you can use something like null

homatthew · 2023-06-27T23:34:37Z

gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobSchedulerTest.java

-        break;
-      }
-      Thread.sleep(100);
+  private void runWorkflowTest(Duration mockedPeriod, String jobSuffix,


A java doc that describes what these variables are so that future people can use this method would be helpful.

Also, mockedPeriod is a bit of a weird name. Since you're now using it to represent the step amount each time clock.instant() is called.

Is this really necessary? In your original implementation it was just returning a final Instant defined at the beginning which was a bit easier to reason about. But now we sort of rely on how many times clock.instant() is called to know what the current time is

homatthew

Great work

ZihanLi58 · 2023-06-28T20:35:13Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java

+      );
+      return;
+    }
+    nextSchedulableTime = clock.instant().plus(jobSchedulingThrottleTimeout);


Should we only add entry to jobNameToNextSchedulableTime when the throttle is enabled? it is a hash map, where we can easily see memory leak when we not delete the entry properly

...ster/src/main/java/org/apache/gobblin/cluster/GobblinThrottlingHelixJobLauncherListener.java

ZihanLi58 · 2023-06-28T21:02:12Z

...ster/src/main/java/org/apache/gobblin/cluster/GobblinThrottlingHelixJobLauncherListener.java

+  }
+
+  @Override
+  public void onJobPrepare(JobContext jobContext)


Why for the same job, why we try to update the schedulable time three times? once when we handle the message, once when we prepare the job, once when job start. This will be confusing reading the log.
If we have concerns about race conditions when we handle messages, can we only update it when handling a message

…rottle is enabled

ZihanLi58

+1, great work!

* upstream/master: Fix bug with total count watermark whitelist (apache#3724) [GOBBLIN-1858] Fix logs relating to multi-active lease arbiter (apache#3720) [GOBBLIN-1838] Introduce total count based completion watermark (apache#3701) Correct num of failures (apache#3722) [GOBBLIN- 1856] Add flow trigger handler leasing metrics (apache#3717) [GOBBLIN-1857] Add override flag to force generate a job execution id based on gobbl… (apache#3719) [GOBBLIN-1855] Metadata writer tests do not work in isolation after upgrading to Iceberg 1.2.0 (apache#3718) Remove unused ORC writer code (apache#3710) [GOBBLIN-1853] Reduce # of Hive calls during schema related updates (apache#3716) [GOBBLIN-1851] Unit tests for MysqlMultiActiveLeaseArbiter with Single Participant (apache#3715) [GOBBLIN-1848] Add tags to dagmanager metrics for extensibility (apache#3712) [GOBBLIN-1849] Add Flow Group & Name to Job Config for Job Scheduler (apache#3713) [GOBBLIN-1841] Move disabling of current live instances to the GobblinClusterManager startup (apache#3708) [GOBBLIN-1840] Helix Job scheduler should not try to replace running workflow if within configured time (apache#3704) [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched (apache#3711) [GOBBLIN-1842] Add timers to GobblinMCEWriter (apache#3703) [GOBBLIN-1844] Ignore workflows marked for deletion when calculating container count (apache#3709) [GOBBLIN-1846] Validate Multi-active Scheduler with Logs (apache#3707) [GOBBLIN-1845] Changes parallelstream to stream in DatasetsFinderFilteringDecorator to avoid classloader issues in spark (apache#3706) [GOBBLIN-1843] Utility for detecting non optional unions should convert dataset urn to hive compatible format (apache#3705) [GOBBLIN-1837] Implement multi-active, non blocking for leader host (apache#3700) [GOBBLIN-1835]Upgrade Iceberg Version from 0.11.1 to 1.2.0 (apache#3697) Update CHANGELOG to reflect changes in 0.17.0 Reserving 0.18.0 version for next release [GOBBLIN-1836] Ensuring Task Reliability: Handling Job Cancellation and Graceful Exits for Error-Free Completion (apache#3699) [GOBBLIN-1805] Check watermark for the most recent hour for quiet topics (apache#3698) [GOBBLIN-1825]Hive retention job should fail if deleting underlying files fail (apache#3687) [GOBBLIN-1823] Improving Container Calculation and Allocation Methodology (apache#3692) [GOBBLIN-1830] Improving Container Transition Tracking in Streaming Data Ingestion (apache#3693) [GOBBLIN-1833]Emit Completeness watermark information in snapshotCommitEvent (apache#3696)

Peiying Ye added 5 commits June 12, 2023 19:25

[GOBBLIN-1840] Helix Job scheduler should not try to replace running …

29f5c2d

…workflow if within configured time

[GOBBLIN-1840] Remove unnecessary files

45c87ff

[GOBBLIN-1840] Add config for throttleTimeoutDuration

df182ff

[GOBBLIN-1840] Clean up format and coding standard

3c552b7

[GOBBLIN-1840] Clean up format layout

88b9a02

homatthew reviewed Jun 14, 2023

View reviewed changes

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java Outdated Show resolved Hide resolved

Peiyingy changed the title ~~Py helix scheduler throttle gobblin 1840~~ [GOBBLIN-1840] Helix Job scheduler should not try to replace running workflow if within configured time Jun 14, 2023

Peiyingy added 2 commits June 14, 2023 14:03

[GOBBLIN-1840] Clean up auto format

d75e522

[GOBBLIN-1840] Clear up empty space

4ddd7c5

umustafi reviewed Jun 15, 2023

View reviewed changes

homatthew reviewed Jun 15, 2023

View reviewed changes

[GOBBLIN-1840] Clarify naming standards and simplify repeated codes

0e7ba4d

ZihanLi58 requested changes Jun 15, 2023

View reviewed changes

umustafi reviewed Jun 15, 2023

View reviewed changes

Peiyingy force-pushed the py-helix-scheduler-throttle-GOBBLIN-1840 branch 2 times, most recently from 4fa777f to 3552fa3 Compare June 22, 2023 22:17

[GOBBLIN-1840] Add Javadoc on GobblinHelixJobSchedulerTest for settin…

c449a71

…g HelixManager as local variable

Peiyingy force-pushed the py-helix-scheduler-throttle-GOBBLIN-1840 branch 2 times, most recently from ebd8e67 to 258de1a Compare June 22, 2023 23:02

Peiyingy added 2 commits June 22, 2023 18:17

[GOBBLIN-1840] Optimize imports and fix unit test errors

3160d8b

[GOBBLIN-1840] Rewrite log info and add Javadoc

ae2b58c

Peiyingy force-pushed the py-helix-scheduler-throttle-GOBBLIN-1840 branch from 258de1a to ae2b58c Compare June 23, 2023 01:18

[GOBBLIN-1840] Remove job status check

e6ea195

homatthew reviewed Jun 26, 2023

View reviewed changes

homatthew reviewed Jun 27, 2023

View reviewed changes

...ster/src/main/java/org/apache/gobblin/cluster/GobblinThrottlingHelixJobLauncherListener.java Show resolved Hide resolved

...ster/src/main/java/org/apache/gobblin/cluster/GobblinThrottlingHelixJobLauncherListener.java Outdated Show resolved Hide resolved

Peiyingy added 2 commits June 26, 2023 19:08

[GOBBLIN-1840] Add log info and change config setting

6e8358c

[GOBBLIN-1840] Add @slf4j in GobblinThrottlingHelixJobLauncherListener

a15afd8

Peiyingy force-pushed the py-helix-scheduler-throttle-GOBBLIN-1840 branch from b7589ff to a15afd8 Compare June 27, 2023 02:09

homatthew suggested changes Jun 27, 2023

View reviewed changes

ZihanLi58 reviewed Jun 27, 2023

View reviewed changes

[GOBBLIN-1840] Fix race condition of handleNewJobConfigArrival

bbe4a0b

homatthew reviewed Jun 27, 2023

View reviewed changes

[GOBBLIN-1840] Improve mockClock mechanism

701881e

homatthew reviewed Jun 27, 2023

View reviewed changes

[GOBBLIN-1840] Address comments

cfdc115

Peiyingy force-pushed the py-helix-scheduler-throttle-GOBBLIN-1840 branch from 5e1a77d to cfdc115 Compare June 28, 2023 20:22

homatthew approved these changes Jun 28, 2023

View reviewed changes

ZihanLi58 requested changes Jun 28, 2023

View reviewed changes

Peiyingy added 3 commits June 28, 2023 17:38

[GOBBLIN-1840] Only put entry in jobNameToNextSchedulableTime when th…

b598455

…rottle is enabled

[GOBBLIN-1840] Remove extra schedulable time updates

32ea7ed

[GOBBLIN-1840] Fix checkstyle problems

2496212

ZihanLi58 approved these changes Jun 30, 2023

View reviewed changes

ZihanLi58 merged commit 1ecce5b into apache:master Jun 30, 2023

		@@ -0,0 +1 @@
		mock-maker-inline No newline at end of file

Conversation

Peiyingy commented Jun 14, 2023

JIRA

Description

Problem Statement

Solution

Tests

Commits

Uh oh!

Uh oh!

umustafi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

homatthew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ZihanLi58 left a comment

Choose a reason for hiding this comment

Uh oh!

umustafi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 15, 2023 •

edited

Loading

homatthew left a comment •

edited

Loading