Skip to content

Conversation

@kjmrknsn
Copy link
Contributor

@kjmrknsn kjmrknsn commented Nov 27, 2017

What is this PR for?

The cron scheduler is easy to get stuck when one of the cron jobs takes long time or gets stuck.

I sometimes come across the issue that the cron scheduler stops working suddenly. According to the thread dump of ZeppelinServer, all of the DefaultQuartzScheduler_Worker threads were waiting for the job's completion and there was no thread to launch a new job.

Here is the contents of the thread dump:

"DefaultQuartzScheduler_Worker-10" #76 prio=5 os_prio=0 tid=0x00007fb41d3b4000 nid=0x1b521 sleeping[0x00007fb3daef1000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
        - locked <0x00000000c0a7dbf0> (a java.lang.Object)

   Locked ownable synchronizers:
        - None

"DefaultQuartzScheduler_Worker-9" #75 prio=5 os_prio=0 tid=0x00007fb41d3b2000 nid=0x1b520 waiting on condition [0x00007fb3daff2000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
        - locked <0x00000000c0a7a470> (a java.lang.Object)

   Locked ownable synchronizers:
        - None

...

"DefaultQuartzScheduler_Worker-2" #68 prio=5 os_prio=0 tid=0x00007fb41d3c8800 nid=0x1b519 waiting on condition [0x00007fb3da473000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
        - locked <0x00000000c0a7a7b0> (a java.lang.Object)

   Locked ownable synchronizers:
        - None

"DefaultQuartzScheduler_Worker-1" #67 prio=5 os_prio=0 tid=0x00007fb41d3cc800 nid=0x1b518 waiting on condition [0x00007fb3da372000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
        - locked <0x00000000c0a7dd90> (a java.lang.Object)

   Locked ownable synchronizers:
        - None

The above thread dump says that all of the worker threads get stuck at https://github.com/apache/zeppelin/blob/v0.7.3/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/Notebook.java#L889.

One way to reproduce this kind of issue is creating a paragraph whose status is "READY" and "disable run". That makes the paragraph status "READY" permanently and note.isTerminated() never turns to true.

To fix this issue, the following two improvements has been made at this PR:

  1. Remove the unnecessary while (!note.isTerminated()) { ... } block because the execution of all of the paragraphs is finished after note.runAll().
  2. Skip the cron execution if there is a running or pending paragraph. That prevents the Zeppelin cron scheduler from getting stuck by the long running paragraph whose execution duration is greater than the cron execution cycle.

What type of PR is it?

[Bug]

Todos

What is the Jira issue?

https://issues.apache.org/jira/browse/ZEPPELIN-3077

How should this be tested?

  • Tested manually.
    1. The cron scheduler does not get stuck if there is a paragraph whose status is "READY" and "disable run".
    2. The following message is printed on the log file when the cron job is launched while the previous cron job still has been running.
      • execution of the cron job is skipped because there is a running or pending paragraph (note id: XXXXXXXXX)

Screenshots (if appropriate)

Questions:

  • Does the licenses files need update? No.
  • Is there breaking changes for older versions? No.
  • Does this needs documentation? Yes. The behavior of the cron job was changed not to run if there is a running or pending paragraph by this PR. Thus, the documentation docs/usage/other_features/cron_scheduler.md was also added by this PR. Its layout is as follow:

screen shot 2017-11-28 at 18 30 54

@Leemoonsoo
Copy link
Member

Leemoonsoo commented Nov 27, 2017

Thanks @kjmrknsn for taking care of this problem. LGTM.
If there's a unit around here and make sure cron execution skip on paragraph run|pending, it would be even greater. It's up to you add test here or address it in future pr.

@kjmrknsn kjmrknsn force-pushed the ZEPPELIN-3077 branch 2 times, most recently from 4a0f4c2 to 52cbba6 Compare November 28, 2017 01:59
@kjmrknsn
Copy link
Contributor Author

@Leemoonsoo Thanks for your checking and suggestion. I added a test case to NotebookTest.java.

@kjmrknsn
Copy link
Contributor Author

All of the CI tests were passed: https://travis-ci.org/kjmrknsn/zeppelin/builds/308212514

if (note.isRunningOrPending()) {
logger.info("execution of the cron job is skipped because there is a running or pending " +
"paragraph (note id: {})", noteId);
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if someone is also running the notebook manually when the scheduler kicks off, would that block the scheduler?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if so, this is a behavior change that is worthwhile to document

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if someone is also running the notebook manually when the scheduler kicks off, would that block the scheduler?

Yes. I think it's OK because there's no need that notebooks which are executed manually are kicked by the cron scheduler.

if so, this is a behavior change that is worthwhile to document

I'll add the explanation to document.

Thanks.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if someone has a lot of notebook scheduled and running (from such schedule) at the same time, would it also cause quartz to run out of thread?

Or would it be ok in that case they will just queued up and hopefully some scheduled one will finish at some point?

@kjmrknsn
Copy link
Contributor Author

@felixcheung

so if someone has a lot of notebook scheduled and running (from such schedule) at the same time, would it also cause quartz to run out of thread?

Yes, but I think jobs are queued and they will be executed when one of the quartz threads finish the current job.

Zeppelin administrators can change the number of the quartz threads whose default value is 10 by deploying the quartz.properties file on the Zeppelin classpath.

logger.error(e.toString(), e);
}
if (note.isRunningOrPending()) {
logger.info("execution of the cron job is skipped because there is a running or pending " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to change it to warning

Copy link
Contributor Author

@kjmrknsn kjmrknsn Nov 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks.

@kjmrknsn
Copy link
Contributor Author

logger.info("execution ... "); has been changed to logger.warn("execution ... ");. Thanks.

@kjmrknsn
Copy link
Contributor Author

@felixcheung I added the documentation page docs/usage/other_features/cron_scheduler.md to this PR. I also added the explanation about this documentation to the Questions: section of the top comment of this PR. Thanks.

@kjmrknsn
Copy link
Contributor Author

kjmrknsn commented Dec 8, 2017

I'd be glad if any committers would review this PR.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor comment, this looks reasonable, thank you!

any other comment?


When this checkbox is set to "on", the interpreters which are binded to the notebook are stopped automatically after the cron execution. This feature is useful if you want to release the interpreter resources after the cron execution.

> **Note**: A cron execution is skipped if one of the paragraphs is in a state of `RUNNING` or `PENDING` no matter whether it is executed automatically (i.e. by the cron scheduler) or manually.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be explicitly, let's add manually -> manually by an user opening this notebook

@kjmrknsn
Copy link
Contributor Author

kjmrknsn commented Dec 8, 2017

@felixcheung Thanks for your reviewing.

to be explicitly, let's add manually -> manually by an user opening this notebook

I fixed it. (I used a correct phrase of a user.)

Thanks.

@felixcheung
Copy link
Member

felixcheung commented Dec 8, 2017 via email

@kjmrknsn
Copy link
Contributor Author

Travis CI turned to green by merging #2703 to my branch: https://travis-ci.org/kjmrknsn/zeppelin/builds/315898589

I would be glad if this PR would be merged.

@Leemoonsoo
Copy link
Member

Thanks @kjmrknsn. Merge to master if no further review!

@asfgit asfgit closed this in 888a05d Dec 20, 2017
@kjmrknsn kjmrknsn deleted the ZEPPELIN-3077 branch December 20, 2017 03:14
jithinchandranj pushed a commit to jithinchandranj/zeppelin that referenced this pull request Dec 20, 2017
…ron jobs takes long time or gets stuck

### What is this PR for?
The cron scheduler is easy to get stuck when one of the cron jobs takes long time or gets stuck.

I sometimes come across the issue that the cron scheduler stops working suddenly. According to the thread dump of ZeppelinServer, all of the DefaultQuartzScheduler_Worker threads were waiting for the job's completion and there was no thread to launch a new job.

Here is the contents of the thread dump:

```
"DefaultQuartzScheduler_Worker-10" apache#76 prio=5 os_prio=0 tid=0x00007fb41d3b4000 nid=0x1b521 sleeping[0x00007fb3daef1000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
        - locked <0x00000000c0a7dbf0> (a java.lang.Object)

   Locked ownable synchronizers:
        - None

"DefaultQuartzScheduler_Worker-9" apache#75 prio=5 os_prio=0 tid=0x00007fb41d3b2000 nid=0x1b520 waiting on condition [0x00007fb3daff2000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
        - locked <0x00000000c0a7a470> (a java.lang.Object)

   Locked ownable synchronizers:
        - None

...

"DefaultQuartzScheduler_Worker-2" apache#68 prio=5 os_prio=0 tid=0x00007fb41d3c8800 nid=0x1b519 waiting on condition [0x00007fb3da473000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
        - locked <0x00000000c0a7a7b0> (a java.lang.Object)

   Locked ownable synchronizers:
        - None

"DefaultQuartzScheduler_Worker-1" apache#67 prio=5 os_prio=0 tid=0x00007fb41d3cc800 nid=0x1b518 waiting on condition [0x00007fb3da372000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
        - locked <0x00000000c0a7dd90> (a java.lang.Object)

   Locked ownable synchronizers:
        - None
```

The above thread dump says that all of the worker threads get stuck at https://github.com/apache/zeppelin/blob/v0.7.3/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/Notebook.java#L889.

One way to reproduce this kind of issue is creating a paragraph whose status is "READY" and "disable run". That makes the paragraph status "READY" permanently and `note.isTerminated()` never turns to `true`.

To fix this issue, the following two improvements has been made at this PR:

1. Remove the unnecessary `while (!note.isTerminated()) { ... }` block because the execution of all of the paragraphs is finished after `note.runAll()`.
2. Skip the cron execution if there is a running or pending paragraph. That prevents the Zeppelin cron scheduler from getting stuck by the long running paragraph whose execution duration is greater than the cron execution cycle.

### What type of PR is it?
[Bug]

### Todos

### What is the Jira issue?
https://issues.apache.org/jira/browse/ZEPPELIN-3077

### How should this be tested?
* Tested manually.
    1. The cron scheduler does not get stuck if there is a paragraph whose status is "READY" and "disable run".
    2. The following message is printed on the log file when the cron job is launched while the previous cron job still has been running.
        * `execution of the cron job is skipped because there is a running or pending paragraph (note id: XXXXXXXXX)`

### Screenshots (if appropriate)

### Questions:
* Does the licenses files need update? No.
* Is there breaking changes for older versions? No.
* Does this needs documentation? Yes. The behavior of the cron job was changed not to run if there is a running or pending paragraph by this PR. Thus, the documentation `docs/usage/other_features/cron_scheduler.md` was also added by this PR. Its layout is as follow:

<img width="711" alt="screen shot 2017-11-28 at 18 30 54" src="https://user-images.githubusercontent.com/31149688/33312407-20664e02-d46b-11e7-9715-9e2562d5e064.png">

Author: Keiji Yoshida <[email protected]>

Closes apache#2687 from kjmrknsn/ZEPPELIN-3077 and squashes the following commits:

81e7218 [Keiji Yoshida] [ZEPPELIN-3077] Cron scheduler is easy to get stuck when one of the cron jobs takes long time or gets stuck
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants