-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[ZEPPELIN-3077] Cron scheduler is easy to get stuck when one of the cron jobs takes long time or gets stuck #2687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4a0f4c2 to
52cbba6
Compare
|
@Leemoonsoo Thanks for your checking and suggestion. I added a test case to |
|
All of the CI tests were passed: https://travis-ci.org/kjmrknsn/zeppelin/builds/308212514 |
| if (note.isRunningOrPending()) { | ||
| logger.info("execution of the cron job is skipped because there is a running or pending " + | ||
| "paragraph (note id: {})", noteId); | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if someone is also running the notebook manually when the scheduler kicks off, would that block the scheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if so, this is a behavior change that is worthwhile to document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if someone is also running the notebook manually when the scheduler kicks off, would that block the scheduler?
Yes. I think it's OK because there's no need that notebooks which are executed manually are kicked by the cron scheduler.
if so, this is a behavior change that is worthwhile to document
I'll add the explanation to document.
Thanks.
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if someone has a lot of notebook scheduled and running (from such schedule) at the same time, would it also cause quartz to run out of thread?
Or would it be ok in that case they will just queued up and hopefully some scheduled one will finish at some point?
Yes, but I think jobs are queued and they will be executed when one of the quartz threads finish the current job. Zeppelin administrators can change the number of the quartz threads whose default value is 10 by deploying the quartz.properties file on the Zeppelin classpath. |
| logger.error(e.toString(), e); | ||
| } | ||
| if (note.isRunningOrPending()) { | ||
| logger.info("execution of the cron job is skipped because there is a running or pending " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to change it to warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks.
52cbba6 to
3912b26
Compare
|
|
3912b26 to
58c9094
Compare
|
@felixcheung I added the documentation page |
|
I'd be glad if any committers would review this PR. |
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor comment, this looks reasonable, thank you!
any other comment?
|
|
||
| When this checkbox is set to "on", the interpreters which are binded to the notebook are stopped automatically after the cron execution. This feature is useful if you want to release the interpreter resources after the cron execution. | ||
|
|
||
| > **Note**: A cron execution is skipped if one of the paragraphs is in a state of `RUNNING` or `PENDING` no matter whether it is executed automatically (i.e. by the cron scheduler) or manually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be explicitly, let's add manually -> manually by an user opening this notebook
58c9094 to
8444ac0
Compare
|
@felixcheung Thanks for your reviewing.
I fixed it. (I used a correct phrase of Thanks. |
|
Ah yes ;)
|
…ron jobs takes long time or gets stuck
8444ac0 to
81e7218
Compare
|
Travis CI turned to green by merging #2703 to my branch: https://travis-ci.org/kjmrknsn/zeppelin/builds/315898589 I would be glad if this PR would be merged. |
|
Thanks @kjmrknsn. Merge to master if no further review! |
…ron jobs takes long time or gets stuck ### What is this PR for? The cron scheduler is easy to get stuck when one of the cron jobs takes long time or gets stuck. I sometimes come across the issue that the cron scheduler stops working suddenly. According to the thread dump of ZeppelinServer, all of the DefaultQuartzScheduler_Worker threads were waiting for the job's completion and there was no thread to launch a new job. Here is the contents of the thread dump: ``` "DefaultQuartzScheduler_Worker-10" apache#76 prio=5 os_prio=0 tid=0x00007fb41d3b4000 nid=0x1b521 sleeping[0x00007fb3daef1000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889) at org.quartz.core.JobRunShell.run(JobRunShell.java:202) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) - locked <0x00000000c0a7dbf0> (a java.lang.Object) Locked ownable synchronizers: - None "DefaultQuartzScheduler_Worker-9" apache#75 prio=5 os_prio=0 tid=0x00007fb41d3b2000 nid=0x1b520 waiting on condition [0x00007fb3daff2000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889) at org.quartz.core.JobRunShell.run(JobRunShell.java:202) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) - locked <0x00000000c0a7a470> (a java.lang.Object) Locked ownable synchronizers: - None ... "DefaultQuartzScheduler_Worker-2" apache#68 prio=5 os_prio=0 tid=0x00007fb41d3c8800 nid=0x1b519 waiting on condition [0x00007fb3da473000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889) at org.quartz.core.JobRunShell.run(JobRunShell.java:202) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) - locked <0x00000000c0a7a7b0> (a java.lang.Object) Locked ownable synchronizers: - None "DefaultQuartzScheduler_Worker-1" apache#67 prio=5 os_prio=0 tid=0x00007fb41d3cc800 nid=0x1b518 waiting on condition [0x00007fb3da372000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889) at org.quartz.core.JobRunShell.run(JobRunShell.java:202) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) - locked <0x00000000c0a7dd90> (a java.lang.Object) Locked ownable synchronizers: - None ``` The above thread dump says that all of the worker threads get stuck at https://github.com/apache/zeppelin/blob/v0.7.3/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/Notebook.java#L889. One way to reproduce this kind of issue is creating a paragraph whose status is "READY" and "disable run". That makes the paragraph status "READY" permanently and `note.isTerminated()` never turns to `true`. To fix this issue, the following two improvements has been made at this PR: 1. Remove the unnecessary `while (!note.isTerminated()) { ... }` block because the execution of all of the paragraphs is finished after `note.runAll()`. 2. Skip the cron execution if there is a running or pending paragraph. That prevents the Zeppelin cron scheduler from getting stuck by the long running paragraph whose execution duration is greater than the cron execution cycle. ### What type of PR is it? [Bug] ### Todos ### What is the Jira issue? https://issues.apache.org/jira/browse/ZEPPELIN-3077 ### How should this be tested? * Tested manually. 1. The cron scheduler does not get stuck if there is a paragraph whose status is "READY" and "disable run". 2. The following message is printed on the log file when the cron job is launched while the previous cron job still has been running. * `execution of the cron job is skipped because there is a running or pending paragraph (note id: XXXXXXXXX)` ### Screenshots (if appropriate) ### Questions: * Does the licenses files need update? No. * Is there breaking changes for older versions? No. * Does this needs documentation? Yes. The behavior of the cron job was changed not to run if there is a running or pending paragraph by this PR. Thus, the documentation `docs/usage/other_features/cron_scheduler.md` was also added by this PR. Its layout is as follow: <img width="711" alt="screen shot 2017-11-28 at 18 30 54" src="https://user-images.githubusercontent.com/31149688/33312407-20664e02-d46b-11e7-9715-9e2562d5e064.png"> Author: Keiji Yoshida <[email protected]> Closes apache#2687 from kjmrknsn/ZEPPELIN-3077 and squashes the following commits: 81e7218 [Keiji Yoshida] [ZEPPELIN-3077] Cron scheduler is easy to get stuck when one of the cron jobs takes long time or gets stuck
What is this PR for?
The cron scheduler is easy to get stuck when one of the cron jobs takes long time or gets stuck.
I sometimes come across the issue that the cron scheduler stops working suddenly. According to the thread dump of ZeppelinServer, all of the DefaultQuartzScheduler_Worker threads were waiting for the job's completion and there was no thread to launch a new job.
Here is the contents of the thread dump:
The above thread dump says that all of the worker threads get stuck at https://github.com/apache/zeppelin/blob/v0.7.3/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/Notebook.java#L889.
One way to reproduce this kind of issue is creating a paragraph whose status is "READY" and "disable run". That makes the paragraph status "READY" permanently and
note.isTerminated()never turns totrue.To fix this issue, the following two improvements has been made at this PR:
while (!note.isTerminated()) { ... }block because the execution of all of the paragraphs is finished afternote.runAll().What type of PR is it?
[Bug]
Todos
What is the Jira issue?
https://issues.apache.org/jira/browse/ZEPPELIN-3077
How should this be tested?
execution of the cron job is skipped because there is a running or pending paragraph (note id: XXXXXXXXX)Screenshots (if appropriate)
Questions:
docs/usage/other_features/cron_scheduler.mdwas also added by this PR. Its layout is as follow: