Skip to content

Conversation

@xiejiajun
Copy link
Contributor

What is this PR for?

  • added timeout for getting Thrift client to avoid situations where the interpreter may not be restarted when the interpreter process exits unexpectedly

What type of PR is it?

  • Bug Fix

What is the Jira issue?

Questions:

  • Does the licenses files need update? NO
  • Is there breaking changes for older versions? NO
  • Does this needs documentation? NO

… interpreter may not be restarted when the interpreter process exits unexpectedly
@zjffdu
Copy link
Contributor

zjffdu commented Mar 20, 2020

Thanks for the contribution @xiejiajun , could you let other know what kind of people you are trying to resolve and how to reproduce this issue ?

@xiejiajun
Copy link
Contributor Author

Thanks for the contribution @xiejiajun , could you let other know what kind of people you are trying to resolve and how to reproduce this issue ?

@zjffdu ,When I manually kill the interpreter-related process, I try to restart by using the interpreter restart button on the web page. Occasionally, the interpreter cannot restart. The ZeppelinServer service must be restarted to solve it. Finally, I traced the source code and found that the chain of method invocation for closing the interpreter was blocked at RemoteInterpreterProcess.getClient method
203605_f7e9310b_1936253

@zjffdu
Copy link
Contributor

zjffdu commented Mar 21, 2020

Which interpreter do you use ? Do you use the latest master branch ? Recently there's one related issue resolved

@xiejiajun
Copy link
Contributor Author

Which interpreter do you use ? Do you use the latest master branch ? Recently there's one related issue resolved

@zjffdu Sorry,I have not been able to find a related issue. I am using the 0.8.2 branch. The spark interpreter often has this problem. But I see that the related code of the 0.9 branch has not changed, and this problem should also exist. Can you provide the related issue that has been resolved recently, I'll check if it is related to this issue.

@zjffdu
Copy link
Contributor

zjffdu commented Mar 21, 2020

@xiejiajun Sorry, I forget to paste the issue ZEPPELIN-4600
BTW, is it hard to reproduce this issue ? You mentioned that it is on occasional, I am wondering how often it could be reproduced ? What kind of mode do you use ? yarn-client or yarn-cluster ?

@xiejiajun
Copy link
Contributor Author

@xiejiajun Sorry, I forget to paste the issue ZEPPELIN-4600
BTW, is it hard to reproduce this issue ? You mentioned that it is on occasional, I am wondering how often it could be reproduced ? What kind of mode do you use ? yarn-client or yarn-cluster ?

I am using yarn-client mode. Through the analysis of the code call chain, we can learn that after we using kill -9 to manually terminate the local Spark Driver process, if we restart spark interpreter when the number of active Thrift clients in the ClientPool connected to the corresponding interpreter is greater than or equal to the maxTotal default value (8) , it will wait indefinitely for available thrift clients.

@zjffdu
Copy link
Contributor

zjffdu commented Mar 23, 2020

Thanks for the detail explanation, PR LGTM

asfgit pushed a commit that referenced this pull request Mar 23, 2020
…exited unexpectedly

### What is this PR for?
- added timeout for getting Thrift client to avoid situations where the interpreter may not be restarted when the interpreter process exits unexpectedly

### What type of PR is it?
- Bug Fix

### What is the Jira issue?
- https://issues.apache.org/jira/browse/ZEPPELIN-4691

### Questions:
* Does the licenses files need update? NO
* Is there breaking changes for older versions? NO
* Does this needs documentation? NO

Author: xiejiajun <[email protected]>

Closes #3695 from xiejiajun/branch-0.9 and squashes the following commits:

9b3c744 [xiejiajun] added timeout for getting Thrift client to avoid situations where the interpreter may not be restarted when the interpreter process exits unexpectedly
@asfgit asfgit closed this in b0a26b4 Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants