-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-6209] Clean up connections in ExecutorClassLoader after failing to load classes #4935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Conflicts: repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala
|
Test build #28358 has started for PR 4935 at commit
|
|
/cc @pwendell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a class cannot be found at the driver, then Jetty will return a 404 status code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this still might be a subtly-wrong use of HttpUrlConnection; see
http://stackoverflow.com/a/39449/590203
http://www.tbray.org/ongoing/When/201x/2012/01/17/HttpURLConnection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, even this is still subtly-unsafe due to HttpUrlConnection's weird KeepAlive behavior:
https://plumbr.eu/blog/12-year-old-bug-in-jdk-still-out-there-leaking-memory
https://news.ycombinator.com/item?id=4931911
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The second cast is redundant?) Can this be mitigated with Connection: close, or by just making a HEAD request? I know, that might require libraries.
|
This actually turns out to be a much more subtle issue than I originally anticipated, since we might need to do other work to handle HttpUrlConnection's KeepAlive behavior: http://docs.oracle.com/javase/7/docs/technotes/guides/net/http-keepalive.html. |
|
Test build #28358 has finished for PR 4935 at commit
|
|
Test PASSed. |
|
I'm going to close this PR and re-open against master, since it looks like the conflicts shouldn't be too bad. |
|
I've opened an updated PR against the master branch at #4944 and added a regression test (which was slightly non-trivial to write). |
…g to load classes (master branch PR) ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang. See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction. This patch fixes this issue by ensuring proper cleanup of these resources. It also adds logging for unexpected error cases. This PR is an extended version of #4935 and adds a regression test. Author: Josh Rosen <[email protected]> Closes #4944 from JoshRosen/executorclassloader-leak-master-branch and squashes the following commits: e0e3c25 [Josh Rosen] Wrap try block around getReponseCode; re-enable keep-alive by closing error stream 961c284 [Josh Rosen] Roll back changes that were added to get the regression test to fail 7ee2261 [Josh Rosen] Add a failing regression test e2d70a3 [Josh Rosen] Properly clean up after errors in ExecutorClassLoader (cherry picked from commit 7215aa7) Signed-off-by: Andrew Or <[email protected]>
…g to load classes (master branch PR) ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang. See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction. This patch fixes this issue by ensuring proper cleanup of these resources. It also adds logging for unexpected error cases. This PR is an extended version of #4935 and adds a regression test. Author: Josh Rosen <[email protected]> Closes #4944 from JoshRosen/executorclassloader-leak-master-branch and squashes the following commits: e0e3c25 [Josh Rosen] Wrap try block around getReponseCode; re-enable keep-alive by closing error stream 961c284 [Josh Rosen] Roll back changes that were added to get the regression test to fail 7ee2261 [Josh Rosen] Add a failing regression test e2d70a3 [Josh Rosen] Properly clean up after errors in ExecutorClassLoader
…g to load classes (master branch PR) ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang. See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction. This patch fixes this issue by ensuring proper cleanup of these resources. It also adds logging for unexpected error cases. This PR is an extended version of apache#4935 and adds a regression test. Author: Josh Rosen <[email protected]> Closes apache#4944 from JoshRosen/executorclassloader-leak-master-branch and squashes the following commits: e0e3c25 [Josh Rosen] Wrap try block around getReponseCode; re-enable keep-alive by closing error stream 961c284 [Josh Rosen] Roll back changes that were added to get the regression test to fail 7ee2261 [Josh Rosen] Add a failing regression test e2d70a3 [Josh Rosen] Properly clean up after errors in ExecutorClassLoader (cherry picked from commit 7215aa7) Signed-off-by: Andrew Or <[email protected]> Conflicts: repl/pom.xml repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala
ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang. See SPARK-6209 for more details, including a bug reproduction.
This patch fixes this issue by ensuring proper cleanup of these resources. It also adds logging for unexpected error cases.
This class was renamed and heavily-refactored in Spark 1.3+, so I'll open a separate PR to fix this issue there.