Skip to content

Conversation

@JoshRosen
Copy link
Contributor

ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang. See SPARK-6209 for more details, including a bug reproduction.

This patch fixes this issue by ensuring proper cleanup of these resources. It also adds logging for unexpected error cases.

This class was renamed and heavily-refactored in Spark 1.3+, so I'll open a separate PR to fix this issue there.

Conflicts:
	repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala
@SparkQA
Copy link

SparkQA commented Mar 6, 2015

Test build #28358 has started for PR 4935 at commit cc685a8.

  • This patch merges cleanly.

@JoshRosen
Copy link
Contributor Author

/cc @pwendell

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a class cannot be found at the driver, then Jetty will return a 404 status code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this still might be a subtly-wrong use of HttpUrlConnection; see

http://stackoverflow.com/a/39449/590203
http://www.tbray.org/ongoing/When/201x/2012/01/17/HttpURLConnection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, even this is still subtly-unsafe due to HttpUrlConnection's weird KeepAlive behavior:

https://plumbr.eu/blog/12-year-old-bug-in-jdk-still-out-there-leaking-memory
https://news.ycombinator.com/item?id=4931911

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The second cast is redundant?) Can this be mitigated with Connection: close, or by just making a HEAD request? I know, that might require libraries.

@JoshRosen
Copy link
Contributor Author

This actually turns out to be a much more subtle issue than I originally anticipated, since we might need to do other work to handle HttpUrlConnection's KeepAlive behavior: http://docs.oracle.com/javase/7/docs/technotes/guides/net/http-keepalive.html.

@SparkQA
Copy link

SparkQA commented Mar 7, 2015

Test build #28358 has finished for PR 4935 at commit cc685a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28358/
Test PASSed.

@JoshRosen
Copy link
Contributor Author

I'm going to close this PR and re-open against master, since it looks like the conflicts shouldn't be too bad.

@JoshRosen
Copy link
Contributor Author

I've opened an updated PR against the master branch at #4944 and added a regression test (which was slightly non-trivial to write).

@JoshRosen JoshRosen closed this Mar 9, 2015
asfgit pushed a commit that referenced this pull request Mar 24, 2015
…g to load classes (master branch PR)

ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang.  See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction.

This patch fixes this issue by ensuring proper cleanup of these resources.  It also adds logging for unexpected error cases.

This PR is an extended version of #4935 and adds a regression test.

Author: Josh Rosen <[email protected]>

Closes #4944 from JoshRosen/executorclassloader-leak-master-branch and squashes the following commits:

e0e3c25 [Josh Rosen] Wrap try block around getReponseCode; re-enable keep-alive by closing error stream
961c284 [Josh Rosen] Roll back changes that were added to get the regression test to fail
7ee2261 [Josh Rosen] Add a failing regression test
e2d70a3 [Josh Rosen] Properly clean up after errors in ExecutorClassLoader

(cherry picked from commit 7215aa7)
Signed-off-by: Andrew Or <[email protected]>
asfgit pushed a commit that referenced this pull request Mar 24, 2015
…g to load classes (master branch PR)

ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang.  See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction.

This patch fixes this issue by ensuring proper cleanup of these resources.  It also adds logging for unexpected error cases.

This PR is an extended version of #4935 and adds a regression test.

Author: Josh Rosen <[email protected]>

Closes #4944 from JoshRosen/executorclassloader-leak-master-branch and squashes the following commits:

e0e3c25 [Josh Rosen] Wrap try block around getReponseCode; re-enable keep-alive by closing error stream
961c284 [Josh Rosen] Roll back changes that were added to get the regression test to fail
7ee2261 [Josh Rosen] Add a failing regression test
e2d70a3 [Josh Rosen] Properly clean up after errors in ExecutorClassLoader
@JoshRosen JoshRosen deleted the executorclassloader-leak branch March 24, 2015 22:17
JoshRosen added a commit to JoshRosen/spark that referenced this pull request Mar 24, 2015
…g to load classes (master branch PR)

ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang.  See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction.

This patch fixes this issue by ensuring proper cleanup of these resources.  It also adds logging for unexpected error cases.

This PR is an extended version of apache#4935 and adds a regression test.

Author: Josh Rosen <[email protected]>

Closes apache#4944 from JoshRosen/executorclassloader-leak-master-branch and squashes the following commits:

e0e3c25 [Josh Rosen] Wrap try block around getReponseCode; re-enable keep-alive by closing error stream
961c284 [Josh Rosen] Roll back changes that were added to get the regression test to fail
7ee2261 [Josh Rosen] Add a failing regression test
e2d70a3 [Josh Rosen] Properly clean up after errors in ExecutorClassLoader

(cherry picked from commit 7215aa7)
Signed-off-by: Andrew Or <[email protected]>

Conflicts:
	repl/pom.xml
	repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants