Skip to content

Conversation

@vanzin
Copy link
Contributor

@vanzin vanzin commented Nov 11, 2015

The code was using the wrong API to add data to the internal composite
buffer, causing buffers to leak in certain situations. Use the right
API and enhance the tests to catch memory leaks.

Also, avoid reusing the composite buffers when downstream handlers keep
references to them; this seems to cause a few different issues even though
the ref counting code seems to be correct, so instead pay the cost of copying
a few bytes when that situation happens.

The code was using the wrong API to add data to the internal composite
buffer, causing buffers to leak in certain situations. Use the right
API and enhance the tests to catch memory leaks.
@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Nov 11, 2015

Test build #45603 has finished for PR 9619 at commit ed0c1d7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * public class JavaLBFGSExample\n * class LDA @Since(\"1.6.0\") (\n * case class Metadata(\n * require(className == expectedClassName, s\"Error loading metadata: Expected class name\" +\n * class SlidingRDDPartition[T](val idx: Int, val prev: Partition, val tail: Seq[T], val offset: Int)\n * class SlidingRDD[T: ClassTag](@transient val parent: RDD[T], val windowSize: Int, val step: Int)\n

@SparkQA
Copy link

SparkQA commented Nov 11, 2015

Test build #45608 has finished for PR 9619 at commit ed0c1d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

This makes the frame decoder behave more like netty's ByteToMessageDecoder,
at the expense of copying some data in a few cases.
@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45779 has finished for PR 9619 at commit 8a7d194.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Nov 13, 2015

/cc @zsxwing @rxin

@SparkQA
Copy link

SparkQA commented Nov 14, 2015

Test build #45909 has finished for PR 9619 at commit 180456c.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Nov 16, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Nov 16, 2015

Test build #45999 has finished for PR 9619 at commit 180456c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Marcelo Vanzin added 2 commits November 16, 2015 12:51
This test actually fails with java.lang.IndexOutOfBoundsException if the
fix in this patch set is disabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin could you give a real case? Or this is just for correctness, even if downstream in Spark doesn't use retain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually Spark does use retain() when fetching shuffle blocks, and for some reason that causes problems. I think the real problem is somewhere in netty code, but this is the workaround the netty code itself uses (see ByteToMessageDecoder).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Just saw retain() in ChunkFetchSuccess.

@zsxwing
Copy link
Member

zsxwing commented Nov 16, 2015

LGTM pending tests.

@SparkQA
Copy link

SparkQA commented Nov 16, 2015

Test build #46015 has finished for PR 9619 at commit dcdfc31.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Nov 16, 2015

retest this please

@zsxwing
Copy link
Member

zsxwing commented Nov 16, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/46015/
Test FAILed.

This is a bug that will be fixed in #9707

@SparkQA
Copy link

SparkQA commented Nov 16, 2015

Test build #46019 has finished for PR 9619 at commit 7fe9617.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2015

Test build #46024 has finished for PR 9619 at commit 7fe9617.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Nov 17, 2015

Merging this (master / 1.6).

asfgit pushed a commit that referenced this pull request Nov 17, 2015
The code was using the wrong API to add data to the internal composite
buffer, causing buffers to leak in certain situations. Use the right
API and enhance the tests to catch memory leaks.

Also, avoid reusing the composite buffers when downstream handlers keep
references to them; this seems to cause a few different issues even though
the ref counting code seems to be correct, so instead pay the cost of copying
a few bytes when that situation happens.

Author: Marcelo Vanzin <[email protected]>

Closes #9619 from vanzin/SPARK-11617.

(cherry picked from commit 540bf58)
Signed-off-by: Marcelo Vanzin <[email protected]>
@asfgit asfgit closed this in 540bf58 Nov 17, 2015
@vanzin vanzin deleted the SPARK-11617 branch November 19, 2015 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants