Skip to content

Conversation

@sryza
Copy link
Contributor

@sryza sryza commented Jan 6, 2015

"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD."
-the Tuning Spark page

This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object.

@SparkQA
Copy link

SparkQA commented Jan 6, 2015

Test build #25109 has started for PR 3913 at commit 29fa503.

  • This patch merges cleanly.

@pwendell
Copy link
Contributor

pwendell commented Jan 6, 2015

That doc is very outdated - you can actually just look in the UI after caching some data, you don't need to visit the logs.

@SparkQA
Copy link

SparkQA commented Jan 6, 2015

Test build #25109 has finished for PR 3913 at commit 29fa503.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25109/
Test PASSed.

@sryza
Copy link
Contributor Author

sryza commented Jan 6, 2015

Ooh OK I'll update the doc. That's still a little cumbersome though for someone who just wants to see how much space an object takes up. Most of the recommendations on that page are at the micro level - tuning the memory taken up by a single object. It would be useful to have a way to determine this amount directly.

E.g. if I have plan to have an RDD of some case class I might want to see how much space each instance takes up. Then I can experiment quickly with things like flatter structures or removing fields and see if they improve the footprint.

Another thing is that users trying to tune applications commonly need to figure out whether they should broadcast side-data or figure out a way to load it as an RDD and join. It would be helpful to be able to have a single function call that's easy to use from the shell to find out the size of the data you're about to broadcast.

@pwendell
Copy link
Contributor

Just not sure how overall useful it would be. For RDD data, it might be slightly misleading here because of things like serialization in-memory. For broadcast objects in the shell, it would only work in the scala shell though because of the way that serialization works in Python.

I'm also not totally sure overall how accurate our memory estimation is and it may get less so if we add smarter caching for SchemaRDD's. Anyways, what would be helpful, could you walk through an example with a case class or something and show how accurate it is? That would be useful to better understand.

One thing we could do that would be more isolated is have a function in SparkContext called estimateSizeOf(object: Any), so that at least we don't expose the class location and names as API's.

@srowen
Copy link
Member

srowen commented Feb 11, 2015

Should this just turn into a doc update PR then?

@sryza
Copy link
Contributor Author

sryza commented Feb 11, 2015

Adding an estimateSizeOf method to SparkContext sounds reasonable to me.

I agree that there's not a great way to expose something like this for Python. But I don't think the zaniness of Python-JVM interaction means that we shouldn't expose useful functionality to pure-JVM apps.

For RDD data, it might be slightly misleading here because of things like serialization in-memory.

I think this is the kind of thing we can just document. Adding a separate estimateSerializedSizeOf method would be helpful as well.

I'm also not totally sure overall how accurate our memory estimation is and it may get less so if we add smarter caching for SchemaRDD's.

I've found it to be very accurate in my experiments. We rely on its accuracy for shuffle memory management and POJO caching, so to the extent that it's inaccurate we've got bigger problems.

@SparkQA
Copy link

SparkQA commented Feb 11, 2015

Test build #27305 has started for PR 3913 at commit 62bac82.

  • This patch merges cleanly.

@shivaram
Copy link
Contributor

The size estimator should be pretty accurate for measuring the size of small Java objects. You could add a note that we do some approximation for large arrays (we sample some elements of the array and then extrapolate).

@SparkQA
Copy link

SparkQA commented Feb 11, 2015

Test build #27305 has finished for PR 3913 at commit 62bac82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27305/
Test PASSed.

@srowen
Copy link
Member

srowen commented Apr 15, 2015

I'd like to resolve this one way or the other. My hesitation is mostly about tacking on another method to SparkContext, developer API or no. Would it really be a better way to estimate actual RDD size, since that depends on other stuff (even the JVM, whether data is serialized, etc.)

@squito
Copy link
Contributor

squito commented Apr 15, 2015

I think this is a good change. Yes, you could cache an RDD and see its size, but think about what a pain that actually is if you wanted to do it programmatically. You'd need to register a spark listener, wait to get the appropriate events and look at the sizes.

If it was easy to do this programmatically via an RDD, then I'd say this change isn't necessary. Eg., if you could do something like

val (_, meta) = sc.parallelize(oneObject).cache().count()
meta.getPartitionSizes

Then we wouldn't need to expose this. But that's an even bigger api change, and one that I would be far more nervous about (the code above is definitely not a viable alternative, lots of reasons it doesn't make sense).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this import isn't necessary anymore

@sryza
Copy link
Contributor Author

sryza commented Apr 15, 2015

Again, one of the main uses is estimating the size of variables you're considering broadcasting. Another is experimenting with different representations - e.g. how much more efficient is declaring a custom class than just using a hash map? In these situations, putting the data into and RDD to estimate the size would be an inconvenience.

@sryza sryza force-pushed the sandy-spark-5112 branch from 62bac82 to 4aba898 Compare April 16, 2015 16:43
@SparkQA
Copy link

SparkQA commented Apr 16, 2015

Test build #30423 has started for PR 3913 at commit 4aba898.

@srowen
Copy link
Member

srowen commented Apr 16, 2015

OK, I think my remaining comment is just: why route through SparkContext? can this method be exposed where it is? It doesn't seem like it has to be tied to the context.

@sryza
Copy link
Contributor Author

sryza commented Apr 16, 2015

@pwendell thought this would be preferable:

One thing we could do that would be more isolated is have a function in SparkContext called estimateSizeOf(object: Any), so that at least we don't expose the class location and names as API'

@pwendell
Copy link
Contributor

The reason I proposed to put in SparkContext is to avoid committing to the current namespace/package of that object and just expose a narrower utility function off of SparkContext. Overall our estimation code is likely to evolve in the future. For instance, we may want to have a nested package under util that deals with memory management stuff.

All that said, if we expose other utilities to users (can't remember now) under the util package, then I'd be okay to do that too if @srowen and @sryza think it's nicer.

In terms of exposing or not, I'm okay to expose it given the reasons here. But can we give some warning to set expectations for the user? This estimation can be really inaccurate because of sampling and heuristics used internally. This is especially true if you have, say, a hashmap that has skewed keys - it will only sample a small percentage of all the keyspace and could miss hot keys.

So I'd just say it's an estimate of the in-memory size and uses sampling internally for complex objects. I think this is also the gist of @shivaram's suggestion.

@SparkQA
Copy link

SparkQA commented Apr 16, 2015

Test build #30423 has finished for PR 3913 at commit 4aba898.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30423/
Test PASSed.

@srowen
Copy link
Member

srowen commented Apr 16, 2015

OK how about just moving it up a level out of util, if that's generally not public, and making SizeEstimator public but a developer API? You have to commit to the method living somewhere and I think that's no more complex than routing it through the context API.

@pwendell
Copy link
Contributor

Sure, sounds fine to me. We can have a static method for it.

@sryza sryza force-pushed the sandy-spark-5112 branch from 4aba898 to 8adde39 Compare April 29, 2015 21:36
@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@SparkQA
Copy link

SparkQA commented Apr 29, 2015

Test build #31334 has started for PR 3913 at commit 8adde39.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 5, 2015

Test build #31864 has started for PR 3913 at commit 8d9e082.

@SparkQA
Copy link

SparkQA commented May 5, 2015

Test build #31864 has finished for PR 3913 at commit 8d9e082.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31864/
Test PASSed.

asfgit pushed a commit that referenced this pull request May 5, 2015
"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD."
-the Tuning Spark page

This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object.

Author: Sandy Ryza <[email protected]>

Closes #3913 from sryza/sandy-spark-5112 and squashes the following commits:

8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark
2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util"
93f4cd0 [Sandy Ryza] Move SizeEstimator out of util
e21c1f4 [Sandy Ryza] Remove unused import
798ab88 [Sandy Ryza] Update documentation and add to SparkContext
34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api

(cherry picked from commit 4222da6)
Signed-off-by: Sean Owen <[email protected]>
@asfgit asfgit closed this in 4222da6 May 5, 2015
@rxin
Copy link
Contributor

rxin commented May 5, 2015

Actually I strongly oppose putting this in the top level package. We will end up with a lot of random util objects or top level classes. In my mind, most of the stuff we put in the top level class right now should not be there.

One way to do this is to have a public util package, and move everything in util into internalutil if we want to hide utils.

@srowen
Copy link
Member

srowen commented May 6, 2015

@rxin yeah I hear you. It's not too late to move this. I struggle a bit with where to put this then. So you mean move util to util.internal or something, and move public utilities into util? makes sense to me.

There are probably a few other things that could move down out of org.apache.spark that are not already public. Like the HTTP server stuff for example?
And I have thought and heard a few times that it might be nice to break up Utils into a few logical units.

It would be a big invasive change to the code; that's the only thing. I can take a swing at it if anyone voices support for that kind of thing?

@rxin
Copy link
Contributor

rxin commented May 6, 2015

I think it's a good idea to do that. For now why don't we move SizeEstimator back into util, rename the current util.SizeEstimator util.SizeEstimator0?

For 1.5 we can move stuff around into utils or utils.internal (seems long ... maybe we can find better names too).

@srowen
Copy link
Member

srowen commented May 6, 2015

I think @pwendell 's sentiment was that util shouldn't have public classes/objects -- what do you think Patrick? as an interim measure?

@rxin
Copy link
Contributor

rxin commented May 6, 2015

But we already have a bunch of stuff in util that's public ...
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.package

@pwendell
Copy link
Contributor

pwendell commented May 6, 2015

I'm okay to nest it under util, per Reynold's suggesting.

On Wed, May 6, 2015 at 7:33 AM, Reynold Xin [email protected]
wrote:

But we already have a bunch of stuff in util that's public ...

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.package


Reply to this email directly or view it on GitHub
#3913 (comment).

@srowen
Copy link
Member

srowen commented May 6, 2015

@sryza do you mind doing the honors as an part 2 for this issue?

@sryza
Copy link
Contributor Author

sryza commented May 10, 2015

@pwendell @rxin if we're putting it in the same package, I don't really understand why we don't just have a single class SizeEstimator with a single public method estimate? We're worried in the future that someone might make other attributes of SizeEstimator public? In SizeEstimator / SizeEstimator0 world, it would be nearly as easy to add something public to SizeEstimator. I don't see why we wouldn't trust ourselves to properly restrict access when a change proposing that comes around.

@rxin
Copy link
Contributor

rxin commented May 10, 2015

I'm fine putting them in the same class too, provided that everything else is marked as private (I think it is already, isn't it?)

@sryza
Copy link
Contributor Author

sryza commented May 13, 2015

Is that ok with you as well @pwendell ?

@pwendell
Copy link
Contributor

Okay - I officially give in and will not comment further on this thread :P

On Tue, May 12, 2015 at 10:12 PM, Sandy Ryza [email protected]
wrote:

Is that ok with you as well @pwendell https://github.com/pwendell ?


Reply to this email directly or view it on GitHub
#3913 (comment).

mbautin pushed a commit to mbautin/spark that referenced this pull request May 18, 2015
"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD."
-the Tuning Spark page

This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object.

Author: Sandy Ryza <[email protected]>

Closes apache#3913 from sryza/sandy-spark-5112 and squashes the following commits:

8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark
2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util"
93f4cd0 [Sandy Ryza] Move SizeEstimator out of util
e21c1f4 [Sandy Ryza] Remove unused import
798ab88 [Sandy Ryza] Update documentation and add to SparkContext
34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD."
-the Tuning Spark page

This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object.

Author: Sandy Ryza <[email protected]>

Closes apache#3913 from sryza/sandy-spark-5112 and squashes the following commits:

8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark
2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util"
93f4cd0 [Sandy Ryza] Move SizeEstimator out of util
e21c1f4 [Sandy Ryza] Remove unused import
798ab88 [Sandy Ryza] Update documentation and add to SparkContext
34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api
asfgit pushed a commit that referenced this pull request May 28, 2015
See comments on #3913

Author: Reynold Xin <[email protected]>

Closes #6471 from rxin/sizeestimator and squashes the following commits:

c057095 [Reynold Xin] Fixed import.
2da478b [Reynold Xin] Remove SizeEstimator from o.a.spark package.

(cherry picked from commit 0077af2)
Signed-off-by: Reynold Xin <[email protected]>
asfgit pushed a commit that referenced this pull request May 28, 2015
See comments on #3913

Author: Reynold Xin <[email protected]>

Closes #6471 from rxin/sizeestimator and squashes the following commits:

c057095 [Reynold Xin] Fixed import.
2da478b [Reynold Xin] Remove SizeEstimator from o.a.spark package.
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD."
-the Tuning Spark page

This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object.

Author: Sandy Ryza <[email protected]>

Closes apache#3913 from sryza/sandy-spark-5112 and squashes the following commits:

8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark
2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util"
93f4cd0 [Sandy Ryza] Move SizeEstimator out of util
e21c1f4 [Sandy Ryza] Remove unused import
798ab88 [Sandy Ryza] Update documentation and add to SparkContext
34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
See comments on apache#3913

Author: Reynold Xin <[email protected]>

Closes apache#6471 from rxin/sizeestimator and squashes the following commits:

c057095 [Reynold Xin] Fixed import.
2da478b [Reynold Xin] Remove SizeEstimator from o.a.spark package.
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD."
-the Tuning Spark page

This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object.

Author: Sandy Ryza <[email protected]>

Closes apache#3913 from sryza/sandy-spark-5112 and squashes the following commits:

8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark
2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util"
93f4cd0 [Sandy Ryza] Move SizeEstimator out of util
e21c1f4 [Sandy Ryza] Remove unused import
798ab88 [Sandy Ryza] Update documentation and add to SparkContext
34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
See comments on apache#3913

Author: Reynold Xin <[email protected]>

Closes apache#6471 from rxin/sizeestimator and squashes the following commits:

c057095 [Reynold Xin] Fixed import.
2da478b [Reynold Xin] Remove SizeEstimator from o.a.spark package.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants