SPARK-5112. Expose SizeEstimator as a developer api #3913

sryza · 2015-01-06T19:52:42Z

"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD."
-the Tuning Spark page

This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object.

SparkQA · 2015-01-06T19:57:44Z

Test build #25109 has started for PR 3913 at commit 29fa503.

This patch merges cleanly.

pwendell · 2015-01-06T21:04:49Z

That doc is very outdated - you can actually just look in the UI after caching some data, you don't need to visit the logs.

SparkQA · 2015-01-06T21:06:24Z

Test build #25109 has finished for PR 3913 at commit 29fa503.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-06T21:06:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25109/
Test PASSed.

sryza · 2015-01-06T21:22:26Z

Ooh OK I'll update the doc. That's still a little cumbersome though for someone who just wants to see how much space an object takes up. Most of the recommendations on that page are at the micro level - tuning the memory taken up by a single object. It would be useful to have a way to determine this amount directly.

E.g. if I have plan to have an RDD of some case class I might want to see how much space each instance takes up. Then I can experiment quickly with things like flatter structures or removing fields and see if they improve the footprint.

Another thing is that users trying to tune applications commonly need to figure out whether they should broadcast side-data or figure out a way to load it as an RDD and join. It would be helpful to be able to have a single function call that's easy to use from the shell to find out the size of the data you're about to broadcast.

pwendell · 2015-01-26T22:35:36Z

Just not sure how overall useful it would be. For RDD data, it might be slightly misleading here because of things like serialization in-memory. For broadcast objects in the shell, it would only work in the scala shell though because of the way that serialization works in Python.

I'm also not totally sure overall how accurate our memory estimation is and it may get less so if we add smarter caching for SchemaRDD's. Anyways, what would be helpful, could you walk through an example with a case class or something and show how accurate it is? That would be useful to better understand.

One thing we could do that would be more isolated is have a function in SparkContext called estimateSizeOf(object: Any), so that at least we don't expose the class location and names as API's.

srowen · 2015-02-11T10:00:20Z

Should this just turn into a doc update PR then?

sryza · 2015-02-11T20:56:24Z

Adding an estimateSizeOf method to SparkContext sounds reasonable to me.

I agree that there's not a great way to expose something like this for Python. But I don't think the zaniness of Python-JVM interaction means that we shouldn't expose useful functionality to pure-JVM apps.

For RDD data, it might be slightly misleading here because of things like serialization in-memory.

I think this is the kind of thing we can just document. Adding a separate estimateSerializedSizeOf method would be helpful as well.

I'm also not totally sure overall how accurate our memory estimation is and it may get less so if we add smarter caching for SchemaRDD's.

I've found it to be very accurate in my experiments. We rely on its accuracy for shuffle memory management and POJO caching, so to the extent that it's inaccurate we've got bigger problems.

SparkQA · 2015-02-11T22:37:28Z

Test build #27305 has started for PR 3913 at commit 62bac82.

This patch merges cleanly.

shivaram · 2015-02-11T22:46:25Z

The size estimator should be pretty accurate for measuring the size of small Java objects. You could add a note that we do some approximation for large arrays (we sample some elements of the array and then extrapolate).

SparkQA · 2015-02-11T23:58:46Z

Test build #27305 has finished for PR 3913 at commit 62bac82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-11T23:58:50Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27305/
Test PASSed.

srowen · 2015-04-15T13:03:43Z

I'd like to resolve this one way or the other. My hesitation is mostly about tacking on another method to SparkContext, developer API or no. Would it really be a better way to estimate actual RDD size, since that depends on other stuff (even the JVM, whether data is serialized, etc.)

squito · 2015-04-15T16:57:29Z

I think this is a good change. Yes, you could cache an RDD and see its size, but think about what a pain that actually is if you wanted to do it programmatically. You'd need to register a spark listener, wait to get the appropriate events and look at the sizes.

If it was easy to do this programmatically via an RDD, then I'd say this change isn't necessary. Eg., if you could do something like

val (_, meta) = sc.parallelize(oneObject).cache().count()
meta.getPartitionSizes

Then we wouldn't need to expose this. But that's an even bigger api change, and one that I would be far more nervous about (the code above is definitely not a viable alternative, lots of reasons it doesn't make sense).

squito · 2015-04-15T16:57:51Z

core/src/main/scala/org/apache/spark/util/SizeEstimator.scala

nit: this import isn't necessary anymore

sryza · 2015-04-15T17:02:15Z

Again, one of the main uses is estimating the size of variables you're considering broadcasting. Another is experimenting with different representations - e.g. how much more efficient is declaring a custom class than just using a hash map? In these situations, putting the data into and RDD to estimate the size would be an inconvenience.

SparkQA · 2015-04-16T16:48:32Z

Test build #30423 has started for PR 3913 at commit 4aba898.

srowen · 2015-04-16T16:54:24Z

OK, I think my remaining comment is just: why route through SparkContext? can this method be exposed where it is? It doesn't seem like it has to be tied to the context.

sryza · 2015-04-16T17:01:29Z

@pwendell thought this would be preferable:

One thing we could do that would be more isolated is have a function in SparkContext called estimateSizeOf(object: Any), so that at least we don't expose the class location and names as API'

pwendell · 2015-04-16T17:07:19Z

The reason I proposed to put in SparkContext is to avoid committing to the current namespace/package of that object and just expose a narrower utility function off of SparkContext. Overall our estimation code is likely to evolve in the future. For instance, we may want to have a nested package under util that deals with memory management stuff.

All that said, if we expose other utilities to users (can't remember now) under the util package, then I'd be okay to do that too if @srowen and @sryza think it's nicer.

In terms of exposing or not, I'm okay to expose it given the reasons here. But can we give some warning to set expectations for the user? This estimation can be really inaccurate because of sampling and heuristics used internally. This is especially true if you have, say, a hashmap that has skewed keys - it will only sample a small percentage of all the keyspace and could miss hot keys.

So I'd just say it's an estimate of the in-memory size and uses sampling internally for complex objects. I think this is also the gist of @shivaram's suggestion.

SparkQA · 2015-04-16T18:19:19Z

Test build #30423 has finished for PR 3913 at commit 4aba898.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-16T18:19:23Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30423/
Test PASSed.

srowen · 2015-04-16T19:05:33Z

OK how about just moving it up a level out of util, if that's generally not public, and making SizeEstimator public but a developer API? You have to commit to the method living somewhere and I think that's no more complex than routing it through the context API.

pwendell · 2015-04-21T17:48:21Z

Sure, sounds fine to me. We can have a static method for it.

AmplabJenkins · 2015-04-29T21:37:12Z

Build triggered.

AmplabJenkins · 2015-04-29T21:37:22Z

Build started.

SparkQA · 2015-04-29T21:38:54Z

Test build #31334 has started for PR 3913 at commit 8adde39.

AmplabJenkins · 2015-05-05T09:52:13Z

Merged build triggered.

AmplabJenkins · 2015-05-05T09:52:20Z

Merged build started.

SparkQA · 2015-05-05T09:53:58Z

Test build #31864 has started for PR 3913 at commit 8d9e082.

SparkQA · 2015-05-05T11:36:04Z

Test build #31864 has finished for PR 3913 at commit 8d9e082.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-05T11:36:09Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-05T11:36:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31864/
Test PASSed.

"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD." -the Tuning Spark page This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object. Author: Sandy Ryza <[email protected]> Closes #3913 from sryza/sandy-spark-5112 and squashes the following commits: 8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark 2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util" 93f4cd0 [Sandy Ryza] Move SizeEstimator out of util e21c1f4 [Sandy Ryza] Remove unused import 798ab88 [Sandy Ryza] Update documentation and add to SparkContext 34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api (cherry picked from commit 4222da6) Signed-off-by: Sean Owen <[email protected]>

rxin · 2015-05-05T21:28:11Z

Actually I strongly oppose putting this in the top level package. We will end up with a lot of random util objects or top level classes. In my mind, most of the stuff we put in the top level class right now should not be there.

One way to do this is to have a public util package, and move everything in util into internalutil if we want to hide utils.

srowen · 2015-05-06T06:16:17Z

@rxin yeah I hear you. It's not too late to move this. I struggle a bit with where to put this then. So you mean move util to util.internal or something, and move public utilities into util? makes sense to me.

There are probably a few other things that could move down out of org.apache.spark that are not already public. Like the HTTP server stuff for example?
And I have thought and heard a few times that it might be nice to break up Utils into a few logical units.

It would be a big invasive change to the code; that's the only thing. I can take a swing at it if anyone voices support for that kind of thing?

rxin · 2015-05-06T06:27:19Z

I think it's a good idea to do that. For now why don't we move SizeEstimator back into util, rename the current util.SizeEstimator util.SizeEstimator0?

For 1.5 we can move stuff around into utils or utils.internal (seems long ... maybe we can find better names too).

srowen · 2015-05-06T06:31:58Z

I think @pwendell 's sentiment was that util shouldn't have public classes/objects -- what do you think Patrick? as an interim measure?

rxin · 2015-05-06T06:33:26Z

But we already have a bunch of stuff in util that's public ...
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.package

pwendell · 2015-05-06T06:52:55Z

I'm okay to nest it under util, per Reynold's suggesting.

On Wed, May 6, 2015 at 7:33 AM, Reynold Xin [email protected]
wrote:

But we already have a bunch of stuff in util that's public ...

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.package

—
Reply to this email directly or view it on GitHub
#3913 (comment).

srowen · 2015-05-06T06:54:12Z

@sryza do you mind doing the honors as an part 2 for this issue?

sryza · 2015-05-10T18:18:33Z

@pwendell @rxin if we're putting it in the same package, I don't really understand why we don't just have a single class SizeEstimator with a single public method estimate? We're worried in the future that someone might make other attributes of SizeEstimator public? In SizeEstimator / SizeEstimator0 world, it would be nearly as easy to add something public to SizeEstimator. I don't see why we wouldn't trust ourselves to properly restrict access when a change proposing that comes around.

rxin · 2015-05-10T19:04:19Z

I'm fine putting them in the same class too, provided that everything else is marked as private (I think it is already, isn't it?)

sryza · 2015-05-13T05:11:27Z

Is that ok with you as well @pwendell ?

pwendell · 2015-05-14T06:26:41Z

Okay - I officially give in and will not comment further on this thread :P

On Tue, May 12, 2015 at 10:12 PM, Sandy Ryza [email protected]
wrote:

Is that ok with you as well @pwendell https://github.com/pwendell ?

—
Reply to this email directly or view it on GitHub
#3913 (comment).

"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD." -the Tuning Spark page This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object. Author: Sandy Ryza <[email protected]> Closes apache#3913 from sryza/sandy-spark-5112 and squashes the following commits: 8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark 2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util" 93f4cd0 [Sandy Ryza] Move SizeEstimator out of util e21c1f4 [Sandy Ryza] Remove unused import 798ab88 [Sandy Ryza] Update documentation and add to SparkContext 34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api

See comments on #3913 Author: Reynold Xin <[email protected]> Closes #6471 from rxin/sizeestimator and squashes the following commits: c057095 [Reynold Xin] Fixed import. 2da478b [Reynold Xin] Remove SizeEstimator from o.a.spark package. (cherry picked from commit 0077af2) Signed-off-by: Reynold Xin <[email protected]>

See comments on #3913 Author: Reynold Xin <[email protected]> Closes #6471 from rxin/sizeestimator and squashes the following commits: c057095 [Reynold Xin] Fixed import. 2da478b [Reynold Xin] Remove SizeEstimator from o.a.spark package.

"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD." -the Tuning Spark page This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object. Author: Sandy Ryza <[email protected]> Closes apache#3913 from sryza/sandy-spark-5112 and squashes the following commits: 8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark 2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util" 93f4cd0 [Sandy Ryza] Move SizeEstimator out of util e21c1f4 [Sandy Ryza] Remove unused import 798ab88 [Sandy Ryza] Update documentation and add to SparkContext 34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api

See comments on apache#3913 Author: Reynold Xin <[email protected]> Closes apache#6471 from rxin/sizeestimator and squashes the following commits: c057095 [Reynold Xin] Fixed import. 2da478b [Reynold Xin] Remove SizeEstimator from o.a.spark package.

"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD." -the Tuning Spark page This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object. Author: Sandy Ryza <[email protected]> Closes apache#3913 from sryza/sandy-spark-5112 and squashes the following commits: 8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark 2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util" 93f4cd0 [Sandy Ryza] Move SizeEstimator out of util e21c1f4 [Sandy Ryza] Remove unused import 798ab88 [Sandy Ryza] Update documentation and add to SparkContext 34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api

See comments on apache#3913 Author: Reynold Xin <[email protected]> Closes apache#6471 from rxin/sizeestimator and squashes the following commits: c057095 [Reynold Xin] Fixed import. 2da478b [Reynold Xin] Remove SizeEstimator from o.a.spark package.

sryza force-pushed the sandy-spark-5112 branch from 29fa503 to 62bac82 Compare February 11, 2015 22:37

squito reviewed Apr 15, 2015
View reviewed changes

core/src/main/scala/org/apache/spark/util/SizeEstimator.scala Outdated

Copy link

Contributor

squito Apr 15, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this import isn't necessary anymore

sryza force-pushed the sandy-spark-5112 branch from 62bac82 to 4aba898 Compare April 16, 2015 16:43

sryza force-pushed the sandy-spark-5112 branch from 4aba898 to 8adde39 Compare April 29, 2015 21:36

asfgit closed this in 4222da6 May 5, 2015

rxin mentioned this pull request May 28, 2015

Remove SizeEstimator from o.a.spark package. #6471

Closed

SPARK-5112. Expose SizeEstimator as a developer api #3913

SPARK-5112. Expose SizeEstimator as a developer api #3913

Uh oh!

Conversation

sryza commented Jan 6, 2015

Uh oh!

SparkQA commented Jan 6, 2015

Uh oh!

pwendell commented Jan 6, 2015

Uh oh!

SparkQA commented Jan 6, 2015

Uh oh!

AmplabJenkins commented Jan 6, 2015

Uh oh!

sryza commented Jan 6, 2015

Uh oh!

pwendell commented Jan 26, 2015

Uh oh!

srowen commented Feb 11, 2015

Uh oh!

sryza commented Feb 11, 2015

Uh oh!

SparkQA commented Feb 11, 2015

Uh oh!

shivaram commented Feb 11, 2015

Uh oh!

SparkQA commented Feb 11, 2015

Uh oh!

AmplabJenkins commented Feb 11, 2015

Uh oh!

srowen commented Apr 15, 2015

Uh oh!

squito commented Apr 15, 2015

Uh oh!

squito Apr 15, 2015

Choose a reason for hiding this comment

Uh oh!

sryza commented Apr 15, 2015

Uh oh!

SparkQA commented Apr 16, 2015

Uh oh!

srowen commented Apr 16, 2015

Uh oh!

sryza commented Apr 16, 2015

Uh oh!

pwendell commented Apr 16, 2015

Uh oh!

SparkQA commented Apr 16, 2015

Uh oh!

AmplabJenkins commented Apr 16, 2015

Uh oh!

srowen commented Apr 16, 2015

Uh oh!

pwendell commented Apr 21, 2015

Uh oh!

AmplabJenkins commented Apr 29, 2015

Uh oh!

AmplabJenkins commented Apr 29, 2015

Uh oh!

SparkQA commented Apr 29, 2015

Uh oh!

AmplabJenkins commented May 5, 2015

Uh oh!

AmplabJenkins commented May 5, 2015

Uh oh!

SparkQA commented May 5, 2015

Uh oh!

SparkQA commented May 5, 2015

Uh oh!

AmplabJenkins commented May 5, 2015

Uh oh!

AmplabJenkins commented May 5, 2015

Uh oh!

rxin commented May 5, 2015

Uh oh!

srowen commented May 6, 2015

Uh oh!

rxin commented May 6, 2015

Uh oh!

srowen commented May 6, 2015