-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-5112. Expose SizeEstimator as a developer api #3913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #25109 has started for PR 3913 at commit
|
|
That doc is very outdated - you can actually just look in the UI after caching some data, you don't need to visit the logs. |
|
Test build #25109 has finished for PR 3913 at commit
|
|
Test PASSed. |
|
Ooh OK I'll update the doc. That's still a little cumbersome though for someone who just wants to see how much space an object takes up. Most of the recommendations on that page are at the micro level - tuning the memory taken up by a single object. It would be useful to have a way to determine this amount directly. E.g. if I have plan to have an RDD of some case class I might want to see how much space each instance takes up. Then I can experiment quickly with things like flatter structures or removing fields and see if they improve the footprint. Another thing is that users trying to tune applications commonly need to figure out whether they should broadcast side-data or figure out a way to load it as an RDD and join. It would be helpful to be able to have a single function call that's easy to use from the shell to find out the size of the data you're about to broadcast. |
|
Just not sure how overall useful it would be. For RDD data, it might be slightly misleading here because of things like serialization in-memory. For broadcast objects in the shell, it would only work in the scala shell though because of the way that serialization works in Python. I'm also not totally sure overall how accurate our memory estimation is and it may get less so if we add smarter caching for SchemaRDD's. Anyways, what would be helpful, could you walk through an example with a case class or something and show how accurate it is? That would be useful to better understand. One thing we could do that would be more isolated is have a function in SparkContext called |
|
Should this just turn into a doc update PR then? |
|
Adding an I agree that there's not a great way to expose something like this for Python. But I don't think the zaniness of Python-JVM interaction means that we shouldn't expose useful functionality to pure-JVM apps.
I think this is the kind of thing we can just document. Adding a separate
I've found it to be very accurate in my experiments. We rely on its accuracy for shuffle memory management and POJO caching, so to the extent that it's inaccurate we've got bigger problems. |
29fa503 to
62bac82
Compare
|
Test build #27305 has started for PR 3913 at commit
|
|
The size estimator should be pretty accurate for measuring the size of small Java objects. You could add a note that we do some approximation for large arrays (we sample some elements of the array and then extrapolate). |
|
Test build #27305 has finished for PR 3913 at commit
|
|
Test PASSed. |
|
I'd like to resolve this one way or the other. My hesitation is mostly about tacking on another method to |
|
I think this is a good change. Yes, you could cache an RDD and see its size, but think about what a pain that actually is if you wanted to do it programmatically. You'd need to register a spark listener, wait to get the appropriate events and look at the sizes. If it was easy to do this programmatically via an RDD, then I'd say this change isn't necessary. Eg., if you could do something like Then we wouldn't need to expose this. But that's an even bigger api change, and one that I would be far more nervous about (the code above is definitely not a viable alternative, lots of reasons it doesn't make sense). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this import isn't necessary anymore
|
Again, one of the main uses is estimating the size of variables you're considering broadcasting. Another is experimenting with different representations - e.g. how much more efficient is declaring a custom class than just using a hash map? In these situations, putting the data into and RDD to estimate the size would be an inconvenience. |
|
Test build #30423 has started for PR 3913 at commit |
|
OK, I think my remaining comment is just: why route through |
|
@pwendell thought this would be preferable:
|
|
The reason I proposed to put in SparkContext is to avoid committing to the current namespace/package of that object and just expose a narrower utility function off of SparkContext. Overall our estimation code is likely to evolve in the future. For instance, we may want to have a nested package under util that deals with memory management stuff. All that said, if we expose other utilities to users (can't remember now) under the util package, then I'd be okay to do that too if @srowen and @sryza think it's nicer. In terms of exposing or not, I'm okay to expose it given the reasons here. But can we give some warning to set expectations for the user? This estimation can be really inaccurate because of sampling and heuristics used internally. This is especially true if you have, say, a hashmap that has skewed keys - it will only sample a small percentage of all the keyspace and could miss hot keys. So I'd just say it's an estimate of the in-memory size and uses sampling internally for complex objects. I think this is also the gist of @shivaram's suggestion. |
|
Test build #30423 has finished for PR 3913 at commit
|
|
Test PASSed. |
|
OK how about just moving it up a level out of |
|
Sure, sounds fine to me. We can have a static method for it. |
|
Build triggered. |
|
Build started. |
|
Test build #31334 has started for PR 3913 at commit |
|
Merged build triggered. |
|
Merged build started. |
|
Test build #31864 has started for PR 3913 at commit |
|
Test build #31864 has finished for PR 3913 at commit
|
|
Merged build finished. Test PASSed. |
|
Test PASSed. |
"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD." -the Tuning Spark page This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object. Author: Sandy Ryza <[email protected]> Closes #3913 from sryza/sandy-spark-5112 and squashes the following commits: 8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark 2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util" 93f4cd0 [Sandy Ryza] Move SizeEstimator out of util e21c1f4 [Sandy Ryza] Remove unused import 798ab88 [Sandy Ryza] Update documentation and add to SparkContext 34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api (cherry picked from commit 4222da6) Signed-off-by: Sean Owen <[email protected]>
|
Actually I strongly oppose putting this in the top level package. We will end up with a lot of random util objects or top level classes. In my mind, most of the stuff we put in the top level class right now should not be there. One way to do this is to have a public util package, and move everything in util into internalutil if we want to hide utils. |
|
@rxin yeah I hear you. It's not too late to move this. I struggle a bit with where to put this then. So you mean move There are probably a few other things that could move down out of It would be a big invasive change to the code; that's the only thing. I can take a swing at it if anyone voices support for that kind of thing? |
|
I think it's a good idea to do that. For now why don't we move SizeEstimator back into util, rename the current util.SizeEstimator util.SizeEstimator0? For 1.5 we can move stuff around into utils or utils.internal (seems long ... maybe we can find better names too). |
|
I think @pwendell 's sentiment was that |
|
But we already have a bunch of stuff in util that's public ... |
|
I'm okay to nest it under util, per Reynold's suggesting. On Wed, May 6, 2015 at 7:33 AM, Reynold Xin [email protected]
|
|
@sryza do you mind doing the honors as an part 2 for this issue? |
|
@pwendell @rxin if we're putting it in the same package, I don't really understand why we don't just have a single class |
|
I'm fine putting them in the same class too, provided that everything else is marked as private (I think it is already, isn't it?) |
|
Is that ok with you as well @pwendell ? |
|
Okay - I officially give in and will not comment further on this thread :P On Tue, May 12, 2015 at 10:12 PM, Sandy Ryza [email protected]
|
"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD." -the Tuning Spark page This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object. Author: Sandy Ryza <[email protected]> Closes apache#3913 from sryza/sandy-spark-5112 and squashes the following commits: 8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark 2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util" 93f4cd0 [Sandy Ryza] Move SizeEstimator out of util e21c1f4 [Sandy Ryza] Remove unused import 798ab88 [Sandy Ryza] Update documentation and add to SparkContext 34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api
"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD." -the Tuning Spark page This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object. Author: Sandy Ryza <[email protected]> Closes apache#3913 from sryza/sandy-spark-5112 and squashes the following commits: 8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark 2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util" 93f4cd0 [Sandy Ryza] Move SizeEstimator out of util e21c1f4 [Sandy Ryza] Remove unused import 798ab88 [Sandy Ryza] Update documentation and add to SparkContext 34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api
See comments on #3913 Author: Reynold Xin <[email protected]> Closes #6471 from rxin/sizeestimator and squashes the following commits: c057095 [Reynold Xin] Fixed import. 2da478b [Reynold Xin] Remove SizeEstimator from o.a.spark package. (cherry picked from commit 0077af2) Signed-off-by: Reynold Xin <[email protected]>
See comments on #3913 Author: Reynold Xin <[email protected]> Closes #6471 from rxin/sizeestimator and squashes the following commits: c057095 [Reynold Xin] Fixed import. 2da478b [Reynold Xin] Remove SizeEstimator from o.a.spark package.
"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD." -the Tuning Spark page This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object. Author: Sandy Ryza <[email protected]> Closes apache#3913 from sryza/sandy-spark-5112 and squashes the following commits: 8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark 2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util" 93f4cd0 [Sandy Ryza] Move SizeEstimator out of util e21c1f4 [Sandy Ryza] Remove unused import 798ab88 [Sandy Ryza] Update documentation and add to SparkContext 34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api
See comments on apache#3913 Author: Reynold Xin <[email protected]> Closes apache#6471 from rxin/sizeestimator and squashes the following commits: c057095 [Reynold Xin] Fixed import. 2da478b [Reynold Xin] Remove SizeEstimator from o.a.spark package.
"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD." -the Tuning Spark page This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object. Author: Sandy Ryza <[email protected]> Closes apache#3913 from sryza/sandy-spark-5112 and squashes the following commits: 8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark 2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util" 93f4cd0 [Sandy Ryza] Move SizeEstimator out of util e21c1f4 [Sandy Ryza] Remove unused import 798ab88 [Sandy Ryza] Update documentation and add to SparkContext 34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api
See comments on apache#3913 Author: Reynold Xin <[email protected]> Closes apache#6471 from rxin/sizeestimator and squashes the following commits: c057095 [Reynold Xin] Fixed import. 2da478b [Reynold Xin] Remove SizeEstimator from o.a.spark package.
"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD."
-the Tuning Spark page
This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object.