-
Notifications
You must be signed in to change notification settings - Fork 29.3k
[SPARK-11710] Document new memory management model #9676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -88,9 +88,40 @@ than the "raw" data inside their fields. This is due to several reasons: | |
| but also pointers (typically 8 bytes each) to the next object in the list. | ||
| * Collections of primitive types often store them as "boxed" objects such as `java.lang.Integer`. | ||
|
|
||
| This section will discuss how to determine the memory usage of your objects, and how to improve | ||
| it -- either by changing your data structures, or by storing data in a serialized format. | ||
| We will then cover tuning Spark's cache size and the Java garbage collector. | ||
| This section will start with an overview of memory management in Spark, then discuss specific | ||
| strategies the user can take to make more efficient use of memory in her application. In | ||
| particular, we will describe how to determine the memory usage of your objects, and how to | ||
| improve it -- either by changing your data structures, or by storing data in a serialized | ||
| format. We will then cover tuning Spark's cache size and the Java garbage collector. | ||
|
|
||
| ## Memory Management Overview | ||
|
|
||
| Memory usage in Spark largely falls into one of two categories: execution and storage. | ||
| Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations, | ||
| while storage memory refers to that used for caching and propagating internal data across the | ||
| cluster. In Spark, execution and storage share a unified region (M). When no execution memory is | ||
| used, storage can acquire all the available memory and vice versa. Execution may evict storage | ||
| if necessary, but only until total storage memory usage falls under a certain threshold (R). | ||
| Storage may not evict execution due to complexities in implementation. | ||
|
|
||
| This design ensures several desirable properties. First, applications that do not use caching | ||
| can use the entire space for execution, obviating unnecessary disk spills. Second, applications | ||
| that do use caching can reserve a minimum storage space (R) where their data blocks are immune | ||
| to being evicted. Lastly, this approach provides reasonable out-of-the-box performance for a | ||
| variety of workloads without requiring user expertise of how memory is divided internally. | ||
|
|
||
| Although there are two relevant configurations, the typical user should not need to adjust them | ||
| as the default values are applicable to most workloads: | ||
|
|
||
| * `spark.memory.fraction` expresses the size of `M` as a fraction of the total JVM heap space | ||
| (default 0.75). This sets aside memory for internal metadata, user data structures, and | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can it be explained why ".75" is the default value? What is the remaining .25% of memory allocated for?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the next sentence explains it. Maybe the wording's not clear |
||
| imprecise size estimation in the case of sparse, unusually large records. | ||
| * `spark.memory.storageFraction` expresses the size of `R` as a fraction of `M` (default 0.5). | ||
| This is the amount of storage memory immune to being evicted by execution. | ||
|
|
||
| For a more detailed description of the model, see the design doc attached to the associated | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i dont' think you'd want to reference a jira ticket, since it is mutable here ... i'd just remove this line. |
||
| [JIRA](http://issues.apache.org/jira/browse/SPARK-10000). | ||
|
|
||
|
|
||
| ## Determining Memory Consumption | ||
|
|
||
|
|
@@ -151,18 +182,6 @@ time spent GC. This can be done by adding `-verbose:gc -XX:+PrintGCDetails -XX:+ | |
| each time a garbage collection occurs. Note these logs will be on your cluster's worker nodes (in the `stdout` files in | ||
| their work directories), *not* on your driver program. | ||
|
|
||
| **Cache Size Tuning** | ||
|
|
||
| One important configuration parameter for GC is the amount of memory that should be used for caching RDDs. | ||
| By default, Spark uses 60% of the configured executor memory (`spark.executor.memory`) to | ||
| cache RDDs. This means that 40% of memory is available for any objects created during task execution. | ||
|
|
||
| In case your tasks slow down and you find that your JVM is garbage-collecting frequently or running out of | ||
| memory, lowering this value will help reduce the memory consumption. To change this to, say, 50%, you can call | ||
| `conf.set("spark.storage.memoryFraction", "0.5")` on your SparkConf. Combined with the use of serialized caching, | ||
| using a smaller cache should be sufficient to mitigate most of the garbage collection problems. | ||
| In case you are interested in further tuning the Java GC, continue reading below. | ||
|
|
||
| **Advanced GC Tuning** | ||
|
|
||
| To further tune garbage collection, we first need to understand some basic information about memory management in the JVM: | ||
|
|
@@ -183,9 +202,9 @@ temporary objects created during task execution. Some steps which may be useful | |
| * Check if there are too many garbage collections by collecting GC stats. If a full GC is invoked multiple times for | ||
| before a task completes, it means that there isn't enough memory available for executing tasks. | ||
|
|
||
| * In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of memory used for caching. | ||
| This can be done using the `spark.storage.memoryFraction` property. It is better to cache fewer objects than to slow | ||
| down task execution! | ||
| * In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of | ||
| memory used for caching by lowering `spark.memory.storageFraction`; it is better to cache fewer | ||
| objects than to slow down task execution! | ||
|
|
||
| * If there are too many minor collections but not many major GCs, allocating more memory for Eden would help. You | ||
| can set the size of the Eden to be an over-estimate of how much memory each task will need. If the size of Eden | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
her -> the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll replace this with
his/her