-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-2652] [PySpark] Turning some default configs for PySpark #1568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 1568. This patch merges cleanly. |
|
QA results for PR 1568: |
|
QA tests have started for PR 1568. This patch merges cleanly. |
|
QA results for PR 1568: |
python/pyspark/context.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now merged #1051, so update this to do _conf.setIfMissing().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you may want to move the "spark.rdd.compress" that that one set into your map above
|
QA tests have started for PR 1568. This patch merges cleanly. |
|
QA results for PR 1568: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies, you also need to remove the self._conf.setIfMissing("spark.rdd.compress", "true") line above. Otherwise it looks good.
|
QA tests have started for PR 1568. This patch merges cleanly. |
|
QA results for PR 1568: |
|
Merged this, thanks. |
Add several default configs for PySpark, related to serialization in JVM. spark.serializer = org.apache.spark.serializer.KryoSerializer spark.serializer.objectStreamReset = 100 spark.rdd.compress = True This will help to reduce the memory usage during RDD.partitionBy() Author: Davies Liu <[email protected]> Closes apache#1568 from davies/conf and squashes the following commits: cd316f1 [Davies Liu] remove duplicated line f71a355 [Davies Liu] rebase to master, add spark.rdd.compress = True 8f63f45 [Davies Liu] Merge branch 'master' into conf 8bc9f08 [Davies Liu] fix unittest c04a83d [Davies Liu] some default configs for PySpark
Add several default configs for PySpark, related to serialization in JVM.
spark.serializer = org.apache.spark.serializer.KryoSerializer
spark.serializer.objectStreamReset = 100
spark.rdd.compress = True
This will help to reduce the memory usage during RDD.partitionBy()