-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey() #1978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Jenkins, test this please. |
|
QA tests have started for PR 1978 at commit
|
|
QA tests have finished for PR 1978 at commit
|
|
I think we also need to add a license statement to the LICENSE file (like we've done with CloudPickle and Py4J). |
python/pyspark/shuffle.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of my new PR, this will need to be changed to "SPARK_LOCAL_DIRS" (plural).
|
Why do we need to use |
|
In Python 2.6/7, heapq.merge() do not support |
change SPARK_LOCAL_DIR into SPARK_LOCAL_DIRS
|
cc @mateiz |
|
QA tests have started for PR 1978 at commit
|
|
QA tests have finished for PR 1978 at commit
|
|
Jenkins, retest this please. |
Conflicts: python/pyspark/tests.py
|
QA tests have started for PR 1978 at commit
|
|
QA tests have finished for PR 1978 at commit
|
|
QA tests have started for PR 1978 at commit
|
|
QA tests have finished for PR 1978 at commit
|
|
Test failure is due to RAT complaining about some temporary files left behind by another test: There's a few ways to fix this:
|
|
I ssh'd into the worker and deleted that checkpoint directory, so maybe it will work now. Jenkins, retest this please. |
|
Jenkins, retest this please |
|
QA tests have started for PR 1978 at commit
|
|
QA tests have finished for PR 1978 at commit
|
Conflicts: python/pyspark/rdd.py python/pyspark/shuffle.py
|
Jenkins, test this please. |
|
QA tests have started for PR 1978 at commit
|
|
QA tests have finished for PR 1978 at commit
|
python/pyspark/shuffle.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there will be multiple Python worker processes running on the same node, if they all need to spill, it looks like they'll use the same directories in order here. Can you instead start each one at a random ID and then increment that to have it cycle through?
I'm not sure whether this can also affect the external hashing code, but if so, it would be good to fix that too (as a separate JIRA).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically I'm worried that everyone writes to disk1 first, then everyone writes to disk2, etc, and we only use one disk at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, maybe shuffling the directories randomly in the begging would be better.
PS: Could you have a configured policy to choose local disks, such as use the first one AMAP, it's will be useful when one of the local disks is SSD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good question. We don't have that yet, but in the future we'll have support for multiple local storage levels.
|
@mateiz I had addressed above comments, it also fix the same problem for external merger, please take another look again, thx. |
|
QA tests have started for PR 1978 at commit
|
|
QA tests have finished for PR 1978 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested that this actually spills any data? I guess it does because the bare Python interpreter already consumes more than 1 MB?
|
Looks pretty good, just added one question about the test |
|
@mateiz added checking for spilled bytes in tests. |
|
QA tests have started for PR 1978 at commit
|
|
QA tests have finished for PR 1978 at commit
|
|
Cool, thanks! Going to merge this. |
|
BTW can you send a PR for the randomizing change to branch-1.1? I don't think we'll add sorting in branch-1.1 since it's a new feature, but we can add that randomizing patch as a bug fix. Or do you think it won't matter much? |
Using external sort to support sort large datasets in reduce stage. Author: Davies Liu <[email protected]> Closes apache#1978 from davies/sort and squashes the following commits: bbcd9ba [Davies Liu] check spilled bytes in tests b125d2f [Davies Liu] add test for external sort in rdd eae0176 [Davies Liu] choose different disks from different processes and instances 1f075ed [Davies Liu] Merge branch 'master' into sort eb53ca6 [Davies Liu] Merge branch 'master' into sort 644abaf [Davies Liu] add license in LICENSE 19f7873 [Davies Liu] improve tests 55602ee [Davies Liu] use external sort in sortBy() and sortByKey()
Using external sort to support sort large datasets in reduce stage.