Skip to content

Commit 092121e

Browse files
daviesmateiz
authored andcommitted
[SPARK-3239] [PySpark] randomize the dirs for each process
This can avoid the IO contention during spilling, when you have multiple disks. Author: Davies Liu <[email protected]> Closes #2152 from davies/randomize and squashes the following commits: a4863c4 [Davies Liu] randomize the dirs for each process
1 parent 8f8e2a4 commit 092121e

File tree

1 file changed

+4
-0
lines changed

1 file changed

+4
-0
lines changed

python/pyspark/shuffle.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
import shutil
2222
import warnings
2323
import gc
24+
import random
2425

2526
from pyspark.serializers import BatchedSerializer, PickleSerializer
2627

@@ -216,6 +217,9 @@ def _get_dirs(self):
216217
""" Get all the directories """
217218
path = os.environ.get("SPARK_LOCAL_DIRS", "/tmp")
218219
dirs = path.split(",")
220+
if len(dirs) > 1:
221+
rnd = random.Random(os.getpid() + id(dirs))
222+
random.shuffle(dirs, rnd.random)
219223
return [os.path.join(d, "python", str(os.getpid()), str(id(self)))
220224
for d in dirs]
221225

0 commit comments

Comments
 (0)