Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ object PythonRunner {
// pass conf spark.pyspark.python to python process, the only way to pass info to
// python process is through environment variable.
sparkConf.get(PYSPARK_PYTHON).foreach(env.put("PYSPARK_PYTHON", _))
sys.env.get("PYTHONHASHSEED").foreach(env.put("PYTHONHASHSEED", _))
builder.redirectErrorStream(true) // Ugly but needed for stdout and stderr to synchronize
try {
val process = builder.start()
Expand Down
6 changes: 2 additions & 4 deletions python/pyspark/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,10 +173,8 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize,
if k.startswith("spark.executorEnv."):
varName = k[len("spark.executorEnv."):]
self.environment[varName] = v
if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
# disable randomness of hash of string in worker, if this is not
# launched by spark-submit
self.environment["PYTHONHASHSEED"] = "0"

self.environment["PYTHONHASHSEED"] = os.environ.get("PYTHONHASHSEED", "0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason for removing the python sys.version check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PYTHONHASHSEED is introduced from 3.2.3, it doesn't matter we use it in versions before 3.2.3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any use in allowing the user to override the value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit of a stretch, but if a user has another Python application that is producing hashed data with a fixed seed (say a flask app), a user might want to set this to the same value.


# Create the Java SparkContext through Py4J
self._jsc = jsc or self._initialize_context(self._conf._jconf)
Expand Down
3 changes: 2 additions & 1 deletion python/pyspark/rdd.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,8 @@ def portable_hash(x):
>>> portable_hash((None, 1)) & 0xffffffff
219750521
"""
if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:

if sys.version_info >= (3, 2, 3) and 'PYTHONHASHSEED' not in os.environ:
raise Exception("Randomness of hash of string should be disabled via PYTHONHASHSEED")

if x is None:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -813,6 +813,7 @@ private[spark] class Client(
sys.env.get(envname).foreach(env(envname) = _)
}
}
sys.env.get("PYTHONHASHSEED").foreach(env.put("PYTHONHASHSEED", _))
}

sys.env.get(ENV_DIST_CLASSPATH).foreach { dcp =>
Expand Down