-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated to python worker #11211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #51335 has finished for PR 11211 at commit
|
|
How do we handle this in Python 2? if we're running Python 2.x, do we currently propagate Also, how are we going to ensure that this change isn't accidentally rolled back? This seems subtle, so adding an explanatory paragraph comment into the source code near this line would make sense. |
|
PYTHONHASHSEED is set in script spark-submit no matter what version of python. And it would only be set in executor when python version is greater than 3.3. PYTHONHASHSEED is introduced in python 3.2.3 (https://docs.python.org/3.3/using/cmdline.html). I am not sure the purpose of disable random hash, just feel that we can set PYTHONHASHSEED as 0 in all the cases since it looks like there's no case we want to enable the random of hash. And it's also fine to set it in python 2, because it is only introduced after 3.2.3 |
|
I think that we need to disable randomness of the hash in order to be able to safely compare hashcodes which were generated by different Python processes / machines. Could you do a little detective work to look through the Git history and try to figure out whether this propagation ever worked or determine whether it was broken recently? If it used to work and was broken only in 1.6 then I think that there might be another root cause that we should find rather than patching around this one specific issue in Python. |
|
I verified it on spark-1.4.1 and spark-1.5.2, both of them have this issue. I believe it exists for a long time. Besides that I found pyspark specific environment variable is not propagated to driver if it is cluster mode. Create SPARK-13360 for it, after SPARK-13360 I will update this PR. |
|
I'm a bit concerned that the change implemented here won't be correct in case |
|
Right, it should be set as the same as driver rather than set it as 0 explicitly. |
|
Okay. Are you planning to update this PR to handle that case? |
|
I plan to do it after SPARK-13360, or I can do them together in this PR if it is OK |
|
@JoshRosen I update the patch and also merge SPARK-13360 into this PR because they are related. Please help review it. Thanks |
|
Jenkins, please build it again. |
|
Test build #51835 has finished for PR 11211 at commit
|
python/pyspark/context.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason for removing the python sys.version check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PYTHONHASHSEED is introduced from 3.2.3, it doesn't matter we use it in versions before 3.2.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any use in allowing the user to override the value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit of a stretch, but if a user has another Python application that is producing hashed data with a fixed seed (say a flask app), a user might want to set this to the same value.
|
Ping @JoshRosen |
|
Now that your other patch has been merged, is this ready to update? |
|
Update the patch. |
|
Test build #54497 has finished for PR 11211 at commit
|
python/pyspark/rdd.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on the minimum supported version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always a little wary of string comparisons for these...
>>> '3.10.0' > '3.2.3'
False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid the string comparison here. If you must, you can use distutils.version.StrictVersion
In [15]: StrictVersion('3.10.0') > StrictVersion('3.2.3')
Out[15]: True
but sys.version will break anyway on some Python distributions.
In [16]: print(sys.version)
3.4.4 |Continuum Analytics, Inc.| (default, Jan 9 2016, 17:30:09)
[GCC 4.2.1 (Apple Inc. build 5577)]
You might want to use sys.version_info instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also use sys.version_info in the PySpark setup.py file now.
|
Quick clarification: would calling |
|
@JoshRosen Sorry for response late. Yes, self._conf.setExecutorEnv(key, value) in context.py will propagate the variable to the executor, even in YARN mode |
|
Okay, so why don't we go ahead and use |
|
The executor here may be a little misleading. It means the python worker rather than the spark executor. See PythonRDD.scala |
|
Test build #59134 has finished for PR 11211 at commit
|
|
For what its worth it seems like this has maybe caused issues in the past too - https://issues.apache.org/jira/browse/SPARK-12100 |
|
Conflict is resolved. |
|
Test build #66590 has finished for PR 11211 at commit
|
|
LGTM. @JoshRosen do you have more thought on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather do sys.env.get("PYTHONHASHSEED").foreach { ... } to avoid polluting the environment when not needed. (Also to avoid hardcoding the default in multiple places.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment regarding not polluting the environment.
|
(ping @zjffdu) |
|
gentle re-ping - is this something you have badnwidth to work on @zjffdu? |
|
Sorry for late reply, I may come back to this issue late of this week. |
|
Test build #73046 has finished for PR 11211 at commit
|
|
Test build #73049 has finished for PR 11211 at commit
|
|
ping @holdenk @HyukjinKwon PR is updated, please help review. Thanks |
vanzin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'll let Holden take a last look.
|
Actually one last thing before we merge this PR - would you be ok with updating the description to match the template that is now used? Just that way it is consistent with everything. |
|
@holdenk description is updated. |
|
thanks @zjffdu , merged to master. |
## What changes were proposed in this pull request? self.environment will be propagated to executor. Should set PYTHONHASHSEED as long as the python version is greater than 3.3 ## How was this patch tested? Manually tested it. Author: Jeff Zhang <[email protected]> Closes apache#11211 from zjffdu/SPARK-13330.
|
how do we resolve the problem? |
What changes were proposed in this pull request?
self.environment will be propagated to executor. Should set PYTHONHASHSEED as long as the python version is greater than 3.3
How was this patch tested?
Manually tested it.