Skip to content

Conversation

@zjffdu
Copy link
Contributor

@zjffdu zjffdu commented Feb 16, 2016

What changes were proposed in this pull request?

self.environment will be propagated to executor. Should set PYTHONHASHSEED as long as the python version is greater than 3.3

How was this patch tested?

Manually tested it.

@SparkQA
Copy link

SparkQA commented Feb 16, 2016

Test build #51335 has finished for PR 11211 at commit f459705.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

How do we handle this in Python 2? if we're running Python 2.x, do we currently propagate PYTHONHASHSEED to the worker?

Also, how are we going to ensure that this change isn't accidentally rolled back? This seems subtle, so adding an explanatory paragraph comment into the source code near this line would make sense.

@zjffdu
Copy link
Contributor Author

zjffdu commented Feb 16, 2016

PYTHONHASHSEED is set in script spark-submit no matter what version of python. And it would only be set in executor when python version is greater than 3.3. PYTHONHASHSEED is introduced in python 3.2.3 (https://docs.python.org/3.3/using/cmdline.html). I am not sure the purpose of disable random hash, just feel that we can set PYTHONHASHSEED as 0 in all the cases since it looks like there's no case we want to enable the random of hash. And it's also fine to set it in python 2, because it is only introduced after 3.2.3

@JoshRosen
Copy link
Contributor

I think that we need to disable randomness of the hash in order to be able to safely compare hashcodes which were generated by different Python processes / machines.

Could you do a little detective work to look through the Git history and try to figure out whether this propagation ever worked or determine whether it was broken recently? If it used to work and was broken only in 1.6 then I think that there might be another root cause that we should find rather than patching around this one specific issue in Python.

@zjffdu
Copy link
Contributor Author

zjffdu commented Feb 17, 2016

I verified it on spark-1.4.1 and spark-1.5.2, both of them have this issue. I believe it exists for a long time. Besides that I found pyspark specific environment variable is not propagated to driver if it is cluster mode. Create SPARK-13360 for it, after SPARK-13360 I will update this PR.

@JoshRosen
Copy link
Contributor

I'm a bit concerned that the change implemented here won't be correct in case PYTHONHASHSEED is set to a non-zero value on the driver. In that case, wouldn't the driver and executors wind up with different hash seeds, leading to incomparable hash values?

@zjffdu
Copy link
Contributor Author

zjffdu commented Feb 18, 2016

Right, it should be set as the same as driver rather than set it as 0 explicitly.

@JoshRosen
Copy link
Contributor

Okay. Are you planning to update this PR to handle that case?

@zjffdu
Copy link
Contributor Author

zjffdu commented Feb 18, 2016

I plan to do it after SPARK-13360, or I can do them together in this PR if it is OK

@zjffdu
Copy link
Contributor Author

zjffdu commented Feb 24, 2016

@JoshRosen I update the patch and also merge SPARK-13360 into this PR because they are related. Please help review it. Thanks

@zjffdu
Copy link
Contributor Author

zjffdu commented Feb 24, 2016

Jenkins, please build it again.

@SparkQA
Copy link

SparkQA commented Feb 24, 2016

Test build #51835 has finished for PR 11211 at commit 43dfd47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason for removing the python sys.version check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PYTHONHASHSEED is introduced from 3.2.3, it doesn't matter we use it in versions before 3.2.3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any use in allowing the user to override the value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit of a stretch, but if a user has another Python application that is producing hashed data with a fixed seed (say a flask app), a user might want to set this to the same value.

@zjffdu
Copy link
Contributor Author

zjffdu commented Mar 8, 2016

Ping @JoshRosen

@JoshRosen
Copy link
Contributor

Now that your other patch has been merged, is this ready to update?

@zjffdu
Copy link
Contributor Author

zjffdu commented Mar 30, 2016

Update the patch.

@SparkQA
Copy link

SparkQA commented Mar 30, 2016

Test build #54497 has finished for PR 11211 at commit e68899c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the minimum supported version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always a little wary of string comparisons for these...

>>> '3.10.0' > '3.2.3'
False

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid the string comparison here. If you must, you can use distutils.version.StrictVersion

In [15]: StrictVersion('3.10.0') > StrictVersion('3.2.3')
Out[15]: True

but sys.version will break anyway on some Python distributions.

In [16]: print(sys.version)
3.4.4 |Continuum Analytics, Inc.| (default, Jan  9 2016, 17:30:09)
[GCC 4.2.1 (Apple Inc. build 5577)]

You might want to use sys.version_info instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also use sys.version_info in the PySpark setup.py file now.

@JoshRosen
Copy link
Contributor

Quick clarification: would calling self._conf.setExecutorEnv(key, value) in context.py be sufficient to propagate the variable to the executor, even in YARN mode?

@zjffdu
Copy link
Contributor Author

zjffdu commented Apr 19, 2016

@JoshRosen Sorry for response late. Yes, self._conf.setExecutorEnv(key, value) in context.py will propagate the variable to the executor, even in YARN mode

@JoshRosen
Copy link
Contributor

Okay, so why don't we go ahead and use setExecutorEnv then?

@zjffdu zjffdu changed the title [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated to executor [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated to python worker Apr 22, 2016
@zjffdu
Copy link
Contributor Author

zjffdu commented Apr 22, 2016

The executor here may be a little misleading. It means the python worker rather than the spark executor.
Python worker get its env from PythonRDD and PythonRDD get the env from context.environment. ( I have updated the title)

See PythonRDD.scala

val worker: Socket = env.createPythonWorker(pythonExec, envVars.asScala.toMap)

@SparkQA
Copy link

SparkQA commented May 23, 2016

Test build #59134 has finished for PR 11211 at commit e68899c.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Oct 8, 2016

For what its worth it seems like this has maybe caused issues in the past too - https://issues.apache.org/jira/browse/SPARK-12100

@zjffdu
Copy link
Contributor Author

zjffdu commented Oct 9, 2016

Conflict is resolved.

@SparkQA
Copy link

SparkQA commented Oct 9, 2016

Test build #66590 has finished for PR 11211 at commit b761b85.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

LGTM. @JoshRosen do you have more thought on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather do sys.env.get("PYTHONHASHSEED").foreach { ... } to avoid polluting the environment when not needed. (Also to avoid hardcoding the default in multiple places.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment regarding not polluting the environment.

@HyukjinKwon
Copy link
Member

(ping @zjffdu)

@holdenk
Copy link
Contributor

holdenk commented Feb 13, 2017

gentle re-ping - is this something you have badnwidth to work on @zjffdu?

@zjffdu
Copy link
Contributor Author

zjffdu commented Feb 14, 2017

Sorry for late reply, I may come back to this issue late of this week.

@SparkQA
Copy link

SparkQA commented Feb 17, 2017

Test build #73046 has finished for PR 11211 at commit e352c36.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 17, 2017

Test build #73049 has finished for PR 11211 at commit 0acbb18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zjffdu
Copy link
Contributor Author

zjffdu commented Feb 21, 2017

ping @holdenk @HyukjinKwon PR is updated, please help review. Thanks

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'll let Holden take a last look.

@holdenk
Copy link
Contributor

holdenk commented Feb 22, 2017

LGTM as well, thanks for taking a look @vanzin and thanks for taking the time on this @zjffdu & sticking with it.

@holdenk
Copy link
Contributor

holdenk commented Feb 22, 2017

Actually one last thing before we merge this PR - would you be ok with updating the description to match the template that is now used? Just that way it is consistent with everything.

@zjffdu
Copy link
Contributor Author

zjffdu commented Feb 23, 2017

@holdenk description is updated.

@asfgit asfgit closed this in 330c3e3 Feb 24, 2017
@holdenk
Copy link
Contributor

holdenk commented Feb 24, 2017

thanks @zjffdu , merged to master.

Yunni pushed a commit to Yunni/spark that referenced this pull request Feb 27, 2017
## What changes were proposed in this pull request?
self.environment will be propagated to executor. Should set PYTHONHASHSEED as long as the python version is greater than 3.3

## How was this patch tested?
Manually tested it.

Author: Jeff Zhang <[email protected]>

Closes apache#11211 from zjffdu/SPARK-13330.
@mxhellor
Copy link

how do we resolve the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants