-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23698][Python] Resolve undefined names in Python 3 #20838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bdcd740 to
dc27035
Compare
dev/create-release/releaseutils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we put two spaces before lined comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and shall we do this with if-else of Python version to be consistent with other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“Where practical, apply the Python porting best practice: Use feature detection instead of version detection.”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we keep this consistent for now? There are a hell of a lot things to fix in PySpark in that way.
f097649 to
16d60ce
Compare
|
ok to test |
|
Test build #88332 has finished for PR 20838 at commit
|
|
@HyukjinKwon Was there something more to do on this one? |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we wait for https://github.com/cloudpipe/cloudpickle/pull/163? I want to see more feedback.
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that cloudpickle PR seems independent on the changes here though?
python/pyspark/cloudpickle.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one is actually related with cloudpickle's PR.
I was trying to (exactly) match this file to a specific version of cloudpickle (which is currently 0.4.3 - https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.3). So, I thought we could wait for more feedback there.
At least, I was thinking that we should match it to https://github.com/cloudpipe/cloudpickle/tree/0.4.x If that one is merged, I could backport that change into 0.4.x branch, and then we could just copy that change into here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was merged upstream.
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cclauss for the PR, +1 for syncing the cloudpickle change.
dev/create-release/releaseutils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you just put this above:
if sys.version > '3':
unicode = str
and then just do?
author = unidecode.unidecode(unicode(author)).strip()
I don't think you need to specify "UTF-8" because either way it will be a unicode object, but I'm not too sure how this conversion is supposed to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler I think we do need to specify it because it won't be unicode type in Python 2, we get these from run_cmd which call's Popen communicate which returns a regular string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes... My other worry in Py2 would be if sys.setdefaultencoding() was set to somthing other that utf-8. That method was also thankfully dropped in Py3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought was that we are first casting author this to unicode already with unicode(author) and it doesn't really matter if it is "UTF-8" or not because we then immediately decode it into ASCII with unidecode, which can handle it even it it wasn't "UTF-8", so the end result should be the same I believe. It was just to clean up a little, so not a big deal either way. The way it is now replicates the old behavior, so it's probably safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it is now [...] it's probably safer.
Let's agree to leave this as is in this PR. EOL of Python 2 in 500 daze away so safe is better.
|
The cloudpickle change is now merged upstream. I would like to avoid @BryanCutler suggestion above because Unicode is really touchy in Python 2 and a lot can change based on sys.setdefaultencoding() so I would like to avoid assumptions and make as small modifications in Unicode as possible. |
|
Other issues / suggestions? |
|
im preparing to backport and release cloudpuckle. please give me some more days. |
|
ok to test |
|
Test build #91607 has finished for PR 20838 at commit
|
|
any update? |
|
Test build #91644 has finished for PR 20838 at commit
|
python/pyspark/streaming/dstream.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for it? Seems only used once and shouldn't be difficult to add a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a call to slice in tests.py should be enough to test this. I'm surprised we haven't caught this before, but I suppose this isn't a very frequently exercised code path.
python/pyspark/sql/conf.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine since we rely on short-circuiting but I guess it's fine if it complains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope
dev/merge_spark_pr.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this script run with Python 3+ now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It creates a new function called raw_input() that is identical to the builtin input().
dev/create-release/releaseutils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this script work in Python 3+ now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It creates a new function called raw_input() that is identical to the builtin input().
|
@cclauss, mind updating this PR? |
|
Test build #93125 has finished for PR 20838 at commit
|
|
retest this please |
|
Test build #93150 has finished for PR 20838 at commit
|
|
retest this please |
|
Test build #93195 has finished for PR 20838 at commit
|
|
It was reverted because slice_test() was causing the build to fail. |
|
Shell we fix it here? |
c0fdb71 to
73a4fd2
Compare
|
Test build #94927 has finished for PR 20838 at commit
|
|
This is not working at all... I am wasting way too much time. 5+ months and 80+ comments for 12 lines of code is I do not have the skills to solve the following undefined name 'long' in a satisfactory manner: If someone with more skills would be willing to take that one undefined name off my plate and solve it with a test in a separate PR then I would be grateful. I will study that PR carefully and can then proceed with the others that are in this PR. My recommended fix is at https://github.com/apache/spark/pull/20838/files#diff-6c576c52abc0624ccb6a2f45828dc6a7 and my proposed test (it is failing!) is immediately following. |
|
Hi @cclauss , sorry for the frustration. I looked into the test, and it was kind of a pain to get it working right - which is probably why it wasn't done in the first place ;) Here are my modifications for def test_slice(self):
"""Basic operation test for DStream.slice."""
import datetime as dt
self.ssc = StreamingContext(self.sc, 1.0)
self.ssc.remember(4.0)
input = [[1], [2], [3], [4]]
stream = self.ssc.queueStream([self.sc.parallelize(d, 1) for d in input])
time_vals = []
def get_times(t, rdd):
if rdd and len(time_vals) < len(input):
time_vals.append(t)
stream.foreachRDD(get_times)
self.ssc.start()
self.wait_for(time_vals, 4)
begin_time = time_vals[0]
def get_sliced(begin_delta, end_delta):
begin = begin_time + dt.timedelta(seconds=begin_delta)
end = begin_time + dt.timedelta(seconds=end_delta)
rdds = stream.slice(begin, end)
result_list = [rdd.collect() for rdd in rdds]
return [r for result in result_list for r in result]
self.assertEqual(set([1]), set(get_sliced(0, 0)))
self.assertEqual(set([2, 3]), set(get_sliced(1, 2)))
self.assertEqual(set([2, 3, 4]), set(get_sliced(1, 4)))
self.assertEqual(set([1, 2, 3, 4]), set(get_sliced(0, 4)))If you want to put that in, I have some time now and can help you get this merged or if you prefer I can finish it up and still assign to you. |
|
Thanks massively for this. I doubt that I ever would have gotten to that on my own. This is a test so my proposal would be that you create a separate PR so that we are all assured that it passes in the current codebase. Once your PR has been merged, I can come back and finish this PR. Thanks again. |
|
Or that Bryan opens a PR on your branch? that usually would be easier to get this PR through, just my 2c. |
|
+1 for ^. We can credit to multiple persons now anyway. |
|
Test build #95010 has finished for PR 20838 at commit
|
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one minor change. Could you also update the PR title to include a [PYTHON] tag, and from the description remove the line crossed out "Where practical..." and "Please review http://spark.apache.org/contributing.html before opening a pull request."?
|
|
||
| def test_slice(self): | ||
| """Basic operation test for DStream.slice.""" | ||
| import datetime as dt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets remove the import here since it is at the top already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, not your doing but I noticed this spelling error on ln 183 "DStream.faltMap" should be "DStream.flatMap", would you mind changing that while we are here?
https://github.com/apache/spark/pull/20838/files#diff-ca4ec8dece48511b915cad6a801695c1R183
|
Test build #95045 has finished for PR 20838 at commit
|
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. If you could just update the title and description mentioned in #20838 (review) I think this is good to go
|
merged to master, thanks @cclauss ! |
Prevent linters from raising __undefined name '\_\_version\_\_'__ by initializing the variable before setting it via a call to __exec()__. This is the last remaining issue related to the work done in apache#20838
What changes were proposed in this pull request?
Fix issues arising from the fact that builtins file, long, raw_input(), unicode, xrange(), etc. were all removed from Python 3. Undefined names have the potential to raise NameError at runtime.
How was this patch tested?
@holdenk
flake8 testing of https://github.com/apache/spark on Python 3.6.3
$ python3 -m flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics