[SPARK-23698][Python] Resolve undefined names in Python 3 #20838

cclauss · 2018-03-15T23:18:00Z

What changes were proposed in this pull request?

Fix issues arising from the fact that builtins file, long, raw_input(), unicode, xrange(), etc. were all removed from Python 3. Undefined names have the potential to raise NameError at runtime.

How was this patch tested?

$ python2 -m flake8 . --count --select=E9,F82 --show-source --statistics
$ python3 -m flake8 . --count --select=E9,F82 --show-source --statistics

@holdenk

flake8 testing of https://github.com/apache/spark on Python 3.6.3

$ python3 -m flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input'
    result = raw_input("\n%s (y/n): " % prompt)
             ^
./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input'
    primary_author = raw_input(
                     ^
./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input'
    pick_ref = raw_input("Enter a branch name [%s]: " % default_branch)
               ^
./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input'
    jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id)
              ^
./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input'
    fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % default_fix_versions)
                   ^
./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input'
            raw_assignee = raw_input(
                           ^
./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input'
    pr_num = raw_input("Which pull request would you like to merge? (e.g. 34): ")
             ^
./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input'
        result = raw_input("Would you like to use the modified title? (y/n): ")
                 ^
./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input'
    while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y":
          ^
./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input'
    response = raw_input("%s [y/n]: " % msg)
               ^
./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode'
        author = unidecode.unidecode(unicode(author, "UTF-8")).strip()
                                     ^
./python/setup.py:37:11: F821 undefined name '__version__'
VERSION = __version__
          ^
./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer'
        dispatch[buffer] = save_buffer
                 ^
./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file'
        dispatch[file] = save_file
                 ^
./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode'
        if not isinstance(obj, str) and not isinstance(obj, unicode):
                                                            ^
./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long'
    intlike = (int, long)
                    ^
./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long'
        return self._sc._jvm.Time(long(timestamp * 1000))
                                  ^
./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 undefined name 'xrange'
for i in xrange(50):
         ^
./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 undefined name 'xrange'
    for j in xrange(5):
             ^
./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 undefined name 'xrange'
        for k in xrange(20022):
                 ^
20    F821 undefined name 'raw_input'
20

HyukjinKwon · 2018-03-16T05:21:27Z

dev/create-release/releaseutils.py

Shall we put two spaces before lined comment?

and shall we do this with if-else of Python version to be consistent with other places?

“Where practical, apply the Python porting best practice: Use feature detection instead of version detection.”

Shall we keep this consistent for now? There are a hell of a lot things to fix in PySpark in that way.

HyukjinKwon · 2018-03-17T00:33:47Z

ok to test

SparkQA · 2018-03-17T04:54:50Z

Test build #88332 has finished for PR 20838 at commit 16d60ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cclauss · 2018-03-23T16:21:03Z

@HyukjinKwon Was there something more to do on this one?

HyukjinKwon

shall we wait for https://github.com/cloudpipe/cloudpickle/pull/163? I want to see more feedback.

felixcheung

that cloudpickle PR seems independent on the changes here though?

HyukjinKwon · 2018-03-24T09:07:31Z

python/pyspark/cloudpickle.py

I think this one is actually related with cloudpickle's PR.

I was trying to (exactly) match this file to a specific version of cloudpickle (which is currently 0.4.3 - https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.3). So, I thought we could wait for more feedback there.

At least, I was thinking that we should match it to https://github.com/cloudpipe/cloudpickle/tree/0.4.x If that one is merged, I could backport that change into 0.4.x branch, and then we could just copy that change into here.

This was merged upstream.

BryanCutler

Thanks @cclauss for the PR, +1 for syncing the cloudpickle change.

BryanCutler · 2018-03-26T17:43:46Z

dev/create-release/releaseutils.py

could you just put this above:

if sys.version > '3': unicode = str

and then just do?

author = unidecode.unidecode(unicode(author)).strip()

I don't think you need to specify "UTF-8" because either way it will be a unicode object, but I'm not too sure how this conversion is supposed to work

@BryanCutler I think we do need to specify it because it won't be unicode type in Python 2, we get these from run_cmd which call's Popen communicate which returns a regular string.

Yes... My other worry in Py2 would be if sys.setdefaultencoding() was set to somthing other that utf-8. That method was also thankfully dropped in Py3.

My thought was that we are first casting author this to unicode already with unicode(author) and it doesn't really matter if it is "UTF-8" or not because we then immediately decode it into ASCII with unidecode, which can handle it even it it wasn't "UTF-8", so the end result should be the same I believe. It was just to clean up a little, so not a big deal either way. The way it is now replicates the old behavior, so it's probably safer.

The way it is now [...] it's probably safer.

Let's agree to leave this as is in this PR. EOL of Python 2 in 500 daze away so safe is better.

cclauss · 2018-03-28T15:38:46Z

The cloudpickle change is now merged upstream. I would like to avoid @BryanCutler suggestion above because Unicode is really touchy in Python 2 and a lot can change based on sys.setdefaultencoding() so I would like to avoid assumptions and make as small modifications in Unicode as possible.

cclauss · 2018-04-12T19:12:03Z

Other issues / suggestions?

HyukjinKwon · 2018-04-13T00:11:03Z

im preparing to backport and release cloudpuckle. please give me some more days.

HyukjinKwon · 2018-06-09T09:01:34Z

ok to test

SparkQA · 2018-06-09T14:03:16Z

Test build #91607 has finished for PR 20838 at commit 16d60ce.

This patch fails from timeout after a configured wait of `300m`.
This patch does not merge cleanly.
This patch adds no public classes.

felixcheung · 2018-06-10T19:19:26Z

any update?

SparkQA · 2018-06-11T03:35:48Z

Test build #91644 has finished for PR 20838 at commit fd4d922.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-06-11T04:57:30Z

python/pyspark/streaming/dstream.py

Can you add a test for it? Seems only used once and shouldn't be difficult to add a test.

Adding a call to slice in tests.py should be enough to test this. I'm surprised we haven't caught this before, but I suppose this isn't a very frequently exercised code path.

HyukjinKwon · 2018-06-11T04:58:12Z

python/pyspark/sql/conf.py

This is fine since we rely on short-circuiting but I guess it's fine if it complains.

Is there an issue here?

HyukjinKwon · 2018-06-11T04:59:06Z

dev/merge_spark_pr.py

Does this script run with Python 3+ now?

It creates a new function called raw_input() that is identical to the builtin input().

HyukjinKwon · 2018-06-11T04:59:23Z

dev/create-release/releaseutils.py

Does this script work in Python 3+ now?

It creates a new function called raw_input() that is identical to the builtin input().

HyukjinKwon · 2018-07-16T02:41:12Z

@cclauss, mind updating this PR?

SparkQA · 2018-07-16T22:19:45Z

Test build #93125 has finished for PR 20838 at commit 2c4f15c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-17T02:36:45Z

retest this please

SparkQA · 2018-07-17T06:38:18Z

Test build #93150 has finished for PR 20838 at commit 2c4f15c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-07-17T18:51:43Z

retest this please

SparkQA · 2018-07-17T23:53:44Z

Test build #93195 has finished for PR 20838 at commit 2c4f15c.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

cclauss · 2018-08-17T22:46:56Z

It was reverted because slice_test() was causing the build to fail.

HyukjinKwon · 2018-08-18T09:26:32Z

Shell we fix it here?

SparkQA · 2018-08-18T20:08:56Z

Test build #94927 has finished for PR 20838 at commit 73a4fd2.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cclauss · 2018-08-19T01:54:55Z

This is not working at all... I am wasting way too much time. 5+ months and 80+ comments for 12 lines of code is

I do not have the skills to solve the following undefined name 'long' in a satisfactory manner:

./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long'
        return self._sc._jvm.Time(long(timestamp * 1000))
                                  ^

If someone with more skills would be willing to take that one undefined name off my plate and solve it with a test in a separate PR then I would be grateful. I will study that PR carefully and can then proceed with the others that are in this PR.

My recommended fix is at https://github.com/apache/spark/pull/20838/files#diff-6c576c52abc0624ccb6a2f45828dc6a7 and my proposed test (it is failing!) is immediately following.

BryanCutler · 2018-08-21T04:54:55Z

Hi @cclauss , sorry for the frustration. I looked into the test, and it was kind of a pain to get it working right - which is probably why it wasn't done in the first place ;)

Here are my modifications for test_slice and it seems to pass py3 fine

def test_slice(self):
        """Basic operation test for DStream.slice."""
        import datetime as dt
        self.ssc = StreamingContext(self.sc, 1.0)
        self.ssc.remember(4.0)
        input = [[1], [2], [3], [4]]
        stream = self.ssc.queueStream([self.sc.parallelize(d, 1) for d in input])

        time_vals = []

        def get_times(t, rdd):
            if rdd and len(time_vals) < len(input):
                time_vals.append(t)

        stream.foreachRDD(get_times)

        self.ssc.start()
        self.wait_for(time_vals, 4)
        begin_time = time_vals[0]

        def get_sliced(begin_delta, end_delta):
            begin = begin_time + dt.timedelta(seconds=begin_delta)
            end = begin_time + dt.timedelta(seconds=end_delta)
            rdds = stream.slice(begin, end)
            result_list = [rdd.collect() for rdd in rdds]
            return [r for result in result_list for r in result]

        self.assertEqual(set([1]), set(get_sliced(0, 0)))
        self.assertEqual(set([2, 3]), set(get_sliced(1, 2)))
        self.assertEqual(set([2, 3, 4]), set(get_sliced(1, 4)))
        self.assertEqual(set([1, 2, 3, 4]), set(get_sliced(0, 4)))

If you want to put that in, I have some time now and can help you get this merged or if you prefer I can finish it up and still assign to you.

cclauss · 2018-08-21T05:40:19Z

Thanks massively for this. I doubt that I ever would have gotten to that on my own. This is a test so my proposal would be that you create a separate PR so that we are all assured that it passes in the current codebase. Once your PR has been merged, I can come back and finish this PR. Thanks again.

felixcheung · 2018-08-21T06:12:37Z

Or that Bryan opens a PR on your branch? that usually would be easier to get this PR through, just my 2c.

HyukjinKwon · 2018-08-21T09:31:23Z

+1 for ^. We can credit to multiple persons now anyway.

@BryanCutler

Thanks @BryanCutler

SparkQA · 2018-08-21T14:38:45Z

Test build #95010 has finished for PR 20838 at commit afd508e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

LGTM, just one minor change. Could you also update the PR title to include a [PYTHON] tag, and from the description remove the line crossed out "Where practical..." and "Please review http://spark.apache.org/contributing.html before opening a pull request."?

BryanCutler · 2018-08-21T17:17:13Z

python/pyspark/streaming/tests.py


+    def test_slice(self):
+        """Basic operation test for DStream.slice."""
+        import datetime as dt


lets remove the import here since it is at the top already

Also, not your doing but I noticed this spelling error on ln 183 "DStream.faltMap" should be "DStream.flatMap", would you mind changing that while we are here?
https://github.com/apache/spark/pull/20838/files#diff-ca4ec8dece48511b915cad6a801695c1R183

SparkQA · 2018-08-21T23:39:57Z

Test build #95045 has finished for PR 20838 at commit 5b3658c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

LGTM. If you could just update the title and description mentioned in #20838 (review) I think this is good to go

BryanCutler · 2018-08-22T17:12:29Z

merged to master, thanks @cclauss !

Prevent linters from raising __undefined name '\_\_version\_\_'__ by initializing the variable before setting it via a call to __exec()__. This is the last remaining issue related to the work done in apache#20838

cclauss force-pushed the fix-undefined-names branch from bdcd740 to dc27035 Compare March 15, 2018 23:20

HyukjinKwon reviewed Mar 16, 2018

View reviewed changes

cclauss force-pushed the fix-undefined-names branch 4 times, most recently from f097649 to 16d60ce Compare March 16, 2018 07:33

HyukjinKwon reviewed Mar 24, 2018

View reviewed changes

felixcheung reviewed Mar 24, 2018

View reviewed changes

HyukjinKwon reviewed Mar 24, 2018

View reviewed changes

BryanCutler reviewed Mar 26, 2018

View reviewed changes

HyukjinKwon reviewed Jun 11, 2018

View reviewed changes

cclauss mentioned this pull request Aug 16, 2018

Add test_slice() to streaming BasicOperations #22128

Closed

cclauss added 7 commits August 18, 2018 17:49

[SPARK-23698] Reduce undefined names in Python 3

3e645a7

add test_slice() to python/pyspark/streaming/tests.py

a74bb53

five_secs = dt.timedelta(seconds=5)

561ec8a

Remove comment

54d6563

dstream.slice() returns a list of lists

4e74e68

Remove definition of long for Python 3

f41bc31

restore definition of long and test_slice()

73a4fd2

cclauss force-pushed the fix-undefined-names branch from c0fdb71 to 73a4fd2 Compare August 18, 2018 16:01

Replacing test_slice() with new implementation

afd508e

Thanks @BryanCutler

BryanCutler reviewed Aug 21, 2018

View reviewed changes

Fix redundant import and typo in comment

5b3658c

BryanCutler approved these changes Aug 22, 2018

View reviewed changes

cclauss changed the title ~~[SPARK-23698] Resolve undefined names in Python 3~~ [SPARK-23698] [Python] Resolve undefined names in Python 3 Aug 22, 2018

cclauss changed the title ~~[SPARK-23698] [Python] Resolve undefined names in Python 3~~ [SPARK-23698][Python] Resolve undefined names in Python 3 Aug 22, 2018

asfgit closed this in 71f38ac Aug 22, 2018

cclauss deleted the fix-undefined-names branch August 22, 2018 17:14

cclauss mentioned this pull request Aug 24, 2018

[SPARK-23698][Python] Avoid 'undefined name' by defining __version__ #22214

Closed

[SPARK-23698][Python] Resolve undefined names in Python 3 #20838

[SPARK-23698][Python] Resolve undefined names in Python 3 #20838

Uh oh!

Conversation

cclauss commented Mar 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Mar 17, 2018

Uh oh!

SparkQA commented Mar 17, 2018

Uh oh!

cclauss commented Mar 23, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

felixcheung left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler Mar 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cclauss commented Mar 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cclauss commented Apr 12, 2018

Uh oh!

HyukjinKwon commented Apr 13, 2018

Uh oh!

HyukjinKwon commented Jun 9, 2018

Uh oh!

SparkQA commented Jun 9, 2018

Uh oh!

felixcheung commented Jun 10, 2018

Uh oh!

SparkQA commented Jun 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

cclauss commented Mar 15, 2018 •

edited

Loading

HyukjinKwon Mar 24, 2018 •

edited

Loading

BryanCutler Mar 26, 2018 •

edited

Loading

cclauss commented Mar 28, 2018 •

edited

Loading

HyukjinKwon Jun 11, 2018 •

edited

Loading

BryanCutler commented Aug 21, 2018 •

edited

Loading

cclauss commented Aug 21, 2018 •

edited

Loading