Skip to content

Conversation

@cclauss
Copy link

@cclauss cclauss commented Mar 15, 2018

What changes were proposed in this pull request?

Fix issues arising from the fact that builtins file, long, raw_input(), unicode, xrange(), etc. were all removed from Python 3. Undefined names have the potential to raise NameError at runtime.

How was this patch tested?

  • $ python2 -m flake8 . --count --select=E9,F82 --show-source --statistics
  • $ python3 -m flake8 . --count --select=E9,F82 --show-source --statistics

@holdenk

flake8 testing of https://github.com/apache/spark on Python 3.6.3

$ python3 -m flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input'
    result = raw_input("\n%s (y/n): " % prompt)
             ^
./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input'
    primary_author = raw_input(
                     ^
./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input'
    pick_ref = raw_input("Enter a branch name [%s]: " % default_branch)
               ^
./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input'
    jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id)
              ^
./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input'
    fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % default_fix_versions)
                   ^
./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input'
            raw_assignee = raw_input(
                           ^
./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input'
    pr_num = raw_input("Which pull request would you like to merge? (e.g. 34): ")
             ^
./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input'
        result = raw_input("Would you like to use the modified title? (y/n): ")
                 ^
./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input'
    while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y":
          ^
./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input'
    response = raw_input("%s [y/n]: " % msg)
               ^
./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode'
        author = unidecode.unidecode(unicode(author, "UTF-8")).strip()
                                     ^
./python/setup.py:37:11: F821 undefined name '__version__'
VERSION = __version__
          ^
./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer'
        dispatch[buffer] = save_buffer
                 ^
./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file'
        dispatch[file] = save_file
                 ^
./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode'
        if not isinstance(obj, str) and not isinstance(obj, unicode):
                                                            ^
./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long'
    intlike = (int, long)
                    ^
./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long'
        return self._sc._jvm.Time(long(timestamp * 1000))
                                  ^
./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 undefined name 'xrange'
for i in xrange(50):
         ^
./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 undefined name 'xrange'
    for j in xrange(5):
             ^
./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 undefined name 'xrange'
        for k in xrange(20022):
                 ^
20    F821 undefined name 'raw_input'
20

@cclauss cclauss force-pushed the fix-undefined-names branch from bdcd740 to dc27035 Compare March 15, 2018 23:20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we put two spaces before lined comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and shall we do this with if-else of Python version to be consistent with other places?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Where practical, apply the Python porting best practice: Use feature detection instead of version detection.”

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we keep this consistent for now? There are a hell of a lot things to fix in PySpark in that way.

@cclauss cclauss force-pushed the fix-undefined-names branch 4 times, most recently from f097649 to 16d60ce Compare March 16, 2018 07:33
@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Mar 17, 2018

Test build #88332 has finished for PR 20838 at commit 16d60ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cclauss
Copy link
Author

cclauss commented Mar 23, 2018

@HyukjinKwon Was there something more to do on this one?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we wait for https://github.com/cloudpipe/cloudpickle/pull/163? I want to see more feedback.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that cloudpickle PR seems independent on the changes here though?

Copy link
Member

@HyukjinKwon HyukjinKwon Mar 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is actually related with cloudpickle's PR.

I was trying to (exactly) match this file to a specific version of cloudpickle (which is currently 0.4.3 - https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.3). So, I thought we could wait for more feedback there.

At least, I was thinking that we should match it to https://github.com/cloudpipe/cloudpickle/tree/0.4.x If that one is merged, I could backport that change into 0.4.x branch, and then we could just copy that change into here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was merged upstream.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cclauss for the PR, +1 for syncing the cloudpickle change.

Copy link
Member

@BryanCutler BryanCutler Mar 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you just put this above:

if sys.version > '3':
    unicode = str

and then just do?

author = unidecode.unidecode(unicode(author)).strip()

I don't think you need to specify "UTF-8" because either way it will be a unicode object, but I'm not too sure how this conversion is supposed to work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler I think we do need to specify it because it won't be unicode type in Python 2, we get these from run_cmd which call's Popen communicate which returns a regular string.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... My other worry in Py2 would be if sys.setdefaultencoding() was set to somthing other that utf-8. That method was also thankfully dropped in Py3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that we are first casting author this to unicode already with unicode(author) and it doesn't really matter if it is "UTF-8" or not because we then immediately decode it into ASCII with unidecode, which can handle it even it it wasn't "UTF-8", so the end result should be the same I believe. It was just to clean up a little, so not a big deal either way. The way it is now replicates the old behavior, so it's probably safer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it is now [...] it's probably safer.

Let's agree to leave this as is in this PR. EOL of Python 2 in 500 daze away so safe is better.

@cclauss
Copy link
Author

cclauss commented Mar 28, 2018

The cloudpickle change is now merged upstream. I would like to avoid @BryanCutler suggestion above because Unicode is really touchy in Python 2 and a lot can change based on sys.setdefaultencoding() so I would like to avoid assumptions and make as small modifications in Unicode as possible.

@cclauss
Copy link
Author

cclauss commented Apr 12, 2018

Other issues / suggestions?

@HyukjinKwon
Copy link
Member

im preparing to backport and release cloudpuckle. please give me some more days.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jun 9, 2018

Test build #91607 has finished for PR 20838 at commit 16d60ce.

  • This patch fails from timeout after a configured wait of `300m`.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

any update?

@SparkQA
Copy link

SparkQA commented Jun 11, 2018

Test build #91644 has finished for PR 20838 at commit fd4d922.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for it? Seems only used once and shouldn't be difficult to add a test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a call to slice in tests.py should be enough to test this. I'm surprised we haven't caught this before, but I suppose this isn't a very frequently exercised code path.

Copy link
Member

@HyukjinKwon HyukjinKwon Jun 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine since we rely on short-circuiting but I guess it's fine if it complains.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this script run with Python 3+ now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It creates a new function called raw_input() that is identical to the builtin input().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this script work in Python 3+ now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It creates a new function called raw_input() that is identical to the builtin input().

@HyukjinKwon
Copy link
Member

@cclauss, mind updating this PR?

@SparkQA
Copy link

SparkQA commented Jul 16, 2018

Test build #93125 has finished for PR 20838 at commit 2c4f15c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 17, 2018

Test build #93150 has finished for PR 20838 at commit 2c4f15c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 17, 2018

Test build #93195 has finished for PR 20838 at commit 2c4f15c.

  • This patch fails from timeout after a configured wait of `300m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cclauss
Copy link
Author

cclauss commented Aug 17, 2018

It was reverted because slice_test() was causing the build to fail.

@HyukjinKwon
Copy link
Member

Shell we fix it here?

@cclauss cclauss force-pushed the fix-undefined-names branch from c0fdb71 to 73a4fd2 Compare August 18, 2018 16:01
@SparkQA
Copy link

SparkQA commented Aug 18, 2018

Test build #94927 has finished for PR 20838 at commit 73a4fd2.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cclauss
Copy link
Author

cclauss commented Aug 19, 2018

This is not working at all... I am wasting way too much time. 5+ months and 80+ comments for 12 lines of code is

I do not have the skills to solve the following undefined name 'long' in a satisfactory manner:

./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long'
        return self._sc._jvm.Time(long(timestamp * 1000))
                                  ^

If someone with more skills would be willing to take that one undefined name off my plate and solve it with a test in a separate PR then I would be grateful. I will study that PR carefully and can then proceed with the others that are in this PR.

My recommended fix is at https://github.com/apache/spark/pull/20838/files#diff-6c576c52abc0624ccb6a2f45828dc6a7 and my proposed test (it is failing!) is immediately following.

@BryanCutler
Copy link
Member

BryanCutler commented Aug 21, 2018

Hi @cclauss , sorry for the frustration. I looked into the test, and it was kind of a pain to get it working right - which is probably why it wasn't done in the first place ;)

Here are my modifications for test_slice and it seems to pass py3 fine

def test_slice(self):
        """Basic operation test for DStream.slice."""
        import datetime as dt
        self.ssc = StreamingContext(self.sc, 1.0)
        self.ssc.remember(4.0)
        input = [[1], [2], [3], [4]]
        stream = self.ssc.queueStream([self.sc.parallelize(d, 1) for d in input])

        time_vals = []

        def get_times(t, rdd):
            if rdd and len(time_vals) < len(input):
                time_vals.append(t)

        stream.foreachRDD(get_times)

        self.ssc.start()
        self.wait_for(time_vals, 4)
        begin_time = time_vals[0]

        def get_sliced(begin_delta, end_delta):
            begin = begin_time + dt.timedelta(seconds=begin_delta)
            end = begin_time + dt.timedelta(seconds=end_delta)
            rdds = stream.slice(begin, end)
            result_list = [rdd.collect() for rdd in rdds]
            return [r for result in result_list for r in result]

        self.assertEqual(set([1]), set(get_sliced(0, 0)))
        self.assertEqual(set([2, 3]), set(get_sliced(1, 2)))
        self.assertEqual(set([2, 3, 4]), set(get_sliced(1, 4)))
        self.assertEqual(set([1, 2, 3, 4]), set(get_sliced(0, 4)))

If you want to put that in, I have some time now and can help you get this merged or if you prefer I can finish it up and still assign to you.

@cclauss
Copy link
Author

cclauss commented Aug 21, 2018

Thanks massively for this. I doubt that I ever would have gotten to that on my own. This is a test so my proposal would be that you create a separate PR so that we are all assured that it passes in the current codebase. Once your PR has been merged, I can come back and finish this PR. Thanks again.

@felixcheung
Copy link
Member

Or that Bryan opens a PR on your branch? that usually would be easier to get this PR through, just my 2c.

@HyukjinKwon
Copy link
Member

+1 for ^. We can credit to multiple persons now anyway.

@SparkQA
Copy link

SparkQA commented Aug 21, 2018

Test build #95010 has finished for PR 20838 at commit afd508e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one minor change. Could you also update the PR title to include a [PYTHON] tag, and from the description remove the line crossed out "Where practical..." and "Please review http://spark.apache.org/contributing.html before opening a pull request."?


def test_slice(self):
"""Basic operation test for DStream.slice."""
import datetime as dt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets remove the import here since it is at the top already

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, not your doing but I noticed this spelling error on ln 183 "DStream.faltMap" should be "DStream.flatMap", would you mind changing that while we are here?
https://github.com/apache/spark/pull/20838/files#diff-ca4ec8dece48511b915cad6a801695c1R183

@SparkQA
Copy link

SparkQA commented Aug 21, 2018

Test build #95045 has finished for PR 20838 at commit 5b3658c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. If you could just update the title and description mentioned in #20838 (review) I think this is good to go

@cclauss cclauss changed the title [SPARK-23698] Resolve undefined names in Python 3 [SPARK-23698] [Python] Resolve undefined names in Python 3 Aug 22, 2018
@cclauss cclauss changed the title [SPARK-23698] [Python] Resolve undefined names in Python 3 [SPARK-23698][Python] Resolve undefined names in Python 3 Aug 22, 2018
@asfgit asfgit closed this in 71f38ac Aug 22, 2018
@BryanCutler
Copy link
Member

merged to master, thanks @cclauss !

@cclauss cclauss deleted the fix-undefined-names branch August 22, 2018 17:14
cclauss pushed a commit to cclauss/spark that referenced this pull request Aug 24, 2018
Prevent linters from raising __undefined name '\_\_version\_\_'__ by initializing the variable before setting it via a call to __exec()__.  This is the last remaining issue related to the work done in apache#20838
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants