Skip to content

Conversation

@advancedxy
Copy link
Contributor

What changes were proposed in this pull request?

The str of CapaturedException is now returned by str(self.desc) rather than repr(self.desc), which is more user-friendly. It also handles unicode under python2 specially.

Why are the changes needed?

This is an improvement, and makes exception more human readable in python side.

Does this PR introduce any user-facing change?

Before this pr, select 中文字段 throws exception something likes below:

Traceback (most recent call last):
  File "/Users/advancedxy/code_workspace/github/spark/python/pyspark/sql/tests/test_utils.py", line 34, in test_capture_user_friendly_exception
    raise e
AnalysisException: u"cannot resolve '`\u4e2d\u6587\u5b57\u6bb5`' given input columns: []; line 1 pos 7;\n'Project ['\u4e2d\u6587\u5b57\u6bb5]\n+- OneRowRelation\n"

after this pr:

Traceback (most recent call last):
  File "/Users/advancedxy/code_workspace/github/spark/python/pyspark/sql/tests/test_utils.py", line 34, in test_capture_user_friendly_exception
    raise e
AnalysisException: cannot resolve '`中文字段`' given input columns: []; line 1 pos 7;
'Project ['中文字段]
+- OneRowRelation

How was this patch

Add a new test to verify unicode are correctly converted and manual checks for thrown exceptions.

This pr's credits should go to @uncleGen and is based on #17267

@advancedxy
Copy link
Contributor Author

Ping @HyukjinKwon and @viirya.

I believe this should be ported before #18324, let's check this pr first.

import sys

if sys.version_info.major >= 3:
unicode = str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need this? you only use unicode when version major < 3, isn't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks we don't.

I thought unicode should be available when running python3. But looks like isinstance is not evaluated, and it doesn't require unicode to be in the namespace.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Sep 18, 2019

Test build #110903 has finished for PR 25814 at commit 3c028e9.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@advancedxy please fix the style. looks good otherwise.

desc = self.desc
# encode unicode instance for python2 for human readable description
if sys.version_info.major < 3 and isinstance(desc, unicode):
return str(desc.encode('utf-8'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this once we drop Python 2, which will be right away after Spark 3.0 release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@SparkQA
Copy link

SparkQA commented Sep 18, 2019

Test build #110910 has finished for PR 25814 at commit 9e4e7e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@advancedxy advancedxy deleted the python_exception_19926_and_21045 branch September 18, 2019 14:39
@tgravescs
Copy link
Contributor

I'm seeing an error after this went in with unicode:

starting flake8 test...
flake8 checks failed:
./python/pyspark/sql/utils.py:31:60: F821 undefined name 'unicode'
if sys.version_info.major < 3 and isinstance(desc, unicode):
^
1 F821 undefined name 'unicode'

Where is unicode supposed to come from?

@advancedxy
Copy link
Contributor Author

If it’s python2, unicode is built-in.
Otherwise, this part is never evaluated, thus irrelevant.

@tgravescs
Copy link
Contributor

hmm, then why am I getting the error? I'm simply running the dev/lint-python script, my default system version is python 3.7.3. it seems to be checking for unicode package still. Its not clear to me which version Jenkins is using. I see it install python 3.5.6 but that is way after the lint script runs

Have you tried this with python3?

@advancedxy
Copy link
Contributor Author

Have you tried this with python3?

Yes, I ran tests with python3.7 on my local machine, but not dev/lint-python.

Looks like flake cannot infer that unicode is never used for python3. The code is correct in the runtime.

Seems we don't run lint-python on the Jenkins? @HyukjinKwon can you help confirming that?

I'd like to address this problem in #25847, the fix is simple, just
add the following in the begin:

import py4j
import sys

if sys.version_info.major >= 3:
    unicode = str

Does that sound reasonable to you @HyukjinKwon ? Or I should submit a follow up pr for this problem? Also cc @viirya

@tgravescs
Copy link
Contributor

I would like to see this fixed quickly because dev/lint-python is run by run-tests.py which if you are running with python 3 fails and thus makes my CI fail. I have to assume the Apache Spark Jenkins is using python 2 at that point. If #25847 is going to be committed today I'm ok with it, otherwise I think it should be separate pr

@viirya
Copy link
Member

viirya commented Sep 19, 2019

I'm fine either way. Looks #25847 was approved. It might be merged soon.

@advancedxy
Copy link
Contributor Author

I have updated a new commit to #25847, hope it would be merged soon.

@viirya
Copy link
Member

viirya commented Sep 19, 2019

I will take a look, If @HyukjinKwon is unable to merge it due to time difference.

@tgravescs
Copy link
Contributor

thanks!

@viirya
Copy link
Member

viirya commented Sep 19, 2019

@tgravescs I have a question on #25847, and I think that @advancedxy can't response in quick due to time difference. I am not sure how many people are affected by this, but in order to unblock it, I am going to create a follow up PR to include the fix.

HyukjinKwon pushed a commit that referenced this pull request Sep 20, 2019
…from python execution in Python 2

### What changes were proposed in this pull request?

This PR allows non-ascii string as an exception message in Python 2 by explicitly en/decoding in case of `str` in Python 2.

### Why are the changes needed?

Previously PySpark will hang when the `UnicodeDecodeError` occurs and the real exception cannot be passed to the JVM side.

See the reproducer as below:

```python
def f():
    raise Exception("中")
spark = SparkSession.builder.master('local').getOrCreate()
spark.sparkContext.parallelize([1]).map(lambda x: f()).count()
```

### Does this PR introduce any user-facing change?

User may not observe hanging for the similar cases.

### How was this patch tested?

Added a new test and manually checking.

This pr is based on #18324, credits should also go to dataknocker.
To make lint-python happy for python3, it also includes a followup fix for #25814

Closes #25847 from advancedxy/python_exception_19926_and_21045.

Authored-by: Xianjin YE <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants