Skip to content

Commit b8100b5

Browse files
itholicHyukjinKwon
authored andcommitted
[SPARK-41586][PYTHON] Introduce pyspark.errors and error classes for PySpark
### What changes were proposed in this pull request? This PR proposes to introduce `pyspark.errors` and error classes to unifying & improving errors generated by PySpark under a single path. To summarize, this PR includes the changes below: - Add `python/pyspark/errors/error_classes.py` to support error class for PySpark. - Add `ErrorClassesReader` to manage the `error_classes.py`. - Add `PySparkException` to handle the errors generated by PySpark. - Add `check_error` for error class testing. This is an initial PR for introducing error framework for PySpark to facilitate the error management and provide better/consistent error messages to users. While such an active work is being done on the [SQL side to improve error messages](https://issues.apache.org/jira/browse/SPARK-37935), so far there is no work to improve error messages in PySpark. So, I'd expect to also initiate the effort on error message improvement for PySpark side from this PR. Eventually, the errors massage will be shown as below, for example: - PySpark, `PySparkException` (thrown by Python driver): ```python >>> from pyspark.sql.functions import lit >>> lit([df.id, df.id]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/utils.py", line 334, in wrapped return f(*args, **kwargs) File ".../spark/python/pyspark/sql/functions.py", line 176, in lit raise PySparkException( pyspark.errors.exceptions.PySparkException: [COLUMN_IN_LIST] lit does not allow a column in a list. ``` - PySpark, `AnalysisException` (thrown by JVM side, and capture in PySpark side): ``` >>> df.unpivot("id", [], "var", "val").collect() Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/dataframe.py", line 3296, in unpivot jdf = self._jdf.unpivotWithSeq(jids, jvals, variableColumnName, valueColumnName) File ".../spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ File ".../spark/python/pyspark/sql/utils.py", line 209, in deco raise converted from None pyspark.sql.utils.AnalysisException: [UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one value column needs to be specified for UNPIVOT, all columns specified as ids; 'Unpivot ArraySeq(id#2L), ArraySeq(), var, [val] +- LogicalRDD [id#2L, int#3L, double#4, str#5], false ``` - Spark, `AnalysisException`: ```scala scala> df.select($"id").unpivot(Array($"id"), Array.empty,variableColumnName = "var", valueColumnName = "val") org.apache.spark.sql.AnalysisException: [UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one value column needs to be specified for UNPIVOT, all columns specified as ids; 'Unpivot ArraySeq(id#0L), ArraySeq(), var, [val] +- Project [id#0L] +- Range (0, 10, step=1, splits=Some(16)) ``` **Next up** for this PR include: - Migrate more errors into `PySparkException` across all modules (e.g, Spark Connect, pandas API on Spark...). - Migrate more error tests into error class tests by using `check_error`. - Define more error classes onto `error_classes.py`. - Add documentation. ### Why are the changes needed? Centralizing error messages & introducing identified error class provides the following benefits: - Errors are searchable via the unique class names and properly classified. - Reduce the cost of future maintenance for PySpark errors. - Provide consistent & actionable error messages to users. - Facilitates translating error messages into different languages. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Adding UTs & running the existing static analysis tools (`dev/lint-python`) Closes #39387 from itholic/SPARK-41586. Authored-by: itholic <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
1 parent 47068db commit b8100b5

File tree

16 files changed

+398
-3
lines changed

16 files changed

+398
-3
lines changed

dev/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,4 +31,4 @@ required-version = "22.6.0"
3131
line-length = 100
3232
target-version = ['py37']
3333
include = '\.pyi?$'
34-
extend-exclude = 'cloudpickle'
34+
extend-exclude = 'cloudpickle|error_classes.py'

dev/sparktestsupport/modules.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -756,6 +756,16 @@ def __hash__(self):
756756
],
757757
)
758758

759+
pyspark_errors = Module(
760+
name="pyspark-errors",
761+
dependencies=[],
762+
source_file_regexes=["python/pyspark/errors"],
763+
python_test_goals=[
764+
# unittests
765+
"pyspark.errors.tests.test_errors",
766+
],
767+
)
768+
759769
sparkr = Module(
760770
name="sparkr",
761771
dependencies=[hive, mllib],

dev/tox.ini

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ per-file-ignores =
3030
# Examples contain some unused variables.
3131
examples/src/main/python/sql/datasource.py: F841,
3232
# Exclude * imports in test files
33+
python/pyspark/errors/tests/*.py: F403,
3334
python/pyspark/ml/tests/*.py: F403,
3435
python/pyspark/mllib/tests/*.py: F403,
3536
python/pyspark/pandas/tests/*.py: F401 F403,

python/docs/source/reference/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,4 @@ Pandas API on Spark follows the API specifications of latest pandas release.
3535
pyspark.mllib
3636
pyspark
3737
pyspark.resource
38+
pyspark.errors
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
or more contributor license agreements. See the NOTICE file
3+
distributed with this work for additional information
4+
regarding copyright ownership. The ASF licenses this file
5+
to you under the Apache License, Version 2.0 (the
6+
"License"); you may not use this file except in compliance
7+
with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
software distributed under the License is distributed on an
13+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
KIND, either express or implied. See the License for the
15+
specific language governing permissions and limitations
16+
under the License.
17+
18+
19+
======
20+
Errors
21+
======
22+
23+
.. currentmodule:: pyspark.errors
24+
25+
.. autosummary::
26+
:toctree: api/
27+
28+
PySparkException.getErrorClass
29+
PySparkException.getMessageParameters

python/mypy.ini

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,9 @@ ignore_errors = True
106106
[mypy-pyspark.testing.*]
107107
ignore_errors = True
108108

109+
[mypy-pyspark.errors.tests.*]
110+
ignore_errors = True
111+
109112
; Allow non-strict optional for pyspark.pandas
110113

111114
[mypy-pyspark.pandas.*]

python/pyspark/errors/__init__.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
"""
19+
PySpark exceptions.
20+
"""
21+
from pyspark.errors.exceptions import PySparkException
22+
23+
24+
__all__ = [
25+
"PySparkException",
26+
]
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
import json
18+
19+
20+
ERROR_CLASSES_JSON = """
21+
{
22+
"COLUMN_IN_LIST": {
23+
"message": [
24+
"<func_name> does not allow a column in a list."
25+
]
26+
}
27+
}
28+
"""
29+
30+
ERROR_CLASSES_MAP = json.loads(ERROR_CLASSES_JSON)
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
from typing import Dict, Optional, cast
19+
20+
from pyspark.errors.utils import ErrorClassesReader
21+
22+
23+
class PySparkException(Exception):
24+
"""
25+
Base Exception for handling errors generated from PySpark.
26+
"""
27+
28+
def __init__(
29+
self,
30+
message: Optional[str] = None,
31+
error_class: Optional[str] = None,
32+
message_parameters: Optional[Dict[str, str]] = None,
33+
):
34+
# `message` vs `error_class` & `message_parameters` are mutually exclusive.
35+
assert (message is not None and (error_class is None and message_parameters is None)) or (
36+
message is None and (error_class is not None and message_parameters is not None)
37+
)
38+
39+
self.error_reader = ErrorClassesReader()
40+
41+
if message is None:
42+
self.message = self.error_reader.get_error_message(
43+
cast(str, error_class), cast(Dict[str, str], message_parameters)
44+
)
45+
else:
46+
self.message = message
47+
48+
self.error_class = error_class
49+
self.message_parameters = message_parameters
50+
51+
def getErrorClass(self) -> Optional[str]:
52+
"""
53+
Returns an error class as a string.
54+
55+
.. versionadded:: 3.4.0
56+
57+
See Also
58+
--------
59+
:meth:`PySparkException.getMessageParameters`
60+
"""
61+
return self.error_class
62+
63+
def getMessageParameters(self) -> Optional[Dict[str, str]]:
64+
"""
65+
Returns a message parameters as a dictionary.
66+
67+
.. versionadded:: 3.4.0
68+
69+
See Also
70+
--------
71+
:meth:`PySparkException.getErrorClass`
72+
"""
73+
return self.message_parameters
74+
75+
def __str__(self) -> str:
76+
return f"[{self.getErrorClass()}] {self.message}"
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#

0 commit comments

Comments
 (0)