Skip to content

Commit 8141c3e

Browse files
committed
[SPARK-23300][TESTS] Prints out if Pandas and PyArrow are installed or not in PySpark SQL tests
## What changes were proposed in this pull request? This PR proposes to log if PyArrow and Pandas are installed or not so we can check if related tests are going to be skipped or not. ## How was this patch tested? Manually tested: I don't have PyArrow installed in PyPy. ```bash $ ./run-tests --python-executables=python3 ``` ``` ... Will test against the following Python executables: ['python3'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Will test PyArrow related features against Python executable 'python3' in 'pyspark-sql' module. Will test Pandas related features against Python executable 'python3' in 'pyspark-sql' module. Starting test(python3): pyspark.mllib.tests Starting test(python3): pyspark.sql.tests Starting test(python3): pyspark.streaming.tests Starting test(python3): pyspark.tests ``` ```bash $ ./run-tests --modules=pyspark-streaming ``` ``` ... Will test against the following Python executables: ['python2.7', 'pypy'] Will test the following Python modules: ['pyspark-streaming'] Starting test(pypy): pyspark.streaming.tests Starting test(pypy): pyspark.streaming.util Starting test(python2.7): pyspark.streaming.tests Starting test(python2.7): pyspark.streaming.util ``` ```bash $ ./run-tests ``` ``` ... Will test against the following Python executables: ['python2.7', 'pypy'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Will test PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module. Will test Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module. Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found. Will test Pandas related features against Python executable 'pypy' in 'pyspark-sql' module. Starting test(pypy): pyspark.streaming.tests Starting test(pypy): pyspark.sql.tests Starting test(pypy): pyspark.tests Starting test(python2.7): pyspark.mllib.tests ``` ```bash $ ./run-tests --modules=pyspark-sql --python-executables=pypy ``` ``` ... Will test against the following Python executables: ['pypy'] Will test the following Python modules: ['pyspark-sql'] Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found. Will test Pandas related features against Python executable 'pypy' in 'pyspark-sql' module. Starting test(pypy): pyspark.sql.tests Starting test(pypy): pyspark.sql.catalog Starting test(pypy): pyspark.sql.column Starting test(pypy): pyspark.sql.conf ``` After some modification to produce other cases: ```bash $ ./run-tests ``` ``` ... Will test against the following Python executables: ['python2.7', 'pypy'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Will skip PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module. PyArrow >= 20.0.0 is required; however, PyArrow 0.8.0 was found. Will skip Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module. Pandas >= 20.0.0 is required; however, Pandas 0.20.2 was found. Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 20.0.0 is required; however, PyArrow was not found. Will skip Pandas related features against Python executable 'pypy' in 'pyspark-sql' module. Pandas >= 20.0.0 is required; however, Pandas 0.22.0 was found. Starting test(pypy): pyspark.sql.tests Starting test(pypy): pyspark.streaming.tests Starting test(pypy): pyspark.tests Starting test(python2.7): pyspark.mllib.tests ``` ```bash ./run-tests-with-coverage ``` ``` ... Will test against the following Python executables: ['python2.7', 'pypy'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Will test PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module. Will test Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module. Coverage is not installed in Python executable 'pypy' but 'COVERAGE_PROCESS_START' environment variable is set, exiting. ``` Author: hyukjinkwon <[email protected]> Closes #20473 from HyukjinKwon/SPARK-23300.
1 parent a24c031 commit 8141c3e

File tree

1 file changed

+68
-5
lines changed

1 file changed

+68
-5
lines changed

python/run-tests.py

Lines changed: 68 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,15 +31,16 @@
3131
import Queue
3232
else:
3333
import queue as Queue
34+
from distutils.version import LooseVersion
3435

3536

3637
# Append `SPARK_HOME/dev` to the Python path so that we can import the sparktestsupport module
3738
sys.path.append(os.path.join(os.path.dirname(os.path.realpath(__file__)), "../dev/"))
3839

3940

4041
from sparktestsupport import SPARK_HOME # noqa (suppress pep8 warnings)
41-
from sparktestsupport.shellutils import which, subprocess_check_output, run_cmd # noqa
42-
from sparktestsupport.modules import all_modules # noqa
42+
from sparktestsupport.shellutils import which, subprocess_check_output # noqa
43+
from sparktestsupport.modules import all_modules, pyspark_sql # noqa
4344

4445

4546
python_modules = dict((m.name, m) for m in all_modules if m.python_test_goals if m.name != 'root')
@@ -151,6 +152,67 @@ def parse_opts():
151152
return opts
152153

153154

155+
def _check_dependencies(python_exec, modules_to_test):
156+
if "COVERAGE_PROCESS_START" in os.environ:
157+
# Make sure if coverage is installed.
158+
try:
159+
subprocess_check_output(
160+
[python_exec, "-c", "import coverage"],
161+
stderr=open(os.devnull, 'w'))
162+
except:
163+
print_red("Coverage is not installed in Python executable '%s' "
164+
"but 'COVERAGE_PROCESS_START' environment variable is set, "
165+
"exiting." % python_exec)
166+
sys.exit(-1)
167+
168+
# If we should test 'pyspark-sql', it checks if PyArrow and Pandas are installed and
169+
# explicitly prints out. See SPARK-23300.
170+
if pyspark_sql in modules_to_test:
171+
# TODO(HyukjinKwon): Relocate and deduplicate these version specifications.
172+
minimum_pyarrow_version = '0.8.0'
173+
minimum_pandas_version = '0.19.2'
174+
175+
try:
176+
pyarrow_version = subprocess_check_output(
177+
[python_exec, "-c", "import pyarrow; print(pyarrow.__version__)"],
178+
universal_newlines=True,
179+
stderr=open(os.devnull, 'w')).strip()
180+
if LooseVersion(pyarrow_version) >= LooseVersion(minimum_pyarrow_version):
181+
LOGGER.info("Will test PyArrow related features against Python executable "
182+
"'%s' in '%s' module." % (python_exec, pyspark_sql.name))
183+
else:
184+
LOGGER.warning(
185+
"Will skip PyArrow related features against Python executable "
186+
"'%s' in '%s' module. PyArrow >= %s is required; however, PyArrow "
187+
"%s was found." % (
188+
python_exec, pyspark_sql.name, minimum_pyarrow_version, pyarrow_version))
189+
except:
190+
LOGGER.warning(
191+
"Will skip PyArrow related features against Python executable "
192+
"'%s' in '%s' module. PyArrow >= %s is required; however, PyArrow "
193+
"was not found." % (python_exec, pyspark_sql.name, minimum_pyarrow_version))
194+
195+
try:
196+
pandas_version = subprocess_check_output(
197+
[python_exec, "-c", "import pandas; print(pandas.__version__)"],
198+
universal_newlines=True,
199+
stderr=open(os.devnull, 'w')).strip()
200+
if LooseVersion(pandas_version) >= LooseVersion(minimum_pandas_version):
201+
LOGGER.info("Will test Pandas related features against Python executable "
202+
"'%s' in '%s' module." % (python_exec, pyspark_sql.name))
203+
else:
204+
LOGGER.warning(
205+
"Will skip Pandas related features against Python executable "
206+
"'%s' in '%s' module. Pandas >= %s is required; however, Pandas "
207+
"%s was found." % (
208+
python_exec, pyspark_sql.name, minimum_pandas_version, pandas_version))
209+
except:
210+
LOGGER.warning(
211+
"Will skip Pandas related features against Python executable "
212+
"'%s' in '%s' module. Pandas >= %s is required; however, Pandas "
213+
"was not found." % (python_exec, pyspark_sql.name, minimum_pandas_version))
214+
215+
154216
def main():
155217
opts = parse_opts()
156218
if (opts.verbose):
@@ -175,9 +237,10 @@ def main():
175237

176238
task_queue = Queue.PriorityQueue()
177239
for python_exec in python_execs:
178-
if "COVERAGE_PROCESS_START" in os.environ:
179-
# Make sure if coverage is installed.
180-
run_cmd([python_exec, "-c", "import coverage"])
240+
# Check if the python executable has proper dependencies installed to run tests
241+
# for given modules properly.
242+
_check_dependencies(python_exec, modules_to_test)
243+
181244
python_implementation = subprocess_check_output(
182245
[python_exec, "-c", "import platform; print(platform.python_implementation())"],
183246
universal_newlines=True).strip()

0 commit comments

Comments
 (0)