-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-1900 / 1918] PySpark on YARN is broken #853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Jars and files provided to spark-submit are treated as HDFS paths on YARN clusters, even if they exist locally. This is inconsistent across different modes. Instead, we should always treat the command line argument paths passed to spark-submit as local paths, unless otherwise specified.
|
Merged build triggered. |
|
Merged build started. |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15140/ |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15139/ |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to Utils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: also add tests for this
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15145/ |
|
Jenkins, test this please |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
This is non-trivial because paths in Windows may contain backslashes,
and the drive may be misinterpreted as a URI scheme. This commit also
simplifies the logic of handling fragments ("#") in URIs.
|
Merged build finished. All automated tests passed. |
1 similar comment
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
All automated tests passed. |
Previously, we were not dealing with PYTHONPATHs correctly. We added the full resolved URI (e.g. "file:/path/to/hello.py") to PYTHONPATH, but python does not understand URI scheme prefixes. We need to strip this prefix to add the correct path (e.g. "/path/to/hello.py") to PYTHONPATH. Without this commit, --py-files does not work. This is a non-trivial change, however, as it requires us to correct all python file paths before we add them to the PYTHONPATH. We must still resolve the URIs of these paths, so they are added through `sc.addFile` properly. Fun fact: before this commit, pyspark applications still work with --py-files. This is a fluke, however, because in PythonRunner.scala we construct the PYTHONPATH by concatenating strings with the ":" separator. However, a file path "file:/path/to/hello.py" is interpreted as two separate paths, "file" and "/path/to/hello.py." It just so happens that the latter path is also an absolute path, so python is still able to load the "hello" module completely by chance.
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
@tdas I have pushed a commit that corrects the way we set PYTHONPATH. In a nutshell, python does not understand URI schemes (e.g. Unfortunately, this involves a fairly non-trivial change, because we also have to make sure that the provided python files exist locally, such that adding them to the PYTHONPATH is actually meaningful. Also, we have been adding the python file itself to the PYTHONPATH. This is incorrect and does not work on YARN; instead, we should be adding the python file's containing directory. However, This is a slightly invasive change, but much of the new code are tests for formatting the paths properly. The good news is that I have tested this locally, on a CDH5 cluster, and on Windows, and everything behaves as expected. More specifically, on each of these deploy modes, I ran a combination of spark-shell, spark-submit, and pyspark, with jars / python files referencing each other. I can confirm that I have not had the time to test this on standalone mode or HDP cluster (especially with Hadoop 2.4). After these have been tested, I think this PR is ready for merge. |
|
I have tested them with Hadoop 2.4 and Spark Standalone as well and they work. This was a very tricky bug fix, that required testing all combinations of deployment modes (local, spark standalone, yarn-client, yarn-cluster, windowS) and execution modes (jars, spark shell, python shell, python scripts). Thanks @andrewor14 for doing this and thanks to @mengxr for helping us out. I am merging this. |
If I run the following on a YARN cluster ``` bin/spark-submit sheep.py --master yarn-client ``` it fails because of a mismatch in paths: `spark-submit` thinks that `sheep.py` resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file: ``` bin/spark-submit file:/path/to/sheep.py --master yarn-client ``` However, this also fails. This time it is because python does not understand URI schemes. This PR fixes this by automatically resolving all paths passed as command line argument to `spark-submit` properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. For python, we strip the URI scheme before we actually try to run it. Much of the code is originally written by @mengxr. Tested on YARN cluster. More tests pending. Author: Andrew Or <[email protected]> Closes #853 from andrewor14/submit-paths and squashes the following commits: 0bb097a [Andrew Or] Format path correctly before adding it to PYTHONPATH 323b45c [Andrew Or] Include --py-files on PYTHONPATH for pyspark shell 3c36587 [Andrew Or] Improve error messages (minor) 854aa6a [Andrew Or] Guard against NPE if user gives pathological paths 6638a6b [Andrew Or] Fix spark-shell jar paths after #849 went in 3bb0359 [Andrew Or] Update more comments (minor) 2a1f8a0 [Andrew Or] Update comments (minor) 6af2c77 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths a68c4d1 [Andrew Or] Handle Windows python file path correctly 427a250 [Andrew Or] Resolve paths properly for Windows a591a4a [Andrew Or] Update tests for resolving URIs 6c8621c [Andrew Or] Move resolveURIs to Utils db8255e [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths f542dce [Andrew Or] Fix outdated tests 691c4ce [Andrew Or] Ignore special primary resource names 5342ac7 [Andrew Or] Add missing space in error message 02f77f3 [Andrew Or] Resolve command line arguments to spark-submit properly (cherry picked from commit 5081a0a) Signed-off-by: Tathagata Das <[email protected]>
If I run the following on a YARN cluster ``` bin/spark-submit sheep.py --master yarn-client ``` it fails because of a mismatch in paths: `spark-submit` thinks that `sheep.py` resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file: ``` bin/spark-submit file:/path/to/sheep.py --master yarn-client ``` However, this also fails. This time it is because python does not understand URI schemes. This PR fixes this by automatically resolving all paths passed as command line argument to `spark-submit` properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. For python, we strip the URI scheme before we actually try to run it. Much of the code is originally written by @mengxr. Tested on YARN cluster. More tests pending. Author: Andrew Or <[email protected]> Closes apache#853 from andrewor14/submit-paths and squashes the following commits: 0bb097a [Andrew Or] Format path correctly before adding it to PYTHONPATH 323b45c [Andrew Or] Include --py-files on PYTHONPATH for pyspark shell 3c36587 [Andrew Or] Improve error messages (minor) 854aa6a [Andrew Or] Guard against NPE if user gives pathological paths 6638a6b [Andrew Or] Fix spark-shell jar paths after apache#849 went in 3bb0359 [Andrew Or] Update more comments (minor) 2a1f8a0 [Andrew Or] Update comments (minor) 6af2c77 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths a68c4d1 [Andrew Or] Handle Windows python file path correctly 427a250 [Andrew Or] Resolve paths properly for Windows a591a4a [Andrew Or] Update tests for resolving URIs 6c8621c [Andrew Or] Move resolveURIs to Utils db8255e [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths f542dce [Andrew Or] Fix outdated tests 691c4ce [Andrew Or] Ignore special primary resource names 5342ac7 [Andrew Or] Add missing space in error message 02f77f3 [Andrew Or] Resolve command line arguments to spark-submit properly
We resolve relative paths to the local `file:/` system for `--jars` and `--files` in spark submit (#853). We should do the same for the history server. Author: Andrew Or <[email protected]> Closes #1280 from andrewor14/hist-serv-fix and squashes the following commits: 13ff406 [Andrew Or] Merge branch 'master' of github.com:apache/spark into hist-serv-fix b393e17 [Andrew Or] Strip trailing "/" from logging directory 622a471 [Andrew Or] Fix test in EventLoggingListenerSuite 0e20f71 [Andrew Or] Shift responsibility of resolving paths up one level b037c0c [Andrew Or] Use resolved paths for everything in history server c7e36ee [Andrew Or] Resolve paths for event logging too 40e3933 [Andrew Or] Resolve history server file paths
We resolve relative paths to the local `file:/` system for `--jars` and `--files` in spark submit (apache#853). We should do the same for the history server. Author: Andrew Or <[email protected]> Closes apache#1280 from andrewor14/hist-serv-fix and squashes the following commits: 13ff406 [Andrew Or] Merge branch 'master' of github.com:apache/spark into hist-serv-fix b393e17 [Andrew Or] Strip trailing "/" from logging directory 622a471 [Andrew Or] Fix test in EventLoggingListenerSuite 0e20f71 [Andrew Or] Shift responsibility of resolving paths up one level b037c0c [Andrew Or] Use resolved paths for everything in history server c7e36ee [Andrew Or] Resolve paths for event logging too 40e3933 [Andrew Or] Resolve history server file paths
### What changes were proposed in this pull request? The main change of this pr as follows: - Upgrade `org.scalatestplus:selenium` from `org.scalatestplus:selenium-3-141:3.2.10.0` to `org.scalatestplus:selenium-4-2:3.2.13.0` and upgrade selenium-java from `3.141.59` to `4.2.2`, `htmlunit-driver` from `2.62.0` to `3.62.0` - okio upgrade from `1.14.0` to `1.15.0` due to both selenium-java and kubernetes-client depends on okio 1.15.0 and maven's nearby choice has also changed from 1.14.0 to 1.15.0 ### Why are the changes needed? Use the same version as other `org.scalatestplus` series dependencies, the release notes as follows: - https://github.com/scalatest/scalatestplus-selenium/releases/tag/release-3.2.11.0-for-selenium-4.1 - https://github.com/scalatest/scalatestplus-selenium/releases/tag/release-3.2.12.0-for-selenium-4.1 - https://github.com/scalatest/scalatestplus-selenium/releases/tag/release-3.2.12.1-for-selenium-4.1 - https://github.com/scalatest/scalatestplus-selenium/releases/tag/release-3.2.13.0-for-selenium-4.2 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual test: - ChromeUISeleniumSuite ``` build/sbt -Dguava.version=31.1-jre -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest.default.exclude.tags="" -Phive -Phive-thriftserver "core/testOnly org.apache.spark.ui.ChromeUISeleniumSuite" ``` ``` [info] ChromeUISeleniumSuite: Starting ChromeDriver 105.0.5195.52 (412c95e518836d8a7d97250d62b29c2ae6a26a85-refs/branch-heads/5195{#853}) on port 53917 Only local connections are allowed. Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe. ChromeDriver was started successfully. [info] - SPARK-31534: text for tooltip should be escaped (4 seconds, 447 milliseconds) [info] - SPARK-31882: Link URL for Stage DAGs should not depend on paged table. (841 milliseconds) [info] - SPARK-31886: Color barrier execution mode RDD correctly (297 milliseconds) [info] - Search text for paged tables should not be saved (1 second, 676 milliseconds) [info] Run completed in 11 seconds, 819 milliseconds. [info] Total number of tests run: 4 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 25 s, completed 2022-9-14 20:12:28 ``` - ChromeUIHistoryServerSuite ``` build/sbt -Dguava.version=31.1-jre -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest.default.exclude.tags="" -Phive -Phive-thriftserver "core/testOnly org.apache.spark.deploy.history.ChromeUIHistoryServerSuite" ``` ``` [info] ChromeUIHistoryServerSuite: Starting ChromeDriver 105.0.5195.52 (412c95e518836d8a7d97250d62b29c2ae6a26a85-refs/branch-heads/5195{#853}) on port 58567 Only local connections are allowed. Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe. ChromeDriver was started successfully. [info] - ajax rendered relative links are prefixed with uiRoot (spark.ui.proxyBase) (2 seconds, 416 milliseconds) [info] Run completed in 8 seconds, 936 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 30 s, completed 2022-9-14 20:11:34 ``` Closes #37868 from LuciferYang/SPARK-40397. Authored-by: yangjie01 <[email protected]> Signed-off-by: Kousuke Saruta <[email protected]>
…dd UTs for RocksDB
### What changes were proposed in this pull request?
`ChromeUIHistoryServerSuite` only test LevelDB backend now, this pr refactor the UTs of `ChromeUIHistoryServerSuite` to add UTs for RocksDB
### Why are the changes needed?
Add UTs related to RocksDB.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Pass GA
- Manual test on Apple Silicon environment:
```
build/sbt -Dguava.version=31.1-jre -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest.default.exclude.tags="" "core/testOnly org.apache.spark.deploy.history.RocksBackendChromeUIHistoryServerSuite"
```
```
[info] RocksBackendChromeUIHistoryServerSuite:
Starting ChromeDriver 105.0.5195.52 (412c95e518836d8a7d97250d62b29c2ae6a26a85-refs/branch-heads/5195{#853}) on port 54402
Only local connections are allowed.
Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe.
ChromeDriver was started successfully.
[info] - ajax rendered relative links are prefixed with uiRoot (spark.ui.proxyBase) (5 seconds, 387 milliseconds)
[info] Run completed in 20 seconds, 838 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 118 s (01:58), completed 2022-9-15 10:30:53
```
Closes #37878 from LuciferYang/SPARK-40424.
Lead-authored-by: yangjie01 <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…dd UTs for RocksDB
### What changes were proposed in this pull request?
`ChromeUIHistoryServerSuite` only test LevelDB backend now, this pr refactor the UTs of `ChromeUIHistoryServerSuite` to add UTs for RocksDB
### Why are the changes needed?
Add UTs related to RocksDB.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Pass GA
- Manual test on Apple Silicon environment:
```
build/sbt -Dguava.version=31.1-jre -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest.default.exclude.tags="" "core/testOnly org.apache.spark.deploy.history.RocksBackendChromeUIHistoryServerSuite"
```
```
[info] RocksBackendChromeUIHistoryServerSuite:
Starting ChromeDriver 105.0.5195.52 (412c95e518836d8a7d97250d62b29c2ae6a26a85-refs/branch-heads/5195{apache#853}) on port 54402
Only local connections are allowed.
Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe.
ChromeDriver was started successfully.
[info] - ajax rendered relative links are prefixed with uiRoot (spark.ui.proxyBase) (5 seconds, 387 milliseconds)
[info] Run completed in 20 seconds, 838 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 118 s (01:58), completed 2022-9-15 10:30:53
```
Closes apache#37878 from LuciferYang/SPARK-40424.
Lead-authored-by: yangjie01 <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? The main change of this pr as follows: - Upgrade `org.scalatestplus:selenium` from `org.scalatestplus:selenium-3-141:3.2.10.0` to `org.scalatestplus:selenium-4-2:3.2.13.0` and upgrade selenium-java from `3.141.59` to `4.2.2`, `htmlunit-driver` from `2.62.0` to `3.62.0` - okio upgrade from `1.14.0` to `1.15.0` due to both selenium-java and kubernetes-client depends on okio 1.15.0 and maven's nearby choice has also changed from 1.14.0 to 1.15.0 ### Why are the changes needed? Use the same version as other `org.scalatestplus` series dependencies, the release notes as follows: - https://github.com/scalatest/scalatestplus-selenium/releases/tag/release-3.2.11.0-for-selenium-4.1 - https://github.com/scalatest/scalatestplus-selenium/releases/tag/release-3.2.12.0-for-selenium-4.1 - https://github.com/scalatest/scalatestplus-selenium/releases/tag/release-3.2.12.1-for-selenium-4.1 - https://github.com/scalatest/scalatestplus-selenium/releases/tag/release-3.2.13.0-for-selenium-4.2 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual test: - ChromeUISeleniumSuite ``` build/sbt -Dguava.version=31.1-jre -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest.default.exclude.tags="" -Phive -Phive-thriftserver "core/testOnly org.apache.spark.ui.ChromeUISeleniumSuite" ``` ``` [info] ChromeUISeleniumSuite: Starting ChromeDriver 105.0.5195.52 (412c95e518836d8a7d97250d62b29c2ae6a26a85-refs/branch-heads/5195{apache#853}) on port 53917 Only local connections are allowed. Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe. ChromeDriver was started successfully. [info] - SPARK-31534: text for tooltip should be escaped (4 seconds, 447 milliseconds) [info] - SPARK-31882: Link URL for Stage DAGs should not depend on paged table. (841 milliseconds) [info] - SPARK-31886: Color barrier execution mode RDD correctly (297 milliseconds) [info] - Search text for paged tables should not be saved (1 second, 676 milliseconds) [info] Run completed in 11 seconds, 819 milliseconds. [info] Total number of tests run: 4 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 25 s, completed 2022-9-14 20:12:28 ``` - ChromeUIHistoryServerSuite ``` build/sbt -Dguava.version=31.1-jre -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest.default.exclude.tags="" -Phive -Phive-thriftserver "core/testOnly org.apache.spark.deploy.history.ChromeUIHistoryServerSuite" ``` ``` [info] ChromeUIHistoryServerSuite: Starting ChromeDriver 105.0.5195.52 (412c95e518836d8a7d97250d62b29c2ae6a26a85-refs/branch-heads/5195{apache#853}) on port 58567 Only local connections are allowed. Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe. ChromeDriver was started successfully. [info] - ajax rendered relative links are prefixed with uiRoot (spark.ui.proxyBase) (2 seconds, 416 milliseconds) [info] Run completed in 8 seconds, 936 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 30 s, completed 2022-9-14 20:11:34 ``` Closes apache#37868 from LuciferYang/SPARK-40397. Authored-by: yangjie01 <[email protected]> Signed-off-by: Kousuke Saruta <[email protected]>
… 3.2.14 ### What changes were proposed in this pull request? This pr aims upgrade `scalatest` related test dependencies to 3.2.14: - scalatest: upgrade scalatest to 3.2.14 - scalatestplus - scalacheck: upgrade to `scalacheck-1-17` 3.2.14.0 - mockito: upgrade to `mockito-4-6` to 3.2.14.0 - selenium: uprade to `selenium-4-4` to 3.2.14.0 and `selenium-java` to 4.4, `htmlunit-driver` to 3.64.0, `htmlunit` to 2.64.0 ### Why are the changes needed? The release notes as follows: - scalatest:https://github.com/scalatest/scalatest/releases/tag/release-3.2.14 - scalatestplus - scalacheck-1-17: https://github.com/scalatest/scalatestplus-scalacheck/releases/tag/release-3.2.14.0-for-scalacheck-1.17 - mockito-4-6: https://github.com/scalatest/scalatestplus-mockito/releases/tag/release-3.2.14.0-for-mockito-4.6 - selenium-4-4: https://github.com/scalatest/scalatestplus-selenium/releases/tag/release-3.2.14.0-for-selenium-4.4 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual test: ``` build/sbt -Dguava.version=31.1-jre -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest.default.exclude.tags="" -Phive -Phive-thriftserver "core/testOnly org.apache.spark.deploy.history.RocksDBBackendChromeUIHistoryServerSuite" ``` ``` [info] RocksDBBackendChromeUIHistoryServerSuite: Starting ChromeDriver 105.0.5195.52 (412c95e518836d8a7d97250d62b29c2ae6a26a85-refs/branch-heads/5195{#853}) on port 58104 Only local connections are allowed. Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe. ChromeDriver was started successfully. [info] - ajax rendered relative links are prefixed with uiRoot (spark.ui.proxyBase) (2 seconds, 216 milliseconds) [info] Run completed in 7 seconds, 816 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 111 s (01:51), completed 2022-10-6 17:14:20 ``` Closes #38128 from LuciferYang/scalatest-3214. Authored-by: yangjie01 <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
If I run the following on a YARN cluster
it fails because of a mismatch in paths:
spark-submitthinks thatsheep.pyresides on HDFS, and balks when it can't find the file there. A natural workaround is to add thefile:prefix to the file:However, this also fails. This time it is because python does not understand URI schemes.
This PR fixes this by automatically resolving all paths passed as command line argument to
spark-submitproperly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. For python, we strip the URI scheme before we actually try to run it.Much of the code is originally written by @mengxr. Tested on YARN cluster. More tests pending.