-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-46668][DOCS] Parallelize Sphinx build of Python API docs #44680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @itholic and @HyukjinKwon. This PR builds on #44012. |
itholic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM when CI passing.
Nice fix.
|
Merged to master. |
|
|
||
| # You can set these variables from the command line. | ||
| SPHINXOPTS ?= "-W" | ||
| SPHINXOPTS ?= "-W" "-j" "auto" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this breaks Python API doc generation in many core machines because this means the number of parallel SparkSubmit invocation of PySpark. In addition, given that each PySpark currently is launched with local[*], this ends up N * N pyspark.daemon.
As of today, this setting seems to work on low-core machine like GitHub Action runners (4 cores). This breaks Python documentations build even on M3 Max environment and this is worse on large EC2 machines (c7i.24xlarge). You can see the failure locally like this.
$ build/sbt package -Phive-thriftserver
$ cd python/docs
$ make html
...
java.lang.OutOfMemoryError: Java heap space
...
24/09/16 14:09:55 WARN PythonRunner: Incomplete task 7.0 in stage 30 (TID 177) interrupted: Attempting to kill Python Worker
...
make: *** [html] Error 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of today, this setting seems to work on low-core machine like
GitHub Actionrunners (4 cores).
It seems something about the build has changed since this PR was merged in ~8 months ago. As stated in the PR description, I tested auto on a 16-core Intel and it worked fine. And I remember running the build many times to get a representative runtime to report.
this ends up
N * Npyspark.daemon
Perhaps this was not the case back in January, for some reason. 🤷♂️
|
Here is a PR with new default setting while keeping the speed of GitHub Action CI. The users also override this via |
### What changes were proposed in this pull request? This PR aims to limit `Sphinx` build parallelism to 4 by default for the following goals. - This will preserve the same speed in GitHub Action environment. - This will prevent the exhaustive `SparkSubmit` invocation in large machines like `c6i.24xlarge`. - The user still can override by providing `SPHINXOPTS`. ### Why are the changes needed? `Sphinx` parallelism feature was added via the following on 2024-01-10. - #44680 However, unfortunately, this breaks Python API doc generation in large machines because this means the number of parallel `SparkSubmit` invocation of PySpark. In addition, given that each `PySpark` currently is launched with `local[*]`, this ends up `N * N` `pyspark.daemon`s. In other words, as of today, this default setting, `auto`, seems to work on low-core machine like `GitHub Action` runners (4 cores). For example, this breaks `Python` documentations build even on M3 Max environment and this is worse on large EC2 machines (c7i.24xlarge). You can see the failure locally like this. ``` $ build/sbt package -Phive-thriftserver $ cd python/docs $ make html ... 24/09/16 17:04:38 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 24/09/16 17:04:38 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042. 24/09/16 17:04:38 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043. 24/09/16 17:04:38 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 24/09/16 17:04:38 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042. 24/09/16 17:04:38 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043. 24/09/16 17:04:38 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044. 24/09/16 17:04:39 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 24/09/16 17:04:39 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042. 24/09/16 17:04:39 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043. 24/09/16 17:04:39 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044. 24/09/16 17:04:39 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045. ... java.lang.OutOfMemoryError: Java heap space ... 24/09/16 14:09:55 WARN PythonRunner: Incomplete task 7.0 in stage 30 (TID 177) interrupted: Attempting to kill Python Worker ... make: *** [html] Error 2 ``` ### Does this PR introduce _any_ user-facing change? No, this is a dev-only change. ### How was this patch tested? Pass the CIs and do manual tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48129 from dongjoon-hyun/SPARK-49680. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Upgrade to Sphinx 4.5.0, which is the latest in the 4.x line and includes the fix for parallel builds on macOS.
Enable parallel Sphinx workers to build the Python API docs.
I experimented with a few different values, and
autoseems to work best. Configuring 4 workers seems to yield the same improvement asauto, suggesting parallelization beyond that is ineffective due to some sort of resource contention. But I left it asautosince that's more dynamic and should work better across varied environments.On my 16-core Intel workstation, the runtime of
make htmlwas cut by ~60%.Here on CI, the "Run documentation build" step seems also to have improved from the usual of ~50 minutes down to ~31 minutes.
Why are the changes needed?
This saves developer time and CI time.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
I manually built and reviewed the docs using:
Was this patch authored or co-authored using generative AI tooling?
No.