Skip to content

Conversation

@nchammas
Copy link
Contributor

@nchammas nchammas commented Jan 11, 2024

What changes were proposed in this pull request?

Upgrade to Sphinx 4.5.0, which is the latest in the 4.x line and includes the fix for parallel builds on macOS.

Enable parallel Sphinx workers to build the Python API docs.

I experimented with a few different values, and auto seems to work best. Configuring 4 workers seems to yield the same improvement as auto, suggesting parallelization beyond that is ineffective due to some sort of resource contention. But I left it as auto since that's more dynamic and should work better across varied environments.

On my 16-core Intel workstation, the runtime of make html was cut by ~60%.

# `make html` @ master
real    43m51.167s
user    41m43.526s
sys     0m39.651s

# `make html` with parallel workers
real    17m8.424s
user    174m42.051s
sys     5m8.824s

Here on CI, the "Run documentation build" step seems also to have improved from the usual of ~50 minutes down to ~31 minutes.

Why are the changes needed?

This saves developer time and CI time.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I manually built and reviewed the docs using:

SKIP_SCALADOC=1 SKIP_SQLDOC=1 SKIP_RDOC=1 time bundle exec jekyll build

Was this patch authored or co-authored using generative AI tooling?

No.

@nchammas
Copy link
Contributor Author

cc @itholic and @HyukjinKwon. This PR builds on #44012.

Copy link
Contributor

@itholic itholic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM when CI passing.

Nice fix.

@HyukjinKwon
Copy link
Member

Merged to master.

@nchammas nchammas deleted the SPARK-46668-parallel-sphinx branch January 11, 2024 17:05

# You can set these variables from the command line.
SPHINXOPTS ?= "-W"
SPHINXOPTS ?= "-W" "-j" "auto"
Copy link
Member

@dongjoon-hyun dongjoon-hyun Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this breaks Python API doc generation in many core machines because this means the number of parallel SparkSubmit invocation of PySpark. In addition, given that each PySpark currently is launched with local[*], this ends up N * N pyspark.daemon.

As of today, this setting seems to work on low-core machine like GitHub Action runners (4 cores). This breaks Python documentations build even on M3 Max environment and this is worse on large EC2 machines (c7i.24xlarge). You can see the failure locally like this.

$ build/sbt package -Phive-thriftserver
$ cd python/docs
$ make html
...
java.lang.OutOfMemoryError: Java heap space
...
24/09/16 14:09:55 WARN PythonRunner: Incomplete task 7.0 in stage 30 (TID 177) interrupted: Attempting to kill Python Worker
...
make: *** [html] Error 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of today, this setting seems to work on low-core machine like GitHub Action runners (4 cores).

It seems something about the build has changed since this PR was merged in ~8 months ago. As stated in the PR description, I tested auto on a 16-core Intel and it worked fine. And I remember running the build many times to get a representative runtime to report.

this ends up N * N pyspark.daemon

Perhaps this was not the case back in January, for some reason. 🤷‍♂️

@dongjoon-hyun
Copy link
Member

Here is a PR with new default setting while keeping the speed of GitHub Action CI. The users also override this via SPHINXOPT.

dongjoon-hyun added a commit that referenced this pull request Sep 17, 2024
### What changes were proposed in this pull request?

This PR aims to limit `Sphinx` build parallelism to 4 by default for the following goals.
- This will preserve the same speed in GitHub Action environment.
- This will prevent the exhaustive `SparkSubmit` invocation in large machines like `c6i.24xlarge`.
- The user still can override by providing `SPHINXOPTS`.

### Why are the changes needed?

`Sphinx` parallelism feature was added via the following on 2024-01-10.

- #44680

However, unfortunately, this breaks Python API doc generation in large machines because this means the number of parallel `SparkSubmit` invocation of PySpark. In addition, given that each `PySpark` currently is launched with `local[*]`, this ends up `N * N` `pyspark.daemon`s.

In other words, as of today, this default setting, `auto`, seems to work on low-core machine like `GitHub Action` runners (4 cores). For example, this breaks `Python` documentations build even on M3 Max environment and this is worse on large EC2 machines (c7i.24xlarge). You can see the failure locally like this.

```
$ build/sbt package -Phive-thriftserver
$ cd python/docs
$ make html
...
24/09/16 17:04:38 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/09/16 17:04:38 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/09/16 17:04:38 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
24/09/16 17:04:38 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/09/16 17:04:38 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/09/16 17:04:38 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
24/09/16 17:04:38 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
24/09/16 17:04:39 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/09/16 17:04:39 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/09/16 17:04:39 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
24/09/16 17:04:39 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
24/09/16 17:04:39 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
...
java.lang.OutOfMemoryError: Java heap space
...
24/09/16 14:09:55 WARN PythonRunner: Incomplete task 7.0 in stage 30 (TID 177) interrupted: Attempting to kill Python Worker
...
make: *** [html] Error 2
```

### Does this PR introduce _any_ user-facing change?

No, this is a dev-only change.

### How was this patch tested?

Pass the CIs and do manual tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48129 from dongjoon-hyun/SPARK-49680.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants