Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set --no-pre-install-wheels and PEX_MAX_INSTALL_JOBS for faster builds of internal pexes #20670

Merged
merged 3 commits into from
Apr 15, 2024

Conversation

huonw
Copy link
Contributor

@huonw huonw commented Mar 14, 2024

This has all internal PEXes be built with settings to improve performance:

This is designed to be a performance improvement for any processing where Pants synthesises a PEX internally, like pants run path/to/script.py or pants test .... pex-tool/pex#2292 has benchmarks for the PEX tool itself.

For benchmarks, I did some more purposeful ones with tensorflow (PyTorch seems a bit awkward hard to set-up and Tensorflow is still huge), using https://gist.github.com/huonw/0560f5aaa34630b68bfb7e0995e99285 . I did 3 runs each of two goals, with 2.21.0.dev4 and with PANTS_SOURCE pointing to this PR, and pulled the numbers out by finding the relevant log lines:

  • pants --no-local-cache --no-pantsd --named-caches-dir=$(mktemp -d) test example_test.py. This involves building 4 separate PEXes partially in parallel, partially sequentially: requirements.pex, local_dists.pex pytest.pex, and then pytest_runner.pex. The first and last are the interesting ones for this test.
  • pants --no-local-cache --no-pantsd --named-caches-dir=$(mktemp -d) run script.py. This just builds the requirements into script.pex.

(NB. these are potentially unrealistic in they're running with all caching turned off or cleared, so are truly a worst case. This means they're downloading tensorflow wheels and all the others, each time, which takes about 30s on my 100Mbit/s connection. Faster connections will thus see a higher ratio of benefit.)

goal period before (s) after (s)
run script.py building requirements 74-82 49-52
test some_test.py building requirements 67-71 30-36
building pytest runner 8-9 17-18
total to start running tests 76-80 53-58

I also did more adhoc ones on a real-world work repo of mine, which doesn't use any of the big ML libraries, just running some basic goals once.

goal period before (s) after (s)
pants export on largest resolve building requirements 66 35
total 82 54
"random" pants test path/to/file.py (1 attempt) building requirements and pytest runner 49 38

Two explicit questions for review:

  1. The --no-pre-install-wheels flag behaviour isn't explicitly tested... but maybe it should be, i.e. validate we're passing this flag for internal PEXes. Thoughts?
  2. Setting PEX_MAX_INSTALL_JOBS as an env var may result in fewer cache hits, e.g. pants test path/to/file.py may not be able to reuse caches from pants test :: (and vice versa) because the available concurrency is different, and thus this may be better to do differently. Thoughts?
    • for instance, leave it as it was (the default 1)
    • set to 0 to have each PEX process do its own concurrency, e.g. based on number of CPU cores and wheels. Multiple processes in parallel risk overloading the machine since they don't coordinate, though.

Fixes #15062

@huonw

This comment was marked as resolved.

@huonw huonw changed the title Set --no-pre-install-wheels and PEX_MAX_INSTALL_JOBS for pants pexes Set --no-pre-install-wheels and PEX_MAX_INSTALL_JOBS for internal pexes Apr 3, 2024
@huonw
Copy link
Contributor Author

huonw commented Apr 3, 2024

Splitting the local dists change to #20746, since that seems like it might cause independent problems, so handy to have it out, for bisection.

huonw added a commit that referenced this pull request Apr 3, 2024
This marks all local dist PEXes as internal-only, removing the ability
for them to be anything but internal. This is almost true already,
except for PEXes built via `PexFromTargetsRequest`, where the local
dists PEX used for building the "real" PEX has the same internal status
as that real PEX. In this case, the local dists PEX still isn't surfaced
to users, so it's appropriate for that one to be internal too.

This will probably be slightly faster in isolation (building a
`pex_binary` that uses in-repo `python_distribution`s will be able to
just copy them around with less zip-file manipulation, more often, by
creating packed-layout PEXes). However, the real motivation is
unblocking #20670, where having this PEX built with
`--no-pre-install-wheels` (as internal-only PEXes will, by default) is
required to support downstream PEXes using that argument, at least until
pex-tool/pex#2299 is fixed.

NB. there's still a separate consideration of changing how local dists
are incorporated, which isn't changed or considered here:
pex-tool/pex#2392 (comment)
@huonw huonw marked this pull request as ready for review April 5, 2024 06:28
@huonw huonw requested review from cburroughs, kaos and benjyw April 5, 2024 06:28
Copy link
Sponsor Contributor

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice wins!

@huonw huonw changed the title Set --no-pre-install-wheels and PEX_MAX_INSTALL_JOBS for internal pexes Set --no-pre-install-wheels and PEX_MAX_INSTALL_JOBS for faster builds of internal pexes Apr 6, 2024
@huonw huonw merged commit 00f5d69 into main Apr 15, 2024
24 checks passed
@huonw huonw deleted the huonw/15062-pre-install branch April 15, 2024 02:51
@@ -1157,6 +1163,9 @@ async def setup_pex_process(request: PexProcess, pex_environment: PexEnvironment
complete_pex_env = pex_environment.in_sandbox(working_directory=request.working_directory)
argv = complete_pex_env.create_argv(pex.name, *request.argv)
env = {
# Set this in case this PEX was built with --no-pre-install-wheels, and thus parallelising
# the install on cold boot is handy.
"PEX_MAX_INSTALL_JOBS": str(request.concurrency_available),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what you were getting at in your notes, but as a rough estimate if I have 16 cores this will potentially thrash into 16 different cache entries, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in theory... depending on what else is running at the same time.

However, on closer examination prompted by your comment: I think I did this wrong:

  • concurrency_available is configuration to tell the command runner how much concurrency would be useful for the process (e.g. "I'm processing 34 files, and could do so in parallel" => concurrency available = 34), not necessary how much it is actually allocated.
  • That allocation is communicated to the process by replacing the {pants_concurrency} template in its argv (to be used like ["some-process", ..., "--jobs={pants_concurrency}"] or similar).
  • This replacement currently only seems to work on the argv, not env vars.

For myself, relevant code (and surrounds), of the concurrency_available request flowing through to the replacement:

let semaphore_acquisition = self.sema.acquire(process.concurrency_available);
let permit = in_workunit!(
"acquire_command_runner_slot",
// TODO: The UI uses the presence of a blocked workunit below a parent as an indication that
// the parent is blocked. If this workunit is filtered out, parents nodes which are waiting
// for the semaphore will render, even though they are effectively idle.
//
// https://github.com/pantsbuild/pants/issues/14680 will likely allow for a more principled
// solution to this problem, such as removing the mutable `blocking` flag, and then never
// filtering blocked workunits at creation time, regardless of level.
Level::Debug,
|workunit| async move {
let _blocking_token = workunit.blocking();
semaphore_acquisition.await
}
)
.await;
loop {
let mut process = process.clone();
let concurrency_available = permit.concurrency();
log::debug!(
"Running {} under semaphore with concurrency id: {}, and concurrency: {}",
process.description,
permit.concurrency_slot(),
concurrency_available,
);
// TODO: Both of these templating cases should be implemented at the lowest possible level:
// they might currently be applied above a cache.
if let Some(ref execution_slot_env_var) = process.execution_slot_variable {
process.env.insert(
execution_slot_env_var.clone(),
format!("{}", permit.concurrency_slot()),
);
}
if process.concurrency_available > 0 {
let concurrency = format!("{}", permit.concurrency());
let mut matched = false;
process.argv = std::mem::take(&mut process.argv)
.into_iter()
.map(
|arg| match CONCURRENCY_TEMPLATE_RE.replace_all(&arg, &concurrency) {

So, I'm gonna revert this part of the change. Thanks.

#20795

huonw added a commit that referenced this pull request Apr 17, 2024
This fixes/removes the use of `PEX_MAX_INSTALL_JOBS` when running a PEX.
This was added in #20670 in an attempt to sync with
`--no-pre-install-wheels` and do more work in parallel when booting
"internal" PEXes... but it was implemented incorrectly. This would need
to be set to `PEX_MAX_INSTALL_JOBS={pants_concurrency}` or similar, and:

- that substitution isn't currently supported (only argv substitutions)
- it's somewhat unclear if we even want to do that at all, as it'll
result in more cache misses

Noticed in
#20670 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Benchmark various defaults for PEX compression for internal PEXes
3 participants