Collecting from cache is surprisingly slow #10467

bluenote10 · 2021-09-13T11:00:35Z

What's the problem this feature will solve?

Workflows that require a frequent re-creation of virtual envs from scratch are very slow, and a large amount of time is spend collecting wheels despite having all the artefacts in the local cache already.

To demonstrate, let's consider installing from this (pip-compiled) requirements.txt:

requirements.txt

#
# This file is autogenerated by pip-compile with python 3.8
# To update, run:
#
#    pip-compile --no-emit-index-url requirements.in
#
ansible==4.5.0
    # via -r requirements.in
ansible-core==2.11.4
    # via ansible
asgiref==3.4.1
    # via django
attrs==21.2.0
    # via pytest
backcall==0.2.0
    # via ipython
black==21.8b0
    # via -r requirements.in
bokeh==2.3.3
    # via -r requirements.in
brotli==1.0.9
    # via flask-compress
cashier==1.3
    # via -r requirements.in
certifi==2021.5.30
    # via requests
cffi==1.14.6
    # via cryptography
charset-normalizer==2.0.4
    # via requests
click==8.0.1
    # via
    #   -r requirements.in
    #   black
    #   flask
cloudpickle==2.0.0
    # via dask
cryptography==3.4.8
    # via ansible-core
cycler==0.10.0
    # via matplotlib
dash==2.0.0
    # via -r requirements.in
dash-core-components==2.0.0
    # via dash
dash-html-components==2.0.0
    # via dash
dash-table==5.0.0
    # via dash
dask==2021.9.0
    # via -r requirements.in
decorator==5.1.0
    # via ipython
django==3.2.7
    # via -r requirements.in
docker==5.0.2
    # via -r requirements.in
flake8==3.9.2
    # via -r requirements.in
flask==2.0.1
    # via
    #   -r requirements.in
    #   dash
    #   flask-compress
flask-compress==1.10.1
    # via dash
fsspec==2021.8.1
    # via dask
idna==3.2
    # via requests
importlib-resources==5.2.2
    # via yachalk
iniconfig==1.1.1
    # via pytest
ipython==7.27.0
    # via -r requirements.in
itsdangerous==2.0.1
    # via flask
jedi==0.18.0
    # via ipython
jinja2==3.0.1
    # via
    #   -r requirements.in
    #   ansible-core
    #   bokeh
    #   flask
joblib==1.0.1
    # via scikit-learn
kiwisolver==1.3.2
    # via matplotlib
locket==0.2.1
    # via partd
markupsafe==2.0.1
    # via jinja2
matplotlib==3.4.3
    # via -r requirements.in
matplotlib-inline==0.1.3
    # via ipython
mccabe==0.6.1
    # via flake8
mypy==0.910
    # via -r requirements.in
mypy-extensions==0.4.3
    # via
    #   black
    #   mypy
numpy==1.21.2
    # via
    #   -r requirements.in
    #   bokeh
    #   matplotlib
    #   pandas
    #   pyarrow
    #   scikit-learn
    #   scipy
packaging==21.0
    # via
    #   ansible-core
    #   bokeh
    #   dask
    #   pytest
pandas==1.3.3
    # via -r requirements.in
parso==0.8.2
    # via jedi
partd==1.2.0
    # via dask
pathspec==0.9.0
    # via black
pexpect==4.8.0
    # via ipython
pickleshare==0.7.5
    # via ipython
pillow==8.3.2
    # via
    #   bokeh
    #   matplotlib
platformdirs==2.3.0
    # via black
plotly==5.3.1
    # via dash
pluggy==1.0.0
    # via pytest
prompt-toolkit==3.0.20
    # via ipython
ptyprocess==0.7.0
    # via pexpect
py==1.10.0
    # via pytest
pyarrow==5.0.0
    # via -r requirements.in
pycodestyle==2.7.0
    # via flake8
pycparser==2.20
    # via cffi
pyflakes==2.3.1
    # via flake8
pygments==2.10.0
    # via ipython
pyparsing==2.4.7
    # via
    #   matplotlib
    #   packaging
pytest==6.2.5
    # via -r requirements.in
python-dateutil==2.8.2
    # via
    #   bokeh
    #   matplotlib
    #   pandas
pytz==2021.1
    # via
    #   django
    #   pandas
pyyaml==5.4.1
    # via
    #   -r requirements.in
    #   ansible-core
    #   bokeh
    #   dask
regex==2021.8.28
    # via black
requests==2.26.0
    # via
    #   -r requirements.in
    #   docker
resolvelib==0.5.5
    # via ansible-core
scikit-learn==0.24.2
    # via sklearn
scipy==1.7.1
    # via
    #   -r requirements.in
    #   scikit-learn
six==1.16.0
    # via
    #   cycler
    #   plotly
    #   python-dateutil
sklearn==0.0
    # via -r requirements.in
sqlitedict==1.7.0
    # via -r requirements.in
sqlparse==0.4.2
    # via django
tenacity==8.0.1
    # via plotly
threadpoolctl==2.2.0
    # via scikit-learn
toml==0.10.2
    # via
    #   mypy
    #   pytest
tomli==1.2.1
    # via black
toolz==0.11.1
    # via
    #   dask
    #   partd
tornado==6.1
    # via bokeh
traitlets==5.1.0
    # via
    #   ipython
    #   matplotlib-inline
typing-extensions==3.10.0.2
    # via
    #   black
    #   bokeh
    #   mypy
urllib3==1.26.6
    # via requests
wcwidth==0.2.5
    # via prompt-toolkit
websocket-client==1.2.1
    # via docker
werkzeug==2.0.1
    # via flask
yachalk==0.1.4
    # via -r requirements.in
zipp==3.5.0
    # via importlib-resources

Benchmark script for reproduction:

# cleanup tmp venv if it exists
rm -rf ./tmp_venv

# re-create tmp venv
virtualenv ./tmp_venv

# activate tmp venv
. ./tmp_venv/bin/activate

# install requiremts
time pip install -r requirements.txt --no-deps -i https://pypi.python.org/simple

Starting with the second execution of the script, pip can fully rely on its local cache. But even with 100% cache hits, the time it takes to run all the Collecting ... Using cached ... operations takes ~42 seconds. The total run time is ~92 seconds, so 45% of the time is spend just collecting from the cache. This time seems excessive. The disk is a fast SSD and the total sum of data to be collected should be < 1GB. So in terms of I/O it should be possible to collect the artefacts much faster from an SSD based cache.

In terms of raw I/O loading the wheels from an SSD based cache should be in the order of few seconds. Thus, bringing down the collection time could speed up venv creation in many cases by almost a factor of 2. This could e.g. significantly speed up CI pipelines that require creation of multiple similar venvs (in fact, venv creation is becoming an increasing bottleneck in complex CI pipelines for us).

Used pip version: 21.2.4

Describe the solution you'd like

Perhaps it is possible to revisit why collecting artefacts from the cache is so slow.

Alternative Solutions

Alternative solutions probably doesn't apply in case of performance improvements.

Additional context

All information given above?

Code of Conduct

I agree to follow the PSF Code of Conduct.

The text was updated successfully, but these errors were encountered:

uranusjr · 2021-09-13T16:10:52Z

so 45% of the time is spend just collecting from the cache.

That sounds about right to me TBH. If you do --no-dep, the installation process is basically just loading cache to memory, and write files from memory to destination, so 45% cache reading, 45% writing files, plus 10% misc overhead is not unreasonable. The percentage does not really mean anything here, only absolute time does.

bluenote10 · 2021-09-13T18:17:46Z

That sounds about right to me TBH. If you do --no-dep, the installation process is basically just loading cache to memory, and write files from memory to destination

I was hoping that --no-deps is that lightweight. But looking at the absolute numbers, there seems to be a significant overhead:

I can cp -ar the entire virtual env directory in ~4 seconds (for 1.5 GB).
I can cp -ar the entire pip cache directory in ~6 seconds (for 3.4 GB).
The pip install takes ~92 seconds, which feels > 10 times slower than expected judging based on the necessary I/O.

I'm wondering if there are some low hanging fruits to reduce that overhead? Perhaps my feature request should have been rather: Introduce optimized/fast mode for fully pip-compiled requirement lists. Does that make more sense?

jbylund · 2021-09-16T12:48:45Z

@bluenote10 thanks for the clear reproduction instructions, makes it so much easier to look at something like this.

I took a profile on my machine and found that it was spending 25s on 92 calls of _get_html_response. I'm guessing this is necessary to verify that there hasn't been a newer version of a given package published & that a previous version hasn't been yanked.

On the one hand this is a good opportunity to thread the http calls, because they're all independent. But... right now criteria resolution happens on a sort of incremental basis and that will need to be moved a little.

I applied approximately this diff and tried again. Total time for commands.run down from ~95s to ~65s and time for resolver.resolve down from ~38s to ~9s. Now I'm thinking the easiest approach might be to "pre-warm" an http cache in parallel and then not change the structure here?

--- a/src/pip/_vendor/resolvelib/resolvers.py
+++ b/src/pip/_vendor/resolvelib/resolvers.py
@@ -1,5 +1,6 @@
 import collections
 import operator
+import multiprocessing.dummy
 
 from .providers import AbstractResolver
 from .structs import DirectedGraph, IteratorMapping, build_iter_view
@@ -336,11 +337,16 @@ class Resolution(object):
 
         # Initialize the root state.
         self._states = [State(mapping=collections.OrderedDict(), criteria={})]
-        for r in requirements:
+
+        def inner(r):
             try:
                 self._add_to_criteria(self.state.criteria, r, parent=None)
             except RequirementsConflicted as e:
                 raise ResolutionImpossible(e.criterion.information)
+            return r
+        with multiprocessing.dummy.Pool() as tp:
+            for r_out in tp.imap_unordered(inner, requirements):
+                pass
 
         # The root state is saved as a sentinel so the first ever pin can have

jbylund · 2021-09-16T18:30:14Z

* I can `cp -ar` the entire pip cache directory in ~6 seconds (for 3.4 GB).

* The pip install takes ~92 seconds, which feels > 10 times slower than expected judging based on the necessary I/O.

So in the process of taking a profile I think I have a bit of an idea of where that time is going (and it happens that timings on my machine are close enough to yours to be comparable, ~90s in total on main, ~5-6s to copy files...).

Of that 90s, ~40s were spent fetching pages from pypi, that can be parallelized which decreases the wall clock time, but it still takes ~10s. ~50s are spent on "install_wheel", of which only ~5-6s is spent in shutil.copyfileobj. A huge fraction of the remainder under install_wheel is spent in py_compile.

pradyunsg · 2021-09-16T18:36:45Z

Please share the requirements.txt that you're experimenting with to reproduce this.

jbylund · 2021-09-16T18:41:23Z

Please share the requirements.txt that you're experimenting with to reproduce this.

Should be in the first comment in this issue.

bluenote10 · 2021-09-16T21:32:32Z

Great findings, thanks a lot for looking into it!

Of that 90s, ~40s were spent fetching pages from pypi, that can be parallelized which decreases the wall clock time, but it still takes ~10s.

That sounds indeed like a low hanging fruit, with a very welcome speedup 👍

I'm wondering if it would be feasible to store the http request results themselves in the cache as well for an extra speed-up. For dependency==x.y.z it would be possible to look in the cached http file for a possible match. If none is found, the http cache is invalidated, running a real http request. However, unversioned dependencies or <, <=, >=, > operators probably must never use the cache. Also there can be subtle issues, e.g. if version x.y.z is first published as a source tarball, and only later as a wheel. If in this case the http request caches the state where only the tarball is available, it would continue to install version x.y.z from source instead of the preferred wheel. Not sure if there is a good solution to this problem, may cause too much headache.

~50s are spent on "install_wheel", of which only ~5-6s is spent in shutil.copyfileobj. A huge fraction of the remainder under install_wheel is spent in py_compile.

Does that refer to the compilation of .py files to .pyc? So perhaps it would make sense to have an --dont-compile-pyc option? This particular venv contains 31048 .pyc files totalling a size of 280 MB. In fact this would also help with another issue of virtual envs becoming very large, which can be an issue when trying to create small-ish docker images. If it is indeed the .pyc creation and it could be skipped, it would mean a massive speed-up with a ~20% size reduction in this example. EDIT: Stupid me, this option already exists, just under the name --no-compile (always thought it refers to e.g. Cython), and it indeed gives the expected speed-up.

If I do the math correctly, the parallel requests combined with skipping .pyc creation would bring the time down to ~15 sec, which would be awesome.

yurj · 2023-06-08T13:00:31Z

using importlib.metadata in Python 3.11 (pip 22.1 and above) has worsened the situation:

Case: pip install Plone

On Python 3.10: first time 1m26, second time 0m12
On Python 3.11: first time 2m18, second time 3m42

@mauritsvanrees found that:

bin/pip install -U 'pip<22.1' and a rerun on 3.11 takes only 9 seconds.
The changelog for 22.1 says (https://pip.pypa.io/en/stable/news/#v22-1):
"Enable the importlib.metadata metadata implementation by default on Python 3.11 (or later). The environment variable _PIP_USE_IMPORTLIB_METADATA can still be used to enable the implementation on 3.10 and earlier, or disable it on 3.11 (by setting it to 0 or false)."

export _PIP_USE_IMPORTLIB_METADATA=0 and retry on 3.11 with latest pip/setuptools/wheel, and it takes 12 seconds.
On 3.10 with export _PIP_USE_IMPORTLIB_METADATA=1, and it slows down a lot: a retry takes 5 minutes

jbylund · 2023-06-08T13:52:51Z

I haven't profiled what you describe yet, but it sounds like a separate issue with a similar symptom. If others agree I'd propose re-naming this issue to be about parallelizing network requests and we could make a separate issue for what sounds like metadata parsing related?

yurj · 2023-06-08T14:17:01Z

I haven't profiled what you describe yet, but it sounds like a separate issue with a similar symptom. If others agree I'd propose re-naming this issue to be about parallelizing network requests and we could make a separate issue for what sounds like metadata parsing related?

I agree.

mauritsvanrees · 2023-06-08T16:34:02Z

Separate issue created here: #12079

On the off chance that this affects the current issue as well, you could try my small PR:
#12080
It probably won't fix the current. But maybe it points in the direction of a solution.

bluenote10 added S: needs triage Issues/PRs that need to be triaged type: feature request Request for a new feature labels Sep 13, 2021

jbylund mentioned this issue Sep 16, 2021

Populate find_all_candidates cache from threadpool #10480

Open

ichard26 added type: performance Commands take too long to run and removed type: feature request Request for a new feature S: needs triage Issues/PRs that need to be triaged labels Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collecting from cache is surprisingly slow #10467

Collecting from cache is surprisingly slow #10467

bluenote10 commented Sep 13, 2021

uranusjr commented Sep 13, 2021 •

edited

Loading

bluenote10 commented Sep 13, 2021

jbylund commented Sep 16, 2021 •

edited

Loading

jbylund commented Sep 16, 2021

pradyunsg commented Sep 16, 2021

jbylund commented Sep 16, 2021

bluenote10 commented Sep 16, 2021 •

edited

Loading

yurj commented Jun 8, 2023

jbylund commented Jun 8, 2023

yurj commented Jun 8, 2023

mauritsvanrees commented Jun 8, 2023

Collecting from cache is surprisingly slow #10467

Collecting from cache is surprisingly slow #10467

Comments

bluenote10 commented Sep 13, 2021

What's the problem this feature will solve?

Describe the solution you'd like

Alternative Solutions

Additional context

Code of Conduct

uranusjr commented Sep 13, 2021 • edited Loading

bluenote10 commented Sep 13, 2021

jbylund commented Sep 16, 2021 • edited Loading

jbylund commented Sep 16, 2021

pradyunsg commented Sep 16, 2021

jbylund commented Sep 16, 2021

bluenote10 commented Sep 16, 2021 • edited Loading

yurj commented Jun 8, 2023

jbylund commented Jun 8, 2023

yurj commented Jun 8, 2023

mauritsvanrees commented Jun 8, 2023

uranusjr commented Sep 13, 2021 •

edited

Loading

jbylund commented Sep 16, 2021 •

edited

Loading

bluenote10 commented Sep 16, 2021 •

edited

Loading