Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add prefetching of index in PEP503 repositories #5442

Closed

Conversation

tgolsson
Copy link
Contributor

@tgolsson tgolsson commented Apr 12, 2022

Pull Request Check List

Resolves: #4885, partially

  • Added tests for changed code.
  • Updated documentation for changed code.

Description

This PR adds a new keyword indexed when defining a LegacyRepository (e.g. PEP503) source. The indexed keyword enables the use of a prefetched and cached index, which will limit the amount of unnecessary calls.

Currently, if one configures a secondary repository for Poetry it'll get queried for all dependencies, no matter whether it's default, primary, or secondary. For projects with lots of dependencies (transitive or direct) this leads to a lot of unnecessary calls to a host which can't serve the requested package. One such case is using GPU-based packages from https://download.pytorch.org/whl/, where most other packages should be served from Pypi. This leads to user confusion due to error messages; and takes a lot of time.

While update-time on a cold cache is dominated by downloading every possible GPU package; this PR changes noop poetry update time from 30-40 seconds to <5 seconds. In total, this repository has 89 dependencies (reported by poetry show).

Also, it removes all errors from querying subpages that don't exist.

This depends on a PR to poetry-core, and thus a release there: python-poetry/poetry-core#323

Timing

Poetry==1.2.0b1

===> multitime results
1: poetry update
            Mean        Std.Dev.    Min         Median      Max
real        36.818      2.927       32.470      36.176      44.032
user        3.478       0.040       3.408       3.480       3.568
sys         0.126       0.030       0.075       0.123       0.181

This PR

===> multitime results
1: poetry update
            Mean        Std.Dev.    Min         Median      Max
real        4.617       0.211       4.410       4.588       5.165
user        3.362       0.162       3.235       3.301       3.813
sys         0.097       0.027       0.059       0.099       0.151

Copy link
Member

@abn abn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get similar results by applying an functools.lru_cache on _get_page instead?

@tgolsson
Copy link
Contributor Author

@abn Nope! So the problem here is that if you ask pytorch.org where to get foobar, it's going to throw an error back at you. No manner of caching is going to change that -- except maybe cross-session, and we'd still have to get the error once. This asks pytorch.org "what do you have" and then we only ask for those things.

@abn
Copy link
Member

abn commented Apr 12, 2022

@tgolsson I should really read the descriptions fully before looking at code 🤣

@tgolsson
Copy link
Contributor Author

No worries! For some context for why I landed on this solution:

I did investigate alternative methods such as "explicit" sources, not querying secondary sources by default, etc. However, changing how sources are combined felt like a larger/potentially breaking or verbose change compared to relying on PEP503 behaviours. I do think it's weird that default/primary/secondary sources seemingly are mostly treated the same way, but again, anything changing that is a breaking change.

This is 100% backwards compatible, a pure speed-up, and purely opt-in. The only way to get "worse" results than today is to opt-in to indexing for a repository that doesn't have an index.

@lovesegfault
Copy link
Contributor

This is awesome!

@tgolsson
Copy link
Contributor Author

tgolsson commented May 3, 2022

@abn What's the release-process for poetry-core so the dependency constraint can be updated here?

@abn
Copy link
Member

abn commented May 4, 2022

@tgolsson I am aiming for a new core release this week; so once that lands we can rebase this.

@abn
Copy link
Member

abn commented May 24, 2022

This can now be rebased.

@tgolsson tgolsson force-pushed the ts/prefetch-legacy-repository branch from 0118da1 to 1437a69 Compare May 26, 2022 18:45
@tgolsson
Copy link
Contributor Author

@abn Rebased! I'm going to need to check how the docs render since that had changed quite a bit.

@abn abn added the area/docs Documentation issues/improvements label May 26, 2022
@github-actions
Copy link

github-actions bot commented May 26, 2022

Deploy preview for website ready!

✅ Preview
https://website-ds65cso7d-python-poetry.vercel.app

Built with commit a0c1846.
This pull request is being automatically deployed with vercel-action

@tgolsson
Copy link
Contributor Author

Feels like a natural flow:

image

@tgolsson
Copy link
Contributor Author

If anyone watching this PR has a known use-case for indexing; I'd love to know if it works for you! I've tested against torch and pypi, plus of course unit tests -- but I'm sure there are other cases out there that have... interesting configurations that I'm not dealing with correctly.

tgolsson and others added 2 commits May 26, 2022 21:07
The new indexed keyword introduced for pre-fetching legacy repositories was not known the Source class.
@Jinior
Copy link

Jinior commented Jun 3, 2022

I just tried out your branch and it is working great. I can finally install PyTorch using poetry. Thank you for the work.

I do however get an error when I try to add a repository to pyproject.toml using poetry source add if the pyproject.toml contains the indexed keyword.

I made a pull request into tgolsson:ts/prefetch-legacy-repository. As far as I can see that fixed the problem.

Error

Command ran: poetry source add mmcv https://download.openmmlab.com/mmcv/dist/cu115/torch1.11.0/index.html

TypeError

  Source.__init__() got an unexpected keyword argument 'indexed'

  at src/poetry/poetry.py:78 in <listcomp>
       74│         return self
       75│
       76│     def get_sources(self) -> list[Source]:
       77│         return [
    →  78│             Source(**source)
       79│             for source in self.pyproject.poetry_config.get("source", [])
       80│         ]
       81│

pyproject.toml

[tool.poetry]
name = "test"
version = "0.1.0"
description = ""
authors = ["[email protected]"]
readme = "README.md"

[tool.poetry.dependencies]
python = "~3.8"
numpy = "^1.22"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

[[tool.poetry.source]]
name = "pytorch"
url = "https://download.pytorch.org/whl/cu115/"
default = false
secondary = true
indexed = true

docs/repositories.md Outdated Show resolved Hide resolved
@tgolsson
Copy link
Contributor Author

tgolsson commented Jun 3, 2022

Thanks @neersighted for the feedback, and @Jinior for testing and the PR!

@tgolsson
Copy link
Contributor Author

tgolsson commented Jun 7, 2022

@abn / @neersighted I'd missed the follow up to the convo but went ahead and did some restructuring; which I do think made it clearer! I don't think it's exactly what you asked for with a "SimpleIndexedRepositoryPage", though -- but I'm not sure I see how that would work.

@tgolsson
Copy link
Contributor Author

Please let me know if there's anything I can do to get this merged...

@vikigenius
Copy link

@tgolsson what do you mean when you say that this only partially resolves #4885

This should solve it right since you say

Also, it removes all errors from querying subpages that don't exist.

@tgolsson
Copy link
Contributor Author

@vikigenius It does solve it if users opt in. So for some users it's a perfect fix but otherwise it has no effect.

@strangemonad
Copy link

@tgolsson is this branch pinning the correct version of poetry core? I tried to run this against our private gitlab package registry and ran into the following

TypeError

  Link.__init__() got multiple values for argument 'requires_python'

  at ~/.local/pipx/venvs/poetry@pr5442/lib/python3.10/site-packages/poetry/repositories/link_sources/html.py:37 in links
       33│                 href = anchor.get("href")
       34│                 url = self.clean_link(urllib.parse.urljoin(self._url, href))
       35│                 pyrequire = anchor.get("data-requires-python")
       36│                 pyrequire = unescape(pyrequire) if pyrequire else None
    →  37│                 link = Link(url, self, requires_python=pyrequire)
       38│
       39│                 if link.ext not in self.SUPPORTED_FORMATS:
       40│                     continue

see the second argument, self, to Link.__init__ which clashes with requires_python? pipx installed poetry core 1.1.0b3 for me.

when I try to run with this branch without setting index=true I get a stack trace when poetry gets a 404 from pypi for a private package.

@neersighted neersighted added area/solver Related to the dependency resolver impact/changelog Requires a changelog entry labels Sep 5, 2022
@neersighted neersighted modified the milestones: 1.2, 1.3 Sep 5, 2022
@Secrus Secrus modified the milestones: 1.3, 1.4 Dec 12, 2022
@tgolsson
Copy link
Contributor Author

I'm abandoning this PR as we've adopted PDM instead :-)

@tgolsson tgolsson closed this Feb 13, 2023
@tgolsson tgolsson deleted the ts/prefetch-legacy-repository branch February 13, 2023 09:27
Copy link

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/docs Documentation issues/improvements area/solver Related to the dependency resolver impact/changelog Requires a changelog entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

403 errors on custom repositories
8 participants