Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package source constraints are silently ignored when extras and transitive dependencies are involved #8614

Closed
4 tasks done
latk opened this issue Nov 2, 2023 · 3 comments · Fixed by #8835
Closed
4 tasks done
Labels
area/solver Related to the dependency resolver kind/bug Something isn't working as expected

Comments

@latk
Copy link

latk commented Nov 2, 2023

  • I am on the latest stable Poetry version, installed using a recommended method.
  • I have searched the issues of this repo and believe that this is not a duplicate.
  • I have consulted the FAQ and blog for any relevant entries or release notes.
  • If an exception occurs when executing a command, I executed it again in debug mode (-vvv option) and have included the output below.

Issue

We use Poetry to manage multiple private packages, and have recently started migrating to setting our sources to priority = "explicit" as encouraged by the Poetry docs. This has been working fine, until Poetry silently stopped installing one of our internal dependencies. I have not been able to pinpoint the issue in Poetry's puzzle/solver, but it seems this bug depends on three factors:

  • using sources with priority = "explicit" (but otherwise leaving the default PyPI source)
  • having transitive dependencies to a package that only exists in the private source
  • having these dependencies differ in required "extras"

I don't know of a way to quickly spin up a private package repository so I wasn't able to create a demo/test case, but I will walk us through all relevant details of the scenario:

We have three packages:

  • private-app depends on private-other and private-lib[extra]
  • private-other depends on private-lib (no extras!)
  • private-lib is a library with extras. This is the library that goes missing during the solving process.

The private-app pyproject.toml has the following dependency specification:

[[tool.poetry.source]]
name = "myrepo"
url = "https://myrepo.example/"
priority = "explicit"

[tool.poetry.dependencies]
python = "~3.11"
private-lib = { version = "^5", source = "myrepo", extras = ["extra"] }
private-other = { version = "^1.6.3", source = "myrepo" }

This worked fine while the source was configured as secondary = true, but setting it to priority = "explicit" and adding the source constraints on the deps started creating breakage.

While trying to debug this without having created a venv with poetry install, Poetry instead reported an error that no matching version for private-lib could be found. Unfortunately I can no longer repro because we have already mitigated the problem, and this seems depends on the unavailability of a suitable version in the private package sources? Sorry.

When running poetry update or poetry lock on an existing poetry-installed venv, things would "succeed", but create a poetry.lock that contained the following entry for the private-lib:

[[package]]
name = "private-lib"
version = "5.17.0"
description = ""
optional = false
python-versions = "*"
files = []

Nothing other than the name has been redacted here. There is no source, and files are empty. So Poetry knows about the dependency, but finds no installation candidates, and thus silently skips installation. This then broke our CI systems when trying to install dependencies from this lockfile.

I have captured a log for this session poetry -vvv update, minus all PyPI-based dependencies as they distract from the bug: https://gist.github.com/latk/958e918b2ce4d96956a901263d38e955#file-log-txt

Here is an excerpt of that already-condensed log highlighting how the private-lib is resolved:

   1: fact: private-app depends on private-lib (^5)
   1: derived: private-lib[extra] (>=5,<6)
Source (myrepo): 31 packages found for private-lib >=5,<6
   1: fact: private-other1 (1.6.3) depends on private-lib (*)
   1: selecting private-other1 (1.6.3)
   1: derived: private-lib
Source (PyPI): No packages found for private-lib
Source (PyPI): 0 packages found for private-lib *
Falling back to installed packages to discover metadata for private-lib
Found 1 compatible packages for private-lib
Source (PyPI): Getting info for private-lib (5.17.0) from PyPI
Falling back to installed packages to discover metadata for private-lib
Found 1 compatible packages for private-lib
   1: selecting private-lib (5.17.0)
   1: fact: private-lib (5.17.0) depends on private-lib (5.17.0)
   1: selecting private-lib[extra] (5.17.0)

Finding the necessary packages for the current system
Source (myrepo): 1 packages found for private-other1 >=1.6.3,<2.0.0
Source (myrepo): 1 packages found for private-other2 >=2.1.1,<3.0.0
Source (myrepo): 1 packages found for private-lib >=5,<6
Source (PyPI): 1 packages found for private-lib *
Source (myrepo): 1 packages found for private-lib *

No dependencies to install or update

Writing lock file

What we are seeing here is that when the private-other1 dependencies are resolved, the private-lib is looked up in PyPI rather than through its explicit source. Nothing is found there. Thus, Poetry falls back to looking at locally installed packages. Since the venv already exists, it is found, albeit without any installation candidates (since it's already installed). At the end, one private-lib package was found each for myrepo (correct) and PyPI (incorrect, this is the installed package). When writing the lockfile, it seems that the PyPI/local info is used, thus leading to the empty files = [] array.

Regardless of the problematic poetry.lock file being generated, this violates the expectations about top-level dependencies with a source option: if I specify a source, then I'm expecting that metadata about that package will only ever be retrieved from that source. That this logfile shows PyPI being contacted is arguably a security issue (but not actively exploitable, so whatever…).

Attempted workarounds

I have found some techniques to mitigate the observed problems, which might aid with debugging.

  • Explicitly updating that dependency via poetry update private-lib. This leads to a working poetry.lock file, but things will break again with the next poetry update.

  • Setting the private repository to priority = "supplemental". This "works" because when the PyPI lookup for private-lib fails, myrepo will be checked next, finding the correct package data. This avoids the problematic fallback to the installed version of the package. However, this would only mask the bug, and despite the explicit source = "myrepo" would not protect against dependency confusion attacks!

  • Stop using the extras in Python packages. This is what we are doing, at least until this Poetry bug is fixed.

  • Include the top-level dependency on the private-lib both with and without extras. I believe (see discussion below) that this primes Poetry's dependency solver cache to avoid the PyPI lookup.

    [tool.poetry.dependencies]
    python = "~3.11"
    private-lib = [
      { version = "^5", source = "myrepo" },
      { version = "^5", source = "myrepo", extras = ["extra"] },
    ]

    This doesn't seem to depend on the order of entries.

Interpretation and preliminary analysis

I've been poking a bit at the Poetry source code and now have a basic understanding of Poetry's dependency solver. I think the following is happening, but am not entirely sure:

When a dependency has extras, a virtual dependency is created which inherits the correct source:

# When adding dependency foo[extra] -> foo, preserve foo's source, if it's
# specified. This prevents us from trying to get foo from PyPI
# when user explicitly set repo for foo[extra].
if not new_dependency.source_name and dependency.source_name:
new_dependency.source_name = dependency.source_name

Thus, we actually have the following facts in our dependency graph:

  1. private-app depends on private-other with source = "myrepo"
  2. private-app depends on private-lib[extra] with source = "myrepo"
  3. private-other depends on private-lib
  4. private-lib[extra] depends on private-lib with source = "myrepo"

The 3rd fact cannot have a source, because sources are purely a Poetry feature, and not part of Python package metadata.

Instead, the source for a transitive dependency can be set by specifying it as a top-level dependency in the pyproject.toml. Later, dependency resolutions are resolved by name in the dependency cache. I think that happens here:

packages = self._dependency_cache.search_for(
dependency, self._solution.decision_level
)

def search_for(
self,
dependency: Dependency,
decision_level: int,
) -> list[DependencyPackage]:
key = (
dependency.complete_name,
dependency.source_type,
dependency.source_url,
dependency.source_reference,
dependency.source_subdirectory,
)
packages = self._search_for_cached(dependency, key)
if not self._cache[key] or self._cache[key][-1] is not packages:
self._cache[key].append(packages)
self._cached_dependencies_by_level[decision_level].append(key)
return packages

Since the complete_name is used as the key, private-lib[extra] has a different cache entry from private-lib, but that is of course necessary.

What is not obvious is that the contents of the cache are sensitive to the order in which dependencies are resolved:

  • If the above fact 3 is resolved before fact 4, we will see the problem explained in this issue, as the private-lib gets wrongly associated with the PyPI source (and this is inherited while resolving fact 4).
  • If fact 4 were to be resolved first, then everything would work and fact 3 would inherit the correct source. This is what I'm achieving in one of the workarounds by specifying both private-lib and private-lib[extra] in the Poetry dependencies.

I'm not sure how this could be solved. Maybe:

  • when a virtual dep[extras] -> dep dependency is created, ensure that dep is resolved immediately to prime the cache with the correct source. However, this seems fragile, and wouldn't be a systematic solution to this kind of problem.

  • keep an explicit name -> source mapping, which is used to look up the appropriate source for transitive dependencies. However, this would fail for configurations with constraints like:

    some-dep = [
      { ..., source = "repo-a", python = ">=3.6,<3.8" },
      { ..., source = "repo-b", python = ">=3.8" },
    ]
  • cache package data at the correct place. Currently, information about dependencies is cached in two places during resolution. First, the aforementioned DependencyCache. This will then eventually load data from the RepositoryPool, which will load data from the correct repository (in particular, the explicit myrepo if the source name is set to such):

    def find_packages(self, dependency: Dependency) -> list[Package]:
    repository_name = dependency.source_name
    if repository_name:
    return self.repository(repository_name).find_packages(dependency)
    packages: list[Package] = []
    for repo in self.repositories:
    if packages and self.get_priority(repo.name) is Priority.SUPPLEMENTAL:
    break
    packages += repo.find_packages(dependency)
    return packages

    Each repository in the pool might have its own cache (though this doesn't seem to prevent the repeated failing PyPI lookups visible in the logfile…).

    It seems to me that this problem might disappear if there were a cache around the RepositoryPool? After all, the repositories do not care about extras, so a top-level dependency on private-lib[extra] with source = "myrepo" would prime a repository pool cache for any subsequent private-lib lookup.

@latk latk added kind/bug Something isn't working as expected status/triage This issue needs to be triaged labels Nov 2, 2023
@lucemia
Copy link
Contributor

lucemia commented Dec 18, 2023

I've encountered the same issue and I'm actively working on finding a solution. @latk's suggestion of implementing caching within the RepositoryPool could help, but relying solely on execution order may not ensure safety.

In my view, a more robust approach to addressing this problem would involve allowing the 'local_config' to be passed to the 'RepositoryPool.' This way, the RepositoryPool can be informed about which packages have a specific source, as explicitly specified in the pyproject.toml.

self._locker = locker
self._config = config
self._pool = RepositoryPool(config=config)
self._plugin_manager: PluginManager | None = None
self._disable_cache = disable_cache

@radoering
Copy link
Member

Since the complete_name is used as the key, private-lib[extra] has a different cache entry from private-lib, but that is of course necessary.

It is kind of necessary, but actually I think it is not really necessary (at least not the way we do it now) because if the only difference between private-lib[extra] and private-lib is the extra, then both will result in the same packages.

when a virtual dep[extras] -> dep dependency is created, ensure that dep is resolved immediately to prime the cache with the correct source. However, this seems fragile, and wouldn't be a systematic solution to this kind of problem.

It's not necessary to resolve it but only to cache it immediately after. However, that's still not enough in case the transitive dependency requires a (different) extra so actually we have to make sure that the cache is more robust concerning extras.

In #8835 I adapted the dependency cache so that packages of dependencies are cached independent of requested extras. That should be quite robust.

keep an explicit name -> source mapping, which is used to look up the appropriate source for transitive dependencies. However, this would fail for configurations with constraints like:

Multiple constraints with different sources seems to be a quite strong argument against the mapping approach in the repository pool.

Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/solver Related to the dependency resolver kind/bug Something isn't working as expected
Projects
None yet
3 participants