Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trusted publishing: possible race condition related to GH repo move #16843

Open
webknjaz opened this issue Oct 7, 2024 · 10 comments
Open

Trusted publishing: possible race condition related to GH repo move #16843

webknjaz opened this issue Oct 7, 2024 · 10 comments

Comments

@webknjaz
Copy link
Member

webknjaz commented Oct 7, 2024

Describe the bug

So there was a repository initially created at https://github.com/Bluetooth-Devices/propcache with trusted publishing configured and working. We then moved it to https://github.com/aio-libs/propcache, and so I had to create a new trust link. The project also got moved to the PyPI org before the trusted publishing reconfiguration.

What happened was that after these moves and reconfig, the CI pipeline that worked before stopped being able to publish.

I strongly suspect a race condition somewhere in the logic for adding a new trusted publisher.
So let me start with the timeline:

Timestamp (UTC+2) PyPI Event GitHub Event
06 Oct 2024, 02:19:24 🔄 Project added to aio-libs organization
06 Oct 2024, 02:19:24 (unrelated) bdraco removed as project owner
06 Oct 2024, 02:19:25 🔄 The GH repo moved from Bluetooth-Devices (ID 109550163) to aio-libs (ID 7049303), the time is retrieved from the email notification I got from GitHub regarding this
06 Oct 2024, 03:01:33 Trusted publisher added
Publisher: GitHub
URL: https://github.com/aio-libs/propcache
Workflow: ci-cd.yml
06 Oct 2024, 03:01:38 Trusted publisher removed
Publisher: GitHub
URL: https://github.com/Bluetooth-Devices/propcache
Workflow: ci-cd.yml
06 Oct, 2024, 03:02:30 (unrelated) bdraco added as project maintainer
06 Oct 2024 04:19:17 ❌ An upload attempt started, it's the first attempt to publish post-migration (https://github.com/aio-libs/propcache/actions/runs/11197845167/job/31128986852#step:10:25)
06 Oct 2024 04:19:18 ❌ The upload attempt failed, claiming the trust link couldn't be found (https://github.com/aio-libs/propcache/actions/runs/11197845167/job/31128986852#step:10:26)
07 Oct 2024 14:19:55 ❌ An upload attempt started, it's the first run of the job within the workflow run (https://github.com/aio-libs/propcache/actions/runs/11214274927/job/31172158971#step:10:25)
07 Oct 2024 14:19:56 ❌ The upload attempt failed, claiming the trust link couldn't be found (https://github.com/aio-libs/propcache/actions/runs/11197845167/job/31128986852#step:10:26)
07 Oct 2024, 14:46:35 Trusted publisher removed
Publisher: GitHub
URL: https://github.com/aio-libs/propcache
Workflow: ci-cd.yml
07 Oct 2024, 14:47:17 Trusted publisher added
Publisher: GitHub
URL: https://github.com/aio-libs/propcache
Workflow: ci-cd.yml
07 Oct 2024 14:53:21 ✅ An upload attempt started, it's the second rerun of the job within the workflow run (https://github.com/aio-libs/propcache/actions/runs/11214274927/job/31173940770#step:10:25)
07 Oct 2024, 14:53:23 Short-lived API token created
07 Oct 2024 14:56:37 ✅ Publishing succeeded to upload all the dists (https://github.com/aio-libs/propcache/actions/runs/11214274927/job/31173940770#step:10:706)

Note

The log does not show this, but in all cased the GitHub Environment was set to pypi.
Perhaps, this should be improved.

Both failures look like this (except for the tag number, which is irrelevant):

Error: Trusted publishing exchange failure: 
Token request failed: the server refused the request for the following reasons:

* `invalid-publisher`: valid token, but no corresponding publisher (All lookup strategies exhausted)

This generally indicates a trusted publisher configuration error, but could
also indicate an internal error on GitHub or PyPI's part.


The claims rendered below are **for debugging purposes only**. You should **not**
use them to configure a trusted publisher unless they already match your expectations.

If a claim is not present in the claim set, then it is rendered as `MISSING`.

* `sub`: `repo:aio-libs/propcache:environment:pypi`
* `repository`: `aio-libs/propcache`
* `repository_owner`: `aio-libs`
* `repository_owner_id`: `7049303`
* `job_workflow_ref`: `aio-libs/propcache/.github/workflows/ci-cd.yml@refs/tags/v0.2.0`
* `ref`: `refs/tags/v0.2.0`

See https://docs.pypi.org/trusted-publishers/troubleshooting/ for more help.

The trusted publisher configs show up exactly the same before and after re-creation.

My observation is that when I was reconfiguring the trust, I first created the new trust link, and only then deleted the old one. The suspicion is that the old link's properties got somehow transferred to the new trust config, causing PyPI to still expect the old GH repo URL.

But later, when I deleted the trusted publishing right away and created a new one, those properties weren't inherited, and it became possible to look it up with what OIDC includes.

Can this be related to the resurrection prevention logic? Could you check this @woodruffw? Also, perhaps you @di could find anything in the platform logs that isn't surfaced in the common UI?

The above explanation is the only idea I came up with. And if that's the case, this is a bug.
I'm starting to think that something like this might be happening when several GH repos are configured in the same project. This is something to check. But beyond this, I don't know how to narrow this down to a smaller repro...

Expected behavior

Migrating projects, followed by proper reconfiguration, shouldn't let the trust end up in an inconsistent state.

To Reproduce

🤷‍♂️ See the ideas above.

My Platform

GHA

Additional context

A few CI jobs are linked in the table above.

@webknjaz webknjaz added requires triaging maintainers need to do initial inspection of issue bug 🐛 trusted-publishing labels Oct 7, 2024
@woodruffw
Copy link
Member

Thanks for the detailed timeline @webknjaz!

The suspicion is that the old link's properties got somehow transferred to the new trust config, causing PyPI to still expect the old GH repo URL.

Hmm, this part seems unlikely: there's no logic anywhere in Warehouse for transferring settings between different Trusted Publishers. That includes internal IDs, i.e. there's no copying of the user or repo IDs between publishers connected to the same project.

I'll double check the deletion logic, though, and see if there's anything funky there.

Another potential candidate here is inconsistency on GitHub's side during repo migrations: I'd be very surprised if it was the case, but it's possible that the initial TP registration for aio-libs/propcache was given a different repository_owner_id, which only became 7049303 (i.e. aio-libs) at some later point. I'll try and ping some GH folks to see if they have details/info on owner ID consistency/atomicity during migrations.

@woodruffw
Copy link
Member

I'll double check the deletion logic, though, and see if there's anything funky there.

Just checked, and we always delete trusted publishers by unique ID, meaning that there should be no confusion/cross-contamination between publishers at that stage:

@view_config(
request_method="POST",
request_param=DeletePublisherForm.__params__,
)
def delete_oidc_publisher(self):
if self.request.flags.disallow_oidc():
self.request.session.flash(
(
"Trusted publishing is temporarily disabled. "
"See https://pypi.org/help#admin-intervention for details."
),
queue="error",
)
return self.default_response
self.metrics.increment("warehouse.oidc.delete_publisher.attempt")
form = DeletePublisherForm(self.request.POST)
if form.validate():
publisher = self.request.db.get(OIDCPublisher, form.publisher_id.data)
# publisher will be `None` here if someone manually futzes with the form.
if publisher is None or publisher not in self.project.oidc_publishers:
self.request.session.flash(
"Invalid publisher for project",
queue="error",
)
return self.default_response
for user in self.project.users:
send_trusted_publisher_removed_email(
self.request,
user,
project_name=self.project.name,
publisher=publisher,
)
self.project.record_event(
tag=EventTag.Project.OIDCPublisherRemoved,
request=self.request,
additional={
"publisher": publisher.publisher_name,
"id": str(publisher.id),
"specifier": str(publisher),
"url": publisher.publisher_url(),
"submitted_by": self.request.user.username,
},
)
# We remove this publisher from the project's list of publishers
# and, if there are no projects left associated with the publisher,
# we delete it entirely.
self.project.oidc_publishers.remove(publisher)
if len(publisher.projects) == 0:
self.request.db.delete(publisher)
self.request.session.flash(
self.request._(
f"Removed trusted publisher for project {self.project.name!r}"
),
queue="success",
)
self.metrics.increment(
"warehouse.oidc.delete_publisher.ok",
tags=[f"publisher:{publisher.publisher_name}"],
)
return HTTPSeeOther(self.request.path)
return self.default_response

Could you check the project's event log on PyPI as well? I'm curious if it has any other hints/possible discrepancies.

@webknjaz
Copy link
Member Author

webknjaz commented Oct 7, 2024

Another potential candidate here is inconsistency on GitHub's side during repo migrations: I'd be very surprised if it was the case

That's a good point. I wouldn't be surprised, OTOH, since we're regularly observing GitHub's eventual consistency causing surprising effects in large projects like Ansible.

but it's possible that the initial TP registration for aio-libs/propcache was given a different repository_owner_id, which only became 7049303 (i.e. aio-libs) at some later point. I'll try and ping some GH folks to see if they have details/info on owner ID consistency/atomicity during migrations.

If the owner ID is important to check, perhaps Warehouse could detect such cases and send email notifications to the project owners when such changes happen? And maybe reflect that in the security log?

@webknjaz
Copy link
Member Author

webknjaz commented Oct 7, 2024

Could you check the project's event log on PyPI as well? I'm curious if it has any other hints/possible discrepancies.

I was starting at that to compose the table posted above. I didn't include events of users being removed/re-added because of the project to org migration, since TP is bound to projects, not accounts. The last event preceding the move to the PyPI org was the successful sdist upload via TP 3 days prior to the move. And right after the move, there's removal of @bdraco (due to the #13558 bug). And re-adding him as a maintainer after the first TP config attempt (so he'd still show up on the public project page).

I extended the table with the GH repo move time.

@webknjaz
Copy link
Member Author

webknjaz commented Oct 7, 2024

Just checked, and we always delete trusted publishers by unique ID, meaning that there should be no confusion/cross-contamination between publishers at that stage:

@view_config(
request_method="POST",
request_param=DeletePublisherForm.__params__,
)
def delete_oidc_publisher(self):
if self.request.flags.disallow_oidc():
self.request.session.flash(
(
"Trusted publishing is temporarily disabled. "
"See https://pypi.org/help#admin-intervention for details."
),
queue="error",
)
return self.default_response
self.metrics.increment("warehouse.oidc.delete_publisher.attempt")
form = DeletePublisherForm(self.request.POST)
if form.validate():
publisher = self.request.db.get(OIDCPublisher, form.publisher_id.data)
# publisher will be `None` here if someone manually futzes with the form.
if publisher is None or publisher not in self.project.oidc_publishers:
self.request.session.flash(
"Invalid publisher for project",
queue="error",
)
return self.default_response
for user in self.project.users:
send_trusted_publisher_removed_email(
self.request,
user,
project_name=self.project.name,
publisher=publisher,
)
self.project.record_event(
tag=EventTag.Project.OIDCPublisherRemoved,
request=self.request,
additional={
"publisher": publisher.publisher_name,
"id": str(publisher.id),
"specifier": str(publisher),
"url": publisher.publisher_url(),
"submitted_by": self.request.user.username,
},
)
# We remove this publisher from the project's list of publishers
# and, if there are no projects left associated with the publisher,
# we delete it entirely.
self.project.oidc_publishers.remove(publisher)
if len(publisher.projects) == 0:
self.request.db.delete(publisher)
self.request.session.flash(
self.request._(
f"Removed trusted publisher for project {self.project.name!r}"
),
queue="success",
)
self.metrics.increment(
"warehouse.oidc.delete_publisher.ok",
tags=[f"publisher:{publisher.publisher_name}"],
)
return HTTPSeeOther(self.request.path)
return self.default_response

That's the deletion code, though, my concern was about adding a new one while the previous TP exists:

@view_config(
request_method="POST",
request_param=GitHubPublisherForm.__params__,
)
def add_github_oidc_publisher(self):
if self.request.flags.disallow_oidc(AdminFlagValue.DISALLOW_GITHUB_OIDC):
self.request.session.flash(
self.request._(
"GitHub-based trusted publishing is temporarily disabled. "
"See https://pypi.org/help#admin-intervention for details."
),
queue="error",
)
return self.default_response
self.metrics.increment(
"warehouse.oidc.add_publisher.attempt", tags=["publisher:GitHub"]
)
try:
self._check_ratelimits()
except TooManyOIDCRegistrations as exc:
self.metrics.increment(
"warehouse.oidc.add_publisher.ratelimited", tags=["publisher:GitHub"]
)
return HTTPTooManyRequests(
self.request._(
"There have been too many attempted trusted publisher "
"registrations. Try again later."
),
retry_after=exc.resets_in.total_seconds(),
)
self._hit_ratelimits()
response = self.default_response
form = response["github_publisher_form"]
if not form.validate():
self.request.session.flash(
self.request._("The trusted publisher could not be registered"),
queue="error",
)
return response
# GitHub OIDC publishers are unique on the tuple of
# (repository_name, repository_owner, workflow_filename, environment),
# so we check for an already registered one before creating.
publisher = (
self.request.db.query(GitHubPublisher)
.filter(
GitHubPublisher.repository_name == form.repository.data,
GitHubPublisher.repository_owner == form.normalized_owner,
GitHubPublisher.workflow_filename == form.workflow_filename.data,
GitHubPublisher.environment == form.normalized_environment,
)
.one_or_none()
)
if publisher is None:
publisher = GitHubPublisher(
repository_name=form.repository.data,
repository_owner=form.normalized_owner,
repository_owner_id=form.owner_id,
workflow_filename=form.workflow_filename.data,
environment=form.normalized_environment,
)
self.request.db.add(publisher)
# Each project has a unique set of OIDC publishers; the same
# publisher can't be registered to the project more than once.
if publisher in self.project.oidc_publishers:
self.request.session.flash(
self.request._(
f"{publisher} is already registered with {self.project.name}"
),
queue="error",
)
return response
for user in self.project.users:
send_trusted_publisher_added_email(
self.request,
user,
project_name=self.project.name,
publisher=publisher,
)
self.project.oidc_publishers.append(publisher)
self.project.record_event(
tag=EventTag.Project.OIDCPublisherAdded,
request=self.request,
additional={
"publisher": publisher.publisher_name,
"id": str(publisher.id),
"specifier": str(publisher),
"url": publisher.publisher_url(),
"submitted_by": self.request.user.username,
},
)
self.request.session.flash(
f"Added {publisher} in {publisher.publisher_url()} to {self.project.name}",
queue="success",
)
self.metrics.increment(
"warehouse.oidc.add_publisher.ok", tags=["publisher:GitHub"]
)
return HTTPSeeOther(self.request.path)

@webknjaz
Copy link
Member Author

webknjaz commented Oct 7, 2024

Maybe, it'd be helpful to log

return response.json()
.

@webknjaz
Copy link
Member Author

webknjaz commented Oct 7, 2024

Or extend

self.project.record_event(
tag=EventTag.Project.OIDCPublisherAdded,
request=self.request,
additional={
"publisher": publisher.publisher_name,
"id": str(publisher.id),
"specifier": str(publisher),
"url": publisher.publisher_url(),
"submitted_by": self.request.user.username,
},
to mention org+repo ids.

@woodruffw
Copy link
Member

If the owner ID is important to check, perhaps Warehouse could detect such cases and send email notifications to the project owners when such changes happen? And maybe reflect that in the security log?

I'll think about it some more, but I think this will be difficult with the current architecture: lookups are currently abstracted/have a uniform interface across different TP providers, so we'd need to break that interface or add some kind of reporting knob to it in the GitHub case. But that may not be as bad as I'm thinking, especially if we don't do the reporting at the user level but instead at the Sentry level (since these should be very exceptional cases anyways).

to mention org+repo ids.

Yeah, logging those in their associated event makes sense to me -- I'll think a bit about how to best do that in a way that doesn't break the genericity of the current events.

@miketheman miketheman removed the requires triaging maintainers need to do initial inspection of issue label Oct 11, 2024
@woodruffw
Copy link
Member

I spent a bit more time trying to figure this out, and my best theory so far is that GitHub took the 'eventual' in 'eventually consistent' a little too literally here, and continued returning the old owner ID in their API ~40 minutes after the repo transfer. That's pretty much the only thing that could explain how the inconsistency happened here without any of the user-controlled TP configuration changing.

I'll look into adding some Sentry or similar alerting, to see if we can catch it happening again.

@webknjaz
Copy link
Member Author

I would expect that your guess has high probability of being true, based on my experience with GitHub API, Apps, platform in general. In big repos, we observe inconsist states of GitHub that don't clear for weeks sometimes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants