Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move remaining state to db, allow multiple build servers #1785

Merged
merged 9 commits into from
Sep 4, 2022

Conversation

syphar
Copy link
Member

@syphar syphar commented Jul 30, 2022

This is a possible approach for #795, without #1011, but including the possibility to run multiple build-servers.

It contains:

  • moving the queue-lock to the database
  • moving the last seen reference into the database, so we can recreate the registry watcher from scratch without any issues.
  • an argument to run a registry watcher via start-registry-watcher. With multiple build-servers we still should only run one registry watcher
  • the repository-stats-updater can be run with a scheduled ECS job (via database update-repository-fields), or optionally in the registry watcher process (via --repository-stats-updater=enabled).
  • a separate build-server (via start-build-server) which can be run multiple times. A queued crate will only be picked up by one of the servers.

we already can run a separate webserver as often as we want with start-web-server, of course using the proper connection pooling limits.

testing

Since test coverage for the build-process and queue is limited I might miss edge-cases. I'm happy to fix any issues that might come up.

@github-actions github-actions bot added the S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed label Jul 30, 2022
@syphar
Copy link
Member Author

syphar commented Jul 30, 2022

@jyn514 @Nemo157 it would be awesome to get feedback on this

@Nemo157
Copy link
Member

Nemo157 commented Jul 30, 2022

At first skim it all looks great to me. I'll take a deeper look again later, and I'll have to read the postgres docs around the locking.

Copy link
Member

@jyn514 jyn514 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome ❤️ thank you!!

Err(err) => {
log::error!("queue locked because of invalid last_seen_index_reference \"{}\" in database: {}", value, err);
self.lock()?;
return Ok(None);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this return an error instead of discarding it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was kind of coming from the watch_registry method where I want to keep the registry watcher running, but locked, after this happened.

So we would then manually remove / fix the ref and unlock the queue.

But I can see that this could be confusing behavior for last_seen_reference and moved the logic into watch_registry

// additionally set the reference in the database
// so this survives recreating the registry watcher
// server.
self.set_last_seen_reference(oid)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worries me that this isn't atomic. I worry that we'll:

  • update this in the local git repo
  • fail to update it in the database
  • end up building the same list of crates twice

I guess that's not the end of the world. But it would be nice to add some more error logging, and to update it in the database before updating the git repo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the order of these, first setting the ref in the db, then in the repo.

What additional logging to you imagine?

At the call-site of get_new_crates ( now in watch_registry) all errors will be logged already.

debug!("Checking new crates");
match build_queue
.get_new_crates(&index)
.context("Failed to get new crates")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment that there should only be one registry watcher running at a time? It confused me for a bit why there wasn't any locking.

In practice I think it should be ok - the worst that happens is we'll try and insert the same crates twice, and the second insert will be ignored by the database.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note

src/utils/queue_builder.rs Outdated Show resolved Hide resolved
}
Err(e) => {
report_error(&e.context("Failed to build crate from queue"));
}
}
}));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a comment below that says "If we panic here something is really truly wrong and trying to handle the error won't help". That no longer seems true to me, since the database could be down or whatever. But I don't know how to handle the error. Maybe we should treat it the same as the rest of the errors, exit the build loop and report it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a moment I thought you meant exiting = break, but I think you mean continue like with the other errors.

I don't know which cases can lead to panics in the build. I remember we still had some places sprinkled with .unwrap or .expect around database methods, where trying again could be valid.

Checking this I also saw that the ? when checking the queue lock should be removed to match the error-handling in other cases in that loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the logic , only reporting and continue also for panics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the failure case here if it's the crate causing the failure will just be that we retry building the top crate from the queue continuously every 60 seconds (because the panic bypasses incrementing the attempt counter); until someone manually removes that crate from the queue (which would have been the same resolution before). If it's something transient like the database, then we should eventually recover when it's working and we attempt to build the crate again.

@syphar
Copy link
Member Author

syphar commented Jul 30, 2022

At first skim it all looks great to me. I'll take a deeper look again later, and I'll have to read the postgres docs around the locking.

On top of the docs this article is also helpful.

I ran some local tests (replacing the build with a sleep) and 3 build-servers correctly picked up queued crates

@syphar syphar self-assigned this Jul 30, 2022
src/utils/daemon.rs Outdated Show resolved Hide resolved
src/utils/mod.rs Outdated Show resolved Hide resolved
@Nemo157
Copy link
Member

Nemo157 commented Jul 31, 2022

I threw 3 builders into the docker-compose setup and tested building some 20 crates locally, the queueing handled them all fine.

I wonder if we should record the hostname or something into builds too, so we can identify which worker built the docs if we need to investigate issues.

@syphar
Copy link
Member Author

syphar commented Jul 31, 2022

@jyn514 @Nemo157 I added new commits for the proposed changes, sot his is ready for another review

@syphar syphar removed their assignment Aug 3, 2022
@syphar
Copy link
Member Author

syphar commented Aug 8, 2022

@jyn514 @Nemo157 I'm aware this is a complex thing and reviews take time.

I'm happy to add any improvements and invest more time if that's what is needed :)

( Also using this is blocked on infra anyways :) )

@syphar
Copy link
Member Author

syphar commented Aug 26, 2022

@Nemo157 @jyn514 ping?

Copy link
Member

@Nemo157 Nemo157 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and worked well in local testing (I just retested with the hostname changes).

@syphar syphar added the S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it label Aug 31, 2022
@syphar
Copy link
Member Author

syphar commented Aug 31, 2022

I'll merge & deploy this separately after the current merges are live & safe, so I can watch & eventually revert better

@syphar syphar merged commit c4f9e90 into rust-lang:master Sep 4, 2022
@syphar syphar deleted the separate-build-server branch September 4, 2022 07:51
@github-actions github-actions bot removed the S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed label Sep 4, 2022
@syphar syphar removed the S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it label Sep 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants