Skip to content

Update the location and post-upload command to use the new repository.#26

Merged
nuclearsandwich merged 1 commit intomasterfrom
new-bootstrap-repo
Oct 18, 2018
Merged

Update the location and post-upload command to use the new repository.#26
nuclearsandwich merged 1 commit intomasterfrom
new-bootstrap-repo

Conversation

@nuclearsandwich
Copy link
Copy Markdown
Contributor

Not quite ready for merge as I had to re-create the S3 bucket this publishes to due to an esoteric S3 limitation but this configuration change will support the new aptly-backed bootstrap repository. It requires no changes to how the script is used.

Releasers should expect publishing packages to take a bit longer as after being indexed by aptly the new package files and updated repository metadata will be pushed to S3.

@nuclearsandwich nuclearsandwich self-assigned this Sep 13, 2018
@tfoote
Copy link
Copy Markdown
Member

tfoote commented Sep 14, 2018

This looks reasonable and it's a nice and compact change. Writing the wrapper script is a good idea. I haven't had a chance to test this though.

@dirk-thomas
Copy link
Copy Markdown
Member

Once this is ready to be merge please let me know. I could do a patch release of catkin_pkg and try this branch to check that everything works.

@nuclearsandwich
Copy link
Copy Markdown
Contributor Author

As @dirk-thomas pointed out in a separate discussion the time it takes for aptly to intake, re-snapshot, and publish all our distributions is more than just "a little bit longer" it's quite egregious.

Some possible solutions:

  1. Drop updates for out-of-service distributions.
    The current publishing script makes no assumptions about what distributions can be updated by the ros_release_python script and as a result it re-publishes every distribution in the bootstrap repo. We could immediately stop updating older distributions that we aren't pushing python tools to and could also drop support for end-of-life distros like utopic, vivid, and wily in our release configs and strike those from the update list as well. It still wouldn't be par with reprepro (reasons for this are below) but it maintains the backwards compatible repository structure and keeps the safety and maintenance advantages of aptly.

  2. Publish the repository directly via a local web server instead of S3.
    Managing the repository in S3 takes a lot longer than publishing it locally. S3 gives us a lot of desirable advantages including reduced bandwidth costs and redundant storage and availability. However Aptly's S3 support sure does not seem as robust as @tfoote and I expected given the number of errors encountered by @dirk-thomas during the first production test. If we take this route, a follow up task would be configuring a CDN to put in front of the host to regain some of the benefits of S3.

  3. Publish each distribution to a separate repository path.
    This would change the repo path per distribution to repos/ros_bootstrap/$codename and allow us to run the publish step for each repository independently. But it require a breaking change for boostrap repository consumers and would hugely increase the amount of configuration required since each distribution would need its own import. I think this would also slow down the import_upstream process as it means fetching slightly more data from more places.

  4. Fold all our python releases into one distribution.
    This is what aptly upstream does. The python packages we produce are not different per-distribution, the package published to trusty is the same as the one published to bionic. We could put these packages in a bootstrap distribution or a distribution named for the oldest supported (the aptly repo distribution is still wheezy) and only update and publish this distribution from ros_release_python.
    This would get us closest to "par" with reprepro's timing but costs us backwards compatibility, and will require some reasonably heavy documentation updates. On top of that, we'd still need to maintain the other distribution repositories manually for other distribution specific packages that get added to the bootstrap repository.

Of these, I'd recommend (1) first and both (1) and (2) second if we continue to have issues with S3 publishing being unreliable.


Some extra info: One of the limitations we're dealing with is that aptly repos:one: cannot contain, as far as I am able to determine, multiple distributions. Instead we have one aptly repo for each distribution we support and we publish all of them to the same endpoint.

Publishing repositories to S3 is the time-intensive part of the process but it's also unfortunately the one part we cannot parallelize beyond aptly's internal parallelization without option (3) above which I don't recommend.

1️⃣ Here the term "aptly repos" refers to the aptly internal repository structure. aptly is capable of producing a published repository where multiple distributions share the same package pool

@dirk-thomas
Copy link
Copy Markdown
Member

dirk-thomas commented Sep 19, 2018

The current publishing script makes no assumptions about what distributions can be updated by the ros_release_python script and as a result it re-publishes every distribution in the bootstrap repo.

Each Python package defines its own list of targeted distros. We should try to minimize the work on the server to only the targeted distros (if possible). I didn't pay attention during the release of vcstool if it was "doing work" for distros not specified in the stdeb.cfg file?

I don't think we should remove the possibility to update EOL distros all together for performance reasons. In the case of rospkg we intentionally rolled out the support for wildcards to EOLed distros.

Publishing repositories to S3 is the time-intensive part of the process
Publish the repository directly via a local web server instead of S3.
a follow up task would be configuring a CDN to put in front of the host to regain some of the benefits of S3.

Afaik the bootstrap repository is not a user facing repository. And even the buildfarm only pulls from it in the import_upstream which are triggered infrequently / manually. So I don't see the advantage / need for a CDN / S3. (If we would want to use one we could also leverage our existing one at OSU but as said I don't see the need.)

Therefore I would suggest doing option 2 (using a local webserver only) first and then measure again where we stand. That would also solve the problem of not being able to get a directory listing atm.

@nuclearsandwich
Copy link
Copy Markdown
Contributor Author

I didn't pay attention during the release of vcstool if it was "doing work" for distros not specified in the stdeb.cfg file?

Every distribution gets snapshotted and republished after including new packages in order to avoid the need to do out-of-band detection on which distributions have been changed. We could try parsing out only the distributions we expect to change but it adds to the brittleness of the wrapper script and increases the likelihood that what you think you see is not what you'll get.

I don't think we should remove the possibility to update EOL distros all together for performance reasons. In the case of rospkg we intentionally rolled out the support for wildcards to EOLed distros.

It would still be doable, but not via the ros_release_python exclusively. The process would entail accessing the host via ssh and running the aptly snapshot and publish commands manually for affected EOL distros.

So I don't see the advantage / need for a CDN / S3.

The advantages in S3 are mostly administrative. It requires less effort and cost to create and maintain recurring backups of the bootstrap repository contents and is slightly more redundantly available than our one repository host.

@tfoote
Copy link
Copy Markdown
Member

tfoote commented Sep 19, 2018

Yeah, since this is only used for boostrapping repositories I don't think that the CDN is necessary for this repository. It's only manually queried by our buildfarm and any other buildfarms. On our old host the load was undetectable on their 2nd smallest instance. (2nd smallest because we needed more disk space than their smallest).

So I would suggest trying (2) and we could certainly add some extra logic to skip unaffected repositories along the lines of (1) but I agree with Dirk that I'd rather not disable us from pushing to older repos.

For backup purposes could we setup the aptly instance to still push to S3 but periodically out of band so that it doesn't slow down either the developer upload experience?

@nuclearsandwich
Copy link
Copy Markdown
Contributor Author

For backup purposes could we setup the aptly instance to still push to S3 but periodically out of band so that it doesn't slow down either the developer upload experience?

Yeah this is what I was imagining. Setting it up to do so reliably is straightforward but not trivial work. I'll go ahead and reconfigure things to use a local web server and locally published repository. Option (2).

@nuclearsandwich
Copy link
Copy Markdown
Contributor Author

repos.ros.org is now set up with a locally published repo hosted by nginx. None of the config in this PR has changed. I'm planning to make a small Bloom release sometime this weekend as the next test.

@nuclearsandwich
Copy link
Copy Markdown
Contributor Author

nuclearsandwich commented Oct 18, 2018

With much thanks to @dirk-thomas for hitting all the issues I seemed not to, I believe we have a relatively stable first round of aptly support. The issue this second round of testing was memory usage from aptly which was saturating the nano EC2 instance holding this repository. I've bumped it up so we should now be quite comfortable memory-wise unless aptly is doing something pathological.

There are some improvements that could be made to the system which, while not "out of scope" are lower priority than other things I need to get to this month.

Uploading to the new repository is still significantly slower than it was.

Aptly has limitations reprepro doesn't with regard to maintaining separate distributions. There are some pre-processing steps that may be able to be eliminated with support from aptly (aptly-dev/aptly#757) but ultimately we're going to be O(n) on snapshot and publish operations where n is the number of distributions the automated tooling supports pushing to.

The most dramatic improvement for minimal effort would come from dropping automated support for all or most end-of-life ubuntu and debian distributions which would cut our n down significantly. It would still be possible for us to release into these if necessary bypassing the automated tooling.
Right now every new package causes a snapshot and publish of all distributions ever supported by the bootstrap repository.

The most impactful high-effort change would involve modifying the release script here to only snapshot and publish the aptly repositories after all new packages, source and binary, python2 and python3, are pushed and added to the aptly repo. This modification to the release script would have the best hope of getting us back to near the performance of the unified reprepro repository at the cost of no longer being "agnostic" to the repository implementation on the other side of dput.

No automatic recovery after upload failure

One of the advantages aptly has over reprepro is that rolling back an undesired change is extremely straightforward, as simple as switching the published snapshots back to older ones. But if the package upload fails partway through the process it leaves the repository in a state that so far requires a significant amount of manual intervention to recover from before another release attempt can be made. Some of this recovery effort can be mitigated with improved error handling in the publishing script to rollback the staging repository state but because each release atomically publishes multiple packages (catkin_pkg for example causes atomic publishes for python-catkin-pkg src, python-catkin-pkg bin, python-catkin-pkg-modules src, python-catkin-pkg-modules bin, python3-catkin-pkg src, python3-catkin-pkg bin, python3-catkin-pkg-modules src, and python3-catkin-pkg-modules bin) and if a later package fails to publish the earlier ones are left behind even after rolling back to the previous snapshot. When the release script is re-run it will try and upload all packages again and because we publish repositories one distribution at a time this causes a conflict between the last uploaded version of a package and the newly uploaded one that will need to be resolved for recovery to work. A way to do that would be to unpublish everything and only re-publish it all once when every distribution has the same package set. The current behavior maximizes uptime of the bootstrap repo sacrificing distribution consistency, but I think few minutes where no packages are available during release is better than a few minutes where distributions are inconsistent. That would also make retries after partial failures easier to manage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants