Skip to content

Added multiarch Teleport container images#15514

Closed
fheinecke wants to merge 94 commits into
masterfrom
fred/arm-container-images
Closed

Added multiarch Teleport container images#15514
fheinecke wants to merge 94 commits into
masterfrom
fred/arm-container-images

Conversation

@fheinecke
Copy link
Copy Markdown
Contributor

@fheinecke fheinecke commented Aug 13, 2022

The primary purpose of this PR is to add multiarch Teleport container images. This includes amd64, arm, and arm64 images under the $MAJOR_VERSION tag on each release.

Due to the complexity of how we handle building and publishing container images this PR includes a number of other changes:

  • Removed old/unneeded Makefile targets that pertain to building images.
  • Reduced the number of Dockerfiles that are involved with building images from four to one consistent, unified file.
  • Added multiarch support for Teleport lab, along with Teleport lab for enterprise/fips.
  • Moved all CI/CD related image building logic into dronegen and the new "unified" Dockerfile. Previously this logic was spread across two repos, two makefiles, a few dozen makefile targets, four Dockerfiles, Dronegen, and lots of manual entries in .drone.yml
  • Updated makefiles to use the new Dockerfile
  • Refactored several Dronegen functions that are now used in multiple pipelines
  • Refactored several makefile targets to support local archive building, deb building, and image building from source

Regarding multiarch container builds the following features have been added:

  • Separate arch-specific images/tags for each major version
  • Image manifest for each major version that "points" to each arch-specific image
  • Publishing to all three container repos (ECR staging, ECR production, Quay)
  • Highly parallelized/efficient container image pipelines that properly build a DAG for step dependencies, executing as many as possible at once
  • All images will continue to be rebuilt/update daily to ensure the latest upstream packages are installed, across the latest three major Teleport versions
  • Wrote dronegen so that it should be very easy to add/remove Dockerfiles (such as the Teleport k8s operator), ISAs, and container repositories

This PR still needs some testing however I'm opening it up now to get some code review in the meantime.

@github-actions
Copy link
Copy Markdown
Contributor

@fheinecke - this PR is large and will require admin approval to merge. Consider breaking it up into a series smaller changes.

@russjones russjones requested review from logand22 and removed request for tigrato August 13, 2022 15:21
@russjones
Copy link
Copy Markdown
Contributor

@logand22 Added you as reviewer for this PR because you've been making changes to how our container images are published.

@russjones russjones requested review from tcsc and removed request for jakule August 13, 2022 15:21
@tigrato
Copy link
Copy Markdown
Contributor

tigrato commented Aug 15, 2022

@fheinecke,
Are the updates in webassets and e submodules required?

@fheinecke
Copy link
Copy Markdown
Contributor Author

@fheinecke, Are the updates in webassets and e submodules required?

@tigrato There is a minor change required that can be merged after this PR (https://github.com/gravitational/teleport.e/compare/master...fred/arm-container-images, PR pending)

Copy link
Copy Markdown
Contributor

@logand22 logand22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is such a large PR, I'd like to see some type of passing test before finishing the review. From what I tested locally there are some broken parts of the local image building process.

Comment thread dronegen/common.go Outdated
}

// Note that tags are also valid here as a tag refers to a specific commit
func cloneRepoStep(checkoutPath, commit string) step {
Copy link
Copy Markdown
Contributor

@logand22 logand22 Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func cloneRepoCommands() []string {
...
}

and

func cloneRepoStep(checkoutPath, commit string) step {
...
}

seem to have some overlap, if they are both still required can you add comments or help distinguish between the two in someway?

Comment thread .drone.yml Outdated
Comment thread .drone.yml Outdated
Comment thread .drone.yml Outdated
- name: dockersock
path: /var/run
depends_on:
- Tag and push "public.ecr.aws/gravitational/teleport-ent:10-fips-amd64" to ECR
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do promotion steps need to depend on a pipeline that doesn't run within the same event?

Comment thread .drone.yml Outdated
Comment thread .drone.yml Outdated
Comment thread Makefile
@r0mant
Copy link
Copy Markdown
Collaborator

r0mant commented Sep 16, 2022

@tcsc @logand22 Can you guys re-review this?

@logand22 logand22 self-requested a review September 19, 2022 18:02
Copy link
Copy Markdown
Contributor

@logand22 logand22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to split this PR into smaller, more manageable chunks? There's quite a lot of changes happening here and I feel like I could provide better feedback if some of the goals of this PR were separated out.

I was under the impression that docker buildx could target more than one platform at a time removing the need for you to have multiple steps for the different architectures. Is there a reason to not do this?

Have you tested this on our drone servers? I have a concern about the pipeline size. Do you know if the use of depends_on causes drone to use multiple pods instead of multiple containers? Drone schedules each step of a pipeline in a different container within a single pod. This can cause issues with kubernetes scheduling since the pod requests resources equal to the resource requests * number of steps. This can cause pods not to be scheduled and just timeout.

Comment thread docs/postrelease.md Outdated
"strings"
)

// If you are working on a PR/testing changes to this file you should configure the following for Drone testing:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to test by just updating the version and triggering the drone pipeline against this version / branch?

This updated testing method seems complicated to use.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is complicated to use. I wrote this section to document how to test this repo without affecting production registries. Because this file touches so many important infra resource (Quay, two ECR instances) I decided to add some changes that would allow for full end to end testing without the risk of pushing a breaking change to those resources. For example, if I run this as currently committed then I will push a v10, v10.2, and v10.2.2 release to each of those three registries. By adding these vars we can isolate testing.

For what it's worth we will be adding a "staging" or "test" CI/CD pipeline tool instance when we replace drone or migrate it to AWS. A staging instance would remove the need for this section.

I absolutely can remove this section, and it is complicated. Wether this should be kept or not comes down to if we want to accept the risk of somebody (such as myself) pushing a build to production registries while testing a PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if I run this as currently committed then I will push a v10, v10.2, and v10.2.2 release to each of those three registries. By adding these vars we can isolate testing.

I guess I have a few questions regarding this statement.

  1. Does the current drone.yml pipeline allow you to push a v10, v10.2, v10.2.2 release?
  2. If not, what change in this PR allows it to happen and is it necessary?

For instance:

teleport/.drone.yml

Lines 1748 to 1751 in 4cdc4b6

if [ "$$VERSION" != "${DRONE_TAG##v}" ]; then
echo "Mismatch between Makefile version: $$VERSION and git tag: $DRONE_TAG"
exit 1
fi

Seems like a common protection to prevent a user from forgetting to update the Makefile versions.

Copy link
Copy Markdown
Contributor Author

@fheinecke fheinecke Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the current drone.yml pipeline allow you to push a v10, v10.2, v10.2.2 release?

Yes it does. All three will be pushed as a part of a promotion/cron job run.

Comment thread dronegen/container_images.go Outdated
Comment thread dronegen/container_images.go Outdated
PublicEcrRegion string = "us-east-1"
StagingEcrRegion string = "us-west-2"

LocalRegistry string = "drone-docker-registry:5000"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the local registry used for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker buildx will not use the normal docker image local registry with the docker-container driver. Running a registry in the context of a pipeline allows layers/images to be shared around steps without exporting everything as tarballs and passing the files around multiple steps via the workspace dir. Strictly speaking this can be removed/replaced but it makes the build, tag, and push steps much more messy.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, docker buildx doesn't use the local docker image ls when using the docker-container driver. And this driver is used because of docker-in-docker?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of educating myself on this issue I looked into the problem you are facing. Looks like docker-container driver does support the local registry but only when you are building for a single platform at a time. Running something like --platform linux/amd64,linux/arm64 doesn't work as you mention.

I have two conflicting comments I've already made. The first is not using docker registry. This can be done because you could possibly switch back to using --load instead of --push but you'll have to individually run buildx build for each platform.

Alternatively I mentioned leveraging --platform so that you can build them all at once in one command. This doesn't work with --load so you'll have to leverage --push with the registry like you currently do.

You currently leverage the registry and individual platform builds. I suggest you switch to one or the other. I tested this locally, but I could be overlooking something though so let me know.

Note: https://docs.docker.com/desktop/containerd/ Containerd image store beta fixes this issue entirely allowing you to leverage all platforms in one command and still not have to use the registry solution which seems to be the optimal path. This should come out soon.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, docker buildx doesn't use the local docker image ls when using the docker-container driver. And this driver is used because of docker-in-docker?

Yes that is correct, it is due entirely to the driver and is unrelated to DinD. One of the biggest places this causes and issue is when building teleport lab. Teleport lab depends on the base teleport image which is also built in this pipeline. Without using a registry there is no clean way to provide a multiarch BASE to teleport operator's dockerfile.

I can absolutely switch to using --platform with multiple arch/targets. The reason why I have not here is because (IMO) splitting the steps up makes it easier to diagnose and debug an issue when something goes wrong during a promotion. For example, if the teleport-ent:v10-arm build fails due to a change with what cross compiler package we're using, it's much easier to track down that issue than if teleport-env:v10 targeting arm, arm64, amd64 fails.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me! In that case it seems like you can remove the need for the registry then as building individually doesn't cause issues with loading into docker image.

Comment thread .drone.yml
Comment thread dronegen/container_images.go Outdated
Comment thread dronegen/container_images.go Outdated
majorVersionVarPath := path.Join(majorVersionVarDirectory, majorVersion)
return step{
Name: fmt.Sprintf("Find the latest available semver for %s", majorVersion),
Image: "golang:1.18",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we locked at this Go version, or should it be pulled out from the makefile at runtime?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which makefile are you referring to? We could pull it out of a file at runtime but I'm not sure what the benefit would be.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @tcsc means that we could configure this Go version outside of this file so that we don't have to manage it in as many places.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add that if we really want it in this PR but we are already doing this all over dronegen. I'd recommend doing this as a part of a new PR, and leaving it as is in this one.

Comment thread .drone.yml
@fheinecke
Copy link
Copy Markdown
Contributor Author

Since this is such a large PR, I'd like to see some type of passing test before finishing the review. From what I tested locally there are some broken parts of the local image building process.

100% agree - I'll be opening the first of several split PRs based upon this one in a few minutes.

@fheinecke
Copy link
Copy Markdown
Contributor Author

Closing in favor of #16688 which is a smaller subset of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants