Added multiarch Teleport container images#15514
Conversation
|
@fheinecke - this PR is large and will require admin approval to merge. Consider breaking it up into a series smaller changes. |
|
@logand22 Added you as reviewer for this PR because you've been making changes to how our container images are published. |
|
@fheinecke, |
@tigrato There is a minor change required that can be merged after this PR (https://github.com/gravitational/teleport.e/compare/master...fred/arm-container-images, PR pending) |
logand22
left a comment
There was a problem hiding this comment.
Since this is such a large PR, I'd like to see some type of passing test before finishing the review. From what I tested locally there are some broken parts of the local image building process.
| } | ||
|
|
||
| // Note that tags are also valid here as a tag refers to a specific commit | ||
| func cloneRepoStep(checkoutPath, commit string) step { |
There was a problem hiding this comment.
func cloneRepoCommands() []string {
...
}and
func cloneRepoStep(checkoutPath, commit string) step {
...
}seem to have some overlap, if they are both still required can you add comments or help distinguish between the two in someway?
| - name: dockersock | ||
| path: /var/run | ||
| depends_on: | ||
| - Tag and push "public.ecr.aws/gravitational/teleport-ent:10-fips-amd64" to ECR |
There was a problem hiding this comment.
Do promotion steps need to depend on a pipeline that doesn't run within the same event?
There was a problem hiding this comment.
Is it possible to split this PR into smaller, more manageable chunks? There's quite a lot of changes happening here and I feel like I could provide better feedback if some of the goals of this PR were separated out.
I was under the impression that docker buildx could target more than one platform at a time removing the need for you to have multiple steps for the different architectures. Is there a reason to not do this?
Have you tested this on our drone servers? I have a concern about the pipeline size. Do you know if the use of depends_on causes drone to use multiple pods instead of multiple containers? Drone schedules each step of a pipeline in a different container within a single pod. This can cause issues with kubernetes scheduling since the pod requests resources equal to the resource requests * number of steps. This can cause pods not to be scheduled and just timeout.
| "strings" | ||
| ) | ||
|
|
||
| // If you are working on a PR/testing changes to this file you should configure the following for Drone testing: |
There was a problem hiding this comment.
Is there a reason not to test by just updating the version and triggering the drone pipeline against this version / branch?
This updated testing method seems complicated to use.
There was a problem hiding this comment.
It is complicated to use. I wrote this section to document how to test this repo without affecting production registries. Because this file touches so many important infra resource (Quay, two ECR instances) I decided to add some changes that would allow for full end to end testing without the risk of pushing a breaking change to those resources. For example, if I run this as currently committed then I will push a v10, v10.2, and v10.2.2 release to each of those three registries. By adding these vars we can isolate testing.
For what it's worth we will be adding a "staging" or "test" CI/CD pipeline tool instance when we replace drone or migrate it to AWS. A staging instance would remove the need for this section.
I absolutely can remove this section, and it is complicated. Wether this should be kept or not comes down to if we want to accept the risk of somebody (such as myself) pushing a build to production registries while testing a PR.
There was a problem hiding this comment.
For example, if I run this as currently committed then I will push a v10, v10.2, and v10.2.2 release to each of those three registries. By adding these vars we can isolate testing.
I guess I have a few questions regarding this statement.
- Does the current
drone.ymlpipeline allow you to push av10, v10.2, v10.2.2release? - If not, what change in this PR allows it to happen and is it necessary?
For instance:
Lines 1748 to 1751 in 4cdc4b6
Seems like a common protection to prevent a user from forgetting to update the Makefile versions.
There was a problem hiding this comment.
Does the current drone.yml pipeline allow you to push a v10, v10.2, v10.2.2 release?
Yes it does. All three will be pushed as a part of a promotion/cron job run.
| PublicEcrRegion string = "us-east-1" | ||
| StagingEcrRegion string = "us-west-2" | ||
|
|
||
| LocalRegistry string = "drone-docker-registry:5000" |
There was a problem hiding this comment.
What is the local registry used for?
There was a problem hiding this comment.
docker buildx will not use the normal docker image local registry with the docker-container driver. Running a registry in the context of a pipeline allows layers/images to be shared around steps without exporting everything as tarballs and passing the files around multiple steps via the workspace dir. Strictly speaking this can be removed/replaced but it makes the build, tag, and push steps much more messy.
There was a problem hiding this comment.
Just to clarify, docker buildx doesn't use the local docker image ls when using the docker-container driver. And this driver is used because of docker-in-docker?
There was a problem hiding this comment.
For the sake of educating myself on this issue I looked into the problem you are facing. Looks like docker-container driver does support the local registry but only when you are building for a single platform at a time. Running something like --platform linux/amd64,linux/arm64 doesn't work as you mention.
I have two conflicting comments I've already made. The first is not using docker registry. This can be done because you could possibly switch back to using --load instead of --push but you'll have to individually run buildx build for each platform.
Alternatively I mentioned leveraging --platform so that you can build them all at once in one command. This doesn't work with --load so you'll have to leverage --push with the registry like you currently do.
You currently leverage the registry and individual platform builds. I suggest you switch to one or the other. I tested this locally, but I could be overlooking something though so let me know.
Note: https://docs.docker.com/desktop/containerd/ Containerd image store beta fixes this issue entirely allowing you to leverage all platforms in one command and still not have to use the registry solution which seems to be the optimal path. This should come out soon.
There was a problem hiding this comment.
Just to clarify, docker buildx doesn't use the local docker image ls when using the docker-container driver. And this driver is used because of docker-in-docker?
Yes that is correct, it is due entirely to the driver and is unrelated to DinD. One of the biggest places this causes and issue is when building teleport lab. Teleport lab depends on the base teleport image which is also built in this pipeline. Without using a registry there is no clean way to provide a multiarch BASE to teleport operator's dockerfile.
I can absolutely switch to using --platform with multiple arch/targets. The reason why I have not here is because (IMO) splitting the steps up makes it easier to diagnose and debug an issue when something goes wrong during a promotion. For example, if the teleport-ent:v10-arm build fails due to a change with what cross compiler package we're using, it's much easier to track down that issue than if teleport-env:v10 targeting arm, arm64, amd64 fails.
There was a problem hiding this comment.
Sounds good to me! In that case it seems like you can remove the need for the registry then as building individually doesn't cause issues with loading into docker image.
| majorVersionVarPath := path.Join(majorVersionVarDirectory, majorVersion) | ||
| return step{ | ||
| Name: fmt.Sprintf("Find the latest available semver for %s", majorVersion), | ||
| Image: "golang:1.18", |
There was a problem hiding this comment.
Are we locked at this Go version, or should it be pulled out from the makefile at runtime?
There was a problem hiding this comment.
Which makefile are you referring to? We could pull it out of a file at runtime but I'm not sure what the benefit would be.
There was a problem hiding this comment.
I think @tcsc means that we could configure this Go version outside of this file so that we don't have to manage it in as many places.
There was a problem hiding this comment.
I can add that if we really want it in this PR but we are already doing this all over dronegen. I'd recommend doing this as a part of a new PR, and leaving it as is in this one.
100% agree - I'll be opening the first of several split PRs based upon this one in a few minutes. |
|
Closing in favor of #16688 which is a smaller subset of this PR. |
The primary purpose of this PR is to add multiarch Teleport container images. This includes amd64, arm, and arm64 images under the
$MAJOR_VERSIONtag on each release.Due to the complexity of how we handle building and publishing container images this PR includes a number of other changes:
.drone.ymlRegarding multiarch container builds the following features have been added:
This PR still needs some testing however I'm opening it up now to get some code review in the meantime.