Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ARM64 to lambda-promtail drone build job #5354

Merged
merged 19 commits into from
Jun 16, 2022

Conversation

tlinhart
Copy link
Contributor

@tlinhart tlinhart commented Feb 9, 2022

What this PR does / why we need it:

Using ARM64 architecture for Lambda functions can lead to significant price/performance gain. For reference:

Up until now, the solution was to custom-build an ARM64 image for Promtail Lambda and push it to the private ECR registry. However, since the release of pull through cache repositories (https://aws.amazon.com/about-aws/whats-new/2021/11/amazon-ecr-cache-repositories/) it's now much easier to keep the images from public registries in private ECR registry up to date. This is especially true when using IaC (e.g. Pulumi).

Which issue(s) this PR fixes:

Special notes for your reviewer:

Discussed on Slack with @cstyan (https://grafana.slack.com/archives/CEPJRLQNL/p1644258844007079).

I tried to build the ARM64 image and use it in production with success.

Checklist

  • Documentation added
  • Tests updated
  • Add an entry in the CHANGELOG.md about the changes.

@tlinhart tlinhart requested a review from a team as a code owner February 9, 2022 13:59
@CLAassistant
Copy link

CLAassistant commented Feb 9, 2022

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right, at least I don't see the drone job building lambda-promtail-arm64

Successfully tagged 25e0d225ae133d9faf0acf560f7c6273b24f4064:latest
+ /usr/local/bin/docker tag 25e0d225ae133d9faf0acf560f7c6273b24f4064 public.ecr.aws/grafana/lambda-promtail:main-17a1a17-amd64
+ /usr/local/bin/docker tag 25e0d225ae133d9faf0acf560f7c6273b24f4064 public.ecr.aws/grafana/lambda-promtail:main
+ /usr/local/bin/docker rmi 25e0d225ae133d9faf0acf560f7c6273b24f4064

I think we need further modifications to the jsonnet code to make the lambda_promtail stages look more like the other images we're building.

@tlinhart
Copy link
Contributor Author

tlinhart commented Feb 9, 2022

Hi @cstyan,
I assume this is because I modified just the drone.jsonnet without regenerating the drone.yml. Running make drone should do in my opinion, but I don't posses the DRONE_TOKEN so I can't sign the YAML file. However, I tried running the drone jsonnet --stream --format ... command from Makefile and it builds a correct drone.yml (I diffed it against the current drone.yml).
I tried to reuse as much from the current drone.jsonnet as possible considering the Promtail Lambda is the only image pushed to ECR Public.

@cstyan
Copy link
Contributor

cstyan commented Feb 9, 2022

🤦‍♂️ Ah right, if you give me push access to your fork I can push a change with the new generated yaml.

@tlinhart
Copy link
Contributor Author

tlinhart commented Feb 9, 2022

Sure, I've sent you an invitation.

@pull-request-size pull-request-size bot added size/M and removed size/S labels Feb 9, 2022
@cstyan
Copy link
Contributor

cstyan commented Feb 9, 2022

Lets see how this goes.

@cstyan
Copy link
Contributor

cstyan commented Feb 10, 2022

Okay, I think there's more that we need to do here. The docker image that runs to build the image on builds for the default architecture in the alpine image, that needs to be changed so that if we're doing the arm64 build we crosscompile for arm64.

@tlinhart
Copy link
Contributor Author

I'll take a look it at. Any advice on testing it locally? Is drone exec the way?

@tlinhart
Copy link
Contributor Author

Taking another look at it. The drone.yml seems to contain correct platform section and according to the documentation (https://docs.drone.io/pipeline/docker/syntax/platform/), this should lead to routing the build to the correct runner which, AFAIK, leads to trying to use the image with matching architecture. It seems like the problem is with the cstyan/ecr image not built for the ARM64 architecture?

@cstyan
Copy link
Contributor

cstyan commented Feb 10, 2022

Taking another look at it. The drone.yml seems to contain correct platform section and according to the documentation (https://docs.drone.io/pipeline/docker/syntax/platform/), this should lead to routing the build to the correct runner which, AFAIK, leads to trying to use the image with matching architecture. It seems like the problem is with the cstyan/ecr image not built for the ARM64 architecture?

cstyan/ecr is just a fork of drones existing ecr image build plugin that supports publishing to a public ecr repo. I think the issue is that the Dockerfile the job uses tools/lambda-promtail/Dockerfile doesn't attempt to cross compile the Go binary for arm64 and still builds the final lambda-promtail image into an alpine base image.

@cstyan
Copy link
Contributor

cstyan commented Feb 10, 2022

I'll take a look it at. Any advice on testing it locally? Is drone exec the way?

You can try running the drone pipeline locally, but I'd start with getting the dockerfile working to build an arm64 image that you confirm runs lambda-promtail as expected.

@tlinhart
Copy link
Contributor Author

I'll try to convince you. Compare manifests for plugins/docker image:

pasmen@nyx:~$ docker manifest inspect --verbose plugins/docker | jq '.[].Descriptor.platform'
{
  "architecture": "amd64",
  "os": "linux"
}
{
  "architecture": "arm64",
  "os": "linux",
  "variant": "v8"
}
{
  "architecture": "arm",
  "os": "linux",
  "variant": "v7"
}
{
  "architecture": "amd64",
  "os": "windows",
  "os.version": "10.0.17134.706"
}
{
  "architecture": "amd64",
  "os": "windows",
  "os.version": "10.0.17763.1457"
}
{
  "architecture": "amd64",
  "os": "windows",
  "os.version": "10.0.18362.295"
}
{
  "architecture": "amd64",
  "os": "windows",
  "os.version": "10.0.18363.1139"
}

and cstyan/ecr image:

pasmen@nyx:~$ docker manifest inspect --verbose cstyan/ecr | jq '.Descriptor.platform'
{
  "architecture": "amd64",
  "os": "linux"
}

It's obvious there's ARM64 variant for plugins/docker Docker image while there's no ARM64 variant for cstyan/ecr Docker image. To confirm this, I tried to spin up the t4g.small (ARM64) EC2 and run both Docker images. For plugins/docker I got:

ssm-user@ip-xxx-xxx-xxx-xxx:~$ docker run --rm -v /var/run/docker.sock:/var/run/docker.sock plugins/docker
+ /usr/local/bin/dockerd --data-root /var/lib/docker --host=unix:///var/run/docker.sock
Registry credentials or Docker config not provided. Guest mode enabled.
+ /usr/local/bin/docker version
Client:
 Version:           20.10.9
 API version:       1.41
 Go version:        go1.16.8
 Git commit:        c2ea9bc
 Built:             Mon Oct  4 16:03:36 2021
 OS/Arch:           linux/arm64
 Context:           default
 Experimental:      true
...

while for cstyan/ecr I got:

ssm-user@ip-xxx-xxx-xxx-xxx:~$ docker run --rm -v /var/run/docker.sock:/var/run/docker.sock cstyan/ecr
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
standard_init_linux.go:228: exec user process caused: exec format error

Exactly the same as reported by the Drone build:

latest: Pulling from cstyan/ecr
Digest: sha256:a9a6085f477aa3fd5e5a3e05406c61b30e4123560064e40c527ac4a89187bdd1
Status: Downloaded newer image for cstyan/ecr:latest
standard_init_linux.go:228: exec user process caused: exec format error

As I already stated in the PR description, I tried to actually build the ARM64 lambda-promtail Docker image using the Dockerfile from repository with success, both on x86-64 architecture using QEMU emulation and on ARM64 architecture (EC2 machine). So I presume there's nothing wrong with the Dockerfile nor the Drone job configuration.

@pull-request-size pull-request-size bot added size/L and removed size/M labels Feb 10, 2022
@cstyan
Copy link
Contributor

cstyan commented Feb 10, 2022

I'm not sure how your build locally via the dockerfile to build the binary for arm64 works, the docker image only builds lambda-promtail for the default arch on the golang:1-alpine3.14 image which I assume is amd64.

I think we either need to modify the Dockerfile to optionally take an argument to change the Go build architecture to arm64, and have a second pipeline for that, or double check that the ECR image should work for arm64 and potentially publish another version of the image?

I honestly don't know how this all currently works for the other images we build that do support arm64.

@tlinhart
Copy link
Contributor Author

From the Drone documentation:

Use the platform section to configure the target operating system and architecture and routes the pipeline to the appropriate runner.

I modified the drone.jsonnet so that it prepares two pipelines in drone.yml, one for x86-64 and one for ARM64 (pretty much the same way it does for other tools). The ARM64 pipeline tries to pull the ARM64 version of cstyan/ecr image for building the lambda-promtail, but as I posted earlier, there's none so it pulls the only available, x86-64 variant, which leads to architecture mismatch as shown above.

I'm pretty sure the only thing needed is to build and publish the ARM64 variant of cstyan/ecr Docker image.

@tlinhart
Copy link
Contributor Author

Hi @cstyan,
I see your effort in #5403. When I look at the cstyan/ecr repository, I see there a Docker image with arm64 tag which itself is not enough, IMO. The Drone pipeline hasn't changed so it still tries to pull the latest tag, and that is a manifest for x86-64 architecture. I tried to clone your repository and build the Docker image the way I think is correct. Here's the exact steps I took:

git clone https://github.com/cstyan/drone-docker.git
cd drone-docker

export GOOS=linux
export CGO_ENABLED=0
export GO111MODULE=on

export GOARCH=amd64
go build -v -a -tags netgo -o release/linux/amd64/drone-ecr ./cmd/drone-ecr

export GOARCH=arm64
go build -v -a -tags netgo -o release/linux/arm64/drone-ecr ./cmd/drone-ecr

docker login -u tlinhart
docker build \
  --label org.label-schema.build-date=$(date -u +"%Y-%m-%dT%H:%M:%SZ") \
  --label org.label-schema.vcs-ref=$(git rev-parse --short HEAD) \
  --file docker/ecr/Dockerfile.linux.amd64 --tag tlinhart/drone-ecr-public:amd64 .
docker push tlinhart/drone-ecr-public:amd64

docker build \
  --label org.label-schema.build-date=$(date -u +"%Y-%m-%dT%H:%M:%SZ") \
  --label org.label-schema.vcs-ref=$(git rev-parse --short HEAD) \
  --file docker/ecr/Dockerfile.linux.arm64 --tag tlinhart/drone-ecr-public:arm64 .
docker push tlinhart/drone-ecr-public:arm64

docker manifest create tlinhart/drone-ecr-public:latest \
  --amend tlinhart/drone-ecr-public:amd64 \
  --amend tlinhart/drone-ecr-public:arm64
docker manifest push tlinhart/drone-ecr-public:latest

Now there's a Docker image tlinhart/drone-ecr-public with tags for both architectures and a latest tag with a manifest list for those architectures:

$ docker manifest inspect tlinhart/drone-ecr-public
{
   "schemaVersion": 2,
   "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
   "manifests": [
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 3036,
         "digest": "sha256:bc0e850763dcee8866b87b1ad9d51073af6919fd6bf54c416522c60659485fa4",
         "platform": {
            "architecture": "amd64",
            "os": "linux"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 3036,
         "digest": "sha256:6dcc5c7a71e12854fac9875df4220fc3a7760fbcc26f593c0a41764a3f1fe4cd",
         "platform": {
            "architecture": "arm64",
            "os": "linux"
         }
      }
   ]
}

I update the Drone pipeline to use this image to confirm it works, but it seems the build here is blocked (https://drone.grafana.net/grafana/loki/8807). If you could unblock it, we would see if that's the missing part or not.

The cross-compile way doesn't look the best to me, for couple of reasons:

  • It makes the build process for Promtail Lambda diverged from the rest of the tools.
  • It's a bit more complex that just using the correct runner.

Thus, in compliance with Occam's razor, I'd prefer this PR over the #5403 (granted this approarch works).

@tlinhart
Copy link
Contributor Author

As a sidenote, couple of days ago I create a PR to add ECR Public support to the official drone-docker plugins repository (see drone-plugins/drone-docker#357). If it gets merged, it might streamline things a bit more.

@tlinhart
Copy link
Contributor Author

Hi @cstyan, I just pushed an effort to add the manifest part for ECR images. I tried to mirror the other parts as much as possible, changing just .drone/drone.jsonnet. I'd now need if you could re-generate the .drone/drone.yml so that it's signed and the Drone can run. However, it would need to be tested as from what I understand, the manifest part is run only when pushed to the main branch. Can I get your help here?

@cstyan
Copy link
Contributor

cstyan commented Apr 13, 2022

Hey @tlinhart thanks again for all your awesome work here. There was a minor issue with the volumes definition in the ecr function, I made a small change to fix that and regenerated the drone.jsonnet, I also pushed a second commit that should allow the build to run even before a push to main. Lets see how that goes. On top of that, I had to merge master in and generate the jsonnet again, because merge conflicts 🤷‍♂️

@cstyan
Copy link
Contributor

cstyan commented Apr 13, 2022

okay @tlinhart it looks like maybe the way we're setting up the authentication is incorrect?

@tlinhart
Copy link
Contributor Author

@cstyan thanks for the corrections! Seems like I forgot to define the Docker volume in the pipeline (https://docs.drone.io/pipeline/docker/syntax/volumes/temporary/). I added it to the .drone/drone.jsonnet, may I ask you to regenerate the .drone/drone.yml to see if it works now?

@cstyan
Copy link
Contributor

cstyan commented Apr 18, 2022

@tlinhart it looks like we have a mismatch between the SHA used in the image tag when building and publishing the image, and the SHA used to try and find the images when creating the manifest. Unfortunately I haven't been able to track the source of that mismatch down, but I do have a few other things in progress right now. Let me know if you can find the issue, otherwise I can dig into this more later in the week.

@tlinhart
Copy link
Contributor Author

Hi @cstyan, I tried to simulate the process and I think the problem is only due to the fact that we try to create and push the manifest list from the fork/PR. If pushed to the main branch the problem would not occur. This is not 100%, though, so I'd need your confimation or test it somehow. Here's my observations:

  1. The latest commit to my fork is yours commit tlinhart@17d2382 with ID 17d2382. This corresponds to the tag(s) that plugins/manifest expects when creating the manifest list in the manifest-lambda-promtail step of the manifest-ecr pipeline:
time="2022-04-18T00:06:19Z" level=warning msg="Couldn't find or access image reference \"public.ecr.aws/grafana/lambda-
promtail:main-17d2382-amd64\". Skipping image."
time="2022-04-18T00:06:19Z" level=warning msg="Couldn't find or access image reference \"public.ecr.aws/grafana/lambda-
promtail:main-17d2382-arm64\". Skipping image."

This comes from build.tag used in the manifest tool spec file.

  1. The actual tag that the Lambda Promtail image is tagged with comes from the ./tools/image-tag script run in the image-tag step of the pipeline, which is run after the clone step. Now consider what the clone step does and how it differs for a push to fork/PR and for a push to main branch. For the push to a fork/PR, it first merges the PR head to main and the hash of the resulting merge commit becomes the hash used by image-tag. For the push to a main branch, this does not happen and thus the tags used by image-tag and by plugins/manifest match.

So, I'd say that everything will work just fine once merged to main. What do you think, does it make sense?

@tlinhart
Copy link
Contributor Author

tlinhart commented May 9, 2022

Hi @cstyan, just a quick ping if you could take a look at my last comment.

@tlinhart
Copy link
Contributor Author

tlinhart commented May 25, 2022

Hi @cstyan, is there anything I can do to make a progress?

@cstyan
Copy link
Contributor

cstyan commented May 27, 2022

Hey @tlinhart, sorry for the delay!

Just read over your explanation, it makes sense to me. I will run it by another person on the team who has some experience with our CI and image build setup next week just to confirm. It looks like the drone.yml file has some conflicts as well, I will resolve that.

@tlinhart
Copy link
Contributor Author

Great, thanks @cstyan!

@tlinhart
Copy link
Contributor Author

Hi @cstyan, could you please give it a try and confirm my assumptions?

@cstyan
Copy link
Contributor

cstyan commented Jun 15, 2022

I think we're probably good to go. I will regenerate the drone yaml file in the morning.

cstyan added 2 commits June 15, 2022 13:56
Signed-off-by: Callum Styan <[email protected]>
Signed-off-by: Callum Styan <[email protected]>
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@owen-d owen-d merged commit 35cb40a into grafana:main Jun 16, 2022
@cstyan
Copy link
Contributor

cstyan commented Jul 5, 2022

@tlinhart everything looking good?

@tlinhart
Copy link
Contributor Author

tlinhart commented Jul 7, 2022

Hi @cstyan,
regarding the ARM64 architecture, everything is working great. There's one minor hiccup with that AWS Lambda service doesn't yet support multi-arch container images (see the discussion here). Thus, we can't simply use the tag main but have to stick with specific commit's tag (e.g. main-b3b3da6-arm64). Until AWS Lambda adds the support, it would probably be best to also generate tags for specific architectures like main-arm64. I'll consider opening another PR for this if you agree it's a good idea. I suppose that would be a minor change to the drone.jsonnet file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants