Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: gracefully handle an error retrieving auth info from AWS Secrets Manager #401

Merged
merged 3 commits into from
Aug 31, 2023

Conversation

trentm
Copy link
Member

@trentm trentm commented Aug 29, 2023

If there is an error retrieving an APM auth secret from AWS Secrets
Manager (attempted when ELASTIC_APM_SECRETS_MANAGER_API_KEY_ID or ELASTIC_APM_SECRETS_MANAGER_SECRET_TOKEN_ID
are provided), then the extension will now log a warning rather
than erroring out.

This allows the Lambda function to still work. Only the reporting of
APM server data will fail (with a 403). This was already the behavior
for an invalid ELASTIC_APM_SECRET_TOKEN.

testing notes

first showing the failure

I configured a Lambda fn, trentm-play-fn1, to use the current latest extension and Node.js APM layers. I configured a bogus secret ID for a secret token:

ELASTIC_APM_SECRETS_MANAGER_SECRET_TOKEN_ID=bogus-no-such-secret

then I invoked the function (via its Lambda URL). It fails:

% curl -i https://[REDACTED].lambda-url.us-west-2.on.aws/
HTTP/1.1 502 Bad Gateway
Date: Tue, 29 Aug 2023 17:53:09 GMT
Content-Type: application/json
Content-Length: 21
Connection: keep-alive
x-amzn-RequestId: bc6f78e2-2eba-4a73-8617-62b79e66bbd2
X-Amzn-Trace-Id: root=1-64ee3084-569b26440d0b3e2910b91698;sampled=0;lineage=c4ffcf56:0

Internal Server Error
The full Lambda invocation log shows the extension erroring out
2023-08-29T17:53:08.915000+00:00 2023/08/29/[$LATEST]0ff2602d5bcd48db969ea5f6e20053d7 INIT_START Runtime Version: nodejs:18.v11	Runtime Version ARN: arn:aws:lambda:us-west-2::runtime:2d38a43033930b7e89d977b60a6556a2e84bcc276ff59eab90d1327b252528ef
2023-08-29T17:53:09.118000+00:00 2023/08/29/[$LATEST]0ff2602d5bcd48db969ea5f6e20053d7 2023/08/29 17:53:09 failed to create the app: failed loading APM Server Secret Token from Secrets Manager: failed to retrieve sercet value: operation error Secrets Manager: GetSecretValue, https response error StatusCode: 400, RequestID: 8332d4d2-7a37-4432-b406-4ad386beca92, api error AccessDeniedException: User: arn:aws:sts::627286350134:assumed-role/trentm-play-fn1-role-irhf0v93/trentm-play-fn1 is not authorized to perform: secretsmanager:GetSecretValue on resource: bogus-no-such-secret because no identity-based policy allows the secretsmanager:GetSecretValue action
2023-08-29T17:53:09.119000+00:00 2023/08/29/[$LATEST]0ff2602d5bcd48db969ea5f6e20053d7 EXTENSION	Name: apm-lambda-extension	State: Started	Events: []
2023-08-29T17:53:09.498000+00:00 2023/08/29/[$LATEST]0ff2602d5bcd48db969ea5f6e20053d7 2023/08/29 17:53:09 failed to create the app: failed loading APM Server Secret Token from Secrets Manager: failed to retrieve sercet value: operation error Secrets Manager: GetSecretValue, https response error StatusCode: 400, RequestID: f3f12739-850d-4c26-a716-234273c900d7, api error AccessDeniedException: User: arn:aws:sts::627286350134:assumed-role/trentm-play-fn1-role-irhf0v93/trentm-play-fn1 is not authorized to perform: secretsmanager:GetSecretValue on resource: bogus-no-such-secret because no identity-based policy allows the secretsmanager:GetSecretValue action
2023-08-29T17:53:09.499000+00:00 2023/08/29/[$LATEST]0ff2602d5bcd48db969ea5f6e20053d7 EXTENSION	Name: apm-lambda-extension	State: Started	Events: []
2023-08-29T17:53:09.499000+00:00 2023/08/29/[$LATEST]0ff2602d5bcd48db969ea5f6e20053d7 START RequestId: bc6f78e2-2eba-4a73-8617-62b79e66bbd2 Version: $LATEST
2023-08-29T17:53:09.501000+00:00 2023/08/29/[$LATEST]0ff2602d5bcd48db969ea5f6e20053d7 RequestId: bc6f78e2-2eba-4a73-8617-62b79e66bbd2 Error: exit status 1
Extension.Crash
2023-08-29T17:53:09.501000+00:00 2023/08/29/[$LATEST]0ff2602d5bcd48db969ea5f6e20053d7 END RequestId: bc6f78e2-2eba-4a73-8617-62b79e66bbd2
2023-08-29T17:53:09.501000+00:00 2023/08/29/[$LATEST]0ff2602d5bcd48db969ea5f6e20053d7 REPORT RequestId: bc6f78e2-2eba-4a73-8617-62b79e66bbd2	Duration: 216.76 ms	Billed Duration: 217 msMemory Size: 1769 MB	Max Memory Used: 20 MB

showing the fix

I updated this Lambda fn to use a dev build of this Lambda extension with these changes, published to arn:aws:lambda:us-west-2:627286350134:layer:trentm-play-apm-lambda-extension:10, and re-invoked. The Lambda function worked:

% curl -i https://[REDACTED].lambda-url.us-west-2.on.aws/
HTTP/1.1 200 OK
Date: Tue, 29 Aug 2023 18:27:05 GMT
Content-Type: application/json
Content-Length: 32
Connection: keep-alive
x-amzn-RequestId: 4324e313-d39e-409d-b8c3-82f91d9c8315
X-Amzn-Trace-Id: root=1-64ee3878-3f15f1742caddd824cd9d767;sampled=0;lineage=c4ffcf56:0

{"body":"hi from myPlayHandler"}

The extension now warns with this, granted very wordy, log message:

Could not load APM secret token from AWS Secrets Manager. Reporting APM data will likely fail. Is 'ELASTIC_APM_SECRETS_MANAGER_SECRET_TOKEN_ID=bogus-no-such-secret' correct? See https://www.elastic.co/guide/en/apm/lambda/current/aws-lambda-secrets-manager.html. Error message: failed to retrieve sercet value: operation error Secrets Manager: GetSecretValue, https response error StatusCode: 400, RequestID: 7fd623fc-fcbd-44b7-aeae-2b4e63c3168e, api error AccessDeniedException: User: arn:aws:sts::627286350134:assumed-role/trentm-play-fn1-role-irhf0v93/trentm-play-fn1 is not authorized to perform: secretsmanager:GetSecretValue on resource: bogus-no-such-secret because no identity-based policy allows the secretsmanager:GetSecretValue action

and then later data sending fails with a 403.

The trimmed Lambda invocation log showing the warnings
2023-08-29T18:27:04.671000+00:00 2023/08/29/[$LATEST]5957f1fc2b894de6af6fb329c9ba1719 INIT_START Runtime Version: nodejs:18.v11	Runtime Version ARN: arn:aws:lambda:us-west-2::runtime:2d38a43033930b7e89d977b60a6556a2e84bcc276ff59eab90d1327b252528ef
2023-08-29T18:27:04.815000+00:00 2023/08/29/[$LATEST]5957f1fc2b894de6af6fb329c9ba1719 {"log.level":"warn","@timestamp":"2023-08-29T18:27:04.815Z","log.origin":{"function":"github.com/elastic/apm-aws-lambda/app.loadAWSOptions","file.name":"app/aws.go","file.line":52},"message":"Could not load APM secret token from AWS Secrets Manager. Reporting APM data will likely fail. Is 'ELASTIC_APM_SECRETS_MANAGER_SECRET_TOKEN_ID=bogus-no-such-secret' correct? See https://www.elastic.co/guide/en/apm/lambda/current/aws-lambda-secrets-manager.html. Error message: failed to retrieve sercet value: operation error Secrets Manager: GetSecretValue, https response error StatusCode: 400, RequestID: 7fd623fc-fcbd-44b7-aeae-2b4e63c3168e, api error AccessDeniedException: User: arn:aws:sts::627286350134:assumed-role/trentm-play-fn1-role-irhf0v93/trentm-play-fn1 is not authorized to perform: secretsmanager:GetSecretValue on resource: bogus-no-such-secret because no identity-based policy allows the secretsmanager:GetSecretValue action","ecs.version":"1.6.0"}
...
2023-08-29T18:27:05.471000+00:00 2023/08/29/[$LATEST]5957f1fc2b894de6af6fb329c9ba1719 {"log.level":"debug","@timestamp":"2023-08-29T18:27:05.471Z","log.origin":{"function":"github.com/elastic/apm-aws-lambda/apmproxy.(*Client).PostToApmServer","file.name":"apmproxy/apmserver.go","file.line":169},"message":"Sending data chunk to APM server","ecs.version":"1.6.0"}
2023-08-29T18:27:05.737000+00:00 2023/08/29/[$LATEST]5957f1fc2b894de6af6fb329c9ba1719 {"log.level":"warn","@timestamp":"2023-08-29T18:27:05.737Z","log.origin":{"function":"github.com/elastic/apm-aws-lambda/apmproxy.logBodyErrors","file.name":"apmproxy/apmserver.go","file.line":233},"message":"failed to post data to APM server: response status: 403 Forbidden","ecs.version":"1.6.0"}
2023-08-29T18:27:05.737000+00:00 2023/08/29/[$LATEST]5957f1fc2b894de6af6fb329c9ba1719 {"log.level":"warn","@timestamp":"2023-08-29T18:27:05.737Z","log.origin":{"function":"github.com/elastic/apm-aws-lambda/apmproxy.logBodyErrors","file.name":"apmproxy/apmserver.go","file.line":235},"message":"document : message: unauthorized: anonymous access not permitted for agent \"nodejs\"","ecs.version":"1.6.0"}
...
2023-08-29T18:27:05.737000+00:00 2023/08/29/[$LATEST]5957f1fc2b894de6af6fb329c9ba1719 END RequestId: 4324e313-d39e-409d-b8c3-82f91d9c8315
2023-08-29T18:27:05.737000+00:00 2023/08/29/[$LATEST]5957f1fc2b894de6af6fb329c9ba1719 REPORT RequestId: 4324e313-d39e-409d-b8c3-82f91d9c8315	Duration: 295.92 ms	Billed Duration: 296 msMemory Size: 1769 MB	Max Memory Used: 118 MB	Init Duration: 769.69 ms

… Manager

If there is an error retrieving an APM auth secret from AWS Secrets
Manager (attempted when ELASTIC_APM_SECRETS_MANAGER_API_KEY_ID or ELASTIC_APM_SECRETS_MANAGER_SECRET_TOKEN_ID
are provided), then the extension will now *log a warning* rather
than erroring out.

This allows the Lambda function to still work. Only the reporting of
APM server data will fail (with a 403). This was already the behavior
for an invalid ELASTIC_APM_SECRET_TOKEN.
@trentm trentm requested a review from kruskall August 29, 2023 18:43
@trentm trentm self-assigned this Aug 29, 2023
@github-actions github-actions bot added the aws-λ-extension AWS Lambda Extension label Aug 29, 2023
@trentm
Copy link
Member Author

trentm commented Aug 29, 2023

cla/check

@kruskall
Copy link
Member

Changes look good but I'm not sure we want to go in this direction.

This allows the Lambda function to still work. Only the reporting of
APM server data will fail (with a 403). This was already the behavior
for an invalid ELASTIC_APM_SECRET_TOKEN.

I think the main difference is that this is not the responsibility of the extension. We can't know if the secre token is valid before sending data to APM.
On the other hand the secrets manager is configured in the aws client/sdk and has nothing to do with APM. If it fails to pull the secret then the initialization failed (IMO this is similar to providing an invalid url to the extension).

@trentm
Copy link
Member Author

trentm commented Aug 29, 2023

(IMO this is similar to providing an invalid url to the extension).

I disagree.

  1. Given a lambda configured with an invalid URL, e.g. ELASTIC_APM_LAMBDA_APM_SERVER=this-isn't-even-a-url, I think it is fine for the extension to error out as it does now (and hence break the Lambda function).
  2. However, given a Lambda configured with a valid URL that happens to not point to a working APM server, e.g. ELASTIC_APM_LAMBDA_APM_SERVER=https://example.com, I think the extension should function. It will just not be able to successfully send APM data. That's the current behaviour.

Given a lambda configured with ELASTIC_APM_SECRETS_MANAGER_SECRET_TOKEN_ID=this-is-a-valid-secret-id-but-the-secret-does-not-exist-currently, I think this is equivalent to the second case. It is a configuration that is syntactically valid. However, it is possible that the value does not work at runtime. APM data won't be able to flow, but I don't think this should break the user's Lambda function working.

A similar case is ELASTIC_APM_SECRETS_MANAGER_SECRET_TOKEN_ID=this-was-a-working-secret-id-until-some-SRE-accidentally-deleted-it. If Bob the SRE accidentally deletes a secret related to APM, this shouldn't break the invocation of Lambda that happened to already be configured with that secret just for APM.

Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @trentm. I agree with your assessment: runtime errors should not prevent the Lambda from executing.

app/aws.go Outdated Show resolved Hide resolved
@axw axw merged commit fb6d2dc into elastic:main Aug 31, 2023
@trentm
Copy link
Member Author

trentm commented Aug 31, 2023

Thanks!

@trentm trentm deleted the trentm/graceful-secrets-manager-error branch February 1, 2024 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws-λ-extension AWS Lambda Extension
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants