- 
                Notifications
    You must be signed in to change notification settings 
- Fork 4.3k
fix(cloudwatch): metric period in AnomalyDetectionAlarm is not being respected #35319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(cloudwatch): metric period in AnomalyDetectionAlarm is not being respected #35319
Conversation
| * | ||
| * @default Duration.minutes(5) | ||
| */ | ||
| readonly period?: cdk.Duration; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's a bad pattern to add a property to a child interface when the same property exists on the parent, especially when the parent property is marked as deprecated. Maybe I could rename this to anomalyDetectionPeriod or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This review is outdated)
| const { period, ...alarmProps } = props; | ||
| super(scope, id, { | ||
| ...props, | ||
| ...alarmProps, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this, the period prop will get passed into Alarm constructor which led to me getting this error on deployment:
❌  TestCdkStack failed: ToolkitError: The stack named TestCdkStack failed creation, it may need to be manually deleted from the AWS console: ROLLBACK_COMPLETE: Resource handler returned message: "Period should not be set if list of Metrics is set. (Service: CloudWatch, Status Code: 400, Request ID: <request id>) (SDK Attempt Count: 1)" (RequestToken: <request token>, HandlerErrorCode: GeneralServiceException) 
... which I believe is understandable since period has to be passed to the MathExpresion "metric", not the alarm itself.
✅ Updated pull request passes all PRLinter validations. Dismissing previous PRLinter review.
| evaluationPeriods: 1, | ||
|  | ||
| // The period over which the anomaly detection band's statistics are applied (default: 5 minutes) | ||
| period: Duration.minutes(5), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be passing in period here. Period is an attribute of metric. Most (all) other CloudWatch alarms use the metric period as monitoring period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Math expressions have their own period attribute which overrides the period attributes of the metrics passed into it (which makes sense since you can pass multiple metrics with different period attributes into it).
I felt this was a way to transparently pass this period into the math expression without changing existing behaviour.
Maybe we can rename period here to anomalyDetectionPeriod.
Or I guess we could have the math expression pick up the monitoring period from the metric passed into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that confused me when I was trying to debug why the alarm wasn't picking up the period of my metric was that AnomalyDetectionAlarmProps through CreateAlarmOptionsBase  already had a period member, but it is marked as deprecated, and I can't even find it as an option in my version of cdk-lib for some reason. (I see you mentioned it here as well). If the base interface marks period as deprecated, we probably shouldn't reintroduce it.
It does seem the easier (and better) way would be to remove the period field from the MathExpression. If we do that, we also won't need this workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi folks, thank you for looking into this. I have proposed a fix here: https://github.com/aws/aws-cdk/pull/35319/files#r2394391941
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that confused me when I was trying to debug why the alarm wasn't picking up the period of my metric was that AnomalyDetectionAlarmProps through CreateAlarmOptionsBase already had a
periodmember, but it is marked as deprecated, and I can't even find it as an option in my version of cdk-lib for some reason. (I see you mentioned it here as well).
Hey @Miles123K
I believe this is the case because, as I've mentioned in my PR description, jsii removes deprecated properties during compilation
| ...props, | ||
| ...alarmProps, | ||
| comparisonOperator: props.comparisonOperator ?? ComparisonOperator.LESS_THAN_LOWER_OR_GREATER_THAN_UPPER_THRESHOLD, | ||
| metric: Metric.anomalyDetectionFor(props), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vivekashok1221 , I started working on fixing the bug before I saw this PR. The correct fix should be:
| metric: Metric.anomalyDetectionFor(props), | |
| metric: Metric.anomalyDetectionFor({ | |
| ...props, | |
| // AnomalyDetectionAlarmProps.period is deprecated - the guidance recommends encoding the period in the metric itself | |
| period: metricPeriod(props.metric), | |
| }), | 
This uses the period provided in the metric object, which is in line with the guidance:
| * @deprecated Use `metric.with({ period: ... })` to encode the period into the Metric object | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll try to implement your suggestion.
I wanted to confirm one thing: correcting the anomaly detection alarm to use the metric's period will change the alarm duration for everyone who has specified a period in the metric object passed into the alarm- it will switch from the current 5 minute default (set by math expression here) to match their metric period (This was why I added an explicit way to specify duration in AnomalyDetectionAlarm construct in the first place). Is this approach acceptable?
(I agree this approach is correct. I initially expected the alarm to pick up the metric's period anyway. However, I do have some concerns about changing existing behaviour for current users.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good callout, thanks.
I understand the risk here; that said, there are two considerations which make me comfortable with fixing the bug directly instead of adding the extra period property:
- While there may be customers who realised that the default was incorrectly set to 5 minutes despite their configuration (i.e. the behaviour is a bug), there may also be customers who did not realise this (i.e they still think the alarm is configured with the correct period) - fixing the period to match the metric will automatically fix the bug for them.
- The current (incorrect) default period of 5 minutes is not documented anywhere as intended behaviour (i.e. it is not a contract) - it is purely a result of buggy behaviour and requires users to actually read CDK code to find the MathExpression where it is set.
Also, ultimately we want the correct long-term solution in place. A user who might start with AnomalyDetection next week might be confused by the extra period property - the direct fix helps them (and all future users) to understand expected behaviour better.
Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM 👍 , I'll do as you've suggested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Let's update the commit + PR title to  | 
| Sure, I'll do that. | 
| 
 There's no specific policy. However, you do not have to rebase locally - you can just make your commit and Mergify will take care of squashing and merging (assuming no conflicts of course). | 
9bdb0f6    to
    8388ee0      
    Compare
  
    Fixes issue where anomaly detection band alarm's period always defaulted to 300 seconds regardless of the metric's period: Pass the metric's period to anomalyDetectionFor() to prevent MathExpression from using its default period.
        
          
                packages/aws-cdk-lib/aws-cloudwatch/test/alarm-anomaly-detection.test.ts
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This review is outdated)
| Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). | 
| This pull request has been removed from the queue for the following reason:  The merge conditions cannot be satisfied due to failing checks:You may have to fix your CI before adding the pull request to the queue again. | 
| @Mergifyio requeue | 
| 
 ❌ Command disallowed due to command restrictions in the Mergify configuration.
 | 
| I think the build job is failing in CI due to an OOM error. Rerunning the job by requeuing my PR might fix it. | 
| Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). | 
| This pull request has been removed from the queue for the following reason:  The pull request can't be updated. You should update or rebase your pull request manually. If you do, this pull request will automatically be requeued once the queue conditions match again. | 
| Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). | 
| This pull request has been removed from the queue for the following reason:  The pull request can't be updated. You should update or rebase your pull request manually. If you do, this pull request will automatically be requeued once the queue conditions match again. | 
Pull request has been modified.
| Mergify was failing because the PR could not be updated. Rebase from the UI wasn't working either so pushed a merge commit - hopefully this will get it done | 
| Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). | 
| Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). | 
| Comments on closed issues and PRs are hard for our team to see. | 
Fixes issue where anomaly detection band alarm's period defaulted to 300 seconds regardless of the metric's period. AnomalyDetectionAlarmProps didn't have a period property because it was deprecated in the parent interface and removed by jsii during compilation, causing the internal MathExpression to default to 300 seconds.
I am an Amazon employee (see commit email).
Issue
Should fix #34614 (the issue was closed as duplicate but I'm not sure the issue it was linked to is actually the same issue)
Reason for this change
Although I passed in the duration of 1 day as the
periodfor the alarm metric, the actual period for the evaluation of the anomaly detection band is overridden to 5 minutes (300 seconds).i.e,
cdk synthshows this warning:And, on deployment, I can see on the CloudWatch alarm dashboard that the period is 5 minutes.
Example alarm:
This happens because
AnomalyDetectionAlarmcreates an internalMathExpressionfor the anomaly detection band, and sinceAnomalyDetectionAlarmPropsdoesn't have a period property, no period gets passed to that math expression. The math expression then defaults to 300 seconds which overrides the period I've set on the metric used within it (math expression always overrides the periods of the metrics passed into it)I believe this is a bug in aws-cdk-lib where the metric's period isn't being respected by the Anomaly Detection Alarm.
If I try to include the
periodproperty inAnomalyDetectionAlarmProps, I get the error:Object literal may only specify known properties, and 'period' does not exist in type 'AnomalyDetectionAlarmProps'despiteAnomalyDetectionAlarmPropsextendingCreateAlarmOptionsBase, which has theperiodproperty.This is because the
periodproperty has been marked as deprecated sojsiiremoved that property during compilation, which resulted inAnomalyDetectionAlarmPropsnot receiving this property from the parent interface.Possible Fixes
periodproperty toAnomalyDetectionAlarmPropsmetricpassed into itThis PR implements approach 2 (see discussion)
Describe any new or updated permissions being added
N/A
Description of how you validated changes
Added two new unit tests
Added an integration test
Steps I took for testing
(listing it here to catch if I missed anything/did something wrong)
testCDKpackage with the example alarm I've provided above.aws-cdk-liband rannpx lerna run build --scope=aws-cdk-lib../aws/link-all.shintestCDKdirectory.npx cdk synthto make sure the generated file is correct.npx cdk deployand verified that the dashboard is displaying the correct duration.yarn testinaws-cdk/packages/aws-cdk-libnpx lerna run build --scope=@aws-cdk-testing/framework-integfollowed byyarn integ --update-on-failedChecklist
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license