-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise an alarm when any stages are in a Failed state. #6
Conversation
…or the number of failed stages, enabling consistent alarming on a 'failed pipeline'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a unit test?
lib/pipeline.ts
Outdated
.addResource(props.pipeline.pipelineArn) | ||
.addAction('codepipeline:GetPipelineState')); | ||
|
||
// Explicitly and pre-emptively create the function's log group because we need it for the metric filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a race condition here between Lambda and CloudFormation. I remember @rix0rrr had to create a custom resource to implement something like that and I wonder if that's the case, if we should actually include this in the Lambda construct library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can address this by having the EventRule trigger depend on the LogGroup Resource. This ensures the log group is created before the first invocation of the function. Lambda lazily creates the log group on first invocation, so no more race condition. The only down-side I can see is that I have to drop down to L1 and rely on the findChild(‘Resource’) convention.
const logGroupResource = logGroup.findChild('Resource') as cdk.Resource;
const triggerResource = trigger.findChild('Resource') as cdk.Resource;
triggerResource.addDependency(logGroupResource);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess...
We should try to find a better way to allow defining dependencies between resources (aws/aws-cdk#95)
We have an issue to try and make log groups accessible in the CDK (aws/aws-cdk#667).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative is that I have the handler call putMetricData
directly instead of using a MetricFilter. Let me know what you think so I can wrap this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the metric filter route is the right one, just want to make sure we have a better path in the CDK for future generations to use it. Tactically, your solution for delivlib is perfectly fine.
… Fix style and FnConcat comments.
@@ -0,0 +1,29 @@ | |||
import AWS = require('aws-sdk'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the nit picking, but I'd call this "watcher-handler"
.length; | ||
} | ||
process.stdout.write(JSON.stringify({failedCount})); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we write a simple unit test for this handler? (I know we don't have any UT for handlers in this library, yet, but let's raise the bar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on it .. it's a pain since we use nodeunit (deprecated) and most of what the handler does is create a client, check env variables, make a request and log to stdout - all of which are annoying to test, ha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #5 - I don't think we need to use nodeunit here. Moving to jest?
lib/pipeline.ts
Outdated
@@ -6,6 +6,7 @@ import iam = require('@aws-cdk/aws-iam'); | |||
import sns = require('@aws-cdk/aws-sns'); | |||
import cdk = require('@aws-cdk/cdk'); | |||
import path = require('path'); | |||
import { PipelineWatcher } from './pipeline-watcher/watcher'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create an pipeline-watcher/index.ts
file and export * from './watcher'
, so here it will be:
import { PipelineWatcher } from './pipeline-watcher'
lib/pipeline-watcher/watcher.ts
Outdated
}); | ||
|
||
this.alarm = new cloudwatch.Alarm(this, 'Alarm', { | ||
alarmDescription: `Pipeline ${props.title || props.pipeline.pipelineName} has failed failed stages`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"failed failed"
// TODO: This creates a long namespace, better alternatives? | ||
const metricNamespace = `CodePipeline/${props.pipeline.pipelineName}`; | ||
|
||
new logs.MetricFilter(this, 'MetricFilter', { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably use logGroup.extractMetric
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried, it doesn't support namespaces with a /
or token values. Cut an issue: aws/aws-cdk#1351
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this namespace is appropriate. Generally this would contain the service name, like AWS/SQS
or AWS/DynamoDB
. Variables go into Dimensions
. So in this case, I would expect it to be CDK/Delivlib
or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't use dimensions in metric filters unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could instead make the metric name ugly instead of the namespace:
- Namespace:
CDK/Delivlib
- MetricName:
<pipeline-name>_FailedStages
lib/pipeline-watcher/watcher.ts
Outdated
|
||
const metricName = 'FailedStages'; | ||
// TODO: This creates a long namespace, better alternatives? | ||
const metricNamespace = `CodePipeline/${props.pipeline.pipelineName}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rix0rrr I've called it out in the code here. Any other suggestions for distinguishing individual pipeline's metrics from each other? I can't use dimensions in metric filters :(
* **pipeline**: concurrency limit ([#9](#9)) * **gh-pages-publisher**: force-push without history ([#7](#7)) * **pipeline**: send email notifications on any action failure ([#10](#10)) * **github-releases**: if changelog doesn't exist, don't include release notes ([#8](#8)) * **pipeline**: raise an alarm when any stages are in a Failed state ([#6](#6))
A function runs every minute and calls GetPipelineState for the provided pipeline's name, counts the number of failed stages and emits a JSON log { failedCount: }. A metric filter is then configured to track this value as a CloudWatch metric, and a corresponding alarm is set to fire when the maximim value of a single 5-minute interval is >= 1.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.