Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise an alarm when any stages are in a Failed state. #6

Merged
merged 7 commits into from
Dec 13, 2018

Conversation

sam-goodwin
Copy link
Contributor

A function runs every minute and calls GetPipelineState for the provided pipeline's name, counts the number of failed stages and emits a JSON log { failedCount: }. A metric filter is then configured to track this value as a CloudWatch metric, and a corresponding alarm is set to fire when the maximim value of a single 5-minute interval is >= 1.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Sam Goodwin added 2 commits December 12, 2018 17:11
…or the number of failed stages, enabling consistent alarming on a 'failed pipeline'.
Copy link
Contributor

@eladb eladb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a unit test?

lib/pipeline-watcher.ts Outdated Show resolved Hide resolved
lib/pipeline-watcher.ts Outdated Show resolved Hide resolved
lib/pipeline.ts Outdated Show resolved Hide resolved
lib/pipeline.ts Outdated
.addResource(props.pipeline.pipelineArn)
.addAction('codepipeline:GetPipelineState'));

// Explicitly and pre-emptively create the function's log group because we need it for the metric filter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a race condition here between Lambda and CloudFormation. I remember @rix0rrr had to create a custom resource to implement something like that and I wonder if that's the case, if we should actually include this in the Lambda construct library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can address this by having the EventRule trigger depend on the LogGroup Resource. This ensures the log group is created before the first invocation of the function. Lambda lazily creates the log group on first invocation, so no more race condition. The only down-side I can see is that I have to drop down to L1 and rely on the findChild(‘Resource’) convention.

const logGroupResource = logGroup.findChild('Resource') as cdk.Resource;
const triggerResource = trigger.findChild('Resource') as cdk.Resource;
triggerResource.addDependency(logGroupResource);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess...

We should try to find a better way to allow defining dependencies between resources (aws/aws-cdk#95)

We have an issue to try and make log groups accessible in the CDK (aws/aws-cdk#667).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative is that I have the handler call putMetricData directly instead of using a MetricFilter. Let me know what you think so I can wrap this up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the metric filter route is the right one, just want to make sure we have a better path in the CDK for future generations to use it. Tactically, your solution for delivlib is perfectly fine.

lib/pipeline.ts Outdated Show resolved Hide resolved
lib/pipeline.ts Outdated Show resolved Hide resolved
@@ -0,0 +1,29 @@
import AWS = require('aws-sdk');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the nit picking, but I'd call this "watcher-handler"

.length;
}
process.stdout.write(JSON.stringify({failedCount}));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write a simple unit test for this handler? (I know we don't have any UT for handlers in this library, yet, but let's raise the bar)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on it .. it's a pain since we use nodeunit (deprecated) and most of what the handler does is create a client, check env variables, make a request and log to stdout - all of which are annoying to test, ha.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #5 - I don't think we need to use nodeunit here. Moving to jest?

lib/pipeline.ts Outdated
@@ -6,6 +6,7 @@ import iam = require('@aws-cdk/aws-iam');
import sns = require('@aws-cdk/aws-sns');
import cdk = require('@aws-cdk/cdk');
import path = require('path');
import { PipelineWatcher } from './pipeline-watcher/watcher';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create an pipeline-watcher/index.ts file and export * from './watcher', so here it will be:

import { PipelineWatcher } from './pipeline-watcher'

});

this.alarm = new cloudwatch.Alarm(this, 'Alarm', {
alarmDescription: `Pipeline ${props.title || props.pipeline.pipelineName} has failed failed stages`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"failed failed"

// TODO: This creates a long namespace, better alternatives?
const metricNamespace = `CodePipeline/${props.pipeline.pipelineName}`;

new logs.MetricFilter(this, 'MetricFilter', {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably use logGroup.extractMetric here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried, it doesn't support namespaces with a / or token values. Cut an issue: aws/aws-cdk#1351

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this namespace is appropriate. Generally this would contain the service name, like AWS/SQS or AWS/DynamoDB. Variables go into Dimensions. So in this case, I would expect it to be CDK/Delivlib or similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use dimensions in metric filters unfortunately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could instead make the metric name ugly instead of the namespace:

  • Namespace: CDK/Delivlib
  • MetricName: <pipeline-name>_FailedStages

lib/pipeline-watcher/watcher.ts Show resolved Hide resolved
lib/pipeline-watcher/index.ts Outdated Show resolved Hide resolved

const metricName = 'FailedStages';
// TODO: This creates a long namespace, better alternatives?
const metricNamespace = `CodePipeline/${props.pipeline.pipelineName}`;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rix0rrr I've called it out in the code here. Any other suggestions for distinguishing individual pipeline's metrics from each other? I can't use dimensions in metric filters :(

@sam-goodwin sam-goodwin merged commit ce744c0 into master Dec 13, 2018
@eladb eladb mentioned this pull request Dec 13, 2018
RomainMuller pushed a commit that referenced this pull request Dec 13, 2018
* **pipeline**: concurrency limit ([#9](#9))
* **gh-pages-publisher**: force-push without history ([#7](#7))
* **pipeline**: send email notifications on any action failure ([#10](#10))
* **github-releases**: if changelog doesn't exist, don't include release notes ([#8](#8))
* **pipeline**: raise an alarm when any stages are in a Failed state ([#6](#6))
@sam-goodwin sam-goodwin deleted the samgood/alarming branch December 17, 2018 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants