Skip to content

Conversation

@nathantournant
Copy link
Member

@nathantournant nathantournant commented Sep 30, 2021

What does this PR do?

  • Adds new functionality
  • Alters existing functionality
  • Fixes a bug
  • Improves documentation or testing

Please briefly describe your changes as well as the motivation behind them:

This PR creates a Notifier interface to send kubernetes events to the Slack API, with a useful format. I allows to leverage kubernetes events creation mechanisms (anti-spam, lifecycle integration, tooling, etc.) as a way to notify users about unexpected behaviors in their disruption, directly as a Slack DM.
Currently, users have little to no information when a disruption goes wrong in a controlled way (stuck in removal, no target found, etc.). With this PR, any warning event from the controller (which we generate ourselves and have total control over) will be sent to the user as a Slack DM, along with information about the disruption (where to find it, number of targets, etc.) so they know quickly they have to dig and know where to start.

Code Quality Checklist

  • The documentation is up to date.
  • My code is sufficiently commented and passes continuous integration checks.
  • I have signed my commit (see Contributing Docs).

Testing

  • I leveraged continuous integration testing
    • by depending on existing unit tests or end-to-end tests.
    • by adding new unit tests or end-to-end tests.
  • I manually tested the following steps:
    • starting disruptions with no target will generate a warning event. On a cluster, you will receive a Slack DM for it. On minikube, you can take a look at the logs and search for NOOP: warning to assert the warning was sent to the console.
    • Any configuration mistake will generate benign, easy to debug errors at controller launch (no token found, sink doesn't exist, slack auth failed, etc.)
    • Test version will be deployed to plain1b.us1.staging.dog at the release of this PR so you can try it there. There will be a dedicated branch in chaos-k8s-ressources to observe the differences in deployment files.
    • locally.
    • as a canary deployment to a cluster.

Diagrams

Notifier Class/Event Lifecycle Diagram:
Kubernetes Events Notifier Class/Lifecycle

Controller Events Lifecycle (not exhaustive):
Disruption Lifecycle

}

if err = s.parseEventToNotifier(event, dis); err != nil {
fmt.Println(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expected? It'll print an error on standard output with no formatting while I would expect more something handled by a logger here.

Copy link
Member Author

@nathantournant nathantournant Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not, it slipped by me, thank you for noticing
I would still like to log this error not there but probably upstream, in the NotifierSink interface
I'll do another proposal

edit: never mind this is the NotifierSink interface, I'll add a logger there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final word:
logger has been added almost everywhere, to send errors from NotifierSink and warnings from the SlackNotifier

return
}

func (s *NotifierSink) Create(event *corev1.Event) (*corev1.Event, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when this function returns an error? Does it retry or stop here? Because we'll likely reach this case if the username stored in the disruption status is invalid (for instance when a disruption is created by something else than a human, like a workflow).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing happens: the error is generated in the slack notifier, gets back to the NotifierSink.Create function, which just absorbs the error (the error from parseEventToNotifier is just ignored, as it would actually generate a "Event Rejected" error log)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final implemented behavior:

  • if username isn't an email address, nothing happens
  • if username is an email address, but no slack user i found, a warning is logged
  • if username is an email address and a slack user is found, notification is sent

Copy link
Contributor

@Devatoria Devatoria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the last remaining thing would be to add a few lines about this feature in the features doc maybe? https://github.com/DataDog/chaos-controller/blob/main/docs/features.md

@nikos912000
Copy link
Contributor

This feature is very useful, thanks for that!

I was wondering if you are planning to add support for HTTP sinks. I implemented something similar in kube-monkey back in time which works for any HTTP collectors, including Slack, as this can be treated as a simple REST call.

Copy link
Contributor

@Devatoria Devatoria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 small nits in doc otherwise LGTM

@nathantournant nathantournant merged commit f02cd0a into main Oct 18, 2021
@nathantournant nathantournant deleted the nathan/notifier branch October 18, 2021 13:46
@nathantournant
Copy link
Member Author

This feature is very useful, thanks for that!

I was wondering if you are planning to add support for HTTP sinks. I implemented something similar in kube-monkey back in time which works for any HTTP collectors, including Slack, as this can be treated as a simple REST call.

Thank you for the positive feedback !
We'll consider adding a HTTP webhook support sometime soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants