Maintain datadog monitors via code #1673

boolafish · 2020-07-28T00:42:27Z

Why

We are having more and more alarms/metric/monitors in data dog now. Currently it is a messy manual process. Also, there are 2 possible ways to trigger an alarm. Could be via application code directly or via data dog monitor.

We have some alarm that is directly triggered in application code. However, it would be great if all alarm can be just in data dog and the application only need to emit metric. Currently for most cases we are emitting both metric and alarm via application code.

But even not doing the simplification, using some code as infra to maintain our current messy monitors might be already beneficial enough.

Note

One potential tool is terraform: https://www.terraform.io/docs/providers/datadog/r/monitor.html
Seems like supporting datadog too.

InoMurko · 2020-07-28T05:56:29Z

Just a note: you're using the word alarm, which has two definitions:

Application alarm that allows the application to be reactive.

Metric alarm.

boolafish · 2020-07-28T05:58:54Z

My intention is [alarm_word_placeholder] that would notify engineer to react on. I think datadog use the word monitor. Just I am used to use the word alarm previously 😅

unnawut · 2020-08-06T07:46:51Z

Just to expand Ino's comment a bit. The in-app alarms right now are mostly the ones that signals that the service is not in good shape to accept new transactions, e.g. :boot_in_progress, :ethereum_connection_error, :ethereum_stalled_sync, :invalid_fee_source, :main_supervisor_halted. And so when these ones are raised, new transactions are rejected immediately. So I think they'll have to remain in the application? And the rest of them, that needs human's help are already in datadog.

The only one that seems out of place is :system_memory_too_high which is being removed in #1678 (comment)

boolafish · 2020-08-07T05:43:17Z

If we want to simplify a bit and standardize everything, we can make those :boot_in_progress metric to DG as well. In app they still handles the original logic to react to those but instead of sending alarm directly, it sends metric instead. So we have maintained list of monitors/alarms in a single place to see what we have.

But....might not bring much benefit as we already have those code in place and is working fine. Just some burden to be able to see what are all the monitors we have.

unnawut · 2020-08-07T05:53:43Z

Ah I see. I think we can send an increment metric for each of the event above, then we can start to see their frequency patterns. The alarm would also be treated more properly by setting alarm threshold to > 0 for each metric. Right now they're sort of reported as discontinuous events.

But I concur that it's not high priority given other things we have in hand at the moment.

boolafish · 2020-08-07T05:55:50Z

lol my personally feeling is nobody would take this up at all hahaha unless gold team board become cleaner 😅

But can we pick this up in new services like chch v2 to be a standard approach for new service ?! @InoMurko @achiurizo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintain datadog monitors via code #1673

Maintain datadog monitors via code #1673

boolafish commented Jul 28, 2020 •

edited

Loading

InoMurko commented Jul 28, 2020

boolafish commented Jul 28, 2020 •

edited

Loading

unnawut commented Aug 6, 2020

boolafish commented Aug 7, 2020

unnawut commented Aug 7, 2020 •

edited

Loading

boolafish commented Aug 7, 2020 •

edited

Loading

Maintain datadog monitors via code #1673

Maintain datadog monitors via code #1673

Comments

boolafish commented Jul 28, 2020 • edited Loading

Why

Note

InoMurko commented Jul 28, 2020

boolafish commented Jul 28, 2020 • edited Loading

unnawut commented Aug 6, 2020

boolafish commented Aug 7, 2020

unnawut commented Aug 7, 2020 • edited Loading

boolafish commented Aug 7, 2020 • edited Loading

boolafish commented Jul 28, 2020 •

edited

Loading

boolafish commented Jul 28, 2020 •

edited

Loading

unnawut commented Aug 7, 2020 •

edited

Loading

boolafish commented Aug 7, 2020 •

edited

Loading