Support new heartbeat 'state' fields by andrewvc · Pull Request #4023 · elastic/integrations

andrewvc · 2022-08-17T21:19:22Z

This is the mapping counterpart to elastic/beats#30632

It adds supports for the new state.* fields

Checklist

I have reviewed tips for building integrations and this pull request is aligned with them.
I have verified that all data streams collect metrics or logs.
I have added an entry to my package's changelog.yml file.
I have verified that Kibana version constraints are current according to guidelines.

Author's Checklist

[ ]

How to test this PR locally

This is the mapping counterpart to elastic/beats#30632 It adds supports for the new `state.*` fields

elasticmachine · 2022-08-17T21:19:24Z

Pinging @elastic/uptime (Team:Uptime)

elasticmachine · 2022-08-17T21:30:55Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-10-04T14:52:25.497+0000
Duration: 14 min 9 sec

Test stats 🧪

Test	Results
Failed	0
Passed	6
Skipped	0
Total	6

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.

elasticmachine · 2022-08-18T12:23:54Z

🌐 Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	100.0% (`0/0`)	💚
Files	100.0% (`0/0`)	💚 2.53
Classes	100.0% (`0/0`)	💚 2.53
Methods	33.333% (`6/18`)	👎 -56.907
Lines	100.0% (`0/0`)	💚 8.417
Conditionals	100.0% (`0/0`)	💚

Fixes #32163 , corresponding mapping changes for synthetics package in elastic/integrations#4023 Adds a notion of state across checks, with flapping as a bonus. At a high level this PR does the following: Adds new root level state fields Enhances the ecserr package and types to make them more testable and usable Refactors timeout, http status, and could not connect errors to use the new ecserr package to make testing this PR/feature easier (these are the easiest types of errors to replicate) with lightweight monitors. Adds support for the standard mage goIntegTest task, already supported by CI that thus far has been a noop for heartbeat. Adds a notion of flapping states, in addition to up / down states. Automatically connects to ES to retrieve the last state value for the given monitor when a monitor first starts, this is necessary to continue the previous state across restarts of heartbeat Replaces the add_observer_metadata processor with a new heartbeat.location global setting and location per monitor setting. This lets us set a location ID (which is then set to observer.name. See details below: Note: flapping is currently disabled Per the discussion in the review, it's a complex feature, let's add it in a follow-up What are states, and how are they implemented here? The main goal of this PR is to resolve #32163 , which this goes, but it also recognizes that the goal of grouping errors is a subset of the more general problem of grouping both up and down states. It's useful to group both since it's useful to see something like: State Duration Reason UP 18 hours DOWN 30 minutes status 400 Up 1 month Hence, the introduction of the various state.* fields, which group contiguous blocks of 'up' and 'down' states together. A sample of the state.* fields can be seen below: { "state": { // new state field in addition to existing monitor fields // globally unique ID for this state, this ID is sortable as a timestamp // to speed up aggregations. The format is id-timestampMsHex-serialHex // which is more compact than a UUID, and also chronologically sortable "id": "dummy-182a27ea210-2dc", // when this state first started, with this we can see when the first event in the // state occurred without having to retrieve that event "started_at": "2022-08-15T12:13:04.2721958-05:00", // number of milliseconds this state has been active for "duration_ms": 4655149, // number of checks that have occurred within this state // broken out by up/down. Flapping states will have non-zero values for both up/down "checks": 2290, "up": 0, "down": 2290 // status of the state, which can be 'up', 'down', or 'flapping' // usually identical to monitor.status except in the case of a flapping monitor "status": "down", // the last FLAPPING_THRESHOLD-1 checks, used to reconstruct flapping state // when resuming state from ES "flap_history": [], // The prior state, the nice thing about `state.ends` is that these states do not change // since they are complete, so they are easy to query / aggregate since the values are stable // in actual use these are only attached to events with `state.checks: 1` so they appear // exactly once "ends": { "started_at": "2022-08-15T12:12:57.1792082-05:00", "duration_ms": 5069, "status": "flap", "up": 3, "down": 1, "flap_history": null, // omitted on ends states since it's just dead-weight "id": "dummy-182a27e865b-2db", "checks": 4, "ends": null // we don't recurse end states }, }, } Notes on location The new location field can be set as follows: #globally heartbeat.location: id: "us-east-1a" geo: name: "US-East Coast" location: "44.123, 45.12345" heartbeat.monitors: - type: http id: my-monitor urls: "http://elastic.co" run_from: id: "us-east-1a" geo: name: "US-East Coast" location: "44.123, 45.12345" Notes on flapping states The new flapping state serves an important purpose, to reduce the cardinality of states for unstable sites. This is important for UX and UI reasons, since having large numbers of states to visualize in a list is a key thing we'd like to improve. The flapping threshold in this PR is hard coded to the number 7. This number is equivalent to the number of consecutive identical 'up' or 'down' states the algorithm uses to determine whether a monitor is stable or not. As an example, if a monitor experiences 7 consecutive up checks, followed by seven consecutive down checks it will be reflected as a single up state of 7 checks followed by a single down state of 7 checks. If, by contrast, there are 7 consecutive up checks followed by 6 consecutive down checks, then a single up check there will be two consecutive states, of up followed by flapping; if 6 consecutive down checks were to follow the last of these new events would constitute a new down state following the flapping state, since the monitor would now be stable. Please see the unit tests for monitor states for additional more nuanced detail. I don't think it makes sense to expose this to users yet, though we could in the future. It's a bit complex to explain, and I think this is a good starting point. We'll likely want to tweak this algorithm in the future, but we that could be done in a follow-up. One concern I have is that it could take a while for monitors to recover if they run infrequently. It should be noted that flapping checks start as a simple up or down check, but change into a flapping check if they see a different result before the flapping threshold is hit. So the most recent state.status is the only accurate value that should be used. It is also for this reason that two consecutive up or down states cannot happen, but multiple consecutive flapping states could happen if after what looks like a recover instability occurs again. We may want to tweak this to allow for shorter stable states. Again, I think this could happen in a flapping follow-up.

shahzad31

LGTM !!

Fixes #32163 , corresponding mapping changes for synthetics package in elastic/integrations#4023 Adds a notion of state across checks, with flapping as a bonus. At a high level this PR does the following: Adds new root level state fields Enhances the ecserr package and types to make them more testable and usable Refactors timeout, http status, and could not connect errors to use the new ecserr package to make testing this PR/feature easier (these are the easiest types of errors to replicate) with lightweight monitors. Adds support for the standard mage goIntegTest task, already supported by CI that thus far has been a noop for heartbeat. Adds a notion of flapping states, in addition to up / down states. Automatically connects to ES to retrieve the last state value for the given monitor when a monitor first starts, this is necessary to continue the previous state across restarts of heartbeat Replaces the add_observer_metadata processor with a new heartbeat.location global setting and location per monitor setting. This lets us set a location ID (which is then set to observer.name. See details below: Note: flapping is currently disabled Per the discussion in the review, it's a complex feature, let's add it in a follow-up What are states, and how are they implemented here? The main goal of this PR is to resolve #32163 , which this goes, but it also recognizes that the goal of grouping errors is a subset of the more general problem of grouping both up and down states. It's useful to group both since it's useful to see something like: State Duration Reason UP 18 hours DOWN 30 minutes status 400 Up 1 month Hence, the introduction of the various state.* fields, which group contiguous blocks of 'up' and 'down' states together. A sample of the state.* fields can be seen below: { "state": { // new state field in addition to existing monitor fields // globally unique ID for this state, this ID is sortable as a timestamp // to speed up aggregations. The format is id-timestampMsHex-serialHex // which is more compact than a UUID, and also chronologically sortable "id": "dummy-182a27ea210-2dc", // when this state first started, with this we can see when the first event in the // state occurred without having to retrieve that event "started_at": "2022-08-15T12:13:04.2721958-05:00", // number of milliseconds this state has been active for "duration_ms": 4655149, // number of checks that have occurred within this state // broken out by up/down. Flapping states will have non-zero values for both up/down "checks": 2290, "up": 0, "down": 2290 // status of the state, which can be 'up', 'down', or 'flapping' // usually identical to monitor.status except in the case of a flapping monitor "status": "down", // the last FLAPPING_THRESHOLD-1 checks, used to reconstruct flapping state // when resuming state from ES "flap_history": [], // The prior state, the nice thing about `state.ends` is that these states do not change // since they are complete, so they are easy to query / aggregate since the values are stable // in actual use these are only attached to events with `state.checks: 1` so they appear // exactly once "ends": { "started_at": "2022-08-15T12:12:57.1792082-05:00", "duration_ms": 5069, "status": "flap", "up": 3, "down": 1, "flap_history": null, // omitted on ends states since it's just dead-weight "id": "dummy-182a27e865b-2db", "checks": 4, "ends": null // we don't recurse end states }, }, } Notes on location The new location field can be set as follows: #globally heartbeat.location: id: "us-east-1a" geo: name: "US-East Coast" location: "44.123, 45.12345" heartbeat.monitors: - type: http id: my-monitor urls: "http://elastic.co" run_from: id: "us-east-1a" geo: name: "US-East Coast" location: "44.123, 45.12345" Notes on flapping states The new flapping state serves an important purpose, to reduce the cardinality of states for unstable sites. This is important for UX and UI reasons, since having large numbers of states to visualize in a list is a key thing we'd like to improve. The flapping threshold in this PR is hard coded to the number 7. This number is equivalent to the number of consecutive identical 'up' or 'down' states the algorithm uses to determine whether a monitor is stable or not. As an example, if a monitor experiences 7 consecutive up checks, followed by seven consecutive down checks it will be reflected as a single up state of 7 checks followed by a single down state of 7 checks. If, by contrast, there are 7 consecutive up checks followed by 6 consecutive down checks, then a single up check there will be two consecutive states, of up followed by flapping; if 6 consecutive down checks were to follow the last of these new events would constitute a new down state following the flapping state, since the monitor would now be stable. Please see the unit tests for monitor states for additional more nuanced detail. I don't think it makes sense to expose this to users yet, though we could in the future. It's a bit complex to explain, and I think this is a good starting point. We'll likely want to tweak this algorithm in the future, but we that could be done in a follow-up. One concern I have is that it could take a while for monitors to recover if they run infrequently. It should be noted that flapping checks start as a simple up or down check, but change into a flapping check if they see a different result before the flapping threshold is hit. So the most recent state.status is the only accurate value that should be used. It is also for this reason that two consecutive up or down states cannot happen, but multiple consecutive flapping states could happen if after what looks like a recover instability occurs again. We may want to tweak this to allow for shorter stable states. Again, I think this could happen in a flapping follow-up.

Support new heartbeat 'state' fields

7e5cf69

This is the mapping counterpart to elastic/beats#30632 It adds supports for the new `state.*` fields

andrewvc added enhancement New feature or request Team:obs-ux-infra_services Obs UX: Infra & Services team [elastic/obs-ux-infra_services-team] labels Aug 17, 2022

andrewvc self-assigned this Aug 17, 2022

andrewvc requested a review from a team as a code owner August 17, 2022 21:19

Add changelog link

cc21e6d

andrewvc mentioned this pull request Aug 17, 2022

[heartbeat] States and Improved Errors elastic/beats#30632

Merged

6 tasks

andrewvc added 3 commits August 17, 2022 17:52

Fix nesting of state.ends

dd34263

Format

ebecb8e

Omit top-level key

53552c4

Merge branch 'main' into add-states

e9b50a0

shahzad31 approved these changes Oct 4, 2022

View reviewed changes

andrewvc merged commit a691c64 into elastic:main Oct 4, 2022

andrewvc deleted the add-states branch October 4, 2022 17:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support new heartbeat 'state' fields#4023

Support new heartbeat 'state' fields#4023
andrewvc merged 6 commits intoelastic:mainfrom
andrewvc:add-states

andrewvc commented Aug 17, 2022 •

edited

Loading

Uh oh!

elasticmachine commented Aug 17, 2022

Uh oh!

elasticmachine commented Aug 17, 2022 •

edited

Loading

Build stats

Test stats 🧪

Uh oh!

elasticmachine commented Aug 18, 2022 •

edited

Loading

Uh oh!

shahzad31 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

andrewvc commented Aug 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Author's Checklist

How to test this PR locally

Uh oh!

elasticmachine commented Aug 17, 2022

Uh oh!

elasticmachine commented Aug 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💚 Build Succeeded

Build stats

Test stats 🧪

🤖 GitHub comments

Uh oh!

elasticmachine commented Aug 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🌐 Coverage report

Uh oh!

shahzad31 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andrewvc commented Aug 17, 2022 •

edited

Loading

elasticmachine commented Aug 17, 2022 •

edited

Loading

elasticmachine commented Aug 18, 2022 •

edited

Loading