Support new heartbeat 'state' fields#4023
Merged
andrewvc merged 6 commits intoelastic:mainfrom Oct 4, 2022
Merged
Conversation
This is the mapping counterpart to elastic/beats#30632 It adds supports for the new `state.*` fields
|
Pinging @elastic/uptime (Team:Uptime) |
6 tasks
🌐 Coverage report
|
andrewvc
added a commit
to elastic/beats
that referenced
this pull request
Sep 13, 2022
Fixes #32163 , corresponding mapping changes for synthetics package in elastic/integrations#4023 Adds a notion of state across checks, with flapping as a bonus. At a high level this PR does the following: Adds new root level state fields Enhances the ecserr package and types to make them more testable and usable Refactors timeout, http status, and could not connect errors to use the new ecserr package to make testing this PR/feature easier (these are the easiest types of errors to replicate) with lightweight monitors. Adds support for the standard mage goIntegTest task, already supported by CI that thus far has been a noop for heartbeat. Adds a notion of flapping states, in addition to up / down states. Automatically connects to ES to retrieve the last state value for the given monitor when a monitor first starts, this is necessary to continue the previous state across restarts of heartbeat Replaces the add_observer_metadata processor with a new heartbeat.location global setting and location per monitor setting. This lets us set a location ID (which is then set to observer.name. See details below: Note: flapping is currently disabled Per the discussion in the review, it's a complex feature, let's add it in a follow-up What are states, and how are they implemented here? The main goal of this PR is to resolve #32163 , which this goes, but it also recognizes that the goal of grouping errors is a subset of the more general problem of grouping both up and down states. It's useful to group both since it's useful to see something like: State Duration Reason UP 18 hours DOWN 30 minutes status 400 Up 1 month Hence, the introduction of the various state.* fields, which group contiguous blocks of 'up' and 'down' states together. A sample of the state.* fields can be seen below: { "state": { // new state field in addition to existing monitor fields // globally unique ID for this state, this ID is sortable as a timestamp // to speed up aggregations. The format is id-timestampMsHex-serialHex // which is more compact than a UUID, and also chronologically sortable "id": "dummy-182a27ea210-2dc", // when this state first started, with this we can see when the first event in the // state occurred without having to retrieve that event "started_at": "2022-08-15T12:13:04.2721958-05:00", // number of milliseconds this state has been active for "duration_ms": 4655149, // number of checks that have occurred within this state // broken out by up/down. Flapping states will have non-zero values for both up/down "checks": 2290, "up": 0, "down": 2290 // status of the state, which can be 'up', 'down', or 'flapping' // usually identical to monitor.status except in the case of a flapping monitor "status": "down", // the last FLAPPING_THRESHOLD-1 checks, used to reconstruct flapping state // when resuming state from ES "flap_history": [], // The prior state, the nice thing about `state.ends` is that these states do not change // since they are complete, so they are easy to query / aggregate since the values are stable // in actual use these are only attached to events with `state.checks: 1` so they appear // exactly once "ends": { "started_at": "2022-08-15T12:12:57.1792082-05:00", "duration_ms": 5069, "status": "flap", "up": 3, "down": 1, "flap_history": null, // omitted on ends states since it's just dead-weight "id": "dummy-182a27e865b-2db", "checks": 4, "ends": null // we don't recurse end states }, }, } Notes on location The new location field can be set as follows: #globally heartbeat.location: id: "us-east-1a" geo: name: "US-East Coast" location: "44.123, 45.12345" heartbeat.monitors: - type: http id: my-monitor urls: "http://elastic.co" run_from: id: "us-east-1a" geo: name: "US-East Coast" location: "44.123, 45.12345" Notes on flapping states The new flapping state serves an important purpose, to reduce the cardinality of states for unstable sites. This is important for UX and UI reasons, since having large numbers of states to visualize in a list is a key thing we'd like to improve. The flapping threshold in this PR is hard coded to the number 7. This number is equivalent to the number of consecutive identical 'up' or 'down' states the algorithm uses to determine whether a monitor is stable or not. As an example, if a monitor experiences 7 consecutive up checks, followed by seven consecutive down checks it will be reflected as a single up state of 7 checks followed by a single down state of 7 checks. If, by contrast, there are 7 consecutive up checks followed by 6 consecutive down checks, then a single up check there will be two consecutive states, of up followed by flapping; if 6 consecutive down checks were to follow the last of these new events would constitute a new down state following the flapping state, since the monitor would now be stable. Please see the unit tests for monitor states for additional more nuanced detail. I don't think it makes sense to expose this to users yet, though we could in the future. It's a bit complex to explain, and I think this is a good starting point. We'll likely want to tweak this algorithm in the future, but we that could be done in a follow-up. One concern I have is that it could take a while for monitors to recover if they run infrequently. It should be noted that flapping checks start as a simple up or down check, but change into a flapping check if they see a different result before the flapping threshold is hit. So the most recent state.status is the only accurate value that should be used. It is also for this reason that two consecutive up or down states cannot happen, but multiple consecutive flapping states could happen if after what looks like a recover instability occurs again. We may want to tweak this to allow for shorter stable states. Again, I think this could happen in a flapping follow-up.
chrisberkhout
pushed a commit
to elastic/beats
that referenced
this pull request
Jun 1, 2023
Fixes #32163 , corresponding mapping changes for synthetics package in elastic/integrations#4023 Adds a notion of state across checks, with flapping as a bonus. At a high level this PR does the following: Adds new root level state fields Enhances the ecserr package and types to make them more testable and usable Refactors timeout, http status, and could not connect errors to use the new ecserr package to make testing this PR/feature easier (these are the easiest types of errors to replicate) with lightweight monitors. Adds support for the standard mage goIntegTest task, already supported by CI that thus far has been a noop for heartbeat. Adds a notion of flapping states, in addition to up / down states. Automatically connects to ES to retrieve the last state value for the given monitor when a monitor first starts, this is necessary to continue the previous state across restarts of heartbeat Replaces the add_observer_metadata processor with a new heartbeat.location global setting and location per monitor setting. This lets us set a location ID (which is then set to observer.name. See details below: Note: flapping is currently disabled Per the discussion in the review, it's a complex feature, let's add it in a follow-up What are states, and how are they implemented here? The main goal of this PR is to resolve #32163 , which this goes, but it also recognizes that the goal of grouping errors is a subset of the more general problem of grouping both up and down states. It's useful to group both since it's useful to see something like: State Duration Reason UP 18 hours DOWN 30 minutes status 400 Up 1 month Hence, the introduction of the various state.* fields, which group contiguous blocks of 'up' and 'down' states together. A sample of the state.* fields can be seen below: { "state": { // new state field in addition to existing monitor fields // globally unique ID for this state, this ID is sortable as a timestamp // to speed up aggregations. The format is id-timestampMsHex-serialHex // which is more compact than a UUID, and also chronologically sortable "id": "dummy-182a27ea210-2dc", // when this state first started, with this we can see when the first event in the // state occurred without having to retrieve that event "started_at": "2022-08-15T12:13:04.2721958-05:00", // number of milliseconds this state has been active for "duration_ms": 4655149, // number of checks that have occurred within this state // broken out by up/down. Flapping states will have non-zero values for both up/down "checks": 2290, "up": 0, "down": 2290 // status of the state, which can be 'up', 'down', or 'flapping' // usually identical to monitor.status except in the case of a flapping monitor "status": "down", // the last FLAPPING_THRESHOLD-1 checks, used to reconstruct flapping state // when resuming state from ES "flap_history": [], // The prior state, the nice thing about `state.ends` is that these states do not change // since they are complete, so they are easy to query / aggregate since the values are stable // in actual use these are only attached to events with `state.checks: 1` so they appear // exactly once "ends": { "started_at": "2022-08-15T12:12:57.1792082-05:00", "duration_ms": 5069, "status": "flap", "up": 3, "down": 1, "flap_history": null, // omitted on ends states since it's just dead-weight "id": "dummy-182a27e865b-2db", "checks": 4, "ends": null // we don't recurse end states }, }, } Notes on location The new location field can be set as follows: #globally heartbeat.location: id: "us-east-1a" geo: name: "US-East Coast" location: "44.123, 45.12345" heartbeat.monitors: - type: http id: my-monitor urls: "http://elastic.co" run_from: id: "us-east-1a" geo: name: "US-East Coast" location: "44.123, 45.12345" Notes on flapping states The new flapping state serves an important purpose, to reduce the cardinality of states for unstable sites. This is important for UX and UI reasons, since having large numbers of states to visualize in a list is a key thing we'd like to improve. The flapping threshold in this PR is hard coded to the number 7. This number is equivalent to the number of consecutive identical 'up' or 'down' states the algorithm uses to determine whether a monitor is stable or not. As an example, if a monitor experiences 7 consecutive up checks, followed by seven consecutive down checks it will be reflected as a single up state of 7 checks followed by a single down state of 7 checks. If, by contrast, there are 7 consecutive up checks followed by 6 consecutive down checks, then a single up check there will be two consecutive states, of up followed by flapping; if 6 consecutive down checks were to follow the last of these new events would constitute a new down state following the flapping state, since the monitor would now be stable. Please see the unit tests for monitor states for additional more nuanced detail. I don't think it makes sense to expose this to users yet, though we could in the future. It's a bit complex to explain, and I think this is a good starting point. We'll likely want to tweak this algorithm in the future, but we that could be done in a follow-up. One concern I have is that it could take a while for monitors to recover if they run infrequently. It should be noted that flapping checks start as a simple up or down check, but change into a flapping check if they see a different result before the flapping threshold is hit. So the most recent state.status is the only accurate value that should be used. It is also for this reason that two consecutive up or down states cannot happen, but multiple consecutive flapping states could happen if after what looks like a recover instability occurs again. We may want to tweak this to allow for shorter stable states. Again, I think this could happen in a flapping follow-up.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is the mapping counterpart to elastic/beats#30632
It adds supports for the new
state.*fieldsChecklist
changelog.ymlfile.Author's Checklist
How to test this PR locally