Skip to content

[Synthetics] Detect agent-level issues in monitor integration health#261367

Merged
miguelmartin-elastic merged 116 commits intoelastic:mainfrom
miguelmartin-elastic:feat/synthetics-agent-health-status-258541
Apr 13, 2026
Merged

[Synthetics] Detect agent-level issues in monitor integration health#261367
miguelmartin-elastic merged 116 commits intoelastic:mainfrom
miguelmartin-elastic:feat/synthetics-agent-health-status-258541

Conversation

@miguelmartin-elastic
Copy link
Copy Markdown
Contributor

@miguelmartin-elastic miguelmartin-elastic commented Apr 6, 2026

Blocked by #256738

Summary

Closes #258541. Follow-up to #256738.

Extends the monitor integration health API and UI to detect agent-level issues in private locations — specifically when no agents are enrolled or all agents are offline/unhealthy. These statuses are surfaced alongside the existing integration-level checks.

Monitors Management - No agents enrolled tooltip image
Monitors Management - Mixed issues: reset-fixable and non-reset-fixable image
Monitors Management: reset is applied only to monitors that have at least one reset-fixable issue image
Monitors Management: reset is applied only to monitors that have at least one reset-fixable issue image
Monitor edit: if there are no reset-fixable issues the reset button is not shown image
Private locations table: if none of the unhealthy monitors is reset-fixable in that private location, the reset button is not shown image
Private locations table: if at least one of the unhealthy monitors is reset-fixable in that private location, the reset button is shown image

New health statuses (server):

  • missing_agents — the agent policy exists but has zero active agents enrolled
  • unhealthy_agent — agents are enrolled but none are online

Agent status is fetched in batch via Fleet's getAgentStatusForAgentPolicy for all relevant agent policies. The check uses status.active (not status.all) so that unenrolled/deleted agents don't incorrectly count as enrolled.

Priority order (most fundamental → least):
missing_locationmissing_agent_policymissing_package_policymissing_agentsunhealthy_agenthealthy

UI changes:

  • MissingAgentPolicy, MissingAgents, and UnhealthyAgent are classified as non-reset-fixable — the reset button is hidden when all unhealthy locations have agent-level issues
  • When a selection is mixed (some fixable, some not), the bulk reset modal shows a collapsible warning listing the skipped monitors
  • In the private locations table, the reset button is now a primary inline action to avoid a blank space when hidden
  • The edit monitor callout shows "Agent issue detected" as the title when agent-level issues are present

Reset API fix:
Before calling editMonitors, the reset API now pre-filters out locations whose agent policy no longer exists. This prevents AgentPolicyNotFoundError from bubbling up as a 500 when a monitor has both fixable and non-fixable locations.

Test plan

Prerequisites

  • A running Kibana with at least one private location that has an enrolled, online agent

Setup test monitors

Run ~/elastic/scripts/break_monitors.sh against your Kibana instance. It creates:

Monitor Locations Expected status
Mon A loc1 (agent online) missing_package_policy (fixable)
Mon B loc1 (agent online) missing_package_policy (fixable)
Mon C loc1 + loc2 (no agents) missing_package_policy on loc1 + missing_agents on loc2
Mon D loc2 (no agents) missing_agents (not fixable)
Mon E loc3 (deleted agent policy) missing_agent_policy (not fixable)
Mon F loc1 + loc3 missing_package_policy on loc1 + missing_agent_policy on loc3

What to verify

Monitor list page (/app/synthetics/monitors):

  • Mon A and Mon B show the unhealthy badge and a "Reset monitor" button in the row actions
  • Mon D shows the unhealthy badge but no reset button
  • Mon C shows the unhealthy badge and a reset button (mixed: one location is fixable)
  • Selecting Mon A + Mon B + Mon D and clicking bulk reset opens the confirmation modal with a warning listing Mon D as skipped
  • Confirming the bulk reset fixes Mon A and Mon B (they become healthy after a few seconds)

Edit monitor page for Mon C:

  • The callout lists both locations with their respective status messages
  • The reset button is visible (because loc1 is fixable)
  • Clicking reset fixes loc1; loc2 remains missing_agents

Private locations settings page (/app/synthetics/settings/private-locations):

  • The location with enrolled agents shows a "Reset monitors" button when Mon A/B are broken
  • The no-agents location does not show the reset button (all issues are agent-level)
  • No blank space appears where the reset button would be on the no-agents location row

New health status messages:

  • missing_agents: "No Fleet agents are enrolled in the agent policy for this private location. Enroll an agent in Fleet to resolve this."
  • unhealthy_agent: "All Fleet agents for this private location are unhealthy or offline. Check the agent status in Fleet."

miguelmartin-elastic and others added 12 commits April 1, 2026 18:31
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…h statuses

Both statuses are removed from the private location health API:
- AgentPolicyMismatch: scenario is practically impossible in normal usage; monitors where the package policy exists now report Healthy regardless of which agent policy it is attached to
- PackageNotInstalled: if the synthetics package is missing the entire app fails; surfacing it per-monitor adds noise without actionable value

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the author:actionable-obs PRs authored by the actionable obs team label Apr 6, 2026
miguelmartin-elastic and others added 3 commits April 7, 2026 17:40
…ts check

Fleet marks deleted/unenrolled agents with status=unenrolled but still
counts them in `all`. Using `active` correctly identifies policies with
no currently-enrolled agents, triggering missing_agents instead of
unhealthy_agent in that case.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@miguelmartin-elastic miguelmartin-elastic marked this pull request as ready for review April 8, 2026 11:01
@miguelmartin-elastic miguelmartin-elastic requested a review from a team as a code owner April 8, 2026 11:01
@macroscopeapp
Copy link
Copy Markdown
Contributor

macroscopeapp Bot commented Apr 8, 2026

Approvability

Verdict: Would Approve

Adds new health detection capability for Fleet agent issues in Synthetics monitoring. While this introduces new runtime behavior (new health statuses and Fleet API calls), all changes are within the author's owned code, are additive/backward-compatible, and include comprehensive tests. The scope is self-contained within the health API.

No code changes detected at 174e407. Prior analysis still applies.

Macroscope would have approved this PR. A repo admin can enable approvability here.

@miguelmartin-elastic miguelmartin-elastic added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting labels Apr 8, 2026
@elasticmachine
Copy link
Copy Markdown
Contributor

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
synthetics 1.1MB 1.1MB +516.0B

History

Copy link
Copy Markdown
Contributor

@shahzad31 shahzad31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !!

I think we need to do further work regarding making these errors bit more prominent in the UI

we likely also need to display these on the cards and also likely as top level item in case it's happening for a significant amount of monitors

Image

@miguelmartin-elastic miguelmartin-elastic merged commit a54fcf0 into elastic:main Apr 13, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

author:actionable-obs PRs authored by the actionable obs team backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes v9.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Synthetics] Extend monitor integration health: detect missing and unhealthy agent states

4 participants