Skip to content

[Synthetics] Detect and display missing/corrupted Synthetics integrations in monitor UIs#256738

Merged
miguelmartin-elastic merged 115 commits intoelastic:mainfrom
miguelmartin-elastic:synthetics/missing-integrations-ui
Apr 6, 2026
Merged

[Synthetics] Detect and display missing/corrupted Synthetics integrations in monitor UIs#256738
miguelmartin-elastic merged 115 commits intoelastic:mainfrom
miguelmartin-elastic:synthetics/missing-integrations-ui

Conversation

@miguelmartin-elastic
Copy link
Copy Markdown
Contributor

@miguelmartin-elastic miguelmartin-elastic commented Mar 9, 2026

Release note

Synthetics monitor integration health detection and self-healing

Synthetics now automatically detects when private location monitors have broken Fleet integrations — such as deleted agent or package policies and missing locations — and surfaces per-location health status directly in the monitor management list, monitor edit page, and private locations settings. Users can reset affected monitors individually or in bulk to recreate the missing Fleet resources and restore monitoring.

Summary

Closes #256397
Closes #256398
Closes #256399

Adds a new Monitor Integration Health API for Synthetics that detects and reports unhealthy monitor/private-location configurations, and surfaces these health statuses in the UI with actionable per-location details and reset capabilities.

Problem

When a Synthetics monitor uses a private location, several Fleet-level resources (package policies, agent policies, the synthetics package itself) must be in place for the monitor to run. If any of these resources are missing or misconfigured, the monitor silently fails to collect data. Users have no clear signal about what went wrong or which location is affected.

Solution

Backend: MonitorIntegrationHealthApi service

A new MonitorIntegrationHealthApi service class that evaluates the health of monitors across their private locations. It detects 5 distinct failure scenarios, evaluated in intentional priority order (most fundamental issue first):

Priority Status Description
1 missing_location The monitor references a private location that no longer exists
2 missing_agent_policy The agent policy referenced by the private location was deleted
3 missing_package_policy The Fleet package policy for this monitor/location pair is missing

Only the first matching status is reported per location, since the higher-priority issue is the root cause (e.g., if the agent policy is deleted, reporting missing_package_policy on its package policies would be misleading — the fix is the same regardless).

The service supports both new ({configId}-{locationId}) and legacy ({configId}-{locationId}-{spaceId}) policy ID formats using getPolicyIdFormatInfo, preventing false-positive missing_package_policy reports for monitors created before the space-agnostic migration.

Two new internal endpoints expose this:

  • POST /internal/synthetics/monitors/_health — Bulk health check (up to 500 monitor IDs)
  • GET /internal/synthetics/monitors/{monitorId}/_health — Single monitor health check (returns 404 if the monitor doesn't exist)

You can check the OAS here api.yml

The service uses Promise.allSettled for partial error handling: if some monitors fail to load, the response includes both the successful health results and per-monitor errors with proper statusCode propagation.

Frontend: useMonitorIntegrationHealth hook + Redux slice

A new monitor_health Redux slice fetches health data from the bulk API. The saga uses debounce(50ms) to aggregate rapid dispatches from multiple hook instances into a single API call. The useMonitorIntegrationHealth hook provides:

  • isUnhealthy(configId) — boolean check for a specific monitor
  • getUnhealthyLocationStatuses(configId) — per-location details with translated reasons
  • getUnhealthyMonitorCountForLocation(locationId) — count of unhealthy monitors per location
  • getUnhealthyConfigIdsForLocation(locationId) — config IDs of unhealthy monitors at a location
  • resetMonitor(configId) — triggers a single monitor reset via the reset API
  • resetMonitors(configIds) — triggers a bulk monitor reset via the bulk reset API

UI changes

  • Monitor management list table: Warning icon with tooltip showing per-location health reasons
  • Monitor edit page: Warning callout listing affected locations with a "Reset monitor" button
  • Private locations settings: Badge showing count of unhealthy monitors per location, with a "Reset monitors" button per location
  • Bulk operations: Bulk reset action for selected unhealthy monitors (both in monitor list and private locations table)
  • Reset confirmation modal: ResetMonitorModal component with confirmation dialog, loading state, and success/error toast notifications
  • Toast notifications: Success and error feedback after individual and bulk reset operations

API response examples

Bulk endpoint (POST /internal/synthetics/monitors/_health):

{
  "monitors": [
    {
      "configId": "a04dd21...",
      "monitorName": "https://www.elastic.co",
      "isUnhealthy": false,
      "locations": [
        {
          "locationId": "80ad4fda...",
          "locationLabel": "My Private Location",
          "status": "healthy",
          "policyId": "a04dd21...-80ad4fda..."
        }
      ]
    },
    {
      "configId": "6249a8b4...",
      "monitorName": "Test Unhealthy Monitor",
      "isUnhealthy": true,
      "locations": [
        {
          "locationId": "80ad4fda...",
          "locationLabel": "My Private Location",
          "status": "missing_package_policy",
          "policyId": "6249a8b4...-80ad4fda...",
          "reason": "The Fleet package policy for this monitor/location pair does not exist."
        }
      ]
    }
  ],
  "errors": []
}

Single endpoint (GET /internal/synthetics/monitors/{monitorId}/_health):

{
  "configId": "6249a8b4...",
  "monitorName": "Test Unhealthy Monitor",
  "isUnhealthy": true,
  "locations": [
    {
      "locationId": "80ad4fda...",
      "locationLabel": "My Private Location",
      "status": "missing_package_policy",
      "policyId": "6249a8b4...-80ad4fda...",
      "reason": "The Fleet package policy for this monitor/location pair does not exist."
    }
  ]
}

Some screenshots may be outdated!

Screenshots

Monitor List (unhealthy warning icon) Private Locations (unhealthy count badge)
monitor_list image private_locations image
Monitor Edit (missing integration callout)
image
image image

Test plan

You can use this script for creating unhealthy monitor locally: break_monitors.sh, just replace the location id and kibana url

Prerequisites

  1. Start Elasticsearch and Kibana locally
  2. Navigate to Observability > Synthetics > Settings > Private Locations and create a private location (requires a Fleet agent policy)
  3. Create at least one Synthetics monitor (HTTP or browser) assigned to the private location

Scenario 1: Verify healthy monitor

  • Open the monitor list page — the monitor should not show a warning icon
  • Call GET /internal/synthetics/monitors/{monitorId}/_health — all locations should be "status": "healthy"

Scenario 2: Simulate missing_package_policy

  • Go to Fleet > Agent policies, find the agent policy used by the private location
  • Expand the policy, find the synthetics package policy for your monitor, and delete it (use POST /api/fleet/package_policies/delete with force: true if needed)
  • Refresh the monitor list — a warning icon should appear next to the affected monitor
  • Hover the icon — tooltip should show the location name and reason ("The Fleet package policy for this monitor/location pair does not exist.")
  • Navigate to the monitor edit page — a warning callout should appear with per-location details and a "Reset monitor" button
  • Call the bulk API: POST /internal/synthetics/monitors/_health with {"monitorIds": ["<configId>"]} — response should show "status": "missing_package_policy" with a reason

Scenario 3: Simulate missing_agent_policy

  • Delete the agent policy referenced by the private location via Fleet API
  • Refresh the monitor list — warning icon should appear
  • Call the health API — response should show "status": "missing_agent_policy"

Scenario 4: Verify private locations settings page

  • Navigate to Synthetics > Settings > Private Locations
  • If any monitor is unhealthy, the location row should display a badge like "1 unhealthy"
  • The location should also show a "Reset monitors" button

Scenario 5: Reset a single monitor (edit page)

  • On the monitor edit page, click "Reset monitor" in the warning callout
  • Verify the success callout appears: "Monitor reset successfully"
  • Refresh the page — the warning callout should be gone (if the reset resolved the issue)

Scenario 6: Bulk reset monitors (monitor list)

  • Select multiple unhealthy monitors using the checkboxes in the monitor list
  • Click the "Reset N monitors" button in the bulk actions bar
  • A confirmation modal should appear listing the number of monitors to reset
  • Click "Reset" — verify a success toast appears ("N monitors reset successfully")
  • Refresh the page — the warning icons should be gone for reset monitors

Scenario 7: Reset monitors from private locations page

  • Navigate to Synthetics > Settings > Private Locations
  • Click "Reset monitors" on a location with unhealthy monitors
  • A confirmation modal should appear
  • Click "Reset" — verify a success toast appears
  • The unhealthy count badge should update

Scenario 8: Legacy policy ID format

  • If you have monitors created before the space-agnostic policy ID migration (with format {configId}-{locationId}-{spaceId}), verify they are not falsely reported as missing_package_policy
  • Call the health API for such a monitor — it should return "status": "healthy"

Scenario 9: Partial errors

  • Call the bulk API with a mix of valid and invalid monitor IDs: {"monitorIds": ["valid-id", "nonexistent-id"]}
  • Verify the response includes both a monitors entry for the valid ID and an errors entry with statusCode: 404 for the invalid one

Scenario 10: Verify the single endpoint 404 handling

  • Call GET /internal/synthetics/monitors/nonexistent/_health
  • Verify it returns HTTP 404 with a message

Risk assessment

Low — this is an additive feature. No existing API contracts or data models are modified. The new endpoints are internal-only and the UI changes are isolated to the Synthetics management pages.

@github-actions github-actions Bot added the author:actionable-obs PRs authored by the actionable obs team label Mar 9, 2026
@miguelmartin-elastic
Copy link
Copy Markdown
Contributor Author

/ci

@miguelmartin-elastic
Copy link
Copy Markdown
Contributor Author

/ci

@miguelmartin-elastic
Copy link
Copy Markdown
Contributor Author

/ci

…h statuses

Both statuses are removed from the private location health API:
- AgentPolicyMismatch: scenario is practically impossible in normal usage; monitors where the package policy exists now report Healthy regardless of which agent policy it is attached to
- PackageNotInstalled: if the synthetics package is missing the entire app fails; surfacing it per-monitor adds noise without actionable value

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@miguelmartin-elastic
Copy link
Copy Markdown
Contributor Author

@benakansara

Re: layout issue in the private locations table:

The root cause is that the Delete
action uses a render-based approach in the EUI actions column, which EUI
treats differently from standard type: 'icon' actions — they end up in
separate DOM containers and stack vertically. Fixing it cleanly would
require restructuring the actions column (separating the Delete modal from
its trigger and reworking the isPrimary setup across all three actions).
Since this is a pre-existing layout issue unrelated to the core changes in
this PR, I'd prefer to address it in a follow-up to keep this one focused

The rest of comments have been addressed 😃

@benakansara
Copy link
Copy Markdown
Contributor

@benakansara

Re: layout issue in the private locations table:

The root cause is that the Delete
action uses a render-based approach in the EUI actions column, which EUI
treats differently from standard type: 'icon' actions — they end up in
separate DOM containers and stack vertically. Fixing it cleanly would
require restructuring the actions column (separating the Delete modal from
its trigger and reworking the isPrimary setup across all three actions).
Since this is a pre-existing layout issue unrelated to the core changes in
this PR, I'd prefer to address it in a follow-up to keep this one focused

The rest of comments have been addressed 😃

@miguelmartin-elastic Thanks for the explanation. Upon closer look, it seems that the layout issue happens when there are more than 2 actions. Since in this PR, we are adding a third action, it overflows and creates layout issue. Below is the screenshot from main. I think its ok to restructure the delete action from render to icon format and using same pattern as reset modal for delete modal.

Screenshot 2026-04-01 at 9 15 44 PM

@miguelmartin-elastic
Copy link
Copy Markdown
Contributor Author

I think its ok to restructure the delete action from render to icon format and using same pattern as reset modal for delete modal.

@benakansara fixed 🚀

Copy link
Copy Markdown
Contributor

@benakansara benakansara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀 Just one comment about using JSON.stringify

Comment on lines +69 to +70
// eslint-disable-next-line react-hooks/exhaustive-deps
}, [dispatch, JSON.stringify(configIds)]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need JSON.stringify?

Suggested change
// eslint-disable-next-line react-hooks/exhaustive-deps
}, [dispatch, JSON.stringify(configIds)]);
}, [dispatch, configIds]);

useEffect(() => {
dispatch(updateManagementPageStateAction({ configIds }));
// eslint-disable-next-line react-hooks/exhaustive-deps
}, [dispatch, JSON.stringify(configIds)]);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Fixed now :)

Comment on lines +67 to +69
useEffect(() => {
dispatch(updateManagementPageStateAction({ configIds }));
}, [dispatch, configIds]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low monitor_filters/use_filters.ts:67

The useEffect for configIds only dispatches to updateManagementPageStateAction but not setOverviewPageStateAction, while the useLogicalAndFor effect immediately after it dispatches to both. Since MonitorOverviewPageState extends MonitorFilterState which includes configIds, the overview page state won't receive configIds updates from URL params, causing the overview page filtering to be out of sync with the URL.

-  useEffect(() => {
-    dispatch(updateManagementPageStateAction({ configIds }));    
-  }, [dispatch, configIds]);
+  useEffect(() => {
+    dispatch(updateManagementPageStateAction({ configIds }));
+    dispatch(setOverviewPageStateAction({ configIds }));
+  }, [dispatch, configIds]);
🤖 Copy this AI Prompt to have your agent fix this:
In file x-pack/solutions/observability/plugins/synthetics/public/apps/synthetics/components/monitors_page/common/monitor_filters/use_filters.ts around lines 67-69:

The `useEffect` for `configIds` only dispatches to `updateManagementPageStateAction` but not `setOverviewPageStateAction`, while the `useLogicalAndFor` effect immediately after it dispatches to both. Since `MonitorOverviewPageState` extends `MonitorFilterState` which includes `configIds`, the overview page state won't receive `configIds` updates from URL params, causing the overview page filtering to be out of sync with the URL.

Evidence trail:
1. x-pack/solutions/observability/plugins/synthetics/public/apps/synthetics/components/monitors_page/common/monitor_filters/use_filters.ts lines 62-64: configIds effect only dispatches updateManagementPageStateAction
2. use_filters.ts lines 66-76: useLogicalAndFor effect dispatches to both setOverviewPageStateAction and updateManagementPageStateAction
3. x-pack/solutions/observability/plugins/synthetics/public/apps/synthetics/state/monitor_list/models.ts line 27: MonitorFilterState includes configIds
4. x-pack/solutions/observability/plugins/synthetics/public/apps/synthetics/state/overview/models.ts line 15: MonitorOverviewPageState extends MonitorFilterState
5. x-pack/solutions/observability/plugins/synthetics/public/apps/synthetics/utils/filters/filter_fields.ts line 29: getMonitorFilterFields() does NOT include configIds

@elasticmachine
Copy link
Copy Markdown
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Jest Tests #2 / should call onSelectionChange on user selection

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
synthetics 1264 1276 +12

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
synthetics 1.1MB 1.1MB +13.3KB

History

@miguelmartin-elastic miguelmartin-elastic merged commit 7d637b7 into elastic:main Apr 6, 2026
18 checks passed
miguelmartin-elastic added a commit that referenced this pull request Apr 13, 2026
…261367)

Blocked by #256738 

## Summary

Closes #258541. Follow-up to #256738.

Extends the monitor integration health API and UI to detect agent-level
issues in private locations — specifically when no agents are enrolled
or all agents are offline/unhealthy. These statuses are surfaced
alongside the existing integration-level checks.

|   |   |
|---|---|
| Monitors Management - No agents enrolled tooltip | <img width="1552"
height="982" alt="image"
src="https://github.com/user-attachments/assets/9e1ab7fb-481c-48fa-8f22-e57e1f366474"
/> |
| Monitors Management - Mixed issues: reset-fixable and
non-reset-fixable | <img width="1552" height="982" alt="image"
src="https://github.com/user-attachments/assets/fab9becd-9db8-45fc-b0c1-e300c6cb4df7"
/> |
| Monitors Management: reset is applied only to monitors that have at
least one reset-fixable issue | <img width="1552" height="982"
alt="image"
src="https://github.com/user-attachments/assets/7f22413b-b45d-403b-8d55-ef7ae343b37b"
/> |
| Monitors Management: reset is applied only to monitors that have at
least one reset-fixable issue | <img width="1552" height="982"
alt="image"
src="https://github.com/user-attachments/assets/be4048fa-b576-4be2-be9d-341c7c50431d"
/> |
| Monitor edit: if there are no reset-fixable issues the reset button is
not shown | <img width="1552" height="982" alt="image"
src="https://github.com/user-attachments/assets/35d564a9-e10e-46e2-8ac8-69f68d2b5d00"
/> |
| Private locations table: if none of the unhealthy monitors is
reset-fixable in that private location, the reset button is not shown |
<img width="1552" height="982" alt="image"
src="https://github.com/user-attachments/assets/e1f4662a-9322-4ef6-bb7a-3cb969101f02"
/> |
| Private locations table: if at least one of the unhealthy monitors is
reset-fixable in that private location, the reset button is shown | <img
width="1552" height="982" alt="image"
src="https://github.com/user-attachments/assets/f5298fcb-c83f-4d85-910f-58017d7dead6"
/> |

**New health statuses (server):**
- `missing_agents` — the agent policy exists but has zero active agents
enrolled
- `unhealthy_agent` — agents are enrolled but none are online

Agent status is fetched in batch via Fleet's
`getAgentStatusForAgentPolicy` for all relevant agent policies. The
check uses `status.active` (not `status.all`) so that unenrolled/deleted
agents don't incorrectly count as enrolled.

**Priority order** (most fundamental → least):
`missing_location` → `missing_agent_policy` → `missing_package_policy` →
`missing_agents` → `unhealthy_agent` → `healthy`

**UI changes:**
- `MissingAgentPolicy`, `MissingAgents`, and `UnhealthyAgent` are
classified as **non-reset-fixable** — the reset button is hidden when
all unhealthy locations have agent-level issues
- When a selection is mixed (some fixable, some not), the bulk reset
modal shows a collapsible warning listing the skipped monitors
- In the private locations table, the reset button is now a primary
inline action to avoid a blank space when hidden
- The edit monitor callout shows "Agent issue detected" as the title
when agent-level issues are present

**Reset API fix:**
Before calling `editMonitors`, the reset API now pre-filters out
locations whose agent policy no longer exists. This prevents
`AgentPolicyNotFoundError` from bubbling up as a 500 when a monitor has
both fixable and non-fixable locations.

## Test plan

### Prerequisites

- A running Kibana with at least one private location that has an
enrolled, online agent

### Setup test monitors

Run
[`~/elastic/scripts/break_monitors.sh`](https://github.com/miguelmartin-elastic/kibana/blob/feat/synthetics-agent-health-status-258541/x-pack/solutions/observability/plugins/synthetics/server/services/monitor_integration_health_api.test.ts)
against your Kibana instance. It creates:

| Monitor | Locations | Expected status |
|---|---|---|
| Mon A | loc1 (agent online) | `missing_package_policy` (fixable) |
| Mon B | loc1 (agent online) | `missing_package_policy` (fixable) |
| Mon C | loc1 + loc2 (no agents) | `missing_package_policy` on loc1 +
`missing_agents` on loc2 |
| Mon D | loc2 (no agents) | `missing_agents` (not fixable) |
| Mon E | loc3 (deleted agent policy) | `missing_agent_policy` (not
fixable) |
| Mon F | loc1 + loc3 | `missing_package_policy` on loc1 +
`missing_agent_policy` on loc3 |

### What to verify

**Monitor list page (`/app/synthetics/monitors`):**
- [ ] Mon A and Mon B show the unhealthy badge and a "Reset monitor"
button in the row actions
- [ ] Mon D shows the unhealthy badge but **no** reset button
- [ ] Mon C shows the unhealthy badge and a reset button (mixed: one
location is fixable)
- [ ] Selecting Mon A + Mon B + Mon D and clicking bulk reset opens the
confirmation modal with a warning listing Mon D as skipped
- [ ] Confirming the bulk reset fixes Mon A and Mon B (they become
healthy after a few seconds)

**Edit monitor page for Mon C:**
- [ ] The callout lists both locations with their respective status
messages
- [ ] The reset button is visible (because loc1 is fixable)
- [ ] Clicking reset fixes loc1; loc2 remains `missing_agents`

**Private locations settings page
(`/app/synthetics/settings/private-locations`):**
- [ ] The location with enrolled agents shows a "Reset monitors" button
when Mon A/B are broken
- [ ] The no-agents location does **not** show the reset button (all
issues are agent-level)
- [ ] No blank space appears where the reset button would be on the
no-agents location row

**New health status messages:**
- [ ] `missing_agents`: "No Fleet agents are enrolled in the agent
policy for this private location. Enroll an agent in Fleet to resolve
this."
- [ ] `unhealthy_agent`: "All Fleet agents for this private location are
unhealthy or offline. Check the agent status in Fleet."

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

author:actionable-obs PRs authored by the actionable obs team backport:skip This PR does not require backporting release_note:feature Makes this part of the condensed release notes v9.4.0

Projects

None yet

6 participants