Skip to content

[Security Data Experience] Add permission verifier background tasks#257516

Merged
Omolola-Akinleye merged 41 commits into
elastic:mainfrom
Omolola-Akinleye:permission-verifier-background-tasks
Apr 6, 2026
Merged

[Security Data Experience] Add permission verifier background tasks#257516
Omolola-Akinleye merged 41 commits into
elastic:mainfrom
Omolola-Akinleye:permission-verifier-background-tasks

Conversation

@Omolola-Akinleye
Copy link
Copy Markdown
Contributor

@Omolola-Akinleye Omolola-Akinleye commented Mar 12, 2026

Summary

The verify_permissions_task is a scheduled Fleet background task that validates cloud connector credentials by deploying a short-lived OTel-based verifier agent. It runs every 5 minutes and processes one connector at a time to avoid resource contention.

  • Remove the otel_verifier_logs_status_change_task and all references from plugin.ts — this mock-based status-update task is no longer needed; status will be handled differently when the real verification index is available.

  • Clean up unused cloud connector saved object fields (verification_id, verification_timestamp, verification_permissions) and their associated types (VerificationResultDocument, VerificationPermissionResult, PermissionStatus) that had no active writers.

  • Trim the SO model version 4 to only include the 3 actively used verification fields.

  • Add comprehensive unit tests for the permission verifier task (11 tests covering registration, scheduling, feature flag gating, eligibility filtering, policy template aggregation, connector status updates, failure handling, and TTL-based cleanup).

High-Level Overview: Permission Verifier Task
The verify_permissions_task is a scheduled Fleet background task (runs every 5 minutes) that validates cloud connector credentials by deploying an OTel-based verifier agent. It operates in three phases:

  1. Cleanup — Deletes any expired verifier agent policies whose TTL (5 min) has elapsed, ensuring at most one active verification deployment at a time.

  2. Pre-filter — Queries package policies with cloud_connector_id to build a map of connector ID to verification info (aggregated policy templates + package metadata). Only connectors with installed integrations are candidates.

  3. Verify — Picks the first eligible connector (based on 6 criteria: never verified, recently created/updated, due for re-verification, failed with cooldown expired, or no status set) and creates a single verifier agent policy containing all policy templates for that connector. On success, stamps verification_started_at; on failure, marks verification_status: failed.

The verifier agent policy uses the verifier_otel integration package, which runs an OTel Collector with a custom Verifier receiver that checks the connector's cloud credentials against the required permissions for each policy template.

Verifying Cloud Connector Permisstions with Otel Verifier Flow

sequenceDiagram
    participant TM as Task Manager
    participant VT as Verify Permissions Task
    participant SO as Saved Objects
    participant AP as Agent Policy Service
    participant Agent as OTel Verifier Agent
    participant ES as Elasticsearch

    TM->>VT: Run task (every 5 min)
    
    Note over VT: Phase 1 — Cleanup
    VT->>AP: List verifier policies (is_verifier: true)
    AP-->>VT: Active verifier policies
    loop Each expired policy (age > 5 min TTL)
        VT->>AP: deleteVerifierPolicy(policyId)
    end

    Note over VT: Gate check — one at a time
    VT->>AP: List verifier policies (is_verifier: true)
    AP-->>VT: Active policies
    alt Non-expired verifier exists
        VT-->>TM: Skip (deployment in flight)
    end

    Note over VT: Phase 2 — Pre-filter
    VT->>SO: Find package policies with cloud_connector_id
    SO-->>VT: Package policies
    Note over VT: Build map: connector → {policyTemplates, packageInfo}
    VT->>SO: Find cloud connectors by ID
    SO-->>VT: Cloud connectors

    Note over VT: Phase 3 — Verify first eligible connector
    loop Each connector (break after first eligible)
        alt Not eligible (recently verified, in backoff window, etc.)
            Note over VT: Skip
        else Eligible
            VT->>AP: createVerifierPolicy(connector, verificationInfo)
            AP->>AP: Create agent policy (is_verifier: true)
            AP->>AP: Create verifier_otel package policy
            AP->>AP: Deploy policy to agentless
            AP-->>VT: {policyId}
            
            Note over Agent: Agent enrolls and runs OTel Collector
            Agent->>Agent: Verifier receiver checks cloud permissions
            Agent->>ES: Write verification results to logs-verifier_otel.*
            
            VT->>SO: Update connector (status: pending, started_at: now)
        end
    end
    
    VT-->>TM: Task complete
Loading

Checklist

Check the PR satisfies following conditions.

Reviewers should verify this PR satisfies this list as well.

  • Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
  • Documentation was added for features that require explanation or tutorials
  • Unit or functional tests were updated or added to match the most common scenarios
  • If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
  • This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The release_note:breaking label should be applied in these situations.
  • Flaky Test Runner was used on any tests changed
  • The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines
  • Review the backport guidelines and apply applicable backport:* labels.

Identify risks

Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss.

Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging.

@Omolola-Akinleye
Copy link
Copy Markdown
Contributor Author

/ci

@Omolola-Akinleye Omolola-Akinleye added ci:cloud-deploy Create or update a Cloud deployment ci:project-deploy-security Create a Security Serverless Project labels Mar 12, 2026
@Omolola-Akinleye Omolola-Akinleye changed the title first pass at permission on verifier tasks [Security Data Experience] Add permission verifier background tasks Mar 13, 2026
@Omolola-Akinleye
Copy link
Copy Markdown
Contributor Author

/ci

@Omolola-Akinleye Omolola-Akinleye marked this pull request as ready for review March 21, 2026 01:14
@Omolola-Akinleye Omolola-Akinleye requested review from a team as code owners March 21, 2026 01:14
@elastic elastic deleted a comment from elasticmachine Mar 21, 2026
Copy link
Copy Markdown
Member

@florent-leborgne florent-leborgne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for docs

Copy link
Copy Markdown
Contributor

@jloleysens jloleysens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my feedback!

const policyName = `Verifier-Agent-Policy-${connectorName}-${shortId}`;
const verificationId = uuidv4();

const agentPolicy = await this.create(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium services/agent_policy.ts:2626

If ensureInstalledPackage, getPackageInfo, packagePolicyService.create, or deployPolicy throws after the agent policy is created at line 2626, the policy is orphaned because createVerifierPolicy does not delete it on failure. This contradicts the cleanup pattern in createWithPackagePolicies (lines 630-660), which properly deletes the agent policy when subsequent operations fail.

🤖 Copy this AI Prompt to have your agent fix this:
In file x-pack/platform/plugins/shared/fleet/server/services/agent_policy.ts around line 2626:

If `ensureInstalledPackage`, `getPackageInfo`, `packagePolicyService.create`, or `deployPolicy` throws after the agent policy is created at line 2626, the policy is orphaned because `createVerifierPolicy` does not delete it on failure. This contradicts the cleanup pattern in `createWithPackagePolicies` (lines 630-660), which properly deletes the agent policy when subsequent operations fail.

Evidence trail:
x-pack/platform/plugins/shared/fleet/server/services/agent_policy.ts:
- Line 2626: Agent policy created via `this.create(..., { skipDeploy: true })`
- Lines 2658-2662: `ensureInstalledPackage` called (no cleanup on failure)
- Lines 2669-2674: `getPackageInfo` called (no cleanup on failure)
- Lines 2728-2752: `packagePolicyService.create` in try-catch, but catch only logs and rethrows (no cleanup)
- Line 2758: `deployPolicy` called (no cleanup on failure)
- No cleanup/rollback logic anywhere in `createVerifierPolicy`

Contrast with lines 630-660 (`createWithPackagePolicies`):
- Lines 631-632: Catches errors
- Lines 645-653: Deletes created package policies
- Lines 654-656: Deletes agent policy via `this.delete()`
- Line 658: Re-throws error

…rify_permissions_task.ts

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@juliaElastic juliaElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Omolola-Akinleye Omolola-Akinleye added v9.4.0 release_note:feature Makes this part of the condensed release notes labels Apr 2, 2026
Copy link
Copy Markdown
Contributor

@kc13greiner kc13greiner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Omolola-Akinleye Omolola-Akinleye added the backport:skip This PR does not require backporting label Apr 3, 2026
@elasticmachine
Copy link
Copy Markdown
Contributor

elasticmachine commented Apr 6, 2026

💔 Build Failed

Failed CI Steps

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
fleet 1726 1732 +6

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
fleet 121 122 +1

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
fleet 201.7KB 201.7KB +22.0B
Unknown metric groups

API count

id before after diff
fleet 1910 1916 +6

History

@Omolola-Akinleye Omolola-Akinleye merged commit bd82ce1 into elastic:main Apr 6, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting ci:cloud-deploy Create or update a Cloud deployment ci:project-deploy-security Create a Security Serverless Project release_note:feature Makes this part of the condensed release notes Team:Fleet Team label for Observability Data Collection Fleet team v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants