[Security Data Experience] Add permission verifier background tasks#257516
Conversation
|
/ci |
…lola-Akinleye/kibana into permission-verifier-background-tasks
|
/ci |
…tion_tests/ci_checks
…lola-Akinleye/kibana into permission-verifier-background-tasks
…lola-Akinleye/kibana into permission-verifier-background-tasks
…lola-Akinleye/kibana into permission-verifier-background-tasks
jloleysens
left a comment
There was a problem hiding this comment.
Thanks for addressing my feedback!
| const policyName = `Verifier-Agent-Policy-${connectorName}-${shortId}`; | ||
| const verificationId = uuidv4(); | ||
|
|
||
| const agentPolicy = await this.create( |
There was a problem hiding this comment.
🟡 Medium services/agent_policy.ts:2626
If ensureInstalledPackage, getPackageInfo, packagePolicyService.create, or deployPolicy throws after the agent policy is created at line 2626, the policy is orphaned because createVerifierPolicy does not delete it on failure. This contradicts the cleanup pattern in createWithPackagePolicies (lines 630-660), which properly deletes the agent policy when subsequent operations fail.
🤖 Copy this AI Prompt to have your agent fix this:
In file x-pack/platform/plugins/shared/fleet/server/services/agent_policy.ts around line 2626:
If `ensureInstalledPackage`, `getPackageInfo`, `packagePolicyService.create`, or `deployPolicy` throws after the agent policy is created at line 2626, the policy is orphaned because `createVerifierPolicy` does not delete it on failure. This contradicts the cleanup pattern in `createWithPackagePolicies` (lines 630-660), which properly deletes the agent policy when subsequent operations fail.
Evidence trail:
x-pack/platform/plugins/shared/fleet/server/services/agent_policy.ts:
- Line 2626: Agent policy created via `this.create(..., { skipDeploy: true })`
- Lines 2658-2662: `ensureInstalledPackage` called (no cleanup on failure)
- Lines 2669-2674: `getPackageInfo` called (no cleanup on failure)
- Lines 2728-2752: `packagePolicyService.create` in try-catch, but catch only logs and rethrows (no cleanup)
- Line 2758: `deployPolicy` called (no cleanup on failure)
- No cleanup/rollback logic anywhere in `createVerifierPolicy`
Contrast with lines 630-660 (`createWithPackagePolicies`):
- Lines 631-632: Catches errors
- Lines 645-653: Deletes created package policies
- Lines 654-656: Deletes agent policy via `this.delete()`
- Line 658: Re-throws error
…rify_permissions_task.ts Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
…lola-Akinleye/kibana into permission-verifier-background-tasks
…tion_tests/ci_checks
…lola-Akinleye/kibana into permission-verifier-background-tasks
💔 Build Failed
Failed CI StepsMetrics [docs]Public APIs missing comments
Public APIs missing exports
Page load bundle
History
|
Summary
The
verify_permissions_taskis a scheduled Fleet background task that validates cloud connector credentials by deploying a short-lived OTel-based verifier agent. It runs every 5 minutes and processes one connector at a time to avoid resource contention.Remove the otel_verifier_logs_status_change_task and all references from plugin.ts — this mock-based status-update task is no longer needed; status will be handled differently when the real verification index is available.
Clean up unused cloud connector saved object fields (verification_id, verification_timestamp, verification_permissions) and their associated types (VerificationResultDocument, VerificationPermissionResult, PermissionStatus) that had no active writers.
Trim the SO model version 4 to only include the 3 actively used verification fields.
Add comprehensive unit tests for the permission verifier task (11 tests covering registration, scheduling, feature flag gating, eligibility filtering, policy template aggregation, connector status updates, failure handling, and TTL-based cleanup).
High-Level Overview: Permission Verifier Task
The verify_permissions_task is a scheduled Fleet background task (runs every 5 minutes) that validates cloud connector credentials by deploying an OTel-based verifier agent. It operates in three phases:
Cleanup — Deletes any expired verifier agent policies whose TTL (5 min) has elapsed, ensuring at most one active verification deployment at a time.
Pre-filter — Queries package policies with cloud_connector_id to build a map of connector ID to verification info (aggregated policy templates + package metadata). Only connectors with installed integrations are candidates.
Verify — Picks the first eligible connector (based on 6 criteria: never verified, recently created/updated, due for re-verification, failed with cooldown expired, or no status set) and creates a single verifier agent policy containing all policy templates for that connector. On success, stamps verification_started_at; on failure, marks verification_status: failed.
The verifier agent policy uses the verifier_otel integration package, which runs an OTel Collector with a custom Verifier receiver that checks the connector's cloud credentials against the required permissions for each policy template.
Verifying Cloud Connector Permisstions with Otel Verifier Flow
sequenceDiagram participant TM as Task Manager participant VT as Verify Permissions Task participant SO as Saved Objects participant AP as Agent Policy Service participant Agent as OTel Verifier Agent participant ES as Elasticsearch TM->>VT: Run task (every 5 min) Note over VT: Phase 1 — Cleanup VT->>AP: List verifier policies (is_verifier: true) AP-->>VT: Active verifier policies loop Each expired policy (age > 5 min TTL) VT->>AP: deleteVerifierPolicy(policyId) end Note over VT: Gate check — one at a time VT->>AP: List verifier policies (is_verifier: true) AP-->>VT: Active policies alt Non-expired verifier exists VT-->>TM: Skip (deployment in flight) end Note over VT: Phase 2 — Pre-filter VT->>SO: Find package policies with cloud_connector_id SO-->>VT: Package policies Note over VT: Build map: connector → {policyTemplates, packageInfo} VT->>SO: Find cloud connectors by ID SO-->>VT: Cloud connectors Note over VT: Phase 3 — Verify first eligible connector loop Each connector (break after first eligible) alt Not eligible (recently verified, in backoff window, etc.) Note over VT: Skip else Eligible VT->>AP: createVerifierPolicy(connector, verificationInfo) AP->>AP: Create agent policy (is_verifier: true) AP->>AP: Create verifier_otel package policy AP->>AP: Deploy policy to agentless AP-->>VT: {policyId} Note over Agent: Agent enrolls and runs OTel Collector Agent->>Agent: Verifier receiver checks cloud permissions Agent->>ES: Write verification results to logs-verifier_otel.* VT->>SO: Update connector (status: pending, started_at: now) end end VT-->>TM: Task completeChecklist
Check the PR satisfies following conditions.
Reviewers should verify this PR satisfies this list as well.
release_note:breakinglabel should be applied in these situations.release_note:*label is applied per the guidelinesbackport:*labels.Identify risks
Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss.
Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging.