[Fleet] moving action batching to async #138870

juliaElastic · 2022-08-16T07:45:56Z

Summary

Changes for agent bulk actions reassign, unenroll, upgrade, update tags:

checking if bulk action is on more than 10k (or batch size)
- If not, normal sync execution
- If yes, the action execution is started as async and the API immediately returns
  - For errors during batch execution, there is a task started with Kibana Task Manager to retry 3 times. The retry task will pick up retry from last successful batch by passing the searchAfter parameter.
  - the UI is using a new /action_status API to report on the progress of actions -> I moved the UI part out of this pr, it will be covered in another one in a form of flyout
  - the counter shows the number of agents acknowledged by fleet server compared to the total actioned
  - the action can be in progress, completed, cancelled or expired.

Pending work:

Add tests
Update openapi spec
Currently there are retry tasks started in case of catching error in async execution, not yet handling a scenario of kibana crash/restart (so execution stopping without the catch flow running)
Handling unrecoverable errors e.g. point in time expiring - this can be an issue if the retry task runs with some delay, current point in time timeout is set for 10m.
- Added special error handling for es 404 error - this comes when pit is expired, not retrying in that case
Open question: how can Fleet API determine whether the action failed? For that we would need a failed counter stored in Fleet action indices.
- Added a .fleet-actions-status index to save failed state with error message when kibana processing fails.
- Update Sept 8: moved out the .fleet-action-status index from this pr to simplify, will move to another pr. Also it probably needs an elasticsearch change to add the new index mapping.

Make the code simpler and more readable, currently the logic became quite complex with the ActionRunner and the BulkActionResolver (for Task Manager) logic.
- Did some refactoring to make the parameters type safe and easier to read.
Add/remove tags and force unenroll action: it doesn't create an entry in .fleet-actions index, so we currently don't have any info about the action status.

GET kbn:/api/fleet/agents/action_status

{
  "items": [
    {
        "actionId": "ffa5355d-41bd-4382-8d9d-1b93e502936b",
        "nbAgents": 100,
        "complete": true,
        "nbAgentsAck": 100,
        "startTime": "2022-08-23T11:49:43.439Z",
        "type": "POLICY_REASSIGN",
        "total": 100,
        "cancelled": false,
        "expired": false
    }
  ]
}

Testing with ess (pr build docker image), 8 Gb integration server memory
https://admin.found.no/deployments/af21b782629cb35eacb6b66ae157d211
reassign 20k agents, kibana processing took 15s, FS processing took about 21m (first 10k took 7m, slowed down for second half)
upgrade 20k agents, kibana processing took 15s, FS processing did 11k after 30m, took about 1hr to complete all (action expired after 30m)
unenroll 20k, kibana processing took 13s, FS processing took 20m

reassign 30k agents, kibana processing took 21s, FS processing took 31m

On average looks like <10s/10k for kibana processing, and around 10m/10k for FS processing.

Checklist

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Unit or functional tests were updated or added to match the most common scenarios

juliaElastic · 2022-08-22T09:03:13Z

x-pack/plugins/fleet/server/services/agents/reassign.ts

-          skipSuccess
-        )
-    );
+    const batchSize = options.batchSize ?? SO_SEARCH_LIMIT;


for testing locally with less than 10k agents enrolled, set a lower number here:

const batchSize = 1000; // options.batchSize ?? SO_SEARCH_LIMIT;

or invoke from API:

POST kbn:/api/fleet/agents/bulk_reassign { "agents": " fleet-agents.policy_id : (\"e57948b0-1d55-11ed-85fe-e34c31cf865b\" or \"26594150-1d56-11ed-85fe-e34c31cf865b\")", "policy_id": "e57948b0-1d55-11ed-85fe-e34c31cf865b", "batchSize": 1000 }

juliaElastic · 2022-08-23T11:18:38Z

x-pack/plugins/fleet/server/services/agents/upgrade_action_runner.ts

+      start_time: startTime ?? now,
+      minimum_execution_duration: MINIMUM_EXECUTION_DURATION_SECONDS,
+      expiration: moment(startTime)
+        .add(MINIMUM_EXECUTION_DURATION_SECONDS, 'seconds')


The default expiration with upgrade Immediately is 30m, this is too short for larger agent selections e.g. 20k. When I tried upgrading immediately, the action still completed for all actions despite expiring about halfway through. I am not sure if this is the expected behavior that Fleet Server continues to execute the action even after expiration.

Yes we could probably have a larger expiration here and minimum expiration duration (this need to be the same) for immediate to happen,

The default expiration with upgrade Immediately is 30m, this is too short for larger agent selections e.g. 20k. When I tried upgrading immediately, the action still completed for all actions despite expiring about halfway through. I am not sure if this is the expected behavior that Fleet Server continues to execute the action even after expiration.

The action should be delivered immediatly to the agent, but Fleet server may have some trouble to get all the agent ack in 30 minutes.

Is there any disadvantage of increasing the minimum expiration time to e.g. 2 hours? updated.

I think increasing the minimum expiration time to 2 hours will mean that every rolling upgrade less than 2hours will not be a rolling upgrade, it's maybe acceptable, or maybe we can just have this to two hours or more for the immediate scenario

The previous logic of the rolling upgrade was to set minimum duration to 30m and expiration to the selected duration.
I updated the minimum to be 2hr (or 1hr if that was selected as the window)
Previously:

// Perform a rolling upgrade if (upgradeDurationSeconds) { return { start_time: startTime ?? now, minimum_execution_duration: MINIMUM_EXECUTION_DURATION_SECONDS, expiration: moment(startTime ?? now) .add(upgradeDurationSeconds, 'seconds') .toISOString(), }; }

Updated:

// Perform a rolling upgrade if (upgradeDurationSeconds) { return { start_time: startTime ?? now, minimum_execution_duration: Math.min( MINIMUM_EXECUTION_DURATION_SECONDS, upgradeDurationSeconds ), expiration: moment(startTime ?? now) .add(upgradeDurationSeconds, 'seconds') .toISOString(), }; }

elasticmachine · 2022-08-23T12:02:29Z

Pinging @elastic/fleet (Team:Fleet)

juliaElastic · 2022-09-08T11:59:59Z

x-pack/plugins/fleet/common/types/models/agent.ts

  start_time?: string;
  minimum_execution_duration?: number;
  source_uri?: string;
+  total?: number;


Added the total count that represents how many agents were actioned (clicked by user), this helps with status reporting in case something went wrong while creating the action documents in batches.

juliaElastic · 2022-09-08T12:30:03Z

x-pack/plugins/fleet/common/types/models/agent.ts


+export interface ActionStatus {
+  actionId: string;
+  nbAgentsActionCreated: number;


nbAgentsActionCreated represents how many agents are included in .fleet-actions documents, if this is less than nbAgentsActioned, it indicates something went wrong with kibana batch processing.

juliaElastic · 2022-09-08T12:41:45Z

x-pack/test/fleet_api_integration/apis/agents/reassign.ts

+
+        await new Promise((resolve, reject) => {
+          let attempts = 0;
+          const intervalId = setInterval(async () => {


For testing the async execution of bulk actions, I had to wait for a Promise, otherwise the FTR shutdown starts and the action doesn't finish.
Added an interval of 1s for 3 times to check whether the action has completed.

kpollich

I haven't gotten a chance to run through every action type locally, but I spent yesterday + this morning walking through the code and I think things look good to me. Definitely more OOP than we have elsewhere in the Fleet codebase, but I think the patterns used here with a base class implementing the bulk action processing needs all make sense in this context.

I'll offer my approval, but I think @nchaulet should probably continue to review here as well.

x-pack/plugins/fleet/public/applications/fleet/sections/agents/agent_list_page/index.tsx

kpollich · 2022-09-09T13:12:39Z

x-pack/plugins/fleet/scripts/create_agents/create_agents.ts

@@ -0,0 +1,143 @@
+/*


Thanks for adding this script to help with testing and local dev setup.

kpollich · 2022-09-09T13:17:36Z

x-pack/plugins/fleet/server/services/agents/current_upgrades.ts

+          },
+          {
+            range: {
+              expiration: { gte: now },


I think removing the expiration condition from this query makes sense.

nchaulet · 2022-09-09T14:11:04Z

x-pack/plugins/fleet/server/services/agents/action_runner.ts

+   * On errors, starts a task with Task Manager to retry max 3 times
+   * If the last batch was stored in state, retry continues from there (searchAfter)
+   */
+  public async runActionAsyncWithRetry(): Promise<{ items: BulkActionResult[]; actionId: string }> {


That method seems a good candidate to be unit tested, and will help to document the expected behavior

Do you mean unit tested?
Yes, I am planning to add more tests after I complete the UI part.

nchaulet · 2022-09-09T14:22:55Z

Maybe a dumb question, but I am wondering if we considered using the task scheduler for every bulkAction creation and not just for retry? it could make things easier to understand and probably more scalable if the task manager is hosted on a separate Kibana instance only running tasks in the future, I do not like having long running task from a request it seems something that can break to me.

juliaElastic · 2022-09-09T14:36:32Z

Maybe a dumb question, but I am wondering if we considered using the task scheduler for every bulkAction creation and not just for retry? it could make things easier to understand and probably more scalable if the task manager is hosted on a separate Kibana instance only running tasks in the future, I do not like having long running task from a request it seems something that can break to me.

Good question, this was the original plan, however revised it after feedback of Alerting team who are heavy users of Task Manager. They suggested to avoid long running tasks, because it might impact the schedule of other tasks. Also depending on how many tasks are waiting to run, it might take some time for TM to pick up a task. So I thought it is a better idea to start the execution immediately.

juliaElastic · 2022-09-12T07:40:01Z

@elasticmachine merge upstream

juliaElastic · 2022-09-12T09:03:07Z

@nchaulet @joshdover are you okay if I merge this today?

joshdover · 2022-09-12T09:03:39Z

@juliaElastic Let me take a look at this now, there were just a few key things I wanted to review.

joshdover · 2022-09-12T09:12:42Z

x-pack/plugins/fleet/server/services/agents/action_status.ts

+  return Object.values(
+    res.hits.hits.reduce((acc, hit) => {


optional nit: this combo of values + reduce could probably simplified to a single flatMap call.

Not quite, the result here is an object, not an array. I reused most of this logic from the existing current_upgrades API.

joshdover · 2022-09-12T09:19:53Z

x-pack/plugins/fleet/server/services/agents/action_status.ts

+  let actions = await _getActions(esClient);
+  const cancelledActionIds = await _getCancelledActionId(esClient);
+
+  // Fetch acknowledged result for every action
+  actions = await pMap(
+    actions,
+    async (action) => {
+      const { count } = await esClient.count({
+        index: AGENT_ACTIONS_RESULTS_INDEX,
+        ignore_unavailable: true,
+        query: {
+          bool: {
+            must: [
+              {
+                term: {
+                  action_id: action.actionId,
+                },
+              },
+            ],
+          },
+        },
+      });


I'm concerned about the number of queries here. We're fetching all actions, then doing a separate query for action results for every action. Instead I think we should try combining that into a single query and use aggregations to get the count value that we need.

Likely something like this (pseudo-ish code):

let actions = await _getActions(esClient); let acks = await esClient.search({ index: AGENT_ACTIONS_RESULTS_INDEX, query: { bool: { // There's some perf/caching advantages to using filter over must // See https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html#filter-context filter: actions.map(a => ({ term: { action_id: a.id } }), } }, aggs: { ack_counts: { terms: { field: "action_id" } } } }) return actions.map(a => { return { ...action, nbAgentsAck: acks.aggs.ack_counts.buckets.find(b => b.key === a.id).doc_count, // other fields } })

Is that okay if I refactor this in next pr about Agent activity UI? I already had to make changes to add more info.

I'd be okay with refactoring this fetch operation all at once in the next PR. Let's try to land this one and move on to the UI.

refactored here https://github.com/elastic/kibana/pull/140510/files#diff-a1a29af3b4d7a9d69539c684d9bbcaf584fd6db8a60eeebfe47da2af803cb621

joshdover · 2022-09-12T09:20:14Z

x-pack/plugins/fleet/server/services/agents/action_status.ts

+}
+
+async function _getActions(esClient: ElasticsearchClient) {
+  const res = await esClient.search<FleetServerAgentAction>({


Do we need to search for all actions? Is there some way to only filter on relevant or recent ones? Similar question on the canceled actions query

We could add a query param to limit the results. Will change this in the next pr.

x-pack/plugins/fleet/server/services/agents/reassign_action_runner.ts

x-pack/test/fleet_api_integration/apis/agents/unenroll.ts

joshdover · 2022-09-12T09:37:05Z

x-pack/plugins/fleet/server/services/agents/action_runner.ts

+   * On errors, starts a task with Task Manager to retry max 3 times
+   * If the last batch was stored in state, retry continues from there (searchAfter)


I wonder if we should be checkpointing the state regardless of whether or not there is an error. We need to consider the case where Kibana crashes or is otherwise shutdown. Ideally in this case there's always a task scheduled to confirm that the process was completed successfully. This may require some significant changes to this PR and if so, we could handle it in a follow up.

yes, I'll tackle this separately.

juliaElastic · 2022-09-12T12:19:46Z

@elasticmachine merge upstream

kibana-ci · 2022-09-12T13:17:40Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: c303407

Failed CI Steps

Fleet Cypress Tests

Test Failures

[job] [logs] Fleet Cypress Tests / Enrollment token page "before all" hook for "Create new Token"

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`fleet`	873	876	+3

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`fleet`	865.8KB	866.4KB	+580.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id	before	after	diff
`fleet`	105.7KB	105.8KB	+106.0B

Unknown metric groups

API count

id	before	after	diff
`fleet`	970	973	+3

History

💔 Build #71219 failed 921e96c
💚 Build #71118 succeeded 1d7643b
💚 Build #70867 succeeded 7e95311
💚 Build #70708 succeeded fa9bb47
💔 Build #70421 failed f26a1f5
💚 Build #70390 succeeded c0b02d7

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

moving action batching to async

c23db52

juliaElastic added release_note:feature Makes this part of the condensed release notes ci:deploy-cloud v8.5.0 labels Aug 16, 2022

juliaElastic self-assigned this Aug 16, 2022

juliaElastic added 6 commits August 16, 2022 10:14

created new api for action status

6cdbf4e

fix types

d9da608

fix types

d8e968f

added time measurement

e7fd443

added action status component, refactored reassign logic

8e5667a

refactored retry task to be reusable by other actions

3c94459

tylersmalley added ci:cloud-deploy Create or update a Cloud deployment and removed ci:deploy-cloud labels Aug 17, 2022

juliaElastic added the Team:Fleet Team label for Observability Data Collection Fleet team label Aug 18, 2022

refactored action runner to class

8a31e6c

juliaElastic force-pushed the feat/action-batch-async branch from 53d33e9 to 8a31e6c Compare August 18, 2022 15:12

juliaElastic commented Aug 22, 2022

View reviewed changes

juliaElastic and others added 5 commits August 22, 2022 13:02

changed bulk update tags

8d705a6

Merge branch 'main' into feat/action-batch-async

8d692d2

build fixes

8da6894

fix checks

1afb694

passing retry options

1137591

juliaElastic added ci:cloud-deploy Create or update a Cloud deployment and removed ci:cloud-deploy Create or update a Cloud deployment labels Aug 23, 2022

fixed action status

6de094e

juliaElastic commented Aug 23, 2022

View reviewed changes

juliaElastic marked this pull request as ready for review August 23, 2022 12:02

juliaElastic requested a review from a team as a code owner August 23, 2022 12:02

improvements to action status api

07f3283

juliaElastic commented Sep 8, 2022

View reviewed changes

added comments

f26a1f5

juliaElastic added ci:cloud-redeploy Always create a new Cloud deployment and removed ci:cloud-deploy Create or update a Cloud deployment labels Sep 8, 2022

juliaElastic and others added 2 commits September 9, 2022 09:46

Merge branch 'main' into feat/action-batch-async

9e3c01c

fixed conflict resolution

fa9bb47

kpollich approved these changes Sep 9, 2022

View reviewed changes

nchaulet self-requested a review September 9, 2022 13:35

nchaulet reviewed Sep 9, 2022

View reviewed changes

reverted refreshAgents change

7e95311

juliaElastic removed the ci:cloud-redeploy Always create a new Cloud deployment label Sep 9, 2022

juliaElastic mentioned this pull request Sep 9, 2022

[Fleet] Agent activity #140428

Closed

9 tasks

Merge branch 'main' into feat/action-batch-async

1d7643b

joshdover reviewed Sep 12, 2022

View reviewed changes

review comments

921e96c

Merge branch 'main' into feat/action-batch-async

c303407

juliaElastic merged commit 38e74d7 into elastic:main Sep 12, 2022

kibanamachine added the backport:skip This PR does not require backporting label Sep 12, 2022

juliaElastic mentioned this pull request Sep 12, 2022

[Fleet] Agent activity #140510

Merged

3 tasks

juliaElastic mentioned this pull request Nov 8, 2022

[Fleet] Request diagnostics #142369

Merged

2 tasks

		* On errors, starts a task with Task Manager to retry max 3 times
		* If the last batch was stored in state, retry continues from there (searchAfter)

[Fleet] moving action batching to async #138870

[Fleet] moving action batching to async #138870

Uh oh!

Conversation

juliaElastic commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliaElastic Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliaElastic Sep 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Aug 23, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kpollich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nchaulet Sep 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nchaulet commented Sep 9, 2022

Uh oh!

juliaElastic commented Sep 9, 2022

Uh oh!

juliaElastic commented Sep 12, 2022

Uh oh!

juliaElastic commented Sep 12, 2022

Uh oh!

joshdover commented Sep 12, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliaElastic Sep 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

juliaElastic commented Aug 16, 2022 •

edited

Loading

juliaElastic Aug 24, 2022 •

edited

Loading

juliaElastic Sep 5, 2022 •

edited

Loading

nchaulet Sep 9, 2022 •

edited

Loading

juliaElastic Sep 12, 2022 •

edited

Loading