Skip to content

Conversation

@juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Aug 16, 2022

Summary

Relates #141567

Changes for agent bulk actions reassign, unenroll, upgrade, update tags:

  • checking if bulk action is on more than 10k (or batch size)
    • If not, normal sync execution
    • If yes, the action execution is started as async and the API immediately returns
      • For errors during batch execution, there is a task started with Kibana Task Manager to retry 3 times. The retry task will pick up retry from last successful batch by passing the searchAfter parameter.
      • the UI is using a new /action_status API to report on the progress of actions -> I moved the UI part out of this pr, it will be covered in another one in a form of flyout
      • the counter shows the number of agents acknowledged by fleet server compared to the total actioned
      • the action can be in progress, completed, cancelled or expired.

Pending work:

  • Add tests
  • Update openapi spec
  • Currently there are retry tasks started in case of catching error in async execution, not yet handling a scenario of kibana crash/restart (so execution stopping without the catch flow running)
  • Handling unrecoverable errors e.g. point in time expiring - this can be an issue if the retry task runs with some delay, current point in time timeout is set for 10m.
    • Added special error handling for es 404 error - this comes when pit is expired, not retrying in that case
  • Open question: how can Fleet API determine whether the action failed? For that we would need a failed counter stored in Fleet action indices.
    • Added a .fleet-actions-status index to save failed state with error message when kibana processing fails.
    • Update Sept 8: moved out the .fleet-action-status index from this pr to simplify, will move to another pr. Also it probably needs an elasticsearch change to add the new index mapping.

image

  • Make the code simpler and more readable, currently the logic became quite complex with the ActionRunner and the BulkActionResolver (for Task Manager) logic.
    • Did some refactoring to make the parameters type safe and easier to read.
  • Add/remove tags and force unenroll action: it doesn't create an entry in .fleet-actions index, so we currently don't have any info about the action status.
GET kbn:/api/fleet/agents/action_status

{
  "items": [
    {
        "actionId": "ffa5355d-41bd-4382-8d9d-1b93e502936b",
        "nbAgents": 100,
        "complete": true,
        "nbAgentsAck": 100,
        "startTime": "2022-08-23T11:49:43.439Z",
        "type": "POLICY_REASSIGN",
        "total": 100,
        "cancelled": false,
        "expired": false
    }
  ]
}

image

Testing with ess (pr build docker image), 8 Gb integration server memory
https://admin.found.no/deployments/af21b782629cb35eacb6b66ae157d211
reassign 20k agents, kibana processing took 15s, FS processing took about 21m (first 10k took 7m, slowed down for second half)
upgrade 20k agents, kibana processing took 15s, FS processing did 11k after 30m, took about 1hr to complete all (action expired after 30m)
unenroll 20k, kibana processing took 13s, FS processing took 20m

reassign 30k agents, kibana processing took 21s, FS processing took 31m

On average looks like <10s/10k for kibana processing, and around 10m/10k for FS processing.

Checklist

@juliaElastic juliaElastic added release_note:feature Makes this part of the condensed release notes ci:deploy-cloud v8.5.0 labels Aug 16, 2022
@juliaElastic juliaElastic self-assigned this Aug 16, 2022
@tylersmalley tylersmalley added ci:cloud-deploy Create or update a Cloud deployment and removed ci:deploy-cloud labels Aug 17, 2022
@juliaElastic juliaElastic added the Team:Fleet Team label for Observability Data Collection Fleet team label Aug 18, 2022
@juliaElastic juliaElastic force-pushed the feat/action-batch-async branch from 53d33e9 to 8a31e6c Compare August 18, 2022 15:12
skipSuccess
)
);
const batchSize = options.batchSize ?? SO_SEARCH_LIMIT;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for testing locally with less than 10k agents enrolled, set a lower number here:

const batchSize = 1000; // options.batchSize ?? SO_SEARCH_LIMIT;

or invoke from API:

POST kbn:/api/fleet/agents/bulk_reassign
{
  "agents": " fleet-agents.policy_id : (\"e57948b0-1d55-11ed-85fe-e34c31cf865b\" or \"26594150-1d56-11ed-85fe-e34c31cf865b\")",
  "policy_id": "e57948b0-1d55-11ed-85fe-e34c31cf865b",
  "batchSize": 1000
}

@juliaElastic juliaElastic added ci:cloud-deploy Create or update a Cloud deployment and removed ci:cloud-deploy Create or update a Cloud deployment labels Aug 23, 2022
start_time: startTime ?? now,
minimum_execution_duration: MINIMUM_EXECUTION_DURATION_SECONDS,
expiration: moment(startTime)
.add(MINIMUM_EXECUTION_DURATION_SECONDS, 'seconds')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default expiration with upgrade Immediately is 30m, this is too short for larger agent selections e.g. 20k. When I tried upgrading immediately, the action still completed for all actions despite expiring about halfway through. I am not sure if this is the expected behavior that Fleet Server continues to execute the action even after expiration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we could probably have a larger expiration here and minimum expiration duration (this need to be the same) for immediate to happen,

The default expiration with upgrade Immediately is 30m, this is too short for larger agent selections e.g. 20k. When I tried upgrading immediately, the action still completed for all actions despite expiring about halfway through. I am not sure if this is the expected behavior that Fleet Server continues to execute the action even after expiration.

The action should be delivered immediatly to the agent, but Fleet server may have some trouble to get all the agent ack in 30 minutes.

Copy link
Contributor Author

@juliaElastic juliaElastic Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any disadvantage of increasing the minimum expiration time to e.g. 2 hours? updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think increasing the minimum expiration time to 2 hours will mean that every rolling upgrade less than 2hours will not be a rolling upgrade, it's maybe acceptable, or maybe we can just have this to two hours or more for the immediate scenario

Copy link
Contributor Author

@juliaElastic juliaElastic Sep 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous logic of the rolling upgrade was to set minimum duration to 30m and expiration to the selected duration.
I updated the minimum to be 2hr (or 1hr if that was selected as the window)
Previously:

  // Perform a rolling upgrade
  if (upgradeDurationSeconds) {
    return {
      start_time: startTime ?? now,
      minimum_execution_duration: MINIMUM_EXECUTION_DURATION_SECONDS,
      expiration: moment(startTime ?? now)
        .add(upgradeDurationSeconds, 'seconds')
        .toISOString(),
    };
  }

Updated:

  // Perform a rolling upgrade
  if (upgradeDurationSeconds) {
    return {
      start_time: startTime ?? now,
      minimum_execution_duration: Math.min(
        MINIMUM_EXECUTION_DURATION_SECONDS,
        upgradeDurationSeconds
      ),
      expiration: moment(startTime ?? now)
        .add(upgradeDurationSeconds, 'seconds')
        .toISOString(),
    };
  }

@juliaElastic juliaElastic marked this pull request as ready for review August 23, 2022 12:02
@juliaElastic juliaElastic requested a review from a team as a code owner August 23, 2022 12:02
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

start_time?: string;
minimum_execution_duration?: number;
source_uri?: string;
total?: number;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the total count that represents how many agents were actioned (clicked by user), this helps with status reporting in case something went wrong while creating the action documents in batches.


export interface ActionStatus {
actionId: string;
nbAgentsActionCreated: number;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nbAgentsActionCreated represents how many agents are included in .fleet-actions documents, if this is less than nbAgentsActioned, it indicates something went wrong with kibana batch processing.


await new Promise((resolve, reject) => {
let attempts = 0;
const intervalId = setInterval(async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing the async execution of bulk actions, I had to wait for a Promise, otherwise the FTR shutdown starts and the action doesn't finish.
Added an interval of 1s for 3 times to check whether the action has completed.

@juliaElastic juliaElastic added ci:cloud-redeploy Always create a new Cloud deployment and removed ci:cloud-deploy Create or update a Cloud deployment labels Sep 8, 2022
Copy link
Member

@kpollich kpollich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gotten a chance to run through every action type locally, but I spent yesterday + this morning walking through the code and I think things look good to me. Definitely more OOP than we have elsewhere in the Fleet codebase, but I think the patterns used here with a base class implementing the bulk action processing needs all make sense in this context.

I'll offer my approval, but I think @nchaulet should probably continue to review here as well.

@@ -0,0 +1,143 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this script to help with testing and local dev setup.

},
{
range: {
expiration: { gte: now },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing the expiration condition from this query makes sense.

@nchaulet nchaulet self-requested a review September 9, 2022 13:35
* On errors, starts a task with Task Manager to retry max 3 times
* If the last batch was stored in state, retry continues from there (searchAfter)
*/
public async runActionAsyncWithRetry(): Promise<{ items: BulkActionResult[]; actionId: string }> {
Copy link
Member

@nchaulet nchaulet Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That method seems a good candidate to be unit tested, and will help to document the expected behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean unit tested?
Yes, I am planning to add more tests after I complete the UI part.

@nchaulet
Copy link
Member

nchaulet commented Sep 9, 2022

Maybe a dumb question, but I am wondering if we considered using the task scheduler for every bulkAction creation and not just for retry? it could make things easier to understand and probably more scalable if the task manager is hosted on a separate Kibana instance only running tasks in the future, I do not like having long running task from a request it seems something that can break to me.

@juliaElastic
Copy link
Contributor Author

Maybe a dumb question, but I am wondering if we considered using the task scheduler for every bulkAction creation and not just for retry? it could make things easier to understand and probably more scalable if the task manager is hosted on a separate Kibana instance only running tasks in the future, I do not like having long running task from a request it seems something that can break to me.

Good question, this was the original plan, however revised it after feedback of Alerting team who are heavy users of Task Manager. They suggested to avoid long running tasks, because it might impact the schedule of other tasks. Also depending on how many tasks are waiting to run, it might take some time for TM to pick up a task. So I thought it is a better idea to start the execution immediately.

@juliaElastic juliaElastic removed the ci:cloud-redeploy Always create a new Cloud deployment label Sep 9, 2022
@juliaElastic juliaElastic mentioned this pull request Sep 9, 2022
9 tasks
@juliaElastic
Copy link
Contributor Author

@elasticmachine merge upstream

@juliaElastic
Copy link
Contributor Author

@nchaulet @joshdover are you okay if I merge this today?

@joshdover
Copy link
Contributor

@juliaElastic Let me take a look at this now, there were just a few key things I wanted to review.

Comment on lines +114 to +115
return Object.values(
res.hits.hits.reduce((acc, hit) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit: this combo of values + reduce could probably simplified to a single flatMap call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, the result here is an object, not an array. I reused most of this logic from the existing current_upgrades API.

Comment on lines +20 to +41
let actions = await _getActions(esClient);
const cancelledActionIds = await _getCancelledActionId(esClient);

// Fetch acknowledged result for every action
actions = await pMap(
actions,
async (action) => {
const { count } = await esClient.count({
index: AGENT_ACTIONS_RESULTS_INDEX,
ignore_unavailable: true,
query: {
bool: {
must: [
{
term: {
action_id: action.actionId,
},
},
],
},
},
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about the number of queries here. We're fetching all actions, then doing a separate query for action results for every action. Instead I think we should try combining that into a single query and use aggregations to get the count value that we need.

Likely something like this (pseudo-ish code):

let actions = await _getActions(esClient);

let acks = await esClient.search({
  index: AGENT_ACTIONS_RESULTS_INDEX,
  query: {
    bool: {
      // There's some perf/caching advantages to using filter over must
      // See https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html#filter-context
      filter: actions.map(a => ({ term: { action_id: a.id } }),
    }
  },
  aggs: {
    ack_counts: { terms: { field: "action_id" } }
  }
})

return actions.map(a => {
  return {
    ...action,
    nbAgentsAck: acks.aggs.ack_counts.buckets.find(b => b.key === a.id).doc_count,
   // other fields
  }
})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that okay if I refactor this in next pr about Agent activity UI? I already had to make changes to add more info.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be okay with refactoring this fetch operation all at once in the next PR. Let's try to land this one and move on to the UI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

async function _getActions(esClient: ElasticsearchClient) {
const res = await esClient.search<FleetServerAgentAction>({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to search for all actions? Is there some way to only filter on relevant or recent ones? Similar question on the canceled actions query

Copy link
Contributor Author

@juliaElastic juliaElastic Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a query param to limit the results. Will change this in the next pr.

Comment on lines +67 to +68
* On errors, starts a task with Task Manager to retry max 3 times
* If the last batch was stored in state, retry continues from there (searchAfter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should be checkpointing the state regardless of whether or not there is an error. We need to consider the case where Kibana crashes or is otherwise shutdown. Ideally in this case there's always a task scheduled to confirm that the process was completed successfully. This may require some significant changes to this PR and if so, we could handle it in a follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'll tackle this separately.

@juliaElastic
Copy link
Contributor Author

@elasticmachine merge upstream

@kibana-ci
Copy link

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Fleet Cypress Tests / Enrollment token page "before all" hook for "Create new Token"

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
fleet 873 876 +3

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
fleet 865.8KB 866.4KB +580.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
fleet 105.7KB 105.8KB +106.0B
Unknown metric groups

API count

id before after diff
fleet 970 973 +3

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

@juliaElastic juliaElastic merged commit 38e74d7 into elastic:main Sep 12, 2022
@kibanamachine kibanamachine added the backport:skip This PR does not require backporting label Sep 12, 2022
@juliaElastic juliaElastic mentioned this pull request Sep 12, 2022
3 tasks
@juliaElastic juliaElastic mentioned this pull request Nov 8, 2022
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:feature Makes this part of the condensed release notes Team:Fleet Team label for Observability Data Collection Fleet team v8.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants