Skip to content

[Security Solution][Alerts] improves IM rule memory usage and performance#149208

Merged
vitaliidm merged 33 commits intoelastic:mainfrom
vitaliidm:alerts/im-memory-and-performance
Feb 6, 2023
Merged

[Security Solution][Alerts] improves IM rule memory usage and performance#149208
vitaliidm merged 33 commits intoelastic:mainfrom
vitaliidm:alerts/im-memory-and-performance

Conversation

@vitaliidm
Copy link
Contributor

@vitaliidm vitaliidm commented Jan 19, 2023

Summary

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@elastic elastic deleted a comment from kibana-ci Jan 31, 2023
@vitaliidm vitaliidm marked this pull request as ready for review February 1, 2023 15:28
@vitaliidm vitaliidm requested a review from a team as a code owner February 1, 2023 15:28
@vitaliidm vitaliidm self-assigned this Feb 1, 2023
@vitaliidm vitaliidm added enhancement New value added to drive a business result release_note:skip Skip the PR/issue when compiling release notes Feature:Detection Alerts Security Solution Detection Alerts Feature Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Alerts Security Detection Alerts Area Team labels Feb 1, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

},
};

const threatResponse = await getThreatList({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we here specify per page here (9k for example)?
Because looks like we query only 1k threats, but there can be more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part basically replicates current behaviour in 8.6, where max of 1,000 threats is fetched before building the enrichment.

I don't know whether this was conscious decision to avoid loading large number of documents in memory(with potentially large number of fields depends on _source: [${threatIndicatorPath}.*, 'threat.feed.*'], in threat params) or just overlook.

There is some hard limit of 10,000 items per page though.
Do you think it would be safe to load 9,000 items in one request with no control on size of the document?

We do fetch 9,000 threats in one request on initial phase of the execution. But it returns only indicator mapping fields, which response is significantly smaller

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think before it was like:
const calculatedPerPage = perPage ?? INDICATOR_PER_PAGE;
And perPage was 9000 I think

So, it worked before for this amount, at least I don't remember any client SDH about that.

Right now we can potentially miss some threat indicators for matches, and some alerts can't be without matches at all.
In this trade-off, I lean more toward keeping 9k as a limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think before it was like:
const calculatedPerPage = perPage ?? INDICATOR_PER_PAGE;
And perPage was 9000 I think

yes, that's where this part is coming from:

We do fetch 9,000 threats in one request on initial phase of the execution. But it returns only indicator mapping fields, which response is significantly smaller

It fetches, 9,000 batch of threat documents, but only indicator mapping fields. So, the size of each document is small.

But, when it fetches threats to make enrichments, it uses default page size as 1,000 as perPage is undefined in that call. And it returns, the whole section of document, defined in query parameters

  _source: [${threatIndicatorPath}.*, 'threat.feed.*'],

for that particular call, which fetches threats for enrichments
So, this implementation is essentially kept in the same way as in 8.6.

While, I agree, it can miss a lot of potential threats my concern is around fetching of large number of documents.
Which could result:

  • large memory consumption
  • too large content of response that causes rule execution error. Corresponding issue

So, it worked before for this amount, at least I don't remember any client SDH about that.
It worked for 1,000 for threats first search

Although it seems, 9,000 threats should not cause the issue as in 150041

@marshallmain, what are you thoughts on this?

@nkhristinin nkhristinin added the ci:cloud-deploy Create or update a Cloud deployment label Feb 6, 2023
@nkhristinin
Copy link
Contributor

@elasticmachine merge upstream

@nkhristinin nkhristinin self-requested a review February 6, 2023 11:57
@vitaliidm vitaliidm added ci:cloud-redeploy Always create a new Cloud deployment and removed ci:cloud-deploy Create or update a Cloud deployment labels Feb 6, 2023
@vitaliidm vitaliidm enabled auto-merge (squash) February 6, 2023 13:14
@kibana-ci
Copy link

kibana-ci commented Feb 6, 2023

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @vitaliidm

Copy link
Contributor

@marshallmain marshallmain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple comments for follow up and longer term views on how we should handle threat match rules. The PR looks good to merge pending the comments in get_signals_map_from_threat_index.ts.

I still find the threat match code particularly tough to review due to the overall structure. The control flow is difficult to follow, names are not clear, and the overall design originally attempted to share the core searchAfterBulkCreate function with the query rule type, but there's now a ton of additional logic that was bolted on. I think splitting threat match rules away from searchAfterBulkCreate and making the logic independent of KQL rules would free us up to refactor more of the threat match rule type and make this a lot easier to follow. For event-first search, it seems to me that we won't need to make the extra queries in searchAfterBulkCreate at all since we already have the source events that searchAfterBulkCreate will fetch again.

@nkhristinin @vitaliidm I'd like to find some time after FF to discuss the code architecture and see if we can agree on a path forward.


while (
maxThreatsReachedMap.size < eventsCount &&
(threatList ? threatList?.hits.hits.length > 0 : true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition defaulting to true seems like an unnecessary risk where a bug in getThreatList could cause an infinite loop. Can we fetch the first threat list outside the loop and have the loop break if threatList is undefined?

Suggested change
(threatList ? threatList?.hits.hits.length > 0 : true)
threatList?.hits.hits.length > 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in #150677

decodedQuery: ThreatMatchNamedQuery | ThreatTermNamedQuery;
}) => {
const signalMatch = signalsQueryMap.get(signalId);
if (!signalMatch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if block looks redundant with the one on line 65 below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in #150677


export type SignalsQueryMap = Map<string, ThreatMatchNamedQuery[]>;

interface GetSignalsMatchesFromThreatIndexOptions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this interface name doesn't match the function name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in #150677

const values = Array.isArray(threatValue) ? threatValue : [threatValue];

values.forEach((value) => {
if (value && signalValueMap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that signalValueMap is required for this function to work correctly when the query against the threatList uses a terms query. However, the type system doesn't enforce that constraint so it would be easy to miss and accidentally use a terms query without providing the signalValueMap.

We should continue to work on ways to simplify the threat match rule logic so it's easier to maintain moving forward, and use the TS compiler to enforce requirements like this whenever possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.
This part was adopted from the initial IM terms PR and integrated into this one.
The issue I see here, it's difficult to determine whether query has any terms clauses within it and whether they related to IM execution path. Also it can be determined only in a runtime environment, when all checks for index fields performed and it's clear terms query can be generated.

So, instead, I introduced parameter termsQueryAllowed, that would require from developer to set it to true when terms query can be used. Once its set, it will require signalValueMap to be passed as argument

});

// the third request return empty results
getThreatListMock.mockReturnValueOnce({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found tests that mock ES responses are often brittle, either failing when they shouldn't or not failing when they should. It's fine to leave these in, but if we make changes to the implementation and find that these tests require a lot of maintenance it may be easier to remove them or refactor them so they don't rely on mock responses as much, e.g. by extracting the ES calls from the encoding logic and testing the encoding logic with unit tests and ES calls would be covered by integration tests.

Copy link
Contributor Author

@vitaliidm vitaliidm Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for highlighting these concerns.

I share the opinion, it might break too easy if implementation of getSignalsQueryMapFromThreatIndex changes.

On the other hand, I think not all the cases might be possible to cover easily by functional tests. For example if we want to cover processing of multiple result pages, we would need to populate test index with thousands of documents. So, it might be easier to mock supposed ES response. Especially, given the fact it unlikely changes in future.

I haven't yet encountered issue with ES mocks you mentioned

I've found tests that mock ES responses are often brittle, either failing when they shouldn't or not failing when they should.

But will pay more attention in future to those kind of tests

@vitaliidm vitaliidm merged commit 2ff017e into elastic:main Feb 6, 2023
vitaliidm added a commit that referenced this pull request Mar 1, 2023
…0677)

## Summary

- addresses feedback from #149208
- typings for `getSignalsQueryMapFromThreatIndex`
- fixes interface name for `getSignalsQueryMapFromThreatIndex`
- small code refactorings

More details in comments of the initial PR

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Mar 1, 2023
…stic#150677)

## Summary

- addresses feedback from elastic#149208
- typings for `getSignalsQueryMapFromThreatIndex`
- fixes interface name for `getSignalsQueryMapFromThreatIndex`
- small code refactorings

More details in comments of the initial PR

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit c08cdc8)
kibanamachine referenced this pull request Mar 1, 2023
#150677) (#152426)

# Backport

This will backport the following commits from `main` to `8.7`:
- [[Security Solution][Alerts] addresses IM performance PR feedback
(#150677)](#150677)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Vitalii
Dmyterko","email":"92328789+vitaliidm@users.noreply.github.com"},"sourceCommit":{"committedDate":"2023-03-01T09:46:23Z","message":"[Security
Solution][Alerts] addresses IM performance PR feedback (#150677)\n\n##
Summary\r\n\r\n- addresses feedback from
https://github.com/elastic/kibana/pull/149208\r\n- typings for
`getSignalsQueryMapFromThreatIndex`\r\n- fixes interface name for
`getSignalsQueryMapFromThreatIndex`\r\n- small code
refactorings\r\n\r\nMore details in comments of the initial
PR\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"c08cdc8db633a02ac65d0940a26e5f5a86542d9b","branchLabelMapping":{"^v8.8.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:
SecuritySolution","Feature:Detection Alerts","Team:Detection
Alerts","backport:prev-minor","v8.7.0","v8.8.0"],"number":150677,"url":"https://github.com/elastic/kibana/pull/150677","mergeCommit":{"message":"[Security
Solution][Alerts] addresses IM performance PR feedback (#150677)\n\n##
Summary\r\n\r\n- addresses feedback from
https://github.com/elastic/kibana/pull/149208\r\n- typings for
`getSignalsQueryMapFromThreatIndex`\r\n- fixes interface name for
`getSignalsQueryMapFromThreatIndex`\r\n- small code
refactorings\r\n\r\nMore details in comments of the initial
PR\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"c08cdc8db633a02ac65d0940a26e5f5a86542d9b"}},"sourceBranch":"main","suggestedTargetBranches":["8.7"],"targetPullRequestStates":[{"branch":"8.7","label":"v8.7.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.8.0","labelRegex":"^v8.8.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/150677","number":150677,"mergeCommit":{"message":"[Security
Solution][Alerts] addresses IM performance PR feedback (#150677)\n\n##
Summary\r\n\r\n- addresses feedback from
https://github.com/elastic/kibana/pull/149208\r\n- typings for
`getSignalsQueryMapFromThreatIndex`\r\n- fixes interface name for
`getSignalsQueryMapFromThreatIndex`\r\n- small code
refactorings\r\n\r\nMore details in comments of the initial
PR\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"c08cdc8db633a02ac65d0940a26e5f5a86542d9b"}}]}]
BACKPORT-->

Co-authored-by: Vitalii Dmyterko <92328789+vitaliidm@users.noreply.github.com>
bmorelli25 pushed a commit to bmorelli25/kibana that referenced this pull request Mar 10, 2023
…stic#150677)

## Summary

- addresses feedback from elastic#149208
- typings for `getSignalsQueryMapFromThreatIndex`
- fixes interface name for `getSignalsQueryMapFromThreatIndex`
- small code refactorings

More details in comments of the initial PR

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
@vitaliidm vitaliidm deleted the alerts/im-memory-and-performance branch March 4, 2024 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting ci:cloud-redeploy Always create a new Cloud deployment enhancement New value added to drive a business result Feature:Detection Alerts Security Solution Detection Alerts Feature release_note:skip Skip the PR/issue when compiling release notes Team:Detection Alerts Security Detection Alerts Area Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.7.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants