Skip to content

[SLO] Fix issue where filters do not apply to overview stats#234218

Merged
baileycash-elastic merged 31 commits intoelastic:mainfrom
baileycash-elastic:slo-233631
Oct 21, 2025
Merged

[SLO] Fix issue where filters do not apply to overview stats#234218
baileycash-elastic merged 31 commits intoelastic:mainfrom
baileycash-elastic:slo-233631

Conversation

@baileycash-elastic
Copy link
Copy Markdown
Contributor

@baileycash-elastic baileycash-elastic commented Sep 5, 2025

Summary

Closes #233631

This PR aims to fix inconsistencies with the SLO overview stats where the alerts and rules count would not represent filtered SLO results, unlike healthy vs violated count.

Demo

Screen.Recording.2025-09-10.at.4.54.44.PM.mov

@baileycash-elastic baileycash-elastic added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting Team:actionable-obs Formerly "obs-ux-management", responsible for SLO, o11y alerting, significant events, & synthetics. v9.2.0 labels Sep 5, 2025
@github-actions github-actions bot added the author:obs-ux-management PRs authored by the obs ux management team label Sep 5, 2025
@baileycash-elastic baileycash-elastic changed the title Forward SLO filters to SLO overview service Forward SLO filters to SLO overview stat service Sep 5, 2025
@baileycash-elastic
Copy link
Copy Markdown
Contributor Author

/ci

@baileycash-elastic
Copy link
Copy Markdown
Contributor Author

/ci

@baileycash-elastic baileycash-elastic marked this pull request as ready for review September 11, 2025 13:17
@baileycash-elastic baileycash-elastic requested a review from a team as a code owner September 11, 2025 13:17
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

…_slo_stats_overview.ts

Co-authored-by: Shahzad <shahzad31comp@gmail.com>
Copy link
Copy Markdown
Contributor

@justinkambic justinkambic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a first pass code review on the things that grabbed my eye right away. I'm going to come back to this later, just pressed for time on finishing the review right now.

const filters = params.filters ?? '';
const kqlQuery = params?.kqlQuery ?? '';
const filters = params?.filters ?? '';
const parsedFilters = parseStringFilters(filters, this.logger);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parseStringFilters function is not great, because it is reverting TypeScript to typeless JS wild west style programming. But it's also doing a JSON.parse, which is what you do further down, so we don't need to do any JSON.parse(params?.filters).

Ideally there'd be some kind of type checking or at least casting to the expected values within parseStringFilters, but it's already used throughout the server code and is out of scope for refactoring in this PR.

@elasticmachine
Copy link
Copy Markdown
Contributor

💚 Build Succeeded

Metrics [docs]

Saved Objects .kibana field count

Every field in each saved object type adds overhead to Elasticsearch. Kibana needs to keep the total field count below Elasticsearch's default limit of 1000 fields. Only specify field mappings for the fields you wish to search on or query. See https://www.elastic.co/guide/en/kibana/master/saved-objects-service.html#_mappings

id before after diff
_data_stream_timestamp 1 - -1
_doc_count 1 - -1
_ignored_source 1 - -1
_index_mode 1 - -1
_inference_fields 1 - -1
_tier 1 - -1
apm-custom-dashboards 5 - -5
apm-server-schema 2 - -2
apm-service-group 5 - -5
application_usage_daily 2 - -2
config 2 - -2
config-global 2 - -2
coreMigrationVersion 1 - -1
created_at 1 - -1
created_by 1 - -1
entity-definition 9 - -9
entity-discovery-api-key 2 - -2
event_loop_delays_daily 2 - -2
favorites 4 - -4
file 11 - -11
file-upload-usage-collection-telemetry 3 - -3
fileShare 5 - -5
infra-custom-dashboards 4 - -4
infrastructure-monitoring-log-view 2 - -2
intercept_trigger_record 5 - -5
legacy-url-alias 7 - -7
managed 1 - -1
ml-job 6 - -6
ml-module 13 - -13
ml-trained-model 7 - -7
monitoring-telemetry 2 - -2
namespace 1 - -1
namespaces 1 - -1
observability-onboarding-state 2 - -2
originId 1 - -1
product-doc-install-status 7 - -7
references 4 - -4
sample-data-telemetry 3 - -3
security-ai-prompt 8 - -8
slo 11 - -11
space 5 - -5
synthetics-monitor 34 - -34
synthetics-monitor-multi-space 34 - -34
tag 4 - -4
type 1 - -1
typeMigrationVersion 1 - -1
ui-metric 2 - -2
updated_at 1 - -1
updated_by 1 - -1
upgrade-assistant-ml-upgrade-operation 3 - -3
upgrade-assistant-reindex-operation 3 - -3
uptime-synthetics-api-key 2 - -2
url 5 - -5
usage-counters 2 - -2
total -246

History

@baileycash-elastic baileycash-elastic removed request for a team October 14, 2025 21:14
@baileycash-elastic
Copy link
Copy Markdown
Contributor Author

@PhilippeOberti ohhh yes 😅it's fixed now

if (querySLOsForIds) {
do {
const sloIdCompositeQueryResponse = await this.scopedClusterClient.asCurrentUser.search({
index: '.slo-observability.summary-*',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this shouldn't be hardcoded. use return value from getSummaryIndices

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be good to go here

Copy link
Copy Markdown
Contributor

@justinkambic justinkambic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you for writing an excellent battery of unit tests, it made it very easy to experiment with recommended changes.

I have a few areas of this code I want to change. I think we can significantly improve the readability and reduce the number of vars we need to declare to get this work done. I may want to change a few things yet, let's see the output from this round. Check my recommendations and LMK if you disagree with any of the points.

If it is too hard to track what I'm looking for by looking at the comments in the PR diff, I pushed a version of your branch to my remote that includes a commit with everything I mentioned here.

/*
If we know there are no SLOs that match the provided filters, we can skip querying for rules and alerts
*/
const [rules, alerts] = await Promise.all(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given how much is happening already within this function, it makes logical sense to pull this block out into a separate function we can call. Something like:

  private async fetchRulesAndAlerts({
    querySLOsForIds,
    sloRuleQueryKeys,
    ruleFilters,
    alertFilters,
  }: {
    querySLOsForIds: boolean;
    sloRuleQueryKeys: string[];
    ruleFilters?: KueryNode;
    alertFilters?: QueryDslQueryContainer[];
  }) {
    return await Promise.all(
      querySLOsForIds && sloRuleQueryKeys.length === 0
        ? [
            {
              total: 0,
            },
            {
              activeAlertCount: 0,
              recoveredAlertCount: 0,
            },
          ]
        : [
            this.rulesClient.find({
              options: {
                ruleTypeIds: SLO_RULE_TYPE_IDS,
                consumers: [
                  AlertConsumers.SLO,
                  AlertConsumers.ALERTS,
                  AlertConsumers.OBSERVABILITY,
                ],
                ...(ruleFilters ? { filter: ruleFilters } : {}),
              },
            }),

            this.racClient.getAlertSummary({
              ruleTypeIds: SLO_RULE_TYPE_IDS,
              consumers: [AlertConsumers.SLO, AlertConsumers.ALERTS, AlertConsumers.OBSERVABILITY],
              gte: moment().subtract(24, 'hours').toISOString(),
              lte: moment().toISOString(),
              ...(alertFilters?.length
                ? {
                    filter: alertFilters,
                  }
                : {}),
            }),
          ]
    );
  }

The call signature will remain the same.

Comment on lines +126 to +148
if (buckets && buckets.length > 0) {
alertFilterTerms = alertFilterTerms.concat(
...buckets.map((bucket) => {
sloRuleQueryKeys.push(bucket.key.sloId);
return {
bool: {
must: [
{ term: { 'kibana.alert.rule.parameters.sloId': bucket.key.sloId } },
...(instanceIdIncluded
? [
{
term: {
'kibana.alert.instance.id': bucket.key.sloInstanceId,
},
},
]
: []),
],
},
};
})
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth extracting this to its own processing function.

  private processSloQueryBuckets(
    buckets: Array<{ key: { sloId: string; sloInstanceId: string } }>,
    sloRuleQueryKeys: string[],
    instanceId?: string
  ): QueryDslQueryContainer[] {
    return buckets.map((bucket) => {
      // alternatively, return `bucket.key.sloId` and add it to the `sloRuleQueryKeys` in the loop,
      // this way we get rid of one of the params here
      sloRuleQueryKeys.push(bucket.key.sloId);
      return {
        bool: {
          must: [
            { term: { 'kibana.alert.rule.parameters.sloId': bucket.key.sloId } },
            ...(instanceId
              ? [
                  {
                    term: {
                      'kibana.alert.instance.id': bucket.key.sloInstanceId,
                    },
                  },
                ]
              : []),
          ],
        },
      };
    });
  }

Then the code in your loop becomes much simpler to understand:

          afterKey = this.getAfterKey(sloIdCompositeQueryResponse.aggregations?.sloIds);

          const buckets = (
            sloIdCompositeQueryResponse.aggregations?.sloIds as {
              buckets?: Array<{ key: { sloId: string; sloInstanceId: string } }>;
            }
          )?.buckets;
          if (buckets) {
            alertFilterTerms.push(
              ...this.processSloQueryBuckets(buckets, sloRuleQueryKeys, instanceId)
            );
          }
        } while (afterKey);

You could even parallelize this by making it async and dumping the promises in an async queue that lives outside the loop so you don't delay subsequent calls to the DB. Then, further down when it's time to use sloRuleQueryKeys and alertFilterTerms you can just await the promise and pick up all the items in the array. I didn't write this code because it's probably over-optimizing.

import { getSummaryIndices, getSloSettings } from './slo_settings';
import { getElasticsearchQueryOrThrow, parseStringFilters } from './transform_generators';

const ES_PAGESIZE_LIMIT = 5000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we pick 5000 for the bucket limit for the query? If there's no reason we may want to reduce this to 1000, per the docs for the composite agg:

If all composite buckets should be retrieved it is preferable to use a small size (100 or 1000 for instance) and then use the after parameter to retrieve the next results.

Given this won't change any aspect of the implementation, if there's no real reason for the 5k limit we may want to reduce this to 1k.

Comment on lines +76 to +79
let alertFilters: QueryDslQueryContainer[] = [];
let alertFilterTerms: QueryDslQueryContainer[] = [];
let afterKey: AggregationsAggregate | undefined;
let ruleFilters: KueryNode | undefined;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes I recommended below it becomes possible to have only afterKey declared as a let, and we can delay the declaration of alertFilters and ruleFilters until after the loop finishes.

    const alertFilterTerms: QueryDslQueryContainer[] = [];
    let afterKey: AggregationsAggregate | undefined;

Comment on lines +151 to +167
const resultNodes = nodeBuilder.or(
sloRuleQueryKeys.map((sloId) => nodeBuilder.is(`alert.attributes.params.sloId`, sloId))
);

ruleFilters = resultNodes;
alertFilters = [
{
bool: {
should: [...alertFilterTerms],
},
},
];
}
} catch (error) {
this.logger.error(`Error querying SLOs for IDs: ${error}`);
throw error;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend pulling these blocks out of the loop. It will simplify and you're already adding these values to a list. Include a default short-circuit when there are no values retrieved in each of the declarations and we can keep this closer to the top level of the function.

    const ruleFilters: KueryNode | undefined =
      sloRuleQueryKeys.length > 0
        ? nodeBuilder.or(
            sloRuleQueryKeys.map((sloId) => nodeBuilder.is(`alert.attributes.params.sloId`, sloId))
          )
        : undefined;
    const alertFilters =
      alertFilterTerms.length > 0
        ? [
            {
              bool: {
                should: [...alertFilterTerms],
              },
            },
          ]
        : [];

@elasticmachine
Copy link
Copy Markdown
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #57 / Fleet packages test Automatic agent upgrades should take agents marked but not ready for retry into account but not upgrade them

Metrics [docs]

✅ unchanged

History

Copy link
Copy Markdown
Contributor

@justinkambic justinkambic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested this against main and the fix appears to work

image image image

@kibanamachine
Copy link
Copy Markdown
Contributor

Starting backport for target branches: 8.19, 9.1, 9.2

https://github.com/elastic/kibana/actions/runs/18694598688

@kibanamachine
Copy link
Copy Markdown
Contributor

💔 Some backports could not be created

Status Branch Result
8.19 Backport failed because of merge conflicts

You might need to backport the following PRs to 8.19:
- [SLO] Use internal es client for fetching remote cluster info !! (#224870)
9.1 Backport failed because of merge conflicts
9.2

Note: Successful backport PRs will be merged automatically after passing CI.

Manual backport

To create the backport manually run:

node scripts/backport --pr 234218

Questions ?

Please refer to the Backport tool documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

author:obs-ux-management PRs authored by the obs ux management team backport:version Backport to applied version labels release_note:skip Skip the PR/issue when compiling release notes Team:actionable-obs Formerly "obs-ux-management", responsible for SLO, o11y alerting, significant events, & synthetics. v9.2.1 v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SLO page - Burn rate summary does not take filter into account

8 participants