Skip to content

[Obs Agent] add observability get_trace_change_points tool#247810

Merged
arturoliduena merged 20 commits intoelastic:mainfrom
arturoliduena:obs-agent-418-trace-change-points-tool
Jan 16, 2026
Merged

[Obs Agent] add observability get_trace_change_points tool#247810
arturoliduena merged 20 commits intoelastic:mainfrom
arturoliduena:obs-agent-418-trace-change-points-tool

Conversation

@arturoliduena
Copy link
Copy Markdown
Contributor

@arturoliduena arturoliduena commented Jan 5, 2026

Closes https://github.com/elastic/obs-ai-team/issues/418

Summary

Add observability get_trace_change_points tool

Get trace change points tool

tool id: observability.get_trace_change_points
This tool analyzes traces to detect statistically significant change points in latency, throughput, and failure rate group by field example (default 'service.name'):
Service level: 'service.name', 'service.environment', 'service.version'
Transaction level: 'transaction.name', 'transaction.type'
Infrastructure level: 'host.name', 'container.id', 'kubernetes.pod.name'

Trace metrics:

  • Latency: avg/p95/p99 response time.
  • Throughput: requests per minute.
  • Failure rate: percentage of failed transactions.

Supports optional KQL filtering

@arturoliduena arturoliduena requested review from a team as code owners January 5, 2026 12:35
@arturoliduena arturoliduena added release_note:feature Makes this part of the condensed release notes backport:version Backport to applied version labels v9.3.0 Team:obs-ai Observability AI team labels Jan 5, 2026
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/obs-ai-team (Team:obs-ai)

@sorenlouv
Copy link
Copy Markdown
Member

sorenlouv commented Jan 5, 2026

I also asked this in the issue: do we need this tool, when we have get_metric_change_points added in #242423?

The only reason for having a tool dedicate to traces is if it makes it (much) easier for the LLM to retrieve change points for trace metrics aka red metrics (latency, throughout, error rate).

Can you list how someone would do that using the get_metric_change_points and get_trace_change_points respectively?

@sorenlouv
Copy link
Copy Markdown
Member

sorenlouv commented Jan 6, 2026

@arturoliduena Please take a look at @viduni94's PR in #247474. Specifically look at how getTraceMetrics is implemented. getPreferredDocumentSource and apmEventClient are used to query the right metric sets. getDurationFieldForTransactions and getOutcomeAggregation are used to query the right fields.

You should also read this comment to get a good understanding of the APM data model.

@arturoliduena arturoliduena force-pushed the obs-agent-418-trace-change-points-tool branch from 09208a9 to a276773 Compare January 8, 2026 16:12
@arturoliduena arturoliduena requested a review from a team as a code owner January 8, 2026 16:12
@botelastic botelastic bot added the Team:obs-presentation Focus: APM UI, Infra UI, Hosts UI, Universal Profiling, Obs Overview and left Navigation label Jan 8, 2026
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/obs-presentation-team (Team:obs-presentation)

Comment on lines +104 to +106
avg: {
field: durationField,
},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also support p95 and p99

Copy link
Copy Markdown
Contributor Author

@arturoliduena arturoliduena Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +131 to +150
latency:
// Avoid unsupported aggregation on downsampled index, see example error:
// "reason": {
// "type": "unsupported_aggregation_on_downsampled_index",
// "reason": "Field [transaction.duration.summary] of type [aggregate_metric_double] is not supported for aggregation [percentiles]"
// }
durationField !== 'transaction.duration.summary' &&
(latencyType === 'p95' || latencyType === 'p99')
? {
percentiles: {
field: durationField,
percents: [Number(`${latencyType.split('p')[1]}.0`)],
keyed: true,
},
}
: {
avg: {
field: durationField,
},
},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't calculate percentiles on the summary field. You need histogram or numeric latency field for that.
We already have this logic elsewhere. Can you re-use that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't find the logic that checks if dorationField is transaction.duration.summary can't use percentiles aggregation. I found some reusable code getLatencyAggregationType, getLatencyAggregation to make the code simpler:

const latencyAggregationType = getLatencyAggregationType(latencyType);

 durationField === TRANSACTION_DURATION_SUMMARY
    ? {
        avg: {
          field: durationField,
        },
      }
    : getLatencyAggregation(latencyAggregationType, durationField)

Comment on lines +137 to +143
durationField === TRANSACTION_DURATION_SUMMARY
? {
avg: {
field: durationField,
},
}
: getLatencyAggregation(latencyAggregationType, durationField).latency,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems backwards.
Currently the logic is: if summary field is being used, calculate average regardless of what user requested.
Correct logic: if user requested percentiles, use histogram or numeric field.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was that way before, but to reuse getLatencyAggregation and getLatencyAggregationType I changed the order, but isn't it the same? durationField can only be transaction.duration.summary, transaction.duration.histogram, or transaction.duration.us.
can't calculate percentiles for summary field, so only avg.
and getLatencyAggregation resolves the rest:

export function getLatencyAggregation(
  latencyAggregationType: LatencyAggregationType,
  field: string
) {
  return {
    latency: {
      ...(latencyAggregationType === LatencyAggregationType.avg
        ? { avg: { field } }
        : {
            percentiles: {
              field,
              percents: [latencyAggregationType === LatencyAggregationType.p95 ? 95 : 99],
            },
          }),
    },
  };
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't calculate percentiles for summary field, so only avg. and getLatencyAggregation resolves the rest:

I don't understand what this means.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can’t calculate percentiles on thetransaction.duration.summary field. So, when using the summary field, the only supported aggregation is avg, right?

durationField === TRANSACTION_DURATION_SUMMARY
    ? {
        avg: {
          field: durationField,
        },
      }

The other latency fields, transaction.duration.histogram and transaction.duration.us, support both avg and percentiles. It's handled by getLatencyAggregation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can’t calculate percentiles on thetransaction.duration.summary field. So, when using the summary field, the only supported aggregation is avg, right?

Yes, so if the user/LLM requests percentiles, you should not use the summary field.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, instead of defaulting to avg, we should use one of the other fields, change added here: c20786f

Comment on lines +107 to +110
const useDurationSummaryField =
hasDurationSummaryField &&
latencyAggregationType !== LatencyAggregationType.p95 &&
latencyAggregationType !== LatencyAggregationType.p99;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the way to go :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only nit: you could shorten this like:

Suggested change
const useDurationSummaryField =
hasDurationSummaryField &&
latencyAggregationType !== LatencyAggregationType.p95 &&
latencyAggregationType !== LatencyAggregationType.p99;
const useDurationSummaryField =
hasDurationSummaryField &&
latencyAggregationType === LatencyAggregationType.avg

But the other might be more clear

});

await apmSynthtraceEsClient.index([traceStream]);
}
Copy link
Copy Markdown
Member

@sorenlouv sorenlouv Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to src/platform/packages/shared/kbn-synthtrace/src/scenarios/agent_builder/tools. It should also be available for CLI usage like:

export default createCliScenario(({ range, clients: { logsEsClient } }) =>
  generateTraceChangePointsData({ range, logsEsClient })
);

Copy link
Copy Markdown
Contributor Author

@arturoliduena arturoliduena Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sorenlouv
Copy link
Copy Markdown
Member

@arturoliduena can you hold off merging this until @viduni94 has merged #248513? Then you should be able to move the implementation of the get_trace_metrics tool to observability_agent_builder plugin

@arturoliduena arturoliduena changed the title [Obs Agent] add observability get_trace_change_points tool [Obs Agent] add observability get_trace_change_points tool Jan 13, 2026
@arturoliduena arturoliduena changed the title [Obs Agent] add observability get_trace_change_points tool [Obs Agent] add observability get_trace_change_points tool Jan 13, 2026
Copy link
Copy Markdown
Contributor

@rmyz rmyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presentation changes LGTM

@viduni94
Copy link
Copy Markdown
Contributor

@arturoliduena can you hold off merging this until @viduni94 has merged #248513? Then you should be able to move the implementation of the get_trace_metrics tool to observability_agent_builder plugin

@arturoliduena My PR is merged.

@arturoliduena arturoliduena enabled auto-merge (squash) January 15, 2026 09:54
@elasticmachine
Copy link
Copy Markdown
Contributor

elasticmachine commented Jan 15, 2026

💔 Build Failed

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #81 / discover/esql discover esql controls when adding an ES|QL panel with controls in dashboards and exploring it in discover should retain the controls and their state
  • [job] [logs] FTR Configs #80 / Endpoint plugin spaces support "after all" hook in "Endpoint plugin spaces support"
  • [job] [logs] FTR Configs #80 / Endpoint plugin spaces support "after all" hook in "Endpoint plugin spaces support"
  • [job] [logs] FTR Configs #80 / Endpoint plugin spaces support "after all" hook in "Endpoint plugin spaces support"
  • [job] [logs] FTR Configs #80 / Endpoint plugin spaces support "before all" hook in "Endpoint plugin spaces support"
  • [job] [logs] FTR Configs #80 / Endpoint plugin spaces support "before all" hook in "Endpoint plugin spaces support"
  • [job] [logs] FTR Configs #80 / Endpoint plugin spaces support "before all" hook in "Endpoint plugin spaces support"

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
apm 2089 2090 +1

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
apmDataAccess 86 89 +3
Unknown metric groups

API count

id before after diff
apmDataAccess 86 89 +3

History

Comment on lines +8 to +18
export enum LatencyAggregationType {
avg = 'avg',
p99 = 'p99',
p95 = 'p95',
}

export const getLatencyAggregationType = (
latencyAggregationType: string | null | undefined
): LatencyAggregationType => {
return (latencyAggregationType ?? LatencyAggregationType.avg) as LatencyAggregationType;
};
Copy link
Copy Markdown
Member

@sorenlouv sorenlouv Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not too much work, please remove so they are not duplicated

export enum LatencyAggregationType {
avg = 'avg',
p99 = 'p99',
p95 = 'p95',
}
export const latencyAggregationTypeRt = t.union([
t.literal(LatencyAggregationType.avg),
t.literal(LatencyAggregationType.p95),
t.literal(LatencyAggregationType.p99),
]);
export const getLatencyAggregationType = (
latencyAggregationType: string | null | undefined
): LatencyAggregationType => {
return (latencyAggregationType ?? LatencyAggregationType.avg) as LatencyAggregationType;
};

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also ok to do that in a follow-up if that's easier. Might cause a bunch of changes in APM

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed: ad0be75


import { LatencyAggregationType } from '../../../../common/latency_aggregation_types';

export function getLatencyAggregation(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

export function getLatencyAggregation(
latencyAggregationType: LatencyAggregationType,
field: string
) {
return {
latency: {
...(latencyAggregationType === LatencyAggregationType.avg
? { avg: { field } }
: {
percentiles: {
field,
percents: [latencyAggregationType === LatencyAggregationType.p95 ? 95 : 99],
},
}),
},
};
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove: ad0be75

arturoliduena and others added 2 commits January 16, 2026 09:56
…lder/server/tools/get_trace_change_points/README.md

Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
@arturoliduena arturoliduena enabled auto-merge (squash) January 16, 2026 09:54
@arturoliduena arturoliduena merged commit 5cfdbab into elastic:main Jan 16, 2026
13 checks passed
@kibanamachine
Copy link
Copy Markdown
Contributor

Starting backport for target branches: 9.3

https://github.com/elastic/kibana/actions/runs/21065163474

@kibanamachine
Copy link
Copy Markdown
Contributor

💔 All backports failed

Status Branch Result
9.3 Backport failed because of merge conflicts

You might need to backport the following PRs to 9.3:
- [Obs AI] Replace get_data_sources with get_index_info tool (#248234)
- [Obs AI] Extend get_services tool and add get_trace_metrics tool (#247474)

Manual backport

To create the backport manually run:

node scripts/backport --pr 247810

Questions ?

Please refer to the Backport tool documentation

@kibanamachine kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Jan 19, 2026
@kibanamachine
Copy link
Copy Markdown
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add a backport:* label or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 247810 locally
cc: @arturoliduena

@kibanamachine
Copy link
Copy Markdown
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add a backport:* label or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 247810 locally
cc: @arturoliduena

1 similar comment
@kibanamachine
Copy link
Copy Markdown
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add a backport:* label or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 247810 locally
cc: @arturoliduena

@arturoliduena arturoliduena added backport:skip This PR does not require backporting and removed backport missing Added to PRs automatically when the are determined to be missing a backport. backport:version Backport to applied version labels labels Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:feature Makes this part of the condensed release notes Team:obs-ai Observability AI team Team:obs-presentation Focus: APM UI, Infra UI, Hosts UI, Universal Profiling, Obs Overview and left Navigation v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants