[Obs Agent] add observability get_trace_change_points tool#247810
[Obs Agent] add observability get_trace_change_points tool#247810arturoliduena merged 20 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/obs-ai-team (Team:obs-ai) |
|
I also asked this in the issue: do we need this tool, when we have The only reason for having a tool dedicate to traces is if it makes it (much) easier for the LLM to retrieve change points for trace metrics aka red metrics (latency, throughout, error rate). Can you list how someone would do that using the |
|
@arturoliduena Please take a look at @viduni94's PR in #247474. Specifically look at how You should also read this comment to get a good understanding of the APM data model. |
...servability/plugins/observability_agent_builder/server/tools/get_trace_change_points/tool.ts
Outdated
Show resolved
Hide resolved
...servability/plugins/observability_agent_builder/server/tools/get_trace_change_points/tool.ts
Outdated
Show resolved
Hide resolved
09208a9 to
a276773
Compare
|
Pinging @elastic/obs-presentation-team (Team:obs-presentation) |
...ck/solutions/observability/plugins/apm/server/agent_builder/tools/get_trace_metrics/index.ts
Outdated
Show resolved
Hide resolved
| avg: { | ||
| field: durationField, | ||
| }, |
There was a problem hiding this comment.
You should also support p95 and p99
There was a problem hiding this comment.
Done, updated here add support for avg|p95|p99 latency agg type
| latency: | ||
| // Avoid unsupported aggregation on downsampled index, see example error: | ||
| // "reason": { | ||
| // "type": "unsupported_aggregation_on_downsampled_index", | ||
| // "reason": "Field [transaction.duration.summary] of type [aggregate_metric_double] is not supported for aggregation [percentiles]" | ||
| // } | ||
| durationField !== 'transaction.duration.summary' && | ||
| (latencyType === 'p95' || latencyType === 'p99') | ||
| ? { | ||
| percentiles: { | ||
| field: durationField, | ||
| percents: [Number(`${latencyType.split('p')[1]}.0`)], | ||
| keyed: true, | ||
| }, | ||
| } | ||
| : { | ||
| avg: { | ||
| field: durationField, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
you can't calculate percentiles on the summary field. You need histogram or numeric latency field for that.
We already have this logic elsewhere. Can you re-use that?
There was a problem hiding this comment.
couldn't find the logic that checks if dorationField is transaction.duration.summary can't use percentiles aggregation. I found some reusable code getLatencyAggregationType, getLatencyAggregation to make the code simpler:
const latencyAggregationType = getLatencyAggregationType(latencyType);
durationField === TRANSACTION_DURATION_SUMMARY
? {
avg: {
field: durationField,
},
}
: getLatencyAggregation(latencyAggregationType, durationField)| durationField === TRANSACTION_DURATION_SUMMARY | ||
| ? { | ||
| avg: { | ||
| field: durationField, | ||
| }, | ||
| } | ||
| : getLatencyAggregation(latencyAggregationType, durationField).latency, |
There was a problem hiding this comment.
This seems backwards.
Currently the logic is: if summary field is being used, calculate average regardless of what user requested.
Correct logic: if user requested percentiles, use histogram or numeric field.
There was a problem hiding this comment.
it was that way before, but to reuse getLatencyAggregation and getLatencyAggregationType I changed the order, but isn't it the same? durationField can only be transaction.duration.summary, transaction.duration.histogram, or transaction.duration.us.
can't calculate percentiles for summary field, so only avg.
and getLatencyAggregation resolves the rest:
export function getLatencyAggregation(
latencyAggregationType: LatencyAggregationType,
field: string
) {
return {
latency: {
...(latencyAggregationType === LatencyAggregationType.avg
? { avg: { field } }
: {
percentiles: {
field,
percents: [latencyAggregationType === LatencyAggregationType.p95 ? 95 : 99],
},
}),
},
};
}There was a problem hiding this comment.
can't calculate percentiles for summary field, so only avg. and getLatencyAggregation resolves the rest:
I don't understand what this means.
There was a problem hiding this comment.
We can’t calculate percentiles on thetransaction.duration.summary field. So, when using the summary field, the only supported aggregation is avg, right?
durationField === TRANSACTION_DURATION_SUMMARY
? {
avg: {
field: durationField,
},
}The other latency fields, transaction.duration.histogram and transaction.duration.us, support both avg and percentiles. It's handled by getLatencyAggregation.
There was a problem hiding this comment.
We can’t calculate percentiles on the
transaction.duration.summaryfield. So, when using the summary field, the only supported aggregation isavg, right?
Yes, so if the user/LLM requests percentiles, you should not use the summary field.
There was a problem hiding this comment.
got it, instead of defaulting to avg, we should use one of the other fields, change added here: c20786f
| const useDurationSummaryField = | ||
| hasDurationSummaryField && | ||
| latencyAggregationType !== LatencyAggregationType.p95 && | ||
| latencyAggregationType !== LatencyAggregationType.p99; |
There was a problem hiding this comment.
Yes, this is the way to go :)
There was a problem hiding this comment.
only nit: you could shorten this like:
| const useDurationSummaryField = | |
| hasDurationSummaryField && | |
| latencyAggregationType !== LatencyAggregationType.p95 && | |
| latencyAggregationType !== LatencyAggregationType.p99; | |
| const useDurationSummaryField = | |
| hasDurationSummaryField && | |
| latencyAggregationType === LatencyAggregationType.avg |
But the other might be more clear
| }); | ||
|
|
||
| await apmSynthtraceEsClient.index([traceStream]); | ||
| } |
There was a problem hiding this comment.
Please move this to src/platform/packages/shared/kbn-synthtrace/src/scenarios/agent_builder/tools. It should also be available for CLI usage like:
export default createCliScenario(({ range, clients: { logsEsClient } }) =>
generateTraceChangePointsData({ range, logsEsClient })
);There was a problem hiding this comment.
Will do it in a follow-up issue: https://github.com/elastic/obs-ai-team/issues/467
|
@arturoliduena can you hold off merging this until @viduni94 has merged #248513? Then you should be able to move the implementation of the |
get_trace_change_points tool
get_trace_change_points tool
@arturoliduena My PR is merged. |
💔 Build Failed
Failed CI StepsTest Failures
Metrics [docs]Module Count
Public APIs missing comments
History
|
| export enum LatencyAggregationType { | ||
| avg = 'avg', | ||
| p99 = 'p99', | ||
| p95 = 'p95', | ||
| } | ||
|
|
||
| export const getLatencyAggregationType = ( | ||
| latencyAggregationType: string | null | undefined | ||
| ): LatencyAggregationType => { | ||
| return (latencyAggregationType ?? LatencyAggregationType.avg) as LatencyAggregationType; | ||
| }; |
There was a problem hiding this comment.
If it's not too much work, please remove so they are not duplicated
There was a problem hiding this comment.
Also ok to do that in a follow-up if that's easier. Might cause a bunch of changes in APM
|
|
||
| import { LatencyAggregationType } from '../../../../common/latency_aggregation_types'; | ||
|
|
||
| export function getLatencyAggregation( |
There was a problem hiding this comment.
...rvability/plugins/observability_agent_builder/server/tools/get_trace_change_points/README.md
Outdated
Show resolved
Hide resolved
...vability/plugins/observability_agent_builder/server/tools/get_trace_change_points/handler.ts
Outdated
Show resolved
Hide resolved
...vability/plugins/observability_agent_builder/server/tools/get_trace_change_points/handler.ts
Outdated
Show resolved
Hide resolved
…lder/server/tools/get_trace_change_points/README.md Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
|
Starting backport for target branches: 9.3 https://github.com/elastic/kibana/actions/runs/21065163474 |
💔 All backports failed
Manual backportTo create the backport manually run: Questions ?Please refer to the Backport tool documentation |
|
Friendly reminder: Looks like this PR hasn’t been backported yet. |
|
Friendly reminder: Looks like this PR hasn’t been backported yet. |
1 similar comment
|
Friendly reminder: Looks like this PR hasn’t been backported yet. |
Closes https://github.com/elastic/obs-ai-team/issues/418
Summary
Add observability
get_trace_change_pointstoolGet trace change points tool
tool id:
observability.get_trace_change_pointsThis tool analyzes traces to detect statistically significant change points in latency, throughput, and failure rate group by field example (default 'service.name'):
Service level: 'service.name', 'service.environment', 'service.version'
Transaction level: 'transaction.name', 'transaction.type'
Infrastructure level: 'host.name', 'container.id', 'kubernetes.pod.name'
Trace metrics:
Supports optional KQL filtering