Skip to content

fix: make telemetry more resilient to missing data#2807

Closed
cdoern wants to merge 1 commit intollamastack:mainfrom
cdoern:resilient-telemetry
Closed

fix: make telemetry more resilient to missing data#2807
cdoern wants to merge 1 commit intollamastack:mainfrom
cdoern:resilient-telemetry

Conversation

@cdoern
Copy link
Collaborator

@cdoern cdoern commented Jul 18, 2025

What does this PR do?

it seems sometimes row is missing some data like root_span_id, omit these entries and log a warning

example failure: https://github.com/meta-llama/llama-stack/actions/runs/16368394907/job/46250869773

telemetry tests should now pass

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 18, 2025
Copy link
Contributor

@ashwinb ashwinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leseb
Copy link
Collaborator

leseb commented Jul 18, 2025

@cdoern can you share a failure example?

it seems sometimes `row` is missing some data like `root_span_id`, omit these entries and log a warning

Signed-off-by: Charlie Doern <cdoern@redhat.com>
@cdoern cdoern force-pushed the resilient-telemetry branch from ddf7323 to 0360e0d Compare July 18, 2025 11:52
@cdoern
Copy link
Collaborator Author

cdoern commented Jul 18, 2025

@leseb : https://github.com/meta-llama/llama-stack/actions/runs/16368394907/job/46250869773 on the main branch, I can link this in the PR

Copy link
Collaborator

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than working around the error, can we investigate why root_span_id is missing in the first place? Thanks

@bbrowning
Copy link
Collaborator

Yes we should figure out why it's failing, as this should be deterministic. The only valid case where we'd be missing a root_span_id is when we're part of a distributed trace started outside of Llama Stack server with the client specifying the right trace propagation headers.

With that span, root_span_id is NOT strictly required and validation requiring it is probably incorrect. See #2494 where some of this was fixed before, but I would expect this to either pass or fail consistently vs flake in this way.

@ehhuang
Copy link
Contributor

ehhuang commented Jul 18, 2025

Rather than working around the error, can we investigate why root_span_id is missing in the first place? Thanks

+1 I can disable the test again in the meantime

@ehhuang
Copy link
Contributor

ehhuang commented Jul 18, 2025

#2814

@cdoern cdoern marked this pull request as draft July 18, 2025 16:32
@cdoern cdoern closed this Aug 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants