Skip to content

gen-ai: fix score.label note wording#3527

Closed
Nick-heo-eg wants to merge 7 commits intoopen-telemetry:mainfrom
Nick-heo-eg:genai-judgment-boundary
Closed

gen-ai: fix score.label note wording#3527
Nick-heo-eg wants to merge 7 commits intoopen-telemetry:mainfrom
Nick-heo-eg:genai-judgment-boundary

Conversation

@Nick-heo-eg
Copy link
Copy Markdown

@Nick-heo-eg Nick-heo-eg commented Mar 10, 2026

Summary

This PR introduces a minimal attribute to record the categorical result of an evaluation in a vendor-neutral way.

Motivation

Evaluation systems often produce categorical results (e.g. pass/fail, safe/unsafe) that are not captured by existing score attributes.

Relationship to Existing Attributes

Attribute Type Description
score.value continuous numeric numeric measurement returned by the evaluator
score.label string evaluator-specific human-readable interpretation of the score
outcome discrete classification categorical result produced by the evaluator

outcome is not derived from score and may exist independently. score may also exist without outcome.

@Nick-heo-eg
Copy link
Copy Markdown
Author

Reopened this as a new PR because the previous one (#3336) hit a CI issue after a force push: the merge base recorded by GitHub no longer existed, causing the changed-files step in check-changes-ownership to fail (fatal: bad object). This PR contains the same changes but with a fresh merge base so CI can run correctly.

@Nick-heo-eg
Copy link
Copy Markdown
Author

It looks like the get-changed-files action is failing due to a missing merge base (likely caused by the earlier force push). Let me know if I should refresh the branch or push an update.

@trask
Copy link
Copy Markdown
Member

trask commented Mar 11, 2026

try rebase and squash down to a single commit

@Nick-heo-eg Nick-heo-eg force-pushed the genai-judgment-boundary branch from 45d1777 to 9b4b176 Compare March 11, 2026 02:28
@Nick-heo-eg
Copy link
Copy Markdown
Author

try rebase and squash down to a single commit

Done. Rebased and squashed to a single commit. Please let me know if anything else is needed.

@lmolkova lmolkova moved this from Untriaged to Awaiting codeowners approval in Semantic Conventions Triage Mar 17, 2026
@github-actions
Copy link
Copy Markdown

This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days.

@github-actions github-actions Bot added the Stale label Mar 27, 2026
@Nick-heo-eg
Copy link
Copy Markdown
Author

Just checking in, happy to update or simplify the proposal if needed.

One quick question: would you prefer keeping only gen_ai.evaluation.outcome and dropping multiple_outcomes for a minimal version?

Copy link
Copy Markdown
Member

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to summarize the problem this PR tries to solve in a short sentence using simple words. I think #3244 does not provide enough context to understand this

Comment thread model/gen-ai/registry.yaml Outdated
Comment thread model/gen-ai/registry.yaml Outdated
@Nick-heo-eg
Copy link
Copy Markdown
Author

Thanks for the detailed feedback, this is very helpful.

When an AI agent’s action is evaluated and blocked, there’s currently no standard attribute to record that outcome. This PR adds one.

Regarding the difference from score: score represents a numeric value, while outcome records the categorical result of the evaluation (e.g. allowed or blocked). An evaluation might have a score but no outcome, or an outcome but no score.

Regarding multiple outcomes: you’re right. I’ll remove gen_ai.evaluation.multiple_outcomes and rely on emitting separate gen_ai.evaluation.result events when needed.

I can update the PR to simplify the description and reduce the scope to only gen_ai.evaluation.outcome.

@github-actions github-actions Bot removed the Stale label Mar 28, 2026
@Nick-heo-eg Nick-heo-eg force-pushed the genai-judgment-boundary branch from 9b4b176 to 691ff36 Compare March 28, 2026 14:26
@Nick-heo-eg
Copy link
Copy Markdown
Author

Updated the PR: removed gen_ai.evaluation.multiple_outcomes and clarified the problem statement.

@lmolkova
Copy link
Copy Markdown
Member

Evaluations are not intended to block or allow agent responses. Do you mean guardrails? If so, we don't model them yet, but you might be interested to look at the #3233

@Nick-heo-eg
Copy link
Copy Markdown
Author

Thanks for calling this out.

You’re right that evaluation itself does not imply execution control like allow/block. My intent here is to represent the categorical result of an evaluation (e.g. pass/fail, safe/unsafe), not to model guardrails or enforcement decisions.

I’ll update the wording to remove "blocked" and align the description with evaluation semantics only.

Comment thread model/gen-ai/registry.yaml
@Nick-heo-eg Nick-heo-eg force-pushed the genai-judgment-boundary branch from 10875aa to f8e32c0 Compare April 3, 2026 05:38
@Nick-heo-eg
Copy link
Copy Markdown
Author

It looks like the CI failure is unrelated to this PR.

The failure seems to come from pull_request_target workflows, which run using the base repository's workflow and do not checkout fork PR heads. This prevents actions like changed-files from computing diffs correctly.

I've reverted workflow-related changes to keep this PR focused on the semantic change.

Happy to help address the CI behavior separately if needed.

@lmolkova
Copy link
Copy Markdown
Member

lmolkova commented Apr 3, 2026

@Nick-heo-eg

You’re right that evaluation itself does not imply execution control like allow/block. My intent here is to represent the categorical result of an evaluation (e.g. pass/fail, safe/unsafe), not to model guardrails or enforcement decisions.

I’ll update the wording to remove "blocked" and align the description with evaluation semantics only.

this is exactly what gen_ai.evaluation.score.label represents. How's outcome different from it?

@Nick-heo-eg
Copy link
Copy Markdown
Author

Nick-heo-eg commented Apr 3, 2026

@lmolkova They are related, but serve different roles.

gen_ai.evaluation.score.label is tied to a numeric score and represents an interpretation of score.value (e.g., applying a threshold).

gen_ai.evaluation.outcome represents a categorical result produced directly by an evaluator, without requiring a numeric score.

In short:

  • score.label → derived from a continuous measurement
  • outcome → directly emitted categorical result

This allows modeling evaluators that only produce categorical outputs (e.g., pass/fail, safe/unsafe) without introducing an artificial score.

One practical distinction is that not all evaluators produce a numeric score.

For example, LLM-as-judge or human review often return only categorical results (e.g., pass/fail, approved/rejected) without any underlying score.

In those cases, using score.label would either require introducing an artificial score.value or create a misleading association with a non-existent numeric measurement.

outcome allows capturing the evaluator’s native result directly, without implying a score where none exists.

@lmolkova
Copy link
Copy Markdown
Member

lmolkova commented Apr 3, 2026

gen_ai.evaluation.score.label is tied to a numeric score and represents an interpretation of score.value (e.g., applying a threshold).

where did this interpretation came from? It's not in the https://github.com/open-telemetry/semantic-conventions/blob/main/docs/gen-ai/gen-ai-events.md#event-gen_aievaluationresult

Introducing a new attribute when an existing one could apply requires clear justification. Are there evaluation frameworks that distinguish between these two and support both? Please share some research and mappings. What are these definitions grounded in?

@Nick-heo-eg
Copy link
Copy Markdown
Author

One practical example is a human review step that returns only a categorical decision such as "approved" or "rejected" without any associated numeric score.

In this case, using score.label would require introducing an artificial score.value, while outcome captures the result directly as produced.

@Nick-heo-eg Nick-heo-eg requested review from a team as code owners April 4, 2026 11:48
@Nick-heo-eg Nick-heo-eg force-pushed the genai-judgment-boundary branch 2 times, most recently from 11bdd60 to d266f18 Compare April 4, 2026 15:47
Copy link
Copy Markdown
Member

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove unrelated changes and address remaining comments. Thanks

Comment thread docs/internal/ci_failure_patterns.md Outdated
Comment thread logs/ci_debug_trace_3527.yaml Outdated
Comment thread model/faas/events.yaml Outdated
@Nick-heo-eg Nick-heo-eg force-pushed the genai-judgment-boundary branch from d266f18 to 6b02079 Compare April 4, 2026 16:44
@Nick-heo-eg
Copy link
Copy Markdown
Author

Nick-heo-eg commented Apr 4, 2026

Done.

  • Removed unrelated changes (docs/internal, logs, tools)
  • Restored model/faas/events.yaml
  • Addressed all review comments

The PR now only contains the intended semantic changes.

Copy link
Copy Markdown
Member

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address all open comments

Comment thread .chloggen/genai-judgment-boundary.yaml Outdated
Comment thread docs/gen-ai/gen-ai-events.md Outdated
@Nick-heo-eg Nick-heo-eg changed the title gen-ai: add evaluation outcome attributes gen-ai: extend score.label for non-numeric evaluation results Apr 4, 2026
@Nick-heo-eg
Copy link
Copy Markdown
Author

Updated changelog note, PR title, and removed the example section to reflect the actual change.

@Nick-heo-eg
Copy link
Copy Markdown
Author

Please address all open comments

All comments addressed.

@lmolkova
Copy link
Copy Markdown
Member

lmolkova commented Apr 4, 2026

@Nick-heo-eg can you please provide AI use disclosure in compliance with https://github.com/open-telemetry/community/blob/main/policies/genai.md?

  1. how much of the PR content is generated by AI
  2. did AI auto-reply to comments and applied changes? To what extent?

Comment thread model/gen-ai/registry.yaml Outdated
@Nick-heo-eg
Copy link
Copy Markdown
Author

Thanks for asking.

I used Claude as an assistive tool for suggesting changes and drafting responses. I reviewed, adjusted, and approved every change before submission, and I’m fully responsible for the PR.

Also, good catch on the regression I’m restoring the intended text now.

@lmolkova
Copy link
Copy Markdown
Member

lmolkova commented Apr 4, 2026

@Nick-heo-eg the last commit did not restore the proper text suggested before.
Also the pr description is out of date.

Notice that your excessive use of AI resulted in 70+ comments on this and #3336 PRs for a trivial change and made us both very unproductive. If you consider contributing to semantic conventions again, please use proper human judgement. AI is not yet at the point it can replace it.

@Nick-heo-eg Nick-heo-eg changed the title gen-ai: extend score.label for non-numeric evaluation results gen-ai: fix score.label note wording Apr 4, 2026
@Nick-heo-eg
Copy link
Copy Markdown
Author

Closing this PR in favor of a clean follow-up PR with the exact suggested wording.

@Nick-heo-eg Nick-heo-eg closed this Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants