gen-ai: fix score.label note wording#3527
gen-ai: fix score.label note wording#3527Nick-heo-eg wants to merge 7 commits intoopen-telemetry:mainfrom
Conversation
|
Reopened this as a new PR because the previous one (#3336) hit a CI issue after a force push: the merge base recorded by GitHub no longer existed, causing the |
|
It looks like the get-changed-files action is failing due to a missing merge base (likely caused by the earlier force push). Let me know if I should refresh the branch or push an update. |
|
try rebase and squash down to a single commit |
45d1777 to
9b4b176
Compare
Done. Rebased and squashed to a single commit. Please let me know if anything else is needed. |
|
This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days. |
|
Just checking in, happy to update or simplify the proposal if needed. One quick question: would you prefer keeping only |
|
Thanks for the detailed feedback, this is very helpful. When an AI agent’s action is evaluated and blocked, there’s currently no standard attribute to record that outcome. This PR adds one. Regarding the difference from score: score represents a numeric value, while outcome records the categorical result of the evaluation (e.g. allowed or blocked). An evaluation might have a score but no outcome, or an outcome but no score. Regarding multiple outcomes: you’re right. I’ll remove gen_ai.evaluation.multiple_outcomes and rely on emitting separate gen_ai.evaluation.result events when needed. I can update the PR to simplify the description and reduce the scope to only gen_ai.evaluation.outcome. |
9b4b176 to
691ff36
Compare
|
Updated the PR: removed gen_ai.evaluation.multiple_outcomes and clarified the problem statement. |
|
Evaluations are not intended to block or allow agent responses. Do you mean guardrails? If so, we don't model them yet, but you might be interested to look at the #3233 |
|
Thanks for calling this out. You’re right that evaluation itself does not imply execution control like allow/block. My intent here is to represent the categorical result of an evaluation (e.g. pass/fail, safe/unsafe), not to model guardrails or enforcement decisions. I’ll update the wording to remove "blocked" and align the description with evaluation semantics only. |
10875aa to
f8e32c0
Compare
|
It looks like the CI failure is unrelated to this PR. The failure seems to come from I've reverted workflow-related changes to keep this PR focused on the semantic change. Happy to help address the CI behavior separately if needed. |
this is exactly what |
|
@lmolkova They are related, but serve different roles.
In short:
This allows modeling evaluators that only produce categorical outputs (e.g., pass/fail, safe/unsafe) without introducing an artificial score. One practical distinction is that not all evaluators produce a numeric score. For example, LLM-as-judge or human review often return only categorical results (e.g., pass/fail, approved/rejected) without any underlying score. In those cases, using
|
where did this interpretation came from? It's not in the https://github.com/open-telemetry/semantic-conventions/blob/main/docs/gen-ai/gen-ai-events.md#event-gen_aievaluationresult Introducing a new attribute when an existing one could apply requires clear justification. Are there evaluation frameworks that distinguish between these two and support both? Please share some research and mappings. What are these definitions grounded in? |
|
One practical example is a human review step that returns only a categorical decision such as "approved" or "rejected" without any associated numeric score. In this case, using |
11bdd60 to
d266f18
Compare
lmolkova
left a comment
There was a problem hiding this comment.
Please remove unrelated changes and address remaining comments. Thanks
d266f18 to
6b02079
Compare
|
Done.
The PR now only contains the intended semantic changes. |
lmolkova
left a comment
There was a problem hiding this comment.
Please address all open comments
|
Updated changelog note, PR title, and removed the example section to reflect the actual change. |
All comments addressed. |
|
@Nick-heo-eg can you please provide AI use disclosure in compliance with https://github.com/open-telemetry/community/blob/main/policies/genai.md?
|
|
Thanks for asking. I used Claude as an assistive tool for suggesting changes and drafting responses. I reviewed, adjusted, and approved every change before submission, and I’m fully responsible for the PR. Also, good catch on the regression I’m restoring the intended text now. |
|
@Nick-heo-eg the last commit did not restore the proper text suggested before. Notice that your excessive use of AI resulted in 70+ comments on this and #3336 PRs for a trivial change and made us both very unproductive. If you consider contributing to semantic conventions again, please use proper human judgement. AI is not yet at the point it can replace it. |
|
Closing this PR in favor of a clean follow-up PR with the exact suggested wording. |
Summary
This PR introduces a minimal attribute to record the categorical result of an evaluation in a vendor-neutral way.
Motivation
Evaluation systems often produce categorical results (e.g. pass/fail, safe/unsafe) that are not captured by existing score attributes.
Relationship to Existing Attributes
score.valuescore.labeloutcomeoutcomeis not derived fromscoreand may exist independently.scoremay also exist withoutoutcome.