gen-ai: fix score.label note wording by Nick-heo-eg · Pull Request #3527 · open-telemetry/semantic-conventions

Nick-heo-eg · 2026-03-10T02:37:08Z

Summary

This PR introduces a minimal attribute to record the categorical result of an evaluation in a vendor-neutral way.

Motivation

Evaluation systems often produce categorical results (e.g. pass/fail, safe/unsafe) that are not captured by existing score attributes.

Relationship to Existing Attributes

Attribute	Type	Description
`score.value`	continuous numeric	numeric measurement returned by the evaluator
`score.label`	string	evaluator-specific human-readable interpretation of the score
`outcome`	discrete classification	categorical result produced by the evaluator

outcome is not derived from score and may exist independently. score may also exist without outcome.

Nick-heo-eg · 2026-03-10T02:46:09Z

Reopened this as a new PR because the previous one (#3336) hit a CI issue after a force push: the merge base recorded by GitHub no longer existed, causing the changed-files step in check-changes-ownership to fail (fatal: bad object). This PR contains the same changes but with a fresh merge base so CI can run correctly.

Nick-heo-eg · 2026-03-11T02:10:56Z

It looks like the get-changed-files action is failing due to a missing merge base (likely caused by the earlier force push). Let me know if I should refresh the branch or push an update.

trask · 2026-03-11T02:18:54Z

try rebase and squash down to a single commit

Nick-heo-eg · 2026-03-12T11:11:15Z

try rebase and squash down to a single commit

Done. Rebased and squashed to a single commit. Please let me know if anything else is needed.

github-actions · 2026-03-27T04:08:07Z

This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days.

Nick-heo-eg · 2026-03-27T09:28:28Z

Just checking in, happy to update or simplify the proposal if needed.

One quick question: would you prefer keeping only gen_ai.evaluation.outcome and dropping multiple_outcomes for a minimal version?

lmolkova

It would be great to summarize the problem this PR tries to solve in a short sentence using simple words. I think #3244 does not provide enough context to understand this

Nick-heo-eg · 2026-03-27T13:06:36Z

Thanks for the detailed feedback, this is very helpful.

When an AI agent’s action is evaluated and blocked, there’s currently no standard attribute to record that outcome. This PR adds one.

Regarding the difference from score: score represents a numeric value, while outcome records the categorical result of the evaluation (e.g. allowed or blocked). An evaluation might have a score but no outcome, or an outcome but no score.

Regarding multiple outcomes: you’re right. I’ll remove gen_ai.evaluation.multiple_outcomes and rely on emitting separate gen_ai.evaluation.result events when needed.

I can update the PR to simplify the description and reduce the scope to only gen_ai.evaluation.outcome.

Nick-heo-eg · 2026-03-28T14:26:55Z

Updated the PR: removed gen_ai.evaluation.multiple_outcomes and clarified the problem statement.

lmolkova · 2026-03-30T13:22:28Z

Evaluations are not intended to block or allow agent responses. Do you mean guardrails? If so, we don't model them yet, but you might be interested to look at the #3233

Nick-heo-eg · 2026-03-30T13:33:35Z

Thanks for calling this out.

You’re right that evaluation itself does not imply execution control like allow/block. My intent here is to represent the categorical result of an evaluation (e.g. pass/fail, safe/unsafe), not to model guardrails or enforcement decisions.

I’ll update the wording to remove "blocked" and align the description with evaluation semantics only.

Nick-heo-eg · 2026-04-03T05:38:56Z

It looks like the CI failure is unrelated to this PR.

The failure seems to come from pull_request_target workflows, which run using the base repository's workflow and do not checkout fork PR heads. This prevents actions like changed-files from computing diffs correctly.

I've reverted workflow-related changes to keep this PR focused on the semantic change.

Happy to help address the CI behavior separately if needed.

lmolkova · 2026-04-03T05:43:17Z

@Nick-heo-eg

You’re right that evaluation itself does not imply execution control like allow/block. My intent here is to represent the categorical result of an evaluation (e.g. pass/fail, safe/unsafe), not to model guardrails or enforcement decisions.

I’ll update the wording to remove "blocked" and align the description with evaluation semantics only.

this is exactly what gen_ai.evaluation.score.label represents. How's outcome different from it?

Nick-heo-eg · 2026-04-03T05:57:53Z

@lmolkova They are related, but serve different roles.

gen_ai.evaluation.score.label is tied to a numeric score and represents an interpretation of score.value (e.g., applying a threshold).

gen_ai.evaluation.outcome represents a categorical result produced directly by an evaluator, without requiring a numeric score.

In short:

score.label → derived from a continuous measurement
outcome → directly emitted categorical result

This allows modeling evaluators that only produce categorical outputs (e.g., pass/fail, safe/unsafe) without introducing an artificial score.

One practical distinction is that not all evaluators produce a numeric score.

For example, LLM-as-judge or human review often return only categorical results (e.g., pass/fail, approved/rejected) without any underlying score.

In those cases, using score.label would either require introducing an artificial score.value or create a misleading association with a non-existent numeric measurement.

outcome allows capturing the evaluator’s native result directly, without implying a score where none exists.

lmolkova · 2026-04-03T06:12:55Z

gen_ai.evaluation.score.label is tied to a numeric score and represents an interpretation of score.value (e.g., applying a threshold).

where did this interpretation came from? It's not in the https://github.com/open-telemetry/semantic-conventions/blob/main/docs/gen-ai/gen-ai-events.md#event-gen_aievaluationresult

Introducing a new attribute when an existing one could apply requires clear justification. Are there evaluation frameworks that distinguish between these two and support both? Please share some research and mappings. What are these definitions grounded in?

Nick-heo-eg · 2026-04-03T06:14:05Z

One practical example is a human review step that returns only a categorical decision such as "approved" or "rejected" without any associated numeric score.

In this case, using score.label would require introducing an artificial score.value, while outcome captures the result directly as produced.

lmolkova

Please remove unrelated changes and address remaining comments. Thanks

Nick-heo-eg · 2026-04-04T16:49:46Z

Done.

Removed unrelated changes (docs/internal, logs, tools)
Restored model/faas/events.yaml
Addressed all review comments

The PR now only contains the intended semantic changes.

lmolkova

Please address all open comments

Nick-heo-eg · 2026-04-04T17:36:41Z

Updated changelog note, PR title, and removed the example section to reflect the actual change.

Nick-heo-eg · 2026-04-04T17:59:27Z

Please address all open comments

All comments addressed.

lmolkova · 2026-04-04T18:00:55Z

@Nick-heo-eg can you please provide AI use disclosure in compliance with https://github.com/open-telemetry/community/blob/main/policies/genai.md?

how much of the PR content is generated by AI
did AI auto-reply to comments and applied changes? To what extent?

Nick-heo-eg · 2026-04-04T18:41:00Z

Thanks for asking.

I used Claude as an assistive tool for suggesting changes and drafting responses. I reviewed, adjusted, and approved every change before submission, and I’m fully responsible for the PR.

Also, good catch on the regression I’m restoring the intended text now.

lmolkova · 2026-04-04T18:55:03Z

@Nick-heo-eg the last commit did not restore the proper text suggested before.
Also the pr description is out of date.

Notice that your excessive use of AI resulted in 70+ comments on this and #3336 PRs for a trivial change and made us both very unproductive. If you consider contributing to semantic conventions again, please use proper human judgement. AI is not yet at the point it can replace it.

Nick-heo-eg · 2026-04-05T04:03:56Z

Closing this PR in favor of a clean follow-up PR with the exact suggested wording.

Nick-heo-eg requested review from a team as code owners March 10, 2026 02:37

github-project-automation Bot added this to Semantic Conventions Triage Mar 10, 2026

github-project-automation Bot moved this to Untriaged in Semantic Conventions Triage Mar 10, 2026

github-actions Bot added the enhancement New feature or request label Mar 10, 2026

Nick-heo-eg mentioned this pull request Mar 10, 2026

gen-ai: add judgment boundary attributes to evaluation result #3336

Closed

Nick-heo-eg force-pushed the genai-judgment-boundary branch from 45d1777 to 9b4b176 Compare March 11, 2026 02:28

lmolkova moved this from Untriaged to Awaiting codeowners approval in Semantic Conventions Triage Mar 17, 2026

github-actions Bot added the Stale label Mar 27, 2026

lmolkova reviewed Mar 27, 2026

View reviewed changes

Comment thread model/gen-ai/registry.yaml Outdated

Comment thread model/gen-ai/registry.yaml Outdated

github-actions Bot removed the Stale label Mar 28, 2026

Nick-heo-eg force-pushed the genai-judgment-boundary branch from 9b4b176 to 691ff36 Compare March 28, 2026 14:26

Nick-heo-eg commented Apr 2, 2026

View reviewed changes

Comment thread model/gen-ai/registry.yaml

Nick-heo-eg force-pushed the genai-judgment-boundary branch from 10875aa to f8e32c0 Compare April 3, 2026 05:38

Nick-heo-eg requested review from a team as code owners April 4, 2026 11:48

Nick-heo-eg force-pushed the genai-judgment-boundary branch 2 times, most recently from 11bdd60 to d266f18 Compare April 4, 2026 15:47

lmolkova reviewed Apr 4, 2026

View reviewed changes

Comment thread docs/internal/ci_failure_patterns.md Outdated

Comment thread logs/ci_debug_trace_3527.yaml Outdated

Comment thread model/faas/events.yaml Outdated

gen-ai: remove evaluation outcome and align with score-based model

6b02079

Nick-heo-eg force-pushed the genai-judgment-boundary branch from d266f18 to 6b02079 Compare April 4, 2026 16:44

lmolkova reviewed Apr 4, 2026

View reviewed changes

Comment thread .chloggen/genai-judgment-boundary.yaml Outdated

Comment thread docs/gen-ai/gen-ai-events.md Outdated

gen-ai: extend score.label for non-numeric evaluation results

8bcd2a6

Nick-heo-eg changed the title ~~gen-ai: add evaluation outcome attributes~~ gen-ai: extend score.label for non-numeric evaluation results Apr 4, 2026

gen-ai: remove score interpretation phrasing from score.label note

e2e77ad

fix: remove broken example anchor from TOC

fa5c82e

lmolkova reviewed Apr 4, 2026

View reviewed changes

Comment thread model/gen-ai/registry.yaml Outdated

fix: restore score.label human-readable label wording

a95d334

fix: restore score.label note to reviewer-approved wording

6d942da

Nick-heo-eg changed the title ~~gen-ai: extend score.label for non-numeric evaluation results~~ gen-ai: fix score.label note wording Apr 4, 2026

fix: sync score.label note wording across docs with registry

e9f6d1c

Nick-heo-eg closed this Apr 5, 2026

Conversation

Nick-heo-eg commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Relationship to Existing Attributes

Uh oh!

Nick-heo-eg commented Mar 10, 2026

Uh oh!

Nick-heo-eg commented Mar 11, 2026

Uh oh!

trask commented Mar 11, 2026

Uh oh!

Nick-heo-eg commented Mar 12, 2026

Uh oh!

github-actions Bot commented Mar 27, 2026

Uh oh!

Nick-heo-eg commented Mar 27, 2026

Uh oh!

lmolkova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Nick-heo-eg commented Mar 27, 2026

Uh oh!

Nick-heo-eg commented Mar 28, 2026

Uh oh!

lmolkova commented Mar 30, 2026

Uh oh!

Nick-heo-eg commented Mar 30, 2026

Uh oh!

Uh oh!

Nick-heo-eg commented Apr 3, 2026

Uh oh!

lmolkova commented Apr 3, 2026

Uh oh!

Nick-heo-eg commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lmolkova commented Apr 3, 2026

Uh oh!

Nick-heo-eg commented Apr 3, 2026

Uh oh!

lmolkova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Nick-heo-eg commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lmolkova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Nick-heo-eg commented Apr 4, 2026

Uh oh!

Nick-heo-eg commented Apr 4, 2026

Uh oh!

lmolkova commented Apr 4, 2026

Uh oh!

Uh oh!

Nick-heo-eg commented Apr 4, 2026

Uh oh!

lmolkova commented Apr 4, 2026

Uh oh!

Nick-heo-eg commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Nick-heo-eg commented Mar 10, 2026 •

edited

Loading

Nick-heo-eg commented Apr 3, 2026 •

edited

Loading

Nick-heo-eg commented Apr 4, 2026 •

edited

Loading