gen-ai: add judgment boundary attributes to evaluation result by Nick-heo-eg · Pull Request #3336 · open-telemetry/semantic-conventions

Nick-heo-eg · 2026-01-26T13:37:41Z

Summary

This PR adds two minimal attributes to the existing gen_ai.evaluation.result event to enable standard telemetry recording of evaluation outcomes in GenAI pipelines:

gen_ai.evaluation.outcome — the final outcome label assigned by the evaluator (e.g. pass, fail, allow, block)
gen_ai.evaluation.multiple_outcomes — whether the evaluation assessed multiple outcome categories simultaneously

These attributes intentionally capture minimal evaluation metadata to remain vendor-neutral and avoid introducing lifecycle or decision-model semantics.

The design follows the direction discussed in open-telemetry/semantic-conventions-genai#72 and does not introduce new events or namespaces.

New Attributes

Attribute	Type	Stability	Description	Examples
`gen_ai.evaluation.outcome`	string	experimental	The evaluation outcome label assigned by the evaluator.	`pass`, `fail`, `allow`, `block`
`gen_ai.evaluation.multiple_outcomes`	boolean	experimental	Indicates whether the evaluation process assessed multiple outcome categories or labels.	`true`

Both attributes are added as Recommended to the gen_ai.evaluation.result event.

Motivation

GenAI evaluation pipelines (safety classifiers, quality scorers, guardrail systems) produce outcome labels as part of their standard output. Currently, the GenAI semantic conventions provide no standard attribute to record these outcome labels in telemetry.

Without a standardized attribute, each instrumentation library uses ad-hoc span attributes or event bodies, making cross-vendor analysis and alerting impossible.

These two attributes provide the minimal metadata needed to record evaluation outcomes in a vendor-neutral way, scoped to what is universally available across evaluation APIs and frameworks.

Relationship to Existing Evaluation Attributes

The gen_ai.evaluation namespace already defines the following attributes (stability: development):

Attribute	Purpose
`gen_ai.evaluation.name`	Name of the evaluation metric (e.g. `Relevance`, `IntentResolution`)
`gen_ai.evaluation.score.value`	Numeric score returned by the evaluator
`gen_ai.evaluation.score.label`	Human-readable interpretation of the score (evaluator-specific)
`gen_ai.evaluation.explanation`	Free-form explanation for the score

These are complementary, not overlapping:

score.value / score.label capture how the evaluator scored the output — a numeric signal and its evaluator-specific label.
outcome captures the final categorical decision — the action-oriented result of the evaluation (pass/fail/allow/block), which may derive from a score threshold, a classifier, or a multi-step pipeline.
multiple_outcomes provides context when outcome is one of several categories evaluated simultaneously (e.g. a safety classifier that scores toxicity, bias, and PII at once).

In a typical LLM evaluation workflow:

evaluator runs → score.value=0.85, score.label="relevant"
                → outcome="pass"          ← threshold applied
                → multiple_outcomes=true  ← multiple categories were scored

The existing attributes describe evaluation measurement; the new attributes describe evaluation result selection.

Mapping to Existing APIs

API / Framework	Maps to `gen_ai.evaluation.outcome`	Maps to `gen_ai.evaluation.multiple_outcomes`
OpenAI Evals API (`eval_run_result.result`)	`"pass"` / `"fail"` per sample	`true` when multiple criteria are evaluated per sample
Anthropic model-graded eval (`result` field)	`"pass"` / `"fail"`	`true` for multi-metric evaluations
Azure AI Evaluation SDK (`evaluator_result`)	categorical label (e.g. `"Very good"`, `"Blocked"`)	`true` when composite evaluator runs
deepeval / ragas / RAGAS	per-metric pass/fail	`true` when test case checks multiple metrics
Custom safety classifiers	`"allow"` / `"block"`	`true` for multi-label classifiers

Example

Span attributes for a safety evaluation that assessed multiple categories:

{
  "name": "gen_ai.evaluation.result",
  "attributes": {
    "gen_ai.evaluation.name": "ContentSafety",
    "gen_ai.evaluation.score.value": 0.85,
    "gen_ai.evaluation.score.label": "pass",
    "gen_ai.evaluation.outcome": "pass",
    "gen_ai.evaluation.multiple_outcomes": true
  }
}

…t intent Adds two concrete JSON examples demonstrating judgment boundary attribute usage: - Content safety pre-execution check with automatic decision - Cost boundary evaluation with human escalation Clarifies that judgment boundary attributes are intended for event-level auditability and post-hoc inspection rather than high-cardinality metric aggregation.

…nd query illustration 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

The previous OTLP example used a numeric value (2) which was inconsistent with the boolean type declaration. Int better represents the actual semantic: how many paths were considered, not merely whether more than one existed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- selected_path: "blocked" -> "block" to match examples ["allow", "block", "escalate"] - alternatives_evaluated examples: [1, 2, 3] -> [2] for conciseness 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…cused

…adata Remove judgment namespace and lifecycle-related fields. Keep two experimental attributes aligned with existing evaluation semantics.

Nick-heo-eg · 2026-03-10T00:53:23Z

/easycla

…entions - Change gen_ai.evaluation.outcome and gen_ai.evaluation.multiple_outcomes stability from experimental to development - Add requirement_level: recommended to both refs in gen_ai.evaluation.result event

Nick-heo-eg · 2026-03-10T01:16:28Z

/easycla

trask · 2026-03-10T02:05:03Z

A couple of questions:

How does gen_ai.evaluation.outcome differ from the existing score.label? They're the same type with the same example values (pass, fail) — when would an instrumentor set one but not the other?
What's the concrete use case for gen_ai.evaluation.multiple_outcomes? It doesn't appear to map to a field in the listed APIs, and multiple gen_ai.evaluation.result events on the same span already signal that multiple evaluations happened.

Nick-heo-eg · 2026-03-10T03:18:06Z

A couple of questions:

How does gen_ai.evaluation.outcome differ from the existing score.label? They're the same type with the same example values (pass, fail) — when would an instrumentor set one but not the other?

What's the concrete use case for gen_ai.evaluation.multiple_outcomes? It doesn't appear to map to a field in the listed APIs, and multiple gen_ai.evaluation.result events on the same span already signal that multiple evaluations happened.

Thanks for the questions!

Regarding gen_ai.evaluation.outcome vs score.label:

score.label is the human-readable label associated with a numeric score.value. It represents the categorical interpretation of a score (for example, a score produced by an evaluation framework might map to labels such as "relevant" or "not_relevant"). In this sense, it is a companion attribute to score.value.

gen_ai.evaluation.outcome, on the other hand, represents the categorical result assigned by the evaluator independently of any numeric score. It is useful when an evaluation produces a direct categorical decision such as pass, fail, allow, or block, without necessarily producing a numeric score.

In practice, both attributes may be set on the same event when an evaluation framework produces both a numeric score (with score.value and score.label) and a separate categorical evaluation outcome. When only a categorical result exists without a numeric score, gen_ai.evaluation.outcome is the appropriate attribute.

Regarding gen_ai.evaluation.multiple_outcomes:

This attribute signals that the evaluation process assessed multiple outcome categories in a single evaluation run. For example, a safety classifier may simultaneously evaluate categories such as toxicity, bias, and PII exposure.

It allows telemetry consumers to understand that a single gen_ai.evaluation.result event represents a multi-category evaluation rather than a single-criterion result.

This is distinct from emitting multiple gen_ai.evaluation.result events on the same span, which would indicate multiple independent evaluation runs.

Also, this PR was closed due to a CI merge-base issue after a force push, so the same changes have been reopened in #3527 where CI can run with a fresh merge base.

github-actions · 2026-05-05T16:42:51Z

This PR contains changes to area(s) that do not have an active SIG/project and will be auto-closed:

faas
aws
db
gen-ai
hardware

Such changes may be rejected or put on hold until a new SIG/project is established.

Please refer to the Semantic Convention Areas
document to see the current active SIGs and also to learn how to kick start a new one.

Nick-heo-eg requested review from a team as code owners January 26, 2026 13:37

github-project-automation Bot added this to Semantic Conventions Triage Jan 26, 2026

github-project-automation Bot moved this to Untriaged in Semantic Conventions Triage Jan 26, 2026

lmolkova moved this from Untriaged to Awaiting codeowners approval in Semantic Conventions Triage Jan 26, 2026

lmolkova added the area:gen-ai label Jan 26, 2026

github-actions Bot added the Stale label Feb 16, 2026

lmolkova moved this from Awaiting codeowners approval to Needs More Approval in Semantic Conventions Triage Feb 16, 2026

lmolkova moved this from Needs More Approval to Blocked in Semantic Conventions Triage Feb 16, 2026

lmolkova moved this from Blocked to Awaiting codeowners approval in Semantic Conventions Triage Feb 16, 2026

Nick-heo-eg requested a review from a team as a code owner February 18, 2026 19:21

github-actions Bot removed the Stale label Feb 19, 2026

github-actions Bot added Stale enhancement New feature or request labels Mar 9, 2026

Nick-heo-eg force-pushed the genai-judgment-boundary branch from 42e8627 to 85e6879 Compare March 10, 2026 00:06

trask reviewed Mar 10, 2026

View reviewed changes

Comment thread model/gen-ai/events.yaml

Comment thread model/gen-ai/registry.yaml Outdated

Comment thread model/gen-ai/registry.yaml Outdated

Comment thread .chloggen/genai-judgment-boundary.yaml

Nick-heo-eg force-pushed the genai-judgment-boundary branch from 27d9439 to a57aa0a Compare March 10, 2026 00:34

Nick-heo-eg and others added 10 commits March 10, 2026 09:44

chore: trigger EasyCLA recheck with verified email

2e3e695

docs: link to Judgment Boundary manifest

4183440

docs(genai): clarify decision evaluation modeling with OTLP example a…

10d45b2

…nd query illustration 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

docs(genai): remove unrelated positioning section to keep PR scope fo…

e6eee35

…cused

refactor(genai): reduce judgment attributes to minimal evaluation met…

f24f852

…adata Remove judgment namespace and lifecycle-related fields. Keep two experimental attributes aligned with existing evaluation semantics.

chore: regenerate docs and add changelog entry

f2ea9d8

chore: regenerate registry docs to include new evaluation attributes

294379f

Nick-heo-eg force-pushed the genai-judgment-boundary branch from a57aa0a to 95a3f78 Compare March 10, 2026 00:46

fix(gen-ai): add requirement_level and align stability with OTel conv…

0f5f5f1

…entions - Change gen_ai.evaluation.outcome and gen_ai.evaluation.multiple_outcomes stability from experimental to development - Add requirement_level: recommended to both refs in gen_ai.evaluation.result event

Nick-heo-eg force-pushed the genai-judgment-boundary branch from 95a3f78 to 0f5f5f1 Compare March 10, 2026 01:16

trask removed the Stale label Mar 10, 2026

Nick-heo-eg added 3 commits March 10, 2026 11:09

chore: trigger CI rerun

2a0a9c3

chore: trigger CI rerun after force-push

3bd57d6

fix(ci): allow full history for changed-files detection

02e92b7

Nick-heo-eg closed this Mar 10, 2026

Nick-heo-eg reopened this Mar 10, 2026

Nick-heo-eg closed this Mar 10, 2026

Nick-heo-eg mentioned this pull request Mar 10, 2026

gen-ai: fix score.label note wording #3527

Closed

github-actions Bot added the triage:rejected:declined label May 5, 2026

Conversation

Nick-heo-eg commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Attributes

Motivation

Relationship to Existing Evaluation Attributes

Mapping to Existing APIs

Example

Related

Uh oh!

Nick-heo-eg commented Jan 26, 2026

Uh oh!

Nick-heo-eg commented Feb 1, 2026

Uh oh!

github-actions Bot commented Feb 16, 2026

Uh oh!

Nick-heo-eg commented Feb 18, 2026

Uh oh!

Nick-heo-eg commented Feb 19, 2026

Revision Note

Uh oh!

Nick-heo-eg commented Feb 22, 2026

Uh oh!

github-actions Bot commented Mar 9, 2026

Uh oh!

Nick-heo-eg commented Mar 9, 2026

Uh oh!

trask commented Mar 9, 2026

Uh oh!

Nick-heo-eg commented Mar 9, 2026

Uh oh!

trask commented Mar 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

linux-foundation-easycla Bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nick-heo-eg commented Mar 10, 2026

Uh oh!

Nick-heo-eg commented Mar 10, 2026

Uh oh!

trask commented Mar 10, 2026

Uh oh!

Nick-heo-eg commented Mar 10, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Nick-heo-eg commented Jan 26, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Mar 10, 2026 •

edited

Loading