Skip to content

gen-ai: add judgment boundary attributes to evaluation result#3336

Closed
Nick-heo-eg wants to merge 15 commits intoopen-telemetry:mainfrom
Nick-heo-eg:genai-judgment-boundary
Closed

gen-ai: add judgment boundary attributes to evaluation result#3336
Nick-heo-eg wants to merge 15 commits intoopen-telemetry:mainfrom
Nick-heo-eg:genai-judgment-boundary

Conversation

@Nick-heo-eg
Copy link
Copy Markdown

@Nick-heo-eg Nick-heo-eg commented Jan 26, 2026

Summary

This PR adds two minimal attributes to the existing gen_ai.evaluation.result event to enable standard telemetry recording of evaluation outcomes in GenAI pipelines:

  • gen_ai.evaluation.outcome — the final outcome label assigned by the evaluator (e.g. pass, fail, allow, block)
  • gen_ai.evaluation.multiple_outcomes — whether the evaluation assessed multiple outcome categories simultaneously

These attributes intentionally capture minimal evaluation metadata to remain vendor-neutral and avoid introducing lifecycle or decision-model semantics.

The design follows the direction discussed in open-telemetry/semantic-conventions-genai#72 and does not introduce new events or namespaces.


New Attributes

Attribute Type Stability Description Examples
gen_ai.evaluation.outcome string experimental The evaluation outcome label assigned by the evaluator. pass, fail, allow, block
gen_ai.evaluation.multiple_outcomes boolean experimental Indicates whether the evaluation process assessed multiple outcome categories or labels. true

Both attributes are added as Recommended to the gen_ai.evaluation.result event.


Motivation

GenAI evaluation pipelines (safety classifiers, quality scorers, guardrail systems) produce outcome labels as part of their standard output. Currently, the GenAI semantic conventions provide no standard attribute to record these outcome labels in telemetry.

Without a standardized attribute, each instrumentation library uses ad-hoc span attributes or event bodies, making cross-vendor analysis and alerting impossible.

These two attributes provide the minimal metadata needed to record evaluation outcomes in a vendor-neutral way, scoped to what is universally available across evaluation APIs and frameworks.


Relationship to Existing Evaluation Attributes

The gen_ai.evaluation namespace already defines the following attributes (stability: development):

Attribute Purpose
gen_ai.evaluation.name Name of the evaluation metric (e.g. Relevance, IntentResolution)
gen_ai.evaluation.score.value Numeric score returned by the evaluator
gen_ai.evaluation.score.label Human-readable interpretation of the score (evaluator-specific)
gen_ai.evaluation.explanation Free-form explanation for the score

These are complementary, not overlapping:

  • score.value / score.label capture how the evaluator scored the output — a numeric signal and its evaluator-specific label.
  • outcome captures the final categorical decision — the action-oriented result of the evaluation (pass/fail/allow/block), which may derive from a score threshold, a classifier, or a multi-step pipeline.
  • multiple_outcomes provides context when outcome is one of several categories evaluated simultaneously (e.g. a safety classifier that scores toxicity, bias, and PII at once).

In a typical LLM evaluation workflow:

evaluator runs → score.value=0.85, score.label="relevant"
                → outcome="pass"          ← threshold applied
                → multiple_outcomes=true  ← multiple categories were scored

The existing attributes describe evaluation measurement; the new attributes describe evaluation result selection.


Mapping to Existing APIs

API / Framework Maps to gen_ai.evaluation.outcome Maps to gen_ai.evaluation.multiple_outcomes
OpenAI Evals API (eval_run_result.result) "pass" / "fail" per sample true when multiple criteria are evaluated per sample
Anthropic model-graded eval (result field) "pass" / "fail" true for multi-metric evaluations
Azure AI Evaluation SDK (evaluator_result) categorical label (e.g. "Very good", "Blocked") true when composite evaluator runs
deepeval / ragas / RAGAS per-metric pass/fail true when test case checks multiple metrics
Custom safety classifiers "allow" / "block" true for multi-label classifiers

Example

Span attributes for a safety evaluation that assessed multiple categories:

{
  "name": "gen_ai.evaluation.result",
  "attributes": {
    "gen_ai.evaluation.name": "ContentSafety",
    "gen_ai.evaluation.score.value": 0.85,
    "gen_ai.evaluation.score.label": "pass",
    "gen_ai.evaluation.outcome": "pass",
    "gen_ai.evaluation.multiple_outcomes": true
  }
}

Related

@Nick-heo-eg
Copy link
Copy Markdown
Author

Note: This PR reintroduces the same changes as #3297, which was closed during repository cleanup.
The contents are unchanged; this PR exists to restore the proposal with a clean fork/PR state.

@lmolkova lmolkova moved this from Untriaged to Awaiting codeowners approval in Semantic Conventions Triage Jan 26, 2026
@Nick-heo-eg
Copy link
Copy Markdown
Author

For additional background (non-normative, informational only), see:
https://github.com/Nick-heo-eg/decision-only-observability

This external reference does not propose new standards or semantic conventions,
and is intended only to clarify how existing OpenTelemetry concepts can be applied
to decision-only (non-executed) outcomes.

@github-actions
Copy link
Copy Markdown

This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days.

@github-actions github-actions Bot added the Stale label Feb 16, 2026
@lmolkova lmolkova moved this from Awaiting codeowners approval to Needs More Approval in Semantic Conventions Triage Feb 16, 2026
@lmolkova lmolkova moved this from Needs More Approval to Blocked in Semantic Conventions Triage Feb 16, 2026
@lmolkova lmolkova moved this from Blocked to Awaiting codeowners approval in Semantic Conventions Triage Feb 16, 2026
@Nick-heo-eg Nick-heo-eg requested a review from a team as a code owner February 18, 2026 19:21
@Nick-heo-eg
Copy link
Copy Markdown
Author

For clarity: the recent modeling adjustments only make existing evaluation outcomes queryable and consistent with attribute definitions, without introducing new lifecycle semantics or policy assumptions.

@github-actions github-actions Bot removed the Stale label Feb 19, 2026
@Nick-heo-eg
Copy link
Copy Markdown
Author

Revision Note

This revision narrows the proposal to two minimal, vendor-neutral evaluation metadata attributes.

It does not introduce lifecycle stages, decision models, or control semantics.
It only surfaces evaluation metadata already computed by implementations.

The goal is cross-implementation stability and improved queryability of evaluation results.

@Nick-heo-eg
Copy link
Copy Markdown
Author

@lmolkova

Would this reduced scope align with the current GenAI semantic convention goals?

We’ve narrowed the PR to two minimal attributes focused strictly on execution outcome observability. It no longer introduces lifecycle stages, decision models, or policy semantics - only optional attributes to record execution outcomes.

Happy to adjust further if additional scope reduction would be helpful.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 9, 2026

This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days.

@github-actions github-actions Bot added Stale enhancement New feature or request labels Mar 9, 2026
@Nick-heo-eg
Copy link
Copy Markdown
Author

@lmolkova @trask

I pushed fixes after the previous CI failure
(docs regenerated with weaver + changelog added).

Could you please approve the workflow run again?

Thank you!

@trask
Copy link
Copy Markdown
Member

trask commented Mar 9, 2026

hi @Nick-heo-eg!

can you update the PR description? currently it doesn't match the attributes added

can you also add links and explain how these two new attributes map to existing APIs? Thanks!

@Nick-heo-eg
Copy link
Copy Markdown
Author

hi @Nick-heo-eg!

can you update the PR description? currently it doesn't match the attributes added

can you also add links and explain how these two new attributes map to existing APIs? Thanks!

Thanks for the review. I updated the PR description to match the current attributes and added links plus a mapping section explaining how the attributes relate to existing GenAI evaluation attributes and APIs.

@trask
Copy link
Copy Markdown
Member

trask commented Mar 10, 2026

Thanks! can you also add links in the "Mapping to Existing APIs" tables wherever possible to make it easy for reviewers to verify?

@Nick-heo-eg Nick-heo-eg force-pushed the genai-judgment-boundary branch from 42e8627 to 85e6879 Compare March 10, 2026 00:06
Comment thread model/gen-ai/events.yaml
Comment thread model/gen-ai/registry.yaml Outdated
Comment thread model/gen-ai/registry.yaml Outdated
Comment thread .chloggen/genai-judgment-boundary.yaml
@Nick-heo-eg Nick-heo-eg force-pushed the genai-judgment-boundary branch from 27d9439 to a57aa0a Compare March 10, 2026 00:34
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Mar 10, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

This commit adds attributes to the gen_ai.evaluation.result event
to support traceability of decision boundaries where multiple
alternative outcomes were evaluated.

Implements discussion from open-telemetry/semantic-conventions#3244
Nick-heo-eg and others added 10 commits March 10, 2026 09:44
…t intent

Adds two concrete JSON examples demonstrating judgment boundary attribute usage:
- Content safety pre-execution check with automatic decision
- Cost boundary evaluation with human escalation

Clarifies that judgment boundary attributes are intended for event-level
auditability and post-hoc inspection rather than high-cardinality metric
aggregation.
…nd query illustration

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The previous OTLP example used a numeric value (2) which was
inconsistent with the boolean type declaration. Int better represents
the actual semantic: how many paths were considered, not merely
whether more than one existed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- selected_path: "blocked" -> "block" to match examples ["allow", "block", "escalate"]
- alternatives_evaluated examples: [1, 2, 3] -> [2] for conciseness

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…adata

  Remove judgment namespace and lifecycle-related fields.
  Keep two experimental attributes aligned with existing evaluation semantics.
@Nick-heo-eg Nick-heo-eg force-pushed the genai-judgment-boundary branch from a57aa0a to 95a3f78 Compare March 10, 2026 00:46
@Nick-heo-eg
Copy link
Copy Markdown
Author

/easycla

…entions

- Change gen_ai.evaluation.outcome and gen_ai.evaluation.multiple_outcomes
  stability from experimental to development
- Add requirement_level: recommended to both refs in gen_ai.evaluation.result event
@Nick-heo-eg Nick-heo-eg force-pushed the genai-judgment-boundary branch from 95a3f78 to 0f5f5f1 Compare March 10, 2026 01:16
@Nick-heo-eg
Copy link
Copy Markdown
Author

/easycla

@trask trask removed the Stale label Mar 10, 2026
@trask
Copy link
Copy Markdown
Member

trask commented Mar 10, 2026

A couple of questions:

  • How does gen_ai.evaluation.outcome differ from the existing score.label? They're the same type with the same example values (pass, fail) — when would an instrumentor set one but not the other?
  • What's the concrete use case for gen_ai.evaluation.multiple_outcomes? It doesn't appear to map to a field in the listed APIs, and multiple gen_ai.evaluation.result events on the same span already signal that multiple evaluations happened.

@Nick-heo-eg
Copy link
Copy Markdown
Author

A couple of questions:

  • How does gen_ai.evaluation.outcome differ from the existing score.label? They're the same type with the same example values (pass, fail) — when would an instrumentor set one but not the other?

  • What's the concrete use case for gen_ai.evaluation.multiple_outcomes? It doesn't appear to map to a field in the listed APIs, and multiple gen_ai.evaluation.result events on the same span already signal that multiple evaluations happened.

Thanks for the questions!

Regarding gen_ai.evaluation.outcome vs score.label:

score.label is the human-readable label associated with a numeric score.value. It represents the categorical interpretation of a score (for example, a score produced by an evaluation framework might map to labels such as "relevant" or "not_relevant"). In this sense, it is a companion attribute to score.value.

gen_ai.evaluation.outcome, on the other hand, represents the categorical result assigned by the evaluator independently of any numeric score. It is useful when an evaluation produces a direct categorical decision such as pass, fail, allow, or block, without necessarily producing a numeric score.

In practice, both attributes may be set on the same event when an evaluation framework produces both a numeric score (with score.value and score.label) and a separate categorical evaluation outcome. When only a categorical result exists without a numeric score, gen_ai.evaluation.outcome is the appropriate attribute.

Regarding gen_ai.evaluation.multiple_outcomes:

This attribute signals that the evaluation process assessed multiple outcome categories in a single evaluation run. For example, a safety classifier may simultaneously evaluate categories such as toxicity, bias, and PII exposure.

It allows telemetry consumers to understand that a single gen_ai.evaluation.result event represents a multi-category evaluation rather than a single-criterion result.

This is distinct from emitting multiple gen_ai.evaluation.result events on the same span, which would indicate multiple independent evaluation runs.

Also, this PR was closed due to a CI merge-base issue after a force push, so the same changes have been reopened in #3527 where CI can run with a fresh merge base.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

This PR contains changes to area(s) that do not have an active SIG/project and will be auto-closed:

  • faas
  • aws
  • db
  • gen-ai
  • hardware

Such changes may be rejected or put on hold until a new SIG/project is established.

Please refer to the Semantic Convention Areas
document to see the current active SIGs and also to learn how to kick start a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants