gen-ai: add judgment boundary attributes to evaluation result#3336
gen-ai: add judgment boundary attributes to evaluation result#3336Nick-heo-eg wants to merge 15 commits intoopen-telemetry:mainfrom
Conversation
|
Note: This PR reintroduces the same changes as #3297, which was closed during repository cleanup. |
|
For additional background (non-normative, informational only), see: This external reference does not propose new standards or semantic conventions, |
|
This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days. |
|
For clarity: the recent modeling adjustments only make existing evaluation outcomes queryable and consistent with attribute definitions, without introducing new lifecycle semantics or policy assumptions. |
Revision NoteThis revision narrows the proposal to two minimal, vendor-neutral evaluation metadata attributes. It does not introduce lifecycle stages, decision models, or control semantics. The goal is cross-implementation stability and improved queryability of evaluation results. |
|
Would this reduced scope align with the current GenAI semantic convention goals? We’ve narrowed the PR to two minimal attributes focused strictly on execution outcome observability. It no longer introduces lifecycle stages, decision models, or policy semantics - only optional attributes to record execution outcomes. Happy to adjust further if additional scope reduction would be helpful. |
|
This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days. |
|
hi @Nick-heo-eg! can you update the PR description? currently it doesn't match the attributes added can you also add links and explain how these two new attributes map to existing APIs? Thanks! |
Thanks for the review. I updated the PR description to match the current attributes and added links plus a mapping section explaining how the attributes relate to existing GenAI evaluation attributes and APIs. |
|
Thanks! can you also add links in the "Mapping to Existing APIs" tables wherever possible to make it easy for reviewers to verify? |
42e8627 to
85e6879
Compare
27d9439 to
a57aa0a
Compare
This commit adds attributes to the gen_ai.evaluation.result event to support traceability of decision boundaries where multiple alternative outcomes were evaluated. Implements discussion from open-telemetry/semantic-conventions#3244
…t intent Adds two concrete JSON examples demonstrating judgment boundary attribute usage: - Content safety pre-execution check with automatic decision - Cost boundary evaluation with human escalation Clarifies that judgment boundary attributes are intended for event-level auditability and post-hoc inspection rather than high-cardinality metric aggregation.
…nd query illustration 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The previous OTLP example used a numeric value (2) which was inconsistent with the boolean type declaration. Int better represents the actual semantic: how many paths were considered, not merely whether more than one existed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- selected_path: "blocked" -> "block" to match examples ["allow", "block", "escalate"] - alternatives_evaluated examples: [1, 2, 3] -> [2] for conciseness 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…adata Remove judgment namespace and lifecycle-related fields. Keep two experimental attributes aligned with existing evaluation semantics.
a57aa0a to
95a3f78
Compare
|
/easycla |
…entions - Change gen_ai.evaluation.outcome and gen_ai.evaluation.multiple_outcomes stability from experimental to development - Add requirement_level: recommended to both refs in gen_ai.evaluation.result event
95a3f78 to
0f5f5f1
Compare
|
/easycla |
|
A couple of questions:
|
Thanks for the questions! Regarding
In practice, both attributes may be set on the same event when an evaluation framework produces both a numeric score (with Regarding This attribute signals that the evaluation process assessed multiple outcome categories in a single evaluation run. For example, a safety classifier may simultaneously evaluate categories such as toxicity, bias, and PII exposure. It allows telemetry consumers to understand that a single This is distinct from emitting multiple Also, this PR was closed due to a CI merge-base issue after a force push, so the same changes have been reopened in #3527 where CI can run with a fresh merge base. |
|
This PR contains changes to area(s) that do not have an active SIG/project and will be auto-closed:
Such changes may be rejected or put on hold until a new SIG/project is established. Please refer to the Semantic Convention Areas |
Summary
This PR adds two minimal attributes to the existing
gen_ai.evaluation.resultevent to enable standard telemetry recording of evaluation outcomes in GenAI pipelines:gen_ai.evaluation.outcome— the final outcome label assigned by the evaluator (e.g.pass,fail,allow,block)gen_ai.evaluation.multiple_outcomes— whether the evaluation assessed multiple outcome categories simultaneouslyThese attributes intentionally capture minimal evaluation metadata to remain vendor-neutral and avoid introducing lifecycle or decision-model semantics.
The design follows the direction discussed in open-telemetry/semantic-conventions-genai#72 and does not introduce new events or namespaces.
New Attributes
gen_ai.evaluation.outcomepass,fail,allow,blockgen_ai.evaluation.multiple_outcomestrueBoth attributes are added as
Recommendedto thegen_ai.evaluation.resultevent.Motivation
GenAI evaluation pipelines (safety classifiers, quality scorers, guardrail systems) produce outcome labels as part of their standard output. Currently, the GenAI semantic conventions provide no standard attribute to record these outcome labels in telemetry.
Without a standardized attribute, each instrumentation library uses ad-hoc span attributes or event bodies, making cross-vendor analysis and alerting impossible.
These two attributes provide the minimal metadata needed to record evaluation outcomes in a vendor-neutral way, scoped to what is universally available across evaluation APIs and frameworks.
Relationship to Existing Evaluation Attributes
The
gen_ai.evaluationnamespace already defines the following attributes (stability:development):gen_ai.evaluation.nameRelevance,IntentResolution)gen_ai.evaluation.score.valuegen_ai.evaluation.score.labelgen_ai.evaluation.explanationThese are complementary, not overlapping:
score.value/score.labelcapture how the evaluator scored the output — a numeric signal and its evaluator-specific label.outcomecaptures the final categorical decision — the action-oriented result of the evaluation (pass/fail/allow/block), which may derive from a score threshold, a classifier, or a multi-step pipeline.multiple_outcomesprovides context whenoutcomeis one of several categories evaluated simultaneously (e.g. a safety classifier that scores toxicity, bias, and PII at once).In a typical LLM evaluation workflow:
The existing attributes describe evaluation measurement; the new attributes describe evaluation result selection.
Mapping to Existing APIs
gen_ai.evaluation.outcomegen_ai.evaluation.multiple_outcomeseval_run_result.result)"pass"/"fail"per sampletruewhen multiple criteria are evaluated per sampleresultfield)"pass"/"fail"truefor multi-metric evaluationsevaluator_result)"Very good","Blocked")truewhen composite evaluator runstruewhen test case checks multiple metrics"allow"/"block"truefor multi-label classifiersExample
Span attributes for a safety evaluation that assessed multiple categories:
{ "name": "gen_ai.evaluation.result", "attributes": { "gen_ai.evaluation.name": "ContentSafety", "gen_ai.evaluation.score.value": 0.85, "gen_ai.evaluation.score.label": "pass", "gen_ai.evaluation.outcome": "pass", "gen_ai.evaluation.multiple_outcomes": true } }Related