Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .chloggen/genai-judgment-boundary.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
change_type: enhancement
component: gen_ai
note: Extend `gen_ai.evaluation.score.label` to support non-numeric categorical evaluation results when no numeric score is produced.
issues: [3336]
2 changes: 1 addition & 1 deletion docs/gen-ai/gen-ai-events.md
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@ This event captures the result of evaluating GenAI output for quality, accuracy,
the canonical name of exception that occurred, or another low-cardinality error identifier.
Instrumentations SHOULD document the list of errors they report.

**[2] `gen_ai.evaluation.score.label`:** This attribute provides a human-readable interpretation of the evaluation score produced by an evaluator. For example, a score value of 1 could mean "relevant" in one evaluation system and "not relevant" in another, depending on the scoring range and evaluator. The label SHOULD have low cardinality. Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values.
**[2] `gen_ai.evaluation.score.label`:** This attribute provides a human-readable label for the evaluation result. It MAY be used alongside `score.value` to interpret a numeric score, or independently to capture non-numeric categorical results when no score is produced. In practice, some evaluators may return categorical results (e.g. pass/fail, safe/unsafe) even when no numeric score is produced. The label SHOULD have low cardinality. Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values.

**[3] `gen_ai.response.id`:** The unique identifier assigned to the specific
completion being evaluated. This attribute helps correlate the evaluation
Expand Down
2 changes: 1 addition & 1 deletion docs/registry/attributes/gen-ai.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ This document defines the attributes used to describe telemetry in the context o

**[1] `gen_ai.data_source.id`:** Data sources are used by AI agents and RAG applications to store grounding data. A data source may be an external database, object store, document collection, website, or any other storage system used by the GenAI agent or application. The `gen_ai.data_source.id` SHOULD match the identifier used by the GenAI system rather than a name specific to the external storage, such as a database or object store. Semantic conventions referencing `gen_ai.data_source.id` MAY also leverage additional attributes, such as `db.*`, to further identify and describe the data source.

**[2] `gen_ai.evaluation.score.label`:** This attribute provides a human-readable interpretation of the evaluation score produced by an evaluator. For example, a score value of 1 could mean "relevant" in one evaluation system and "not relevant" in another, depending on the scoring range and evaluator. The label SHOULD have low cardinality. Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values.
**[2] `gen_ai.evaluation.score.label`:** This attribute provides a human-readable label for the evaluation result. It MAY be used alongside `score.value` to interpret a numeric score, or independently to capture non-numeric categorical results when no score is produced. In practice, some evaluators may return categorical results (e.g. pass/fail, safe/unsafe) even when no numeric score is produced. The label SHOULD have low cardinality. Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values.

**[3] `gen_ai.input.messages`:** Instrumentations MUST follow [Input messages JSON schema](/docs/gen-ai/gen-ai-input-messages.json).
When the attribute is recorded on events, it MUST be recorded in structured
Expand Down
7 changes: 5 additions & 2 deletions model/gen-ai/registry.yaml
Comment thread
lmolkova marked this conversation as resolved.
Original file line number Diff line number Diff line change
Expand Up @@ -650,8 +650,11 @@ groups:
brief: Human readable label for evaluation.
examples: ["relevant", "not_relevant", "correct", "incorrect", "pass", "fail"]
note: >
This attribute provides a human-readable interpretation of the evaluation score produced by an evaluator.
For example, a score value of 1 could mean "relevant" in one evaluation system and "not relevant" in another, depending on the scoring range and evaluator.
This attribute provides a human-readable label for the evaluation result.
It MAY be used alongside `score.value` to interpret a numeric score, or independently
to capture non-numeric categorical results when no score is produced.
In practice, some evaluators may return categorical results (e.g. pass/fail, safe/unsafe)
even when no numeric score is produced.
The label SHOULD have low cardinality.
Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values.
- id: gen_ai.evaluation.explanation
Expand Down