open-telemetry · Nick-heo-eg · Apr 4, 2026 · Apr 4, 2026 · Apr 4, 2026 · Apr 4, 2026
@@ -0,0 +1,4 @@
+change_type: enhancement
+component: gen_ai
+note: Extend `gen_ai.evaluation.score.label` to support non-numeric categorical evaluation results when no numeric score is produced.
+issues: [3336]
@@ -260,7 +260,7 @@ This event captures the result of evaluating GenAI output for quality, accuracy,
 the canonical name of exception that occurred, or another low-cardinality error identifier.
 Instrumentations SHOULD document the list of errors they report.
 
-**[2] `gen_ai.evaluation.score.label`:** This attribute provides a human-readable interpretation of the evaluation score produced by an evaluator. For example, a score value of 1 could mean "relevant" in one evaluation system and "not relevant" in another, depending on the scoring range and evaluator. The label SHOULD have low cardinality. Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values.
+**[2] `gen_ai.evaluation.score.label`:** This attribute provides a human-readable label for the evaluation result. It MAY be used alongside `score.value` to interpret a numeric score, or independently to capture non-numeric categorical results when no score is produced. In practice, some evaluators may return categorical results (e.g. pass/fail, safe/unsafe) even when no numeric score is produced. The label SHOULD have low cardinality. Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values.
 
 **[3] `gen_ai.response.id`:** The unique identifier assigned to the specific
 completion being evaluated. This attribute helps correlate the evaluation

@@ -65,7 +65,7 @@ This document defines the attributes used to describe telemetry in the context o
 
 **[1] `gen_ai.data_source.id`:** Data sources are used by AI agents and RAG applications to store grounding data. A data source may be an external database, object store, document collection, website, or any other storage system used by the GenAI agent or application. The `gen_ai.data_source.id` SHOULD match the identifier used by the GenAI system rather than a name specific to the external storage, such as a database or object store. Semantic conventions referencing `gen_ai.data_source.id` MAY also leverage additional attributes, such as `db.*`, to further identify and describe the data source.
 
-**[2] `gen_ai.evaluation.score.label`:** This attribute provides a human-readable interpretation of the evaluation score produced by an evaluator. For example, a score value of 1 could mean "relevant" in one evaluation system and "not relevant" in another, depending on the scoring range and evaluator. The label SHOULD have low cardinality. Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values.
+**[2] `gen_ai.evaluation.score.label`:** This attribute provides a human-readable label for the evaluation result. It MAY be used alongside `score.value` to interpret a numeric score, or independently to capture non-numeric categorical results when no score is produced. In practice, some evaluators may return categorical results (e.g. pass/fail, safe/unsafe) even when no numeric score is produced. The label SHOULD have low cardinality. Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values.
 
 **[3] `gen_ai.input.messages`:** Instrumentations MUST follow [Input messages JSON schema](/docs/gen-ai/gen-ai-input-messages.json).
 When the attribute is recorded on events, it MUST be recorded in structured

@@ -650,8 +650,11 @@ groups:
         brief: Human readable label for evaluation.
         examples: ["relevant", "not_relevant", "correct", "incorrect", "pass", "fail"]
         note: >
-          This attribute provides a human-readable interpretation of the evaluation score produced by an evaluator.
-          For example, a score value of 1 could mean "relevant" in one evaluation system and "not relevant" in another, depending on the scoring range and evaluator.
+          This attribute provides a human-readable label for the evaluation result.
+          It MAY be used alongside `score.value` to interpret a numeric score, or independently
+          to capture non-numeric categorical results when no score is produced.
+          In practice, some evaluators may return categorical results (e.g. pass/fail, safe/unsafe)
+          even when no numeric score is produced.
           The label SHOULD have low cardinality.
           Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values.
       - id: gen_ai.evaluation.explanation