diff --git a/.chloggen/genai-judgment-boundary.yaml b/.chloggen/genai-judgment-boundary.yaml new file mode 100644 index 0000000000..18a2874503 --- /dev/null +++ b/.chloggen/genai-judgment-boundary.yaml @@ -0,0 +1,4 @@ +change_type: enhancement +component: gen_ai +note: Extend `gen_ai.evaluation.score.label` to support non-numeric categorical evaluation results when no numeric score is produced. +issues: [3336] diff --git a/docs/gen-ai/gen-ai-events.md b/docs/gen-ai/gen-ai-events.md index bb2c5c7cc1..b49ba4c40a 100644 --- a/docs/gen-ai/gen-ai-events.md +++ b/docs/gen-ai/gen-ai-events.md @@ -260,7 +260,7 @@ This event captures the result of evaluating GenAI output for quality, accuracy, the canonical name of exception that occurred, or another low-cardinality error identifier. Instrumentations SHOULD document the list of errors they report. -**[2] `gen_ai.evaluation.score.label`:** This attribute provides a human-readable interpretation of the evaluation score produced by an evaluator. For example, a score value of 1 could mean "relevant" in one evaluation system and "not relevant" in another, depending on the scoring range and evaluator. The label SHOULD have low cardinality. Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values. +**[2] `gen_ai.evaluation.score.label`:** This attribute provides a human-readable label for the evaluation result. It MAY be used alongside `score.value` to interpret a numeric score, or independently to capture non-numeric categorical results when no score is produced. In practice, some evaluators may return categorical results (e.g. pass/fail, safe/unsafe) even when no numeric score is produced. The label SHOULD have low cardinality. Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values. **[3] `gen_ai.response.id`:** The unique identifier assigned to the specific completion being evaluated. This attribute helps correlate the evaluation diff --git a/docs/registry/attributes/gen-ai.md b/docs/registry/attributes/gen-ai.md index b7b4f0926d..449ae6f1b9 100644 --- a/docs/registry/attributes/gen-ai.md +++ b/docs/registry/attributes/gen-ai.md @@ -65,7 +65,7 @@ This document defines the attributes used to describe telemetry in the context o **[1] `gen_ai.data_source.id`:** Data sources are used by AI agents and RAG applications to store grounding data. A data source may be an external database, object store, document collection, website, or any other storage system used by the GenAI agent or application. The `gen_ai.data_source.id` SHOULD match the identifier used by the GenAI system rather than a name specific to the external storage, such as a database or object store. Semantic conventions referencing `gen_ai.data_source.id` MAY also leverage additional attributes, such as `db.*`, to further identify and describe the data source. -**[2] `gen_ai.evaluation.score.label`:** This attribute provides a human-readable interpretation of the evaluation score produced by an evaluator. For example, a score value of 1 could mean "relevant" in one evaluation system and "not relevant" in another, depending on the scoring range and evaluator. The label SHOULD have low cardinality. Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values. +**[2] `gen_ai.evaluation.score.label`:** This attribute provides a human-readable label for the evaluation result. It MAY be used alongside `score.value` to interpret a numeric score, or independently to capture non-numeric categorical results when no score is produced. In practice, some evaluators may return categorical results (e.g. pass/fail, safe/unsafe) even when no numeric score is produced. The label SHOULD have low cardinality. Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values. **[3] `gen_ai.input.messages`:** Instrumentations MUST follow [Input messages JSON schema](/docs/gen-ai/gen-ai-input-messages.json). When the attribute is recorded on events, it MUST be recorded in structured diff --git a/model/gen-ai/registry.yaml b/model/gen-ai/registry.yaml index ddd2408b27..d1c66c4a46 100644 --- a/model/gen-ai/registry.yaml +++ b/model/gen-ai/registry.yaml @@ -650,8 +650,11 @@ groups: brief: Human readable label for evaluation. examples: ["relevant", "not_relevant", "correct", "incorrect", "pass", "fail"] note: > - This attribute provides a human-readable interpretation of the evaluation score produced by an evaluator. - For example, a score value of 1 could mean "relevant" in one evaluation system and "not relevant" in another, depending on the scoring range and evaluator. + This attribute provides a human-readable label for the evaluation result. + It MAY be used alongside `score.value` to interpret a numeric score, or independently + to capture non-numeric categorical results when no score is produced. + In practice, some evaluators may return categorical results (e.g. pass/fail, safe/unsafe) + even when no numeric score is produced. The label SHOULD have low cardinality. Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values. - id: gen_ai.evaluation.explanation