Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion sdk/ai/azure-ai-projects/assets.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@
"AssetsRepo": "Azure/azure-sdk-assets",
"AssetsRepoPrefixPath": "python",
"TagPrefix": "python/ai/azure-ai-projects",
"Tag": "python/ai/azure-ai-projects_7cddb7d06f"
"Tag": "python/ai/azure-ai-projects_212aab4d9b"
}
105 changes: 105 additions & 0 deletions sdk/ai/azure-ai-projects/samples/evaluations/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Azure AI Projects - Evaluation Samples

This folder contains samples demonstrating how to use Azure AI Foundry's evaluation capabilities with the `azure-ai-projects` SDK.

## Prerequisites

Before running any sample:

```bash
pip install "azure-ai-projects>=2.0.0b1" python-dotenv
```

Set these environment variables:
- `AZURE_AI_PROJECT_ENDPOINT` - Your Azure AI Project endpoint (e.g., `https://<account>.services.ai.azure.com/api/projects/<project>`)
- `AZURE_AI_MODEL_DEPLOYMENT_NAME` - The model deployment name (e.g., `gpt-4o-mini`)

## Sample Index

### Getting Started

| Sample | Description |
|--------|-------------|
| [sample_evaluations_builtin_with_inline_data.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluations_builtin_with_inline_data.py) | Basic evaluation with built-in evaluators using inline data |
| [sample_eval_catalog.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_eval_catalog.py) | Browse and use evaluators from the evaluation catalog |

### Agent Evaluation

| Sample | Description |
|--------|-------------|
| [sample_agent_evaluation.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_agent_evaluation.py) | Evaluate an agent's responses |
| [sample_agent_response_evaluation.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_agent_response_evaluation.py) | Evaluate agent response quality |
| [sample_agent_response_evaluation_with_function_tool.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_agent_response_evaluation_with_function_tool.py) | Evaluate agent with function tools |
| [sample_model_evaluation.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_model_evaluation.py) | Evaluate model responses directly |

### Evaluator Types

| Sample | Description |
|--------|-------------|
| [sample_evaluations_graders.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluations_graders.py) | OpenAI graders: label_model, text_similarity, string_check, score_model |
| [sample_evaluations_ai_assisted.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluations_ai_assisted.py) | AI-assisted evaluators: Similarity, ROUGE, METEOR, GLEU, F1, BLEU |
| [sample_eval_catalog_code_based_evaluators.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_eval_catalog_code_based_evaluators.py) | Code-based evaluators from the catalog |
| [sample_eval_catalog_prompt_based_evaluators.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_eval_catalog_prompt_based_evaluators.py) | Prompt-based evaluators from the catalog |

### Insights & Analysis

| Sample | Description |
|--------|-------------|
| [sample_evaluation_compare_insight.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluation_compare_insight.py) | Compare evaluation runs and generate insights |
| [sample_evaluation_cluster_insight.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluation_cluster_insight.py) | Generate cluster insights from evaluation runs |

### Red Team Evaluations

| Sample | Description |
|--------|-------------|
| [sample_redteam_evaluations.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_redteam_evaluations.py) | Security and safety evaluations using red team techniques |

### Agentic Evaluators

Located in the [agentic_evaluators](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators) subfolder:

| Sample | Description |
|--------|-------------|
| [sample_coherence.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_coherence.py) | Evaluate response coherence |
| [sample_fluency.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_fluency.py) | Evaluate response fluency |
| [sample_groundedness.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_groundedness.py) | Evaluate response groundedness |
| [sample_relevance.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_relevance.py) | Evaluate response relevance |
| [sample_intent_resolution.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_intent_resolution.py) | Evaluate intent resolution |
| [sample_response_completeness.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_response_completeness.py) | Evaluate response completeness |
| [sample_task_adherence.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_task_adherence.py) | Evaluate task adherence |
| [sample_task_completion.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_task_completion.py) | Evaluate task completion |
| [sample_task_navigation_efficiency.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_task_navigation_efficiency.py) | Evaluate navigation efficiency |
| [sample_tool_call_accuracy.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_tool_call_accuracy.py) | Evaluate tool call accuracy |
| [sample_tool_call_success.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_tool_call_success.py) | Evaluate tool call success |
| [sample_tool_input_accuracy.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_tool_input_accuracy.py) | Evaluate tool input accuracy |
| [sample_tool_output_utilization.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_tool_output_utilization.py) | Evaluate tool output utilization |
| [sample_tool_selection.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_tool_selection.py) | Evaluate tool selection |
| [sample_generic_agentic_evaluator](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_generic_agentic_evaluator) | Generic agentic evaluator example |

### Advanced Samples

These samples require additional setup or Azure services:

| Sample | Description | Requirements |
|--------|-------------|--------------|
| [sample_evaluations_builtin_with_dataset_id.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluations_builtin_with_dataset_id.py) | Use uploaded dataset for evaluation | Azure Blob Storage |
| [sample_evaluations_builtin_with_traces.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluations_builtin_with_traces.py) | Evaluate against Application Insights traces | Azure Application Insights |
| [sample_scheduled_evaluations.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_scheduled_evaluations.py) | Schedule recurring evaluations | RBAC setup |
| [sample_continuous_evaluation_rule.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_continuous_evaluation_rule.py) | Set up continuous evaluation rules | Manual RBAC in Azure Portal |
| [sample_evaluations_score_model_grader_with_image.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluations_score_model_grader_with_image.py) | Evaluate with image data | Image file |
| [sample_evaluations_builtin_with_inline_data_oai.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluations_builtin_with_inline_data_oai.py) | Use OpenAI client directly | OpenAI SDK |

## Running a Sample

```bash
# Set environment variables
export AZURE_AI_PROJECT_ENDPOINT="https://your-account.services.ai.azure.com/api/projects/your-project"
export AZURE_AI_MODEL_DEPLOYMENT_NAME="gpt-4o-mini"

# Run a sample
python sample_evaluations_builtin_with_inline_data.py
```

## Learn More

- [Azure AI Foundry Documentation](https://learn.microsoft.com/azure/ai-studio/)
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,6 @@

import os
import time
import json
import tempfile
from typing import Union
from pprint import pprint
from dotenv import load_dotenv
Expand All @@ -39,7 +37,11 @@
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from openai.types.eval_create_params import DataSourceConfigCustom, TestingCriterionLabelModel
from openai.types.evals.create_eval_jsonl_run_data_source_param import CreateEvalJSONLRunDataSourceParam, SourceFileID
from openai.types.evals.create_eval_jsonl_run_data_source_param import (
CreateEvalJSONLRunDataSourceParam,
SourceFileContent,
SourceFileContentContent,
)
from openai.types.evals.run_create_response import RunCreateResponse
from openai.types.evals.run_retrieve_response import RunRetrieveResponse

Expand Down Expand Up @@ -85,36 +87,23 @@
)
print(f"Evaluation created (id: {eval_object.id}, name: {eval_object.name})")

# Create and upload JSONL data as a dataset
eval_data = [
{"item": {"query": "I love programming!"}},
{"item": {"query": "I hate bugs."}},
{"item": {"query": "The weather is nice today."}},
{"item": {"query": "This is the worst movie ever."}},
{"item": {"query": "Python is an amazing language."}},
]

with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False) as f:
for item in eval_data:
f.write(json.dumps(item) + "\n")
temp_file_path = f.name

dataset = project_client.datasets.upload_file(
name="sentiment-eval-data",
version=str(int(time.time())),
file_path=temp_file_path,
)
os.unlink(temp_file_path)
print(f"Dataset created (id: {dataset.id}, name: {dataset.name}, version: {dataset.version})")

if not dataset.id:
raise ValueError("Dataset ID is None")

# Create an eval run using the uploaded dataset
# Create an eval run using inline data
eval_run: Union[RunCreateResponse, RunRetrieveResponse] = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="Eval Run",
data_source=CreateEvalJSONLRunDataSourceParam(source=SourceFileID(id=dataset.id, type="file_id"), type="jsonl"),
name="Eval Run with Inline Data",
data_source=CreateEvalJSONLRunDataSourceParam(
type="jsonl",
source=SourceFileContent(
type="file_content",
content=[
SourceFileContentContent(item={"query": "I love programming!"}),
SourceFileContentContent(item={"query": "I hate bugs."}),
SourceFileContentContent(item={"query": "The weather is nice today."}),
SourceFileContentContent(item={"query": "This is the worst movie ever."}),
SourceFileContentContent(item={"query": "Python is an amazing language."}),
],
),
),
)
print(f"Evaluation run created (id: {eval_run.id})")

Expand Down Expand Up @@ -142,20 +131,19 @@
print(f"Started insight generation (id: {clusterInsight.id})")

while clusterInsight.state not in [OperationState.SUCCEEDED, OperationState.FAILED]:
print(f"Waiting for insight to be generated...")
print("Waiting for insight to be generated...")
clusterInsight = project_client.insights.get(id=clusterInsight.id)
print(f"Insight status: {clusterInsight.state}")
time.sleep(5)

if clusterInsight.state == OperationState.SUCCEEDED:
print("\n✓ Cluster insights generated successfully!")
pprint(clusterInsight)
else:
print("\n✗ Cluster insight generation failed.")

else:
print("\n✗ Evaluation run failed. Cannot generate cluster insights.")

project_client.datasets.delete(name=dataset.name, version=dataset.version)
print("Dataset deleted")

openai_client.evals.delete(eval_id=eval_object.id)
print("Evaluation deleted")
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@
"""
DESCRIPTION:
Given an AIProjectClient, this sample demonstrates how to use the synchronous
`openai.evals.*` methods to create, get and list evaluation and and eval runs.
`openai.evals.*` methods to create, get and list evaluation and eval runs
with AI-assisted evaluators (Similarity, ROUGE, METEOR, GLEU, F1, BLEU).

USAGE:
python sample_evaluations_ai_assisted.py
Expand All @@ -20,54 +21,34 @@
1) AZURE_AI_PROJECT_ENDPOINT - Required. The Azure AI Project endpoint, as found in the overview page of your
Microsoft Foundry project. It has the form: https://<account_name>.services.ai.azure.com/api/projects/<project_name>.
2) AZURE_AI_MODEL_DEPLOYMENT_NAME - Required. The name of the model deployment to use for evaluation.
3) DATASET_NAME - Optional. The name of the Dataset to create and use in this sample.
4) DATASET_VERSION - Optional. The version of the Dataset to create and use in this sample.
5) DATA_FOLDER - Optional. The folder path where the data files for upload are located.
"""

import os

from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import (
DatasetVersion,
)

import time
from pprint import pprint
from openai.types.evals.create_eval_jsonl_run_data_source_param import CreateEvalJSONLRunDataSourceParam, SourceFileID
from openai.types.evals.create_eval_jsonl_run_data_source_param import (
CreateEvalJSONLRunDataSourceParam,
SourceFileContent,
SourceFileContentContent,
)
from openai.types.eval_create_params import DataSourceConfigCustom
from dotenv import load_dotenv
from datetime import datetime


load_dotenv()

endpoint = os.environ["AZURE_AI_PROJECT_ENDPOINT"]

model_deployment_name = os.environ.get("AZURE_AI_MODEL_DEPLOYMENT_NAME", "")
dataset_name = os.environ.get("DATASET_NAME", "")
dataset_version = os.environ.get("DATASET_VERSION", "1")

# Construct the paths to the data folder and data file used in this sample
script_dir = os.path.dirname(os.path.abspath(__file__))
data_folder = os.environ.get("DATA_FOLDER", os.path.join(script_dir, "data_folder"))
data_file = os.path.join(data_folder, "sample_data_evaluation.jsonl")

with (
DefaultAzureCredential() as credential,
AIProjectClient(endpoint=endpoint, credential=credential) as project_client,
project_client.get_openai_client() as client,
):

print("Upload a single file and create a new Dataset to reference the file.")
dataset: DatasetVersion = project_client.datasets.upload_file(
name=dataset_name or f"eval-data-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S_UTC')}",
version=dataset_version,
file_path=data_file,
)
pprint(dataset)

data_source_config = DataSourceConfigCustom(
{
"type": "custom",
Expand Down Expand Up @@ -133,9 +114,9 @@
},
]

print("Creating evaluation")
print("Creating evaluation with AI-assisted evaluators")
eval_object = client.evals.create(
name="ai assisted evaluators test",
name="AI assisted evaluators test",
data_source_config=data_source_config,
testing_criteria=testing_criteria, # type: ignore
)
Expand All @@ -146,13 +127,42 @@
print("Evaluation Response:")
pprint(eval_object_response)

print("Creating evaluation run")
print("Creating evaluation run with inline data")
eval_run_object = client.evals.runs.create(
eval_id=eval_object.id,
name="dataset",
metadata={"team": "eval-exp", "scenario": "notifications-v1"},
name="inline_data_ai_assisted_run",
metadata={"team": "eval-exp", "scenario": "ai-assisted-inline-v1"},
data_source=CreateEvalJSONLRunDataSourceParam(
source=SourceFileID(id=dataset.id or "", type="file_id"), type="jsonl"
type="jsonl",
source=SourceFileContent(
type="file_content",
content=[
SourceFileContentContent(
item={
"response": "The capital of France is Paris, which is also known as the City of Light.",
"ground_truth": "Paris is the capital of France.",
}
),
SourceFileContentContent(
item={
"response": "Python is a high-level programming language known for its simplicity and readability.",
"ground_truth": "Python is a popular programming language that is easy to learn.",
}
),
SourceFileContentContent(
item={
"response": "Machine learning is a subset of artificial intelligence that enables systems to learn from data.",
"ground_truth": "Machine learning allows computers to learn from data without being explicitly programmed.",
}
),
SourceFileContentContent(
item={
"response": "The sun rises in the east and sets in the west due to Earth's rotation.",
"ground_truth": "The sun appears to rise in the east and set in the west because of Earth's rotation.",
}
),
],
),
),
)
print(f"Eval Run created")
Expand All @@ -174,8 +184,5 @@
time.sleep(5)
print("Waiting for evaluation run to complete...")

project_client.datasets.delete(name=dataset.name, version=dataset.version)
print("Dataset deleted")

client.evals.delete(eval_id=eval_object.id)
print("Evaluation deleted")
Loading