Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion sdk/ai/azure-ai-projects/assets.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@
"AssetsRepo": "Azure/azure-sdk-assets",
"AssetsRepoPrefixPath": "python",
"TagPrefix": "python/ai/azure-ai-projects",
"Tag": "python/ai/azure-ai-projects_7cddb7d06f"
"Tag": "python/ai/azure-ai-projects_212aab4d9b"
}
105 changes: 105 additions & 0 deletions sdk/ai/azure-ai-projects/samples/evaluations/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Azure AI Projects - Evaluation Samples

This folder contains samples demonstrating how to use Azure AI Foundry's evaluation capabilities with the `azure-ai-projects` SDK.

## Prerequisites

Before running any sample:

```bash
pip install "azure-ai-projects>=2.0.0b1" python-dotenv
```

Set these environment variables:
- `AZURE_AI_PROJECT_ENDPOINT` - Your Azure AI Project endpoint (e.g., `https://<account>.services.ai.azure.com/api/projects/<project>`)
- `AZURE_AI_MODEL_DEPLOYMENT_NAME` - The model deployment name (e.g., `gpt-4o-mini`)

## Sample Index

### Getting Started

| Sample | Description |
|--------|-------------|
| [sample_evaluations_builtin_with_inline_data.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluations_builtin_with_inline_data.py) | Basic evaluation with built-in evaluators using inline data |
| [sample_eval_catalog.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_eval_catalog.py) | Browse and use evaluators from the evaluation catalog |

### Agent Evaluation

| Sample | Description |
|--------|-------------|
| [sample_agent_evaluation.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_agent_evaluation.py) | Evaluate an agent's responses |
| [sample_agent_response_evaluation.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_agent_response_evaluation.py) | Evaluate agent response quality |
| [sample_agent_response_evaluation_with_function_tool.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_agent_response_evaluation_with_function_tool.py) | Evaluate agent with function tools |
| [sample_model_evaluation.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_model_evaluation.py) | Evaluate model responses directly |

### Evaluator Types

| Sample | Description |
|--------|-------------|
| [sample_evaluations_graders.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluations_graders.py) | OpenAI graders: label_model, text_similarity, string_check, score_model |
| [sample_evaluations_ai_assisted.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluations_ai_assisted.py) | AI-assisted evaluators: Similarity, ROUGE, METEOR, GLEU, F1, BLEU |
| [sample_eval_catalog_code_based_evaluators.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_eval_catalog_code_based_evaluators.py) | Code-based evaluators from the catalog |
| [sample_eval_catalog_prompt_based_evaluators.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_eval_catalog_prompt_based_evaluators.py) | Prompt-based evaluators from the catalog |

### Insights & Analysis

| Sample | Description |
|--------|-------------|
| [sample_evaluation_compare_insight.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluation_compare_insight.py) | Compare evaluation runs and generate insights |
| [sample_evaluation_cluster_insight.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluation_cluster_insight.py) | Generate cluster insights from evaluation runs |

### Red Team Evaluations

| Sample | Description |
|--------|-------------|
| [sample_redteam_evaluations.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_redteam_evaluations.py) | Security and safety evaluations using red team techniques |

### Agentic Evaluators

Located in the [agentic_evaluators](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators) subfolder:

| Sample | Description |
|--------|-------------|
| [sample_coherence.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_coherence.py) | Evaluate response coherence |
| [sample_fluency.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_fluency.py) | Evaluate response fluency |
| [sample_groundedness.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_groundedness.py) | Evaluate response groundedness |
| [sample_relevance.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_relevance.py) | Evaluate response relevance |
| [sample_intent_resolution.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_intent_resolution.py) | Evaluate intent resolution |
| [sample_response_completeness.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_response_completeness.py) | Evaluate response completeness |
| [sample_task_adherence.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_task_adherence.py) | Evaluate task adherence |
| [sample_task_completion.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_task_completion.py) | Evaluate task completion |
| [sample_task_navigation_efficiency.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_task_navigation_efficiency.py) | Evaluate navigation efficiency |
| [sample_tool_call_accuracy.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_tool_call_accuracy.py) | Evaluate tool call accuracy |
| [sample_tool_call_success.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_tool_call_success.py) | Evaluate tool call success |
| [sample_tool_input_accuracy.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_tool_input_accuracy.py) | Evaluate tool input accuracy |
| [sample_tool_output_utilization.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_tool_output_utilization.py) | Evaluate tool output utilization |
| [sample_tool_selection.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_tool_selection.py) | Evaluate tool selection |
| [sample_generic_agentic_evaluator](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/ai/azure-ai-projects/samples/evaluations/agentic_evaluators/sample_generic_agentic_evaluator) | Generic agentic evaluator example |

### Advanced Samples

These samples require additional setup or Azure services:

| Sample | Description | Requirements |
|--------|-------------|--------------|
| [sample_evaluations_builtin_with_dataset_id.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluations_builtin_with_dataset_id.py) | Use uploaded dataset for evaluation | Azure Blob Storage |
| [sample_evaluations_builtin_with_traces.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluations_builtin_with_traces.py) | Evaluate against Application Insights traces | Azure Application Insights |
| [sample_scheduled_evaluations.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_scheduled_evaluations.py) | Schedule recurring evaluations | RBAC setup |
| [sample_continuous_evaluation_rule.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_continuous_evaluation_rule.py) | Set up continuous evaluation rules | Manual RBAC in Azure Portal |
| [sample_evaluations_score_model_grader_with_image.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluations_score_model_grader_with_image.py) | Evaluate with image data | Image file |
| [sample_evaluations_builtin_with_inline_data_oai.py](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/samples/evaluations/sample_evaluations_builtin_with_inline_data_oai.py) | Use OpenAI client directly | OpenAI SDK |

## Running a Sample

```bash
# Set environment variables
export AZURE_AI_PROJECT_ENDPOINT="https://your-account.services.ai.azure.com/api/projects/your-project"
export AZURE_AI_MODEL_DEPLOYMENT_NAME="gpt-4o-mini"

# Run a sample
python sample_evaluations_builtin_with_inline_data.py
```

## Learn More

- [Azure AI Foundry Documentation](https://learn.microsoft.com/azure/ai-studio/)
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,6 @@

import os
import time
import json
import tempfile
from typing import Union
from pprint import pprint
from dotenv import load_dotenv
Expand All @@ -39,7 +37,11 @@
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from openai.types.eval_create_params import DataSourceConfigCustom, TestingCriterionLabelModel
from openai.types.evals.create_eval_jsonl_run_data_source_param import CreateEvalJSONLRunDataSourceParam, SourceFileID
from openai.types.evals.create_eval_jsonl_run_data_source_param import (
CreateEvalJSONLRunDataSourceParam,
SourceFileContent,
SourceFileContentContent,
)
from openai.types.evals.run_create_response import RunCreateResponse
from openai.types.evals.run_retrieve_response import RunRetrieveResponse

Expand Down Expand Up @@ -85,36 +87,23 @@
)
print(f"Evaluation created (id: {eval_object.id}, name: {eval_object.name})")

# Create and upload JSONL data as a dataset
eval_data = [
{"item": {"query": "I love programming!"}},
{"item": {"query": "I hate bugs."}},
{"item": {"query": "The weather is nice today."}},
{"item": {"query": "This is the worst movie ever."}},
{"item": {"query": "Python is an amazing language."}},
]

with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False) as f:
for item in eval_data:
f.write(json.dumps(item) + "\n")
temp_file_path = f.name

dataset = project_client.datasets.upload_file(
name="sentiment-eval-data",
version=str(int(time.time())),
file_path=temp_file_path,
)
os.unlink(temp_file_path)
print(f"Dataset created (id: {dataset.id}, name: {dataset.name}, version: {dataset.version})")

if not dataset.id:
raise ValueError("Dataset ID is None")

# Create an eval run using the uploaded dataset
# Create an eval run using inline data
eval_run: Union[RunCreateResponse, RunRetrieveResponse] = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="Eval Run",
data_source=CreateEvalJSONLRunDataSourceParam(source=SourceFileID(id=dataset.id, type="file_id"), type="jsonl"),
name="Eval Run with Inline Data",
data_source=CreateEvalJSONLRunDataSourceParam(
type="jsonl",
source=SourceFileContent(
type="file_content",
content=[
SourceFileContentContent(item={"query": "I love programming!"}),
SourceFileContentContent(item={"query": "I hate bugs."}),
SourceFileContentContent(item={"query": "The weather is nice today."}),
SourceFileContentContent(item={"query": "This is the worst movie ever."}),
SourceFileContentContent(item={"query": "Python is an amazing language."}),
],
),
),
)
print(f"Evaluation run created (id: {eval_run.id})")

Expand Down Expand Up @@ -142,20 +131,19 @@
print(f"Started insight generation (id: {clusterInsight.id})")

while clusterInsight.state not in [OperationState.SUCCEEDED, OperationState.FAILED]:
print(f"Waiting for insight to be generated...")
print("Waiting for insight to be generated...")
clusterInsight = project_client.insights.get(id=clusterInsight.id)
print(f"Insight status: {clusterInsight.state}")
time.sleep(5)

if clusterInsight.state == OperationState.SUCCEEDED:
print("\n✓ Cluster insights generated successfully!")
pprint(clusterInsight)
else:
print("\n✗ Cluster insight generation failed.")

else:
print("\n✗ Evaluation run failed. Cannot generate cluster insights.")

project_client.datasets.delete(name=dataset.name, version=dataset.version)
print("Dataset deleted")

openai_client.evals.delete(eval_id=eval_object.id)
print("Evaluation deleted")
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@
"""
DESCRIPTION:
Given an AIProjectClient, this sample demonstrates how to use the synchronous
`openai.evals.*` methods to create, get and list evaluation and and eval runs.
`openai.evals.*` methods to create, get and list evaluation and eval runs
with AI-assisted evaluators (Similarity, ROUGE, METEOR, GLEU, F1, BLEU).

USAGE:
python sample_evaluations_ai_assisted.py
Expand All @@ -20,54 +21,34 @@
1) AZURE_AI_PROJECT_ENDPOINT - Required. The Azure AI Project endpoint, as found in the overview page of your
Microsoft Foundry project. It has the form: https://<account_name>.services.ai.azure.com/api/projects/<project_name>.
2) AZURE_AI_MODEL_DEPLOYMENT_NAME - Required. The name of the model deployment to use for evaluation.
3) DATASET_NAME - Optional. The name of the Dataset to create and use in this sample.
4) DATASET_VERSION - Optional. The version of the Dataset to create and use in this sample.
5) DATA_FOLDER - Optional. The folder path where the data files for upload are located.
"""

import os

from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import (
DatasetVersion,
)

import time
from pprint import pprint
from openai.types.evals.create_eval_jsonl_run_data_source_param import CreateEvalJSONLRunDataSourceParam, SourceFileID
from openai.types.evals.create_eval_jsonl_run_data_source_param import (
CreateEvalJSONLRunDataSourceParam,
SourceFileContent,
SourceFileContentContent,
)
from openai.types.eval_create_params import DataSourceConfigCustom
from dotenv import load_dotenv
from datetime import datetime


load_dotenv()

endpoint = os.environ["AZURE_AI_PROJECT_ENDPOINT"]

model_deployment_name = os.environ.get("AZURE_AI_MODEL_DEPLOYMENT_NAME", "")
dataset_name = os.environ.get("DATASET_NAME", "")
dataset_version = os.environ.get("DATASET_VERSION", "1")

# Construct the paths to the data folder and data file used in this sample
script_dir = os.path.dirname(os.path.abspath(__file__))
data_folder = os.environ.get("DATA_FOLDER", os.path.join(script_dir, "data_folder"))
data_file = os.path.join(data_folder, "sample_data_evaluation.jsonl")

with (
DefaultAzureCredential() as credential,
AIProjectClient(endpoint=endpoint, credential=credential) as project_client,
project_client.get_openai_client() as client,
):

print("Upload a single file and create a new Dataset to reference the file.")
dataset: DatasetVersion = project_client.datasets.upload_file(
name=dataset_name or f"eval-data-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S_UTC')}",
version=dataset_version,
file_path=data_file,
)
pprint(dataset)

data_source_config = DataSourceConfigCustom(
{
"type": "custom",
Expand Down Expand Up @@ -133,9 +114,9 @@
},
]

print("Creating evaluation")
print("Creating evaluation with AI-assisted evaluators")
eval_object = client.evals.create(
name="ai assisted evaluators test",
name="AI assisted evaluators test",
data_source_config=data_source_config,
testing_criteria=testing_criteria, # type: ignore
)
Expand All @@ -146,13 +127,42 @@
print("Evaluation Response:")
pprint(eval_object_response)

print("Creating evaluation run")
print("Creating evaluation run with inline data")
eval_run_object = client.evals.runs.create(
eval_id=eval_object.id,
name="dataset",
metadata={"team": "eval-exp", "scenario": "notifications-v1"},
name="inline_data_ai_assisted_run",
metadata={"team": "eval-exp", "scenario": "ai-assisted-inline-v1"},
data_source=CreateEvalJSONLRunDataSourceParam(
source=SourceFileID(id=dataset.id or "", type="file_id"), type="jsonl"
type="jsonl",
source=SourceFileContent(
type="file_content",
content=[
SourceFileContentContent(
item={
"response": "The capital of France is Paris, which is also known as the City of Light.",
"ground_truth": "Paris is the capital of France.",
}
),
SourceFileContentContent(
item={
"response": "Python is a high-level programming language known for its simplicity and readability.",
"ground_truth": "Python is a popular programming language that is easy to learn.",
}
),
SourceFileContentContent(
item={
"response": "Machine learning is a subset of artificial intelligence that enables systems to learn from data.",
"ground_truth": "Machine learning allows computers to learn from data without being explicitly programmed.",
}
),
SourceFileContentContent(
item={
"response": "The sun rises in the east and sets in the west due to Earth's rotation.",
"ground_truth": "The sun appears to rise in the east and set in the west because of Earth's rotation.",
}
),
],
),
),
)
print(f"Eval Run created")
Expand All @@ -174,8 +184,5 @@
time.sleep(5)
print("Waiting for evaluation run to complete...")

project_client.datasets.delete(name=dataset.name, version=dataset.version)
print("Dataset deleted")

client.evals.delete(eval_id=eval_object.id)
print("Evaluation deleted")
Loading
Loading