Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 0 additions & 35 deletions examples/alert_triage_agent/.env_example

This file was deleted.

115 changes: 93 additions & 22 deletions examples/alert_triage_agent/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,10 @@ This example demonstrates how to build an intelligent alert triage system using
- [5. Root Cause Categorization](#5-root-cause-categorization)
- [6. Report Generation](#6-report-generation)
- [7. Analyst Review](#7-analyst-review)
- [Understanding the config](#understanding-the-config)
- [Functions](#functions)
- [Workflow](#workflow)
- [LLMs](#llms)
- [Installation and setup](#installation-and-setup)
- [Install this workflow](#install-this-workflow)
- [Set up environment variables](#set-up-environment-variables)
Expand Down Expand Up @@ -141,6 +145,79 @@ The triage agent may call one or more of the following tools based on the alert
#### 7. Analyst Review
- The final report is presented to an Analyst for review, action, or escalation.

### Understanding the config

#### Functions

Each entry in the `functions` section defines a tool or sub-agent that can be invoked by the main workflow agent. Tools can operate in test mode, using mocked data for simulation.

Example:

```yaml
hardware_check:
_type: hardware_check
llm_name: tool_reasoning_llm
test_mode: true
```

* `_type`: Identifies the name of the tool (matching the names in the tools' python files.)
* `llm_name`: LLM used to support the tool’s reasoning of the raw fetched data.
* `test_mode`: If `true`, the tool uses predefined mock results for offline testing.

Some entries, like `telemetry_metrics_analysis_agent`, are sub-agents that coordinate multiple tools:

```yaml
telemetry_metrics_analysis_agent:
_type: telemetry_metrics_analysis_agent
tool_names:
- telemetry_metrics_host_heartbeat_check
- telemetry_metrics_host_performance_check
llm_name: telemetry_metrics_analysis_agent_llm
```
#### Workflow

The `workflow` section defines the primary agent’s execution.

```yaml
workflow:
_type: alert_triage_agent
tool_names:
- hardware_check
- ...
llm_name: ata_agent_llm
test_mode: true
test_data_path: ...
benign_fallback_data_path: ...
test_output_path: ...
```

* `_type`: The name of the agent (matching the agent's name in `register.py`).
* `tool_names`: List of tools (from the `functions` section) used in the triage process.
* `llm_name`: Main LLM used by the agent for reasoning, tool-calling, and report generation.
* `test_mode`: Enables test execution using predefined input/output instead of real systems.
* `test_data_path`: CSV file containing test alerts and their corresponding mocked tool responses.
* `benign_fallback_data_path`: JSON file with baseline healthy system responses for tools not explicitly mocked.
* `test_output_path`: Output CSV file path where the agent writes triage results. Each processed alert adds a new `output` column with the generated report.

#### LLMs

The `llms` section defines the available LLMs for various parts of the system.

Example:

```yaml
ata_agent_llm:
_type: nim
model_name: meta/llama-3.3-70b-instruct
temperature: 0.2
max_tokens: 2048
```

* `_type`: Backend type (e.g., `nim` for NVIDIA Inference Microservice).
* `model_name`: LLM mode name.
* `temperature`, `top_p`, `max_tokens`: LLM generation parameters (passed directly into the API).

Each tool or agent can use a dedicated LLM tailored for its task.

## Installation and setup

Expand All @@ -155,22 +232,9 @@ uv pip install -e ./examples/alert_triage_agent
```

### Set up environment variables
In addition to the `NVIDIA_API_KEY` required by AIQ Toolkit, the following environment variables are required for this example. A sample `.env` file is provided in the [.env_example](.env_example) file:

- `TEST_MODE`: Set to "true" to run the agent in test mode which processes alerts using test data instead of live systems
- `MAINTENANCE_STATIC_DATA_PATH`: Path to CSV file containing static maintenance window data
- `TEST_DATA_RELATIVE_FILEPATH`: Main source of test data in CSV format, containing alerts and their simulated environments to process
- Contains alerts and their corresponding simulated environments represented by mocked tool return values
- When the agent queries tools to understand the alert environment, it receives these pre-configured synthetic responses
- Enables testing in a controlled environment without connecting to real hosts/systems
- `TEST_BENIGN_DATA_RELATIVE_FILEPATH`: JSON file containing baseline/normal system behavior data
- Provides fallback responses for tools not explicitly mocked in the main test data
- Contains "benign" or normal system state data that tools should return when not part of the simulated issue
- Used when the agent queries tools beyond those relevant to the test case, simulating healthy parts of the system
- `TEST_OUTPUT_RELATIVE_FILEPATH`: Path where test results will be saved in CSV format


To load the environment variables from your .env file, run:
As mentioned in the Install Guide, an `NVIDIA_API_KEY` environment variable is required to run AIQ Toolkit.

If you have your key in a `.env` file, use the following command to load it:
```bash
export $(grep -v '^#' .env | xargs)
```
Expand All @@ -188,18 +252,24 @@ In live mode, each tool used by the triage agent connects to real systems to col
To run the agent live, follow these steps:

1. **Configure all tools with real environment details**

By default, the agent includes placeholder values for API endpoints, host IP addresses, credentials, and other access parameters. You must:
- Replace these placeholders with the actual values specific to your systems
- Ensure the agent has access permissions to query APIs or connect to hosts
- Test each tool in isolation to confirm it works end-to-end

2. **Add custom tools if needed**

If your environment includes unique systems or data sources, you can define new tools or modify existing ones. This allows your triage agent to pull in the most relevant data for your alerts and infrastructure.

3. **Disable test mode**
Set the environment variable `TEST_MODE=false` to ensure the agent uses real data instead of synthetic test datasets.

Set `test_mode: false` in the workflow section and for each tool in the functions section of your config file to ensure the agent uses real data instead of synthetic test datasets.

You can also selectively keep some tools in test mode by leaving their `test_mode: true` for more granular testing.

4. **Run the agent with a real alert**

Provide a live alert in JSON format and invoke the agent using:

```bash
Expand Down Expand Up @@ -306,9 +376,10 @@ Test mode lets you evaluate the triage agent in a controlled, offline environmen

To run in test mode:
1. **Set required environment variables**
Follow the instructions in the [Set up environment variables](#set-up-environment-variables) section to make sure all variables are set. (`TEST_MODE=true`)

2. **How it works**
Make sure `test_mode: true` is set in both the `workflow` section and individual tool sections of your config file (see [Understanding the config](#understanding-the-config) section).

1. **How it works**
- The **main test CSV** provides both alert details and a mock environment. For each alert, expected tool return values are included. These simulate how the environment would behave if the alert occurred on a real system.
- The **benign fallback dataset** fills in tool responses when the agent calls a tool not explicitly defined in the alert's test data. These fallback responses mimic healthy system behavior and help provide the "background scenery" without obscuring the true root cause.

Expand All @@ -321,12 +392,12 @@ Follow the instructions in the [Set up environment variables](#set-up-environmen
Note: The `--input` value is ignored in test mode.

The agent will:
- Load alerts from the test dataset `TEST_DATA_RELATIVE_FILEPATH`
- Load alerts from the test dataset specified in `test_data_path` in the workflow config
- Simulate an investigation using predefined tool results
- Iterate through all the alerts in the dataset
- Save reports as a new column in a copy of the test CSV file to the path specified in `TEST_OUTPUT_RELATIVE_FILEPATH`
- Save reports as a new column in a copy of the test CSV file to the path specified in `test_output_path` in the workflow config

4. **Understanding the output**
2. **Understanding the output**

The output file will contain a new column named `output`, which includes the markdown report generated by the agent for each data point (i.e., each row in the CSV). Navigate to that rightmost `output` column to view the report for each test entry.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,21 +20,27 @@ functions:
hardware_check:
_type: hardware_check
llm_name: tool_reasoning_llm
test_mode: true
host_performance_check:
_type: host_performance_check
llm_name: tool_reasoning_llm
test_mode: true
monitoring_process_check:
_type: monitoring_process_check
llm_name: tool_reasoning_llm
test_mode: true
network_connectivity_check:
_type: network_connectivity_check
llm_name: tool_reasoning_llm
test_mode: true
telemetry_metrics_host_heartbeat_check:
_type: telemetry_metrics_host_heartbeat_check
llm_name: tool_reasoning_llm
test_mode: true
telemetry_metrics_host_performance_check:
_type: telemetry_metrics_host_performance_check
llm_name: tool_reasoning_llm
test_mode: true
telemetry_metrics_analysis_agent:
_type: telemetry_metrics_analysis_agent
tool_names:
Expand All @@ -44,6 +50,7 @@ functions:
maintenance_check:
_type: maintenance_check
llm_name: maintenance_check_llm
static_data_path: examples/alert_triage_agent/data/maintenance_static_dataset.csv
categorizer:
_type: categorizer
llm_name: categorizer_llm
Expand All @@ -57,6 +64,11 @@ workflow:
- network_connectivity_check
- telemetry_metrics_analysis_agent
llm_name: ata_agent_llm
test_mode: true
# The below paths are only used if test_mode is true
test_data_path: examples/alert_triage_agent/data/test_data.csv
benign_fallback_data_path: examples/alert_triage_agent/data/benign_fallback_test_data.json
test_output_path: .tmp/aiq/examples/alert_triage_agent/output/test_output.csv

llms:
ata_agent_llm:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,24 +64,24 @@ class HardwareCheckToolConfig(FunctionBaseConfig, name="hardware_check"):
"hardware degradation, and anomalies that could explain alerts. Args: host_id: str"),
description="Description of the tool for the agent.")
llm_name: LLMRef
test_mode: bool = Field(default=True, description="Whether to run in test mode")


@register_function(config_type=HardwareCheckToolConfig)
async def hardware_check_tool(config: HardwareCheckToolConfig, builder: Builder):

async def _arun(host_id: str) -> str:
is_test_mode = utils.is_test_mode()
utils.log_header("Hardware Status Checker")

try:
if not is_test_mode:
if not config.test_mode:
ip = "ipmi_ip" # Replace with your actual IPMI IP address
user = "ipmi_user" # Replace with your actual username
pwd = "ipmi_password" # Replace with your actual password
monitoring_data = _get_ipmi_monitor_data(ip, user, pwd)
else:
# In test mode, load test data from CSV file
df = utils.load_test_data()
df = utils.get_test_data()

# Get IPMI data from test data, falling back to static data if needed
monitoring_data = utils.load_column_or_static(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ class HostPerformanceCheckToolConfig(FunctionBaseConfig, name="host_performance_
"and hardware I/O usage details for a given host. Args: host_id: str"),
description="Description of the tool for the agent.")
llm_name: LLMRef
test_mode: bool = Field(default=True, description="Whether to run in test mode")


async def _run_ansible_playbook_for_host_performance_check(config: HostPerformanceCheckToolConfig,
Expand Down Expand Up @@ -110,11 +111,10 @@ async def _parse_stdout_lines(config, builder, stdout_lines):
async def host_performance_check_tool(config: HostPerformanceCheckToolConfig, builder: Builder):

async def _arun(host_id: str) -> str:
is_test_mode = utils.is_test_mode()
utils.log_header("Host Performance Analyzer")

try:
if not is_test_mode:
if not config.test_mode:
# In production mode, use actual Ansible connection details
# Replace placeholder values with connection info from configuration
ansible_host = "your.host.example.name" # Input your target host
Expand All @@ -132,7 +132,7 @@ async def _arun(host_id: str) -> str:
ansible_private_key_path=ansible_private_key_path)
else:
# In test mode, load performance data from test dataset
df = utils.load_test_data()
df = utils.get_test_data()

# Get CPU metrics from test data, falling back to static data if needed
data_top_cpu = utils.load_column_or_static(df=df,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ class MaintenanceCheckToolConfig(FunctionBaseConfig, name="maintenance_check"):
"if the alert can be deprioritized."),
description="Description of the tool for the agent.")
llm_name: LLMRef
static_data_path: str | None = Field(
default="examples/alert_triage_agent/data/maintenance_static_dataset.csv",
description=(
"Path to the static maintenance data CSV file. If not provided, the tool will not check for maintenance."))


def _load_maintenance_data(path: str) -> pd.DataFrame:
Expand Down Expand Up @@ -188,6 +192,8 @@ async def maintenance_check(config: MaintenanceCheckToolConfig, builder: Builder
# Set up LLM
llm = await builder.get_llm(config.llm_name, wrapper_type=LLMFrameworkEnum.LANGCHAIN)

maintenance_data_path = config.static_data_path

async def _arun(input_message: str) -> str:
# NOTE: This is just an example implementation of maintenance status checking using a CSV file.
# Users should implement their own maintenance check logic specific to their environment
Expand All @@ -196,18 +202,15 @@ async def _arun(input_message: str) -> str:

utils.log_header("Maintenance Checker")

maintenance_data_path = os.getenv("MAINTENANCE_STATIC_DATA_PATH")
if not maintenance_data_path:
utils.logger.info("No maintenance data path provided, skipping maintenance check")
return NO_ONGOING_MAINTENANCE_STR # the triage agent will run as usual

filepath = os.path.join(os.path.abspath(os.path.dirname(__file__)), maintenance_data_path)
if not os.path.exists(filepath):
utils.logger.info("Maintenance data file does not exist: %s. Skipping maintenance check.", filepath)
if not os.path.exists(maintenance_data_path):
utils.logger.info("Maintenance data file does not exist: %s. Skipping maintenance check.",
maintenance_data_path)
return NO_ONGOING_MAINTENANCE_STR # the triage agent will run as usual

maintenance_df = _load_maintenance_data(filepath)

alert = _parse_alert_data(input_message)
if alert is None:
utils.logger.info("Failed to parse alert from input message, skipping maintenance check")
Expand All @@ -226,6 +229,7 @@ async def _arun(input_message: str) -> str:
utils.logger.error("Failed to parse alert time from input message: %s, skipping maintenance check", e)
return NO_ONGOING_MAINTENANCE_STR

maintenance_df = _load_maintenance_data(maintenance_data_path)
maintenance_info = _get_active_maintenance(maintenance_df, host, alert_time)
if not maintenance_info:
utils.logger.info("Host: [%s] is NOT under maintenance according to the maintenance database", host)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ class MonitoringProcessCheckToolConfig(FunctionBaseConfig, name="monitoring_proc
"on a target host by executing system commands. Args: host_id: str"),
description="Description of the tool for the agent.")
llm_name: LLMRef
test_mode: bool = Field(default=True, description="Whether to run in test mode")


async def _run_ansible_playbook_for_monitor_process_check(ansible_host: str,
Expand Down Expand Up @@ -70,10 +71,8 @@ async def _run_ansible_playbook_for_monitor_process_check(ansible_host: str,
async def monitoring_process_check_tool(config: MonitoringProcessCheckToolConfig, builder: Builder):

async def _arun(host_id: str) -> str:
is_test_mode = utils.is_test_mode()

try:
if not is_test_mode:
if not config.test_mode:
# In production mode, use actual Ansible connection details
# Replace placeholder values with connection info from configuration
ansible_host = "your.host.example.name" # Input your target host
Expand All @@ -89,7 +88,7 @@ async def _arun(host_id: str) -> str:
output_for_prompt = f"`ps` and `top` result:{output}"
else:
# In test mode, load performance data from test dataset
df = utils.load_test_data()
df = utils.get_test_data()

# Load process status data from ps command output
ps_data = utils.load_column_or_static(df=df,
Expand Down
Loading