Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 25 additions & 25 deletions examples/alert_triage_agent/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ This example demonstrates how to build an intelligent alert triage system using
- [Running in a live environment](#running-in-a-live-environment)
- [Note on credentials and access](#note-on-credentials-and-access)
- [Running live with a HTTP server listening for alerts](#running-live-with-a-http-server-listening-for-alerts)
- [Running in test mode](#running-in-test-mode)
- [Running in offline mode](#running-in-offline-mode)


## Use case description
Expand Down Expand Up @@ -149,20 +149,20 @@ The triage agent may call one or more of the following tools based on the alert

#### Functions

Each entry in the `functions` section defines a tool or sub-agent that can be invoked by the main workflow agent. Tools can operate in test mode, using mocked data for simulation.
Each entry in the `functions` section defines a tool or sub-agent that can be invoked by the main workflow agent. Tools can operate in offline mode, using mocked data for simulation.

Example:

```yaml
hardware_check:
_type: hardware_check
llm_name: tool_reasoning_llm
test_mode: true
offline_mode: true
```

* `_type`: Identifies the name of the tool (matching the names in the tools' python files.)
* `llm_name`: LLM used to support the tool’s reasoning of the raw fetched data.
* `test_mode`: If `true`, the tool uses predefined mock results for offline testing.
* `offline_mode`: If `true`, the tool uses predefined mock results for offline testing.

Some entries, like `telemetry_metrics_analysis_agent`, are sub-agents that coordinate multiple tools:

Expand All @@ -185,19 +185,19 @@ workflow:
- hardware_check
- ...
llm_name: ata_agent_llm
test_mode: true
test_data_path: ...
offline_mode: true
offline_data_path: ...
benign_fallback_data_path: ...
test_output_path: ...
offline_output_path: ...
```

* `_type`: The name of the agent (matching the agent's name in `register.py`).
* `tool_names`: List of tools (from the `functions` section) used in the triage process.
* `llm_name`: Main LLM used by the agent for reasoning, tool-calling, and report generation.
* `test_mode`: Enables test execution using predefined input/output instead of real systems.
* `test_data_path`: CSV file containing test alerts and their corresponding mocked tool responses.
* `offline_mode`: Enables offline execution using predefined input/output instead of real systems.
* `offline_data_path`: CSV file containing offline test alerts and their corresponding mocked tool responses.
* `benign_fallback_data_path`: JSON file with baseline healthy system responses for tools not explicitly mocked.
* `test_output_path`: Output CSV file path where the agent writes triage results. Each processed alert adds a new `output` column with the generated report.
* `offline_output_path`: Output CSV file path where the agent writes triage results. Each processed alert adds a new `output` column with the generated report.

#### LLMs

Expand Down Expand Up @@ -240,7 +240,7 @@ export $(grep -v '^#' .env | xargs)
```

## Example Usage
You can run the agent in [test mode](#running-in-test-mode) or [live mode](#running-live-with-a-http-server-listening-for-alerts). Test mode allows you to evaluate the agent in a controlled, offline environment using synthetic data. Live mode allows you to run the agent in a real environment.
You can run the agent in [offline mode](#running-in-offline-mode) or [live mode](#running-live-with-a-http-server-listening-for-alerts). offline mode allows you to evaluate the agent in a controlled, offline environment using synthetic data. Live mode allows you to run the agent in a real environment.

### Running in a live environment
In live mode, each tool used by the triage agent connects to real systems to collect data. These systems can include:
Expand All @@ -262,11 +262,11 @@ To run the agent live, follow these steps:

If your environment includes unique systems or data sources, you can define new tools or modify existing ones. This allows your triage agent to pull in the most relevant data for your alerts and infrastructure.

3. **Disable test mode**
3. **Disable offline mode**

Set `test_mode: false` in the workflow section and for each tool in the functions section of your config file to ensure the agent uses real data instead of synthetic test datasets.
Set `offline_mode: false` in the workflow section and for each tool in the functions section of your config file to ensure the agent uses real data instead of offline datasets.

You can also selectively keep some tools in test mode by leaving their `test_mode: true` for more granular testing.
You can also selectively keep some tools in offline mode by leaving their `offline_mode: true` for more granular testing.

4. **Run the agent with a real alert**

Expand Down Expand Up @@ -371,31 +371,31 @@ To use this mode, first ensure you have configured your live environment as desc

You can monitor the progress of the triage process through these logs and the generated reports.

### Running in test mode
Test mode lets you evaluate the triage agent in a controlled, offline environment using synthetic data. Instead of calling real systems, the agent uses predefined inputs to simulate alerts and tool outputs, ideal for development, debugging, and tuning.
### Running in offline mode
offline mode lets you evaluate the triage agent in a controlled, offline environment using synthetic data. Instead of calling real systems, the agent uses predefined inputs to simulate alerts and tool outputs, ideal for development, debugging, and tuning.

To run in test mode:
To run in offline mode:
1. **Set required environment variables**

Make sure `test_mode: true` is set in both the `workflow` section and individual tool sections of your config file (see [Understanding the config](#understanding-the-config) section).
Make sure `offline_mode: true` is set in both the `workflow` section and individual tool sections of your config file (see [Understanding the config](#understanding-the-config) section).

1. **How it works**
- The **main test CSV** provides both alert details and a mock environment. For each alert, expected tool return values are included. These simulate how the environment would behave if the alert occurred on a real system.
- The **benign fallback dataset** fills in tool responses when the agent calls a tool not explicitly defined in the alert's test data. These fallback responses mimic healthy system behavior and help provide the "background scenery" without obscuring the true root cause.
- The **main CSV offline dataset** provides both alert details and a mock environment. For each alert, expected tool return values are included. These simulate how the environment would behave if the alert occurred on a real system.
- The **benign fallback dataset** fills in tool responses when the agent calls a tool not explicitly defined in the alert's offline data. These fallback responses mimic healthy system behavior and help provide the "background scenery" without obscuring the true root cause.

3. **Run the agent in test mode**
3. **Run the agent in offline mode**

Run the agent with:
```bash
aiq run --config_file=examples/alert_triage_agent/configs/config_test_mode.yml --input "test_mode"
aiq run --config_file=examples/alert_triage_agent/configs/config_offline_mode.yml --input "offline_mode"
```
Note: The `--input` value is ignored in test mode.
Note: The `--input` value is ignored in offline mode.

The agent will:
- Load alerts from the test dataset specified in `test_data_path` in the workflow config
- Load alerts from the offline dataset specified in `offline_data_path` in the workflow config
- Simulate an investigation using predefined tool results
- Iterate through all the alerts in the dataset
- Save reports as a new column in a copy of the test CSV file to the path specified in `test_output_path` in the workflow config
- Save reports as a new column in a copy of the offline CSV file to the path specified in `offline_output_path` in the workflow config

2. **Understanding the output**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,27 +20,27 @@ functions:
hardware_check:
_type: hardware_check
llm_name: tool_reasoning_llm
test_mode: false
offline_mode: false
host_performance_check:
_type: host_performance_check
llm_name: tool_reasoning_llm
test_mode: false
offline_mode: false
monitoring_process_check:
_type: monitoring_process_check
llm_name: tool_reasoning_llm
test_mode: false
offline_mode: false
network_connectivity_check:
_type: network_connectivity_check
llm_name: tool_reasoning_llm
test_mode: false
offline_mode: false
telemetry_metrics_host_heartbeat_check:
_type: telemetry_metrics_host_heartbeat_check
llm_name: tool_reasoning_llm
test_mode: false
offline_mode: false
telemetry_metrics_host_performance_check:
_type: telemetry_metrics_host_performance_check
llm_name: tool_reasoning_llm
test_mode: false
offline_mode: false
telemetry_metrics_analysis_agent:
_type: telemetry_metrics_analysis_agent
tool_names:
Expand All @@ -64,11 +64,11 @@ workflow:
- network_connectivity_check
- telemetry_metrics_analysis_agent
llm_name: ata_agent_llm
test_mode: false
# The below paths are only used if test_mode is true
test_data_path: null
offline_mode: false
# The below paths are only used if offline_mode is true
offline_data_path: null
benign_fallback_data_path: null
test_output_path: null
offline_output_path: null

llms:
ata_agent_llm:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,28 +20,28 @@ functions:
hardware_check:
_type: hardware_check
llm_name: tool_reasoning_llm
test_mode: true
offline_mode: true
host_performance_check:
_type: host_performance_check
llm_name: tool_reasoning_llm
test_mode: true
offline_mode: true
monitoring_process_check:
_type: monitoring_process_check
llm_name: tool_reasoning_llm
test_mode: true
offline_mode: true
network_connectivity_check:
_type: network_connectivity_check
llm_name: tool_reasoning_llm
test_mode: true
offline_mode: true
telemetry_metrics_host_heartbeat_check:
_type: telemetry_metrics_host_heartbeat_check
llm_name: tool_reasoning_llm
test_mode: true
offline_mode: true
metrics_url: http://your-monitoring-server:9090 # Replace with your monitoring system URL if running in live mode
telemetry_metrics_host_performance_check:
_type: telemetry_metrics_host_performance_check
llm_name: tool_reasoning_llm
test_mode: true
offline_mode: true
metrics_url: http://your-monitoring-server:9090 # Replace with your monitoring system URL if running in live mode
telemetry_metrics_analysis_agent:
_type: telemetry_metrics_analysis_agent
Expand All @@ -66,11 +66,11 @@ workflow:
- network_connectivity_check
- telemetry_metrics_analysis_agent
llm_name: ata_agent_llm
test_mode: true
# The below paths are only used if test_mode is true
test_data_path: examples/alert_triage_agent/data/test_data.csv
benign_fallback_data_path: examples/alert_triage_agent/data/benign_fallback_test_data.json
test_output_path: .tmp/aiq/examples/alert_triage_agent/output/test_output.csv
offline_mode: true
# The below paths are only used if offline_mode is true
offline_data_path: examples/alert_triage_agent/data/offline_data.csv
benign_fallback_data_path: examples/alert_triage_agent/data/benign_fallback_offline_data.json
offline_output_path: .tmp/aiq/examples/alert_triage_agent/output/offline_output.csv

llms:
ata_agent_llm:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ class HardwareCheckToolConfig(FunctionBaseConfig, name="hardware_check"):
"hardware degradation, and anomalies that could explain alerts. Args: host_id: str"),
description="Description of the tool for the agent.")
llm_name: LLMRef
test_mode: bool = Field(default=True, description="Whether to run in test mode")
offline_mode: bool = Field(default=True, description="Whether to run in offline model")


def _get_ipmi_monitor_data(ip_address, username, password):
Expand Down Expand Up @@ -74,14 +74,14 @@ async def _arun(host_id: str) -> str:
utils.log_header("Hardware Status Checker")

try:
if not config.test_mode:
if not config.offline_mode:
ip = "ipmi_ip" # Replace with your actual IPMI IP address
user = "ipmi_user" # Replace with your actual username
pwd = "ipmi_password" # Replace with your actual password
monitoring_data = _get_ipmi_monitor_data(ip, user, pwd)
else:
# In test mode, load test data from CSV file
df = utils.get_test_data()
# In offline model, load test data from CSV file
df = utils.get_offline_data()

# Get IPMI data from test data, falling back to static data if needed
monitoring_data = utils.load_column_or_static(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ class HostPerformanceCheckToolConfig(FunctionBaseConfig, name="host_performance_
"and hardware I/O usage details for a given host. Args: host_id: str"),
description="Description of the tool for the agent.")
llm_name: LLMRef
test_mode: bool = Field(default=True, description="Whether to run in test mode")
offline_mode: bool = Field(default=True, description="Whether to run in offline model")


async def _run_ansible_playbook_for_host_performance_check(config: HostPerformanceCheckToolConfig,
Expand Down Expand Up @@ -113,7 +113,7 @@ async def _arun(host_id: str) -> str:
utils.log_header("Host Performance Analyzer")

try:
if not config.test_mode:
if not config.offline_mode:
# In production mode, use actual Ansible connection details
# Replace placeholder values with connection info from configuration
ansible_host = "your.host.example.name" # Input your target host
Expand All @@ -130,8 +130,8 @@ async def _arun(host_id: str) -> str:
ansible_port=ansible_port,
ansible_private_key_path=ansible_private_key_path)
else:
# In test mode, load performance data from test dataset
df = utils.get_test_data()
# In offline model, load performance data from test dataset
df = utils.get_offline_data()

# Get CPU metrics from test data, falling back to static data if needed
data_top_cpu = utils.load_column_or_static(df=df,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ class MonitoringProcessCheckToolConfig(FunctionBaseConfig, name="monitoring_proc
"on a target host by executing system commands. Args: host_id: str"),
description="Description of the tool for the agent.")
llm_name: LLMRef
test_mode: bool = Field(default=True, description="Whether to run in test mode")
offline_mode: bool = Field(default=True, description="Whether to run in offline model")


async def _run_ansible_playbook_for_monitor_process_check(ansible_host: str,
Expand Down Expand Up @@ -72,7 +72,7 @@ async def monitoring_process_check_tool(config: MonitoringProcessCheckToolConfig

async def _arun(host_id: str) -> str:
try:
if not config.test_mode:
if not config.offline_mode:
# In production mode, use actual Ansible connection details
# Replace placeholder values with connection info from configuration
ansible_host = "your.host.example.name" # Input your target host
Expand All @@ -87,8 +87,8 @@ async def _arun(host_id: str) -> str:
ansible_private_key_path=ansible_private_key_path)
output_for_prompt = f"`ps` and `top` result:{output}"
else:
# In test mode, load performance data from test dataset
df = utils.get_test_data()
# In offline model, load performance data from test dataset
df = utils.get_offline_data()

# Load process status data from ps command output
ps_data = utils.load_column_or_static(df=df,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ class NetworkConnectivityCheckToolConfig(FunctionBaseConfig, name="network_conne
"Args: host_id: str"),
description="Description of the tool for the agent.")
llm_name: LLMRef
test_mode: bool = Field(default=True, description="Whether to run in test mode")
offline_mode: bool = Field(default=True, description="Whether to run in offline model")


def _check_service_banner(host: str, port: int = 80, connect_timeout: float = 10, read_timeout: float = 10) -> str:
Expand Down Expand Up @@ -72,7 +72,7 @@ async def _arun(host_id: str) -> str:
utils.log_header("Network Connectivity Tester")

try:
if not config.test_mode:
if not config.offline_mode:
# NOTE: The ping and telnet commands below are example implementations of network connectivity checking.
# Users should implement their own network connectivity check logic specific to their environment
# and infrastructure setup.
Expand All @@ -91,7 +91,7 @@ async def _arun(host_id: str) -> str:

else:
# Load test data
df = utils.get_test_data()
df = utils.get_offline_data()

# Get ping data from test data, falling back to static data if needed
ping_data = utils.load_column_or_static(df=df,
Expand Down
Loading