NVIDIA · rapids-bot · May 8, 2025 · May 7, 2025 · May 7, 2025 · May 7, 2025
@@ -30,6 +30,10 @@ This example demonstrates how to build an intelligent alert triage system using
       - [5. Root Cause Categorization](#5-root-cause-categorization)
       - [6. Report Generation](#6-report-generation)
       - [7. Analyst Review](#7-analyst-review)
+    - [Understanding the config](#understanding-the-config)
+      - [Functions](#functions)
+      - [Workflow](#workflow)
+      - [LLMs](#llms)
   - [Installation and setup](#installation-and-setup)
     - [Install this workflow](#install-this-workflow)
     - [Set up environment variables](#set-up-environment-variables)
@@ -141,6 +145,79 @@ The triage agent may call one or more of the following tools based on the alert
 #### 7. Analyst Review
 - The final report is presented to an Analyst for review, action, or escalation.
 
+### Understanding the config
+
+#### Functions
+
+Each entry in the `functions` section defines a tool or sub-agent that can be invoked by the main workflow agent. Tools can operate in test mode, using mocked data for simulation.
+
+Example:
+
+```yaml
+hardware_check:
+  _type: hardware_check
+  llm_name: tool_reasoning_llm
+  test_mode: true
+```
+
+* `_type`: Identifies the name of the tool (matching the names in the tools' python files.)
+* `llm_name`: LLM used to support the tool’s reasoning of the raw fetched data.
+* `test_mode`: If `true`, the tool uses predefined mock results for offline testing.
+
+Some entries, like `telemetry_metrics_analysis_agent`, are sub-agents that coordinate multiple tools:
+
+```yaml
+telemetry_metrics_analysis_agent:
+  _type: telemetry_metrics_analysis_agent
+  tool_names:
+    - telemetry_metrics_host_heartbeat_check
+    - telemetry_metrics_host_performance_check
+  llm_name: telemetry_metrics_analysis_agent_llm
+```
+#### Workflow
+
+The `workflow` section defines the primary agent’s execution.
+
+```yaml
+workflow:
+  _type: alert_triage_agent
+  tool_names:
+    - hardware_check
+    - ...
+  llm_name: ata_agent_llm
+  test_mode: true
+  test_data_path: ...
+  benign_fallback_data_path: ...
+  test_output_path: ...
+```
+
+* `_type`: The name of the agent (matching the agent's name in `register.py`).
+* `tool_names`: List of tools (from the `functions` section) used in the triage process.
+* `llm_name`: Main LLM used by the agent for reasoning, tool-calling, and report generation.
+* `test_mode`: Enables test execution using predefined input/output instead of real systems.
+* `test_data_path`: CSV file containing test alerts and their corresponding mocked tool responses.
+* `benign_fallback_data_path`: JSON file with baseline healthy system responses for tools not explicitly mocked.
+* `test_output_path`: Output CSV file path where the agent writes triage results. Each processed alert adds a new `output` column with the generated report.
+
+#### LLMs
+
+The `llms` section defines the available LLMs for various parts of the system.
+
+Example:
+
+```yaml
+ata_agent_llm:
+  _type: nim
+  model_name: meta/llama-3.3-70b-instruct
+  temperature: 0.2
+  max_tokens: 2048
+```
+
+* `_type`: Backend type (e.g., `nim` for NVIDIA Inference Microservice).
+* `model_name`: LLM mode name.
+* `temperature`, `top_p`, `max_tokens`: LLM generation parameters (passed directly into the API).
+
+Each tool or agent can use a dedicated LLM tailored for its task.
 
 ## Installation and setup
 
@@ -155,22 +232,9 @@ uv pip install -e ./examples/alert_triage_agent
 ```
 
 ### Set up environment variables
-In addition to the `NVIDIA_API_KEY` required by AIQ Toolkit, the following environment variables are required for this example. A sample `.env` file is provided in the [.env_example](.env_example) file:
-
-- `TEST_MODE`: Set to "true" to run the agent in test mode which processes alerts using test data instead of live systems
-- `MAINTENANCE_STATIC_DATA_PATH`: Path to CSV file containing static maintenance window data
-- `TEST_DATA_RELATIVE_FILEPATH`: Main source of test data in CSV format, containing alerts and their simulated environments to process
-  - Contains alerts and their corresponding simulated environments represented by mocked tool return values
-  - When the agent queries tools to understand the alert environment, it receives these pre-configured synthetic responses
-  - Enables testing in a controlled environment without connecting to real hosts/systems
-- `TEST_BENIGN_DATA_RELATIVE_FILEPATH`: JSON file containing baseline/normal system behavior data
-  - Provides fallback responses for tools not explicitly mocked in the main test data
-  - Contains "benign" or normal system state data that tools should return when not part of the simulated issue
-  - Used when the agent queries tools beyond those relevant to the test case, simulating healthy parts of the system
-- `TEST_OUTPUT_RELATIVE_FILEPATH`: Path where test results will be saved in CSV format
-
-
-To load the environment variables from your .env file, run:
+As mentioned in the Install Guide, an `NVIDIA_API_KEY` environment variable is required to run AIQ Toolkit.
+
+If you have your key in a `.env` file, use the following command to load it:
 ```bash
 export $(grep -v '^#' .env | xargs)
 ```
@@ -188,18 +252,24 @@ In live mode, each tool used by the triage agent connects to real systems to col
 To run the agent live, follow these steps:
 
 1. **Configure all tools with real environment details**
+
    By default, the agent includes placeholder values for API endpoints, host IP addresses, credentials, and other access parameters. You must:
    - Replace these placeholders with the actual values specific to your systems
    - Ensure the agent has access permissions to query APIs or connect to hosts
    - Test each tool in isolation to confirm it works end-to-end
 
 2. **Add custom tools if needed**
+
    If your environment includes unique systems or data sources, you can define new tools or modify existing ones. This allows your triage agent to pull in the most relevant data for your alerts and infrastructure.
 
 3. **Disable test mode**
-   Set the environment variable `TEST_MODE=false` to ensure the agent uses real data instead of synthetic test datasets.
+
+   Set `test_mode: false` in the workflow section and for each tool in the functions section of your config file to ensure the agent uses real data instead of synthetic test datasets.
+
+   You can also selectively keep some tools in test mode by leaving their `test_mode: true` for more granular testing.
 
 4. **Run the agent with a real alert**
+
    Provide a live alert in JSON format and invoke the agent using:
 
    ```bash
@@ -306,9 +376,10 @@ Test mode lets you evaluate the triage agent in a controlled, offline environmen
 
 To run in test mode:
 1. **Set required environment variables**
-Follow the instructions in the [Set up environment variables](#set-up-environment-variables) section to make sure all variables are set. (`TEST_MODE=true`)
 
-2. **How it works**
+   Make sure `test_mode: true` is set in both the `workflow` section and individual tool sections of your config file (see [Understanding the config](#understanding-the-config) section).
+
+1. **How it works**
 - The **main test CSV** provides both alert details and a mock environment. For each alert, expected tool return values are included. These simulate how the environment would behave if the alert occurred on a real system.
 - The **benign fallback dataset** fills in tool responses when the agent calls a tool not explicitly defined in the alert's test data. These fallback responses mimic healthy system behavior and help provide the "background scenery" without obscuring the true root cause.
 
@@ -321,12 +392,12 @@ Follow the instructions in the [Set up environment variables](#set-up-environmen
     Note: The `--input` value is ignored in test mode.
 
     The agent will:
-   - Load alerts from the test dataset `TEST_DATA_RELATIVE_FILEPATH`
+   - Load alerts from the test dataset specified in `test_data_path` in the workflow config
    - Simulate an investigation using predefined tool results
    - Iterate through all the alerts in the dataset
-   - Save reports as a new column in a copy of the test CSV file to the path specified in `TEST_OUTPUT_RELATIVE_FILEPATH`
+   - Save reports as a new column in a copy of the test CSV file to the path specified in `test_output_path` in the workflow config
 
-4. **Understanding the output**
+2. **Understanding the output**
 
    The output file will contain a new column named `output`, which includes the markdown report generated by the agent for each data point (i.e., each row in the CSV). Navigate to that rightmost `output` column to view the report for each test entry.
 

@@ -20,21 +20,27 @@ functions:
   hardware_check:
     _type: hardware_check
     llm_name: tool_reasoning_llm
+    test_mode: true
   host_performance_check:
     _type: host_performance_check
     llm_name: tool_reasoning_llm
+    test_mode: true
   monitoring_process_check:
     _type: monitoring_process_check
     llm_name: tool_reasoning_llm
+    test_mode: true
   network_connectivity_check:
     _type: network_connectivity_check
     llm_name: tool_reasoning_llm
+    test_mode: true
   telemetry_metrics_host_heartbeat_check:
     _type: telemetry_metrics_host_heartbeat_check
     llm_name: tool_reasoning_llm
+    test_mode: true
   telemetry_metrics_host_performance_check:
     _type: telemetry_metrics_host_performance_check
     llm_name: tool_reasoning_llm
+    test_mode: true
   telemetry_metrics_analysis_agent:
     _type: telemetry_metrics_analysis_agent
     tool_names:
@@ -44,6 +50,7 @@ functions:
   maintenance_check:
     _type: maintenance_check
     llm_name: maintenance_check_llm
+    static_data_path: examples/alert_triage_agent/data/maintenance_static_dataset.csv
   categorizer:
     _type: categorizer
     llm_name: categorizer_llm
@@ -57,6 +64,11 @@ workflow:
     - network_connectivity_check
     - telemetry_metrics_analysis_agent
   llm_name: ata_agent_llm
+  test_mode: true
+  # The below paths are only used if test_mode is true
+  test_data_path: examples/alert_triage_agent/data/test_data.csv
+  benign_fallback_data_path: examples/alert_triage_agent/data/benign_fallback_test_data.json
+  test_output_path: .tmp/aiq/examples/alert_triage_agent/output/test_output.csv
 
 llms:
   ata_agent_llm:

@@ -64,24 +64,24 @@ class HardwareCheckToolConfig(FunctionBaseConfig, name="hardware_check"):
                  "hardware degradation, and anomalies that could explain alerts. Args: host_id: str"),
         description="Description of the tool for the agent.")
     llm_name: LLMRef
+    test_mode: bool = Field(default=True, description="Whether to run in test mode")
 
 
 @register_function(config_type=HardwareCheckToolConfig)
 async def hardware_check_tool(config: HardwareCheckToolConfig, builder: Builder):
 
     async def _arun(host_id: str) -> str:
-        is_test_mode = utils.is_test_mode()
         utils.log_header("Hardware Status Checker")
 
         try:
-            if not is_test_mode:
+            if not config.test_mode:
                 ip = "ipmi_ip"  # Replace with your actual IPMI IP address
                 user = "ipmi_user"  # Replace with your actual username
                 pwd = "ipmi_password"  # Replace with your actual password
                 monitoring_data = _get_ipmi_monitor_data(ip, user, pwd)
             else:
                 # In test mode, load test data from CSV file
-                df = utils.load_test_data()
+                df = utils.get_test_data()
 
                 # Get IPMI data from test data, falling back to static data if needed
                 monitoring_data = utils.load_column_or_static(

@@ -32,6 +32,7 @@ class HostPerformanceCheckToolConfig(FunctionBaseConfig, name="host_performance_
                  "and hardware I/O usage details for a given host. Args: host_id: str"),
         description="Description of the tool for the agent.")
     llm_name: LLMRef
+    test_mode: bool = Field(default=True, description="Whether to run in test mode")
 
 
 async def _run_ansible_playbook_for_host_performance_check(config: HostPerformanceCheckToolConfig,
@@ -110,11 +111,10 @@ async def _parse_stdout_lines(config, builder, stdout_lines):
 async def host_performance_check_tool(config: HostPerformanceCheckToolConfig, builder: Builder):
 
     async def _arun(host_id: str) -> str:
-        is_test_mode = utils.is_test_mode()
         utils.log_header("Host Performance Analyzer")
 
         try:
-            if not is_test_mode:
+            if not config.test_mode:
                 # In production mode, use actual Ansible connection details
                 # Replace placeholder values with connection info from configuration
                 ansible_host = "your.host.example.name"  # Input your target host
@@ -132,7 +132,7 @@ async def _arun(host_id: str) -> str:
                     ansible_private_key_path=ansible_private_key_path)
             else:
                 # In test mode, load performance data from test dataset
-                df = utils.load_test_data()
+                df = utils.get_test_data()
 
                 # Get CPU metrics from test data, falling back to static data if needed
                 data_top_cpu = utils.load_column_or_static(df=df,

@@ -42,6 +42,10 @@ class MaintenanceCheckToolConfig(FunctionBaseConfig, name="maintenance_check"):
                  "if the alert can be deprioritized."),
         description="Description of the tool for the agent.")
     llm_name: LLMRef
+    static_data_path: str | None = Field(
+        default="examples/alert_triage_agent/data/maintenance_static_dataset.csv",
+        description=(
+            "Path to the static maintenance data CSV file. If not provided, the tool will not check for maintenance."))
 
 
 def _load_maintenance_data(path: str) -> pd.DataFrame:
@@ -188,6 +192,8 @@ async def maintenance_check(config: MaintenanceCheckToolConfig, builder: Builder
     # Set up LLM
     llm = await builder.get_llm(config.llm_name, wrapper_type=LLMFrameworkEnum.LANGCHAIN)
 
+    maintenance_data_path = config.static_data_path
+
     async def _arun(input_message: str) -> str:
         # NOTE: This is just an example implementation of maintenance status checking using a CSV file.
         # Users should implement their own maintenance check logic specific to their environment
@@ -196,18 +202,15 @@ async def _arun(input_message: str) -> str:
 
         utils.log_header("Maintenance Checker")
 
-        maintenance_data_path = os.getenv("MAINTENANCE_STATIC_DATA_PATH")
         if not maintenance_data_path:
             utils.logger.info("No maintenance data path provided, skipping maintenance check")
             return NO_ONGOING_MAINTENANCE_STR  # the triage agent will run as usual
 
-        filepath = os.path.join(os.path.abspath(os.path.dirname(__file__)), maintenance_data_path)
-        if not os.path.exists(filepath):
-            utils.logger.info("Maintenance data file does not exist: %s. Skipping maintenance check.", filepath)
+        if not os.path.exists(maintenance_data_path):
+            utils.logger.info("Maintenance data file does not exist: %s. Skipping maintenance check.",
+                              maintenance_data_path)
             return NO_ONGOING_MAINTENANCE_STR  # the triage agent will run as usual
 
-        maintenance_df = _load_maintenance_data(filepath)
-
         alert = _parse_alert_data(input_message)
         if alert is None:
             utils.logger.info("Failed to parse alert from input message, skipping maintenance check")
@@ -226,6 +229,7 @@ async def _arun(input_message: str) -> str:
             utils.logger.error("Failed to parse alert time from input message: %s, skipping maintenance check", e)
             return NO_ONGOING_MAINTENANCE_STR
 
+        maintenance_df = _load_maintenance_data(maintenance_data_path)
         maintenance_info = _get_active_maintenance(maintenance_df, host, alert_time)
         if not maintenance_info:
             utils.logger.info("Host: [%s] is NOT under maintenance according to the maintenance database", host)

@@ -31,6 +31,7 @@ class MonitoringProcessCheckToolConfig(FunctionBaseConfig, name="monitoring_proc
                                       "on a target host by executing system commands. Args: host_id: str"),
                              description="Description of the tool for the agent.")
     llm_name: LLMRef
+    test_mode: bool = Field(default=True, description="Whether to run in test mode")
 
 
 async def _run_ansible_playbook_for_monitor_process_check(ansible_host: str,
@@ -70,10 +71,8 @@ async def _run_ansible_playbook_for_monitor_process_check(ansible_host: str,
 async def monitoring_process_check_tool(config: MonitoringProcessCheckToolConfig, builder: Builder):
 
     async def _arun(host_id: str) -> str:
-        is_test_mode = utils.is_test_mode()
-
         try:
-            if not is_test_mode:
+            if not config.test_mode:
                 # In production mode, use actual Ansible connection details
                 # Replace placeholder values with connection info from configuration
                 ansible_host = "your.host.example.name"  # Input your target host
@@ -89,7 +88,7 @@ async def _arun(host_id: str) -> str:
                 output_for_prompt = f"`ps` and `top` result:{output}"
             else:
                 # In test mode, load performance data from test dataset
-                df = utils.load_test_data()
+                df = utils.get_test_data()
 
                 # Load process status data from ps command output
                 ps_data = utils.load_column_or_static(df=df,