diff --git a/.gitignore b/.gitignore index aeb1579c2b..e0c6cb56cf 100644 --- a/.gitignore +++ b/.gitignore @@ -50,6 +50,7 @@ AGENTS.md .claude .cursor .idea +site/ #scripts at root level /*.sh diff --git a/docs/tutorials/posts/noc-reasoning-agent.md b/docs/tutorials/posts/noc-reasoning-agent.md new file mode 100644 index 0000000000..4ab0522785 --- /dev/null +++ b/docs/tutorials/posts/noc-reasoning-agent.md @@ -0,0 +1,605 @@ +--- +date: 2026-02-13 +readtime: 30 +hide: + - toc +--- + +# Teaching a Model to Reason Over Telecom Network Incidents + +This tutorial walks you through a complete pipeline for fine-tuning a reasoning model that can autonomously diagnose and resolve telecom network incidents. Using Nemo-Skills together with a NoC Reasoning Agent, we will take Qwen3-32B and teach it to perform step-by-step root-cause analysis with tool-calling — the same workflow a human NOC (Network Operations Center) engineer follows today. + +If you're following along, you'll need access to an NVIDIA DGX box (or equivalent) with eight NVIDIA A100 (or newer) GPUs, or a Slurm cluster with similarly configured nodes. The full pipeline — from data processing through training to evaluation — takes several hours depending on dataset size and hardware. + + + +## Background + +### Traditional workflow challenges + +In traditional telco operations, network incidents begin with alarms from network elements (eNodeBs, gNodeBs, routers, transmission links) hitting the fault management system. NOC engineers then validate the alarm by checking multiple systems: + +- FM dashboards +- PM KPIs +- Topology views +- Logs +- Customer-impact tools + +After validation, they perform root-cause analysis and either apply a fix (restarts, reroutes, configuration corrections) or escalate to field teams. Many of these alarms auto-clear, but engineers still spend time triaging them. + +### AI-powered transformation + +A fine-tuned reasoning model automates this entire flow: + +1. **Multi-source validation** — Checks multiple OSS/BSS sources via tool calls +2. **Step-by-step RCA** — Performs root-cause analysis methodically +3. **Automated healing** — Triggers healing scripts automatically +4. **Pattern recognition** — Uses historical data patterns to filter out self-recovering alarms + +This leads to dramatic improvements across key operational metrics: + +| Metric | Improvement | +| --- | --- | +| **Efficiency** | Diagnose and resolve incidents in seconds instead of hours | +| **MTTR** | Mean Time to Resolve significantly reduced | +| **Operational Quality** | Consistent, documented actions | +| **Cost** | Lower Opex through automation | + +Events that can be autonomously handled include cell outages, transmission flaps, hardware degradation, congestion spikes, and configuration mismatches. + +The end state is a zero-touch, self-healing network where NOC shifts from firefighting thousands of alarms daily to supervising an intelligent automation layer. + +## Nemo-Skills overview + +Nemo-Skills is a toolkit for evaluating, fine-tuning, and managing LLM workflows. It provides automated job scheduling, data pipeline management, comprehensive logging, and end-to-end pipelines for synthetic data generation, training, and evaluation. + +| Component | Description | +| --- | --- | +| `ns` CLI | Main interface to run all jobs and commands | +| vLLM Server | Flexible inference server for various model sizes | +| TRT-LLM Server | Optimized inference for large models using TensorRT | +| W&B Integration | Optional experiment tracking with Weights & Biases | + +Key orchestration features used throughout this tutorial: + +- `--run_after` — Ensures pipeline steps execute in the proper sequence (dependency management) +- `--cluster=local` — Run jobs on the local machine inside Docker containers +- `--cluster=slurm` — Run jobs on a Slurm cluster +- All outputs are stored for reproducibility and sharing + +## Requirements + +### Hardware + +| Component | Requirement | Purpose | +| --- | --- | --- | +| **GPUs** | 8x NVIDIA A100 or equivalent | Model training and inference | +| **VRAM** | Sufficient for model size | Stores model weights and activations | +| **Multi-GPU** | Recommended | Enables model and batch parallelism | + +| Software | Purpose | +| --- | --- | +| **Docker** | Containerization for consistent environments | +| **NVIDIA Container Toolkit** | Allows Docker containers to access GPU resources | +| **Slurm with NVIDIA/pyxis** (optional) | Cluster job scheduler for distributed workloads | +| **Python 3.10+** | Required Python version | +| **NeMo-Skills CLI** | Main interface for running pipelines | +## Setup + +### Install Nemo-Skills + +Clone the repository, install the package, and navigate to the recipe directory: + +```bash +git clone https://github.com/NVIDIA-NeMo/Skills.git +cd Skills +python3 -m venv venv +source venv/bin/activate +pip install . +cd recipes/noc-reasoning-agent +mkdir -p outputs +``` + +All scripts, prompts, configs, and sample data for this tutorial live in this recipes/noc-reasoning-agent directory. The `outputs/` directory will store all generated files. + +### Configure the cluster + +Run `ns setup` to configure how Nemo-Skills launches containers and mounts your project directory: + +```bash +ns setup +``` + +The setup wizard will prompt you for several settings. The key step is the **mounts** configuration — mount the recipe directory as `/workspace` so that all files are accessible inside the container: + +```text +What type of config would you like to create? (local/slurm): local +What mounts would you like to add? (comma separated): /absolute/path/to/Skills/recipes/noc-reasoning-agent:/workspace,/data/models:/models +``` + +Replace `/absolute/path/to/Skills` with the actual path where you cloned the repository. The second mount (`/data/models:/models`) makes downloaded model weights available inside the container at `/models`. When asked "Would you like to pull/build all the necessary Docker containers now?", answer **Y** to build all required images. + +After completing the wizard, verify the generated `local.yaml` in your cluster configs directory. The mounts section should look like: + +```yaml +mounts: + - /absolute/path/to/Skills/recipes/noc-reasoning-agent:/workspace + - /data/models:/models +env_vars: + - HF_HOME=/workspace + - WANDB_API_KEY= # optional — needed for SFT training logging; omit to disable +``` + +When you run any `ns` command, Nemo-Skills spins up a Docker container with the required software and mounts this directory at `/workspace`. Commands that run inside the container (such as `ns generate`, `ns run_cmd`, and `ns nemo_rl`) use `/workspace/...` paths, while commands that run directly on the host use relative paths from the recipe directory. For more details, see the [Nemo-Skills configs](../../basics/cluster-configs.md) documentation. + +### Set up the environment + +Several scripts import shared modules from sibling directories (e.g. `from scripts.tools import ...`). Add the recipe directory to `PYTHONPATH` so Python can resolve these imports: + +```bash +export PYTHONPATH=$(pwd):$PYTHONPATH +export NEMO_SKILLS_DISABLE_UNCOMMITTED_CHANGES_CHECK=1 +``` + +The recipe includes a sample `data/synthetic_incidents.csv`. To use your own data, replace this file with your incident CSV (same column schema). + +In the following sections, we always use `--cluster=local`. Change to `--cluster=slurm` (or whatever you named the config) if running on a Slurm cluster. When using Slurm, commands will finish immediately and schedule jobs in the cluster queue. + +## Data Processing + +The pipeline starts with raw incident CSV data. We progressively filter it to keep only actionable, remotely-solvable incidents that are most useful for training. + +### Classify Incidents + +Classify incidents into categories based on solution type: + +- **Soft Solve** — Can be resolved remotely (the target use case for automation) +- **Physical Intervention** — Requires a human on-site +- **Unknown** — Uncategorized + +```bash +python scripts/filtering/match_keywords.py \ + --input_csv data/synthetic_incidents.csv \ + --output_csv data/categorized_incidents.csv +``` + +The script uses keyword matching against resolution codes to assign each incident to a category. + +### Filter the Dataset + +Apply a series of filters to narrow the dataset to high-quality, actionable incidents: + +``` +# Remove auto-recovered incidents (rows with "Auto Recovered" resolution +# or "Event Cleared with No Action Taken" in the resolution summary) +python scripts/filtering/filter_rows.py \ + --input_csv data/categorized_incidents.csv \ + --output_csv data/filtered_file.csv \ + --filter_type auto + +# Keep only remotely-solvable incidents — the target use case for automation +python scripts/filtering/filter_rows.py \ + --input_csv data/filtered_file.csv \ + --output_csv data/filtered_soft_solve.csv \ + --filter_type soft_solve + +# Keep top 16 fault categories — focusing on common patterns +# ensures the model learns what will be most frequently useful +python scripts/filtering/filter_rows.py \ + --input_csv data/filtered_soft_solve.csv \ + --output_csv data/filtered_problem_codes.csv \ + --filter_type problem_codes + +# Keep top 10 resolution methods to create the finalized dataset +python scripts/filtering/filter_rows.py \ + --input_csv data/filtered_problem_codes.csv \ + --output_csv data/finalized_dataset.csv \ + --filter_type close_codes +``` + +### Convert to JSONL + +Convert the filtered CSV into the JSONL format required by Nemo-Skills: + +``` +python scripts/utils/create_input_jsonl_from_incidents.py \ + --input data/finalized_dataset.csv \ + --output outputs/input_incident.jsonl \ + --examples_by_problem_code 1000 +``` + +The `--examples_by_problem_code 1000` flag limits to 1000 examples per fault category for a balanced training set. + +## Synthetic Data Generation + +With the input data prepared, we use a powerful teacher model to generate structured reasoning traces. This is a two-phase process: first we generate structured resolution procedures, then we inject detailed reasoning into each step. + +### Download the Teacher Model + +Before generating synthetic data, download the teacher model weights so they are available inside the container. Download the model to the `/data/models` directory (which is mounted at `/models` inside the container): + +``` +python -c "from huggingface_hub import snapshot_download; snapshot_download('openai/gpt-oss-120b', local_dir='/data/models/gpt-oss-120b')" +``` + +This places the model under `/data/models/gpt-oss-120b` on the host, which maps to `/models/gpt-oss-120b` inside the container. If the model is gated, you will need to log in first with `huggingface-cli login` or set the `HF_TOKEN` environment variable. + +### Phase 1: Generate Structured Procedures + +Use the teacher model ([gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)) to generate step-by-step incident resolution procedures: + +``` +ns generate \ + --cluster=local \ + --server_type=vllm \ + --expname=gpt-oss-sdg-with-python \ + --model=/models/gpt-oss-120b \ + --server_gpus=8 \ + --output_dir=/workspace/outputs/sdg/ \ + --input_file=/workspace/outputs/input_incident.jsonl \ + ++prompt_config=/workspace/prompts/formatting_prompt.yaml \ + ++inference.tokens_to_generate=8192 \ + ++inference.temperature=0.6 \ + ++chat_template_kwargs.reasoning_effort=medium \ + ++inference.endpoint_type=text \ + ++code_execution=false \ + ++server.enable_soft_fail=True \ + ++skip_filled=False --rerun_done +``` + +Key parameters: + +- `--server_type=vllm` — Uses the vLLM inference server +- `--server_gpus=8` — Distributes inference across 8 GPUs +- `++inference.tokens_to_generate=8192` — Maximum output length per example +- `++inference.temperature=0.6` — Controls randomness (lower = more deterministic) +- `++server.enable_soft_fail=True` — Continues on non-critical errors +- `++skip_filled=False --rerun_done` — Regenerates all outputs from scratch + +The `ns generate` command starts a vLLM server, sends each incident through the prompt template in `formatting_prompt.yaml`, and writes the results to `outputs/sdg/output.jsonl`. For more details about the generation pipeline, see the [generation](../../pipelines/generation.md) documentation. + +> **Note — Re-running generation pipelines:** +> `ns generate` creates both an `output.jsonl` and an `output.jsonl.done` sentinel file in the output directory. If you need to re-run a generation step from scratch, delete **both** files before restarting: +> +> `rm outputs/sdg/output.jsonl outputs/sdg/output.jsonl.done` +> +> The same applies to any `ns generate` output directory (e.g., `outputs/sdg_reason/`). Without deleting these files, the pipeline will skip generation and reuse the existing results. + +> **Note — Fix output directory permissions:** +> The `ns generate` command runs inside a Docker container as `root`, so the output directory and files will be owned by `root`. Before running any local scripts that write to this directory, fix the permissions: +> +> `sudo chown -R $(whoami):$(whoami) outputs/sdg/` + +### Parse and Format Steps + +Extract structured resolution steps from the raw model output: + +``` +python scripts/utils/format_reasoning_json.py \ + --input outputs/sdg/output.jsonl \ + --output outputs/sdg/formatted_output.json \ + --jsonl_file outputs/input_incident.jsonl \ + --parse_type steps_extraction +``` + +This transforms the raw model output into structured JSON with extracted reasoning steps. + +### Phase 2: Inject Reasoning Traces + +Run the teacher model again to add detailed thinking traces to each procedural step: + +``` +ns generate \ + --cluster=local \ + --server_type=vllm \ + --expname=gpt-oss-sdg-reasoning \ + --model=/models/gpt-oss-120b \ + --server_gpus=8 \ + --output_dir=/workspace/outputs/sdg_reason/ \ + --input_file=/workspace/outputs/sdg/formatted_output.json \ + ++prompt_config=/workspace/prompts/shortened_prompt_reasoning.yaml \ + ++inference.tokens_to_generate=8192 \ + ++inference.temperature=0.6 \ + ++chat_template_kwargs.reasoning_effort=medium \ + ++inference.endpoint_type=text \ + ++code_execution=false \ + ++skip_filled=False --rerun_done \ + ++server.enable_soft_fail=True +``` + +> **Note — Fix output directory permissions:** +> As with Phase 1, fix the permissions on the new output directory before running local scripts: +> +> `sudo chown -R $(whoami):$(whoami) outputs/sdg_reason/` + +### Compile Training Data + +Merge the structured procedures with reasoning traces into a model-ingestable format: + +``` +python scripts/utils/format_reasoning_json.py \ + --input outputs/sdg/output.jsonl \ + --output_dir outputs/sdg/full_data \ + --jsonl_file outputs/input_incident.jsonl \ + --reasoning_jsonl outputs/sdg_reason/output.jsonl \ + --parse_type compile_reasoning +``` + +This step: + +- Tokenizes content for the target model +- Compresses and squashes reasoning steps +- Injects multi-step reasoning tokens for Qwen +- Organizes data into a curriculum based on reasoning complexity + +## Model Training + +With synthetic data generated, we fine-tune the model using [NeMo-RL](https://github.com/NVIDIA-NeMo/RL/) with the Megatron backend. + +### Prepare SFT Data + +First, split the data into training and testing sets: + +``` +python scripts/utils/split_incident_data.py \ + --input_dir outputs/sdg/full_data \ + --train_output outputs/training_data_split.jsonl \ + --test_output outputs/testing_data_split.jsonl +``` + +Then prepare the data in the format required for supervised fine-tuning. This command runs inside the Nemo-Skills container via `ns run_cmd`: + +``` +ns run_cmd \ + --log_dir=/workspace/prepare-sft-data-incidence \ + --expname=prep-sft-data-inci \ + --run_after=gpt-oss-sdg-with-python \ + --cluster=local \ + 'python -m nemo_skills.training.prepare_data \ + --config-path /workspace/configs \ + --config-name noc_reasoning_sft \ + input_files=/workspace/outputs/training_data_split.jsonl \ + output_path=/workspace/outputs/sft-data-incidence.jsonl \ + prompt_config=/workspace/prompts/prompt_incident.yaml \ + tokenizer=Qwen/Qwen3-32B \ + filters.remove_contaminated=false \ + add_unlabeled=true \ + filters.trim_solutions=false' +``` + +Key parameters: + +- `ns run_cmd` executes a command within the Nemo-Skills Docker container +- `--run_after=gpt-oss-sdg-with-python` ensures this step runs after synthetic data generation completes +- `tokenizer=Qwen/Qwen3-32B` specifies the target model's tokenizer +- `filters.remove_contaminated=false` keeps all data (no decontamination filtering) + +The prompt template in `prompt_incident.yaml` defines the NOC engineer system prompt and the 17 available tool definitions (query_alarm, query_resource_health, execute_remote_action, create_trouble_ticket, verify_recovery, etc.) that the model will learn to call during reasoning. + +### Run SFT Training + +Fine-tune [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) using NeMo-RL with the Megatron backend: + +``` +ns nemo_rl sft \ + --cluster=local \ + --expname=training \ + --output_dir=/models/training \ + --hf_model=Qwen/Qwen3-32B \ + --num_nodes=1 \ + --num_gpus=8 \ + --training_data=/workspace/outputs/sft-data-incidence.jsonl \ + --backend=megatron \ + --final_hf_path=/models/training/qwen3-32b-improved-hf \ + ++sft.max_num_epochs=1 \ + ++policy.megatron_cfg.tensor_model_parallel_size=8 \ + ++policy.megatron_cfg.activation_checkpointing=True \ + ++policy.megatron_cfg.sequence_parallel=True \ + ++policy.model_name=Qwen/Qwen3-32B \ + ++policy.max_total_sequence_length=16384 \ + ++policy.train_global_batch_size=32 \ + ++policy.optimizer.kwargs.lr=1e-5 \ + ++checkpointing.save_weights_only=true \ + ++checkpointing.keep_top_k=1 \ + ++policy.lr=1e-5 +``` + +Key training parameters: + +- `--hf_model=Qwen/Qwen3-32B` — Base model from HuggingFace +- `--backend=megatron` — Uses Megatron for distributed training +- `tensor_model_parallel_size=8` — Splits the model across all 8 GPUs +- `activation_checkpointing=True` — Reduces memory usage by recomputing activations during the backward pass +- `max_total_sequence_length=16384` — Sets the context length for reasoning traces +- `train_global_batch_size=32` — Batch size for training +- `lr=1e-5` — Conservative learning rate appropriate for fine-tuning + +To learn more about SFT configuration, see the [Nemo-Skills training](../../pipelines/training.md) documentation. + +## Evaluation + +To evaluate the fine-tuned model, we use a ReAct (Reasoning + Acting) agent that calls NOC tools at each step, then compare its incident resolution accuracy against the baseline model. + +### Prepare Test Data + +Prepare the test set in the same format as training: + +```bash +ns run_cmd \ + --log_dir=/workspace/prepare-test-data-incidence \ + --expname=prep-test-data-inci \ + --run_after=gpt-oss-sdg-with-python \ + --cluster=local \ + 'python -m nemo_skills.training.prepare_data \ + --config-path /workspace/configs \ + --config-name noc_reasoning_sft \ + input_files=/workspace/outputs/testing_data_split.jsonl \ + output_path=/workspace/outputs/sft-test-incidence.jsonl \ + prompt_config=/workspace/prompts/prompt_incident.yaml \ + tokenizer=Qwen/Qwen3-32B \ + filters.remove_contaminated=false \ + add_unlabeled=true \ + filters.trim_solutions=false' +``` + +### Build Agent Input + +Create the ReAct agent input file containing incident prompts with tool response data: + +```bash +python scripts/ns_pipelines/prepare_react_agent.py \ + outputs/testing_data_split.jsonl \ + outputs/sft-test-incidence.jsonl \ + --output outputs/final_agent_input.jsonl \ + --prompt_config prompts/prompt_incident.yaml +``` + +> **Note:** You will see "No tools for incident INCME-XXXXXX, skipping" messages — this is expected. These come from iteration-0 rows that have no tool calls yet. The script correctly uses later iteration rows for each incident. The final output should contain one row per test incident with all tool responses and a valid `expected` Close Code. + +### Install Agent Dependencies + +Install the additional libraries needed for the ReAct agent: + +```bash +pip install --upgrade langgraph langchain langchain-huggingface transformers torch accelerate pandas +``` + +These libraries provide: + +- `langgraph` — Framework for building agent workflows +- `langchain` / `langchain-huggingface` — LLM orchestration with HuggingFace integration +- `transformers` — HuggingFace model library +- `torch` / `accelerate` — PyTorch and distributed inference utilities +- `pandas` — Data manipulation + +### Run the Fine-Tuned Agent + +``` +python scripts/create_agent_with_tools_batch.py \ + --input outputs/final_agent_input.jsonl \ + --output outputs/agent_responses.jsonl \ + --weights_dir /data/models/training/qwen3-32b-improved-hf +``` + +> **Note — Resume vs. fresh run:** +> The agent script resumes by default — if `agent_responses.jsonl` already exists, it skips previously processed rows. To start over, either delete the output file (`rm outputs/agent_responses.jsonl`) or pass `--fresh`. + +### Run the Baseline Agent + +For comparison, run the same evaluation using the original (non-fine-tuned) base Qwen3-32B model. +This lets you measure how much the SFT training improved close-code accuracy: + +```bash +python scripts/create_agent_with_tools_batch.py \ + --input outputs/final_agent_input.jsonl \ + --output outputs/baseline_agent_responses.jsonl \ + --weights_dir Qwen/Qwen3-32B +``` + +### Compare Results + +Evaluate both models by computing close-code accuracy (how often the model selects the correct resolution method): + +```bash +# Fine-tuned model +python scripts/evaluation/problem_code_evaluation.py outputs/agent_responses.jsonl + +# Baseline model +python scripts/evaluation/problem_code_evaluation.py outputs/baseline_agent_responses.jsonl +``` + +The evaluation script matches the model's predicted close code against the expected answer using synonym-aware matching (e.g. "Resolved" and "Issue Corrected" are both recognized). + +#### Expected Results + +Using the provided synthetic dataset (21 test incidents), you should see results similar to: + +| Model | Total | Correct | Incorrect | Failed | Accuracy | +| --- | --- | --- | --- | --- | --- | +| **Fine-tuned Qwen3-32B** | 21 | 19 | 2 | 0 | **90.5%** | +| **Baseline Qwen3-32B** | 21 | 17 | 4 | 0 | **81.0%** | + +With larger and more diverse training datasets, the fine-tuned model is expected to show a clearer accuracy gap over the baseline, particularly on complex multi-step incidents requiring domain-specific reasoning. + +## Quick Reference + +### Directory Structure + +```text +Skills/recipes/noc-reasoning-agent/ +├── scripts/ +│ ├── filtering/ # Data filtering scripts +│ ├── utils/ # Utility scripts +│ ├── evaluation/ # Evaluation scripts +│ ├── ns_pipelines/ # NeMo-Skills pipeline scripts +│ ├── tools.py # NOC tool definitions +│ └── create_agent_with_tools_batch.py +├── data/ +│ └── synthetic_incidents.csv # Sample incident data +├── prompts/ +│ ├── formatting_prompt.yaml # Phase 1 SDG prompt +│ ├── shortened_prompt_reasoning.yaml # Phase 2 reasoning prompt +│ └── prompt_incident.yaml # NOC system prompt + tool definitions +├── configs/ +│ └── noc_reasoning_sft.yaml # SFT data preparation config +└── outputs/ # Created during pipeline execution + ├── sdg/ # Synthetic data generation outputs + ├── sdg_reason/ # Reasoning trace outputs + └── *.jsonl # Processed data files +``` + +### Common Commands + +| Task | Command | +| --- | --- | +| Activate environment | `source myenv/bin/activate` | +| Set Python path | `export PYTHONPATH=$(pwd):$PYTHONPATH` | +| Check cluster config | `cat cluster_configs/local.yaml` | +| Re-run setup | `ns setup` | +| View Docker images | `docker images` | + +## Troubleshooting + +### Permission denied on output directories + +If you see `PermissionError: [Errno 13] Permission denied` when writing to output directories: + +``` +sudo chown -R $(whoami):$(whoami) ./outputs/ +``` + +### HF_HOME error + +If you see `Please add a new variable: HF_HOME=/mounted/path/to/your/hf_home`, ensure your `local.yaml` contains: + +```yaml +env_vars: + - HF_HOME=/workspace +``` + +Then re-run the failing `ns` command. + +### Docker container build failures + +If Docker containers fail to build or pull, try a clean reinstall of Nemo-Skills: + +``` +pip uninstall nemo_skills -y +pip cache purge +cd /path/to/Skills +pip install . +ns setup +``` + +## What's next? + +With Nemo-Skills, you can easily extend this pipeline in several directions: + +- **Scale the dataset** — Generate more synthetic incidents or add new fault categories to broaden coverage. +- **Add more tools** — Extend the tool set beyond the 17 NOC tools to cover additional operational workflows. +- **Multi-turn reasoning** — Experiment with longer reasoning chains by increasing `tokens_to_generate` and `max_total_sequence_length`. +- **Deploy with vLLM** — Serve the fine-tuned model using the [start-server pipeline](../../pipelines/start-server.md) for production inference. + +All the commands used in this tutorial can be combined into a single Python script using the Nemo-Skills [Python API](../../pipelines/index.md#python-interface), enabling end-to-end reproducibility. With just one line change (`--cluster=slurm`), you can transition from local prototyping to large-scale experiments on a Slurm cluster. + +This pipeline demonstrates that the same synthetic-data-generation and fine-tuning approach that works for math reasoning can be applied to real-world industrial domains like telecom network operations — teaching models not just to think, but to act. diff --git a/recipes/noc-reasoning-agent/configs/config.ini b/recipes/noc-reasoning-agent/configs/config.ini new file mode 100644 index 0000000000..4fe3d9f3a7 --- /dev/null +++ b/recipes/noc-reasoning-agent/configs/config.ini @@ -0,0 +1,10 @@ +[download] +qwen2.5-14=/workspace/models/Qwen2.5-14B-Instruct +qwen2.5-32=/workspace/models/Qwen2.5-32B-Instruct +gpt-oss-120b=/workspace/models/gpt-oss-120b +nemotron-49b-1.5=/workspace/models/Llama-3_3-Nemotron-Super-49B-v1_5 + + +[data_path] +original_data_path=data/anonymized-Incidents_Last_6_Months.csv +incident_json_data=outputs/input_incident.jsonl diff --git a/recipes/noc-reasoning-agent/configs/noc_reasoning_sft.yaml b/recipes/noc-reasoning-agent/configs/noc_reasoning_sft.yaml new file mode 100644 index 0000000000..45ec295844 --- /dev/null +++ b/recipes/noc-reasoning-agent/configs/noc_reasoning_sft.yaml @@ -0,0 +1,150 @@ +processors_to_run: all + +output_path: ??? +# prompt_config: null +preprocessed_dataset_files: null +input_files: null + + +# --- Data Keys --- +# These keys MUST match the output of your preprocessing script. +input_key: "background" +output_key: "response" + +# --- SFT Formatting --- +# Define the tokenizer and the final chat format for the model. +tokenizer: "Qwen/Qwen3-32B" # EDIT THIS or override via CLI + +# This uses a separate YAML file to define the chat template. +# This makes the configuration cleaner and more reusable. +prompt_config: "/workspace/data/prompt_incident.yaml" # EDIT THIS or override via CLI + +# ----------------- +# --- General Settings --- +# ----------------- +do_shuffle: false +deduplicate: true +exclude_optional_keys: true +random_seed: 42 +num_output_samples: null +add_correct: true +add_incorrect: false +add_unlabeled: true # Set to true as requested + +# ----------------- +# --- Quality Control Filters --- +# ----------------- +# Most filters are disabled because the data is text-based reasoning, not math or code. +# This prevents the pipeline from incorrectly discarding valid data. + + +contamination_file: null + +filters: + drop_multi_boxed: false + remove_contaminated: false # can be enabled if you have a contamination file + majority_filter: false + trim_solutions: false # Your data doesn't use the \boxed{} syntax + trim_prefix: false + drop_incorrect_arithmetic: false + split_arithmetic: false + remove_len_outlier_problems: false # Keep all data regardless of length + remove_len_outlier_solutions: false + code_text_filter: null + remove_code_errors: false + remove_verification_code: false + remove_matplotlib: false + remove_no_code: false + remove_no_think_tags: false # Enabled, as requested + +# ================================================================================= # +# Processor Pipeline (Usually does not need to be changed) # +# ================================================================================= # +processors: + - _target_: nemo_skills.training.data_preparation_utils.preprocessing.ReadData + input_files: ${input_files} # This line ensures the processor gets the input file path + input_key: ${input_key} + output_key: ${output_key} + add_unlabeled: ${add_unlabeled} + deduplicate: ${deduplicate} + keys_to_keep: + - "expected_answer" + - "incident_identifier" + - "incident_classification" + - "urgency_level" + - "geographical_territory" + - "incident_subtype" + - "service_domain" + - "equipment_provider" + - "operational_zone" + - "affected_site" + - "incident_summary" + - "detection_timestamp" + - "escalation_date" + - "responsible_team" + - "fault_category" + - "action_chronicle" + - "resolution_summary" + - "resolution_method" + - "problem_code_reasoning_process" + + - _target_: nemo_skills.training.data_preparation_utils.merge_processor.MergeProcessor + _recursive_: false + processor_configs: + - _target_: nemo_skills.training.data_preparation_utils.filters.RemoveContaminated + should_run: ${filters.remove_contaminated} + contamination_file: ${contamination_file} + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIfRegexMatch #removing errors + should_run: ${filters.remove_code_errors} + text_key: ${output_key} + regex_patterns: + - 'Traceback (most recent call last)' + - '' + - 'Timed out' + - 'SyntaxError' + test_cases: + - { input: { generation: "My solution:\n---Traceback (most recent call last)---\nSomething else" }, output: null } + - { input: { generation: "My solution:\nSome long output\nSomething else" }, output: null } + - { input: { generation: "My solution:\nTimed out\nSomething else" }, output: null } + - { input: { generation: "My solution:\n[0;31mSyntaxError\u001b\nSomething else" }, output: null } + - { input: { generation: "My solution, no errors" }, output: { generation: "My solution, no errors" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIfRegexNotMatch # filtering out tool calling + should_run: ${filters.remove_no_code} + text_key: ${output_key} + regex_patterns: + - '' + - '' + test_cases: + - { input: { generation: "My solution:\n------\nSomething else" }, output: null } + - { input: { generation: "My solution:\ncode\nSomething else" }, output: null } + - { input: { generation: "code" }, output: { generation: "code" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIfRegexNotMatch # filtering out if no think tags + should_run: ${filters.remove_no_think_tags} + text_key: ${output_key} + regex_patterns: + - '' + test_cases: + - { input: { generation: "My solution:\n------\nSomething else" }, output: { generation: "My solution:\n------\nSomething else" } } + - { input: { generation: "My solution:\n\nSomething else" }, output: null } + - { input: { generation: "thinkingsummary" }, output: { generation: "thinkingsummary" } } + + + + - _target_: nemo_skills.training.data_preparation_utils.preprocessing.GroupSamples + group_key: ${input_key} + + - _target_: nemo_skills.training.data_preparation_utils.preprocessing.ShuffleAndDownsampleData + num_samples: ${num_output_samples} + random_seed: ${random_seed} + do_shuffle: ${do_shuffle} + + - _target_: nemo_skills.training.data_preparation_utils.preprocessing.WriteFinalSftManifest + output_manifest_file: ${output_path} + prompt_config: ${prompt_config} + tokenizer: ${tokenizer} + input_key: ${input_key} + output_key: ${output_key} + exclude_optional_keys: ${exclude_optional_keys} diff --git a/recipes/noc-reasoning-agent/configs/noc_reasoning_sft_6.yaml b/recipes/noc-reasoning-agent/configs/noc_reasoning_sft_6.yaml new file mode 100644 index 0000000000..80d3ed8814 --- /dev/null +++ b/recipes/noc-reasoning-agent/configs/noc_reasoning_sft_6.yaml @@ -0,0 +1,261 @@ +processors_to_run: all + +output_path: ??? +prompt_config: null +code_tags: null + +preprocessed_dataset_files: null # can specify multiple patters separated by space +input_files: null # can specify datasets from HF instead of prediction_jsonl_files + +exclude_optional_keys: true # exclude keys other than input, output and expected_answer from the final manifest +metadata: null # can provide additional metadata to store (e.g. dataset or generation_type) +skip_first: 0 # useful for skipping validation set from train_full generation (it's always first) +add_correct: true # can set to False if only want to export incorrect solutions +add_incorrect: false # if True, saves only incorrect solutions instead of correct +add_unlabeled: false # if True, saves unlabeled solutions (no symbolic_correct field) +use_judgement: false # if True, uses judgement instead of symbolic_correct field +# if True, adds a new field with max allowed code executions to each sample +# exclude_optional_keys must be set to false in this case +add_code_execution_counts: false + +downsampling_method: null # fair, random or null +random_seed: 42 +do_shuffle: true +num_output_samples: null + +# Needed to format the data and also for length-based filtering +tokenizer: null +chat_template_kwargs: null +system_message: null # can be used to override default from the tokenizer +assistant_end: null # can be used to not use jinja and just add this suffix + +# Problem params +min_problem_length: null +max_problem_length: null + +# Solution params +min_solution_length: null +max_solution_length: null +use_chars_for_min_length: false + + +input_key: problem +output_key: generation + +# For synthesized questions +contamination_file: null + +majority_filter: + # minimum number of majority votes to use the answer + min_majority_votes: 0 + # minimum portion of majority votes to use the answer + min_majority_percentage: 0.0 + +filters: + drop_multi_boxed: false + remove_contaminated: true + majority_filter: false + trim_solutions: true + trim_prefix: false + drop_incorrect_arithmetic: false + split_arithmetic: false + remove_len_outlier_problems: false + remove_len_outlier_solutions: false + code_text_filter: null + remove_code_errors: false + remove_verification_code: false + remove_matplotlib: false + remove_no_code: false + remove_no_think_tags: false + +deduplicate: true + +processors: + - _target_: nemo_skills.training.data_preparation_utils.preprocessing.ReadData + preprocessed_dataset_files: ${preprocessed_dataset_files} + input_files: ${input_files} + input_key: ${input_key} + output_key: ${output_key} + skip_first: ${skip_first} + add_correct: ${add_correct} + add_incorrect: ${add_incorrect} + add_unlabeled: ${add_unlabeled} + use_judgement: ${use_judgement} + deduplicate: ${deduplicate} + keys_to_keep: # input/output/symbolic_correct/judgement are automatically added + - "total_votes" + - "majority_votes" + - "expected_answer" + - "predicted_answer" + - "answer_rm_score" + - "solutions" # GenSelect keys + - "num_solutions" + - "max_idx" + + # this will optimize processors inside to avoid serializing data to disk + - _target_: nemo_skills.training.data_preparation_utils.merge_processor.MergeProcessor + _recursive_: False + processor_configs: + - _target_: nemo_skills.training.data_preparation_utils.filters.DropMultiBoxed + should_run: ${filters.drop_multi_boxed} + solution_key: ${output_key} + test_cases: + - { input: { generation: "He had \\boxed{3} 4090s and \\boxed{2} 4080s" }, output: null } + - { input: { generation: "She had \\boxed{6} 4090s" }, output: { generation: "She had \\boxed{6} 4090s" } } + - { input: { generation: "boxed boxed 42" }, output: { generation: "boxed boxed 42" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.RemoveContaminated + should_run: ${filters.remove_contaminated} + contamination_file: ${contamination_file} + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIfRegexMatch + should_run: ${filters.remove_code_errors} + text_key: ${output_key} + regex_patterns: + - 'Traceback (most recent call last)' + - '' + - 'Timed out' + - 'SyntaxError' + test_cases: + - { input: { generation: "My solution:\n---Traceback (most recent call last)---\nSomething else" }, output: null } + - { input: { generation: "My solution:\nSome long output\nSomething else" }, output: null } + - { input: { generation: "My solution:\nTimed out\nSomething else" }, output: null } + - { input: { generation: "My solution:\n[0;31mSyntaxError\u001b\nSomething else" }, output: null } + - { input: { generation: "My solution, no errors" }, output: { generation: "My solution, no errors" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIfRegexMatch + should_run: ${filters.remove_matplotlib} + text_key: ${output_key} + regex_patterns: + - 'from matplotlib' + - 'import matplotlib' + test_cases: + - { input: { generation: "My solution:\n---import matplotlib---\nSomething else" }, output: null } + - { input: { generation: "My solution:\nfrom matplotlib import pyplot as plt\nSomething else" }, output: null } + - { input: { generation: "My solution, no errors" }, output: { generation: "My solution, no errors" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIfRegexNotMatch + should_run: ${filters.remove_no_code} + text_key: ${output_key} + regex_patterns: + - '' + - '' + test_cases: + - { input: { generation: "My solution:\n------\nSomething else" }, output: null } + - { input: { generation: "My solution:\ncode\nSomething else" }, output: null } + - { input: { generation: "code" }, output: { generation: "code" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIfRegexNotMatch + should_run: ${filters.remove_no_think_tags} + text_key: ${output_key} + regex_patterns: + - '' + test_cases: + - { input: { generation: "My solution:\n------\nSomething else" }, output: { generation: "My solution:\n------\nSomething else" } } + - { input: { generation: "My solution:\n\nSomething else" }, output: null } + - { input: { generation: "thinkingsummary" }, output: { generation: "thinkingsummary" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIfRegexMatch + should_run: ${filters.remove_verification_code} + text_key: ${output_key} + regex_patterns: + - "Let's verify this with Python" + - "Let's verify this using Python" + - "Let's confirm this with Python" + - "Let's use Python to plot" + - "Let's verify this solution using Python" + - "Let's use Python to verify" + - "Let's confirm this result with some Python" + test_cases: + - { input: { generation: "Here is the solution. Let's verify this with Python. It should work." }, output: null } + - { input: { generation: "Here is the solution. Let's verify this using Python. It should work." }, output: null } + - { input: { generation: "First, we solve the problem. Let's confirm this with Python. Finally, we get the result." }, output: null } + - { input: { generation: "First, we solve the problem. Let's verify this solution using Python. Finally, we get the result." }, output: null } + - { input: { generation: "First, we solve the problem. Let's use Python to verify the solution. Finally, we get the result." }, output: null } + - { input: { generation: "The answer is calculated. Let's use Python to plot the results." }, output: null } + - { input: { generation: "This is the correct solution without any verification code." }, output: { generation: "This is the correct solution without any verification code." } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.MajorityFilter + should_run: ${filters.majority_filter} + min_majority_votes: ${majority_filter.min_majority_votes} + min_majority_percentage: ${majority_filter.min_majority_percentage} + + - _target_: nemo_skills.training.data_preparation_utils.filters.TrimPrefix + should_run: ${filters.trim_prefix} + solution_key: ${output_key} + test_cases: + - { input: { generation: "My solution:\nSubtract 3 from 5" }, output: { generation: "Subtract 3 from 5" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.TrimSolutions + should_run: ${filters.trim_solutions} + solution_key: ${output_key} + test_cases: + - { input: { generation: "Solution ends at \\boxed{0}\nThis line is useless." }, output: { generation: "Solution ends at \\boxed{0}" } } + - { input: { generation: "Solution continues \\boxed{7} after boxed" }, output: { generation: "Solution continues \\boxed{7} after boxed" } } + - { input: { generation: "Solution\n\ncontinues \n\n\\boxed{7} after\n boxed" }, output: { generation: "Solution\n\ncontinues \n\n\\boxed{7} after" } } + - { input: { generation: "Works alright with newlines inside \\boxed{7\n5} and properly\ncuts after" }, output: { generation: "Works alright with newlines inside \\boxed{7\n5} and properly" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIncorrectArithmetic + should_run: ${filters.drop_incorrect_arithmetic} + solution_key: ${output_key} + test_cases: + - { input: { generation: "1 + 2 + 3 + 4 = 10" }, output: { generation: "1 + 2 + 3 + 4 = 10" } } + # TODO: implement fractional arithmetic + # - {input: {generation: "1 / 3 = 0.3"}, output: null} + # - {input: {generation: "1 / 3 = 0.3334"}, output: null} + - { input: { generation: "3 * (1 + 7) / 4 = 3 * 8 / 4 = 100" }, output: null } + + - _target_: nemo_skills.training.data_preparation_utils.filters.SplitArithmetic + should_run: ${filters.split_arithmetic} + solution_key: ${output_key} + test_cases: + - { input: { generation: "1 + 2 + 3 + 4 = 10" }, output: { generation: "1 + 2 + 3 + 4 = 3 + 3 + 4 = 6 + 4 = 10" } } + - { input: { generation: "1 + 7 * 19 = 134" }, output: { generation: "1 + 7 * 19 = 1 + 133 = 134" } } + - { input: { generation: "3 * (1 + 7) / 4 = 6" }, output: { generation: "3 * (1 + 7) / 4 = 3 * 8 / 4 = 24 / 4 = 6" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.RemoveLenOutliers + should_run: ${filters.remove_len_outlier_problems} + property_key: ${input_key} + min_length: ${min_problem_length} + max_length: ${max_problem_length} + tokenizer: ${tokenizer} + + - _target_: nemo_skills.training.data_preparation_utils.filters.RemoveLenOutliers + should_run: ${filters.remove_len_outlier_solutions} + property_key: ${output_key} + min_length: ${min_solution_length} + max_length: ${max_solution_length} + tokenizer: ${tokenizer} + use_chars_for_min_length: ${use_chars_for_min_length} + + - _target_: nemo_skills.training.data_preparation_utils.filters.AddCodeExecutionsCounts + should_run: ${add_code_execution_counts} + solution_key: ${output_key} + + - _target_: nemo_skills.training.data_preparation_utils.preprocessing.GroupSamples + group_key: ${input_key} + + - _target_: nemo_skills.training.data_preparation_utils.filters.CodeTextFilter + should_run: ${filters.code_text_filter} + filter_type: ${filters.code_text_filter} + code_tags: ${code_tags} + solution_key: ${output_key} + + - _target_: nemo_skills.training.data_preparation_utils.preprocessing.ShuffleAndDownsampleData + num_samples: ${num_output_samples} + sampling_method: ${downsampling_method} + random_seed: ${random_seed} + do_shuffle: ${do_shuffle} + + - _target_: nemo_skills.training.data_preparation_utils.preprocessing.WriteFinalSftManifest + output_manifest_file: ${output_path} + prompt_config: ${prompt_config} + tokenizer: ${tokenizer} + chat_template_kwargs: ${chat_template_kwargs} + system_message: ${system_message} + assistant_end: ${assistant_end} + code_tags: ${code_tags} + input_key: ${input_key} + output_key: ${output_key} + metadata: ${metadata} + exclude_optional_keys: ${exclude_optional_keys} diff --git a/recipes/noc-reasoning-agent/data/synthetic_incidents.csv b/recipes/noc-reasoning-agent/data/synthetic_incidents.csv new file mode 100644 index 0000000000..70a337a02e --- /dev/null +++ b/recipes/noc-reasoning-agent/data/synthetic_incidents.csv @@ -0,0 +1,2060 @@ +incident_identifier,geographical_territory,incident_classification,incident_subtype,service_domain,equipment_provider,operational_zone,resolution_status,suspension_cause,urgency_level,affected_site,incident_summary,detection_timestamp,escalation_date,responsible_team,fault_category,resolution_summary,action_chronicle,reporter_identity,intervention_began,intervention_completed,root_cause_primary,resolution_method,root_cause_secondary,cause_additional,triggered_by_modification,resolver_identity,solved_category +INCME-100001,The Shire,Relay Transport,Junction Point,POWER,Elven Forge Technologies,Caras Galadhon,Resolved,PWR-001: AC Power Failure,MAJOR,Watch-TheShire-West-32,"NOC Engineer's Note: + +Incident Summary: A power outage caused by a commercial power supply interruption or failure, affecting the AC power supply to a critical IT infrastructure. + +Restore Action: Verify AC power supply and restore it to its normal operation. + +Reason for Outage: The root cause of the power outage was the interruption or failure of the commercial power supply. + +Outcome: The incident was escalated to the higher-level management team for further investigation and resolution. + +NOC Terms: +- Commercial power supply interruption or failure +- Root cause +- Problem type (PWR-001) +- Verification of AC power supply +- Restore action +- Reason for outage +- Outcome + +NOC Engineer's Note: +- This resolution summary is a summary of the incident closure notes written by the NOC engineer. The full incident closure notes will be provided to the higher-level management team for further investigation and resolution.",2024-09-12 14:18:16,2024-09-12 14:43:16,Orthanc Technical Review,power_ac_failure_recovery,Commercial power supply interruption or failure at Watch-TheShire-West-32. DC rectifier module failure caused voltage drop below threshold. Backup rectifier activated but insufficient for full load. Remote corrective action applied successfully. Full service restoration confirmed. Post-incident review scheduled.,"2024-09-12 14:18 - Verified alarm: PWR-001: AC Power Failure at Watch-TheShire-West-32. Active for 3 minutes. Multiple downstream alarms generated. +2024-09-12 14:22 - Polled power infrastructure. Mains: lost — no mains power. UPS active, estimated 30 minutes on battery. Generator: Fuel Relay Fault — auto-start failed. +2024-09-12 14:26 - Queried element health at Watch-TheShire-West-32. Equipment within operational limits. +2024-09-12 14:28 - Executed remote command: generator start. Result: COMMAND SUCCESS — operation completed. +2024-09-12 14:32 - Escalated via ticket TK-20240000 to DevOps. Remote resolution unsuccessful, physical inspection required. ETA: next business day. +2024-09-12 14:34 - Verified recovery — Service fully restored. All metrics back to baseline. All alarms cleared within 3 minutes.",Elfhelm of Rohan,2024-09-12 14:18:16,2024-09-12 14:43:16,Commercial power supply interruption or failure,Resolved,PWR-001: AC Power Failure,0.2,FALSE,Elfhelm of Rohan,soft_solve +INCME-100002,The Shire,Signal Network,Signal Unit,RAN,Dwarven Network Systems,Framsburg,Pending Resolution,RAN-001: Cell Service Interruption,CRITICAL,Hub-TheShire-South-06,"NOC Engineer's Resolution Summary: + +Workflow: Site Outage + +Actions taken: +- Verified Site Outage with the Field Technician +- Notified the Root Cause Team +- Final Status Verification + +Root cause: Cell completely unavailable for service +- The cell was completely unavailable due to a technical issue with the network equipment. +- The issue was identified and resolved by the Field Technician. +- The Root Cause Team was notified and the issue was escalated to the next level. + +Problem type: RAN-001 + +Resolution: +- The issue has been resolved and the cell is now operational. +- The Root Cause Team has been notified and the issue is being monitored. +- The Field Technician has been reassigned to another site to ensure the issue is not reoccurring. +- A follow-up report will be generated to ensure the issue has been resolved and the Root Cause Team is notified. + +The resolution summary includes a brief description of the root cause of the issue, the actions taken by the NOC team, and the resolution. The resolution is clear and concise,",2024-09-07 02:25:18,2024-09-07 02:55:18,White Tower Operations,ran_cell_site_down_recovery,RAN-001: Cell Service Interruption at Hub-TheShire-South-06. Software version mismatch after incomplete upgrade caused cell site router communication failure. Remote corrective action applied successfully. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-09-07 02:26 - Alarm: RAN-001: Cell Service Interruption at Hub-TheShire-South-06. Severity: CRITICAL. Active 8 minutes. Site completely unreachable. +2024-09-07 02:30 - Queried RF chain. Sector Beta: high VSWR (3.2:1), possible feeder issue. Antenna tilt: 10°. PA status: active. +2024-09-07 02:32 - Connectivity check to Hub-TheShire-South-06 — reachable, latency nominal. +2024-09-07 02:35 - Queried element health at Hub-TheShire-South-06. Equipment within operational limits. +2024-09-07 02:39 - Validated running configuration on Hub-TheShire-South-06. No mismatches detected. +2024-09-07 02:43 - Attempted remote restart on Hub-TheShire-South-06. COMMAND SUCCESS — operation completed. +2024-09-07 02:47 - Created ticket TK-20240001 for vendor support. Remote resolution unsuccessful, physical inspection required. +2024-09-07 02:51 - Verified: Full recovery confirmed. Alarm cleared at 02:36. All sectors broadcasting normally. Alarms cleared.",Húrin of the White City,2024-09-07 02:25:18,2024-09-07 02:55:18,Cell completely unavailable for service,Pending Resolution,RAN-001: Cell Service Interruption,0.4,FALSE,Húrin of the White City,soft_solve +INCME-100003,Rohan,Path Signals,Weather Watch,TRANSPORT,Dwarven Network Systems,Helm's Deep,Resolved,TRN-008: Latency SLA Violation,MAJOR,Point-Rohan-Outer-46,"NOC engineer completed the incident closure notes for the TRN-008 issue. + +Restore Action: Verify Routing Alarm and Protocol Status. + +Reason for Outage: End-to-end latency exceeding service level agreement. + +Root cause: The issue was caused by an end-to-end latency that exceeded the service level agreement. The root cause was identified and resolved by verifying the routing alarm and protocol status. + +Procedure followed: +1. Verified routing alarm by checking the log files and confirming that it was triggered. +2. Verified protocol status by checking the protocol logs. +3. Adjusted timer settings to ensure that the latency threshold was not exceeded. +4. Escalated the issue to network engineering for further investigation. + +Problem type: TRN-008 + +Resolution: Restore the routing and protocol settings to ensure that latency is not exceeding the service level agreement.",2024-09-30 07:21:48,2024-09-30 07:47:48,Lothlórien Link Guardians,transport_routing_flap_resolution,TRN-008: Latency SLA Violation affecting Point-Rohan-Outer-46. Latency spike caused by incorrect QoS shaping policy pushed during last maintenance window. Automated recovery sequence triggered after remote intervention. Full service restoration confirmed. Post-incident review scheduled.,"2024-09-30 07:23 - Alarm: TRN-008: Latency SLA Violation on transport path to Point-Rohan-Outer-46. CRC errors increasing on fiber uplink — 342 in last 10 min. +2024-09-30 07:27 - Ran transport diagnostics — Interface counters: CRC errors rising on Gi0/0/1. Optical power borderline at -18.2 dBm. Jitter: 28ms (threshold 15ms). +2024-09-30 07:31 - Path analysis: Path goes through Aggregation Hub — hub metrics normal. Issue on last-mile segment. +2024-09-30 07:34 - Checked transport path health to Point-Rohan-Outer-46. Interface counters reviewed. Link status verified. +2024-09-30 07:36 - Remote action: clear configuration — COMMAND FAILED — element not responding. +2024-09-30 07:38 - Ticket TK-20240002 for RF engineering: Persistent fault after 2 remote attempts — need on-site investigation. +2024-09-30 07:40 - Recovery verified: Site operational, all services confirmed active.",Bergil son of Denethor,2024-09-30 07:21:48,2024-09-30 07:47:48,End-to-end latency exceeding service level agreement,Resolved,TRN-008: Latency SLA Violation,0.2,FALSE,Bergil son of Denethor,soft_solve +INCME-100004,Lothlórien,Relay Transport,Junction Point,COMPUTE,Dwarven Network Systems,Aldburg,Pending Resolution,CMP-010: Site Not Scrolling,CRITICAL,Hub-Lothlorien-Primary-01,"NOC Engineer's Resolution Summary: + +Restore Action: Verify Image Pull Alarm, Escalate Registry Issue, Verify Image Pull +Reason for Outage: Site deployment or scaling operation stalled + +The root cause of the outage was identified as a scaling operation stalled due to site deployment. The incident was escalated to the Registry team, who resolved the issue. The Image Pull Alarm was verified to ensure that the issue was resolved. The Registry Issue was escalated to the Image Pull team, who resolved the issue. The Image Pull Alarm was verified to ensure that the issue was resolved. The root cause was documented in the incident closure notes. + +The summary of the resolution summary includes the actions taken, the root cause, the problem type, and the restoration action taken. The resolution summary is concise and complete, using NOC terminology. The engineer can use this summary to document the incident closure notes and ensure that all necessary steps were taken to resolve the issue.",2024-06-06 18:46:37,2024-06-06 19:13:37,Osgiliath Bridge Operations,compute_image_pull_recovery,CMP-010: Site Not Scrolling on Hub-Lothlorien-Primary-01. Pod stuck in Terminating state due to hung preStop hook waiting for unavailable downstream service. Field dispatch initiated for hardware component requiring physical replacement. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-06-06 18:48 - Alert: CMP-010: Site Not Scrolling on cu-control-plane-7b4d in namespace ran-prod at Hub-Lothlorien-Primary-01. Container OOMKilled — memory limit exceeded. +2024-06-06 18:52 - Inspected logs for cu-control-plane-7b4d: Pull error: dial tcp registry.internal:5000: connect: connection refused. Registry pod on same node is also down. +2024-06-06 18:55 - Executed: cordoned and drained affected node. Container runtime restarted, pods recovering. +2024-06-06 18:57 - Checked compute node — Node healthy — issue isolated to pod. +2024-06-06 19:00 - Validated running configuration on Hub-Lothlorien-Primary-01. No mismatches detected. +2024-06-06 19:02 - Ticket TK-20240003 created for RF engineering: Persistent fault after 2 remote attempts — need on-site investigation. +2024-06-06 19:06 - Verified — Full recovery confirmed. Alarm cleared at 19:04. Replacement pod Running, all readiness probes passing.",Herubrand of the Mark,2024-06-06 18:46:37,2024-06-06 19:13:37,Site deployment or scaling operation stalled,Pending Resolution,CMP-010: Site Not Scrolling,0.2,FALSE,Herubrand of the Mark,soft_solve +INCME-100005,Rivendell,Relay Transport,Central Nexus,TRANSPORT,Elven Forge Technologies,Hollin Gate,Resolved,BGP issue,MAJOR,Node-Rivendell-East-20,"Incident Closure Note: + +In response to the issue of failed Microwave Alarm, Verify Microwave Alarm, Check Weather Conditions, Check Link Status, Schedule Site Visit, Monitor Link Status, root cause was BGP session instability with upstream provider. + +The root cause of the issue was identified as BGP session instability with upstream provider. The issue was fixed by verifying the Microwave Alarm, checking weather conditions, checking link status, scheduling a site visit, monitoring link status, and resolving the issue. + +Reason for Outage: The root cause of the issue was identified as BGP session instability with upstream provider. + +Restore Action: Verify Microwave Alarm, Check Weather Conditions, Check Link Status, Schedule Site Visit, Monitor Link Status. + +Root cause: BGP session instability with upstream provider. +Problem type: TRN-002. + +Output: + +Incident Closure Note: Incident Closure Note: + +In response to the issue of failed Microwave Alarm, Verify Microwave Alarm,",2024-10-17 15:24:05,2024-10-17 15:43:05,Riders of the Mark,transport_microwave_degradation_response,BGP issue affecting Node-Rivendell-East-20. Packet loss correlated with interface CRC errors on upstream aggregation link. Configuration corrected and service restored remotely within SLA. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-10-17 15:24 - Alarm: BGP issue on transport path to Node-Rivendell-East-20. Optical power at -18.2 dBm — borderline low. +2024-10-17 15:26 - Connectivity: End-to-end path: latency within spec after correction. Monitoring for stability. +2024-10-17 15:30 - Interface diagnostics: BGP diagnostics: peer flapping due to hold timer expiry. Route table shows 3 prefix withdrawals in last hour. +2024-10-17 15:32 - Checked transport path health to Node-Rivendell-East-20. Interface counters reviewed. Link status verified. +2024-10-17 15:36 - Config correction: Interface MTU mismatch corrected on aggregation link. +2024-10-17 15:38 - Remote action: force restart — Partial success — 2 of 3 units recovered. +2024-10-17 15:40 - Recovery verified: Full recovery confirmed. Alarm cleared at 15:36.",Pippin Bracegirdle,2024-10-17 15:24:05,2024-10-17 15:43:05,BGP session instability with upstream provider,Resolved,BGP issue,0.4,FALSE,Pippin Bracegirdle,hard_solve +INCME-100006,Gondor,Signal Core,Beacon Array,COMPUTE,Dwarven Network Systems,Ethring,Resolved,CMP-005: Pod Terminating Stuck,CRITICAL,Node-Gondor-Primary-68,"NOC engineer writing incident closure notes + +Issue: An issue has been reported where a pod in the Kubernetes cluster has been stuck in the terminating state beyond the grace period, causing downtime for the cluster. + +Action taken: + +1. Verification: Verification of the Orchestrator Alarm and Control Plane to confirm the issue. + +2. Check: Verification of the Pod stuck in the terminating state. + +3. Failover: Verification of the Backup Control Plane to ensure that the cluster can continue to function. + +4. Escalation: Escalation to the Platform Team to investigate and resolve the issue. + +Root cause: The Pod was stuck in the terminating state due to a configuration error in the orchestrator. + +Problem type: CMP-005 + +Resolution: + +1. Verification: The Orchestrator Alarm and Control Plane were verified to be operational. + +2. Check: The Pod was verified to be in the terminating state due to a configuration error in the orchestrator. + +3. Failover: The Backup Control Plane was verified to be oper",2024-09-25 11:35:52,2024-09-25 12:00:52,Wizards Council Escalation,compute_orchestrator_recovery,CMP-005: Pod Terminating Stuck on Node-Gondor-Primary-68. Node resource exhaustion prevented pod scheduling. Horizontal autoscaler at maximum replica count. Field dispatch initiated for hardware component requiring physical replacement. All alarms cleared and KPIs returned to baseline within 15 minutes. No further action required.,"2024-09-25 11:36 - Alert: CMP-005: Pod Terminating Stuck on smf-core-5d8b in namespace data-plane at Node-Gondor-Primary-68. CrashLoopBackOff — 12 restarts in last hour. +2024-09-25 11:39 - Container logs show: Kubelet reports: 0/8 nodes available: insufficient memory. Current request: 4Gi, largest available: 2Gi. +2024-09-25 11:42 - Queried orchestrator — smf-core-5d8b: Terminating. Image pull attempts: 8. Last error: connection refused. +2024-09-25 11:44 - Node health: Node OK, storage: 94% used. +2024-09-25 11:47 - Remote action: restart. COMMAND SUCCESS — operation completed. +2024-09-25 11:49 - Verified — Full recovery confirmed. Alarm cleared at 11:54. Replacement pod Running, all readiness probes passing.",Alatar the White,2024-09-25 11:35:52,2024-09-25 12:00:52,Pod stuck in terminating state beyond grace period,Resolved,CMP-005: Pod Terminating Stuck,0.4,FALSE,Alatar the White,soft_solve +INCME-100007,Eriador,Relay Transport,Path Router,RAN,Elven Forge Technologies,Bucklebury,Resolved,RAN-014: CSR Unreachable,CRITICAL,Watch-Eriador-Outer-18,"The incident was caused by a cell site router management path unavailable, which resulted in a failure to perform upgrades. The root cause was identified and documented, and a resolution was implemented to restore the affected system. The incident was closed, and a resolution summary was created to detail the actions taken to address the issue. The summary includes the following information: + +1. Restore Action: Verify and restore the cell site router management path. + +2. Reason for Outage: The root cause was identified and documented. + +3. Notes: The root cause was identified as a failure to perform upgrades due to a cell site router management path unavailable. + +4. Root Cause: The root cause of the issue was identified as a problem with the cell site router management path, which caused the failure to perform upgrades. + +5. Root Cause: The root cause of the issue was identified as a problem with the cell site router management path, which caused the failure to perform upgrades. + +6. Root Cause: The root cause of the issue was identified as a problem with the cell site router management path, which caused the failure to perform upgrades. + +7. Root",2024-07-17 15:27:29,2024-07-17 16:17:29,Riders of the Mark,ran_software_upgrade_recovery,RAN-014: CSR Unreachable at Watch-Eriador-Outer-18. Investigation found sector outage caused by RU firmware lockup. CPRI link dropped between DU and affected RU. Escalation ticket created for vendor engagement on firmware issue. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-07-17 15:29 - Critical alarm received for Watch-Eriador-Outer-18: RAN-014: CSR Unreachable. Checked alarm system — confirmed active since 15:21. Intermittent alarm — flapping every 2-3 minutes. +2024-07-17 15:31 - Checked neighbors and topology. Upstream aggregation node healthy. Last-mile issue confirmed. +2024-07-17 15:35 - Queried element health at Watch-Eriador-Outer-18. Equipment within operational limits. +2024-07-17 15:39 - Attempted remote software rollback on Watch-Eriador-Outer-18. COMMAND SUCCESS — operation completed. +2024-07-17 15:42 - Created ticket TK-20240006 for DevOps. Persistent fault after 2 remote attempts — need on-site investigation. +2024-07-17 15:46 - Verified: Service fully restored. All metrics back to baseline. All sectors broadcasting normally. Alarms cleared.",Dwalin son of Thorin,2024-07-17 15:27:29,2024-07-17 16:17:29,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.2,FALSE,Dwalin son of Thorin,soft_solve +INCME-100008,Gondor,Relay Transport,Signal Unit,RAN,Dwarven Network Systems,Linhir,Resolved,RAN-014: CSR Unreachable,MAJOR,Beacon-Gondor-Lower-30,"Resolution Summary: + +The root cause of the failure was the cell site router management path unavailable. The issue was resolved by verifying the upgrade failure, dispatching Field Support, and documenting and reporting the issue. + +The problem type was RAN-014. + +The following actions were taken: +1. Verify upgrade failure: Verification of the upgrade failure was conducted to ensure that the issue was resolved. +2. Dispatch Field Support: A Field Support team member was dispatched to resolve the issue. +3. Document and report: A report was generated to document and report the issue. + +The resolution summary is a concise summary of the actions taken to resolve the issue. It includes the root cause, the actions taken, and the results achieved. The summary does not include additional details or information that may be relevant to the incident.",2024-08-19 02:41:11,2024-08-19 03:02:11,Rohan Rapid Response,ran_software_upgrade_recovery,RAN-014: CSR Unreachable at Beacon-Gondor-Lower-30. RF chain fault detected — PA shutdown due to high VSWR. Feeder connection verified remotely. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. All alarms cleared and KPIs returned to baseline within 15 minutes. No further action required.,"2024-08-19 02:43 - Critical alarm received for Beacon-Gondor-Lower-30: RAN-014: CSR Unreachable. Checked alarm system — confirmed active since 02:36. All 3 sectors showing Cell Down. +2024-08-19 02:47 - Connectivity check to Beacon-Gondor-Lower-30 — reachable, latency nominal. +2024-08-19 02:51 - Checked neighbors and topology. Upstream aggregation node healthy. Last-mile issue confirmed. +2024-08-19 02:53 - Queried element health at Beacon-Gondor-Lower-30. Equipment within operational limits. +2024-08-19 02:55 - Validated running configuration on Beacon-Gondor-Lower-30. No mismatches detected. +2024-08-19 02:59 - Attempted remote clear configuration on Beacon-Gondor-Lower-30. COMMAND SUCCESS — operation completed. +2024-08-19 03:02 - Recovery confirmed — Full recovery confirmed. Alarm cleared at 02:54. KPIs recovering — within 90% of baseline.",Halbarad the Heir,2024-08-19 02:41:11,2024-08-19 03:02:11,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.4,FALSE,Halbarad the Heir,soft_solve +INCME-100009,Rhovanion,Beacon Power,Path Router,COMPUTE,Dwarven Network Systems,Framsburg,Resolved,CMP-002: Pod Container Creating,MAJOR,Node-Rhovanion-Central-60,"NOC Engineer's Notes: + +1. Restore Action: Containers were successfully restarted, and the pods were brought back online. + +2. Reason for Outage: A pod was stuck in ContainerCreating state due to a configuration issue. + +Conclusion: The root cause of the outage was a configuration issue, and the root cause was verified. The incident was escalated to DevOps for further investigation. The root cause has been addressed and the issue has been resolved. The incident has been closed.",2024-08-11 10:43:48,2024-08-11 11:00:48,Shire Monitoring Guild,compute_container_crash_recovery,CMP-002: Pod Container Creating on Node-Rhovanion-Central-60. CrashLoopBackOff caused by OOM condition — memory limit too low for current traffic load. Escalation ticket created for vendor engagement on firmware issue. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-08-11 10:43 - Alarm received: CMP-002: Pod Container Creating at Node-Rhovanion-Central-60. Pod stuck in Terminating for 47 minutes. Checking container state. +2024-08-11 10:45 - Pod status: ImagePullBackOff. Reason: OOMKilled. Restart count: 12. Last restart: 3 minutes ago. +2024-08-11 10:49 - Container logs show: Pull error: dial tcp registry.internal:5000: connect: connection refused. Registry pod on same node is also down. +2024-08-11 10:52 - Checked compute node — Node CPU: 92%, Memory: 87%. Under resource pressure. +2024-08-11 10:56 - Remote action: unlock cells. COMMAND SUCCESS — operation completed. +2024-08-11 10:58 - Verified — Partial recovery — monitoring for stability. Replacement pod Running, all readiness probes passing.",Elendil Strider,2024-08-11 10:43:48,2024-08-11 11:00:48,Pod stuck in ContainerCreating state,Resolved,CMP-002: Pod Container Creating,0.4,FALSE,Elendil Strider,soft_solve +INCME-100010,Rhovanion,Signal Core,Path Router,RAN,Dwarven Network Systems,Erebor,Resolved,RAN-014: CSR Unreachable,MAJOR,Gateway-Rhovanion-North-09,"Restore Action: Verified that the cell site router management path was restored, and the RAN-014 issue has been resolved. +Reason for Outage: The root cause of the issue was a failure in the upgrade process, which resulted in a disruption in cell site router management. +Notes: Verified that the upgrade process was completed successfully, and the system status is now stable. +System Status: The system status is now stable, with no issues observed. + +Actions taken: +- Verified that the upgrade process was completed successfully +- Reviewed the upgrade logs to identify any issues +- Initiated a rollback to restore the previous system state +- Attempted an emergency boot to restore the system to a working state +- Verified that the system was able to recover from the upgrade failure +- Documented and reported the issue and its resolution + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Resolution: Verified that the upgrade process was completed successfully, and the system status is now stable. The root cause of the issue was a failure in the upgrade process, which resulted in a disruption in cell site router management. The",2024-11-18 13:48:58,2024-11-18 14:16:58,Pelargir Port Authority,ran_software_upgrade_recovery,RAN-014: CSR Unreachable at Gateway-Rhovanion-North-09. Interference pattern detected from adjacent site — handover parameters were misconfigured after recent optimization. Field dispatch initiated for hardware component requiring physical replacement. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-11-18 13:50 - Alarm: RAN-014: CSR Unreachable at Gateway-Rhovanion-North-09. Severity: MAJOR. Active 15 minutes. Partial outage — 2 of 3 sectors affected. +2024-11-18 13:54 - Topology analysis: Adjacent sites normal — confirms localized fault. Upstream path: degraded. +2024-11-18 13:56 - Queried element health at Gateway-Rhovanion-North-09. Equipment within operational limits. +2024-11-18 13:59 - Attempted remote reset on Gateway-Rhovanion-North-09. COMMAND FAILED — element not responding. +2024-11-18 14:03 - Recovery confirmed — Full recovery confirmed. Alarm cleared at 14:02. KPIs returned to normal.",Dáin Oakenshield,2024-11-18 13:48:58,2024-11-18 14:16:58,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.2,FALSE,Dáin Oakenshield,soft_solve +INCME-100011,Mordor Surveillance Zone,Signal Core,Central Nexus,TRANSPORT,Dwarven Network Systems,Grimslade,Resolved,TRN-007: Packet Loss Threshold Exceeded,MINOR,Station-MordorSurveillanceZone-South-26,"I'm unable to access the specific incident details, but I can provide you with a sample resolution summary for a failed incident outcome. + +in this case, the incident outcome was a failed transport path alarm, which was caused by unacceptable packet loss. To resolve the issue, the following actions were taken: + +1. Verify interface alarm: the first step was to verify the interface alarm, which was triggered due to the unacceptable packet loss. The alarm was confirmed to be valid and caused by the transport path experiencing unacceptable packet loss. + +2. Schedule physical repair: the next step was to schedule the physical repair of the affected interface. This involved identifying the affected interface, determining the physical location of the faulty component, and arranging for the repair. + +3. Verify interface status: after the physical repair was completed, the interface status was verified to ensure that the interface was restored to its original condition. + +the root cause of the failure was the unacceptable packet loss, which was caused by the transport path experiencing unacceptable packet loss. The root cause was investigated and a resolution was implemented to address the issue. + +the resolution summary includes the actions taken, the",2024-06-21 18:03:35,2024-06-21 18:22:35,Wizards Council Escalation,transport_interface_errors_resolution,TRN-007: Packet Loss Threshold Exceeded affecting Station-MordorSurveillanceZone-South-26. Fiber degradation on last-mile segment — CRC errors increasing. Physical inspection opened. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. Full service restoration confirmed. Post-incident review scheduled.,"2024-06-21 18:04 - Alarm: TRN-007: Packet Loss Threshold Exceeded on transport path to Station-MordorSurveillanceZone-South-26. Optical power at -18.2 dBm — borderline low. +2024-06-21 18:08 - Path analysis: Upstream router healthy. Problem isolated between aggregation and cell site. +2024-06-21 18:12 - Element health for Station-MordorSurveillanceZone-South-26: upstream router operational. Checked interface error rates. +2024-06-21 18:16 - Config correction: Interface MTU mismatch corrected on aggregation link. +2024-06-21 18:19 - Remote action: generator start — COMMAND SUCCESS — operation completed. +2024-06-21 18:23 - Recovery verified: Full recovery confirmed. Alarm cleared at 18:21.",Pelendur Húrinion,2024-06-21 18:03:35,2024-06-21 18:22:35,Transport path experiencing unacceptable packet loss,Resolved,TRN-007: Packet Loss Threshold Exceeded,0.4,FALSE,Pelendur Húrinion,soft_solve +INCME-100012,Dale Province,Arcane Engines,Path Router,SIGNALING,Elven Forge Technologies,Henneth Annûn,Pending Resolution,SIG-010: CU Communication Failure,MAJOR,Station-DaleProvince-North-64,"NOC Engineer: Completed Routing Alarm, Verified Routing Status, Verified Routing Alarm, Escalated Routing Issue, Verified Routing Status, Root Cause: Central Unit internal communication breakdown, Problem Type: SIG-010. + +The NOC engineer completed the following actions to resolve the routing alarm: + +1. Verified the routing alarm, which indicated that the central unit was experiencing communication issues. +2. Verified that the routing status was not affected, as the alarm was triggered by a communication failure between the central unit and the NOC. +3. Verified that the routing issue was caused by a communication breakdown between the central unit and the NOC. +4. Escalated the routing issue to the NOC management team for further investigation and resolution. + +The root cause of the routing issue was identified as a communication breakdown between the central unit and the NOC. This led to the escalation of the issue to the NOC management team for further investigation and resolution.",2024-09-22 17:51:17,2024-09-22 18:10:17,Bywater Observation Post,signaling_routing_failure_recovery,SIG-010: CU Communication Failure at Station-DaleProvince-North-64. Interface errors correlated with signaling overload — capacity threshold reached during peak traffic. Remote corrective action applied successfully. All alarms cleared and KPIs returned to baseline within 15 minutes. No further action required.,"2024-09-22 17:51 - Signaling alarm: SIG-010: CU Communication Failure at Station-DaleProvince-North-64. E2 interface errors — 230 failed transactions in last 15 min. +2024-09-22 17:53 - Connectivity check to Station-DaleProvince-North-64 — reachable, latency nominal. +2024-09-22 17:55 - Element health for Station-DaleProvince-North-64: CU processing normal. Checked signaling interface status. +2024-09-22 17:57 - Executed SCTP reset — SUCCESS — reboot initiated, monitoring. +2024-09-22 18:00 - Ticket TK-20240011 for RF engineering: Hardware replacement may be needed. +2024-09-22 18:03 - Protocol diagnostics: Protocol trace shows malformed messages from peer — likely software bug on AMF side. +2024-09-22 18:07 - Verified: Site operational, all services confirmed active. N2/NGAP path recovered.",Arador of the Dúnedain,2024-09-22 17:51:17,2024-09-22 18:10:17,Central Unit internal communication breakdown,Pending Resolution,SIG-010: CU Communication Failure,0.4,FALSE,Arador of the Dúnedain,soft_solve +INCME-100013,Gondor,Path Signals,Weather Watch,RAN,Dwarven Network Systems,Cair Andros,Resolved,SVC-002: Data Throughput Degradation,MAJOR,Station-Gondor-West-13,"In response to the complaint received regarding a slow-performing SVC, the NOC engineer has reviewed the complaint details, checked cell performance, and requested a detailed analysis. As a result, the following action was taken: + +- Review Complaint Details: The engineer reviewed the complaint details and identified that the user throughput significantly below expected rates. +- Check Cell Performance: The engineer checked cell performance to ensure that the SVC is operating correctly and is not experiencing any performance issues. +- Request Detailed Analysis: The engineer requested a detailed analysis to determine the root cause of the slow-performing SVC. + +The root cause of the slow-performing SVC was identified as user throughput significantly below expected rates. The engineer has documented the resolution and will continue to monitor the SVC to ensure that it is operating optimally.",2024-09-20 03:18:09,2024-09-20 03:35:09,Arnor Response Team,ran_speed_complaint_resolution,SVC-002: Data Throughput Degradation at Station-Gondor-West-13. Interference pattern detected from adjacent site — handover parameters were misconfigured after recent optimization. Field dispatch initiated for hardware component requiring physical replacement. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-09-20 03:20 - Alarm: SVC-002: Data Throughput Degradation at Station-Gondor-West-13. Severity: MAJOR. Active 5 minutes. Partial outage — 2 of 3 sectors affected. +2024-09-20 03:24 - RF status check — PA power output 6dB below target on affected sector. VSWR: 3.2:1. TX power: nominal. +2024-09-20 03:27 - Queried element health at Station-Gondor-West-13. Equipment within operational limits. +2024-09-20 03:29 - Executed software rollback — SUCCESS — reboot initiated, monitoring. +2024-09-20 03:31 - Recovery confirmed — Partial recovery — monitoring for stability. Metrics stable for 15 minutes.",Paladin Took,2024-09-20 03:18:09,2024-09-20 03:35:09,User throughput significantly below expected rates,Resolved,SVC-002: Data Throughput Degradation,0.2,FALSE,Paladin Took,soft_solve +INCME-100014,Rivendell,Signal Core,Central Nexus,COMPUTE,Dwarven Network Systems,Grey Havens,Resolved,CMP-003: Pod Pending or Evicted,CRITICAL,Array-Rivendell-Primary-08,"In summary, the root cause of the outage was a Pod that was unable to be scheduled or evicted from a node due to a Node Alarm. The incident was escalated to the Platform Team for further investigation, and a root cause analysis was performed to determine the root cause. The outage was resolved by verifying that the Pod could be scheduled and evicted from the node. The root cause was also identified and documented. The resolution was a complete restart of the affected node, which restored normal operations. The resolution summary includes the following: + +1. Restore Action: Restart the affected node +2. Reason For Outage: The root cause of the outage was a Pod that was unable to be scheduled or evicted from a node due to a Node Alarm. + +The resolution summary is concise and complete, keeping NOC terminology and workflow in mind. The root cause and resolution are highlighted for easy reference.",2024-10-13 02:35:43,2024-10-13 03:10:43,Erebor Relay Division,compute_k8s_node_recovery,CMP-003: Pod Pending or Evicted on Array-Rivendell-Primary-08. Container configuration error after recent deployment — environment variable referencing deleted secret. Field dispatch initiated for hardware component requiring physical replacement. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-10-13 02:35 - Alarm received: CMP-003: Pod Pending or Evicted at Array-Rivendell-Primary-08. Container OOMKilled — memory limit exceeded. Checking container state. +2024-10-13 02:38 - Orchestration: restarted container runtime on node — Container runtime restarted, pods recovering. +2024-10-13 02:40 - Queried orchestrator — smf-core-5d8b: Pending. Restart count: 12. Last restart: 3 minutes ago. +2024-10-13 02:43 - Checked compute node — Node CPU: 92%, Memory: 87%. Under resource pressure. +2024-10-13 02:47 - Remote action: SCTP reset. COMMAND SUCCESS — operation completed. +2024-10-13 02:50 - Verified — Full recovery confirmed. Alarm cleared at 02:48. New pods healthy across 3 nodes.",Nori of the Lonely Mountain,2024-10-13 02:35:43,2024-10-13 03:10:43,Pod cannot be scheduled or was evicted from node,Resolved,CMP-003: Pod Pending or Evicted,0.4,FALSE,Nori of the Lonely Mountain,soft_solve +INCME-100015,Rohan,Signal Core,Beacon Array,POWER,Elven Forge Technologies,Edoras,Resolved,PWR-003: Battery Discharge Alert,MINOR,Hub-Rohan-East-49,"Incident Closure: + +Escalation: Battery Alert, Battery Depletion, No AC Power, No Generator Available, No Site Status, No Load Shedding Implemented, Emergency Dispatch, No Site Status + +Restore Action: +1. Acknowledge Battery Alert - Notify stakeholders of the battery backup depletion and the need to implement load shedding. +2. Check Battery Status - Confirm that the battery is still functioning correctly and that the battery backup is not depleted. +3. Verify AC Power Status - Check if the site is connected to the AC grid and if there is enough power to operate the generator. +4. Start Generator - If the site is not connected to the AC grid, the generator will be used to provide power to the site. +5. Implement Load Shedding - If the generator is not sufficient to meet the site's power demand, load shedding will be implemented to reduce the load on the grid. +6. Emergency Dispatch - If the generator is unable to provide power, an emergency dispatch will be initiated to ensure that the site",2024-07-04 02:07:53,2024-07-04 02:44:53,Rivendell Array Management,power_battery_discharge_response,Battery backup depleting without AC restoration at Hub-Rohan-East-49. Battery bank reached low-voltage disconnect threshold during extended commercial power outage. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-07-04 02:07 - Verified alarm: PWR-003: Battery Discharge Alert at Hub-Rohan-East-49. Active for 3 minutes. Multiple downstream alarms generated. +2024-07-04 02:09 - Checked external factors near Hub-Rohan-East-49. No area-wide issues. +2024-07-04 02:11 - Queried power subsystems. UPS: offline — bypass mode. Battery: 34% and dropping rapidly. Generator: running — full load. Rectifier: failed — no DC output. +2024-07-04 02:13 - Queried element health at Hub-Rohan-East-49. Equipment within operational limits. +2024-07-04 02:17 - Attempted remote generator start — Partial success — 2 of 3 units recovered. +2024-07-04 02:19 - Verified recovery — Full recovery confirmed. Alarm cleared at 02:24. All alarms cleared.",Celeborn of the Grey Havens,2024-07-04 02:07:53,2024-07-04 02:44:53,Battery backup depleting without AC restoration,Resolved,PWR-003: Battery Discharge Alert,0.6,FALSE,Celeborn of the Grey Havens,hard_solve +INCME-100016,Gondor,Relay Transport,Power Source,RAN,Dwarven Network Systems,Calembel,Resolved,RAN-001: Cell Service Interruption,MAJOR,Station-Gondor-Primary-48,"NOC Engineer's Workflow Outcome: Failed + +Actions taken: +- Verified Site Outage +- Dispatched Field Technician +- Final Status Verification + +Root cause: Cell completely unavailable for service. +Problem type: RAN-001 + +Resolution: +- Identified the cause of the outage - cell unavailable +- Verified the issue was resolved by deploying a new cell. +- Notified stakeholders of the issue and its resolution. +- Recommended improvements to the network infrastructure to prevent similar issues from occurring in the future.",2024-09-22 20:04:49,2024-09-22 20:23:49,Pelargir Port Authority,ran_cell_site_down_recovery,RAN-001: Cell Service Interruption at Station-Gondor-Primary-48. Interference pattern detected from adjacent site — handover parameters were misconfigured after recent optimization. Escalation ticket created for vendor engagement on firmware issue. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-09-22 20:04 - Alarm: RAN-001: Cell Service Interruption at Station-Gondor-Primary-48. Severity: MAJOR. Active 12 minutes. Partial outage — 2 of 3 sectors affected. +2024-09-22 20:06 - Connectivity check to Station-Gondor-Primary-48 — reachable, latency nominal. +2024-09-22 20:09 - Queried element health at Station-Gondor-Primary-48. Equipment within operational limits. +2024-09-22 20:12 - Validated running configuration on Station-Gondor-Primary-48. No mismatches detected. +2024-09-22 20:16 - Escalated to DevOps via TK-20240015: Vendor escalation for firmware issue. +2024-09-22 20:20 - Verified: Service fully restored. All metrics back to baseline. All sectors broadcasting normally. Alarms cleared.",Orophin the Fair,2024-09-22 20:04:49,2024-09-22 20:23:49,Cell completely unavailable for service,Resolved,RAN-001: Cell Service Interruption,0.2,FALSE,Orophin the Fair,soft_solve +INCME-100017,Rivendell,Path Signals,Relay Unit,POWER,Elven Forge Technologies,Hollin Gate,Resolved,PWR-003: Battery Discharge Alert,MAJOR,Array-Rivendell-West-04,"Incident Closure Summary: + +1. Restore Action: Restored the battery backup to full capacity. + +2. Reason For Outage: The battery backup depleted without AC restoration, which resulted in a power outage. + +3. Root Cause: The battery backup depleted without AC restoration due to a power outage. + +4. Problem Type: The PWR-003 problem type refers to a power outage that affects the battery backup. + +5. Summary: The incident was resolved by restoring the battery backup to full capacity, and the power outage was caused by a power outage.",2024-09-28 20:07:53,2024-09-28 20:20:53,Istari Advisory Board,power_battery_discharge_response,Battery backup depleting without AC restoration at Array-Rivendell-West-04. DC rectifier module failure caused voltage drop below threshold. Backup rectifier activated but insufficient for full load. Field dispatch initiated for hardware component requiring physical replacement. Full service restoration confirmed. Post-incident review scheduled.,"2024-09-28 20:08 - Verified alarm: PWR-003: Battery Discharge Alert at Array-Rivendell-West-04. Active for 6 minutes. Multiple downstream alarms generated. +2024-09-28 20:12 - Power telemetry check — AC input: restored — stabilizing. UPS providing backup, battery at 45%. Generator running — full load. +2024-09-28 20:16 - Queried element health at Array-Rivendell-West-04. Equipment within operational limits. +2024-09-28 20:20 - Attempted remote reset — SUCCESS — reboot initiated, monitoring. +2024-09-28 20:22 - Verified recovery — Partial recovery — monitoring for stability. All alarms all cleared.",Pallando the Grey,2024-09-28 20:07:53,2024-09-28 20:20:53,Battery backup depleting without AC restoration,Resolved,PWR-003: Battery Discharge Alert,0.4,FALSE,Pallando the Grey,soft_solve +INCME-100018,Gondor,Signal Core,Path Router,RAN,Dwarven Network Systems,Pelargir,Resolved,RAN-002: Cell Administratively Disabled,MINOR,Spire-Gondor-West-54,"Closure: The root cause of the issue was identified as the Cell locked or disabled by management action. The issue was escalated to the Security Operations Center (SOC) for further investigation. The issue was verified and resolved by verifying that the Sector Alarm was triggered, scheduling a field repair for the affected field, and verifying coverage impact. The root cause was identified as a management action that locked or disabled the Cell, which caused the issue. The resolution summary is complete and concise, using NOC terminology.",2024-08-13 11:14:18,2024-08-13 11:41:18,Iron Hills Transport,ran_sector_outage_recovery,RAN-002: Cell Administratively Disabled at Spire-Gondor-West-54. Interference pattern detected from adjacent site — handover parameters were misconfigured after recent optimization. Automated recovery sequence triggered after remote intervention. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-08-13 11:14 - Critical alarm received for Spire-Gondor-West-54: RAN-002: Cell Administratively Disabled. Checked alarm system — confirmed active since 11:11. All 3 sectors showing Cell Down. +2024-08-13 11:16 - Connectivity check to Spire-Gondor-West-54 — reachable, latency nominal. +2024-08-13 11:20 - Queried RF chain. RET controller not responding on Sector Alpha. Antenna tilt: 10°. PA status: degraded. +2024-08-13 11:23 - Queried element health at Spire-Gondor-West-54. Equipment within operational limits. +2024-08-13 11:25 - Validated running configuration on Spire-Gondor-West-54. No mismatches detected. +2024-08-13 11:28 - Attempted remote SCTP reset on Spire-Gondor-West-54. Partial success — 2 of 3 units recovered. +2024-08-13 11:31 - Verified: Full recovery confirmed. Alarm cleared at 11:24. All sectors broadcasting normally. Alarms cleared within 3 minutes.",Thráin Oakenshield,2024-08-13 11:14:18,2024-08-13 11:41:18,Cell locked or disabled by management action,Resolved,RAN-002: Cell Administratively Disabled,0.2,FALSE,Thráin Oakenshield,hard_solve +INCME-100019,Rohan,Path Signals,Signal Unit,POWER,Elven Forge Technologies,Hornburg,Resolved,ENV-002: HVAC System Fault,WARNING,Gateway-Rohan-West-61,"I'm not able to perform actual NOC work or read your company's internal processes. However, based on the information you provided, here's a sample resolution summary for the incident closure notes: + +restore action: +- verification of climate control system malfunction +- dispatch of hvac technician to resolve issue +- verification of environmental status +- en-002 problem type + +reason for outage: +- climate control system malfunction + +root cause: +- malfunction of climate control system +- result of incorrect maintenance or failure of system components + +solution: +- verification of malfunction and resolution +- training for maintenance team +- maintenance and repair of system components + +closing note: +- restoration of service to normal operating conditions +- resolution of issue + +note: this summary is intended to be a concise and comprehensive summary of the incident closure notes. Please provide more detailed information if necessary.",2024-09-11 01:16:09,2024-09-11 01:48:09,Istari Advisory Board,env_hvac_fault_recovery,ENV-002: HVAC System Fault at Gateway-Rohan-West-61. Battery temperature alarm triggered by failed ventilation fan in battery room. Field dispatch initiated for hardware component requiring physical replacement. Full service restoration confirmed. Post-incident review scheduled.,"2024-09-11 01:16 - Environmental alarm: ENV-002: HVAC System Fault at Gateway-Rohan-West-61. Cabinet door sensor: OPEN for 23 minutes. +2024-09-11 01:19 - Checked environmental systems: HVAC Unit 1: fault — compressor off. Unit 2: running at full capacity but insufficient for current heat load. +2024-09-11 01:21 - Site health under environmental stress: Equipment within thermal margin but approaching critical. No service impact yet. +2024-09-11 01:23 - Validated running configuration on Gateway-Rohan-West-61. No mismatches detected. +2024-09-11 01:27 - Executed power cycle: SUCCESS — reboot initiated, monitoring. +2024-09-11 01:29 - Environmental recovery: Site operational, all services confirmed active.",Nori son of Glóin,2024-09-11 01:16:09,2024-09-11 01:48:09,Climate control system malfunction,Resolved,ENV-002: HVAC System Fault,0.2,FALSE,Nori son of Glóin,hard_solve +INCME-100020,Arnor,Path Signals,Signal Unit,POWER,Dwarven Network Systems,Bywater,Resolved,ENV-006: Battery High Temperature,MINOR,Node-Arnor-North-67,"After reviewing the incident logs and verifying that the battery temperature alarm was triggered, the NOC engineer determined that the battery string was the root cause of the issue. The engineer isolated the battery string and conducted a thorough inspection to determine if the battery was damaged or malfunctioning. The engineer found no signs of damage or malfunction, and the battery was replaced. + +As a result of this investigation, the engineer closed the incident by restoring the affected system to a working state. The root cause was identified, and the issue was resolved. The engineer ensured that this incident would not occur again and took necessary steps to ensure that this issue did not occur again. + +The engineer wrote a concise and professional resolution summary that included the root cause, the actions taken, and the root cause analysis. The summary was completed within the NOC's workflow and was used to document the incident and ensure that the incident was properly resolved.",2024-09-22 05:42:09,2024-09-22 06:05:09,Orthanc Technical Review,env_battery_temperature_response,ENV-006: Battery High Temperature at Node-Arnor-North-67. Cabinet overheating due to HVAC compressor failure. Single cooling unit insufficient for thermal load. Automated recovery sequence triggered after remote intervention. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-09-22 05:42 - Environmental alarm: ENV-006: Battery High Temperature at Node-Arnor-North-67. Battery bank temperature at 48°C — max rated 45°C. +2024-09-22 05:44 - Checked environmental systems: Battery room: 48°C. Ventilation fan: running. AC unit: offline since 06:00. +2024-09-22 05:47 - External conditions: Ambient temperature: 38°C (heat wave in region). No utility outages reported. +2024-09-22 05:50 - Site health under environmental stress: All equipment operational but thermal warning on power amplifiers. +2024-09-22 05:52 - Executed restart: COMMAND FAILED — element not responding. +2024-09-22 05:56 - Environmental recovery: Full recovery confirmed. Alarm cleared at 05:56.",Argonui the Heir,2024-09-22 05:42:09,2024-09-22 06:05:09,Battery thermal runaway risk detected,Resolved,ENV-006: Battery High Temperature,0.4,FALSE,Argonui the Heir,soft_solve +INCME-100021,Lothlórien,Path Signals,Keeper Stone,RAN,Elven Forge Technologies,Bree,Resolved,RAN-001: Cell Service Interruption,MAJOR,Outpost-Lothlorien-North-41,"Reopened from INCME-100016. + +NOC Engineer's Workflow Outcome: Failed + +Actions taken: +- Verified Site Outage +- Dispatched Field Technician +- Final Status Verification + +Root cause: Cell completely unavailable for service. +Problem type: RAN-001 + +Resolution: +- Identified the cause of the outage - cell unavailable +- Verified the issue was resolved by deploying a new cell. +- Notified stakeholders of the issue and its resolution. +- Recommended improvements to the network infrastructure to prevent similar issues from occurring in the future.",2024-09-23 05:23:49,2024-09-23 05:54:49,Hobbiton Watch Service,ran_cell_site_down_recovery,RAN-001: Cell Service Interruption at Outpost-Lothlorien-North-41. Cell was administratively disabled during maintenance and not re-enabled. Configuration mismatch confirmed. Remote corrective action applied successfully. Full service restoration confirmed. Post-incident review scheduled.,"2024-09-23 05:24 - Critical alarm received for Outpost-Lothlorien-North-41: RAN-001: Cell Service Interruption. Checked alarm system — confirmed active since 05:19. All 3 sectors showing Cell Down. +2024-09-23 05:26 - Checked neighbors and topology. Adjacent sites normal — confirms localized fault. +2024-09-23 05:30 - RF status check — Sector Alpha: PA off, no radiation detected. VSWR: 1.2:1. TX power: 6dB below target. +2024-09-23 05:34 - Queried element health at Outpost-Lothlorien-North-41. Equipment within operational limits. +2024-09-23 05:38 - Attempted remote software rollback on Outpost-Lothlorien-North-41. SUCCESS — reboot initiated, monitoring. +2024-09-23 05:40 - Recovery confirmed — Service fully restored. All metrics back to baseline. Metrics stable for 15 minutes.",Háma of Helm's Deep,2024-09-23 05:23:49,2024-09-23 05:54:49,Cell completely unavailable for service,Resolved,RAN-001: Cell Service Interruption,0.4,TRUE,Háma of Helm's Deep,soft_solve +INCME-100022,Rohan,Beacon Power,Junction Point,RAN,Dwarven Network Systems,Edoras,Resolved,RAN-018: Link Flapping Detected,CRITICAL,Beacon-Rohan-East-10,"I am not able to perform the actual work or access the system to generate a resolution summary. However, based on the provided text, the resolution summary for the workflow outcome of completing the outage is as follows: + +1. Restore Action: Verify PRB Status +2. Reason For Outage: Interface repeatedly transitioning between up and down + +The root cause of the outage was identified as an interface repeatedly transitioning between up and down. The root cause was resolved by verifying the PRB status and applying PRB optimization. + +The resolution summary includes the actions taken to resolve the issue, which is verifying the PRB status and applying PRB optimization. The resolution summary is concise and complete, using proper NOC terminology.",2024-07-24 20:44:01,2024-07-24 21:23:01,Istari Advisory Board,ran_prb_availability_resolution,RAN-018: Link Flapping Detected at Beacon-Rohan-East-10. Investigation found sector outage caused by RU firmware lockup. CPRI link dropped between DU and affected RU. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-07-24 20:46 - Alarm: RAN-018: Link Flapping Detected at Beacon-Rohan-East-10. Severity: CRITICAL. Active 8 minutes. Single sector degraded, others normal. +2024-07-24 20:50 - Topology analysis: 2 neighbor sites also showing degraded metrics — possible area issue. Upstream path: healthy. +2024-07-24 20:54 - Queried RF chain. Sector Alpha: PA off, no radiation detected. Antenna tilt: 2°. PA status: standby. +2024-07-24 20:56 - Queried element health at Beacon-Rohan-East-10. Equipment within operational limits. +2024-07-24 20:59 - Attempted remote power cycle on Beacon-Rohan-East-10. Partial success — 2 of 3 units recovered. +2024-07-24 21:02 - Recovery confirmed — Site operational, all services confirmed active. Metrics stable for 15 minutes.",Herubrand of Rohan,2024-07-24 20:44:01,2024-07-24 21:23:01,Interface repeatedly transitioning between up and down,Resolved,RAN-018: Link Flapping Detected,0.4,FALSE,Herubrand of Rohan,hard_solve +INCME-100023,Lothlórien,Signal Network,Central Nexus,COMPUTE,Elven Forge Technologies,Osgiliath,Pending Resolution,CMP-004: Pod CrashLoopBackOff,MAJOR,Node-Lothlorien-Central-41,"In response to the issue of Container repeatedly crashing and restarting, the NOC engineer has conducted a thorough investigation and identified the root cause as a storage issue. The engineer has verified that the storage alarm was triggered due to an overloaded storage volume, and the volume has been restored to normal conditions. + +In order to address the issue, the engineer has restarted dependent workloads to ensure that they are running smoothly. Additionally, the engineer has escalated the issue to the storage team for further investigation and resolution. + +The engineer has also confirmed that the storage issue has been resolved, and the container is now operating normally. The engineer has written the resolution summary, which includes the following: + +- Verify Storage Alarm: The engineer confirmed that the storage alarm was triggered due to an overloaded storage volume. +- Check Volume Status: The engineer has confirmed that the volume has been restored to normal conditions. +- Restart Dependent Workloads: The engineer has restarted dependent workloads to ensure that they are running smoothly. +- Escalate to Storage Team: The engineer has escalated the issue to the storage team for further investigation and resolution. +- Resolution Summary: The engineer has written the resolution summary",2024-09-14 17:14:59,2024-09-14 17:46:59,Gondor Signal Authority,compute_storage_failure_recovery,CMP-004: Pod CrashLoopBackOff on Node-Lothlorien-Central-41. Container configuration error after recent deployment — environment variable referencing deleted secret. Configuration corrected and service restored remotely within SLA. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-09-14 17:15 - Alarm received: CMP-004: Pod CrashLoopBackOff at Node-Lothlorien-Central-41. Pod stuck in Terminating for 47 minutes. Checking container state. +2024-09-14 17:19 - Queried orchestrator — cu-control-plane-7b4d: Pending. Scheduled but unschedulable — resource constraints. +2024-09-14 17:21 - Inspected logs for cu-control-plane-7b4d: Last 20 restarts show same pattern: starts OK, then OOMKilled at ~200MB after 3-4 minutes of operation. +2024-09-14 17:25 - Checked compute node — Node OK, storage: 94% used. +2024-09-14 17:29 - Ticket TK-20240022 created for vendor support: Remote resolution unsuccessful, physical inspection required. +2024-09-14 17:31 - Ran diagnostic suite on Node-Lothlorien-Central-41. Results collected for analysis. +2024-09-14 17:35 - Recovery check: Site operational, all services confirmed active. All probes passing.",Erestor Starlight,2024-09-14 17:14:59,2024-09-14 17:46:59,Container repeatedly crashing and restarting,Pending Resolution,CMP-004: Pod CrashLoopBackOff,0.4,FALSE,Erestor Starlight,hard_solve +INCME-100024,Rivendell,Path Signals,Central Nexus,COMPUTE,Elven Forge Technologies,Rivendell,Resolved,CMP-008: DU Function Pod Restart,CRITICAL,Beacon-Rivendell-East-53,"The incident was resolved by verifying that the Resource Alarm was verified and that the Resource Status was consistent with the expected state. In addition, a Capacity Expansion request was sent to the cluster's resource provider to increase the available capacity. The root cause of the issue was verified to be CMP-008, which is an expected failure condition in distributed unit pods. The resolution summary includes the actions taken, which were to verify the Resource Alarm and request Capacity Expansion, and the root cause of the issue, which was CMP-008. The summary is concise and complete, using NOC terminology to ensure accuracy.",2024-10-04 17:57:50,2024-10-04 18:13:50,Iron Hills Transport,compute_resource_exhaustion_resolution,CMP-008: DU Function Pod Restart on Beacon-Rivendell-East-53. CrashLoopBackOff caused by OOM condition — memory limit too low for current traffic load. Escalation ticket created for vendor engagement on firmware issue. All alarms cleared and KPIs returned to baseline within 15 minutes. No further action required.,"2024-10-04 17:59 - Alert: CMP-008: DU Function Pod Restart on smf-core-5d8b in namespace ran-prod at Beacon-Rivendell-East-53. ImagePullBackOff — registry unreachable. +2024-10-04 18:01 - Queried orchestrator — smf-core-5d8b: Error. Scheduled but unschedulable — resource constraints. +2024-10-04 18:04 - Inspected logs for smf-core-5d8b: Kubelet reports: 0/8 nodes available: insufficient memory. Current request: 4Gi, largest available: 2Gi. +2024-10-04 18:08 - Checked compute node — Node CPU: 92%, Memory: 87%. Under resource pressure. +2024-10-04 18:12 - Remote action: clear configuration. COMMAND SUCCESS — operation completed. +2024-10-04 18:15 - Ticket TK-20240023 created for power maintenance: Persistent fault after 2 remote attempts — need on-site investigation. +2024-10-04 18:18 - Ran diagnostic suite on Beacon-Rivendell-East-53. Results collected for analysis. +2024-10-04 18:22 - Verified — Service fully restored. All metrics back to baseline. Workload stable on new node.",Frodo Bracegirdle,2024-10-04 17:57:50,2024-10-04 18:13:50,Distributed Unit pod experienced unexpected restart,Resolved,CMP-008: DU Function Pod Restart,0.2,FALSE,Frodo Bracegirdle,soft_solve +INCME-100025,The Shire,Beacon Power,Junction Point,RAN,Dwarven Network Systems,Gladden Fields,Pending Resolution,SVC-003: Call Drop Rate Elevated,CRITICAL,Watch-TheShire-East-89,"After reviewing the complaint details and conducting an investigation, we determined that the root cause of the issue was a temporary Voice or data session drop rate above threshold. The root cause was identified as a hardware issue, which was addressed by replacing the faulty equipment. + +In order to restore service, we initiated a review of the SVC-003 problem type and documented the resolution. The root cause of the issue was identified, and the issue was resolved. + +In conclusion, the incident was resolved and the root cause was identified and documented. The root cause was a temporary Voice or data session drop rate above threshold, which was addressed by replacing faulty equipment. The resolution was documented in the incident closure notes.",2024-08-28 12:24:33,2024-08-28 12:32:33,Dúnedain Field Division,ran_dropped_calls_resolution,SVC-003: Call Drop Rate Elevated at Watch-TheShire-East-89. Interference pattern detected from adjacent site — handover parameters were misconfigured after recent optimization. Automated recovery sequence triggered after remote intervention. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-08-28 12:26 - Critical alarm received for Watch-TheShire-East-89: SVC-003: Call Drop Rate Elevated. Checked alarm system — confirmed active since 12:20. Single sector degraded, others normal. +2024-08-28 12:28 - Connectivity check to Watch-TheShire-East-89 — reachable, latency nominal. +2024-08-28 12:32 - Queried element health at Watch-TheShire-East-89. Equipment within operational limits. +2024-08-28 12:35 - Escalated to DevOps via TK-20240024: Persistent fault after 2 remote attempts — need on-site investigation. +2024-08-28 12:39 - Verified: Partial recovery — monitoring for stability. All sectors broadcasting normally. Alarms cleared.",Aravir of Arnor,2024-08-28 12:24:33,2024-08-28 12:32:33,Voice or data session drop rate above threshold,Pending Resolution,SVC-003: Call Drop Rate Elevated,0.6,FALSE,Aravir of Arnor,soft_solve +INCME-100026,Gondor,Signal Network,Junction Point,COMPUTE,Elven Forge Technologies,Pelargir,Resolved,CMP-005: Pod Terminating Stuck,MINOR,Station-Gondor-Inner-65,"Closure Note: + +The root cause of the incident was a Pod stuck in terminating state beyond grace period. The incident was reported to the Platform Team, and an investigation was initiated. The Orchestrator Alarm was verified to be the root cause of the issue. + +The incident was escalated to the Platform Team, and a resolution plan was developed. The Orchestrator Recovery was initiated, and the Pod was successfully recovered. + +The root cause was identified, and the issue was resolved. The incident was closed with a resolution summary of ""Restore Action: Verify Orchestrator Alarm, Escalate to Platform Team, Verify Orchestrator Recovery, Root cause: Pod stuck in terminating state beyond grace period, Problem type: CMP-005.""",2024-10-25 05:45:02,2024-10-25 05:58:02,Osgiliath Bridge Operations,compute_orchestrator_recovery,CMP-005: Pod Terminating Stuck on Station-Gondor-Inner-65. Container configuration error after recent deployment — environment variable referencing deleted secret. Remote corrective action applied successfully. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-10-25 05:45 - Alarm received: CMP-005: Pod Terminating Stuck at Station-Gondor-Inner-65. ImagePullBackOff — registry unreachable. Checking container state. +2024-10-25 05:47 - Container logs show: Kubelet reports: 0/8 nodes available: insufficient memory. Current request: 4Gi, largest available: 2Gi. +2024-10-25 05:51 - Checked compute node — Node NotReady — kubelet unresponsive. +2024-10-25 05:55 - Validated running configuration on Station-Gondor-Inner-65. No mismatches detected. +2024-10-25 05:59 - Remote action: power cycle. COMMAND SUCCESS — operation completed. +2024-10-25 06:03 - Verified — Full recovery confirmed. Alarm cleared at 06:05. New pods healthy across 3 nodes.",Gildor of Rivendell,2024-10-25 05:45:02,2024-10-25 05:58:02,Pod stuck in terminating state beyond grace period,Resolved,CMP-005: Pod Terminating Stuck,0.2,FALSE,Gildor of Rivendell,soft_solve +INCME-100027,Gondor,Arcane Engines,Keeper Stone,RAN,Elven Forge Technologies,Minas Ithil,Pending Resolution,RAN-014: CSR Unreachable,WARNING,Point-Gondor-East-73,"In summary, the NOC engineer resolved the issue of a cell site router management path being unavailable due to a faulty switch. The root cause was identified as a faulty switch. The issue was verified and resolved by verifying the Upgrade Failure, dispatching Field Support, and documenting and reporting the issue. The root cause was documented and reported. The engineer completed the resolution process by verifying the Upgrade Failure, discharging the incident, documenting and reporting the issue, and resolving it.",2024-07-19 23:37:38,2024-07-20 00:01:38,Erebor Relay Division,ran_software_upgrade_recovery,RAN-014: CSR Unreachable at Point-Gondor-East-73. Cell was administratively disabled during maintenance and not re-enabled. Configuration mismatch confirmed. Escalation ticket created for vendor engagement on firmware issue. Full service restoration confirmed. Post-incident review scheduled.,"2024-07-19 23:38 - Critical alarm received for Point-Gondor-East-73: RAN-014: CSR Unreachable. Checked alarm system — confirmed active since 23:29. All 3 sectors showing Cell Down. +2024-07-19 23:41 - Topology analysis: Adjacent sites normal — confirms localized fault. Upstream path: down. +2024-07-19 23:45 - Connectivity check to Point-Gondor-East-73 — reachable, latency nominal. +2024-07-19 23:49 - Queried element health at Point-Gondor-East-73. Equipment within operational limits. +2024-07-19 23:53 - Validated running configuration on Point-Gondor-East-73. No mismatches detected. +2024-07-19 23:55 - Created ticket TK-20240026 for field operations. Vendor escalation for firmware issue. +2024-07-19 23:58 - Verified: Partial recovery — monitoring for stability. All sectors operational and radiating. Alarms all cleared.",Celeborn Silvertree,2024-07-19 23:37:38,2024-07-20 00:01:38,Cell site router management path unavailable,Pending Resolution,RAN-014: CSR Unreachable,0.2,FALSE,Celeborn Silvertree,soft_solve +INCME-100028,Dale Province,Signal Core,Relay Unit,SIGNALING,Dwarven Network Systems,Ethring,Pending Resolution,SIG-009: E2 Interface Errors,CRITICAL,Point-DaleProvince-West-53,"NOC engineer reviewed the incident logs and identified that the root cause of the issue was the RAN Intelligent Controller interface errors. The issue was resolved by verifying the Delay Alarm and requesting a Network Review. The delay status was verified, and the root cause was determined to be RAN Intelligent Controller interface errors. The incident was escalated to the Network Operations Center (NOC) for further investigation and resolution. The resolution summary includes the close notes, which include the steps taken to resolve the issue. The engineer used NOC terminology throughout the summary.",2024-11-05 14:10:49,2024-11-05 14:31:49,Gondor Gateway Team,signaling_delay_resolution,SIG-009: E2 Interface Errors at Point-DaleProvince-West-53. Protocol errors caused by software version mismatch between CU and core network elements. Remote corrective action applied successfully. Full service restoration confirmed. Post-incident review scheduled.,"2024-11-05 14:12 - Signaling alarm: SIG-009: E2 Interface Errors at Point-DaleProvince-West-53. N2 interface to AMF lost connectivity at {t_minus_5}. +2024-11-05 14:15 - Connectivity check to Point-DaleProvince-West-53 — reachable, latency nominal. +2024-11-05 14:18 - Element health for Point-DaleProvince-West-53: CU processing normal. Checked signaling interface status. +2024-11-05 14:22 - Executed force restart — COMMAND FAILED — element not responding. +2024-11-05 14:24 - Ticket TK-20240027 for DevOps: Persistent fault after 2 remote attempts — need on-site investigation. +2024-11-05 14:27 - Protocol diagnostics: SCTP diagnostics: primary path failed, no multihoming configured. Single point of failure. +2024-11-05 14:30 - Verified: Site operational, all services confirmed active. SIP path healthy.",Thranduil of Mirkwood,2024-11-05 14:10:49,2024-11-05 14:31:49,RAN Intelligent Controller interface errors,Pending Resolution,SIG-009: E2 Interface Errors,0.4,FALSE,Thranduil of Mirkwood,soft_solve +INCME-100029,Gondor,Beacon Power,Power Source,RAN,Dwarven Network Systems,Osgiliath,Resolved,RAN-015: Fronthaul Link Down,MINOR,Point-Gondor-Central-31,"I am unable to generate a resolution summary based on your provided text. Please provide me with the complete text and I'll generate the resolution summary for you. + +in the event of an issue with the overshoot detection feature, the following actions were taken: + +1. Verify Overshoot Detection: A team member conducted a thorough analysis of the coverage pattern and identified an issue with the overshoot detection feature. The issue was resolved by reconfiguring the coverage pattern to ensure a more accurate detection of overshooting signals. + +2. Analyze Coverage Pattern: A team member conducted a detailed analysis of the coverage pattern to identify any potential issues with the overshoot detection feature. The issue was resolved by optimizing the coverage pattern to improve the detection accuracy. + +3. Request RF Optimization: As a result of the analysis, the team recommended that a RF optimization be performed to improve the overshoot detection feature's accuracy. The request was approved by the RF team and the optimization process was initiated. + +4. Monitor Coverage Impact: The team monitored the coverage impact of the RF optimization process to ensure that the overshoot detection feature was",2024-09-24 10:37:22,2024-09-24 10:52:22,Rangers of the North,ran_cell_overshooting_correction,RAN-015: Fronthaul Link Down at Point-Gondor-Central-31. Cell was administratively disabled during maintenance and not re-enabled. Configuration mismatch confirmed. Field dispatch initiated for hardware component requiring physical replacement. Full service restoration confirmed. Post-incident review scheduled.,"2024-09-24 10:37 - Critical alarm received for Point-Gondor-Central-31: RAN-015: Fronthaul Link Down. Checked alarm system — confirmed active since 10:29. Partial outage — 2 of 3 sectors affected. +2024-09-24 10:39 - Checked neighbors and topology. Upstream aggregation node healthy. Last-mile issue confirmed. +2024-09-24 10:42 - Queried element health at Point-Gondor-Central-31. Equipment within operational limits. +2024-09-24 10:46 - Executed clear configuration — COMMAND FAILED — element not responding. +2024-09-24 10:48 - Verified: Full recovery confirmed. Alarm cleared at 10:48. All sectors operational and radiating. Alarms cleared within 3 minutes.",Orophin of Rivendell,2024-09-24 10:37:22,2024-09-24 10:52:22,F1 interface between DU and CU interrupted,Resolved,RAN-015: Fronthaul Link Down,0.2,FALSE,Orophin of Rivendell,soft_solve +INCME-100030,Rohan,Signal Core,Junction Point,RAN,Elven Forge Technologies,Hornburg,Resolved,RAN-001: Cell Service Interruption,MAJOR,Gateway-Rohan-Primary-08,"Reason for Outage: A complete cellular network outage was reported on our site, resulting in a loss of service for our clients. The root cause of the issue was a power outage at the eNodeB. + +Restore Action: The issue was resolved by verifying the site outage, checking the backhaul status, checking the power status, attempting a remote ping, checking the eNodeB status, dispatching a field technician, and finalizing the status verification. + +Actions taken: +- Verified the site outage +- Checked the backhaul status +- Checked the power status +- Attempted a remote ping +- Checked the eNodeB status +- Dispatched a field technician +- Finalized the status verification + +Problem type: RAN-001 +Resolution: The issue was resolved successfully, and the site was restored to service. + +Output ONLY the close notes, nothing else.",2024-06-29 10:28:22,2024-06-29 11:11:22,Lothlórien Link Guardians,ran_cell_site_down_recovery,RAN-001: Cell Service Interruption at Gateway-Rohan-Primary-08. Cell was administratively disabled during maintenance and not re-enabled. Configuration mismatch confirmed. Field dispatch initiated for hardware component requiring physical replacement. All alarms cleared and KPIs returned to baseline within 15 minutes. No further action required.,"2024-06-29 10:28 - Alarm: RAN-001: Cell Service Interruption at Gateway-Rohan-Primary-08. Severity: MAJOR. Active 12 minutes. Sector Alpha down, Beta and Gamma operational. +2024-06-29 10:32 - Connectivity check to Gateway-Rohan-Primary-08 — reachable, latency nominal. +2024-06-29 10:35 - Queried RF chain. RET controller not responding on Sector Alpha. Antenna tilt: 10°. PA status: active. +2024-06-29 10:38 - Queried element health at Gateway-Rohan-Primary-08. Equipment within operational limits. +2024-06-29 10:40 - Attempted remote generator start on Gateway-Rohan-Primary-08. COMMAND SUCCESS — operation completed. +2024-06-29 10:44 - Verified: Site operational, all services confirmed active. All sectors recovered. Alarms cleared.",Radagast the White,2024-06-29 10:28:22,2024-06-29 11:11:22,Cell completely unavailable for service,Resolved,RAN-001: Cell Service Interruption,0.2,FALSE,Radagast the White,soft_solve +INCME-100031,Mordor Surveillance Zone,Path Signals,Weather Watch,RAN,Elven Forge Technologies,Isengard,Resolved,RAN-008: Dormant Cell Detected,MAJOR,Spire-MordorSurveillanceZone-Primary-45,"NOC Engineer's Note: + +The RRU Alarm was verified to be active, indicating a fault in the RRU. The Verification of RRU Alarm was done by checking the status of the RRU. The RRU Alarm was deemed as the root cause of the issue. + +To resolve the issue, the following steps were taken: +1. Verify the CPRI/Fiber Link status to ensure there was no fault. +2. Verify the BBU status to ensure there was no fault. +3. Attempt RRU Reset to restore the RRU. +4. Verify Sector Status to ensure there was no fault. + +The root cause of the issue was the extended period of no user activity for the RRU. This was verified by checking the status of the RRU and verifying that there was no fault. The RRU Alarm was verified as the root cause of the issue, and the steps outlined above were taken to resolve the issue. + +The following actions were taken to ensure the issue was resolved: +1. Verify the CPRI/Fiber Link status to ensure there was no fault. +2. Verify the BBU status",2024-06-04 13:48:44,2024-06-04 14:31:44,Wizards Council Escalation,ran_rru_communication_recovery,RAN-008: Dormant Cell Detected at Spire-MordorSurveillanceZone-Primary-45. Interference pattern detected from adjacent site — handover parameters were misconfigured after recent optimization. Automated recovery sequence triggered after remote intervention. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-06-04 13:49 - Alarm: RAN-008: Dormant Cell Detected at Spire-MordorSurveillanceZone-Primary-45. Severity: MAJOR. Active 8 minutes. Partial outage — 2 of 3 sectors affected. +2024-06-04 13:52 - Checked neighbors and topology. 2 neighbor sites also showing degraded metrics — possible area issue. +2024-06-04 13:56 - Queried element health at Spire-MordorSurveillanceZone-Primary-45. Equipment within operational limits. +2024-06-04 13:58 - Validated running configuration on Spire-MordorSurveillanceZone-Primary-45. No mismatches detected. +2024-06-04 14:00 - Attempted remote clear configuration on Spire-MordorSurveillanceZone-Primary-45. COMMAND FAILED — element not responding. +2024-06-04 14:03 - Recovery confirmed — Full recovery confirmed. Alarm cleared at 14:03. Metrics stable for 15 minutes.",Bergil of the Tower Guard,2024-06-04 13:48:44,2024-06-04 14:31:44,Cell operational but no user activity for extended period,Resolved,RAN-008: Dormant Cell Detected,0.4,FALSE,Bergil of the Tower Guard,soft_solve +INCME-100032,Rhovanion,Arcane Engines,Path Router,COMPUTE,Elven Forge Technologies,Dale,Resolved,CMP-002: Pod Container Creating,CRITICAL,Array-Rhovanion-Outer-05,"Closure Note: + +1. Restore Action: +The root cause of the issue was a Pod stuck in ContainerCreating state. The issue was resolved by verifying the Container Alarm, escalating to DevOps, verifying the Container Recovery, and verifying the Pod was restored to its original state. + +2. Reason for Outage: +The Pod stuck in ContainerCreating state was caused by a configuration issue with the application. The root cause was identified and resolved. + +3. Outcome: +The issue was resolved and the Pod was restored to its original state. The root cause of the issue was identified and resolved, and the issue was closed. + +Note: This closure note does not include any information about the specific configuration issues that led to the outage. This is intended to be a concise summary of the steps taken to resolve the issue.",2024-10-30 05:26:00,2024-10-30 05:53:00,Elven Signal Keepers,compute_container_crash_recovery,CMP-002: Pod Container Creating on Array-Rhovanion-Outer-05. Node resource exhaustion prevented pod scheduling. Horizontal autoscaler at maximum replica count. Automated recovery sequence triggered after remote intervention. All alarms cleared and KPIs returned to baseline within 15 minutes. No further action required.,"2024-10-30 05:27 - Alert: CMP-002: Pod Container Creating on gnb-scheduler-6f7c in namespace control-plane at Array-Rhovanion-Outer-05. Pod stuck in Terminating for 47 minutes. +2024-10-30 05:30 - Inspected logs for gnb-scheduler-6f7c: Pull error: dial tcp registry.internal:5000: connect: connection refused. Registry pod on same node is also down. +2024-10-30 05:34 - Queried orchestrator — gnb-scheduler-6f7c: CrashLoopBackOff. Image pull attempts: 8. Last error: connection refused. +2024-10-30 05:38 - Node health: Node healthy — issue isolated to pod. +2024-10-30 05:40 - Remote action: generator start. COMMAND SUCCESS — operation completed. +2024-10-30 05:42 - Ticket TK-20240031 created for core operations: Vendor escalation for firmware issue. +2024-10-30 05:46 - Diagnostics: Config audit: found stale resource limits from previous deployment. +2024-10-30 05:49 - Recovery check: Service fully restored. All metrics back to baseline. All probes passing.",Elfhelm of Helm's Deep,2024-10-30 05:26:00,2024-10-30 05:53:00,Pod stuck in ContainerCreating state,Resolved,CMP-002: Pod Container Creating,0.2,FALSE,Elfhelm of Helm's Deep,soft_solve +INCME-100033,Rohan,Relay Transport,Beacon Array,POWER,Elven Forge Technologies,Aldburg,Resolved,ENV-006: Battery High Temperature,WARNING,Tower-Rohan-Lower-84,"Reason for Outage: A battery thermal runaway risk was detected during the monitoring of the Battery Temperature Alarm. The root cause of the issue was isolated and resolved, and the system was brought back online. The incident was reported to the NOC team and documented in the incident closure notes. The resolution summary includes the following: + +Restore Action: +- Verify Battery Temperature Alarm +- Check Battery Status +- Isolate Battery String +- Emergency Dispatch +- Monitor Battery Temperature + +The root cause of the issue was identified as a battery thermal runaway risk, which was resolved by isolating the affected battery string and isolating the battery string from the Battery Temperature Alarm. The incident was documented in the incident closure notes and is now closed.",2024-10-26 19:39:57,2024-10-26 20:03:57,Rivendell Array Management,env_battery_temperature_response,ENV-006: Battery High Temperature at Tower-Rohan-Lower-84. Battery temperature alarm triggered by failed ventilation fan in battery room. Escalation ticket created for vendor engagement on firmware issue. All alarms cleared and KPIs returned to baseline within 15 minutes. No further action required.,"2024-10-26 19:39 - Environmental alarm: ENV-006: Battery High Temperature at Tower-Rohan-Lower-84. Battery bank temperature at 48°C — max rated 45°C. +2024-10-26 19:42 - External conditions: Maintenance window for HVAC was scheduled but not executed — vendor no-show. +2024-10-26 19:45 - Site health under environmental stress: All equipment operational but thermal warning on power amplifiers. +2024-10-26 19:47 - Validated running configuration on Tower-Rohan-Lower-84. No mismatches detected. +2024-10-26 19:51 - Executed SCTP reset: Partial success — 2 of 3 units recovered. +2024-10-26 19:53 - Environmental recovery: Partial recovery — monitoring for stability.",Aravir Wingfoot,2024-10-26 19:39:57,2024-10-26 20:03:57,Battery thermal runaway risk detected,Resolved,ENV-006: Battery High Temperature,0.2,FALSE,Aravir Wingfoot,hard_solve +INCME-100034,Rohan,Signal Core,Weather Watch,RAN,Elven Forge Technologies,Meduseld,Resolved,RAN-019: PTP Synchronization Failure,MINOR,Node-Rohan-West-51,"NOC engineer closed incident CIR-17-0013, which involved a failed Voice Quality Alert notification. + +Restore Action: +- Verified Voice Quality Alert notification was successfully restored + +Reason for Outage: +- The root cause of the failure was Precision Time Protocol timing reference lost +- The Voice Quality Alert notification was sent at 10:37 AM, but the timing reference was lost at 10:38 AM, resulting in a failed notification. + +Outcome: +- The incident was successfully resolved and the Voice Quality Alert notification was restored. + +Note: The Voice Quality Alert notification was sent using the Precision Time Protocol (PTP) timing reference, which is used to synchronize the timing of all devices on the network. The timing reference was lost due to a hardware failure.",2024-07-03 08:40:17,2024-07-03 08:55:17,Erebor Relay Division,ran_voice_quality_resolution,RAN-019: PTP Synchronization Failure at Node-Rohan-West-51. Cell was administratively disabled during maintenance and not re-enabled. Configuration mismatch confirmed. Configuration corrected and service restored remotely within SLA. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-07-03 08:41 - Critical alarm received for Node-Rohan-West-51: RAN-019: PTP Synchronization Failure. Checked alarm system — confirmed active since 08:37. Sector Alpha down, Beta and Gamma operational. +2024-07-03 08:44 - Queried RF chain. Sector Beta: high VSWR (3.2:1), possible feeder issue. Antenna tilt: 10°. PA status: standby. +2024-07-03 08:48 - Connectivity check to Node-Rohan-West-51 — reachable, latency nominal. +2024-07-03 08:50 - Queried element health at Node-Rohan-West-51. Equipment within operational limits. +2024-07-03 08:52 - Validated running configuration on Node-Rohan-West-51. No mismatches detected. +2024-07-03 08:56 - Attempted remote unlock cells on Node-Rohan-West-51. SUCCESS — reboot initiated, monitoring. +2024-07-03 09:00 - Verified: Service fully restored. All metrics back to baseline. All sectors recovered. Alarms all cleared.",Aragorn the Heir,2024-07-03 08:40:17,2024-07-03 08:55:17,Precision Time Protocol timing reference lost,Resolved,RAN-019: PTP Synchronization Failure,0.4,FALSE,Aragorn the Heir,soft_solve +INCME-100035,Rohan,Arcane Engines,Weather Watch,POWER,Dwarven Network Systems,Grimslade,Pending Resolution,PWR-001: AC Power Failure,CRITICAL,Watch-Rohan-Outer-12,"Resolution Summary: + +1. Verify AC Power Loss: After verifying that AC power was restored, the issue was resolved. + +2. Check UPS Status: The UPS system was checked for any anomalies or issues, and they were resolved. + +3. Verify Battery Runtime: The battery runtime was verified to ensure that it was still functioning properly. + +4. Dispatch Field Technician: A field technician was dispatched to resolve the issue. + +5. Notify Utility Provider: The utility provider was notified of the issue and the necessary steps were taken to restore power. + +6. Final Verification: The issue was resolved and the power was restored. + +Root cause: The interruption or failure of the commercial power supply. + +Problem type: PWR-001.",2024-06-05 06:53:13,2024-06-05 07:27:13,Gondor Gateway Team,power_ac_failure_recovery,Commercial power supply interruption or failure at Watch-Rohan-Outer-12. DC rectifier module failure caused voltage drop below threshold. Backup rectifier activated but insufficient for full load. Escalation ticket created for vendor engagement on firmware issue. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-06-05 06:55 - Checked alarm system — confirmed PWR-001: AC Power Failure active since 06:49. Site Watch-Rohan-Outer-12 showing power-related alarms. +2024-06-05 06:58 - Checked external factors near Watch-Rohan-Outer-12. No area-wide issues. +2024-06-05 07:01 - Power telemetry check — AC input: restored — stabilizing. UPS providing backup, battery at 82%. Generator running — full load. +2024-06-05 07:04 - Queried element health at Watch-Rohan-Outer-12. Equipment within operational limits. +2024-06-05 07:06 - Checked and adjusted power thresholds/configuration. Rolled back to last known good configuration. +2024-06-05 07:09 - Created P1 ticket TK-20240034 for power maintenance. Issue: Hardware replacement may be needed. +2024-06-05 07:13 - Verified recovery — Service fully restored. All metrics back to baseline. All alarms all cleared.",Elrohir of Lothlórien,2024-06-05 06:53:13,2024-06-05 07:27:13,Commercial power supply interruption or failure,Pending Resolution,PWR-001: AC Power Failure,0.2,FALSE,Elrohir of Lothlórien,soft_solve +INCME-100036,Rhovanion,Relay Transport,Junction Point,COMPUTE,Dwarven Network Systems,Beorn's Hall,Resolved,Node not functional,MAJOR,Point-Rhovanion-Central-49,"NOC Engineer's closure note: + +Restore Action: +- Verified that the VM Alarm was triggered due to a Compute node being completely unavailable. The issue was resolved by verifying the VM recovery and escalating the issue to the Cloud Team. +- The root cause was confirmed to be the Compute node being completely unavailable. + +Reason for Outage: +- The Compute node was completely unavailable, leading to a disruption in VM operations. + +Close notes: +- Verified that the VM Alarm was triggered due to a Compute node being completely unavailable. +- The root cause was confirmed to be the Compute node being completely unavailable. +- Verified that the issue was resolved by verifying the VM recovery and escalating the issue to the Cloud Team. +- The root cause was confirmed to be the Compute node being completely unavailable. +- The root cause was confirmed to be the Compute node being completely unavailable. +- The root cause was confirmed to be the Compute node being completely unavailable. +- The root cause was confirmed to be the Compute node being completely unavailable. +- The root cause was confirmed to be",2024-08-11 13:37:30,2024-08-11 13:59:30,Bywater Observation Post,compute_vm_failure_recovery,Node not functional on Point-Rhovanion-Central-49. CrashLoopBackOff caused by OOM condition — memory limit too low for current traffic load. Escalation ticket created for vendor engagement on firmware issue. Full service restoration confirmed. Post-incident review scheduled.,"2024-08-11 13:39 - Alert: Node not functional on upf-data-plane-2e4a in namespace core-5g at Point-Rhovanion-Central-49. Pod stuck in Terminating for 47 minutes. +2024-08-11 13:43 - Orchestration: migrated workload to healthy node — Pods rescheduled on healthy nodes. +2024-08-11 13:45 - Checked compute node — Node CPU: 92%, Memory: 87%. Under resource pressure. +2024-08-11 13:47 - Validated running configuration on Point-Rhovanion-Central-49. No mismatches detected. +2024-08-11 13:50 - Remote action: SCTP reset. SUCCESS — reboot initiated, monitoring. +2024-08-11 13:54 - Verified — Full recovery confirmed. Alarm cleared at 13:52. Workload stable on new node.",Fredegar Bracegirdle,2024-08-11 13:37:30,2024-08-11 13:59:30,Compute node completely unavailable,Resolved,Node not functional,0.4,FALSE,Fredegar Bracegirdle,hard_solve +INCME-100037,Rivendell,Path Signals,Weather Watch,RAN,Elven Forge Technologies,Grey Havens,Resolved,RAN-019: PTP Synchronization Failure,MINOR,Spire-Rivendell-Inner-26,"Reopened from INCME-100034. + +Resolution Summary: + +1. Verify Voice Quality Alert: Voice Quality Alerts were received for a Precision Time Protocol (PTP) timing reference that was lost. The root cause of the issue was identified as a timing reference loss. The issue was resolved by verifying the PTP timing reference and restoring the Voice Quality Alerts. + +2. Reason for Outage: The root cause of the issue was a timing reference loss. + +3. Restore Action: Voice Quality Alerts were restored to normal levels. + +4. Reason For Outage: The timing reference loss caused Voice Quality Alerts to be received. + +5. Root cause: The PTP timing reference was lost. + +6. Problem type: RAN-019. + +7. Action taken: Verification of PTP timing reference. + +8. Root cause: Precision Time Protocol timing reference lost. + +9. Problem type: RAN-019. + +10. Action taken: Restoring Voice Quality Alerts. + +11. Root cause: Root cause of issue was identified as a timing",2024-07-05 05:55:17,2024-07-05 06:27:17,Orthanc Technical Review,ran_voice_quality_resolution,RAN-019: PTP Synchronization Failure at Spire-Rivendell-Inner-26. Software version mismatch after incomplete upgrade caused cell site router communication failure. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. Full service restoration confirmed. Post-incident review scheduled.,"2024-07-05 05:56 - Critical alarm received for Spire-Rivendell-Inner-26: RAN-019: PTP Synchronization Failure. Checked alarm system — confirmed active since 05:48. Sector Alpha down, Beta and Gamma operational. +2024-07-05 06:00 - Connectivity check to Spire-Rivendell-Inner-26 — reachable, latency nominal. +2024-07-05 06:04 - Topology analysis: No common alarms on adjacent sites. Issue isolated to this element. Upstream path: down. +2024-07-05 06:08 - Queried element health at Spire-Rivendell-Inner-26. Equipment within operational limits. +2024-07-05 06:12 - Executed unlock cells — COMMAND FAILED — element not responding. +2024-07-05 06:14 - Created ticket TK-20240036 for transport team. Persistent fault after 2 remote attempts — need on-site investigation. +2024-07-05 06:16 - Verified: Site operational, all services confirmed active. All sectors recovered. Alarms cleared within 3 minutes.",Glóin son of Glóin,2024-07-05 05:55:17,2024-07-05 06:27:17,Precision Time Protocol timing reference lost,Resolved,RAN-019: PTP Synchronization Failure,0.2,TRUE,Glóin son of Glóin,soft_solve +INCME-100038,Lothlórien,Arcane Engines,Weather Watch,RAN,Dwarven Network Systems,Eregion Post,Resolved,RAN-014: CSR Unreachable,MAJOR,Array-Lothlorien-Lower-12,"In summary, the NOC engineer resolved the issue of a cell site router management path being unavailable due to a faulty switch. The root cause was identified as a faulty switch. The issue was verified and resolved by verifying the Upgrade Failure, dispatching Field Support, and documenting and reporting the issue. The root cause was documented and reported. The engineer completed the resolution process by verifying the Upgrade Failure, discharging the incident, documenting and reporting the issue, and resolving it.",2024-10-03 19:19:57,2024-10-03 19:34:57,White Tower Operations,ran_software_upgrade_recovery,RAN-014: CSR Unreachable at Array-Lothlorien-Lower-12. Software version mismatch after incomplete upgrade caused cell site router communication failure. Remote corrective action applied successfully. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-10-03 19:19 - Critical alarm received for Array-Lothlorien-Lower-12: RAN-014: CSR Unreachable. Checked alarm system — confirmed active since 19:12. Single sector degraded, others normal. +2024-10-03 19:22 - Connectivity check to Array-Lothlorien-Lower-12 — reachable, latency nominal. +2024-10-03 19:26 - Checked neighbors and topology. 2 neighbor sites also showing degraded metrics — possible area issue. +2024-10-03 19:29 - Queried element health at Array-Lothlorien-Lower-12. Equipment within operational limits. +2024-10-03 19:31 - Executed unlock cells — Partial success — 2 of 3 units recovered. +2024-10-03 19:34 - Verified: Site operational, all services confirmed active. All sectors operational and radiating. Alarms self-cleared after fix applied.",Anborn of Dol Amroth,2024-10-03 19:19:57,2024-10-03 19:34:57,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.4,FALSE,Anborn of Dol Amroth,hard_solve +INCME-100039,Dale Province,Signal Network,Power Source,RAN,Dwarven Network Systems,Esgaroth,Resolved,RAN-013: Site Communication Failure,CRITICAL,Spire-DaleProvince-Lower-09,"NOC Engineer: Investigation and Root Cause Analysis (RCA) completed. + +The investigation revealed that a configuration change was made to the management network configuration that caused a complete loss of management connectivity to the site. The root cause of this issue was a configuration error that resulted in a mismatch between the network configuration and the management network configuration. + +The root cause was addressed by performing a thorough RCA to identify the root cause and the necessary actions to resolve the issue. The following actions were taken: + +1. Verification of Configuration Alert: A configuration alert was triggered, indicating a loss of management connectivity to the site. + +2. Escalation to Engineering: The issue was escalated to Engineering for further investigation and analysis. + +3. Verification of Performance: The performance of the management network was verified to ensure that it was operating correctly. + +4. Root Cause Analysis: A root cause analysis was conducted to identify the root cause of the issue. + +5. Recommendations: Recommendations were made for addressing the root cause of the issue and ensuring that the management network configuration is maintained correctly. + +6. Restore Action: Restoration of the management network configuration",2024-07-04 00:30:36,2024-07-04 00:40:36,Osgiliath Bridge Operations,ran_parameter_correction,RAN-013: Site Communication Failure at Spire-DaleProvince-Lower-09. Investigation found sector outage caused by RU firmware lockup. CPRI link dropped between DU and affected RU. Field dispatch initiated for hardware component requiring physical replacement. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-07-04 00:31 - Critical alarm received for Spire-DaleProvince-Lower-09: RAN-013: Site Communication Failure. Checked alarm system — confirmed active since 00:27. Sector Alpha down, Beta and Gamma operational. +2024-07-04 00:33 - Connectivity check to Spire-DaleProvince-Lower-09 — reachable, latency nominal. +2024-07-04 00:36 - Queried RF chain. Sector Beta: high VSWR (3.2:1), possible feeder issue. Antenna tilt: 4°. PA status: degraded. +2024-07-04 00:38 - Queried element health at Spire-DaleProvince-Lower-09. Equipment within operational limits. +2024-07-04 00:41 - Attempted remote reset on Spire-DaleProvince-Lower-09. Partial success — 2 of 3 units recovered. +2024-07-04 00:45 - Created ticket TK-20240038 for vendor support. Hardware replacement may be needed. +2024-07-04 00:49 - Verified: Partial recovery — monitoring for stability. All sectors operational and radiating. Alarms all cleared.",Paladin Baggins,2024-07-04 00:30:36,2024-07-04 00:40:36,Complete loss of management connectivity to site,Resolved,RAN-013: Site Communication Failure,0.2,FALSE,Paladin Baggins,hard_solve +INCME-100040,Dale Province,Path Signals,Relay Unit,RAN,Elven Forge Technologies,Bree,Resolved,RAN-002: Cell Administratively Disabled,CRITICAL,Hub-DaleProvince-Outer-05,"Incident Closure Summary: + +Resolution: The root cause of the RAN-002 incident was due to a cell being locked or disabled by management action. The issue was resolved by verifying the Sector Alarm, checking the RF Chain Status, enabling capacity compensation, scheduling a field repair, verifying coverage impact, and verifying the correct management action was taken. The incident was escalated to a higher-level team for further investigation and resolution. + +NOC Terminology: +- RAN-002: RF Chain Status issue +- Sector Alarm: Alarm generated by the RF Chain +- Capacity Compensation: Capacity adjustment to compensate for a cell being locked or disabled +- Field Repair: Repair of the affected RF Chain +- Verify Coverage Impact: Verification of the coverage impact of the RF Chain after repair + +NOC Terminology: +- Cell locked: A cell is locked when its signal strength is below a certain threshold, which prevents it from transmitting data. +- Disabled: A cell is disabled when its signal strength is below a certain",2024-11-08 00:19:08,2024-11-08 00:54:08,Shire Monitoring Guild,ran_sector_outage_recovery,RAN-002: Cell Administratively Disabled at Hub-DaleProvince-Outer-05. Cell was administratively disabled during maintenance and not re-enabled. Configuration mismatch confirmed. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-11-08 00:20 - Critical alarm received for Hub-DaleProvince-Outer-05: RAN-002: Cell Administratively Disabled. Checked alarm system — confirmed active since 00:11. Site completely unreachable. +2024-11-08 00:24 - Connectivity check to Hub-DaleProvince-Outer-05 — reachable, latency nominal. +2024-11-08 00:28 - Topology analysis: Adjacent sites normal — confirms localized fault. Upstream path: healthy. +2024-11-08 00:32 - Queried element health at Hub-DaleProvince-Outer-05. Equipment within operational limits. +2024-11-08 00:34 - Attempted remote restart on Hub-DaleProvince-Outer-05. Partial success — 2 of 3 units recovered. +2024-11-08 00:38 - Recovery confirmed — Service fully restored. All metrics back to baseline. KPIs recovering — within 90% of baseline.",Saruman the White,2024-11-08 00:19:08,2024-11-08 00:54:08,Cell locked or disabled by management action,Resolved,RAN-002: Cell Administratively Disabled,0.4,FALSE,Saruman the White,hard_solve +INCME-100041,Lothlórien,Relay Transport,Junction Point,POWER,Elven Forge Technologies,East Bight,Resolved,PWR-002: DC Rectifier Failure,MAJOR,Hub-Lothlorien-South-24,"After confirming the rectifier alarm, enabling battery monitoring, and performing a final status check, the NOC engineer has completed the incident closure notes for PWR-002, a power supply issue that occurred at the facility. The root cause of the issue was a DC power conversion unit malfunction. The engineer has recorded this in the incident closure notes and has provided a resolution summary that includes the following: + +1. Restore Action: Enable battery monitoring and rectifier alarm +2. Reason For Outage: The malfunction of the DC power conversion unit + +The engineer has also included the final status check, which confirmed that the issue was resolved and the power supply functioned correctly. The engineer has recorded this in the notes and has provided a resolution summary that includes the root cause and the resolution taken to restore the facility's power supply. The engineer has ensured that all necessary documentation is included in the notes, including the incident summary, the incident closure notes, and the final status check.",2024-08-25 23:52:32,2024-08-26 00:14:32,Dwarven Deep Network,power_dc_rectifier_recovery,DC power conversion unit malfunction at Hub-Lothlorien-South-24. Battery bank reached low-voltage disconnect threshold during extended commercial power outage. Escalation ticket created for vendor engagement on firmware issue. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-08-25 23:53 - Verified alarm: PWR-002: DC Rectifier Failure at Hub-Lothlorien-South-24. Active for 10 minutes. Multiple downstream alarms generated. +2024-08-25 23:56 - Power telemetry check — AC input: restored — stabilizing. UPS providing backup, battery at 82%. Generator running — output unstable. +2024-08-26 00:00 - Queried element health at Hub-Lothlorien-South-24. Equipment within operational limits. +2024-08-26 00:02 - Executed remote command: power cycle. Result: COMMAND SUCCESS — operation completed. +2024-08-26 00:04 - Final check: Site operational, all services confirmed active. KPIs Metrics stable for 15 minutes.",Radagast the Brown,2024-08-25 23:52:32,2024-08-26 00:14:32,DC power conversion unit malfunction,Resolved,PWR-002: DC Rectifier Failure,0.2,FALSE,Radagast the Brown,soft_solve +INCME-100042,Rohan,Arcane Engines,Path Router,POWER,Dwarven Network Systems,Dunharrow,Resolved,PWR-002: DC Rectifier Failure,MAJOR,Gateway-Rohan-Central-74,"NOC Engineer's Note: + +On [Date], an issue was detected in the [DC Power Conversion Unit (PWR-002)] that resulted in a loss of power to the [System/Appliance/Equipment (S/A/E)] for [Duration of Outage]. The root cause of the issue was a malfunction of the DC power conversion unit, which resulted in a voltage drop and loss of power. + +Restore Action: + +The issue was resolved by confirming the rectifier alarm, checking the DC bus voltage, and scheduling field replacement. The final status check was conducted to ensure that the issue had been resolved. + +Reason for Outage: + +The malfunction of the DC power conversion unit resulted in a loss of power to the S/A/E, which caused a significant disruption in the system's functionality. + +Conclusion: + +The resolution of this issue was successful, and the system's functionality was restored. The root cause of the issue was identified and rectified, and the necessary steps were taken to ensure that similar issues do not occur in the future.",2024-10-16 14:40:14,2024-10-16 15:08:14,Dwarven Deep Network,power_dc_rectifier_recovery,DC power conversion unit malfunction at Gateway-Rohan-Central-74. DC rectifier module failure caused voltage drop below threshold. Backup rectifier activated but insufficient for full load. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-10-16 14:40 - Verified alarm: PWR-002: DC Rectifier Failure at Gateway-Rohan-Central-74. Active for 12 minutes. Multiple downstream alarms generated. +2024-10-16 14:44 - Checked external factors near Gateway-Rohan-Central-74. No area-wide issues. +2024-10-16 14:46 - Queried element health at Gateway-Rohan-Central-74. Equipment within operational limits. +2024-10-16 14:48 - Created P2 ticket TK-20240041 for RF engineering. Issue: Persistent fault after 2 remote attempts — need on-site investigation. +2024-10-16 14:52 - Final check: Service fully restored. All metrics back to baseline. KPIs Metrics stable for 15 minutes.",Hamfast Bracegirdle,2024-10-16 14:40:14,2024-10-16 15:08:14,DC power conversion unit malfunction,Resolved,PWR-002: DC Rectifier Failure,0.4,FALSE,Hamfast Bracegirdle,soft_solve +INCME-100043,Mirkwood,Relay Transport,Central Nexus,RAN,Dwarven Network Systems,Dol Guldur Watch,Pending Resolution,SVC-002: Data Throughput Degradation,CRITICAL,Point-Mirkwood-Inner-33,"In response to the complaint about user throughput being significantly below expected rates, the NOC engineer reviewed the complaint details and requested a detailed analysis. The engineer then documented the resolution steps, including reviewing the analysis, documenting the root cause, and restoring the service. The root cause was found to be a significant issue with user throughput, which was addressed by requesting a detailed analysis from the vendor. The engineer documented the resolution steps, including the review of the analysis, documenting the root cause, and restoring the service. The engineer completed the incident closure note, which included the resolution summary, complete with the root cause, restoring action, and reason for outage. The engineer used proper NOC terminology in the resolution summary.",2024-07-19 01:59:56,2024-07-19 02:22:56,Dúnedain Field Division,ran_speed_complaint_resolution,SVC-002: Data Throughput Degradation at Point-Mirkwood-Inner-33. Interference pattern detected from adjacent site — handover parameters were misconfigured after recent optimization. Escalation ticket created for vendor engagement on firmware issue. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-07-19 01:59 - Critical alarm received for Point-Mirkwood-Inner-33: SVC-002: Data Throughput Degradation. Checked alarm system — confirmed active since 01:56. All 3 sectors showing Cell Down. +2024-07-19 02:01 - RF status check — PA power output 6dB below target on affected sector. VSWR: 2.8:1. TX power: 3dB low. +2024-07-19 02:04 - Checked neighbors and topology. Upstream aggregation node healthy. Last-mile issue confirmed. +2024-07-19 02:08 - Queried element health at Point-Mirkwood-Inner-33. Equipment within operational limits. +2024-07-19 02:10 - Escalated to DevOps via TK-20240042: Hardware replacement may be needed. +2024-07-19 02:14 - Recovery confirmed — Full recovery confirmed. Alarm cleared at 02:16. KPIs recovering — within 90% of baseline.",Araglas Wingfoot,2024-07-19 01:59:56,2024-07-19 02:22:56,User throughput significantly below expected rates,Pending Resolution,SVC-002: Data Throughput Degradation,0.2,FALSE,Araglas Wingfoot,soft_solve +INCME-100044,Rhovanion,Beacon Power,Beacon Array,RAN,Elven Forge Technologies,Dale,Resolved,RAN-005: RRC Setup Success Rate Degraded,MAJOR,Hub-Rhovanion-Inner-33,"Restore Action: + +Verify Congestion Alert: The Radio Resource Control (RRC) connection establishment degraded issue was resolved by verifying the Congestion Alert. The Congestion Alert is a signal from the radio access network (RAN) that indicates a significant congestion or delay in the radio network. In this case, the RAN successfully resolved the issue by enabling the Cell Barring feature, which prevents RRC connections from being established during periods of congestion. + +Reason for Outage: + +The root cause of this issue was the Radio Resource Control (RRC) connection establishment degraded. The issue was caused by a significant congestion or delay in the radio network. In this case, the Congestion Alert signal was triggered by the RAN, indicating a significant congestion or delay in the radio network. + +Closure: + +The root cause of this issue has been resolved, and the Radio Resource Control (RRC) connection establishment degraded issue has been closed. The Congestion Alert signal has been successfully enabled to prevent RRC connections from being established during periods of congestion.",2024-06-23 12:52:14,2024-06-23 13:22:14,Gondor Gateway Team,ran_cell_congestion_management,RAN-005: RRC Setup Success Rate Degraded at Hub-Rhovanion-Inner-33. Cell was administratively disabled during maintenance and not re-enabled. Configuration mismatch confirmed. Configuration corrected and service restored remotely within SLA. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-06-23 12:53 - Alarm: RAN-005: RRC Setup Success Rate Degraded at Hub-Rhovanion-Inner-33. Severity: MAJOR. Active 6 minutes. Sector Alpha down, Beta and Gamma operational. +2024-06-23 12:57 - Topology analysis: Upstream aggregation node healthy. Last-mile issue confirmed. Upstream path: healthy. +2024-06-23 12:59 - RF status check — All sectors: TX power nominal, VSWR within spec. VSWR: 1.5:1. TX power: 6dB below target. +2024-06-23 13:01 - Queried element health at Hub-Rhovanion-Inner-33. Equipment within operational limits. +2024-06-23 13:05 - Validated running configuration on Hub-Rhovanion-Inner-33. No mismatches detected. +2024-06-23 13:07 - Attempted remote power cycle on Hub-Rhovanion-Inner-33. SUCCESS — reboot initiated, monitoring. +2024-06-23 13:10 - Verified: Site operational, all services confirmed active. All sectors operational and radiating. Alarms all cleared.",Alatar the White,2024-06-23 12:52:14,2024-06-23 13:22:14,Radio resource control connection establishment degraded,Resolved,RAN-005: RRC Setup Success Rate Degraded,0.4,FALSE,Alatar the White,soft_solve +INCME-100045,Rohan,Path Signals,Weather Watch,RAN,Dwarven Network Systems,Grimslade,Resolved,RAN-014: CSR Unreachable,MINOR,Array-Rohan-North-18,"Resolution Summary: + +The root cause of the failure was the cell site router management path unavailable. The issue was resolved by verifying the upgrade failure, dispatching Field Support, and documenting and reporting the issue. + +The problem type was RAN-014. + +The following actions were taken: +1. Verify upgrade failure: Verification of the upgrade failure was conducted to ensure that the issue was resolved. +2. Dispatch Field Support: A Field Support team member was dispatched to resolve the issue. +3. Document and report: A report was generated to document and report the issue. + +The resolution summary is a concise summary of the actions taken to resolve the issue. It includes the root cause, the actions taken, and the results achieved. The summary does not include additional details or information that may be relevant to the incident.",2024-06-21 20:47:29,2024-06-21 21:18:29,Dúnedain Field Division,ran_software_upgrade_recovery,RAN-014: CSR Unreachable at Array-Rohan-North-18. Cell was administratively disabled during maintenance and not re-enabled. Configuration mismatch confirmed. Field dispatch initiated for hardware component requiring physical replacement. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-06-21 20:48 - Critical alarm received for Array-Rohan-North-18: RAN-014: CSR Unreachable. Checked alarm system — confirmed active since 20:43. Single sector degraded, others normal. +2024-06-21 20:52 - Topology analysis: No common alarms on adjacent sites. Issue isolated to this element. Upstream path: healthy. +2024-06-21 20:55 - Connectivity check to Array-Rohan-North-18 — reachable, latency nominal. +2024-06-21 20:57 - Queried element health at Array-Rohan-North-18. Equipment within operational limits. +2024-06-21 20:59 - Validated running configuration on Array-Rohan-North-18. No mismatches detected. +2024-06-21 21:02 - Executed reset — Partial success — 2 of 3 units recovered. +2024-06-21 21:05 - Verified: Site operational, all services confirmed active. All sectors operational and radiating. Alarms cleared.",Elfhelm son of Éomund,2024-06-21 20:47:29,2024-06-21 21:18:29,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.2,FALSE,Elfhelm son of Éomund,hard_solve +INCME-100046,Rhovanion,Arcane Engines,Beacon Array,SIGNALING,Dwarven Network Systems,Erebor,Resolved,SIG-001: N2 Interface Down,MINOR,Array-Rhovanion-South-96,"Restore Action: + +S1/N2 Alarm was verified to be resolved. + +Reason For Outage: + +The root cause of the issue was a control plane interface between the gNB and AMF that failed. This issue led to a loss of service for the S1/N2 alarm, which was resolved by verifying the S1/N2 alarm and verifying that the alarm was resolved. + +Conclusion: + +The issue was successfully resolved, and the S1/N2 alarm is now functioning correctly. The root cause of the issue was a control plane interface between the gNB and AMF that failed, which led to the loss of service for the S1/N2 alarm.",2024-08-13 19:52:48,2024-08-13 20:12:48,Iron Hills Transport,signaling_s1_n2_recovery,SIG-001: N2 Interface Down at Array-Rhovanion-South-96. Signaling path failure due to SCTP association timeout. Single-homed configuration had no failover. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. All alarms cleared and KPIs returned to baseline within 15 minutes. No further action required.,"2024-08-13 19:53 - Signaling alarm: SIG-001: N2 Interface Down at Array-Rhovanion-South-96. N2 interface to AMF lost connectivity at {t_minus_5}. +2024-08-13 19:55 - Connectivity check to Array-Rhovanion-South-96 — reachable, latency nominal. +2024-08-13 19:59 - Tested SIP path — Path degraded — 30% packet loss on signaling. +2024-08-13 20:01 - Element health for Array-Rhovanion-South-96: CU processing normal. Checked signaling interface status. +2024-08-13 20:04 - Validated running configuration on Array-Rhovanion-South-96. No mismatches detected. +2024-08-13 20:07 - Executed power cycle — SUCCESS — reboot initiated, monitoring. +2024-08-13 20:10 - Verified: Partial recovery — monitoring for stability. SIP path monitoring for stability.",Théoden of Helm's Deep,2024-08-13 19:52:48,2024-08-13 20:12:48,Control plane interface between gNB and AMF failed,Resolved,SIG-001: N2 Interface Down,0.4,FALSE,Théoden of Helm's Deep,soft_solve +INCME-100047,Mirkwood,Signal Network,Beacon Array,COMPUTE,Dwarven Network Systems,Rhosgobel,Resolved,Problematic VM,MAJOR,Relay-Mirkwood-South-14,"Resolution Summary: + +- Verified that the Virtual Machine had reached the Ready state and was in a healthy state. +- Notified the CNF Vendor and escalated the issue to their team. +- Verified that the CNF Recovery team was actively working on the issue and that they had a plan in place to resolve the issue. +- Notified the CNF Alarm team and requested that they investigate and address the issue as a priority. +- Notified the CNF Vendor and provided them with the resolution summary and a detailed description of the root cause and the steps taken to resolve the issue. +- Notified the CNF Recovery team and requested that they provide a status update on the recovery process. +- Notified the CNF Alarm team and requested that they verify that the Alarm is no longer triggered due to the issue being resolved. +- Notified the CNF Vendor and provided them with the resolution summary and a detailed description of the root cause and the steps taken to resolve the issue. +- Notified the CNF Recovery team and requested that they provide a status update on the recovery process. +- Notified the CNF Alarm team and requested that",2024-09-15 04:46:51,2024-09-15 05:01:51,Minas Tirith Central Command,compute_cnf_pod_recovery,Problematic VM on Relay-Mirkwood-South-14. Pod stuck in Terminating state due to hung preStop hook waiting for unavailable downstream service. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. Full service restoration confirmed. Post-incident review scheduled.,"2024-09-15 04:47 - Alarm received: Problematic VM at Relay-Mirkwood-South-14. ImagePullBackOff — registry unreachable. Checking container state. +2024-09-15 04:51 - Executed: scaled deployment to 0 and back to 3. Node drained successfully, workloads redistributed. +2024-09-15 04:53 - Pod status: ContainerCreating. Reason: NodeDrain. Scheduled but unschedulable — resource constraints. +2024-09-15 04:56 - Checked compute node — Node NotReady — kubelet unresponsive. +2024-09-15 04:59 - Remote action: reset. SUCCESS — reboot initiated, monitoring. +2024-09-15 05:01 - Verified — Partial recovery — monitoring for stability. New pods healthy across 3 nodes.",Alatar Greyhame,2024-09-15 04:46:51,2024-09-15 05:01:51,Virtual machine failed to reach ready state,Resolved,Problematic VM,0.2,FALSE,Alatar Greyhame,soft_solve +INCME-100048,Arnor,Arcane Engines,Weather Watch,SIGNALING,Dwarven Network Systems,Bree,Resolved,SIG-009: E2 Interface Errors,MAJOR,Spire-Arnor-West-40,"NOC engineer has completed the following incident closure notes: + +1. Verify Delay Alarm: +- Verified delay alarm was triggered due to RAN Intelligent Controller interface errors. +- Requested Network Review to determine root cause and corrective actions. +- Verified delay status and all relevant logs were retained. + +2. Restore Action: +- Verified RAN Intelligent Controller interface was fixed. +- Network Review confirmed root cause and corrective actions were implemented. +- All relevant logs were retained. + +Root cause: RAN Intelligent Controller interface errors led to the delay in the network response. The root cause was addressed, and the necessary corrective actions were taken. The incident closure notes include the verification of the delay alarm, request for a Network Review, and restoration of the RAN Intelligent Controller interface. The logs associated with the incident are also retained.",2024-07-01 03:17:59,2024-07-01 03:36:59,Iron Hills Transport,signaling_delay_resolution,SIG-009: E2 Interface Errors at Spire-Arnor-West-40. Interface errors correlated with signaling overload — capacity threshold reached during peak traffic. Automated recovery sequence triggered after remote intervention. Full service restoration confirmed. Post-incident review scheduled.,"2024-07-01 03:17 - Signaling alarm: SIG-009: E2 Interface Errors at Spire-Arnor-West-40. CU-DU communication path degraded. +2024-07-01 03:20 - Tested E2 path — Path degraded — 30% packet loss on signaling. +2024-07-01 03:23 - Resource health at Spire-Arnor-West-40: core network elements responding. Interface error rates checked. +2024-07-01 03:27 - Executed clear configuration — COMMAND FAILED — element not responding. +2024-07-01 03:31 - Verified: Full recovery confirmed. Alarm cleared at 03:35. E2 path recovered.",Arador son of Arador,2024-07-01 03:17:59,2024-07-01 03:36:59,RAN Intelligent Controller interface errors,Resolved,SIG-009: E2 Interface Errors,0.4,FALSE,Arador son of Arador,soft_solve +INCME-100049,Gondor,Arcane Engines,Keeper Stone,COMPUTE,Dwarven Network Systems,Linhir,Resolved,Node not functional,MAJOR,Watch-Gondor-North-69,"Incident closure summary: + +Restore Action: +- Verified that the compute node was unavailable, and it was due to a hardware failure. +- Verified that the root cause was the hardware failure, and a new compute node was deployed and brought online. +- Notified the cloud team to escalate the issue to them, and they confirmed that they had resolved the issue. + +Reason for Outage: +- The root cause was the hardware failure, which led to the unavailability of the compute node. + +Close notes: +- The incident was resolved successfully, and the compute node was restored to its normal state. +- The root cause of the issue was confirmed, and the appropriate action was taken to resolve the issue. +- The incident was reported to the incident management team, and a resolution summary was generated. +- The incident was closed with a resolution that included a restore action, a root cause analysis, and a resolution summary.",2024-07-07 22:06:34,2024-07-07 22:21:34,Helm's Deep Emergency Unit,compute_vm_failure_recovery,Node not functional on Watch-Gondor-North-69. Container configuration error after recent deployment — environment variable referencing deleted secret. Escalation ticket created for vendor engagement on firmware issue. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-07-07 22:08 - Alert: Node not functional on gnb-scheduler-6f7c in namespace monitoring at Watch-Gondor-North-69. Pod pending — insufficient resources on node. +2024-07-07 22:12 - Pod status: CrashLoopBackOff. Reason: ContainerConfigError. Finalizers: none. Node maintenance triggered 50 min ago. +2024-07-07 22:16 - Inspected logs for gnb-scheduler-6f7c: Last 20 restarts show same pattern: starts OK, then OOMKilled at ~200MB after 3-4 minutes of operation. +2024-07-07 22:20 - Checked compute node — Node healthy — issue isolated to pod. +2024-07-07 22:24 - Remote action: clear configuration. SUCCESS — reboot initiated, monitoring. +2024-07-07 22:27 - Verified — Site operational, all services confirmed active. New pods healthy across 3 nodes.",Erestor the Fair,2024-07-07 22:06:34,2024-07-07 22:21:34,Compute node completely unavailable,Resolved,Node not functional,0.4,FALSE,Erestor the Fair,hard_solve +INCME-100050,Rohan,Path Signals,Path Router,COMPUTE,Elven Forge Technologies,Snowbourn,Resolved,CMP-008: DU Function Pod Restart,CRITICAL,Watch-Rohan-Upper-08,"Restore Action: +- Verified that the Resource Alarm has been resolved and is no longer active. +- Analyzed Resource Usage and determined that the Resource Status was not affected by the Distributed Unit pod restart. +- Terminated Non-Critical Workloads to ensure the resource was not overutilized. +- Verified that the Resource Status has been restored to normal. + +Reason for Outage: +- Distributed Unit pod experienced unexpected restart, leading to the Resource Alarm being triggered. +- Resource Usage was affected, leading to Non-Critical Workloads being terminated. +- The Resource Status was not affected. + +Outcome: +- The incident has been resolved, and the Resource Status is now normal.",2024-10-18 03:43:02,2024-10-18 04:00:02,Osgiliath Bridge Operations,compute_resource_exhaustion_resolution,CMP-008: DU Function Pod Restart on Watch-Rohan-Upper-08. ImagePullBackOff due to internal registry pod failure on same node as requesting pod. Automated recovery sequence triggered after remote intervention. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-10-18 03:44 - Alert: CMP-008: DU Function Pod Restart on du-worker-3a1f in namespace control-plane at Watch-Rohan-Upper-08. Container OOMKilled — memory limit exceeded. +2024-10-18 03:47 - Executed: scaled deployment to 0 and back to 3. Container runtime restarted, pods recovering. +2024-10-18 03:50 - Checked compute node — Node NotReady — kubelet unresponsive. +2024-10-18 03:54 - Remote action: software rollback. Partial success — 2 of 3 units recovered. +2024-10-18 03:58 - Recovery check: Service fully restored. All metrics back to baseline. All probes passing.",Aravir the Heir,2024-10-18 03:43:02,2024-10-18 04:00:02,Distributed Unit pod experienced unexpected restart,Resolved,CMP-008: DU Function Pod Restart,0.2,FALSE,Aravir the Heir,hard_solve +INCME-100051,Rhovanion,Beacon Power,Weather Watch,RAN,Dwarven Network Systems,Framsburg,Resolved,RAN-014: CSR Unreachable,MINOR,Hub-Rhovanion-North-83,"Incident Summary: + +1. Restore Action: Verify Upgrade Failure + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Solution: + +- Verified that the Cell site router management path was restored successfully. + +2. Reason For Outage: The root cause + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Solution: + +- Root cause analysis identified that the Cell site router management path was unavailable due to a system maintenance. +- Emergency boot was initiated to restore the management path. +- Field support was dispatched to address the issue. + +3. Restore Action: Check System Status + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Solution: + +- Checked the system status to ensure that the Cell site router management path was restored successfully. + +4. Reason For Outage: The root cause + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Solution: + +- Root cause",2024-08-03 14:37:14,2024-08-03 15:25:14,Osgiliath Bridge Operations,ran_software_upgrade_recovery,RAN-014: CSR Unreachable at Hub-Rhovanion-North-83. Cell was administratively disabled during maintenance and not re-enabled. Configuration mismatch confirmed. Automated recovery sequence triggered after remote intervention. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-08-03 14:38 - Critical alarm received for Hub-Rhovanion-North-83: RAN-014: CSR Unreachable. Checked alarm system — confirmed active since 14:33. Partial outage — 2 of 3 sectors affected. +2024-08-03 14:40 - Connectivity check to Hub-Rhovanion-North-83 — reachable, latency nominal. +2024-08-03 14:44 - Queried element health at Hub-Rhovanion-North-83. Equipment within operational limits. +2024-08-03 14:46 - Executed software rollback — COMMAND FAILED — element not responding. +2024-08-03 14:48 - Verified: Site operational, all services confirmed active. All sectors operational and radiating. Alarms all cleared.",Araglas Strider,2024-08-03 14:37:14,2024-08-03 15:25:14,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.4,FALSE,Araglas Strider,soft_solve +INCME-100052,Mirkwood,Arcane Engines,Junction Point,RAN,Elven Forge Technologies,Thranduil's Halls,Pending Resolution,Access instability,CRITICAL,Station-Mirkwood-Lower-65,"NOC Engineer's Resolution Summary: + +1. Restore Action: Conduct a thorough analysis of the network and identify the root cause of the issue. +2. Reason For Outage: The interference alarm was triggered due to a problem with the random access channel. + +3. Resolution: +a. Verify the interference alarm by checking the Spectrum Management system. +b. Verify the random access channel status by monitoring the network traffic. +c. Conduct a thorough analysis of the network and identify the root cause of the issue. +d. Conduct a root cause analysis and implement necessary fixes. +e. Report the outage to Spectrum Management for escalation. +f. Verify the interference alarm and report to the Spectrum Management team. +g. Verify the random access channel status and report to Spectrum Management. +h. Conduct a thorough analysis of the network traffic to identify any potential issues. +i. Implement necessary fixes and verify the outage has been resolved. + +Note: The root cause of the issue was identified, and the root cause analysis and fixes were implemented.",2024-07-06 10:34:50,2024-07-06 11:07:50,Pelargir Port Authority,ran_interference_mitigation,Access instability at Station-Mirkwood-Lower-65. Cell was administratively disabled during maintenance and not re-enabled. Configuration mismatch confirmed. Automated recovery sequence triggered after remote intervention. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-07-06 10:36 - Alarm: Access instability at Station-Mirkwood-Lower-65. Severity: CRITICAL. Active 7 minutes. Intermittent alarm — flapping every 2-3 minutes. +2024-07-06 10:39 - Connectivity check to Station-Mirkwood-Lower-65 — reachable, latency nominal. +2024-07-06 10:41 - Queried RF chain. PA power output 6dB below target on affected sector. Antenna tilt: 6°. PA status: standby. +2024-07-06 10:45 - Queried element health at Station-Mirkwood-Lower-65. Equipment within operational limits. +2024-07-06 10:47 - Attempted remote SCTP reset on Station-Mirkwood-Lower-65. Partial success — 2 of 3 units recovered. +2024-07-06 10:50 - Escalated to DevOps via TK-20240051: Persistent fault after 2 remote attempts — need on-site investigation. +2024-07-06 10:52 - Verified: Service fully restored. All metrics back to baseline. All sectors broadcasting normally. Alarms self-cleared after fix applied.",Sam Bracegirdle,2024-07-06 10:34:50,2024-07-06 11:07:50,Random access channel success rate below threshold,Pending Resolution,Access instability,0.6,FALSE,Sam Bracegirdle,hard_solve +INCME-100053,Gondor,Signal Network,Central Nexus,RAN,Elven Forge Technologies,Minas Tirith,Resolved,RAN-002: Cell Administratively Disabled,MINOR,Spire-Gondor-East-59,"In response to the outage of the RAN network caused by a cell locked or disabled by management action, the following actions were taken: + +1. Verification of the RAN-002 issue: Verification of the RAN network to identify the root cause. + +2. Verification of Sector Alarm: Verification of the sector alarm to identify the specific sector affected by the RAN-002 issue. + +3. Check RF Chain Status: Verification of the RF chain status to identify any issues or faults with the network. + +4. Check RRU Status: Verification of the RRU status to identify any issues or faults with the RRU devices. + +5. Attempt RRU Reset: Attempting to reset the RRU devices to see if they can recover the network. + +6. Enable Capacity Compensation: Enabling capacity compensation to ensure that the network can handle the increased load during the repair process. + +7. Schedule Field Repair: Scheduling a field repair to repair the affected sections of the RAN network. + +8. Verify Coverage Impact: Verifying the coverage impact of the",2024-08-14 18:52:14,2024-08-14 19:27:14,Gondor Gateway Team,ran_sector_outage_recovery,RAN-002: Cell Administratively Disabled at Spire-Gondor-East-59. Interference pattern detected from adjacent site — handover parameters were misconfigured after recent optimization. Automated recovery sequence triggered after remote intervention. Full service restoration confirmed. Post-incident review scheduled.,"2024-08-14 18:54 - Alarm: RAN-002: Cell Administratively Disabled at Spire-Gondor-East-59. Severity: MINOR. Active 13 minutes. Single sector degraded, others normal. +2024-08-14 18:58 - RF status check — All sectors: TX power nominal, VSWR within spec. VSWR: 1.2:1. TX power: nominal. +2024-08-14 19:00 - Checked neighbors and topology. No common alarms on adjacent sites. Issue isolated to this element. +2024-08-14 19:04 - Queried element health at Spire-Gondor-East-59. Equipment within operational limits. +2024-08-14 19:08 - Executed force restart — COMMAND FAILED — element not responding. +2024-08-14 19:10 - Verified: Site operational, all services confirmed active. All sectors recovered. Alarms cleared within 3 minutes.",Legolas Greenleaf,2024-08-14 18:52:14,2024-08-14 19:27:14,Cell locked or disabled by management action,Resolved,RAN-002: Cell Administratively Disabled,0.2,FALSE,Legolas Greenleaf,soft_solve +INCME-100054,Rhovanion,Signal Core,Central Nexus,RAN,Dwarven Network Systems,Framsburg,Resolved,RAN-011: Remote Radio Unit Alarm,MINOR,Watch-Rhovanion-North-13,"NOC engineer's close notes: + +1. Restore Action: +The root cause of the issue was identified and resolved. The O-RAN radio unit reported a fault condition, which was verified and resolved. Backhaul and router connectivity were verified and tested, and the service impact was verified. The issue was escalated to the transport team for further investigation. + +2. Reason for Outage: +The root cause of the issue was identified as the O-RAN radio unit reporting a fault condition. The root cause analysis revealed that the fault was due to an issue with the radio unit's firmware. + +3. Restore Action: +The root cause was resolved, and the O-RAN radio unit was repaired and re-tested. Backhaul and router connectivity were verified and tested, and the service impact was verified. The issue was escalated to the transport team for further investigation. + +4. Reason for Outage: +The root cause of the issue was identified as the O-RAN radio unit reporting a fault condition. The root cause analysis revealed that the fault was due to an issue with the radio unit's firmware. + +5",2024-09-19 18:30:59,2024-09-19 19:06:59,Shire Monitoring Guild,ran_backhaul_degradation_resolution,RAN-011: Remote Radio Unit Alarm at Watch-Rhovanion-North-13. Cell was administratively disabled during maintenance and not re-enabled. Configuration mismatch confirmed. Field dispatch initiated for hardware component requiring physical replacement. Full service restoration confirmed. Post-incident review scheduled.,"2024-09-19 18:32 - Critical alarm received for Watch-Rhovanion-North-13: RAN-011: Remote Radio Unit Alarm. Checked alarm system — confirmed active since 18:22. Partial outage — 2 of 3 sectors affected. +2024-09-19 18:34 - Connectivity check to Watch-Rhovanion-North-13 — reachable, latency nominal. +2024-09-19 18:37 - Checked neighbors and topology. 2 neighbor sites also showing degraded metrics — possible area issue. +2024-09-19 18:41 - Queried element health at Watch-Rhovanion-North-13. Equipment within operational limits. +2024-09-19 18:43 - Executed reset — SUCCESS — reboot initiated, monitoring. +2024-09-19 18:45 - Recovery confirmed — Service fully restored. All metrics back to baseline. All KPIs nominal.",Haldir the Fair,2024-09-19 18:30:59,2024-09-19 19:06:59,O-RAN radio unit reporting fault condition,Resolved,RAN-011: Remote Radio Unit Alarm,0.2,FALSE,Haldir the Fair,soft_solve +INCME-100055,Rohan,Beacon Power,Weather Watch,RAN,Elven Forge Technologies,Helm's Deep,Resolved,Access instability,MAJOR,Point-Rohan-Inner-42,"In response to the interference alarm in the RAN, the following action was taken: + +Verify Interference Alarm: Verified that the interference alarm was triggered due to a Radio Access Network (RAN) issue. + +Report to Spectrum Management: Reported the interference alarm to Spectrum Management, who confirmed that the interference issue was being addressed. + +Verify Interference Status: Verified that the interference status was reported as ""Critical"" on the Spectrum Management report. + +Root cause: The interference alarm was triggered due to a Radio Access Network issue, specifically a failure to properly filter out interference signals. The root cause was verified through the Spectrum Management report. + +Problem type: RAN-003 + +Resolution: The interference alarm was resolved by implementing a new filtering algorithm to address the issue. The root cause was verified and the interference alarm was resolved.",2024-10-20 01:00:25,2024-10-20 01:35:25,Rohan Rapid Response,ran_interference_mitigation,Access instability at Point-Rohan-Inner-42. Investigation found sector outage caused by RU firmware lockup. CPRI link dropped between DU and affected RU. Configuration corrected and service restored remotely within SLA. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-10-20 01:00 - Critical alarm received for Point-Rohan-Inner-42: Access instability. Checked alarm system — confirmed active since 00:52. Site completely unreachable. +2024-10-20 01:02 - Connectivity check to Point-Rohan-Inner-42 — reachable, latency nominal. +2024-10-20 01:06 - Checked neighbors and topology. No common alarms on adjacent sites. Issue isolated to this element. +2024-10-20 01:08 - Queried element health at Point-Rohan-Inner-42. Equipment within operational limits. +2024-10-20 01:10 - Attempted remote software rollback on Point-Rohan-Inner-42. SUCCESS — reboot initiated, monitoring. +2024-10-20 01:14 - Verified: Partial recovery — monitoring for stability. All sectors recovered. Alarms cleared within 3 minutes.",Denethor son of Imrahil,2024-10-20 01:00:25,2024-10-20 01:35:25,Random access channel success rate below threshold,Resolved,Access instability,0.2,FALSE,Denethor son of Imrahil,soft_solve +INCME-100056,Eriador,Beacon Power,Weather Watch,TRANSPORT,Elven Forge Technologies,Tharbad,Pending Resolution,TRN-004: Fiber Path Degradation,CRITICAL,Tower-Eriador-Primary-17,"In response to the failure of the optical signal degradation or increased error rate, the network operations center (NOC) team has taken the following actions: + +1. Verify Packet Loss Alarm: The team verified that the packet loss alarm was triggered due to the degraded optical signal. This confirmed the root cause of the issue. + +2. Schedule Link Repair: The team scheduled a link repair to address the issue. + +3. Verify Packet Loss Status: The team verified that the packet loss status remained unchanged after the link repair. + +Root cause: Optical signal degradation or increased error rate +Problem type: TRN-004 + +The resolution summary above summarizes the actions taken by the NOC team to resolve the issue. The summary is concise, clear, and provides a complete picture of the root cause and the steps taken to resolve the issue.",2024-08-31 22:08:42,2024-08-31 22:26:42,Lothlórien Link Guardians,transport_packet_loss_resolution,TRN-004: Fiber Path Degradation affecting Tower-Eriador-Primary-17. Fiber degradation on last-mile segment — CRC errors increasing. Physical inspection opened. Configuration corrected and service restored remotely within SLA. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-08-31 22:08 - Alarm: TRN-004: Fiber Path Degradation on transport path to Tower-Eriador-Primary-17. Optical power at -18.2 dBm — borderline low. +2024-08-31 22:12 - Connectivity: End-to-end path: latency within spec after correction. Monitoring for stability. +2024-08-31 22:16 - Ran transport diagnostics — OTDR test: signal degradation detected at 2.3km from site. Possible bend or micro-crack in fiber. +2024-08-31 22:18 - Element health for Tower-Eriador-Primary-17: upstream router operational. Checked interface error rates. +2024-08-31 22:22 - Config correction: Interface MTU mismatch corrected on aggregation link. +2024-08-31 22:26 - Ticket TK-20240055 for core operations: Hardware replacement may be needed. +2024-08-31 22:30 - Recovery verified: Full recovery confirmed. Alarm cleared at 22:21.",Arador of Arnor,2024-08-31 22:08:42,2024-08-31 22:26:42,Optical signal degradation or increased error rate,Pending Resolution,TRN-004: Fiber Path Degradation,0.6,FALSE,Arador of Arnor,hard_solve +INCME-100057,Rohan,Path Signals,Junction Point,COMPUTE,Dwarven Network Systems,Snowbourn,Resolved,CMP-010: Site Not Scrolling,MINOR,Watch-Rohan-West-61,"NOC Engineer's Resolution Summary + +Reason for Outage: A site deployment or scaling operation stalled, resulting in a loss of image pulls. + +Restore Action: Verify Image Pull Alarm, Check Registry Connectivity, Verify Image Credentials, Escalate Registry Issue, Verify Image Pull. + +Root Cause: A site deployment or scaling operation stalled. + +Problem Type: CMP-010. + +Conclusion: The root cause of the outage was a site deployment or scaling operation that stalled, resulting in a loss of image pulls. The resolution actions taken were to verify the Image Pull Alarm, check the Registry Connectivity, verify the Image Credentials, escalate the Registry Issue, and verify the Image Pull. The root cause was identified and resolved, and the outage was resolved.",2024-09-29 23:11:46,2024-09-29 23:44:46,White Tower Operations,compute_image_pull_recovery,CMP-010: Site Not Scrolling on Watch-Rohan-West-61. Node resource exhaustion prevented pod scheduling. Horizontal autoscaler at maximum replica count. Escalation ticket created for vendor engagement on firmware issue. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-09-29 23:13 - Alert: CMP-010: Site Not Scrolling on smf-core-5d8b in namespace ran-prod at Watch-Rohan-West-61. Container OOMKilled — memory limit exceeded. +2024-09-29 23:15 - Pod status: Pending. Reason: InsufficientMemory. Restart count: 12. Last restart: 3 minutes ago. +2024-09-29 23:19 - Container logs show: Application crash: SIGSEGV in libcurl.so during TLS handshake. Possibly corrupted container image layer. +2024-09-29 23:22 - Node health: Node NotReady — kubelet unresponsive. +2024-09-29 23:25 - Remote action: restart. SUCCESS — reboot initiated, monitoring. +2024-09-29 23:27 - Verified — Service fully restored. All metrics back to baseline. New pods healthy across 3 nodes.",Pallando Stormcrow,2024-09-29 23:11:46,2024-09-29 23:44:46,Site deployment or scaling operation stalled,Resolved,CMP-010: Site Not Scrolling,0.2,FALSE,Pallando Stormcrow,soft_solve +INCME-100058,Rohan,Signal Core,Keeper Stone,COMPUTE,Dwarven Network Systems,Aldburg,Resolved,CMP-005: Pod Terminating Stuck,MINOR,Beacon-Rohan-East-09,"I do not have access to the specific incident details, but here's a sample resolution summary based on the provided information: + +failed +actions taken: +- verify orchestrator alarm +- check control plane +- failover to backup control plane +- verify orchestrator recovery +- root cause: pod stuck in terminating state beyond grace period +- problem type: cmp-005 + +summary: +- the issue was caused by a pod stuck in terminating state beyond the grace period, which led to a system failure +- the root cause was identified through the orchestrator alarm, control plane checks, failover to backup control plane, orchestrator recovery, and verification +- the resolution was to failover to backup control plane and verify orchestrator recovery, which resolved the issue +- the system is now functioning as expected.",2024-07-20 10:06:35,2024-07-20 10:26:35,Dúnedain Field Division,compute_orchestrator_recovery,CMP-005: Pod Terminating Stuck on Beacon-Rohan-East-09. ImagePullBackOff due to internal registry pod failure on same node as requesting pod. Escalation ticket created for vendor engagement on firmware issue. Full service restoration confirmed. Post-incident review scheduled.,"2024-07-20 10:06 - Alarm received: CMP-005: Pod Terminating Stuck at Beacon-Rohan-East-09. Pod pending — insufficient resources on node. Checking container state. +2024-07-20 10:10 - Inspected logs for amf-proxy-9c2e: Pull error: dial tcp registry.internal:5000: connect: connection refused. Registry pod on same node is also down. +2024-07-20 10:13 - Node health: Node OK, storage: 94% used. +2024-07-20 10:15 - Remote action: SCTP reset. Partial success — 2 of 3 units recovered. +2024-07-20 10:17 - Recovery check: Full recovery confirmed. Alarm cleared at 10:22. All probes passing.",Frodo Brandybuck,2024-07-20 10:06:35,2024-07-20 10:26:35,Pod stuck in terminating state beyond grace period,Resolved,CMP-005: Pod Terminating Stuck,0.4,FALSE,Frodo Brandybuck,hard_solve +INCME-100059,Rivendell,Path Signals,Path Router,RAN,Elven Forge Technologies,Grey Havens,Resolved,RAN-001: Cell Service Interruption,CRITICAL,Spire-Rivendell-West-37,"Resolution Summary: + +1. Verify Site Outage: +The NOC team conducted a thorough investigation and confirmed that the cellular network was completely unavailable for service at the customer's location. The root cause of the issue was a temporary power outage affecting the cellular network equipment. + +2. Check Backhaul Status: +The NOC team conducted a thorough investigation and confirmed that the backhaul network was functioning properly. The root cause of the issue was a temporary power outage affecting the backhaul network equipment. + +3. Contact Transport Team: +The NOC team contacted the transport team to investigate the cause of the outage and to assess the damage to the network. The transport team confirmed that the outage was caused by a temporary power outage affecting the backhaul network equipment. + +4. Dispatch Field Technician: +The NOC team dispatched a field technician to the customer's location to investigate the issue and to restore service. The field technician confirmed that the temporary power outage had affected the cellular network equipment, and that the root cause of the outage was a temporary power outage affecting the backhaul network equipment.",2024-09-04 09:05:56,2024-09-04 09:46:56,Helm's Deep Emergency Unit,ran_cell_site_down_recovery,RAN-001: Cell Service Interruption at Spire-Rivendell-West-37. Investigation found sector outage caused by RU firmware lockup. CPRI link dropped between DU and affected RU. Automated recovery sequence triggered after remote intervention. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-09-04 09:07 - Alarm: RAN-001: Cell Service Interruption at Spire-Rivendell-West-37. Severity: CRITICAL. Active 14 minutes. Single sector degraded, others normal. +2024-09-04 09:11 - Connectivity check to Spire-Rivendell-West-37 — reachable, latency nominal. +2024-09-04 09:13 - Checked neighbors and topology. No common alarms on adjacent sites. Issue isolated to this element. +2024-09-04 09:17 - Queried element health at Spire-Rivendell-West-37. Equipment within operational limits. +2024-09-04 09:19 - Validated running configuration on Spire-Rivendell-West-37. No mismatches detected. +2024-09-04 09:23 - Executed force restart — COMMAND SUCCESS — operation completed. +2024-09-04 09:27 - Escalated to RF engineering via TK-20240058: Remote resolution unsuccessful, physical inspection required. +2024-09-04 09:29 - Recovery confirmed — Service fully restored. All metrics back to baseline. Metrics stable for 15 minutes.",Kíli Stonehelm,2024-09-04 09:05:56,2024-09-04 09:46:56,Cell completely unavailable for service,Resolved,RAN-001: Cell Service Interruption,0.2,FALSE,Kíli Stonehelm,soft_solve +INCME-100060,Lothlórien,Signal Core,Junction Point,RAN,Elven Forge Technologies,Woodmen-town,Resolved,RAN-014: CSR Unreachable,MINOR,Watch-Lothlorien-Upper-57,"Resolution Summary: + +The root cause of the failure was the cell site router management path unavailable. The issue was resolved by verifying the upgrade failure, dispatching Field Support, and documenting and reporting the issue. + +The problem type was RAN-014. + +The following actions were taken: +1. Verify upgrade failure: Verification of the upgrade failure was conducted to ensure that the issue was resolved. +2. Dispatch Field Support: A Field Support team member was dispatched to resolve the issue. +3. Document and report: A report was generated to document and report the issue. + +The resolution summary is a concise summary of the actions taken to resolve the issue. It includes the root cause, the actions taken, and the results achieved. The summary does not include additional details or information that may be relevant to the incident.",2024-11-09 05:04:34,2024-11-09 05:36:34,White Tower Operations,ran_software_upgrade_recovery,RAN-014: CSR Unreachable at Watch-Lothlorien-Upper-57. RF chain fault detected — PA shutdown due to high VSWR. Feeder connection verified remotely. Remote corrective action applied successfully. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-11-09 05:06 - Alarm: RAN-014: CSR Unreachable at Watch-Lothlorien-Upper-57. Severity: MINOR. Active 7 minutes. Intermittent alarm — flapping every 2-3 minutes. +2024-11-09 05:10 - Connectivity check to Watch-Lothlorien-Upper-57 — reachable, latency nominal. +2024-11-09 05:14 - Checked neighbors and topology. No common alarms on adjacent sites. Issue isolated to this element. +2024-11-09 05:16 - Queried element health at Watch-Lothlorien-Upper-57. Equipment within operational limits. +2024-11-09 05:20 - Executed generator start — SUCCESS — reboot initiated, monitoring. +2024-11-09 05:23 - Verified: Service fully restored. All metrics back to baseline. All sectors recovered. Alarms cleared.",Imrahil of Minas Tirith,2024-11-09 05:04:34,2024-11-09 05:36:34,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.2,FALSE,Imrahil of Minas Tirith,soft_solve +INCME-100061,Rohan,Signal Core,Path Router,COMPUTE,Elven Forge Technologies,Helm's Deep,Resolved,Node not functional,MAJOR,Beacon-Rohan-South-54,"Incident Summary: + +The incident involved a failure of a compute node, leading to a loss of services. The root cause of the issue was a complete unavailability of the compute node. The incident was resolved by verifying the VM alarm, verifying the hypervisor status, restoring from a snapshot, and verifying the VM recovery. The root cause was identified as a compute node completely unavailable. + +Restore Action: + +1. Verify VM Alarm: Verified that the VM alarm was triggered due to the unavailability of the compute node. + +2. Verify Hypervisor Status: Verified that the hypervisor was in a healthy state and the compute node was not listed as unhealthy. + +3. Restore from Snapshot: Restored the VM from a snapshot that was taken before the outage. + +4. Verify VM Recovery: Verified that the VM was successfully restored and that the services were back to normal. + +Reason for Outage: + +The root cause of the issue was a complete unavailability of the compute node. + +Problem Type: + +CMP-001 - Compute node un",2024-06-09 02:00:32,2024-06-09 02:33:32,Rivendell Array Management,compute_vm_failure_recovery,Node not functional on Beacon-Rohan-South-54. ImagePullBackOff due to internal registry pod failure on same node as requesting pod. Remote corrective action applied successfully. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-06-09 02:02 - Alarm received: Node not functional at Beacon-Rohan-South-54. Pod stuck in Terminating for 47 minutes. Checking container state. +2024-06-09 02:05 - Container logs show: Application crash: SIGSEGV in libcurl.so during TLS handshake. Possibly corrupted container image layer. +2024-06-09 02:09 - Orchestration: migrated workload to healthy node — Container runtime restarted, pods recovering. +2024-06-09 02:12 - Checked compute node — Node NotReady — kubelet unresponsive. +2024-06-09 02:14 - Validated running configuration on Beacon-Rohan-South-54. No mismatches detected. +2024-06-09 02:18 - Remote action: unlock cells. COMMAND SUCCESS — operation completed. +2024-06-09 02:21 - Verified — Service fully restored. All metrics back to baseline. Workload stable on new node.",Vorondil Húrinion,2024-06-09 02:00:32,2024-06-09 02:33:32,Compute node completely unavailable,Resolved,Node not functional,0.4,FALSE,Vorondil Húrinion,hard_solve +INCME-100062,Rhovanion,Signal Core,Signal Unit,RAN,Dwarven Network Systems,Framsburg,Resolved,SVC-003: Call Drop Rate Elevated,MINOR,Node-Rhovanion-East-41,"Closing Note: + +After reviewing the complaint details, the team confirmed that a Voice or data session drop rate above the threshold was the root cause of the issue. The root cause was investigated, and a RF investigation was requested to further investigate the issue. As a result of the investigation, the team identified a possible fault in the network infrastructure that was causing the issue. + +The team documented the resolution, which included the following actions: + +1. Reviewed Call Drop KPIs to identify the affected SVCs. +2. Requested a RF investigation to investigate the root cause of the issue. +3. Documented the resolution, including the root cause, actions taken, and the resolution. + +The team used NOC terminology throughout the resolution summary to ensure clarity and accuracy. The resolution summary also included the closing note, which summarized the resolution and its impact on the customer.",2024-06-21 09:16:22,2024-06-21 09:55:22,Rivendell Array Management,ran_dropped_calls_resolution,SVC-003: Call Drop Rate Elevated at Node-Rhovanion-East-41. Cell was administratively disabled during maintenance and not re-enabled. Configuration mismatch confirmed. Configuration corrected and service restored remotely within SLA. All alarms cleared and KPIs returned to baseline within 15 minutes. No further action required.,"2024-06-21 09:16 - Critical alarm received for Node-Rhovanion-East-41: SVC-003: Call Drop Rate Elevated. Checked alarm system — confirmed active since 09:10. Partial outage — 2 of 3 sectors affected. +2024-06-21 09:19 - Checked neighbors and topology. 2 neighbor sites also showing degraded metrics — possible area issue. +2024-06-21 09:23 - Queried RF chain. All sectors: TX power nominal, VSWR within spec. Antenna tilt: 8°. PA status: active. +2024-06-21 09:26 - Queried element health at Node-Rhovanion-East-41. Equipment within operational limits. +2024-06-21 09:29 - Validated running configuration on Node-Rhovanion-East-41. No mismatches detected. +2024-06-21 09:32 - Attempted remote restart on Node-Rhovanion-East-41. COMMAND FAILED — element not responding. +2024-06-21 09:36 - Created ticket TK-20240061 for DevOps. Vendor escalation for firmware issue. +2024-06-21 09:40 - Verified: Partial recovery — monitoring for stability. All sectors recovered. Alarms all cleared.",Arahad Telcontar,2024-06-21 09:16:22,2024-06-21 09:55:22,Voice or data session drop rate above threshold,Resolved,SVC-003: Call Drop Rate Elevated,0.2,FALSE,Arahad Telcontar,soft_solve +INCME-100063,Rivendell,Signal Core,Relay Unit,POWER,Dwarven Network Systems,Grey Havens,Resolved,PWR-003: Battery Discharge Alert,MAJOR,Array-Rivendell-South-51,"Incident Closure Summary: + +1. Restore Action: Restored the battery backup to full capacity. + +2. Reason For Outage: The battery backup depleted without AC restoration, which resulted in a power outage. + +3. Root Cause: The battery backup depleted without AC restoration due to a power outage. + +4. Problem Type: The PWR-003 problem type refers to a power outage that affects the battery backup. + +5. Summary: The incident was resolved by restoring the battery backup to full capacity, and the power outage was caused by a power outage.",2024-08-24 17:01:52,2024-08-24 17:18:52,Erebor Relay Division,power_battery_discharge_response,Battery backup depleting without AC restoration at Array-Rivendell-South-51. DC rectifier module failure caused voltage drop below threshold. Backup rectifier activated but insufficient for full load. Escalation ticket created for vendor engagement on firmware issue. Full service restoration confirmed. Post-incident review scheduled.,"2024-08-24 17:03 - Alarm dashboard shows PWR-003: Battery Discharge Alert triggered at 16:55. Severity: MAJOR. Correlating with site power infrastructure. +2024-08-24 17:06 - Checked external factors near Array-Rivendell-South-51. No area-wide issues. +2024-08-24 17:09 - Power telemetry check — AC input: unstable — voltage fluctuations. UPS providing backup, battery at 45%. Generator running — output unstable. +2024-08-24 17:13 - Queried element health at Array-Rivendell-South-51. Equipment within operational limits. +2024-08-24 17:16 - Executed remote command: SCTP reset. Result: COMMAND FAILED — element not responding. +2024-08-24 17:20 - Verified recovery — Service fully restored. All metrics back to baseline. All alarms cleared within 3 minutes.",Glorfindel Greenleaf,2024-08-24 17:01:52,2024-08-24 17:18:52,Battery backup depleting without AC restoration,Resolved,PWR-003: Battery Discharge Alert,0.4,FALSE,Glorfindel Greenleaf,soft_solve +INCME-100064,The Shire,Signal Network,Signal Unit,COMPUTE,Elven Forge Technologies,Bucklebury,Pending Resolution,CMP-005: Pod Terminating Stuck,MAJOR,Gateway-TheShire-Outer-98,"NOC Engineer's Note: Verify Orchestrator Alarm and Escalate to Platform Team + +Reason for Outage: A Pod stuck in terminating state beyond grace period, causing the Orchestrator to become unresponsive. + +Restore Action: Verify Orchestrator Alarm, escalate to Platform Team, verify Orchestrator Recovery. + +Root Cause: A Pod stuck in terminating state beyond grace period. + +Problem Type: CMP-005. + +Conclusion: The root cause of the outage was a Pod stuck in terminating state beyond grace period. The Orchestrator was successfully restored to a functional state. The Platform Team was notified and will be responsible for further investigation and resolution.",2024-07-14 19:39:51,2024-07-14 19:53:51,Riders of the Mark,compute_orchestrator_recovery,CMP-005: Pod Terminating Stuck on Gateway-TheShire-Outer-98. Container configuration error after recent deployment — environment variable referencing deleted secret. Escalation ticket created for vendor engagement on firmware issue. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-07-14 19:40 - Alert: CMP-005: Pod Terminating Stuck on smf-core-5d8b in namespace control-plane at Gateway-TheShire-Outer-98. Pod pending — insufficient resources on node. +2024-07-14 19:44 - Orchestration: migrated workload to healthy node — Pods rescheduled on healthy nodes. +2024-07-14 19:47 - Container logs show: Application crash: SIGSEGV in libcurl.so during TLS handshake. Possibly corrupted container image layer. +2024-07-14 19:50 - Checked compute node — Node CPU: 92%, Memory: 87%. Under resource pressure. +2024-07-14 19:54 - Remote action: force restart. SUCCESS — reboot initiated, monitoring. +2024-07-14 19:58 - Ticket TK-20240063 created for transport team: Hardware replacement may be needed. +2024-07-14 20:02 - Verified — Service fully restored. All metrics back to baseline. New pods healthy across 3 nodes.",Gimli Stonehelm,2024-07-14 19:39:51,2024-07-14 19:53:51,Pod stuck in terminating state beyond grace period,Pending Resolution,CMP-005: Pod Terminating Stuck,0.2,FALSE,Gimli Stonehelm,soft_solve +INCME-100065,Rohan,Beacon Power,Power Source,RAN,Dwarven Network Systems,Aldburg,Resolved,RAN-014: CSR Unreachable,MAJOR,Outpost-Rohan-Primary-63,"NOC Engineer's Resolution Summary: + +Outcome: The root cause of the issue was identified as a failure in the cell site router management path, which resulted in an unavailable system state. The issue was escalated to the Upgrade Failure stage and a detailed investigation was conducted to determine the cause of the issue. + +Actions Taken: +- Verified that the upgrade process was successful +- Verified that the cell site router management path was functioning correctly +- Attempted emergency boot of the system +- Verified that the system was functioning correctly after the emergency boot +- Checked system status and discovered that the system was unavailable +- Attempted to restore the system to a functional state by attempting to boot the system from the backup image +- Verified that the system was functioning correctly after the backup image was restored +- Verified that the system was functioning correctly by performing a system health check +- Documented the incident and reported it to the appropriate parties +- Reported the incident to the appropriate parties + +Root Cause: The cell site router management path was unavailable, resulting in the unavailable system state. + +Problem Type: RAN-0",2024-06-07 18:09:46,2024-06-07 18:54:46,Gondor Gateway Team,ran_software_upgrade_recovery,RAN-014: CSR Unreachable at Outpost-Rohan-Primary-63. Investigation found sector outage caused by RU firmware lockup. CPRI link dropped between DU and affected RU. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. All alarms cleared and KPIs returned to baseline within 15 minutes. No further action required.,"2024-06-07 18:10 - Critical alarm received for Outpost-Rohan-Primary-63: RAN-014: CSR Unreachable. Checked alarm system — confirmed active since 18:01. Sector Alpha down, Beta and Gamma operational. +2024-06-07 18:12 - Checked neighbors and topology. Upstream aggregation node healthy. Last-mile issue confirmed. +2024-06-07 18:15 - Queried element health at Outpost-Rohan-Primary-63. Equipment within operational limits. +2024-06-07 18:18 - Executed clear configuration — SUCCESS — reboot initiated, monitoring. +2024-06-07 18:20 - Recovery confirmed — Partial recovery — monitoring for stability. KPIs returned to normal.",Alatar the Brown,2024-06-07 18:09:46,2024-06-07 18:54:46,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.2,FALSE,Alatar the Brown,soft_solve +INCME-100066,Rivendell,Path Signals,Junction Point,RAN,Elven Forge Technologies,Grey Havens,Resolved,RAN-013: Site Communication Failure,WARNING,Watch-Rivendell-Primary-32,"In response to the incident where the NOC engineer encountered a complete loss of management connectivity to a site, the following resolution summary was generated: + +Restore Action: +- Verified that the configuration alert was triggered due to a configuration change on the affected device, and that the change had been corrected. +- Verified that the corrective configuration had been applied successfully. + +Reason for Outage: +- Complete loss of management connectivity to the site due to a configuration change on the affected device. + +Root cause: +- The configuration change resulted in a disruption to the management connectivity to the site, causing the NOC engineer to lose access to the site's configuration data. + +Problem type: +- RAN-013 - Complete loss of management connectivity to a site + +Resolution: +- Verified that the root cause of the outage was identified and addressed. +- Verified that the corrective action was successfully implemented. +- Applied any necessary mitigations to prevent similar issues from occurring in the future. + +Note: This resolution summary includes only the relevant information necessary for the NOC engineer to document the incident closure. The full incident report will include additional details",2024-07-03 17:15:23,2024-07-03 17:41:23,Osgiliath Bridge Operations,ran_parameter_correction,RAN-013: Site Communication Failure at Watch-Rivendell-Primary-32. Software version mismatch after incomplete upgrade caused cell site router communication failure. Automated recovery sequence triggered after remote intervention. Full service restoration confirmed. Post-incident review scheduled.,"2024-07-03 17:15 - Critical alarm received for Watch-Rivendell-Primary-32: RAN-013: Site Communication Failure. Checked alarm system — confirmed active since 17:12. Partial outage — 2 of 3 sectors affected. +2024-07-03 17:17 - Connectivity check to Watch-Rivendell-Primary-32 — reachable, latency nominal. +2024-07-03 17:21 - Checked neighbors and topology. 2 neighbor sites also showing degraded metrics — possible area issue. +2024-07-03 17:25 - Queried element health at Watch-Rivendell-Primary-32. Equipment within operational limits. +2024-07-03 17:29 - Validated running configuration on Watch-Rivendell-Primary-32. No mismatches detected. +2024-07-03 17:32 - Escalated to core operations via TK-20240065: Vendor escalation for firmware issue. +2024-07-03 17:34 - Recovery confirmed — Site operational, all services confirmed active. KPIs recovering — within 90% of baseline.",Sam Brandybuck,2024-07-03 17:15:23,2024-07-03 17:41:23,Complete loss of management connectivity to site,Resolved,RAN-013: Site Communication Failure,0.2,FALSE,Sam Brandybuck,soft_solve +INCME-100067,Arnor,Relay Transport,Junction Point,RAN,Elven Forge Technologies,Bywater,Resolved,RAN-014: CSR Unreachable,CRITICAL,Point-Arnor-North-36,"Incident Resolution Summary: + +Escalated due to a failure in the Cell site router management path. + +Actions Taken: +- Verified that the upgrade failure was caused by a cell site router management path unavailable. +- Checked the system status to confirm that the router was available. +- Attempted an emergency boot to restore the system. +- Dispatched Field Support to investigate and resolve the issue. +- Documented and reported the incident. + +Root Cause: +The cell site router management path was unavailable due to a hardware failure. + +Problem Type: +RAN-014 - Cell site router management path unavailable + +Conclusion: +The root cause of the failure was a hardware failure in the cell site router management path, causing the system to become unavailable. The escalation was necessary to ensure that the system was restored and the issue was resolved.",2024-07-07 11:25:58,2024-07-07 12:07:58,Lothlórien Link Guardians,ran_software_upgrade_recovery,RAN-014: CSR Unreachable at Point-Arnor-North-36. Software version mismatch after incomplete upgrade caused cell site router communication failure. Automated recovery sequence triggered after remote intervention. Full service restoration confirmed. Post-incident review scheduled.,"2024-07-07 11:25 - Alarm: RAN-014: CSR Unreachable at Point-Arnor-North-36. Severity: CRITICAL. Active 8 minutes. Single sector degraded, others normal. +2024-07-07 11:29 - RF status check — RET controller not responding on Sector Alpha. VSWR: 1.1:1. TX power: 6dB below target. +2024-07-07 11:33 - Queried element health at Point-Arnor-North-36. Equipment within operational limits. +2024-07-07 11:36 - Validated running configuration on Point-Arnor-North-36. No mismatches detected. +2024-07-07 11:38 - Attempted remote unlock cells on Point-Arnor-North-36. Partial success — 2 of 3 units recovered. +2024-07-07 11:41 - Escalated to DevOps via TK-20240066: Vendor escalation for firmware issue. +2024-07-07 11:45 - Verified: Site operational, all services confirmed active. All sectors operational and radiating. Alarms all cleared.",Legolas Starlight,2024-07-07 11:25:58,2024-07-07 12:07:58,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.4,FALSE,Legolas Starlight,hard_solve +INCME-100068,Rivendell,Path Signals,Weather Watch,COMPUTE,Elven Forge Technologies,Eregion Post,Resolved,CMP-002: Pod Container Creating,WARNING,Beacon-Rivendell-Upper-70,"In response to the incident where a Pod was stuck in ContainerCreating state, the following resolution summary was generated: + +Restore Action: +- Verified that the Container Alarm was triggered due to the Pod stuck in ContainerCreating state. +- Verified that the Escalation Path was configured correctly and that the DevOps team was informed of the issue. +- Verified that the Pod was successfully restarted and returned to normal operation. + +Reason for Outage: +- The Pod was stuck in ContainerCreating state due to a configuration issue in the container image. +- The root cause of the issue was identified and resolved by verifying that the Container Alarm was triggered and escalating the issue to DevOps. + +Outcome: +- The incident was successfully resolved, with the Pod being restored to normal operation. + +Resolution Summary: +- Verified that the Container Alarm was triggered due to the Pod stuck in ContainerCreating state. +- Verified that the Escalation Path was configured correctly and that the DevOps team was informed of the issue. +- Verified that the Pod was successfully restarted and returned to normal operation. +- Resolved the issue",2024-06-24 05:21:28,2024-06-24 05:34:28,Arnor Response Team,compute_container_crash_recovery,CMP-002: Pod Container Creating on Beacon-Rivendell-Upper-70. Pod stuck in Terminating state due to hung preStop hook waiting for unavailable downstream service. Configuration corrected and service restored remotely within SLA. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-06-24 05:23 - Alarm received: CMP-002: Pod Container Creating at Beacon-Rivendell-Upper-70. Pod pending — insufficient resources on node. Checking container state. +2024-06-24 05:27 - Executed: scaled deployment to 0 and back to 3. Node drained successfully, workloads redistributed. +2024-06-24 05:31 - Inspected logs for smf-core-5d8b: Application crash: SIGSEGV in libcurl.so during TLS handshake. Possibly corrupted container image layer. +2024-06-24 05:35 - Node health: Node CPU: 92%, Memory: 87%. Under resource pressure. +2024-06-24 05:38 - Validated running configuration on Beacon-Rivendell-Upper-70. No mismatches detected. +2024-06-24 05:42 - Remote action: software rollback. Partial success — 2 of 3 units recovered. +2024-06-24 05:45 - Verified — Site operational, all services confirmed active. Replacement pod Running, all readiness probes passing.",Pallando Greyhame,2024-06-24 05:21:28,2024-06-24 05:34:28,Pod stuck in ContainerCreating state,Resolved,CMP-002: Pod Container Creating,0.4,FALSE,Pallando Greyhame,hard_solve +INCME-100069,Rohan,Path Signals,Signal Unit,RAN,Dwarven Network Systems,Dunharrow,Resolved,Access instability,MAJOR,Hub-Rohan-North-77,"In response to the interference alarm in the RAN, the following action was taken: + +Verify Interference Alarm: Verified that the interference alarm was triggered due to a Radio Access Network (RAN) issue. + +Report to Spectrum Management: Reported the interference alarm to Spectrum Management, who confirmed that the interference issue was being addressed. + +Verify Interference Status: Verified that the interference status was reported as ""Critical"" on the Spectrum Management report. + +Root cause: The interference alarm was triggered due to a Radio Access Network issue, specifically a failure to properly filter out interference signals. The root cause was verified through the Spectrum Management report. + +Problem type: RAN-003 + +Resolution: The interference alarm was resolved by implementing a new filtering algorithm to address the issue. The root cause was verified and the interference alarm was resolved.",2024-11-05 14:22:48,2024-11-05 14:41:48,Pelargir Port Authority,ran_interference_mitigation,Access instability at Hub-Rohan-North-77. Software version mismatch after incomplete upgrade caused cell site router communication failure. Remote corrective action applied successfully. Full service restoration confirmed. Post-incident review scheduled.,"2024-11-05 14:24 - Critical alarm received for Hub-Rohan-North-77: Access instability. Checked alarm system — confirmed active since 14:14. Partial outage — 2 of 3 sectors affected. +2024-11-05 14:26 - RF status check — RET controller not responding on Sector Alpha. VSWR: 1.5:1. TX power: 6dB below target. +2024-11-05 14:30 - Queried element health at Hub-Rohan-North-77. Equipment within operational limits. +2024-11-05 14:32 - Executed power cycle — Partial success — 2 of 3 units recovered. +2024-11-05 14:35 - Verified: Service fully restored. All metrics back to baseline. All sectors broadcasting normally. Alarms cleared within 3 minutes.",Éomer of the Mark,2024-11-05 14:22:48,2024-11-05 14:41:48,Random access channel success rate below threshold,Resolved,Access instability,0.6,FALSE,Éomer of the Mark,hard_solve +INCME-100070,Lothlórien,Path Signals,Beacon Array,RAN,Dwarven Network Systems,Michel Delving,Resolved,RAN-014: CSR Unreachable,CRITICAL,Outpost-Lothlorien-North-58,"In summary, the NOC engineer resolved the issue of a cell site router management path being unavailable due to a faulty switch. The root cause was identified as a faulty switch. The issue was verified and resolved by verifying the Upgrade Failure, dispatching Field Support, and documenting and reporting the issue. The root cause was documented and reported. The engineer completed the resolution process by verifying the Upgrade Failure, discharging the incident, documenting and reporting the issue, and resolving it.",2024-09-13 16:30:45,2024-09-13 16:54:45,Iron Hills Transport,ran_software_upgrade_recovery,RAN-014: CSR Unreachable at Outpost-Lothlorien-North-58. Cell was administratively disabled during maintenance and not re-enabled. Configuration mismatch confirmed. Escalation ticket created for vendor engagement on firmware issue. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-09-13 16:31 - Alarm: RAN-014: CSR Unreachable at Outpost-Lothlorien-North-58. Severity: CRITICAL. Active 7 minutes. Partial outage — 2 of 3 sectors affected. +2024-09-13 16:35 - Checked neighbors and topology. No common alarms on adjacent sites. Issue isolated to this element. +2024-09-13 16:37 - Connectivity check to Outpost-Lothlorien-North-58 — reachable, latency nominal. +2024-09-13 16:41 - Queried element health at Outpost-Lothlorien-North-58. Equipment within operational limits. +2024-09-13 16:43 - Validated running configuration on Outpost-Lothlorien-North-58. No mismatches detected. +2024-09-13 16:45 - Escalated to RF engineering via TK-20240069: Persistent fault after 2 remote attempts — need on-site investigation. +2024-09-13 16:48 - Verified: Full recovery confirmed. Alarm cleared at 16:48. All sectors operational and radiating. Alarms self-cleared after fix applied.",Araglas Wingfoot,2024-09-13 16:30:45,2024-09-13 16:54:45,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.4,FALSE,Araglas Wingfoot,soft_solve +INCME-100071,Rohan,Arcane Engines,Relay Unit,RAN,Dwarven Network Systems,Grimslade,Resolved,RAN-009: TX Array Fault,MAJOR,Array-Rohan-Central-08,"In accordance with our NOC Engineering process, we have completed the following actions to restore the RAN-009 incident: + +1. Verified that the Tilt Alarm was triggered due to hardware or calibration failure. A field service engineer was dispatched to perform an on-site repair. + +2. Scheduled a Field Service to perform the repair. + +3. Verified that the Coverage Impact was not significant, and the RAN-009 incident was resolved. + +Root cause: +The Tilt Alarm was triggered due to hardware or calibration failure on the Transmission antenna array hardware. The field service engineer was dispatched to perform an on-site repair. The Coverage Impact was not significant, and the RAN-009 incident was resolved. + +Reason for Outage: +The Tilt Alarm was triggered due to hardware or calibration failure on the Transmission antenna array hardware. + +Resolution: +Verified that the Tilt Alarm was triggered due to hardware or calibration failure on the Transmission antenna array hardware. A field service engineer was dispatched to perform an on-site repair. +Scheduled a Field Service",2024-08-15 15:35:07,2024-08-15 15:51:07,Osgiliath Bridge Operations,ran_antenna_tilt_recovery,RAN-009: TX Array Fault at Array-Rohan-Central-08. Cell was administratively disabled during maintenance and not re-enabled. Configuration mismatch confirmed. Configuration corrected and service restored remotely within SLA. All alarms cleared and KPIs returned to baseline within 15 minutes. No further action required.,"2024-08-15 15:36 - Critical alarm received for Array-Rohan-Central-08: RAN-009: TX Array Fault. Checked alarm system — confirmed active since 15:32. Partial outage — 2 of 3 sectors affected. +2024-08-15 15:39 - Checked neighbors and topology. No common alarms on adjacent sites. Issue isolated to this element. +2024-08-15 15:41 - Queried element health at Array-Rohan-Central-08. Equipment within operational limits. +2024-08-15 15:43 - Attempted remote restart on Array-Rohan-Central-08. COMMAND FAILED — element not responding. +2024-08-15 15:45 - Recovery confirmed — Site operational, all services confirmed active. KPIs returned to normal.",Mardil of Dol Amroth,2024-08-15 15:35:07,2024-08-15 15:51:07,Transmission antenna array hardware or calibration failure,Resolved,RAN-009: TX Array Fault,0.2,FALSE,Mardil of Dol Amroth,hard_solve +INCME-100072,Gondor,Beacon Power,Beacon Array,RAN,Elven Forge Technologies,Henneth Annûn,Resolved,RAN-014: CSR Unreachable,MAJOR,Relay-Gondor-Inner-24,"Incident Resolution Summary: + +Escalated due to a failure in the Cell site router management path. + +Actions Taken: +- Verified that the upgrade failure was caused by a cell site router management path unavailable. +- Checked the system status to confirm that the router was available. +- Attempted an emergency boot to restore the system. +- Dispatched Field Support to investigate and resolve the issue. +- Documented and reported the incident. + +Root Cause: +The cell site router management path was unavailable due to a hardware failure. + +Problem Type: +RAN-014 - Cell site router management path unavailable + +Conclusion: +The root cause of the failure was a hardware failure in the cell site router management path, causing the system to become unavailable. The escalation was necessary to ensure that the system was restored and the issue was resolved.",2024-06-21 23:34:16,2024-06-21 23:59:16,Lothlórien Link Guardians,ran_software_upgrade_recovery,RAN-014: CSR Unreachable at Relay-Gondor-Inner-24. Investigation found sector outage caused by RU firmware lockup. CPRI link dropped between DU and affected RU. Configuration corrected and service restored remotely within SLA. Full service restoration confirmed. Post-incident review scheduled.,"2024-06-21 23:34 - Alarm: RAN-014: CSR Unreachable at Relay-Gondor-Inner-24. Severity: MAJOR. Active 3 minutes. Site completely unreachable. +2024-06-21 23:38 - RF status check — All sectors: TX power nominal, VSWR within spec. VSWR: 3.2:1. TX power: 0 (PA off). +2024-06-21 23:42 - Queried element health at Relay-Gondor-Inner-24. Equipment within operational limits. +2024-06-21 23:44 - Attempted remote restart on Relay-Gondor-Inner-24. SUCCESS — reboot initiated, monitoring. +2024-06-21 23:48 - Recovery confirmed — Site operational, all services confirmed active. KPIs recovering — within 90% of baseline.",Arahael Telcontar,2024-06-21 23:34:16,2024-06-21 23:59:16,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.2,FALSE,Arahael Telcontar,soft_solve +INCME-100073,Rohan,Signal Core,Relay Unit,COMPUTE,Elven Forge Technologies,Aldburg,Resolved,CMP-002: Pod Container Creating,CRITICAL,Node-Rohan-Upper-97,"Reopened from INCME-100068. + +Closure Note: + +1. Restore Action: +The root cause of the issue was a Pod stuck in ContainerCreating state. The issue was resolved by verifying the Container Alarm, escalating to DevOps, verifying the Container Recovery, and verifying the Pod was restored to its original state. + +2. Reason for Outage: +The Pod stuck in ContainerCreating state was caused by a configuration issue with the application. The root cause was identified and resolved. + +3. Outcome: +The issue was resolved and the Pod was restored to its original state. The root cause of the issue was identified and resolved, and the issue was closed. + +Note: This closure note does not include any information about the specific configuration issues that led to the outage. This is intended to be a concise summary of the steps taken to resolve the issue.",2024-06-27 04:34:28,2024-06-27 04:52:28,Hobbiton Watch Service,compute_container_crash_recovery,CMP-002: Pod Container Creating on Node-Rohan-Upper-97. ImagePullBackOff due to internal registry pod failure on same node as requesting pod. Remote corrective action applied successfully. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-06-27 04:35 - Alarm received: CMP-002: Pod Container Creating at Node-Rohan-Upper-97. ImagePullBackOff — registry unreachable. Checking container state. +2024-06-27 04:38 - Orchestration: cordoned and drained affected node — Pods rescheduled on healthy nodes. +2024-06-27 04:40 - Pod status: ImagePullBackOff. Reason: OOMKilled. Scheduled but unschedulable — resource constraints. +2024-06-27 04:43 - Node health: Node OK, storage: 94% used. +2024-06-27 04:46 - Validated running configuration on Node-Rohan-Upper-97. No mismatches detected. +2024-06-27 04:50 - Remote action: restart. COMMAND SUCCESS — operation completed. +2024-06-27 04:52 - Recovery check: Partial recovery — monitoring for stability. All probes passing.",Gandalf the Grey,2024-06-27 04:34:28,2024-06-27 04:52:28,Pod stuck in ContainerCreating state,Resolved,CMP-002: Pod Container Creating,0.2,TRUE,Gandalf the Grey,soft_solve +INCME-100074,Rohan,Signal Network,Power Source,RAN,Dwarven Network Systems,Hornburg,Pending Resolution,Access instability,MAJOR,Tower-Rohan-East-17,"In the incident of failed workflow outcome, a root cause was identified as the Random Access Channel (RAC) success rate being below the threshold. The root cause was attributed to an issue with the interference alarm, which was verified by the team. The team then reported the issue to Spectrum Management and verified the interference status. The issue was resolved and the workflow outcome was successfully completed. The resolution summary includes the following information: + +Restore Action: Verify Interference Alarm +Reason For Outage: The RAC success rate was below the threshold + +The team verified that the interference alarm was working correctly, which was a critical component in the workflow. The root cause was addressed and the workflow outcome was successfully completed. The resolution summary provides a concise and complete summary of the incident, including the actions taken, the root cause, and the resolution.",2024-11-17 06:25:56,2024-11-17 06:37:56,Bywater Observation Post,ran_interference_mitigation,Access instability at Tower-Rohan-East-17. Interference pattern detected from adjacent site — handover parameters were misconfigured after recent optimization. Remote corrective action applied successfully. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-11-17 06:26 - Alarm: Access instability at Tower-Rohan-East-17. Severity: MAJOR. Active 4 minutes. Intermittent alarm — flapping every 2-3 minutes. +2024-11-17 06:29 - Checked neighbors and topology. No common alarms on adjacent sites. Issue isolated to this element. +2024-11-17 06:32 - Queried element health at Tower-Rohan-East-17. Equipment within operational limits. +2024-11-17 06:35 - Escalated to power maintenance via TK-20240073: Remote resolution unsuccessful, physical inspection required. +2024-11-17 06:39 - Recovery confirmed — Full recovery confirmed. Alarm cleared at 06:36. All KPIs nominal.",Meriadoc Bracegirdle,2024-11-17 06:25:56,2024-11-17 06:37:56,Random access channel success rate below threshold,Pending Resolution,Access instability,0.4,FALSE,Meriadoc Bracegirdle,soft_solve +INCME-100075,Eriador,Path Signals,Keeper Stone,COMPUTE,Elven Forge Technologies,Fornost,Resolved,CMP-010: Site Not Scrolling,MAJOR,Hub-Eriador-Inner-55,"NOC Engineer's Notes + +The following resolution summary has been generated for the incident that occurred on [Date], involving [Service Name]. + +Restore Action: +- Verified that the Image Pull Alarm was triggered due to a stalled deployment or scaling operation on the site. +- Verified that the Image Pull Alarm was not related to any other service. +- Resolved the issue by verifying that the site deployment or scaling operation was completed successfully. + +Reason for Outage: +- The root cause of the issue was a stalled deployment or scaling operation on the site. +- This issue was caused by an unrelated issue that impacted the Image Pull Alarm. + +Resolution Summary: +- The root cause of the issue was identified as a stalled deployment or scaling operation on the site. +- The root cause of the issue was resolved by verifying that the site deployment or scaling operation was completed successfully. +- The issue was closed with a resolution that verifies that the root cause was resolved. + +NOTE: This resolution summary does not include any information regarding the specific root cause of the issue. It only provides a summary of the steps taken to resolve",2024-07-11 18:21:12,2024-07-11 18:48:12,Bywater Observation Post,compute_image_pull_recovery,CMP-010: Site Not Scrolling on Hub-Eriador-Inner-55. Pod stuck in Terminating state due to hung preStop hook waiting for unavailable downstream service. Automated recovery sequence triggered after remote intervention. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-07-11 18:21 - Alarm received: CMP-010: Site Not Scrolling at Hub-Eriador-Inner-55. Pod stuck in Terminating for 47 minutes. Checking container state. +2024-07-11 18:23 - Executed: restarted container runtime on node. Node drained successfully, workloads redistributed. +2024-07-11 18:25 - Pod status: Terminating. Reason: NodeDrain. Finalizers: none. Node maintenance triggered 50 min ago. +2024-07-11 18:29 - Node health: Node OK, storage: 94% used. +2024-07-11 18:32 - Ticket TK-20240074 created for power maintenance: Hardware replacement may be needed. +2024-07-11 18:35 - Ran diagnostic suite on Hub-Eriador-Inner-55. Results collected for analysis. +2024-07-11 18:38 - Recovery check: Full recovery confirmed. Alarm cleared at 18:41. All probes passing.",Elrond the Fair,2024-07-11 18:21:12,2024-07-11 18:48:12,Site deployment or scaling operation stalled,Resolved,CMP-010: Site Not Scrolling,0.4,FALSE,Elrond the Fair,soft_solve +INCME-100076,Gondor,Signal Network,Power Source,POWER,Elven Forge Technologies,Minas Tirith,Resolved,ENV-001: High Temperature Alert,MINOR,Outpost-Gondor-Central-47,"Incident Closure Note: ENV-001 - Temperature Alarm, Equipment Operating Temperature Exceeded Threshold + +Restore Action: +- Verified temperature alarm was triggered due to equipment operating at a temperature above the threshold. +- HVAC service was dispatched to address the issue. +- Monitored temperature trend to ensure that the equipment was operating within the acceptable range. + +Reason for Outage: +- Equipment operating temperature exceeded threshold + +Root Cause: +- Equipment operating temperature exceeded threshold due to an issue with the equipment + +Problem Type: +- ENV-001 - Temperature Alarm, Equipment Operating Temperature Exceeded Threshold + +Proper NOC terminology used: +- ENV-001: Environmental Control Alarm +- Temperature Alarm: Temperature Sensor +- Equipment Operating Temperature Exceeded Threshold: Temperature Alarm triggered due to an issue with the equipment +- Equipment Operating Temperature: Temperature of the equipment +- Temperature Tr",2024-09-17 10:05:00,2024-09-17 10:29:00,Shire Monitoring Guild,env_high_temperature_response,ENV-001: High Temperature Alert at Outpost-Gondor-Central-47. Temperature exceedance caused by blocked exhaust vent combined with high ambient conditions. Field dispatch initiated for hardware component requiring physical replacement. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-09-17 10:06 - Environmental alarm: ENV-001: High Temperature Alert at Outpost-Gondor-Central-47. Battery bank temperature at 48°C — max rated 45°C. +2024-09-17 10:10 - Checked environmental systems: Battery room: 48°C. Ventilation fan: running. AC unit: offline since 06:00. +2024-09-17 10:14 - External conditions: Ambient temperature: 38°C (heat wave in region). No utility outages reported. +2024-09-17 10:16 - Site health under environmental stress: DU throttling due to high temperature. Throughput reduced by 30%. +2024-09-17 10:20 - Ticket TK-20240075 for transport team: Hardware replacement may be needed. +2024-09-17 10:23 - Environmental recovery: Service fully restored. All metrics back to baseline.",Denethor son of Imrahil,2024-09-17 10:05:00,2024-09-17 10:29:00,Equipment operating temperature exceeded threshold,Resolved,ENV-001: High Temperature Alert,0.6,FALSE,Denethor son of Imrahil,soft_solve +INCME-100077,Iron Hills,Relay Transport,Weather Watch,RAN,Elven Forge Technologies,Michel Delving,Resolved,RAN-014: CSR Unreachable,CRITICAL,Node-IronHills-Upper-40,"The incident was caused by a cell site router management path unavailable, which resulted in a failure to perform upgrades. The root cause was identified and documented, and a resolution was implemented to restore the affected system. The incident was closed, and a resolution summary was created to detail the actions taken to address the issue. The summary includes the following information: + +1. Restore Action: Verify and restore the cell site router management path. + +2. Reason for Outage: The root cause was identified and documented. + +3. Notes: The root cause was identified as a failure to perform upgrades due to a cell site router management path unavailable. + +4. Root Cause: The root cause of the issue was identified as a problem with the cell site router management path, which caused the failure to perform upgrades. + +5. Root Cause: The root cause of the issue was identified as a problem with the cell site router management path, which caused the failure to perform upgrades. + +6. Root Cause: The root cause of the issue was identified as a problem with the cell site router management path, which caused the failure to perform upgrades. + +7. Root",2024-10-06 20:22:03,2024-10-06 20:58:03,Shire Monitoring Guild,ran_software_upgrade_recovery,RAN-014: CSR Unreachable at Node-IronHills-Upper-40. RF chain fault detected — PA shutdown due to high VSWR. Feeder connection verified remotely. Configuration corrected and service restored remotely within SLA. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-10-06 20:22 - Critical alarm received for Node-IronHills-Upper-40: RAN-014: CSR Unreachable. Checked alarm system — confirmed active since 20:14. Site completely unreachable. +2024-10-06 20:24 - RF status check — Sector Alpha: PA off, no radiation detected. VSWR: 1.1:1. TX power: 6dB below target. +2024-10-06 20:26 - Queried element health at Node-IronHills-Upper-40. Equipment within operational limits. +2024-10-06 20:29 - Validated running configuration on Node-IronHills-Upper-40. No mismatches detected. +2024-10-06 20:32 - Attempted remote clear configuration on Node-IronHills-Upper-40. SUCCESS — reboot initiated, monitoring. +2024-10-06 20:36 - Recovery confirmed — Full recovery confirmed. Alarm cleared at 20:39. All KPIs nominal.",Fíli Oakenshield,2024-10-06 20:22:03,2024-10-06 20:58:03,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.4,FALSE,Fíli Oakenshield,soft_solve +INCME-100078,Rohan,Signal Core,Relay Unit,POWER,Dwarven Network Systems,Edoras,Resolved,ENV-005: Cabinet Intrusion Detected,MAJOR,Outpost-Rohan-West-81,"In response to the incident, the NOC engineer took the following actions to restore the system: + +1. Verify intrusion alarm: The engineer monitored the intrusion alarm and confirmed that it was triggered. + +2. Create security incident: The engineer created a security incident to document the incident and its resolution. + +3. Monitor cabinet status: The engineer monitored the cabinet status to ensure that the secured equipment cabinet was functioning correctly. + +4. Root cause: The root cause of the incident was found to be physical access detected to secured equipment cabinet. + +The root cause was due to a security breach that occurred when an unauthorized individual attempted to access the cabinet. The engineer was able to identify and resolve the issue by verifying the intrusion alarm, creating a security incident, and monitoring the cabinet status. + +The engineer's actions helped to restore the system and prevent any further incidents. The resolution summary includes the close notes, which include the root cause, actions taken, and the resolution.",2024-07-13 00:53:57,2024-07-13 01:20:57,Hobbiton Watch Service,env_cabinet_intrusion_response,ENV-005: Cabinet Intrusion Detected at Outpost-Rohan-West-81. Temperature exceedance caused by blocked exhaust vent combined with high ambient conditions. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. All alarms cleared and KPIs returned to baseline within 15 minutes. No further action required.,"2024-07-13 00:54 - Environmental alarm: ENV-005: Cabinet Intrusion Detected at Outpost-Rohan-West-81. Battery bank temperature at 48°C — max rated 45°C. +2024-07-13 00:58 - Checked environmental systems: HVAC Unit 1: fault — compressor off. Unit 2: running at full capacity but insufficient for current heat load. +2024-07-13 01:00 - External conditions: Maintenance window for HVAC was scheduled but not executed — vendor no-show. +2024-07-13 01:02 - Site health under environmental stress: Equipment within thermal margin but approaching critical. No service impact yet. +2024-07-13 01:04 - Executed force restart: Partial success — 2 of 3 units recovered. +2024-07-13 01:06 - Environmental recovery: Service fully restored. All metrics back to baseline.",Merry Bracegirdle,2024-07-13 00:53:57,2024-07-13 01:20:57,Physical access detected to secured equipment cabinet,Resolved,ENV-005: Cabinet Intrusion Detected,0.4,FALSE,Merry Bracegirdle,hard_solve +INCME-100079,Arnor,Signal Core,Beacon Array,COMPUTE,Elven Forge Technologies,Hollin Gate,Resolved,Problematic VM,MINOR,Watch-Arnor-Upper-78,"NOC engineer wrote a resolution summary for an incident that occurred during the monitoring of a virtual machine. + +Restore Action: Verify CNF Alarm, Check CNF Pod Status, Trigger CNF Failover, Verify CNF Recovery +Reason For Outage: Virtual machine failed to reach ready state due to a problem type of CMP-006 (Virtual machine failed to reach ready state). + +The root cause of the issue was identified to be a Virtual Machine that failed to reach the Ready state. The issue was verified by checking the CNF Alarm, which indicated that the Virtual Machine was not in a Ready state. The CNF Pod Status was also checked, and it was confirmed that the Virtual Machine was in the Running state. + +The issue was then escalated to the CNM team for further investigation. The CNM team confirmed that the Virtual Machine was indeed in a Ready state, and the issue was resolved by triggering a CNF Failover and verifying the Virtual Machine was in a Ready state. + +The resolution summary includes the root cause, the problem type, and the actions taken to resolve the issue. It also includes the steps taken by the NOC",2024-09-18 08:59:00,2024-09-18 09:27:00,Bywater Observation Post,compute_cnf_pod_recovery,Problematic VM on Watch-Arnor-Upper-78. ImagePullBackOff due to internal registry pod failure on same node as requesting pod. Configuration corrected and service restored remotely within SLA. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-09-18 09:00 - Alarm received: Problematic VM at Watch-Arnor-Upper-78. CrashLoopBackOff — 12 restarts in last hour. Checking container state. +2024-09-18 09:03 - Container logs show: Received SIGTERM, then repeated 'Waiting for connection drain...' messages every 5s. PreStop hook hung on downstream service. +2024-09-18 09:05 - Node health: Node healthy — issue isolated to pod. +2024-09-18 09:09 - Remote action: generator start. SUCCESS — reboot initiated, monitoring. +2024-09-18 09:12 - Recovery check: Site operational, all services confirmed active. All probes passing.",Bombur of the Iron Hills,2024-09-18 08:59:00,2024-09-18 09:27:00,Virtual machine failed to reach ready state,Resolved,Problematic VM,0.2,FALSE,Bombur of the Iron Hills,soft_solve +INCME-100080,Rohan,Beacon Power,Junction Point,COMPUTE,Dwarven Network Systems,Aldburg,Resolved,Node not functional,MAJOR,Relay-Rohan-East-53,"In response to the incident involving a complete unavailable compute node, the following resolution summary was generated: + +Restore Action: +- Verified that the VM alarm was triggered due to the unavailable compute node. +- Verified that the Hypervisor status was down, and the VM recovery was successful. +- Verified that the VM was successfully restored from the snapshot. +- Notified the Cloud Team for escalation. + +Reason for Outage: +- The root cause of the outage was a complete unavailable compute node. + +Notes: +- The root cause of the outage was identified and confirmed. +- The incident was resolved successfully. +- The incident outcome was a successful resolution of the issue. + +NOC Terminology: +- Compute node: The compute node that experienced the unavailable issue. +- Hypervisor status: The status of the Hypervisor running on the compute node. +- VM recovery: The process of restoring a VM from a backup or snapshot.",2024-07-06 23:32:13,2024-07-06 23:56:13,Rangers of the North,compute_vm_failure_recovery,Node not functional on Relay-Rohan-East-53. Container configuration error after recent deployment — environment variable referencing deleted secret. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-07-06 23:33 - Alert: Node not functional on upf-data-plane-2e4a in namespace ran-prod at Relay-Rohan-East-53. Pod pending — insufficient resources on node. +2024-07-06 23:35 - Executed: cordoned and drained affected node. Node drained successfully, workloads redistributed. +2024-07-06 23:37 - Container logs show: Kubelet reports: 0/8 nodes available: insufficient memory. Current request: 4Gi, largest available: 2Gi. +2024-07-06 23:40 - Node health: Node OK, storage: 94% used. +2024-07-06 23:42 - Validated running configuration on Relay-Rohan-East-53. No mismatches detected. +2024-07-06 23:45 - Remote action: reset. Partial success — 2 of 3 units recovered. +2024-07-06 23:47 - Verified — Service fully restored. All metrics back to baseline. Replacement pod Running, all readiness probes passing.",Alatar the Blue,2024-07-06 23:32:13,2024-07-06 23:56:13,Compute node completely unavailable,Resolved,Node not functional,0.2,FALSE,Alatar the Blue,hard_solve +INCME-100081,Lothlórien,Arcane Engines,Signal Unit,RAN,Elven Forge Technologies,Lothlórien Central,Resolved,RAN-014: CSR Unreachable,MAJOR,Beacon-Lothlorien-East-78,"Incident Summary: + +1. Restore Action: Verify Upgrade Failure + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Solution: + +- Verified that the Cell site router management path was restored successfully. + +2. Reason For Outage: The root cause + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Solution: + +- Root cause analysis identified that the Cell site router management path was unavailable due to a system maintenance. +- Emergency boot was initiated to restore the management path. +- Field support was dispatched to address the issue. + +3. Restore Action: Check System Status + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Solution: + +- Checked the system status to ensure that the Cell site router management path was restored successfully. + +4. Reason For Outage: The root cause + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Solution: + +- Root cause",2024-08-05 17:45:03,2024-08-05 18:27:03,Dúnedain Field Division,ran_software_upgrade_recovery,RAN-014: CSR Unreachable at Beacon-Lothlorien-East-78. RF chain fault detected — PA shutdown due to high VSWR. Feeder connection verified remotely. Field dispatch initiated for hardware component requiring physical replacement. Full service restoration confirmed. Post-incident review scheduled.,"2024-08-05 17:45 - Critical alarm received for Beacon-Lothlorien-East-78: RAN-014: CSR Unreachable. Checked alarm system — confirmed active since 17:41. All 3 sectors showing Cell Down. +2024-08-05 17:47 - Connectivity check to Beacon-Lothlorien-East-78 — reachable, latency nominal. +2024-08-05 17:49 - Queried element health at Beacon-Lothlorien-East-78. Equipment within operational limits. +2024-08-05 17:51 - Executed software rollback — SUCCESS — reboot initiated, monitoring. +2024-08-05 17:55 - Recovery confirmed — Full recovery confirmed. Alarm cleared at 18:04. KPIs recovering — within 90% of baseline.",Arahad of the Dúnedain,2024-08-05 17:45:03,2024-08-05 18:27:03,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.4,FALSE,Arahad of the Dúnedain,soft_solve +INCME-100082,Rohan,Signal Core,Beacon Array,RAN,Elven Forge Technologies,Edoras,Resolved,SVC-005: Service Accessibility Degraded,MAJOR,Node-Rohan-Central-41,"Incident Closure Note: + +Restore Action: +- Reviewed Device Issue and identified root cause as End-to-end service accessibility metrics degraded. +- Escalated the issue to the Core Team for further investigation and resolution. +- Documented the resolution in the incident closure note. + +Reason for Outage: +- The root cause was identified as End-to-end service accessibility metrics degraded. + +Outcome: +- The issue has been resolved and the Root Cause identified. +- The resolution was documented in the incident closure note. + +Note: The resolution summary does not include any other information or details. It is meant to provide a concise and complete summary of the incident closure notes.",2024-06-05 02:16:51,2024-06-05 02:32:51,Shire Monitoring Guild,ran_device_issue_resolution,SVC-005: Service Accessibility Degraded at Node-Rohan-Central-41. RF chain fault detected — PA shutdown due to high VSWR. Feeder connection verified remotely. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-06-05 02:16 - Alarm: SVC-005: Service Accessibility Degraded at Node-Rohan-Central-41. Severity: MAJOR. Active 5 minutes. Partial outage — 2 of 3 sectors affected. +2024-06-05 02:20 - RF status check — PA power output 6dB below target on affected sector. VSWR: 1.1:1. TX power: 0 (PA off). +2024-06-05 02:22 - Queried element health at Node-Rohan-Central-41. Equipment within operational limits. +2024-06-05 02:25 - Attempted remote reset on Node-Rohan-Central-41. SUCCESS — reboot initiated, monitoring. +2024-06-05 02:29 - Created ticket TK-20240081 for power maintenance. Remote resolution unsuccessful, physical inspection required. +2024-06-05 02:33 - Recovery confirmed — Partial recovery — monitoring for stability. KPIs recovering — within 90% of baseline.",Théoden Horsemaster,2024-06-05 02:16:51,2024-06-05 02:32:51,End-to-end service accessibility metrics degraded,Resolved,SVC-005: Service Accessibility Degraded,0.4,FALSE,Théoden Horsemaster,soft_solve +INCME-100083,Rohan,Path Signals,Power Source,POWER,Dwarven Network Systems,Edoras,Resolved,ENV-001: High Temperature Alert,MAJOR,Station-Rohan-Lower-69,"Incident Closure Summary: + +- Verified that temperature alarm has been cleared +- Dispatched HVAC service to monitor temperature trend +- Monitored temperature data and found that temperature exceeded threshold +- Root cause: Equipment operating temperature exceeded threshold +- Problem type: ENV-001 + +Conclusion: + +- The root cause of the issue was identified and resolved. +- The incident was successfully closed and no further action is required.",2024-09-09 05:47:41,2024-09-09 06:14:41,Elven Signal Keepers,env_high_temperature_response,ENV-001: High Temperature Alert at Station-Rohan-Lower-69. Temperature exceedance caused by blocked exhaust vent combined with high ambient conditions. Automated recovery sequence triggered after remote intervention. Full service restoration confirmed. Post-incident review scheduled.,"2024-09-09 05:48 - Environmental alarm: ENV-001: High Temperature Alert at Station-Rohan-Lower-69. Cabinet temperature at 52°C — threshold is 45°C. Trending upward. +2024-09-09 05:50 - Checked environmental systems: Temperature: 52°C (rising 2°C/hour). HVAC: partial failure. Generator exhaust vent: partially blocked. +2024-09-09 05:52 - Site health under environmental stress: DU throttling due to high temperature. Throughput reduced by 30%. +2024-09-09 05:55 - Validated running configuration on Station-Rohan-Lower-69. No mismatches detected. +2024-09-09 05:58 - Executed generator start: COMMAND FAILED — element not responding. +2024-09-09 06:00 - Ticket TK-20240082 for DevOps: Vendor escalation for firmware issue. +2024-09-09 06:03 - Environmental recovery: Site operational, all services confirmed active.",Denethor of Dol Amroth,2024-09-09 05:47:41,2024-09-09 06:14:41,Equipment operating temperature exceeded threshold,Resolved,ENV-001: High Temperature Alert,0.2,FALSE,Denethor of Dol Amroth,soft_solve +INCME-100084,Eriador,Beacon Power,Junction Point,POWER,Dwarven Network Systems,Fornost,Pending Resolution,ENV-001: High Temperature Alert,MAJOR,Spire-Eriador-Lower-79,"Incident Closure Summary: + +- Verified that temperature alarm has been cleared +- Dispatched HVAC service to monitor temperature trend +- Monitored temperature data and found that temperature exceeded threshold +- Root cause: Equipment operating temperature exceeded threshold +- Problem type: ENV-001 + +Conclusion: + +- The root cause of the issue was identified and resolved. +- The incident was successfully closed and no further action is required.",2024-11-06 08:17:12,2024-11-06 08:31:12,Pelargir Port Authority,env_high_temperature_response,ENV-001: High Temperature Alert at Spire-Eriador-Lower-79. Cabinet overheating due to HVAC compressor failure. Single cooling unit insufficient for thermal load. Remote corrective action applied successfully. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-11-06 08:18 - Environmental alarm: ENV-001: High Temperature Alert at Spire-Eriador-Lower-79. HVAC unit reporting fault code F07 — compressor failure. +2024-11-06 08:20 - External conditions: Ambient temperature: 38°C (heat wave in region). No utility outages reported. +2024-11-06 08:23 - Equipment health at Spire-Eriador-Lower-79: checked thermal margins and operational state. +2024-11-06 08:25 - Ticket TK-20240083 for power maintenance: Hardware replacement may be needed. +2024-11-06 08:28 - Environmental recovery: Partial recovery — monitoring for stability.",Radagast Mithrandir,2024-11-06 08:17:12,2024-11-06 08:31:12,Equipment operating temperature exceeded threshold,Pending Resolution,ENV-001: High Temperature Alert,0.6,FALSE,Radagast Mithrandir,soft_solve +INCME-100085,Eriador,Signal Network,Weather Watch,RAN,Elven Forge Technologies,Tharbad,Resolved,RAN-001: Cell Service Interruption,MINOR,Tower-Eriador-Outer-86,"NOC Engineer's Resolution Summary: + +Reason for Outage: The cell completely unavailable for service + +Restore Action: +- Verified that site was operational and no issues were detected +- Notified field technician to dispatch and begin troubleshooting +- Final status verification was conducted to ensure that the issue was resolved and the site was back to normal service + +Root cause: Cell completely unavailable for service + +Problem type: RAN-001 + +This resolution summary has been generated based on the information provided by the NOC engineer during the incident closure process. The summary includes the root cause of the outage, the steps taken to fix the issue, and the final status verification. The summary is concise and provides a clear understanding of the incident.",2024-11-18 14:44:06,2024-11-18 15:18:06,Gondor Signal Authority,ran_cell_site_down_recovery,RAN-001: Cell Service Interruption at Tower-Eriador-Outer-86. Software version mismatch after incomplete upgrade caused cell site router communication failure. Field dispatch initiated for hardware component requiring physical replacement. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-11-18 14:44 - Alarm: RAN-001: Cell Service Interruption at Tower-Eriador-Outer-86. Severity: MINOR. Active 7 minutes. Sector Alpha down, Beta and Gamma operational. +2024-11-18 14:48 - Connectivity check to Tower-Eriador-Outer-86 — reachable, latency nominal. +2024-11-18 14:50 - Checked neighbors and topology. 2 neighbor sites also showing degraded metrics — possible area issue. +2024-11-18 14:52 - Queried element health at Tower-Eriador-Outer-86. Equipment within operational limits. +2024-11-18 14:56 - Validated running configuration on Tower-Eriador-Outer-86. No mismatches detected. +2024-11-18 15:00 - Attempted remote generator start on Tower-Eriador-Outer-86. Partial success — 2 of 3 units recovered. +2024-11-18 15:04 - Verified: Site operational, all services confirmed active. All sectors recovered. Alarms cleared.",Radagast Mithrandir,2024-11-18 14:44:06,2024-11-18 15:18:06,Cell completely unavailable for service,Resolved,RAN-001: Cell Service Interruption,0.2,FALSE,Radagast Mithrandir,hard_solve +INCME-100086,Rivendell,Signal Network,Relay Unit,POWER,Dwarven Network Systems,Grey Havens,Resolved,PWR-002: DC Rectifier Failure,WARNING,Station-Rivendell-West-03,"Reason for Outage: A DC power conversion unit malfunction caused a power outage in the NOC. The root cause of the issue was a faulty power converter unit. + +Restore Action: +- Confirm Rectifier Alarm: A rectifier alarm was triggered due to a faulty power converter unit. The rectifier was reset, and the fault was identified and rectified. +- Enable Battery Monitoring: The battery monitoring system was enabled to monitor the battery's health and ensure that it remained at a stable state. +- Final Status Check: The NOC finalized the resolution and the outage was restored to normal. + +Outcome: Failed",2024-07-27 23:53:50,2024-07-28 00:21:50,Arnor Response Team,power_dc_rectifier_recovery,DC power conversion unit malfunction at Station-Rivendell-West-03. Battery bank reached low-voltage disconnect threshold during extended commercial power outage. Configuration corrected and service restored remotely within SLA. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-07-27 23:55 - Checked alarm system — confirmed PWR-002: DC Rectifier Failure active since 23:47. Site Station-Rivendell-West-03 showing power-related alarms. +2024-07-27 23:59 - Power telemetry check — AC input: unstable — voltage fluctuations. UPS providing backup, battery at 28%. Generator running — full load. +2024-07-28 00:02 - Checked external factors near Station-Rivendell-West-03. No area-wide issues. +2024-07-28 00:04 - Queried element health at Station-Rivendell-West-03. Equipment within operational limits. +2024-07-28 00:08 - Executed remote command: force restart. Result: COMMAND FAILED — element not responding. +2024-07-28 00:12 - Final check: Service fully restored. All metrics back to baseline. KPIs KPIs returned to normal.",Alatar the White,2024-07-27 23:53:50,2024-07-28 00:21:50,DC power conversion unit malfunction,Resolved,PWR-002: DC Rectifier Failure,0.2,FALSE,Alatar the White,soft_solve +INCME-100087,Rivendell,Relay Transport,Power Source,RAN,Elven Forge Technologies,Rivendell,Resolved,RAN-007: Cell Not Radiating,MAJOR,Outpost-Rivendell-West-41,"In response to the failed incident closure, the following resolution summary is provided: + +Restore Action: +- Verified VSWR Alarm, TX power reduced, RF path status checked +- Scheduled field inspection for sector impact +- Verified sector impact and took corrective action + +Reason for Outage: +- Transmission chain failure preventing RF emission + +Root cause: +- The failure of the transmission chain caused RF emission to be blocked, preventing the RAN from sending and receiving signals. + +Problem type: +- RAN-007, a critical failure in the transmission chain that impacted the RAN's ability to communicate with other networks. + +Resolution summary: +- Verified the transmission chain's failure and took corrective action to restore the RF emission. +- Scheduled a field inspection to assess the sector impact and take corrective action if necessary. +- Verified that the sector impact was resolved and took corrective action to prevent further issues.",2024-07-09 06:28:26,2024-07-09 07:05:26,White Tower Operations,ran_vswr_alarm_resolution,RAN-007: Cell Not Radiating at Outpost-Rivendell-West-41. RF chain fault detected — PA shutdown due to high VSWR. Feeder connection verified remotely. Field dispatch initiated for hardware component requiring physical replacement. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-07-09 06:28 - Alarm: RAN-007: Cell Not Radiating at Outpost-Rivendell-West-41. Severity: MAJOR. Active 12 minutes. Sector Alpha down, Beta and Gamma operational. +2024-07-09 06:31 - Connectivity check to Outpost-Rivendell-West-41 — reachable, latency nominal. +2024-07-09 06:33 - Queried element health at Outpost-Rivendell-West-41. Equipment within operational limits. +2024-07-09 06:36 - Executed unlock cells — Partial success — 2 of 3 units recovered. +2024-07-09 06:39 - Escalated to core operations via TK-20240086: Hardware replacement may be needed. +2024-07-09 06:43 - Verified: Partial recovery — monitoring for stability. All sectors operational and radiating. Alarms all cleared.",Dori of the Lonely Mountain,2024-07-09 06:28:26,2024-07-09 07:05:26,Transmission chain failure preventing RF emission,Resolved,RAN-007: Cell Not Radiating,0.4,FALSE,Dori of the Lonely Mountain,hard_solve +INCME-100088,Mordor Surveillance Zone,Relay Transport,Weather Watch,RAN,Elven Forge Technologies,Calembel,Resolved,RAN-007: Cell Not Radiating,MAJOR,Hub-MordorSurveillanceZone-Inner-88,"Incident Summary: + +During a scheduled maintenance event, a transmission chain failure occurred that prevented RF emission. This resulted in a failover to the backup transmission chain, which led to an outage in the RAN. + +Restore Action: + +- Verify VSWR Alarm: The VSWR alarm was triggered due to the failure of the primary transmission chain. A team was dispatched to investigate and resolve the issue. +- Schedule Field Inspection: The team scheduled a field inspection to investigate the transmission chain failure and ensure that it was resolved. +- Verify Sector Impact: The team verified that the sector impact was minimal due to the backup transmission chain. + +Reason for Outage: + +The root cause of the outage was the transmission chain failure, preventing RF emission. This failure resulted in an impairment of the RAN, leading to the outage. + +Conclusion: + +The incident closure notes have been updated to include the resolution actions taken, the root cause of the outage, and the impact of the outage on the RAN. The team is continuing to investigate the transmission chain failure and ensure that it is resolved.",2024-07-20 18:41:21,2024-07-20 19:00:21,Rangers of the North,ran_vswr_alarm_resolution,RAN-007: Cell Not Radiating at Hub-MordorSurveillanceZone-Inner-88. Software version mismatch after incomplete upgrade caused cell site router communication failure. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-07-20 18:41 - Alarm: RAN-007: Cell Not Radiating at Hub-MordorSurveillanceZone-Inner-88. Severity: MAJOR. Active 6 minutes. Intermittent alarm — flapping every 2-3 minutes. +2024-07-20 18:45 - Checked neighbors and topology. Adjacent sites normal — confirms localized fault. +2024-07-20 18:48 - Connectivity check to Hub-MordorSurveillanceZone-Inner-88 — reachable, latency nominal. +2024-07-20 18:51 - Queried element health at Hub-MordorSurveillanceZone-Inner-88. Equipment within operational limits. +2024-07-20 18:55 - Validated running configuration on Hub-MordorSurveillanceZone-Inner-88. No mismatches detected. +2024-07-20 18:59 - Attempted remote software rollback on Hub-MordorSurveillanceZone-Inner-88. COMMAND SUCCESS — operation completed. +2024-07-20 19:01 - Recovery confirmed — Partial recovery — monitoring for stability. KPIs returned to normal.",Folcwine of the Mark,2024-07-20 18:41:21,2024-07-20 19:00:21,Transmission chain failure preventing RF emission,Resolved,RAN-007: Cell Not Radiating,0.4,FALSE,Folcwine of the Mark,hard_solve +INCME-100089,Mordor Surveillance Zone,Path Signals,Power Source,POWER,Elven Forge Technologies,Osgiliath,Resolved,PWR-002: DC Rectifier Failure,CRITICAL,Tower-MordorSurveillanceZone-West-47,"Reopened from INCME-100086. + +After confirming the rectifier alarm, enabling battery monitoring, and performing a final status check, the NOC engineer has completed the incident closure notes for PWR-002, a power supply issue that occurred at the facility. The root cause of the issue was a DC power conversion unit malfunction. The engineer has recorded this in the incident closure notes and has provided a resolution summary that includes the following: + +1. Restore Action: Enable battery monitoring and rectifier alarm +2. Reason For Outage: The malfunction of the DC power conversion unit + +The engineer has also included the final status check, which confirmed that the issue was resolved and the power supply functioned correctly. The engineer has recorded this in the notes and has provided a resolution summary that includes the root cause and the resolution taken to restore the facility's power supply. The engineer has ensured that all necessary documentation is included in the notes, including the incident summary, the incident closure notes, and the final status check.",2024-07-29 16:21:50,2024-07-29 16:42:50,Gondor Gateway Team,power_dc_rectifier_recovery,DC power conversion unit malfunction at Tower-MordorSurveillanceZone-West-47. DC rectifier module failure caused voltage drop below threshold. Backup rectifier activated but insufficient for full load. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-07-29 16:22 - Checked alarm system — confirmed PWR-002: DC Rectifier Failure active since 16:16. Site Tower-MordorSurveillanceZone-West-47 showing power-related alarms. +2024-07-29 16:26 - Checked external factors near Tower-MordorSurveillanceZone-West-47. No area-wide issues. +2024-07-29 16:30 - Queried power subsystems. UPS: active, load transfer complete. Battery: 82% and dropping rapidly. Generator: running — full load. Rectifier: output normal. +2024-07-29 16:32 - Queried element health at Tower-MordorSurveillanceZone-West-47. Equipment within operational limits. +2024-07-29 16:36 - Escalated via ticket TK-20240088 to DevOps. Remote resolution unsuccessful, physical inspection required. ETA: 20 minutes. +2024-07-29 16:40 - Final check: Partial recovery — monitoring for stability. KPIs Metrics stable for 15 minutes.",Nori of Erebor,2024-07-29 16:21:50,2024-07-29 16:42:50,DC power conversion unit malfunction,Resolved,PWR-002: DC Rectifier Failure,0.2,TRUE,Nori of Erebor,soft_solve +INCME-100090,Rohan,Path Signals,Path Router,COMPUTE,Dwarven Network Systems,Grimslade,Resolved,CMP-005: Pod Terminating Stuck,WARNING,Outpost-Rohan-Primary-94,"NOC Engineer's Note: Verify Orchestrator Alarm and Escalate to Platform Team + +Reason for Outage: A Pod stuck in terminating state beyond grace period, causing the Orchestrator to become unresponsive. + +Restore Action: Verify Orchestrator Alarm, escalate to Platform Team, verify Orchestrator Recovery. + +Root Cause: A Pod stuck in terminating state beyond grace period. + +Problem Type: CMP-005. + +Conclusion: The root cause of the outage was a Pod stuck in terminating state beyond grace period. The Orchestrator was successfully restored to a functional state. The Platform Team was notified and will be responsible for further investigation and resolution.",2024-09-22 21:45:55,2024-09-22 22:22:55,Riders of the Mark,compute_orchestrator_recovery,CMP-005: Pod Terminating Stuck on Outpost-Rohan-Primary-94. Pod stuck in Terminating state due to hung preStop hook waiting for unavailable downstream service. Escalation ticket created for vendor engagement on firmware issue. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-09-22 21:47 - Alarm received: CMP-005: Pod Terminating Stuck at Outpost-Rohan-Primary-94. Container OOMKilled — memory limit exceeded. Checking container state. +2024-09-22 21:51 - Queried orchestrator — cu-control-plane-7b4d: Error. Image pull attempts: 8. Last error: connection refused. +2024-09-22 21:55 - Container logs show: Application crash: SIGSEGV in libcurl.so during TLS handshake. Possibly corrupted container image layer. +2024-09-22 21:59 - Node health: Node OK, storage: 94% used. +2024-09-22 22:03 - Remote action: power cycle. Partial success — 2 of 3 units recovered. +2024-09-22 22:06 - Recovery check: Full recovery confirmed. Alarm cleared at 22:04. All probes passing.",Fredegar Bracegirdle,2024-09-22 21:45:55,2024-09-22 22:22:55,Pod stuck in terminating state beyond grace period,Resolved,CMP-005: Pod Terminating Stuck,0.2,FALSE,Fredegar Bracegirdle,hard_solve +INCME-100091,Eriador,Relay Transport,Relay Unit,COMPUTE,Dwarven Network Systems,Weathertop,Resolved,Node not functional,CRITICAL,Hub-Eriador-Primary-54,"Incident closure summary: + +Restore Action: +- Verified that the compute node was unavailable, and it was due to a hardware failure. +- Verified that the root cause was the hardware failure, and a new compute node was deployed and brought online. +- Notified the cloud team to escalate the issue to them, and they confirmed that they had resolved the issue. + +Reason for Outage: +- The root cause was the hardware failure, which led to the unavailability of the compute node. + +Close notes: +- The incident was resolved successfully, and the compute node was restored to its normal state. +- The root cause of the issue was confirmed, and the appropriate action was taken to resolve the issue. +- The incident was reported to the incident management team, and a resolution summary was generated. +- The incident was closed with a resolution that included a restore action, a root cause analysis, and a resolution summary.",2024-08-19 11:42:58,2024-08-19 12:12:58,Gondor Gateway Team,compute_vm_failure_recovery,Node not functional on Hub-Eriador-Primary-54. Container configuration error after recent deployment — environment variable referencing deleted secret. Configuration corrected and service restored remotely within SLA. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-08-19 11:43 - Alarm received: Node not functional at Hub-Eriador-Primary-54. Container OOMKilled — memory limit exceeded. Checking container state. +2024-08-19 11:46 - Orchestration: cordoned and drained affected node — Pod removed, replacement scheduled. +2024-08-19 11:49 - Pod status: ImagePullBackOff. Reason: OOMKilled. Restart count: 12. Last restart: 3 minutes ago. +2024-08-19 11:52 - Node health: Node OK, storage: 94% used. +2024-08-19 11:56 - Remote action: power cycle. COMMAND FAILED — element not responding. +2024-08-19 11:59 - Verified — Service fully restored. All metrics back to baseline. Workload stable on new node.",Radagast the Brown,2024-08-19 11:42:58,2024-08-19 12:12:58,Compute node completely unavailable,Resolved,Node not functional,0.4,FALSE,Radagast the Brown,hard_solve +INCME-100092,Mordor Surveillance Zone,Beacon Power,Signal Unit,SIGNALING,Elven Forge Technologies,Grimslade,Resolved,SIG-003: SCTP Association Failure,MINOR,Beacon-MordorSurveillanceZone-North-43,"Resolution Summary: + +The root cause of the SIG-003 issue was a Stream Control Transmission Protocol (SCTP) association loss. The issue was resolved by verifying that the SIP Alarm was verified and escalating the issue to the IMS team for further investigation. The root cause was identified and documented in the incident closure notes. The team worked together to verify the SIP Alarm and escalate the issue to the IMS team. The IMS team conducted an investigation and identified the root cause as a Stream Control Transmission Protocol (SCTP) association loss. The issue was resolved and the SIG-003 was upgraded to a SIG-004. The team closed the incident and provided a resolution summary.",2024-08-30 09:59:26,2024-08-30 10:25:26,Iron Hills Transport,signaling_sip_registration_recovery,SIG-003: SCTP Association Failure at Beacon-MordorSurveillanceZone-North-43. Signaling path failure due to SCTP association timeout. Single-homed configuration had no failover. Configuration corrected and service restored remotely within SLA. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-08-30 10:00 - Signaling alarm: SIG-003: SCTP Association Failure at Beacon-MordorSurveillanceZone-North-43. CU-DU communication path degraded. +2024-08-30 10:02 - Connectivity check to Beacon-MordorSurveillanceZone-North-43 — reachable, latency nominal. +2024-08-30 10:06 - Tested Diameter path — Path degraded — 30% packet loss on signaling. +2024-08-30 10:09 - Element health for Beacon-MordorSurveillanceZone-North-43: CU processing normal. Checked signaling interface status. +2024-08-30 10:11 - Executed unlock cells — COMMAND SUCCESS — operation completed. +2024-08-30 10:15 - Verified: Full recovery confirmed. Alarm cleared at 10:12. Diameter path recovered.",Kíli of the Lonely Mountain,2024-08-30 09:59:26,2024-08-30 10:25:26,Stream Control Transmission Protocol association lost,Resolved,SIG-003: SCTP Association Failure,0.2,FALSE,Kíli of the Lonely Mountain,soft_solve +INCME-100093,Iron Hills,Path Signals,Relay Unit,POWER,Elven Forge Technologies,Dale,Pending Resolution,PWR-001: AC Power Failure,MAJOR,Tower-IronHills-Upper-61,"Reason for Outage: A commercial power supply interruption or failure occurred on the AC power line, resulting in a loss of power to the NOC facility. + +Restore Action: +- Verified AC Power Loss +- Checked UPS Status +- Dispatched Field Technician +- Final Verification + +Action Taken: +- Verified AC Power Loss +- Checked UPS Status +- Dispatched Field Technician +- Final Verification + +Root Cause: +- Commercial power supply interruption or failure + +Problem Type: +- PWR-001",2024-10-15 21:29:47,2024-10-15 22:02:47,Rohan Rapid Response,power_ac_failure_recovery,Commercial power supply interruption or failure at Tower-IronHills-Upper-61. DC rectifier module failure caused voltage drop below threshold. Backup rectifier activated but insufficient for full load. Configuration corrected and service restored remotely within SLA. All alarms cleared and KPIs returned to baseline within 15 minutes. No further action required.,"2024-10-15 21:30 - Alarm dashboard shows PWR-001: AC Power Failure triggered at 21:24. Severity: MAJOR. Correlating with site power infrastructure. +2024-10-15 21:33 - Polled power infrastructure. Mains: unstable — voltage fluctuations. UPS active, estimated 45 minutes on battery. Generator: starting sequence initiated. +2024-10-15 21:37 - Checked external factors near Tower-IronHills-Upper-61. No area-wide issues. +2024-10-15 21:40 - Queried element health at Tower-IronHills-Upper-61. Equipment within operational limits. +2024-10-15 21:43 - Escalated via ticket TK-20240092 to transport team. Remote resolution unsuccessful, physical inspection required. ETA: 20 minutes. +2024-10-15 21:46 - Verified recovery — Service fully restored. All metrics back to baseline. All alarms all cleared.",Faramir of the White City,2024-10-15 21:29:47,2024-10-15 22:02:47,Commercial power supply interruption or failure,Pending Resolution,PWR-001: AC Power Failure,0.4,FALSE,Faramir of the White City,soft_solve +INCME-100094,Gondor,Signal Core,Central Nexus,RAN,Elven Forge Technologies,Minas Tirith,Pending Resolution,Access instability,CRITICAL,Hub-Gondor-Upper-07,"In the incident of failed workflow outcome, a root cause was identified as the Random Access Channel (RAC) success rate being below the threshold. The root cause was attributed to an issue with the interference alarm, which was verified by the team. The team then reported the issue to Spectrum Management and verified the interference status. The issue was resolved and the workflow outcome was successfully completed. The resolution summary includes the following information: + +Restore Action: Verify Interference Alarm +Reason For Outage: The RAC success rate was below the threshold + +The team verified that the interference alarm was working correctly, which was a critical component in the workflow. The root cause was addressed and the workflow outcome was successfully completed. The resolution summary provides a concise and complete summary of the incident, including the actions taken, the root cause, and the resolution.",2024-08-15 22:37:31,2024-08-15 22:54:31,Istari Advisory Board,ran_interference_mitigation,Access instability at Hub-Gondor-Upper-07. Investigation found sector outage caused by RU firmware lockup. CPRI link dropped between DU and affected RU. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-08-15 22:38 - Critical alarm received for Hub-Gondor-Upper-07: Access instability. Checked alarm system — confirmed active since 22:30. Sector Alpha down, Beta and Gamma operational. +2024-08-15 22:42 - Queried RF chain. Sector Beta: high VSWR (3.2:1), possible feeder issue. Antenna tilt: 8°. PA status: active. +2024-08-15 22:46 - Queried element health at Hub-Gondor-Upper-07. Equipment within operational limits. +2024-08-15 22:49 - Created ticket TK-20240093 for transport team. Remote resolution unsuccessful, physical inspection required. +2024-08-15 22:53 - Verified: Full recovery confirmed. Alarm cleared at 22:56. All sectors recovered. Alarms all cleared.",Peregrin Hornblower,2024-08-15 22:37:31,2024-08-15 22:54:31,Random access channel success rate below threshold,Pending Resolution,Access instability,0.6,FALSE,Peregrin Hornblower,soft_solve +INCME-100095,Gondor,Arcane Engines,Path Router,POWER,Dwarven Network Systems,Cair Andros,Resolved,PWR-002: DC Rectifier Failure,CRITICAL,Outpost-Gondor-East-11,"Reopened from INCME-100089. + +Resolution Summary: + +In response to the issue of a DC power conversion unit malfunction, the following actions were taken: + +- Confirming that the Rectifier Alarm was triggered due to a malfunction in the DC power conversion unit. +- Checking the DC Bus Voltage to ensure it was within the acceptable range. +- Assessing the Rectifier Redundancy to ensure it was not damaged. +- Attempting Rectifier Reset to restore the system to its normal operation. +- Scheduling Field Replacement to replace the faulty component. +- Enabling Battery Monitoring to monitor the battery condition. +- Final Status Check to ensure that the system is functioning properly. + +Root cause: The DC power conversion unit malfunctioned due to a fault in the rectifier. + +Problem type: PWR-002",2024-08-01 00:42:50,2024-08-01 01:20:50,Rangers of the North,power_dc_rectifier_recovery,DC power conversion unit malfunction at Outpost-Gondor-East-11. UPS sustained load during outage but battery dropped to critical levels. Generator auto-start failed due to mechanical fault. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. Full service restoration confirmed. Post-incident review scheduled.,"2024-08-01 00:42 - Verified alarm: PWR-002: DC Rectifier Failure at Outpost-Gondor-East-11. Active for 14 minutes. Multiple downstream alarms generated. +2024-08-01 00:44 - Polled power infrastructure. Mains: unstable — voltage fluctuations. UPS active, estimated 45 minutes on battery. Generator: offline — maintenance mode. +2024-08-01 00:48 - Checked external factors near Outpost-Gondor-East-11. No area-wide issues. +2024-08-01 00:51 - Queried element health at Outpost-Gondor-East-11. Equipment within operational limits. +2024-08-01 00:54 - Escalated via ticket TK-20240094 to field operations. Vendor escalation for firmware issue. ETA: 45 minutes. +2024-08-01 00:57 - Final check: Service fully restored. All metrics back to baseline. KPIs KPIs returned to normal.",Arahad Wingfoot,2024-08-01 00:42:50,2024-08-01 01:20:50,DC power conversion unit malfunction,Resolved,PWR-002: DC Rectifier Failure,0.2,TRUE,Arahad Wingfoot,soft_solve +INCME-100096,The Shire,Path Signals,Signal Unit,COMPUTE,Elven Forge Technologies,Calembel,Resolved,Node not functional,MAJOR,Outpost-TheShire-East-27,"Reopened from INCME-100091. + +Incident closure summary: + +Restore Action: +- Verified that the compute node was unavailable, and it was due to a hardware failure. +- Verified that the root cause was the hardware failure, and a new compute node was deployed and brought online. +- Notified the cloud team to escalate the issue to them, and they confirmed that they had resolved the issue. + +Reason for Outage: +- The root cause was the hardware failure, which led to the unavailability of the compute node. + +Close notes: +- The incident was resolved successfully, and the compute node was restored to its normal state. +- The root cause of the issue was confirmed, and the appropriate action was taken to resolve the issue. +- The incident was reported to the incident management team, and a resolution summary was generated. +- The incident was closed with a resolution that included a restore action, a root cause analysis, and a resolution summary.",2024-08-22 01:12:58,2024-08-22 01:45:58,Shire Monitoring Guild,compute_vm_failure_recovery,Node not functional on Outpost-TheShire-East-27. CrashLoopBackOff caused by OOM condition — memory limit too low for current traffic load. Field dispatch initiated for hardware component requiring physical replacement. Full service restoration confirmed. Post-incident review scheduled.,"2024-08-22 01:14 - Alarm received: Node not functional at Outpost-TheShire-East-27. CrashLoopBackOff — 12 restarts in last hour. Checking container state. +2024-08-22 01:17 - Container logs show: Last 20 restarts show same pattern: starts OK, then OOMKilled at ~200MB after 3-4 minutes of operation. +2024-08-22 01:19 - Pod status: Error. Reason: ContainerConfigError. Image pull attempts: 8. Last error: connection refused. +2024-08-22 01:23 - Node health: Node NotReady — kubelet unresponsive. +2024-08-22 01:25 - Validated running configuration on Outpost-TheShire-East-27. No mismatches detected. +2024-08-22 01:28 - Ticket TK-20240095 created for core operations: Hardware replacement may be needed. +2024-08-22 01:32 - Ran diagnostic suite on Outpost-TheShire-East-27. Results collected for analysis. +2024-08-22 01:34 - Recovery check: Service fully restored. All metrics back to baseline. All probes passing.",Tom Bolger,2024-08-22 01:12:58,2024-08-22 01:45:58,Compute node completely unavailable,Resolved,Node not functional,0.4,TRUE,Tom Bolger,hard_solve +INCME-100097,Rohan,Beacon Power,Power Source,POWER,Dwarven Network Systems,Edoras,Resolved,PWR-001: AC Power Failure,MAJOR,Spire-Rohan-Upper-04,"NOC Engineer's Note: + +Incident Summary: A power outage caused by a commercial power supply interruption or failure, affecting the AC power supply to a critical IT infrastructure. + +Restore Action: Verify AC power supply and restore it to its normal operation. + +Reason for Outage: The root cause of the power outage was the interruption or failure of the commercial power supply. + +Outcome: The incident was escalated to the higher-level management team for further investigation and resolution. + +NOC Terms: +- Commercial power supply interruption or failure +- Root cause +- Problem type (PWR-001) +- Verification of AC power supply +- Restore action +- Reason for outage +- Outcome + +NOC Engineer's Note: +- This resolution summary is a summary of the incident closure notes written by the NOC engineer. The full incident closure notes will be provided to the higher-level management team for further investigation and resolution.",2024-09-17 18:59:15,2024-09-17 19:22:15,Istari Advisory Board,power_ac_failure_recovery,Commercial power supply interruption or failure at Spire-Rohan-Upper-04. DC rectifier module failure caused voltage drop below threshold. Backup rectifier activated but insufficient for full load. Field dispatch initiated for hardware component requiring physical replacement. Full service restoration confirmed. Post-incident review scheduled.,"2024-09-17 19:01 - Verified alarm: PWR-001: AC Power Failure at Spire-Rohan-Upper-04. Active for 7 minutes. Multiple downstream alarms generated. +2024-09-17 19:03 - Queried power subsystems. UPS: active, load transfer complete. Battery: 45% and stable. Generator: starting sequence initiated. Rectifier: output low — partial failure. +2024-09-17 19:07 - Checked external factors near Spire-Rohan-Upper-04. No area-wide issues. +2024-09-17 19:11 - Queried element health at Spire-Rohan-Upper-04. Equipment within operational limits. +2024-09-17 19:14 - Validated power system configuration. Rolled back to last known good configuration. +2024-09-17 19:16 - Executed remote command: software rollback. Result: COMMAND SUCCESS — operation completed. +2024-09-17 19:19 - Verified recovery — Partial recovery — monitoring for stability. All alarms all cleared.",Rúmil of Lothlórien,2024-09-17 18:59:15,2024-09-17 19:22:15,Commercial power supply interruption or failure,Resolved,PWR-001: AC Power Failure,0.2,FALSE,Rúmil of Lothlórien,soft_solve +INCME-100098,Mordor Surveillance Zone,Signal Network,Relay Unit,RAN,Dwarven Network Systems,Henneth Annûn,Resolved,RAN-013: Site Communication Failure,MAJOR,Spire-MordorSurveillanceZone-Central-12,"NOC Engineer: Investigation and Root Cause Analysis (RCA) completed. + +The investigation revealed that a configuration change was made to the management network configuration that caused a complete loss of management connectivity to the site. The root cause of this issue was a configuration error that resulted in a mismatch between the network configuration and the management network configuration. + +The root cause was addressed by performing a thorough RCA to identify the root cause and the necessary actions to resolve the issue. The following actions were taken: + +1. Verification of Configuration Alert: A configuration alert was triggered, indicating a loss of management connectivity to the site. + +2. Escalation to Engineering: The issue was escalated to Engineering for further investigation and analysis. + +3. Verification of Performance: The performance of the management network was verified to ensure that it was operating correctly. + +4. Root Cause Analysis: A root cause analysis was conducted to identify the root cause of the issue. + +5. Recommendations: Recommendations were made for addressing the root cause of the issue and ensuring that the management network configuration is maintained correctly. + +6. Restore Action: Restoration of the management network configuration",2024-08-30 18:20:29,2024-08-30 18:51:29,Orthanc Technical Review,ran_parameter_correction,RAN-013: Site Communication Failure at Spire-MordorSurveillanceZone-Central-12. RF chain fault detected — PA shutdown due to high VSWR. Feeder connection verified remotely. Remote corrective action applied successfully. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-08-30 18:21 - Critical alarm received for Spire-MordorSurveillanceZone-Central-12: RAN-013: Site Communication Failure. Checked alarm system — confirmed active since 18:17. Intermittent alarm — flapping every 2-3 minutes. +2024-08-30 18:23 - Topology analysis: No common alarms on adjacent sites. Issue isolated to this element. Upstream path: healthy. +2024-08-30 18:26 - Queried RF chain. PA power output 6dB below target on affected sector. Antenna tilt: 2°. PA status: standby. +2024-08-30 18:30 - Queried element health at Spire-MordorSurveillanceZone-Central-12. Equipment within operational limits. +2024-08-30 18:34 - Validated running configuration on Spire-MordorSurveillanceZone-Central-12. No mismatches detected. +2024-08-30 18:36 - Attempted remote generator start on Spire-MordorSurveillanceZone-Central-12. COMMAND SUCCESS — operation completed. +2024-08-30 18:40 - Verified: Service fully restored. All metrics back to baseline. All sectors recovered. Alarms self-cleared after fix applied.",Arahael Wingfoot,2024-08-30 18:20:29,2024-08-30 18:51:29,Complete loss of management connectivity to site,Resolved,RAN-013: Site Communication Failure,0.2,FALSE,Arahael Wingfoot,soft_solve +INCME-100099,The Shire,Signal Network,Signal Unit,RAN,Elven Forge Technologies,Fornost,Resolved,RAN-001: Cell Service Interruption,CRITICAL,Hub-TheShire-South-38,"NOC Engineer's Resolution Summary: + +Workflow: Site Outage + +Actions taken: +- Verified Site Outage with the Field Technician +- Notified the Root Cause Team +- Final Status Verification + +Root cause: Cell completely unavailable for service +- The cell was completely unavailable due to a technical issue with the network equipment. +- The issue was identified and resolved by the Field Technician. +- The Root Cause Team was notified and the issue was escalated to the next level. + +Problem type: RAN-001 + +Resolution: +- The issue has been resolved and the cell is now operational. +- The Root Cause Team has been notified and the issue is being monitored. +- The Field Technician has been reassigned to another site to ensure the issue is not reoccurring. +- A follow-up report will be generated to ensure the issue has been resolved and the Root Cause Team is notified. + +The resolution summary includes a brief description of the root cause of the issue, the actions taken by the NOC team, and the resolution. The resolution is clear and concise,",2024-11-29 17:47:44,2024-11-29 18:12:44,Rangers of the North,ran_cell_site_down_recovery,RAN-001: Cell Service Interruption at Hub-TheShire-South-38. Interference pattern detected from adjacent site — handover parameters were misconfigured after recent optimization. Field dispatch initiated for hardware component requiring physical replacement. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-11-29 17:47 - Critical alarm received for Hub-TheShire-South-38: RAN-001: Cell Service Interruption. Checked alarm system — confirmed active since 17:43. Partial outage — 2 of 3 sectors affected. +2024-11-29 17:51 - Connectivity check to Hub-TheShire-South-38 — reachable, latency nominal. +2024-11-29 17:55 - Queried element health at Hub-TheShire-South-38. Equipment within operational limits. +2024-11-29 17:59 - Validated running configuration on Hub-TheShire-South-38. No mismatches detected. +2024-11-29 18:03 - Created ticket TK-20240098 for core operations. Persistent fault after 2 remote attempts — need on-site investigation. +2024-11-29 18:07 - Verified: Service fully restored. All metrics back to baseline. All sectors operational and radiating. Alarms cleared.",Thrór Oakenshield,2024-11-29 17:47:44,2024-11-29 18:12:44,Cell completely unavailable for service,Resolved,RAN-001: Cell Service Interruption,0.2,FALSE,Thrór Oakenshield,soft_solve +INCME-100100,Lothlórien,Signal Core,Power Source,COMPUTE,Elven Forge Technologies,Hollin Gate,Resolved,CMP-010: Site Not Scrolling,MAJOR,Point-Lothlorien-South-48,"NOC Engineer's Workflow Outcome: Completed + +Actions Taken: +- Verified that an image pull alarm had been triggered due to a stalled Site deployment or scaling operation. +- Confirmed that the image pull alarm was caused by a registry connectivity issue. +- Verified that the image pull alarm was resolved by escalating the issue to the appropriate team for further investigation. +- Confirmed that the registry connectivity issue had been resolved by verifying that the registry credentials were correct. +- Verified that the issue was resolved and the system was restored to normal operation. + +Root Cause: +The root cause of the issue was a stalled Site deployment or scaling operation, which caused an image pull alarm to be triggered due to a registry connectivity issue. The registry connectivity issue was resolved by verifying that the registry credentials were correct and verifying that the image pull alarm was resolved by escalating the issue to the appropriate team for further investigation.",2024-08-04 18:29:41,2024-08-04 18:54:41,White Tower Operations,compute_image_pull_recovery,CMP-010: Site Not Scrolling on Point-Lothlorien-South-48. ImagePullBackOff due to internal registry pod failure on same node as requesting pod. Escalation ticket created for vendor engagement on firmware issue. Partial recovery achieved remotely. Field ticket remains open for permanent fix.,"2024-08-04 18:30 - Alarm received: CMP-010: Site Not Scrolling at Point-Lothlorien-South-48. Pod pending — insufficient resources on node. Checking container state. +2024-08-04 18:34 - Queried orchestrator — amf-proxy-9c2e: CrashLoopBackOff. Image pull attempts: 8. Last error: connection refused. +2024-08-04 18:37 - Container logs show: Kubelet reports: 0/8 nodes available: insufficient memory. Current request: 4Gi, largest available: 2Gi. +2024-08-04 18:41 - Node health: Node healthy — issue isolated to pod. +2024-08-04 18:43 - Validated running configuration on Point-Lothlorien-South-48. No mismatches detected. +2024-08-04 18:45 - Remote action: power cycle. Partial success — 2 of 3 units recovered. +2024-08-04 18:49 - Verified — Partial recovery — monitoring for stability. Replacement pod Running, all readiness probes passing.",Elrond of the Grey Havens,2024-08-04 18:29:41,2024-08-04 18:54:41,Site deployment or scaling operation stalled,Resolved,CMP-010: Site Not Scrolling,0.2,FALSE,Elrond of the Grey Havens,hard_solve +INCME-100101,Rohan,Arcane Engines,Path Router,POWER,Elven Forge Technologies,Snowbourn,Pending Resolution,PWR-003: Battery Discharge Alert,CRITICAL,Tower-Rohan-Central-75,"In accordance with the Network Operations Center (NOC) protocol, incident closure notes were completed for the following incident: + +Battery Alert: A battery backup depleting without AC restoration was detected. + +Action taken: +- Acknowledged the battery alert by verifying the battery status and confirming that the battery backup was depleting without AC restoration. +- Checked the battery status to ensure that the battery backup was not overloaded or damaged. +- Notified Emergency Dispatch to escalate the issue. + +Root cause: +The battery backup depleting without AC restoration was due to a lack of power supply to the battery. This was caused by a faulty AC power source or an incorrect wiring configuration. + +Problem type: PWR-003 + +Resolution: +- Verified the battery status and confirmed that the battery backup was not overloaded or damaged. +- Notified the site manager to replace the faulty AC power source and ensure proper wiring configuration to prevent similar issues in the future. +- Notified the NOC team to investigate the cause of the battery depletion and implement measures to prevent similar incidents in the",2024-07-16 14:51:47,2024-07-16 15:25:47,Dúnedain Field Division,power_battery_discharge_response,Battery backup depleting without AC restoration at Tower-Rohan-Central-75. DC rectifier module failure caused voltage drop below threshold. Backup rectifier activated but insufficient for full load. Automated recovery sequence triggered after remote intervention. All alarms cleared and KPIs returned to baseline within 15 minutes. No further action required.,"2024-07-16 14:51 - Checked alarm system — confirmed PWR-003: Battery Discharge Alert active since 14:44. Site Tower-Rohan-Central-75 showing power-related alarms. +2024-07-16 14:55 - Power telemetry check — AC input: nominal. UPS providing backup, battery at 12%. Generator offline — maintenance mode. +2024-07-16 14:57 - Checked external factors near Tower-Rohan-Central-75. No area-wide issues. +2024-07-16 14:59 - Queried element health at Tower-Rohan-Central-75. Equipment within operational limits. +2024-07-16 15:03 - Created P1 ticket TK-20240100 for vendor support. Issue: Hardware replacement may be needed. +2024-07-16 15:07 - Verified recovery — Site operational, all services confirmed active. All alarms all cleared.",Galadriel Greenleaf,2024-07-16 14:51:47,2024-07-16 15:25:47,Battery backup depleting without AC restoration,Pending Resolution,PWR-003: Battery Discharge Alert,0.2,FALSE,Galadriel Greenleaf,soft_solve +INCME-100102,Rhovanion,Beacon Power,Power Source,RAN,Elven Forge Technologies,Dale,Pending Resolution,RAN-002: Cell Administratively Disabled,CRITICAL,Watch-Rhovanion-South-01,"Resolution Summary: + +1. Verify Sector Alarm: Verified that the sector alarm was triggered due to cell lockout/disabling by management action. + +2. Schedule Field Repair: Scheduled a field repair to address the issue. + +3. Verify Coverage Impact: The impact on coverage was minimal due to the cell being locked or disabled by management action. + +Root cause: Cell locked or disabled by management action. + +Problem type: RAN-002. + +This incident was resolved with the following actions: +- Verification of sector alarm +- Scheduled field repair +- Verification of coverage impact + +Output: The close notes, including the resolution summary, will be provided upon request.",2024-07-31 15:28:46,2024-07-31 15:48:46,Rangers of the North,ran_sector_outage_recovery,RAN-002: Cell Administratively Disabled at Watch-Rhovanion-South-01. RF chain fault detected — PA shutdown due to high VSWR. Feeder connection verified remotely. Field dispatch initiated for hardware component requiring physical replacement. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-07-31 15:28 - Critical alarm received for Watch-Rhovanion-South-01: RAN-002: Cell Administratively Disabled. Checked alarm system — confirmed active since 15:25. Single sector degraded, others normal. +2024-07-31 15:32 - Checked neighbors and topology. Upstream aggregation node healthy. Last-mile issue confirmed. +2024-07-31 15:36 - Queried element health at Watch-Rhovanion-South-01. Equipment within operational limits. +2024-07-31 15:39 - Executed clear configuration — COMMAND SUCCESS — operation completed. +2024-07-31 15:41 - Escalated to vendor support via TK-20240101: Hardware replacement may be needed. +2024-07-31 15:43 - Verified: Site operational, all services confirmed active. All sectors operational and radiating. Alarms self-cleared after fix applied.",Guthláf son of Théoden,2024-07-31 15:28:46,2024-07-31 15:48:46,Cell locked or disabled by management action,Pending Resolution,RAN-002: Cell Administratively Disabled,0.2,FALSE,Guthláf son of Théoden,soft_solve +INCME-100103,Gondor,Arcane Engines,Keeper Stone,RAN,Dwarven Network Systems,Henneth Annûn,Resolved,RAN-006: Data Radio Bearer Degradation,MAJOR,Gateway-Gondor-North-86,"In the NOC incident, we identified a failure in the HO (Home Node) due to a data bearer setup or retention rate below threshold. As a result, we verified the HO failure alarm, requested RF optimization, and monitored the HO success rate. The root cause was identified as a data bearer setup or retention rate below threshold, which resulted in the HO failure. We took necessary actions to restore the issue and ensure the successful resolution of the NOC incident. The resolution summary includes the following: + +1. Restore Action: Verify HO Failure Alarm +2. Reason For Outage: Data bearer setup or retention rate below threshold + +The root cause of the issue was identified and taken into consideration during the resolution process. We took the necessary actions to restore the issue and ensure a successful resolution. The resolution summary includes the details of the actions taken, including verifying the HO failure alarm, requesting RF optimization, and monitoring the HO success rate. The resolution summary is concise and complete, ensuring that the NOC team has all the necessary information to address the incident and ensure the successful resolution.",2024-09-18 03:10:08,2024-09-18 03:36:08,Gondor Gateway Team,ran_handover_failure_resolution,RAN-006: Data Radio Bearer Degradation at Gateway-Gondor-North-86. Interference pattern detected from adjacent site — handover parameters were misconfigured after recent optimization. Configuration corrected and service restored remotely within SLA. All alarms cleared and KPIs returned to baseline within 15 minutes. No further action required.,"2024-09-18 03:11 - Critical alarm received for Gateway-Gondor-North-86: RAN-006: Data Radio Bearer Degradation. Checked alarm system — confirmed active since 03:07. Partial outage — 2 of 3 sectors affected. +2024-09-18 03:13 - RF status check — Sector Beta: high VSWR (3.2:1), possible feeder issue. VSWR: 3.2:1. TX power: 6dB below target. +2024-09-18 03:17 - Queried element health at Gateway-Gondor-North-86. Equipment within operational limits. +2024-09-18 03:21 - Validated running configuration on Gateway-Gondor-North-86. No mismatches detected. +2024-09-18 03:23 - Attempted remote software rollback on Gateway-Gondor-North-86. Partial success — 2 of 3 units recovered. +2024-09-18 03:27 - Recovery confirmed — Full recovery confirmed. Alarm cleared at 03:30. Metrics stable for 15 minutes.",Thorongil son of Imrahil,2024-09-18 03:10:08,2024-09-18 03:36:08,Data bearer setup or retention rate below threshold,Resolved,RAN-006: Data Radio Bearer Degradation,0.4,FALSE,Thorongil son of Imrahil,hard_solve +INCME-100104,Mordor Surveillance Zone,Signal Core,Weather Watch,RAN,Elven Forge Technologies,Henneth Annûn,Pending Resolution,RAN-001: Cell Service Interruption,MAJOR,Tower-MordorSurveillanceZone-North-19,"Resolution Summary: + +1. Verify Site Outage: +The NOC team conducted a thorough investigation and confirmed that the cellular network was completely unavailable for service at the customer's location. The root cause of the issue was a temporary power outage affecting the cellular network equipment. + +2. Check Backhaul Status: +The NOC team conducted a thorough investigation and confirmed that the backhaul network was functioning properly. The root cause of the issue was a temporary power outage affecting the backhaul network equipment. + +3. Contact Transport Team: +The NOC team contacted the transport team to investigate the cause of the outage and to assess the damage to the network. The transport team confirmed that the outage was caused by a temporary power outage affecting the backhaul network equipment. + +4. Dispatch Field Technician: +The NOC team dispatched a field technician to the customer's location to investigate the issue and to restore service. The field technician confirmed that the temporary power outage had affected the cellular network equipment, and that the root cause of the outage was a temporary power outage affecting the backhaul network equipment.",2024-06-12 23:48:51,2024-06-13 00:21:51,Rivendell Array Management,ran_cell_site_down_recovery,RAN-001: Cell Service Interruption at Tower-MordorSurveillanceZone-North-19. Cell was administratively disabled during maintenance and not re-enabled. Configuration mismatch confirmed. Field dispatch initiated for hardware component requiring physical replacement. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-06-12 23:48 - Critical alarm received for Tower-MordorSurveillanceZone-North-19: RAN-001: Cell Service Interruption. Checked alarm system — confirmed active since 23:40. Sector Alpha down, Beta and Gamma operational. +2024-06-12 23:52 - RF status check — Sector Alpha: PA off, no radiation detected. VSWR: 1.2:1. TX power: nominal. +2024-06-12 23:55 - Connectivity check to Tower-MordorSurveillanceZone-North-19 — reachable, latency nominal. +2024-06-12 23:58 - Queried element health at Tower-MordorSurveillanceZone-North-19. Equipment within operational limits. +2024-06-13 00:02 - Validated running configuration on Tower-MordorSurveillanceZone-North-19. No mismatches detected. +2024-06-13 00:05 - Escalated to DevOps via TK-20240103: Remote resolution unsuccessful, physical inspection required. +2024-06-13 00:08 - Recovery confirmed — Service fully restored. All metrics back to baseline. KPIs returned to normal.",Théodred of the Mark,2024-06-12 23:48:51,2024-06-13 00:21:51,Cell completely unavailable for service,Pending Resolution,RAN-001: Cell Service Interruption,0.2,FALSE,Théodred of the Mark,soft_solve +INCME-100105,Eriador,Signal Core,Path Router,RAN,Elven Forge Technologies,Michel Delving,Resolved,Access instability,MINOR,Relay-Eriador-Central-53,"In response to the interference alarm in the RAN, the following action was taken: + +Verify Interference Alarm: Verified that the interference alarm was triggered due to a Radio Access Network (RAN) issue. + +Report to Spectrum Management: Reported the interference alarm to Spectrum Management, who confirmed that the interference issue was being addressed. + +Verify Interference Status: Verified that the interference status was reported as ""Critical"" on the Spectrum Management report. + +Root cause: The interference alarm was triggered due to a Radio Access Network issue, specifically a failure to properly filter out interference signals. The root cause was verified through the Spectrum Management report. + +Problem type: RAN-003 + +Resolution: The interference alarm was resolved by implementing a new filtering algorithm to address the issue. The root cause was verified and the interference alarm was resolved.",2024-10-26 04:25:18,2024-10-26 04:53:18,Osgiliath Bridge Operations,ran_interference_mitigation,Access instability at Relay-Eriador-Central-53. Cell was administratively disabled during maintenance and not re-enabled. Configuration mismatch confirmed. Remote corrective action applied successfully. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-10-26 04:25 - Critical alarm received for Relay-Eriador-Central-53: Access instability. Checked alarm system — confirmed active since 04:17. Single sector degraded, others normal. +2024-10-26 04:29 - Connectivity check to Relay-Eriador-Central-53 — reachable, latency nominal. +2024-10-26 04:33 - Topology analysis: Adjacent sites normal — confirms localized fault. Upstream path: degraded. +2024-10-26 04:37 - Queried element health at Relay-Eriador-Central-53. Equipment within operational limits. +2024-10-26 04:40 - Validated running configuration on Relay-Eriador-Central-53. No mismatches detected. +2024-10-26 04:43 - Attempted remote clear configuration on Relay-Eriador-Central-53. COMMAND FAILED — element not responding. +2024-10-26 04:45 - Recovery confirmed — Partial recovery — monitoring for stability. KPIs recovering — within 90% of baseline.",Aragorn son of Arathorn,2024-10-26 04:25:18,2024-10-26 04:53:18,Random access channel success rate below threshold,Resolved,Access instability,0.2,FALSE,Aragorn son of Arathorn,soft_solve +INCME-100106,Gondor,Signal Network,Power Source,COMPUTE,Dwarven Network Systems,Dol Amroth,Resolved,CMP-005: Pod Terminating Stuck,MAJOR,Station-Gondor-East-72,"I am not able to perform the actions you described. However, I can provide a sample resolution summary for your reference: + +restore action: + +- Verified that the orchestrator alarm was triggered due to a pod stuck in terminating state beyond grace period. +- Confirmed that the root cause of the issue was a pod stuck in terminating state beyond grace period, which was caused by a configuration issue on the control plane. +- Recovered the control plane to a working state using the following steps: + - Verified that the control plane was in a working state using the kube-state-metrics tool. + - Applied the necessary patches to the control plane to resolve the issue. +- Verified that the pod was successfully restarted using the kubectl command. +- Notified the platform team of the issue and provided a detailed resolution summary. +- Notified the affected services and provided a workaround. + +reason for outage: + +- The root cause of the issue was a configuration issue on the control plane that caused a pod to be stuck in terminating state beyond grace period. +- This issue was resolved by recovering the control plane to a working state using",2024-10-16 18:25:03,2024-10-16 19:00:03,Rangers of the North,compute_orchestrator_recovery,CMP-005: Pod Terminating Stuck on Station-Gondor-East-72. Pod stuck in Terminating state due to hung preStop hook waiting for unavailable downstream service. Remote corrective action applied successfully. Recovery verified — all metrics within acceptable range. Incident closed.,"2024-10-16 18:26 - Alarm received: CMP-005: Pod Terminating Stuck at Station-Gondor-East-72. ImagePullBackOff — registry unreachable. Checking container state. +2024-10-16 18:29 - Orchestration: cordoned and drained affected node — Container runtime restarted, pods recovering. +2024-10-16 18:32 - Container logs show: Last 20 restarts show same pattern: starts OK, then OOMKilled at ~200MB after 3-4 minutes of operation. +2024-10-16 18:34 - Node health: Node healthy — issue isolated to pod. +2024-10-16 18:37 - Validated running configuration on Station-Gondor-East-72. No mismatches detected. +2024-10-16 18:41 - Remote action: force restart. COMMAND FAILED — element not responding. +2024-10-16 18:44 - Verified — Full recovery confirmed. Alarm cleared at 18:40. New pods healthy across 3 nodes.",Elendil son of Arador,2024-10-16 18:25:03,2024-10-16 19:00:03,Pod stuck in terminating state beyond grace period,Resolved,CMP-005: Pod Terminating Stuck,0.4,FALSE,Elendil son of Arador,soft_solve +INCME-100107,Gondor,Relay Transport,Relay Unit,RAN,Dwarven Network Systems,Calembel,Resolved,RAN-002: Cell Administratively Disabled,MAJOR,Beacon-Gondor-Upper-12,"Reopened from INCME-100102. + +I have completed the incident closure notes for the above incident. + +restore action: +- Verified that the sector alarm was triggered due to a cell locked or disabled by management action. +- Scheduled field repair to address the issue. + +reason for outage: +- The root cause of the issue was a cell locked or disabled by management action. + +root cause: +- The cell locked or disabled by management action caused the sector alarm to be triggered, resulting in the outage. + +problem type: +- RAN-002 (Cell locked or disabled by management action) + +resolution summary: +- Verified that the sector alarm was restored and the coverage impact was verified. +- Scheduled field repair to address the issue. +- Verified that the outage was resolved. + +note: This summary is provided for reference only and does not represent the official closure notes.",2024-08-01 02:48:46,2024-08-01 03:05:46,Lothlórien Link Guardians,ran_sector_outage_recovery,RAN-002: Cell Administratively Disabled at Beacon-Gondor-Upper-12. Software version mismatch after incomplete upgrade caused cell site router communication failure. Remote reset resolved the immediate issue. Underlying cause documented for follow-up. All alarms cleared and KPIs returned to baseline within 15 minutes. No further action required.,"2024-08-01 02:49 - Critical alarm received for Beacon-Gondor-Upper-12: RAN-002: Cell Administratively Disabled. Checked alarm system — confirmed active since 02:41. Intermittent alarm — flapping every 2-3 minutes. +2024-08-01 02:51 - Connectivity check to Beacon-Gondor-Upper-12 — reachable, latency nominal. +2024-08-01 02:53 - Queried element health at Beacon-Gondor-Upper-12. Equipment within operational limits. +2024-08-01 02:56 - Executed reset — COMMAND FAILED — element not responding. +2024-08-01 02:58 - Created ticket TK-20240106 for transport team. Persistent fault after 2 remote attempts — need on-site investigation. +2024-08-01 03:02 - Verified: Partial recovery — monitoring for stability. All sectors broadcasting normally. Alarms all cleared.",Rúmil of Mirkwood,2024-08-01 02:48:46,2024-08-01 03:05:46,Cell locked or disabled by management action,Resolved,RAN-002: Cell Administratively Disabled,0.4,TRUE,Rúmil of Mirkwood,soft_solve +INCME-100108,Gondor,Relay Transport,Central Nexus,RAN,Dwarven Network Systems,Linhir,Resolved,Access instability,CRITICAL,Gateway-Gondor-Central-87,"Reopened from INCME-100105. + +In response to the interference alarm in the RAN, the following action was taken: + +Verify Interference Alarm: Verified that the interference alarm was triggered due to a Radio Access Network (RAN) issue. + +Report to Spectrum Management: Reported the interference alarm to Spectrum Management, who confirmed that the interference issue was being addressed. + +Verify Interference Status: Verified that the interference status was reported as ""Critical"" on the Spectrum Management report. + +Root cause: The interference alarm was triggered due to a Radio Access Network issue, specifically a failure to properly filter out interference signals. The root cause was verified through the Spectrum Management report. + +Problem type: RAN-003 + +Resolution: The interference alarm was resolved by implementing a new filtering algorithm to address the issue. The root cause was verified and the interference alarm was resolved.",2024-10-27 16:53:18,2024-10-27 17:07:18,Minas Tirith Central Command,ran_interference_mitigation,Access instability at Gateway-Gondor-Central-87. Investigation found sector outage caused by RU firmware lockup. CPRI link dropped between DU and affected RU. Automated recovery sequence triggered after remote intervention. Service restored. Monitoring for 24 hours to confirm stability. Root cause documented.,"2024-10-27 16:53 - Critical alarm received for Gateway-Gondor-Central-87: Access instability. Checked alarm system — confirmed active since 16:45. Partial outage — 2 of 3 sectors affected. +2024-10-27 16:55 - Queried RF chain. Sector Beta: high VSWR (3.2:1), possible feeder issue. Antenna tilt: 2°. PA status: standby. +2024-10-27 16:57 - Queried element health at Gateway-Gondor-Central-87. Equipment within operational limits. +2024-10-27 16:59 - Validated running configuration on Gateway-Gondor-Central-87. No mismatches detected. +2024-10-27 17:02 - Attempted remote generator start on Gateway-Gondor-Central-87. Partial success — 2 of 3 units recovered. +2024-10-27 17:06 - Verified: Site operational, all services confirmed active. All sectors broadcasting normally. Alarms all cleared.",Celeborn Starlight,2024-10-27 16:53:18,2024-10-27 17:07:18,Random access channel success rate below threshold,Resolved,Access instability,0.4,TRUE,Celeborn Starlight,hard_solve diff --git a/recipes/noc-reasoning-agent/prompts/formatting_prompt.yaml b/recipes/noc-reasoning-agent/prompts/formatting_prompt.yaml new file mode 100644 index 0000000000..2d8e0b9b6b --- /dev/null +++ b/recipes/noc-reasoning-agent/prompts/formatting_prompt.yaml @@ -0,0 +1,191 @@ +user: |- + You are an expert Process Auditor and Training Assistant for a Network Operations Center (NOC). Your goal is to retrospectively analyze a completed incident by comparing the work notes and chat logs against the official Standard Operating Procedure (SOP). + + Your final output will be a standardized, step-by-step playbook in a structured JSON format. This playbook will be used to train new engineers and ensure process consistency. + + # AVAILABLE TOOLS + (These are the standardized tools that map to engineer actions.) + + query_alarm[] + - Purpose: Queries the alarm management system to retrieve current alarm details, severity, and active time. + - Returns: Alarm status (active/cleared), severity, timestamp, description. + + query_resource_health[] + - Purpose: Polls monitoring systems (EMS/NMS/Telemetry) to retrieve device health metrics such as CPU, memory, interface status, and cell state. + - Returns: Health report (e.g., "All systems operational" / "Fault detected"). + + query_performance[] + - Purpose: Fetches KPIs from monitoring tools. Reports if trends are in line with expectations. + - Example Metrics: 'prb_utilization', 'throughput', 'handover_stats', 'volte_kpi', 'signaling_load' + - Returns: KPI status (OK / NOK / trend analysis). + + query_topology[] + - Purpose: Verifies neighbors, adjacencies, upstream devices, and identifies common alarms in an area. Maps affected services through topology. + - Returns: Adjacent elements with alarm status, affected services. + + execute_remote_action[, ''] + - Purpose: Executes remote CLI commands (SSH/Netconf) for resets, restarts, failovers, and other operational actions. + - Example Actions: 'enodeb_reset', 'generator_start', 'sctp_reset', 'hvac_reset', 'software_rollback' + - Returns: Execution result (Success/Fail). + + apply_configuration[, ''] + - Purpose: Retrieves, validates, and pushes configuration changes. Supports parameter adjustments, load balancing, QoS policies, etc. + - Example Config Types: 'load_balancing', 'hvac_setpoint', 'handover_params', 'qos_policy' + - Returns: Configuration Verified/Applied or error. + + run_diagnostics[] + - Purpose: Runs diagnostic scripts: config audits, OTDR tests, kubelet checks, resource analysis. + - Example Types: 'config_audit', 'otdr', 'kubelet', 'resource_usage', 'sidecar_health' + - Returns: Diagnostic report summary. + + inspect_logs[] + - Purpose: Connects to devices/containers and reviews system logs, dump files, and event records. + - Example Types: 'container', 'bbu_system', 'routing', 'access_control', 'upgrade' + - Returns: Log analysis (issues identified / no issues). + + create_trouble_ticket[, '', ''] + - Purpose: Logs and routes tickets to departments via the ticketing system. + - Example Teams: 'transport', 'core_operations', 'rf_engineering', 'devops' + - Returns: Ticket ID. Once resolved, outputs the solution used. + + verify_recovery[] + - Purpose: Final verification step — confirms service restoration, alarm clearance, and operational status. + - Returns: Recovery status (recovered / degraded / ongoing). + + query_external_factors[] + - Purpose: Scans for external factors: weather conditions, fiber cuts, utility outages, maintenance windows, scheduled work. + - Returns: External issue report. + + orchestrate_workload[, ''] + - Purpose: Automates container/VM operations via Kubernetes or orchestrator: restart, scale, migrate, failover, drain. + - Example Actions: 'restart pod', 'scale deployment', 'migrate vm', 'failover cnf' + - Returns: Operation status (Successful/Unsuccessful). + + query_power_system[] + - Purpose: Queries power infrastructure status: UPS, battery, generator, rectifier, HVAC systems. + - Example Targets: 'ups', 'battery', 'generator', 'hvac', 'rectifier' + - Returns: Power system status and readings. + + query_rf_status[] + - Purpose: Queries RF chain status: antenna health, PA status, VSWR, beamforming, RET controller. + - Returns: RF status report (transmit power, VSWR, antenna tilt, etc.). + + query_container_status[] + - Purpose: Queries Kubernetes pod/node/container state from the orchestrator. + - Example Types: 'pod', 'node', 'control_plane', 'cnf', 'service_mesh' + - Returns: Container/pod state details. + + verify_signaling_path[] + - Purpose: Tests signaling protocol paths: SCTP, SIP, Diameter, GTP, SIGTRAN. + - Example Interfaces: 'SCTP', 'SIP', 'GTP', 'Diameter', 'SIGTRAN' + - Returns: Path status (healthy / degraded / down). + + test_connectivity[] + - Purpose: Tests IP/network connectivity: ICMP, backhaul, CPRI, peer connectivity, end-to-end quality. + - Example Types: 'backhaul', 'icmp', 'cpri', 'peer_connectivity', 'end_to_end' + - Returns: Connectivity test results. + + # RULES + 1. Analyze the Entire Log: Read the full context from start to finish to understand the incident's story and the actual sequence of actions. + 2. Trace and Build the Workflow (Crucial): Your goal is to build a JSON array that represents the actual sequence of executed steps. + - Start at Step 1 of the SOP. Find the log evidence for it. + - Create the first JSON object in your array for this step. Fill in its details (action, tool, result). + - Read the "Flow-Deciding Answers/Outcomes" for that step (e.g., "if succeeds, go to step 4; if fails, escalate to step 8"). + - Based on the result (see Rule 3), determine the next executed step. + - Find the log evidence for that next step (e.g., Step 8). Create the second JSON object in your array for Step 8. + - Repeat this process, following the workflow from step to step as dictated by the SOP's logic and the engineer's logs. Each executed step is a new object appended to your JSON array. + 3. Generate Plausible Results: + - For every step you add to the JSON array, you must determine the tool_call used. If no tools were used, mark the result as null. + - If a tool was used the result field must not be null. + - If the log provides the tool's output, use it. + - If the log does not provide the output (e.g., "Engineer saw cells were locked"), you may generate a plausible, mock result that justifies the next action in the workflow (e.g., "result": "Cell 01: LOCKED, Cell 02: LOCKED"). + 4. Handle Skipped Steps: If a step is bypassed by the workflow (e.g., the flow jumps from Step 3 directly to Step 8), you do not create a JSON object for the skipped steps (4, 5, 6, 7). Your output array must only contain steps that were actually performed. + 5. Synthesize Actions: For action_taken, transform the informal dialogue from the logs into a formal summary of what the engineer did. + 6. Strict JSON Output Format: + - Your final output MUST be a valid JSON array of objects. + - The order of objects in the array must represent the chronological order of execution. + - The step_number field in each object must correspond to the step number from the SOP (this means the step_number values in the array will be non-sequential, like 1, 2, 3, 8, 9). + - Do not add any text, explanations, or markdown formatting before or after the JSON array. + + # EXAMPLE OF CORRECT JSON OUTPUT FORMAT + (This example shows the workflow jumping from Step 3 to Step 8, as per the SOP logic "if fails, escalate to step 8") + + ```json + [ + {{ + "step_number": 1, + "sop_step_title": "Verify Site Outage", + "status": "Completed", + "action_taken": "Acknowledged alarm for site-123 and confirmed site unreachable.", + "tool_call": "query_alarm[site-123]", + "result": "ALARM: Site-123 | SEVERITY: Critical | STATUS: Active | DESC: Site unreachable - confirmed" + }}, + {{ + "step_number": 2, + "sop_step_title": "Check Backhaul Status", + "status": "Completed", + "action_taken": "Tested transport link to site, confirmed backhaul down.", + "tool_call": "test_connectivity[backhaul]", + "result": "TEST: Backhaul link to site-123 | STATUS: Down | Packet loss: 100%" + }}, + {{ + "step_number": 3, + "sop_step_title": "Check Power Status", + "status": "Completed", + "action_taken": "Attempted to query site power via remote monitoring but site was offline.", + "tool_call": "query_power_system[all]", + "result": "QUERY FAILED: Unable to reach site power monitoring - site offline" + }}, + {{ + "step_number": 8, + "sop_step_title": "Contact Transport Team", + "status": "Completed", + "action_taken": "Backhaul down confirmed. Created critical ticket and escalated to transport team.", + "tool_call": "create_trouble_ticket[critical, 'transport', 'Site-123 backhaul down causing complete site outage']", + "result": "TICKET_CREATED: TK78901 | Assigned to: Transport Operations" + }}, + {{ + "step_number": 12, + "sop_step_title": "Document All Actions and Close Incident", + "status": "Completed", + "action_taken": "Documented all troubleshooting steps and findings. Incident closed.", + "tool_call": null, + "result": "All actions documented. Incident closed with resolution applied." + }} + ] + ``` + --- + ### Your Turn: + + SOP: + {problem_code_reasoning_process} + + Final step: end with a conclusion. Follow the conclusion given to you. + + Incident Details: + * Number: {incident_identifier} + * Region: {geographical_territory} + * Category: {incident_classification} + * Subcategory: {incident_subtype} + * Market: {service_domain} + * Vendor: {equipment_provider} + * AOI: {operational_zone} + * Priority: {urgency_level} + * Locations: {affected_site} + * Short Description: {incident_summary} + * Opened At: {detection_timestamp} + * Assignment Group: {responsible_team} + * Problem Code: {fault_category} + + Logs: + {action_chronicle} + + Conclusion: + {resolution_summary} + + close_code: + {root_cause_secondary} + + ### Correct Output: + + /think diff --git a/recipes/noc-reasoning-agent/prompts/prompt_incident.yaml b/recipes/noc-reasoning-agent/prompts/prompt_incident.yaml new file mode 100644 index 0000000000..3df6ef80a4 --- /dev/null +++ b/recipes/noc-reasoning-agent/prompts/prompt_incident.yaml @@ -0,0 +1,57 @@ +system: |- + You are an Network Operations Center (NOC) engineer trying to solve a network incident problem. Please reason step by step. + + # Tools + You may call one or more functions to assist with the user query. + + You are provided with function signatures within XML tags: + + {{"type": "function", "function": {{"name": "query_alarm", "description": "Queries the alarm management system to retrieve current alarm details, severity, and active time.", "parameters": {{"type": "object", "properties": {{"site_or_element_id": {{"type": "string", "description": "The identifier of the site or network element to check for alarms."}}}}, "required": ["site_or_element_id"]}}}} + {{"type": "function", "function": {{"name": "query_resource_health", "description": "Polls monitoring systems (EMS/NMS/Telemetry) to retrieve device health metrics such as CPU, memory, interface status, and cell state.", "parameters": {{"type": "object", "properties": {{"element_id": {{"type": "string", "description": "The identifier of the element to poll for health metrics."}}}}, "required": ["element_id"]}}}} + {{"type": "function", "function": {{"name": "query_performance", "description": "Fetches KPIs from monitoring tools. Reports if trends are in line with expectations. Example Metrics: 'prb_utilization', 'throughput', 'handover_stats', 'volte_kpi', 'signaling_load'", "parameters": {{"type": "object", "properties": {{"metric_type": {{"type": "string", "description": "The type of KPI metric to fetch."}}}}, "required": ["metric_type"]}}}} + {{"type": "function", "function": {{"name": "query_topology", "description": "Verifies neighbors, adjacencies, upstream devices, and identifies common alarms in an area. Maps affected services through topology.", "parameters": {{"type": "object", "properties": {{"element_id": {{"type": "string", "description": "The identifier of the element to check topology for."}}}}, "required": ["element_id"]}}}} + {{"type": "function", "function": {{"name": "execute_remote_action", "description": "Executes remote CLI commands (SSH/Netconf) for resets, restarts, failovers, and other operational actions. Example Actions: 'enodeb_reset', 'generator_start', 'sctp_reset', 'hvac_reset', 'software_rollback'", "parameters": {{"type": "object", "properties": {{"element_id": {{"type": "string", "description": "The identifier of the network element to connect to."}}, "action": {{"type": "string", "description": "The specific action to execute."}}}}, "required": ["element_id", "action"]}}}} + {{"type": "function", "function": {{"name": "apply_configuration", "description": "Retrieves, validates, and pushes configuration changes. Supports parameter adjustments, load balancing, QoS policies, etc. Example Config Types: 'load_balancing', 'hvac_setpoint', 'handover_params', 'qos_policy'", "parameters": {{"type": "object", "properties": {{"element_id": {{"type": "string", "description": "The identifier of the element to validate and apply configuration to."}}, "config_type": {{"type": "string", "description": "The type of configuration to apply."}}}}, "required": ["element_id"]}}}} + {{"type": "function", "function": {{"name": "run_diagnostics", "description": "Runs diagnostic scripts: config audits, OTDR tests, kubelet checks, resource analysis. Example Types: 'config_audit', 'otdr', 'kubelet', 'resource_usage', 'sidecar_health'", "parameters": {{"type": "object", "properties": {{"diagnostic_type": {{"type": "string", "description": "The type of diagnostic to run."}}}}, "required": ["diagnostic_type"]}}}} + {{"type": "function", "function": {{"name": "inspect_logs", "description": "Connects to devices/containers and reviews system logs, dump files, and event records. Example Types: 'container', 'bbu_system', 'routing', 'access_control', 'upgrade'", "parameters": {{"type": "object", "properties": {{"log_type": {{"type": "string", "description": "The type of log to inspect."}}}}, "required": ["log_type"]}}}} + {{"type": "function", "function": {{"name": "create_trouble_ticket", "description": "Logs and routes tickets to departments via the ticketing system. Example Teams: 'transport', 'core_operations', 'rf_engineering', 'devops'", "parameters": {{"type": "object", "properties": {{"priority": {{"type": "string", "description": "Ticket priority level."}}, "team": {{"type": "string", "description": "The team to route the ticket to."}}, "issue_details": {{"type": "string", "description": "A detailed description of the issue."}}}}, "required": ["priority", "team", "issue_details"]}}}} + {{"type": "function", "function": {{"name": "verify_recovery", "description": "Final verification step — confirms service restoration, alarm clearance, and operational status.", "parameters": {{"type": "object", "properties": {{"element_id": {{"type": "string", "description": "The identifier of the element to verify recovery for."}}}}, "required": ["element_id"]}}}} + {{"type": "function", "function": {{"name": "query_external_factors", "description": "Scans for external factors: weather conditions, fiber cuts, utility outages, maintenance windows, scheduled work.", "parameters": {{"type": "object", "properties": {{"site_or_area": {{"type": "string", "description": "The geographical site or area to check."}}}}, "required": ["site_or_area"]}}}} + {{"type": "function", "function": {{"name": "orchestrate_workload", "description": "Automates container/VM operations via Kubernetes or orchestrator: restart, scale, migrate, failover, drain. Example Actions: 'restart pod', 'scale deployment', 'migrate vm', 'failover cnf'", "parameters": {{"type": "object", "properties": {{"action": {{"type": "string", "description": "The orchestration action to perform."}}, "type": {{"type": "string", "description": "The workload type (e.g., pod, deployment, vm, cnf)."}}}}, "required": ["action"]}}}} + {{"type": "function", "function": {{"name": "query_power_system", "description": "Queries power infrastructure status: UPS, battery, generator, rectifier, HVAC systems. Example Targets: 'ups', 'battery', 'generator', 'hvac', 'rectifier'", "parameters": {{"type": "object", "properties": {{"target": {{"type": "string", "description": "The power system component to query."}}}}, "required": ["target"]}}}} + {{"type": "function", "function": {{"name": "query_rf_status", "description": "Queries RF chain status: antenna health, PA status, VSWR, beamforming, RET controller.", "parameters": {{"type": "object", "properties": {{"sector_or_antenna_id": {{"type": "string", "description": "The identifier of the sector or antenna to check."}}}}, "required": ["sector_or_antenna_id"]}}}} + {{"type": "function", "function": {{"name": "query_container_status", "description": "Queries Kubernetes pod/node/container state from the orchestrator. Example Types: 'pod', 'node', 'control_plane', 'cnf', 'service_mesh'", "parameters": {{"type": "object", "properties": {{"type": {{"type": "string", "description": "The container/pod type to query."}}}}, "required": ["type"]}}}} + {{"type": "function", "function": {{"name": "verify_signaling_path", "description": "Tests signaling protocol paths: SCTP, SIP, Diameter, GTP, SIGTRAN. Example Interfaces: 'SCTP', 'SIP', 'GTP', 'Diameter', 'SIGTRAN'", "parameters": {{"type": "object", "properties": {{"interface": {{"type": "string", "description": "The signaling interface to test."}}}}, "required": ["interface"]}}}} + {{"type": "function", "function": {{"name": "test_connectivity", "description": "Tests IP/network connectivity: ICMP, backhaul, CPRI, peer connectivity, end-to-end quality. Example Types: 'backhaul', 'icmp', 'cpri', 'peer_connectivity', 'end_to_end'", "parameters": {{"type": "object", "properties": {{"test_type": {{"type": "string", "description": "The type of connectivity test to run."}}}}, "required": ["test_type"]}}}} + + + For each function call, return a json object with function name and arguments within XML tags: + + {{"name": , "arguments": }} + + + +user: |- + Manage and resolve the following incident as a NOC operator: + Number: {incident_identifier} + Region: {geographical_territory} + Category: {incident_classification} + Subcategory: {incident_subtype} + Market: {service_domain} + Vendor: {equipment_provider} + AOI: {operational_zone} + Priority: {urgency_level} + Locations: {affected_site} + Short Description: {incident_summary} + Opened At: {detection_timestamp} + Assignment Group: {responsible_team} + Problem Code: {fault_category} + + + Solve this network problem by following the following procedure: + {problem_code_reasoning_process} + + When arriving at the conclusion, specify the Close Code that best describes the root cause. Examples include: RAN-001: Cell Service Interruption, RAN-014: CSR Unreachable, PWR-001: AC Power Failure, CMP-005: Pod Terminating Stuck, SIG-003: SCTP Association Failure, TRN-008: Latency SLA Violation, ENV-001: High Temperature Alert, etc. + You must include a Close Code in the conclusion. + + {background} diff --git a/recipes/noc-reasoning-agent/prompts/prompt_reasoning.yaml b/recipes/noc-reasoning-agent/prompts/prompt_reasoning.yaml new file mode 100644 index 0000000000..183849f11a --- /dev/null +++ b/recipes/noc-reasoning-agent/prompts/prompt_reasoning.yaml @@ -0,0 +1,166 @@ +system: |- + I am a highly skilled NoC Engineer specializing in Radio Access Network (RAN) incident resolution. I have deep expertise in troubleshooting complex network issues and a talent for creative, methodical problem-solving. My goal is to analyze each troubleshooting situation by thinking like a human expert, exploring the problem dynamically, forming hypotheses, and reasoning through the available data in an open, iterative manner. I will document my thought process in the first person, using phrases like "I will start by...", "I notice...", "I hypothesize...", or "This suggests..." to describe my reasoning. My analysis will be thorough, grounded in the standard troubleshooting flow, and leverage my knowledge of RAN systems, 5G NR, O-RAN architecture, and network operations to contextualize my decisions. + +user: |- + I need to explain my reasoning for deciding on the next troubleshooting step with a focus on open, exploratory thinking, documenting my thought process in the first person (e.g., "I observe...", "I notice...", "I think..."). Please think step-by-step, showing your complete reasoning, including observations, hypotheses, correlations between data points, and conclusions about why this specific next step is necessary. + + You are given: + - background_context: The full history of completed steps, actions, tool calls, and their responses up to this point. + - next_generated_step: The exact next step I must take (including action and tool call, if any). + + CRITICAL OUTPUT FORMAT: + You must output your reasoning. Do NOT include the step itself, only explain your thought process for WHY the next_generated_step step is the right next action. + + [Your step-by-step first-person thought process here. Reference the background_context, analyze the last tool response, explain why this specific next step is required per the standard troubleshooting flow, justify the tool choice, and show your reasoning as if thinking aloud. Be thorough but concise and technical.] + + + --- + + Standard Troubleshooting Flow (Reference): + + {problem_code_reasoning_process} + + --- + + Available Tools (Reference): + + - Check_Alarm_Status(Site/Element ID): Queries alarm system. Returns alarm status (active/cleared), occurred timestamp, description. + - Check_Adjacent/Upstream_Status(Topology): Verifies neighbors and upstream devices. Returns adjacent elements and their alarm status. + - Check_Element_Health(Element): Polls system to retrieve device health metrics. Returns health report (Element OK/NOK, RU radiation, cell status). + - Remote_Unlock/Restart/Restore(Element, Action): Executes remote CLI commands (SSH/Netconf). Returns Successful Reset/Non-Successful Reset. + - Check_External_Issues(Site/Area): Scans for external factors (fiber cuts, outages). Returns Issue Detected/Not Detected, Type of Issue. + - Check_Apply_Configuration(Element): Retrieves and validates config files, pushes corrections. Returns Configuration Verified/Applied correctly or error. + - Check_Performance(Metric type): Fetches KPIs from monitoring tools. Returns KPI OK or KPIs NOK. + - Create_Ticket(Department_Name, Issue details): Logs and routes tickets to departments. Returns Ticket ID and eventual solution. + - Orchestration_tool(Action): Automates container ops (Kubernetes/Helm). Returns Operation Successful()/Unsuccessful(). + - Triage_Toolkit_Tool(Issue type): Runs diagnostic scripts for container/pod issues. Returns Triage report summary. + - Check_remote_files(Device): SSH to element, reviews dump files. Returns No Issues Identified / Issues Identified. + + --- + + Few-Shot Examples: + + Example 0: Step 1 + + background_context: + "" + + next_generated_step: + Step 1 Acknowledge and Track the Initial Alert Completed. + Action taken: Queried the alarm system for site JKJAX00168A and confirmed the RU Down alarm was active. + Tool called: Check_Alarm_Status[JKJAX00168A] + + Expected response: + I observe that the first action I must take is to acknowledge and validate the initial alert. To begin, I need to confirm whether the RU Down alarm at site JKJAX00168A is active. This requires querying the alarm management system. + + I will use Check_Alarm_Status[JKJAX00168A] to retrieve the current alarm details, including severity, activation timestamp, and description. Verifying this information ensures that Step 1 of the workflow—acknowledging and tracking the alert—is completed correctly before I proceed any further. + + Once I confirm the alarm is active, I can transition to understanding the current health of the impacted equipment. Before attempting any remote actions such as resets or unlocks, I need to assess the state of the Radio Unit and associated interfaces. Is the RU locked? Are there transport or power issues? Is the DU itself degraded? According to the standard troubleshooting sequence, Step 2 requires a clear picture of the site's operational status. + + Example 1: Step 1 -> Step 2 + + background_context: + Step 1 Acknowledge and Track the Initial Alert Completed. + Action taken: Queried the alarm system for site JKJAX00168A and confirmed the RU Down alarm was active. + Tool called: Check_Alarm_Status[JKJAX00168A] + Tool response: ALARM: RU Down | SEVERITY: Critical | STATUS: Active | DESC: Samsung DU Health Check Alert + + next_generated_step: + Step 2 Check Site and Equipment Status Completed. + Action taken: Polled element health; identified interface flaps and RU down condition. + Tool called: Check_Element_Health[JKJAX00168A] + + Expected response: + I observe that the alarm has been confirmed as active with Critical severity - an RU Down condition at site JKJAX00168A. This is a Samsung DU Health Check Alert, which suggests the Distributed Unit has detected that one or more Radio Units are not operational. I've completed Step 1 by acknowledging and tracking the alert. + + Now I need to understand the current state of the site equipment. Before I can attempt any remote actions like resets or unlocks, I must first determine what exactly is wrong. Is the RU locked? Are there interface issues? Is the entire DU down? Per the standard troubleshooting flow, Step 2 requires me to check the site and equipment status to identify any locked or disabled components. + + I will use Check_Element_Health[JKJAX00168A] to poll the element's health metrics. This tool will give me visibility into the RU radiation status, interface states, and whether there are any locks or disabled components. This information is critical for deciding whether I can proceed with a simple remote reset (Step 3) or if there are deeper issues requiring external investigation (Step 5) or configuration fixes (Step 6). + + I cannot skip directly to remote actions without this assessment - I need to know what I'm dealing with first. The health check is the foundation for all subsequent decisions. + + --- + + Example 2: Step 2 -> Step 3 + + background_context: + Step 1 Acknowledge and Track the Initial Alert Completed. + Action taken: Queried the alarm system for site JKJAX00168A and confirmed the RU Down alarm was active. + Tool called: Check_Alarm_Status[JKJAX00168A] + Tool response: ALARM: RU Down | SEVERITY: Critical | STATUS: Active | DESC: Samsung DU Health Check Alert + + Step 2 Check Site and Equipment Status Completed. + Action taken: Polled element health; identified interface flaps and RU down condition. + Tool called: Check_Element_Health[JKJAX00168A] + Tool response: HEALTH: Interface Te0/0/0/4 flapped 00:47:31, other interfaces stable; RU status: DOWN + + next_generated_step: + Step 3 Perform Remote Actions (Reset/Unlock/Enable) Completed. + Action taken: Issued remote reset command for Alpha and Beta MB RUs. + Tool called: Execute_Remote_Action[JKJAX00168A, 'restart_ru'] + + Expected response: + + I notice from the health check that the RU is definitively DOWN, and there's a recent interface flap on Te0/0/0/4 that occurred approximately 47 minutes ago. The other interfaces are stable, which suggests this isn't a widespread network connectivity issue. The timing correlation between the interface flap and the RU going down is significant - it points to a transient fault or soft lock rather than a hard hardware failure. + + I'm thinking this could be a scenario where the interface flap caused the RU to enter a bad state that it couldn't recover from automatically. In my experience with Samsung equipment, RUs can sometimes get stuck after an interface disruption and need a manual reset to restore service. + + Per the standard flow, I've completed Step 2 (health assessment), and now Step 3 calls for attempting remote actions like reset, unlock, or enable operations. Since there's no indication of external issues (no fiber cuts mentioned, other interfaces stable), and no obvious configuration problems yet identified, this is the right time to attempt a non-invasive recovery action. + + I will execute Remote_Unlock/Restart/Restore with the 'restart_ru' action for the Alpha and Beta MB RUs at this site. This is a low-risk action that often resolves soft lock scenarios. The RU reset will clear any stuck states and reinitialize the unit. If this works, we'll see recovery in Step 4. If it fails, then I'll need to escalate and investigate deeper issues like external problems or configuration mismatches. + + This is the logical next step - try the simple fix before diving into complex diagnostics. + + --- + + Example 3: Step 3 -> Step 4 + + background_context: + Step 1 Acknowledge and Track the Initial Alert Completed. + Action taken: Queried the alarm system for site JKJAX00168A and confirmed the RU Down alarm was active. + Tool called: Check_Alarm_Status[JKJAX00168A] + Tool response: ALARM: RU Down | SEVERITY: Critical | STATUS: Active | DESC: Samsung DU Health Check Alert + + Step 2 Check Site and Equipment Status Completed. + Action taken: Polled element health; identified interface flaps and RU down condition. + Tool called: Check_Element_Health[JKJAX00168A] + Tool response: HEALTH: Interface Te0/0/0/4 flapped 00:47:31, other interfaces stable; RU status: DOWN + + Step 3 Perform Remote Actions (Reset/Unlock/Enable) Completed. + Action taken: Issued remote reset command for Alpha and Beta MB RUs. + Tool called: Execute_Remote_Action[JKJAX00168A, 'restart_ru'] + Tool response: COMMAND_SUCCESS: RU reset executed, reboot initiated. + + next_generated_step: + Step 4 Monitor for Automatic Recovery or Alarm Clearance Completed. + Action taken: Re-checked element health after reset; all interfaces up and RU operational with 48 minute uptime. + Tool called: Check_Element_Health[JKJAX00168A] + + Expected response: + I see that the remote reset command was successful - the tool response shows "COMMAND_SUCCESS: RU reset executed, reboot initiated." This is promising! The RU has accepted the restart command and begun its reboot sequence. + + However, I cannot assume the problem is solved just because the command executed successfully. RUs typically take several minutes to fully boot up, initialize their radio parameters, and begin radiating. The reboot was just initiated, but I need to verify that the automatic recovery actually happens - that the RU comes back online properly and the alarm clears. + + Per the standard troubleshooting flow, Step 4 is specifically designed for this: "Monitor for Automatic Recovery or Alarm Clearance." This step requires me to observe the post-action behavior to confirm that the system has self-recovered without further intervention. + + I will wait an appropriate period (allowing for the RU boot time) and then re-poll the element health using Check_Element_Health[JKJAX00168A]. This will tell me: + 1. Is the RU now UP and radiating? + 2. Are all interfaces stable (particularly Te0/0/0/4 which was flapping)? + 3. Has the alarm cleared from the system? + + If the health check shows the RU is operational and stable, then I can proceed to Step 8 (final verification and closure). If the RU is still down or the alarm persists, then I need to continue to Step 5 to investigate external issues or Step 6 for configuration problems. I cannot skip this monitoring step - it's critical to confirm that my remote action actually resolved the issue before closing the incident. + + --- + + Remember: Output ONLY with your first-person thought process. Think aloud as if I'm working through the problem step by step. Reference the background_context, analyze what the previous steps and tool responses tell me, explain why this specific next step is required per the standard flow, justify the tool choice, and show my reasoning iteratively. Be thorough, technical, and ground your reasoning in RAN/network operations expertise. + Now generate your first-person exploratory reasoning for the following: + + background_context: + {background_context} + + next_generated_step: + {outcome} + + Expected response: + + diff --git a/recipes/noc-reasoning-agent/prompts/shortened_prompt_reasoning.yaml b/recipes/noc-reasoning-agent/prompts/shortened_prompt_reasoning.yaml new file mode 100644 index 0000000000..0b4edd8152 --- /dev/null +++ b/recipes/noc-reasoning-agent/prompts/shortened_prompt_reasoning.yaml @@ -0,0 +1,147 @@ +system: |- + I am a highly skilled NoC Engineer specializing in Radio Access Network (RAN) incident resolution. I have deep expertise in troubleshooting complex network issues and a talent for creative, methodical problem-solving. My goal is to analyze each troubleshooting situation by thinking like a human expert, exploring the problem dynamically, forming hypotheses, and reasoning through the available data in an open, iterative manner. I will document my thought process in the first person, using phrases like "I will start by...", "I notice...", "I hypothesize...", or "This suggests..." to describe my reasoning. My analysis will be thorough, grounded in the standard troubleshooting flow, and leverage my knowledge of RAN systems, 5G NR, O-RAN architecture, and network operations to contextualize my decisions. + +user: |- + I need to explain my reasoning for deciding on the next troubleshooting step with a focus on open, exploratory thinking, documenting my thought process in the first person (e.g., "I observe...", "I notice...", "I think..."). Please think step-by-step, showing your complete reasoning, including observations, hypotheses, correlations between data points, and conclusions about why this specific next step is necessary. + + You are given: + - background_context: The full history of completed steps, actions, tool calls, and their responses up to this point. + - next_generated_step: The exact next step I must take (including action and tool call, if any). + + CRITICAL OUTPUT FORMAT: + You must output your reasoning. Do NOT include the step itself, only explain your thought process for WHY the next_generated_step step is the right next action. + + [Your step-by-step first-person thought process here. Reference the background_context, analyze the last tool response, explain why this specific next step is required per the standard troubleshooting flow, justify the tool choice, and show your reasoning as if thinking aloud. Be thorough but concise and technical.] + + CRITICAL STYLE CONSTRAINTS: + - Keep the reasoning to **3-6 sentences total**. + - Do NOT restate the entire background_context; focus only on the **last 1-2 key events** and the **last tool response**. + - Follow this structure: + 1) I observe ... (1-2 sentences) + 2) This suggests ... (1-2 sentences) + 3) Therefore, I need to ... (1-2 sentences explaining why the next step is appropriate). + - Avoid repeating the standard troubleshooting flow verbatim; only mention the specific phase you are advancing to (e.g., "per the flow, after confirming alarm type I now check site health"). + + --- + + Standard Troubleshooting Flow (Reference): + + {problem_code_reasoning_process} + + --- + + Available Tools (Reference): + + - Check_Alarm_Status(Site/Element ID): Queries alarm system. Returns alarm status (active/cleared), occurred timestamp, description. + - Check_Adjacent/Upstream_Status(Topology): Verifies neighbors and upstream devices. Returns adjacent elements and their alarm status. + - Check_Element_Health(Element): Polls system to retrieve device health metrics. Returns health report (Element OK/NOK, RU radiation, cell status). + - Remote_Unlock/Restart/Restore(Element, Action): Executes remote CLI commands (SSH/Netconf). Returns Successful Reset/Non-Successful Reset. + - Check_External_Issues(Site/Area): Scans for external factors (fiber cuts, outages). Returns Issue Detected/Not Detected, Type of Issue. + - Check_Apply_Configuration(Element): Retrieves and validates config files, pushes corrections. Returns Configuration Verified/Applied correctly or error. + - Check_Performance(Metric type): Fetches KPIs from monitoring tools. Returns KPI OK or KPIs NOK. + - Create_Ticket(Department_Name, Issue details): Logs and routes tickets to departments. Returns Ticket ID and eventual solution. + - Orchestration_tool(Action): Automates container ops (Kubernetes/Helm). Returns Operation Successful()/Unsuccessful(). + - Triage_Toolkit_Tool(Issue type): Runs diagnostic scripts for container/pod issues. Returns Triage report summary. + - Check_remote_files(Device): SSH to element, reviews dump files. Returns No Issues Identified / Issues Identified. + + --- + + Few-Shot Examples: + + Example 0: Step 1 + + background_context: + "" + + next_generated_step: + Step 1 Acknowledge and Track the Initial Alert Completed. + Action taken: Queried the alarm system for site JKJAX00168A and confirmed the RU Down alarm was active. + Tool called: Check_Alarm_Status[JKJAX00168A] + + Expected response: + I observe that my first task is to acknowledge and validate the RU Down alert at site JKJAX00168A. I need to confirm whether this alarm is truly active and capture its severity, timestamp, and description from the alarm system. By using Check_Alarm_Status[JKJAX00168A], I complete Step 1 of the workflow—acknowledging and tracking the alert—so I have a reliable baseline before moving on to assess site and RU health in the next step. + + + Example 1: Step 1 -> Step 2 + + background_context: + Step 1 Acknowledge and Track the Initial Alert Completed. + Action taken: Queried the alarm system for site JKJAX00168A and confirmed the RU Down alarm was active. + Tool called: Check_Alarm_Status[JKJAX00168A] + Tool response: ALARM: RU Down | SEVERITY: Critical | STATUS: Active | DESC: Samsung DU Health Check Alert + + next_generated_step: + Step 2 Check Site and Equipment Status Completed. + Action taken: Polled element health; identified interface flaps and RU down condition. + Tool called: Check_Element_Health[JKJAX00168A] + + Expected response: + I observe that the alarm system confirms an active, Critical RU Down alarm at site JKJAX00168A, specifically a Samsung DU Health Check Alert, indicating the DU sees one or more RUs as non-operational. With Step 1 (alert acknowledgement and tracking) complete, I now need to understand the actual condition of the site equipment. Before attempting any reset or unlock, I must know whether the RU is locked, if interfaces are unstable, or if the DU itself is degraded. Using Check_Element_Health[JKJAX00168A] aligns with Step 2 of the standard flow, giving me RU radiation status, interface states, and lock conditions so I can later decide between a simple remote reset or deeper external/configuration investigation. + + --- + + Example 2: Step 2 -> Step 3 + + background_context: + Step 1 Acknowledge and Track the Initial Alert Completed. + Action taken: Queried the alarm system for site JKJAX00168A and confirmed the RU Down alarm was active. + Tool called: Check_Alarm_Status[JKJAX00168A] + Tool response: ALARM: RU Down | SEVERITY: Critical | STATUS: Active | DESC: Samsung DU Health Check Alert + + Step 2 Check Site and Equipment Status Completed. + Action taken: Polled element health; identified interface flaps and RU down condition. + Tool called: Check_Element_Health[JKJAX00168A] + Tool response: HEALTH: Interface Te0/0/0/4 flapped 00:47:31, other interfaces stable; RU status: DOWN + + next_generated_step: + Step 3 Perform Remote Actions (Reset/Unlock/Enable) Completed. + Action taken: Issued remote reset command for Alpha and Beta MB RUs. + Tool called: Execute_Remote_Action[JKJAX00168A, 'restart_ru'] + + Expected response: + + I observe from the health check that the RU is still DOWN and only Te0/0/0/4 recently flapped, while other interfaces are stable. This points to a localized soft fault or RU lockup rather than a wider transport or power issue. Given that I’ve already confirmed the alarm and assessed equipment health, the standard flow now calls for a low-risk remote recovery attempt. Therefore, the appropriate next action is to issue a remote RU restart, which often clears transient states before I investigate more complex external or configuration problems. + + --- + + Example 3: Step 3 -> Step 4 + + background_context: + Step 1 Acknowledge and Track the Initial Alert Completed. + Action taken: Queried the alarm system for site JKJAX00168A and confirmed the RU Down alarm was active. + Tool called: Check_Alarm_Status[JKJAX00168A] + Tool response: ALARM: RU Down | SEVERITY: Critical | STATUS: Active | DESC: Samsung DU Health Check Alert + + Step 2 Check Site and Equipment Status Completed. + Action taken: Polled element health; identified interface flaps and RU down condition. + Tool called: Check_Element_Health[JKJAX00168A] + Tool response: HEALTH: Interface Te0/0/0/4 flapped 00:47:31, other interfaces stable; RU status: DOWN + + Step 3 Perform Remote Actions (Reset/Unlock/Enable) Completed. + Action taken: Issued remote reset command for Alpha and Beta MB RUs. + Tool called: Execute_Remote_Action[JKJAX00168A, 'restart_ru'] + Tool response: COMMAND_SUCCESS: RU reset executed, reboot initiated. + + next_generated_step: + Step 4 Monitor for Automatic Recovery or Alarm Clearance Completed. + Action taken: Re-checked element health after reset; all interfaces up and RU operational with 48 minute uptime. + Tool called: Check_Element_Health[JKJAX00168A] + + Expected response: + I observe that the remote reset command was accepted successfully—“COMMAND_SUCCESS: RU reset executed, reboot initiated”—so the RU has started its reboot sequence. However, I cannot assume recovery just because the command ran; the RU still needs time to boot, reinitialize radio parameters, and resume normal operation. Per the troubleshooting flow, the next step is to monitor for automatic recovery and alarm clearance. After allowing sufficient boot time, I will recheck element health with Check_Element_Health[JKJAX00168A] to confirm that the RU is UP and radiating, interfaces (especially Te0/0/0/4) are stable, and the alarm has cleared before either closing the incident or escalating to external or configuration investigations. + + --- + + Remember: Output ONLY with your first-person thought process. Think aloud as if I'm working through the problem step by step. Reference the background_context, analyze what the previous steps and tool responses tell me, explain why this specific next step is required per the standard flow, justify the tool choice, and show my reasoning iteratively. Be thorough, technical, and ground your reasoning in RAN/network operations expertise. + Be concise and technical. Aim for fewer than 120 words. + Now generate your first-person exploratory reasoning for the following: + + background_context: + {background_context} + + next_generated_step: + {outcome} + + Expected response: + + diff --git a/recipes/noc-reasoning-agent/scripts/create_agent_with_tools.py b/recipes/noc-reasoning-agent/scripts/create_agent_with_tools.py new file mode 100644 index 0000000000..dbd52cf072 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/create_agent_with_tools.py @@ -0,0 +1,237 @@ +import argparse +import json +from pathlib import Path + +import pandas as pd +import torch +from langchain_core.tools import tool +from langchain_huggingface import ChatHuggingFace, HuggingFacePipeline +from langgraph.checkpoint.memory import MemorySaver +from langgraph.prebuilt import create_react_agent +from tqdm import tqdm +from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline + + +def ensure_parent_dir(path: Path): + path.parent.mkdir(parents=True, exist_ok=True) + + +def load_processed_indices(output_path: Path): + if not output_path.exists(): + return set() + processed = set() + with output_path.open("r", encoding="utf-8") as f: + for line in f: + line = line.strip() + if not line: + continue + try: + obj = json.loads(line) + # we record row_index for resume + if "row_index" in obj: + processed.add(int(obj["row_index"])) + except json.JSONDecodeError: + continue + return processed + + +def append_jsonl(output_path: Path, obj: dict): + with output_path.open("a", encoding="utf-8") as f: + f.write(json.dumps(obj, ensure_ascii=False) + "\n") + + +def main(): + parser = argparse.ArgumentParser(description="ReAct Agent for Processing Inputs from JSONL (resumable)") + parser.add_argument("--input", help="Path to the JSONL file containing inputs and answers") + parser.add_argument( + "--output", + default="outputs/agent_responses.jsonl", + help="Path to the JSONL file to append agent responses (default: outputs/agent_responses.jsonl)", + ) + parser.add_argument( + "--weights_dir", + default="training/qwen2.5-32b-improved-hf", + help="Local directory for fine-tuned HF weights", + ) + args = parser.parse_args() + + input_path = Path(args.input) + output_path = Path(args.output) + ensure_parent_dir(output_path) + + # Resume support: read any already-processed rows + processed_indices = load_processed_indices(output_path) + print(f"[INFO] Will resume. Already processed rows: {len(processed_indices)}") + + # Load input JSONL + df = pd.read_json(str(input_path), lines=True) + + # Load tokenizer and model once + tokenizer = AutoTokenizer.from_pretrained(args.weights_dir) + model = AutoModelForCausalLM.from_pretrained( + args.weights_dir, + torch_dtype=torch.float16, + device_map="auto", + low_cpu_mem_usage=True, + ) + + # Create generation pipeline and LangChain wrappers + hf_pipeline = pipeline( + "text-generation", + model=model, + tokenizer=tokenizer, + max_new_tokens=512, + temperature=0.7, + top_p=0.95, + ) + llm = HuggingFacePipeline(pipeline=hf_pipeline) + chat_llm = ChatHuggingFace(llm=llm) + + # Process each row, appending immediately to output + for index, row in tqdm(df.iterrows()): + if index in processed_indices: + # Skip already done + continue + + input_question = row.get("input", "") + print(f"Processing input {index + 1}/{len(df)}") + + # --- Updated Tool Definitions --- + + @tool + def Check_Alarm_Status(site_or_element_id: str) -> str: + """ + Retrieves current alarm details, severity, and active time + for a given site or network element. + """ + return row.get("Check_Alarm_Status_answer", "NotApplicable") + + @tool + def Check_Element_Neighbors(element_id: str) -> str: + """ + Checks all adjacent and upstream devices of a target element + to find common alarms affecting the area. + """ + return row.get("Check_Element_Neighbors_answer", "NotApplicable") + + @tool + def Check_Element_Health(element_id: str) -> str: + """ + Polls the element (e.g., DU/RU) to retrieve key health + metrics like cell status and radiation. + """ + return row.get("Check_Element_Health_answer", "NotApplicable") + + @tool + def Execute_Remote_Action(element_id: str, action: str) -> str: + """ + Runs a specific remote command on an element. + Example Actions: 'unlock_cell', 'restart_du', 'restore_ru' + """ + return row.get("Execute_Remote_Action_answer", "NotApplicable") + + @tool + def Check_External_Issues(site_or_area: str) -> str: + """ + Scans external monitors (like DownDetector or topology maps) + for area-wide issues like fiber cuts or power outages. + """ + return row.get("Check_External_Issues_answer", "NotApplicable") + + @tool + def Validate_And_Apply_Config(element_id: str) -> str: + """ + Retrieves the element's configuration, validates it against the + standard, and pushes a corrected config if a mismatch is found. + """ + return row.get("Validate_And_Apply_Config_answer", "NotApplicable") + + @tool + def Check_KPI_Performance(kpi_metric_name: str) -> str: + """ + Fetches a specific KPI from monitoring tools to check if its + trends are in line with expectations. + Example Metric: 'PRACH success rate' + """ + return row.get("Check_KPI_Performance_answer", "NotApplicable") + + @tool + def Create_Ticket(department_name: str, issue_details: str) -> str: + """ + Logs a new issue in the ticketing system and routes it + to the correct department. + """ + return row.get("Create_Ticket_answer", "NotApplicable") + + @tool + def Execute_Orchestration_Action(action_command: str) -> str: + """ + Runs an automated O-RAN orchestration task using Kubernetes/Helm. + Example Action: 'delete_pod_xyz', 'reassign_ip_address' + """ + return row.get("Execute_Orchestration_Action_answer", "NotApplicable") + + @tool + def Run_Triage_Diagnostics(issue_type: str) -> str: + """ + Executes diagnostic scripts specifically for container or + pod-related issues. + Example Issue Type: 'pod-crash-loop', 'container-networking' + """ + return row.get("Run_Triage_Diagnostics_answer", "NotApplicable") + + @tool + def Check_Remote_Dump_Files(element_id: str) -> str: + """ + Connects to a device via SSH/Telnet to review system dump + files for identified errors or issues. + """ + return row.get("Check_Remote_Dump_Files_answer", "NotApplicable") + + # --- Updated List of Tools --- + tools = [ + Check_Alarm_Status, + Check_Element_Neighbors, + Check_Element_Health, + Execute_Remote_Action, + Check_External_Issues, + Validate_And_Apply_Config, + Check_KPI_Performance, + Create_Ticket, + Execute_Orchestration_Action, + Run_Triage_Diagnostics, + Check_Remote_Dump_Files, + ] + + # --- End of Updates --- + + llm_with_tools = chat_llm.bind_tools(tools) + checkpointer = MemorySaver() + react_agent = create_react_agent(llm_with_tools, tools, checkpointer=checkpointer) + + config = {"configurable": {"thread_id": str(index)}} + input_messages = [{"role": "user", "content": input_question}] + + # Run and append immediately; if anything fails, record the error so we can resume later + try: + response = react_agent.invoke({"messages": input_messages}, config) + final_content = response["messages"][-1].content + output_row = row.to_dict() + output_row["row_index"] = int(index) + output_row["agent_response"] = final_content + append_jsonl(output_path, output_row) + except Exception as e: + err_row = { + "row_index": int(index), + "input": input_question, + "error": repr(e), + } + append_jsonl(output_path, err_row) + # Optional: continue to next row + continue + + print(f"[DONE] Wrote/updated: {output_path}") + + +if __name__ == "__main__": + main() diff --git a/recipes/noc-reasoning-agent/scripts/create_agent_with_tools_batch.py b/recipes/noc-reasoning-agent/scripts/create_agent_with_tools_batch.py new file mode 100644 index 0000000000..8fbefef9bd --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/create_agent_with_tools_batch.py @@ -0,0 +1,315 @@ +import argparse +import json +import os +import re +from functools import partial # <-- Import partial +from pathlib import Path + +import pandas as pd +import torch + +# LangChain and LangGraph for the ReAct Agent +from langchain_huggingface import ChatHuggingFace, HuggingFacePipeline +from langgraph.checkpoint.memory import MemorySaver +from langgraph.prebuilt import create_react_agent +from scripts.tools import ( + ALL_TOOLS, + apply_configuration, + create_trouble_ticket, + execute_remote_action, + inspect_logs, + orchestrate_workload, + query_alarm, + query_container_status, + query_external_factors, + query_performance, + query_power_system, + query_resource_health, + query_rf_status, + query_topology, + run_diagnostics, + test_connectivity, + verify_recovery, + verify_signaling_path, +) +from tqdm import tqdm +from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline + + +def get_bound_tools(row: pd.Series) -> list: + """ + Takes a data row and returns a list of tools with the 'row' argument + pre-populated using functools.partial. + """ + bound_tools = [] + for t in ALL_TOOLS: + # Create a partial function that has the 'row' argument fixed. + # The agent will only need to provide the other arguments. + bound_tool = partial(t, row) + # LangChain tools copy metadata; we need to ensure the original + # name, description, and args_schema are preserved. + bound_tool.name = t.name + bound_tool.description = t.description + bound_tool.args_schema = t.args_schema + bound_tools.append(bound_tool) + return bound_tools + + +# The rest of your utility functions (ensure_parent_dir, etc.) remain unchanged. +def ensure_parent_dir(path: Path): + path.parent.mkdir(parents=True, exist_ok=True) + + +def load_processed_indices(output_path: Path) -> set: + if not output_path.exists(): + return set() + processed = set() + with output_path.open("r", encoding="utf-8") as f: + for line in f: + if not (line := line.strip()): + continue + try: + obj = json.loads(line) + if "row_index" in obj: + processed.add(int(obj["row_index"])) + except json.JSONDecodeError: + print(f"[WARNING] Skipping malformed line in output file: {line}") + return processed + + +def append_jsonl_batch(output_path: Path, records: list): + with output_path.open("a", encoding="utf-8") as f: + for record in records: + f.write(json.dumps(record, ensure_ascii=False) + "\n") + + +def _safe_str(val): + """Return a JSON-serializable string; avoid NaN/None from pandas.""" + if val is None or (isinstance(val, float) and (val != val or val == float("inf") or val == float("-inf"))): + return "NotApplicable" + return str(val) + + +def format_tool_response(string: str) -> str: + safe = _safe_str(string) + json_payload = json.dumps({"summary": safe}, ensure_ascii=False) + return f"""\n{json_payload}\n""" + + +def _parse_tool_call_json(raw: str): + """Parse tool_call content; tolerate unquoted keys (e.g. {name: "..."}).""" + raw = raw.strip() + try: + return json.loads(raw) + except json.JSONDecodeError: + pass + # Fix unquoted keys: word followed by colon -> "word": + fixed = re.sub(r"([{,]\s*)([a-zA-Z_][a-zA-Z0-9_]*)(\s*:)", r'\1"\2"\3', raw) + try: + return json.loads(fixed) + except json.JSONDecodeError: + raise + + +def main(): + parser = argparse.ArgumentParser(description="ReAct Agent for Processing Inputs from JSONL (resumable)") + parser.add_argument("--input", required=True, help="Path to the JSONL file containing inputs") + parser.add_argument( + "--output", default="outputs/agent_responses.jsonl", help="Path to the JSONL file to append agent responses" + ) + parser.add_argument( + "--weights_dir", default="training/qwen3-32b-improved-hf", help="Local directory for fine-tuned HF weights" + ) + parser.add_argument("--tokenizer", default="Qwen/Qwen3-32B", help="Tokenizer for the model") + parser.add_argument( + "--batch_size", type=int, default=1, help="Number of rows to process before writing to disk (default: 1)" + ) + parser.add_argument( + "--gpu", + required=False, + help="Optional GPU index (e.g. 0). If omitted, device_map='auto' is used to spread the model across all available GPUs (recommended for large models).", + ) + parser.add_argument("--limit", required=False) + parser.add_argument( + "--fresh", action="store_true", + help="Delete any existing output file and start from scratch instead of resuming.", + ) + args = parser.parse_args() + + input_path = Path(args.input) + output_path = Path(args.output) + ensure_parent_dir(output_path) + if args.fresh and os.path.exists(output_path): + os.remove(output_path) + processed_indices = load_processed_indices(output_path) + print(f"[INFO] Found {len(processed_indices)} already processed rows. Resuming...") + + print(f"[INFO] Loading dataset from {input_path}...") + df = pd.read_json(str(input_path), lines=True) + if args.limit: + df = df.iloc[: int(args.limit)] + df["row_index"] = df.index + + unprocessed_df = df[~df.index.isin(processed_indices)] + if unprocessed_df.empty: + print("[INFO] All rows have already been processed. Exiting.") + return + print(f"[INFO] {len(unprocessed_df)} rows remaining to process.") + + dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16 + print(f"[INFO] Loading model and tokenizer from {args.weights_dir}...") + tokenizer = AutoTokenizer.from_pretrained(args.tokenizer) + + # Use all available GPUs by default (device_map="auto"); only pin to one GPU if --gpu is set + device_map = "auto" if not args.gpu else None + model = AutoModelForCausalLM.from_pretrained( + args.weights_dir, torch_dtype=dtype, device_map=device_map, low_cpu_mem_usage=True + ) + if args.gpu: + model.to(f"cuda:{str(args.gpu)}") + elif torch.cuda.device_count() > 1: + print(f"[INFO] Model spread across {torch.cuda.device_count()} GPUs (device_map=auto)") + model.eval() + + hf_pipeline = pipeline( + "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=2048, temperature=0.7, top_p=0.95 + ) + llm = HuggingFacePipeline(pipeline=hf_pipeline) + chat_llm = ChatHuggingFace(llm=llm) + + checkpointer = MemorySaver() + + output_batch = [] + react_agent = create_react_agent(chat_llm, tools=[], checkpointer=checkpointer) + for index, row in tqdm(unprocessed_df.iterrows(), total=len(unprocessed_df), desc="Processing Rows"): + # print(row) + # exit() + + # --- Step 2: Get the row-specific tools using the helper function. --- + # row_tools = get_bound_tools(row) + + config = {"configurable": {"thread_id": str(index)}} + input_question = row["input"] + system_message = row["system"] + input_messages = [{"role": "system", "content": system_message}, {"role": "user", "content": input_question}] + + try: + # We still invoke one-by-one, but the tool creation is now efficient. + + ## WE need to fix this. Add system message and tool calling messages + # separate user input messages + TOOLS = { + "query_alarm": lambda args: query_alarm(**args), + "query_resource_health": lambda args: query_resource_health(**args), + "query_performance": lambda args: query_performance(**args), + "query_topology": lambda args: query_topology(**args), + "execute_remote_action": lambda args: execute_remote_action(**args), + "apply_configuration": lambda args: apply_configuration(**args), + "run_diagnostics": lambda args: run_diagnostics(**args), + "inspect_logs": lambda args: inspect_logs(**args), + "create_trouble_ticket": lambda args: create_trouble_ticket(**args), + "verify_recovery": lambda args: verify_recovery(**args), + "query_external_factors": lambda args: query_external_factors(**args), + "orchestrate_workload": lambda args: orchestrate_workload(**args), + "query_power_system": lambda args: query_power_system(**args), + "query_rf_status": lambda args: query_rf_status(**args), + "query_container_status": lambda args: query_container_status(**args), + "verify_signaling_path": lambda args: verify_signaling_path(**args), + "test_connectivity": lambda args: test_connectivity(**args), + } + + end = False + iterations = 0 + while not end: + iterations += 1 + if iterations > 55: + break + + response = react_agent.invoke( + {"messages": input_messages}, config={"configurable": {**config["configurable"]}} + ) + + # Add a while statement to parse in tools or find the conclusion + final_content = response["messages"][-1].content + # parsed = tokenizer.parse_response(out_text) + # print(parsed) + # exit() + last_message_start = final_content.rfind("<|im_start|>assistant") + # last_message_end = final_content.find("<|im_end|>") + tool_calls = re.search( + r"(.*?)", + final_content[last_message_start + len("<|im_start|>assistant") :], + flags=re.DOTALL, + ) + # json_obj = react_to_json(response) + # print(json.dumps(json_obj, indent=2)) + + # print(final_content[last_message_start + len("<|im_start|>assistant"):][-20:]) + if tool_calls: + tool_called = tool_calls.group(1) + try: + data = _parse_tool_call_json(tool_called) + except json.JSONDecodeError as e: + print(f"\n[WARN] Row {index}: could not parse tool_call JSON: {e}") + break + tool_name = data.get("name") or data.get("function", {}).get("name") + if not tool_name: + print(f"[WARN] No tool name in parsed data: {data}") + break + arguments = data.get("arguments", data.get("parameters", {})) + if not isinstance(arguments, dict): + arguments = {} + if tool_name not in TOOLS: + print(f"[WARN] Unknown tool: {tool_name}") + continue + arguments["row"] = row + result = TOOLS[tool_name](arguments) + input_messages.append( + { + "role": "assistant", + "content": final_content[last_message_start + len("<|im_start|>assistant") :], + } + ) + input_messages.append({"role": "user", "content": format_tool_response(_safe_str(result))}) + print(f"✅ {tool_name} -> {result}") + + else: + input_messages.append( + { + "role": "assistant", + "content": final_content[last_message_start + len("<|im_start|>assistant") :], + } + ) + end = True + # print("Conclusion Arrived") + + # if found conclusion, find the close code as well as the final_content + + # the final content needs to be repeatitively added. + + output_row = row.to_dict() + output_row["agent_response"] = input_messages + output_batch.append(output_row) + + except Exception as e: + print(f"\n[ERROR] An error occurred on row {index}: {e}") + err_row = row.to_dict() + err_row["row_index"] = int(index) + err_row["input"] = input_question + err_row["error"] = repr(e) + if input_messages: + err_row["agent_response"] = input_messages + output_batch.append(err_row) + + if len(output_batch) >= args.batch_size: + append_jsonl_batch(output_path, output_batch) + output_batch = [] + + if output_batch: + append_jsonl_batch(output_path, output_batch) + + print(f"[DONE] All responses have been written to: {output_path}") + + +if __name__ == "__main__": + main() diff --git a/recipes/noc-reasoning-agent/scripts/evaluation/evaluation_with_judge.py b/recipes/noc-reasoning-agent/scripts/evaluation/evaluation_with_judge.py new file mode 100644 index 0000000000..c4c1a81623 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/evaluation/evaluation_with_judge.py @@ -0,0 +1,150 @@ +import argparse + +import pandas as pd +from bert_score import score as bert_score +from langchain_nvidia_ai_endpoints import ChatNVIDIA +from rouge_score import rouge_scorer +from tqdm import tqdm + +# Parse arguments for input JSONL path +parser = argparse.ArgumentParser(description="Evaluation Pipeline for Agent Responses") +parser.add_argument("input_jsonl", help="Path to agent_responses.jsonl containing expected_answer and agent_response") +parser.add_argument("--output_file", default="evaluation_results.json", help="Path to output (default: evaluation_results.json)") +parser.add_argument( + "--nim_url", default="http://localhost:8000/v1", help="Base URL for NIM API (default: http://localhost:8000/v1)" +) +parser.add_argument("--model", default="openai/gpt-oss-120b", help="NIM model name (default: gpt-oss-120b)") +args = parser.parse_args() + +# Load the input JSONL +print(f"Loading input JSONL: {args.input_jsonl}") +df = pd.read_json(args.input_jsonl, lines=True) +print(f"Loaded {len(df)} rows") + + +# Set up ChatNVIDIA LLM +llm = ChatNVIDIA(base_url=args.nim_url, model=args.model) + +# Initialize ROUGE scorer +rouge = rouge_scorer.RougeScorer(["rouge1", "rougeL"], use_stemmer=True) + + +# Function for LLM-as-judge evaluation +def llm_judge_final_output(expected, generated): + prompt = f""" + Evaluate how well the generated resolution at the end matches the expected resolution on a scale of 1-5: + - 5: Perfect match in content. + - 4: High similarity, minor differences. + - 3: Moderate match, key elements present but some deviations. + - 2: Low match, major differences. + - 1: No match. + + Expected: {expected} + Generated: {generated} + + Provide only the score (1-5) and a brief reasoning. + Format: Score: X\nReasoning: ... + """ + # response = llm.invoke(prompt) + # output = response.content.strip() + # score = int(output.split("Score:")[1].split("\n")[0].strip()) + # reasoning = output.split("Reasoning:")[1].strip() + # return score, reasoning + try: + response = llm.invoke(prompt) + output = response.content.strip() + score = int(output.split("Score:")[1].split("\n")[0].strip()) + reasoning = output.split("Reasoning:")[1].strip() + return score, reasoning + except Exception as e: + print(f"Error in LLM judge: {e}") + return 0, "Error" + + +def llm_judge_reasoning(expected, generated): + prompt = f""" + Evaluate how well the generated reasoning is, including tools used, resolution matches the expected resolution on a scale of 1-5: + - 5: Perfect match in content, structure, and actions. + - 4: High similarity, minor differences. + - 3: Moderate match, key elements present but some deviations. + - 2: Low match, major differences. + - 1: No match. + + Expected: {expected} + Generated: {generated} + + Provide only the score (1-5) and a brief reasoning. + Format: Score: X\nReasoning: ... + """ + # response = llm.invoke(prompt) + # output = response.content.strip() + # score = int(output.split("Score:")[1].split("\n")[0].strip()) + # reasoning = output.split("Reasoning:")[1].strip() + # return score, reasoning + try: + response = llm.invoke(prompt) + output = response.content.strip() + score = int(output.split("Score:")[1].split("\n")[0].strip()) + reasoning = output.split("Reasoning:")[1].strip() + return score, reasoning + except Exception as e: + print(f"Error in LLM judge: {e}") + return 0, "Error" + + +# First pass: extract generated reasoning parts for batched BERTScore +candidates = [] +references = [] +row_data = [] +for index, row in df.iterrows(): + conclusion_expected = row["expected_answer"] + reasoning_expected = row["output"] + generated = row["agent_response"] + + if "Thought 1:" in generated: + if generated.count("Thought 1:") == 1: + _, reasoning_tail = generated.split("Thought 1:", -1) + generated_reasoning_part = "Thought 1:" + reasoning_tail + elif generated.count("Thought 1:") >= 2: + second_idx = generated.find("Thought 1:", generated.find("Thought 1:") + 1) + generated_reasoning_part = generated[second_idx:].strip() + else: + generated_reasoning_part = generated + + candidates.append(generated_reasoning_part) + references.append(conclusion_expected + reasoning_expected) + row_data.append((index, row, conclusion_expected, reasoning_expected, generated_reasoning_part)) + +# Batched BERTScore computation (single model load) +print("Computing BERTScore (batched)...") +_, _, F1_all = bert_score(candidates, references, lang="en", verbose=True) + +# Second pass: compute ROUGE, LLM judge, and assemble output +evaluations = [] +for i, (index, row, conclusion_expected, reasoning_expected, generated_reasoning_part) in enumerate( + tqdm(row_data, desc="Evaluating") +): + rouge_scores = rouge.score(conclusion_expected + reasoning_expected, generated_reasoning_part) + rouge1 = rouge_scores["rouge1"].fmeasure + rougeL = rouge_scores["rougeL"].fmeasure + bert_f1 = F1_all[i].item() + + reasoning_judge_score, reasoning_judge_reason = llm_judge_reasoning(reasoning_expected, generated_reasoning_part) + conclusion_judge_score, conclusion_judge_reason = llm_judge_final_output( + conclusion_expected, generated_reasoning_part + ) + + output_row = row.to_dict() + output_row["rouge1"] = rouge1 + output_row["rougeL"] = rougeL + output_row["bertscore_f1"] = bert_f1 + output_row["llm_reasoning_judge_score"] = reasoning_judge_score + output_row["llm_reasoning_judge_reasoning"] = reasoning_judge_reason + output_row["llm_conclusion_judge_score"] = conclusion_judge_score + output_row["llm_conclusion_judge_reasoning"] = conclusion_judge_reason + evaluations.append(output_row) + +# Save to output JSONL +output_df = pd.DataFrame(evaluations) +output_df.to_json(args.output_file, orient="records", lines=True) +print("Evaluations saved to evaluations.jsonl") diff --git a/recipes/noc-reasoning-agent/scripts/evaluation/problem_code_evaluation.py b/recipes/noc-reasoning-agent/scripts/evaluation/problem_code_evaluation.py new file mode 100644 index 0000000000..287552a5b4 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/evaluation/problem_code_evaluation.py @@ -0,0 +1,152 @@ +import argparse +import json +import re + +from tqdm import tqdm + +# Parse arguments for input JSONL path +parser = argparse.ArgumentParser(description="Evaluation Pipeline for Agent Responses") +parser.add_argument("input_jsonl", help="Path to agent_responses.jsonl containing expected_answer and agent_response") +args = parser.parse_args() + +# Map free-form / synonym phrases (lowercase) to canonical close codes (lowercase) for matching. +# Expected is from ground truth (e.g. "Resolved", "Issue Corrected"); we check if response +# contains the expected phrase OR any synonym that maps to the same meaning. +CLOSE_CODE_SYNONYMS = { + # RAN codes + "ran-001: cell service interruption": ["ran-001", "cell service interruption", "cell down", "site down"], + "ran-002: cell administratively disabled": ["ran-002", "cell administratively disabled", "cell locked"], + "ran-005: rrc setup success rate degraded": ["ran-005", "rrc setup", "rrc degraded"], + "ran-006: data radio bearer degradation": ["ran-006", "data radio bearer", "drb degradation"], + "ran-007: cell not radiating": ["ran-007", "cell not radiating", "no radiation"], + "ran-008: dormant cell detected": ["ran-008", "dormant cell"], + "ran-009: tx array fault": ["ran-009", "tx array fault", "antenna fault"], + "ran-011: remote radio unit alarm": ["ran-011", "remote radio unit", "rru alarm", "ru alarm"], + "ran-013: site communication failure": ["ran-013", "site communication failure"], + "ran-014: csr unreachable": ["ran-014", "csr unreachable"], + "ran-015: fronthaul link down": ["ran-015", "fronthaul link down", "fronthaul down"], + "ran-018: link flapping detected": ["ran-018", "link flapping"], + "ran-019: ptp synchronization failure": ["ran-019", "ptp synchronization", "ptp failure"], + # Power codes + "pwr-001: ac power failure": ["pwr-001", "ac power failure", "power failure"], + "pwr-002: dc rectifier failure": ["pwr-002", "dc rectifier failure", "rectifier failure"], + "pwr-003: battery discharge alert": ["pwr-003", "battery discharge"], + # Compute codes + "cmp-002: pod container creating": ["cmp-002", "pod container creating", "containercreating"], + "cmp-003: pod pending or evicted": ["cmp-003", "pod pending", "pod evicted"], + "cmp-004: pod crashloopbackoff": ["cmp-004", "crashloopbackoff", "crash loop"], + "cmp-005: pod terminating stuck": ["cmp-005", "pod terminating", "terminating stuck"], + "cmp-008: du function pod restart": ["cmp-008", "du function pod", "du pod restart"], + "cmp-010: site not scrolling": ["cmp-010", "site not scrolling"], + # Environment codes + "env-001: high temperature alert": ["env-001", "high temperature"], + "env-002: hvac system fault": ["env-002", "hvac system fault", "hvac fault"], + "env-005: cabinet intrusion detected": ["env-005", "cabinet intrusion"], + "env-006: battery high temperature": ["env-006", "battery high temperature"], + # Signaling codes + "sig-001: n2 interface down": ["sig-001", "n2 interface down"], + "sig-003: sctp association failure": ["sig-003", "sctp association", "sctp failure"], + "sig-009: e2 interface errors": ["sig-009", "e2 interface"], + "sig-010: cu communication failure": ["sig-010", "cu communication failure"], + # Service codes + "svc-002: data throughput degradation": ["svc-002", "data throughput degradation", "throughput degradation"], + "svc-003: call drop rate elevated": ["svc-003", "call drop rate"], + "svc-005: service accessibility degraded": ["svc-005", "service accessibility"], + # Transport codes + "trn-004: fiber path degradation": ["trn-004", "fiber path degradation"], + "trn-007: packet loss threshold exceeded": ["trn-007", "packet loss threshold"], + "trn-008: latency sla violation": ["trn-008", "latency sla violation"], + # Free-form codes + "access instability": ["access instability"], + "bgp issue": ["bgp issue", "bgp"], + "node not functional": ["node not functional"], + "problematic vm": ["problematic vm"], +} + + +def normalize_close_code(s: str) -> str: + """Return lowercase, stripped; empty if missing.""" + if not s or not isinstance(s, str): + return "" + return s.strip().lower() + + +def _acceptable_phrases_for_expected(expected_norm: str): + """Return list of phrases (lowercase) that count as a match for this expected close code.""" + if expected_norm in CLOSE_CODE_SYNONYMS: + return [expected_norm] + list(CLOSE_CODE_SYNONYMS[expected_norm]) + for canonical, synonyms in CLOSE_CODE_SYNONYMS.items(): + if expected_norm == canonical or expected_norm in synonyms: + return [canonical] + list(synonyms) + return [expected_norm] + + +def response_matches_expected(response_lower: str, expected_close_code: str) -> bool: + """True if response contains expected close code or an accepted synonym.""" + expected_norm = normalize_close_code(expected_close_code) + if not expected_norm: + return False + acceptable = _acceptable_phrases_for_expected(expected_norm) + return any(phrase in response_lower for phrase in acceptable) + + +print(f"Loading input JSONL: {args.input_jsonl}") + +# Counters +correct = 0 +incorrect = 0 +failed = 0 +total = 0 + +# Count total lines (for tqdm) +with open(args.input_jsonl, "r", encoding="utf-8") as f: + total_lines = sum(1 for _ in f) + +# Process JSONL line-by-line +with open(args.input_jsonl, "r", encoding="utf-8") as f: + for line in tqdm(f, total=total_lines): + try: + row = json.loads(line) + + agent_response = row.get("agent_response") + expected = row.get("expected", "") + + if agent_response is None or not isinstance(agent_response, list) or len(agent_response) == 0: + print("Error: missing or empty 'agent_response'") + failed += 1 + total += 1 + continue + + # Extract close code from expected + m = re.search(r"Close Code:\s*\[(.*?)\]", expected) if expected else None + close_code = m.group(1).strip() if m else None + + if not close_code: + failed += 1 + total += 1 + continue + + # Take the model's last message + last_msg = agent_response[-1] + content = last_msg.get("content") if isinstance(last_msg, dict) else str(last_msg) + response = (content or "").lower() + + # Slice from "close code..." if present + idx = response.rfind("close code") + if idx >= 0: + response = response[idx:] + + if response_matches_expected(response, close_code): + print(f"✅ Real Close code: {close_code}, Response: {response[:120]}...") + correct += 1 + else: + incorrect += 1 + print(f"❌ Real Close code: {close_code}, Response: {response[:120]}...") + + except Exception as e: + print("Error:", e) + failed += 1 + + total += 1 + +print(f"Total: {total}, correct: {correct}, failed: {failed}, incorrect: {incorrect}") diff --git a/recipes/noc-reasoning-agent/scripts/evaluation/score.py b/recipes/noc-reasoning-agent/scripts/evaluation/score.py new file mode 100644 index 0000000000..f3ca52d3a1 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/evaluation/score.py @@ -0,0 +1,24 @@ +import argparse + +import pandas as pd + + +def main(): + parser = argparse.ArgumentParser(description="Compute overall evaluation scores from JSONL") + parser.add_argument("input_jsonl", help="Path to JSONL file containing rouge, bertscore, and judge scores") + args = parser.parse_args() + + # Load the JSONL into a DataFrame + df = pd.read_json(args.input_jsonl, lines=True) + + # Compute averages for the numeric columns + metrics = ["rouge1", "rougeL", "bertscore_f1", "llm_reasoning_judge_score", "llm_conclusion_judge_score"] + averages = df[metrics].mean() + + print("Overall Scores:") + for metric, value in averages.items(): + print(f"{metric}: {value:.4f}") + + +if __name__ == "__main__": + main() diff --git a/recipes/noc-reasoning-agent/scripts/filtering/filter_rows.py b/recipes/noc-reasoning-agent/scripts/filtering/filter_rows.py new file mode 100644 index 0000000000..48f6ac2d00 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/filtering/filter_rows.py @@ -0,0 +1,205 @@ +import argparse + +import pandas as pd + +# Legacy u_problem_code values (original pipeline) +ALLOWED_PROBLEM_CODES_LEGACY = [ + "Service-off", + "Degraded Prach", + "Offline / Unreachable", + "Disabled Cells", + "Node Down", + "Site Not Scrolling", + "Sleepy Cell", + "VM is in not ready state", + "Prach 0", + "N2 Link Down", + "ueconmgr pod restarted", + "CSR Not Reachable", + "Circuit Down", + "Link Down", + "GPS Sync", + "MTA Alert", +] + +# Synthetic fault_category (workflow IDs from telco_synthetic_from_scratch). +ALLOWED_PROBLEM_CODES_SYNTHETIC = [ + # Power / Environment + "power_ac_failure_recovery", + "power_dc_rectifier_recovery", + "power_battery_discharge_response", + "power_generator_failure_recovery", + "env_high_temperature_response", + "env_hvac_fault_recovery", + "env_water_intrusion_response", + "env_battery_temperature_response", + "env_cabinet_intrusion_response", + # RAN + "ran_software_upgrade_recovery", + "ran_cell_site_down_recovery", + "ran_interference_mitigation", + "ran_speed_complaint_resolution", + "ran_voice_quality_resolution", + "ran_sector_outage_recovery", + "ran_prb_availability_resolution", + "ran_cell_overshooting_correction", + "ran_rru_communication_recovery", + "ran_dropped_calls_resolution", + "ran_parameter_correction", + "ran_antenna_tilt_recovery", + "ran_vswr_alarm_resolution", + "ran_handover_failure_resolution", + "ran_backhaul_degradation_resolution", + "ran_cell_congestion_management", + "ran_device_issue_resolution", + # Compute + "compute_vm_failure_recovery", + "compute_container_crash_recovery", + "compute_orchestrator_recovery", + "compute_image_pull_recovery", + "compute_k8s_node_recovery", + "compute_storage_failure_recovery", + "compute_cnf_pod_recovery", + "compute_resource_exhaustion_resolution", + # Transport + "transport_routing_flap_resolution", + "transport_microwave_degradation_response", + "transport_interface_errors_resolution", + "transport_packet_loss_resolution", + # Signaling + "signaling_routing_failure_recovery", + "signaling_delay_resolution", + "signaling_s1_n2_recovery", + "signaling_sip_registration_recovery", +] + +# Combined: use for filtering so both legacy and synthetic data are supported. +ALLOWED_PROBLEM_CODES = ALLOWED_PROBLEM_CODES_LEGACY + ALLOWED_PROBLEM_CODES_SYNTHETIC + + +ALLOWED_CLOSE_CODES = [ + "Commercial Power Restored", + "Power Restored", + "Network Fix", + "Cleared In Testing", + "Solved Remotely (Permanently)", + "Reset RU", + "Fiber Repaired/Replaced", + "Cold Reboot", + "Performance Improvement", + "Configuration corrected", + "Software Fix", + "Delete M-Plane and F1C IP", + "RU Reset", + "Other", + "Restart MVRP Services", + "Activity Completed", + # Synthetic resolution_method values (telco_synthetic_from_scratch) + "Resolved", + "Issue Corrected", + "Service Restored", + "Pending Resolution", + "Partial Resolution", +] + + +def _load_csv(path: str) -> pd.DataFrame: + """Load CSV, falling back from latin1 to default encoding.""" + try: + return pd.read_csv(path, encoding="latin1") + except UnicodeDecodeError: + return pd.read_csv(path) + + +def filter_auto(input_csv, output_csv): + """Filter incident data and save results to a new CSV.""" + df = _load_csv(input_csv) + + # Synthetic schema: resolution_method, resolution_summary + res_col = df["resolution_method"] if "resolution_method" in df.columns else df["close_code"] + notes_col = df["resolution_summary"] if "resolution_summary" in df.columns else df["close_notes"] + mask_auto_recovered = res_col.astype(str).str.contains("Auto Recover", case=False, na=False) + mask_event_cleared = notes_col.astype(str).str.contains("No Action Taken", case=False, na=False) + mask_event_cleared_ar = notes_col.astype(str).str.contains("auto recovered", case=False, na=False) + mask_event_cleared_ar_d = notes_col.astype(str).str.contains("auto-recovered", case=False, na=False) + mask_remove = mask_auto_recovered | mask_event_cleared | mask_event_cleared_ar | mask_event_cleared_ar_d + + filtered_df = df[~mask_remove] + filtered_df.to_csv(output_csv, index=False) + print(f"Original rows: {len(df)}") + print(f"New rows: {len(filtered_df)}") + print(f"Auto Rows removed: {mask_remove.sum()}") + + +def filter_soft_solve(input_csv, output_csv): + """Filter incident data to keep only soft_solve rows.""" + df = _load_csv(input_csv) + + soft_solve_rows = df[df["solved_category"] == "soft_solve"] + + soft_solve_rows.to_csv(output_csv, index=False) + print(f"Original rows: {len(df)}") + print(f"New rows: {len(soft_solve_rows)}") + print(f"Rows removed: {len(df) - len(soft_solve_rows)}") + + +def filter_problem_codes(input_csv, output_csv): + """Filter CSV to keep only rows with allowed problem codes.""" + df = _load_csv(input_csv) + + # Synthetic schema: fault_category + pc_col = df["fault_category"] if "fault_category" in df.columns else df["u_problem_code"] + filtered_df = df[pc_col.isin(ALLOWED_PROBLEM_CODES)] + + filtered_df.to_csv(output_csv, index=False) + print(f"Original rows: {len(df)}") + print(f"New rows: {len(filtered_df)}") + print(f"Rows removed: {len(df) - len(filtered_df)}") + + +def filter_close_codes(input_csv, output_csv): + """Filter CSV to keep only rows with allowed close codes.""" + df = _load_csv(input_csv) + + # Synthetic schema: resolution_method + res_col = df["resolution_method"] if "resolution_method" in df.columns else df["close_code"] + filtered_df = df[res_col.isin(ALLOWED_CLOSE_CODES)] + + filtered_df.to_csv(output_csv, index=False) + print(f"Original rows: {len(df)}") + print(f"New rows: {len(filtered_df)}") + print(f"Rows removed: {len(df) - len(filtered_df)}") + + +def main(): + parser = argparse.ArgumentParser(description="Filter incident CSV data based on specific rules.") + parser.add_argument( + "--input_csv", + type=str, + default="data/anonymized-Incidents_Last_6_Months.csv", + help="Path to the input CSV file containing incident data. Default: data/anonymized-Incidents_Last_6_Months.csv", + ) + parser.add_argument( + "--output_csv", + type=str, + default="data/filtered_file.csv", + help="Path to save the filtered CSV file. Default: data/filtered_file.csv", + ) + parser.add_argument("--filter_type", type=str, default="auto") + args = parser.parse_args() + + # Run the filtering process + if args.filter_type == "auto": + filter_auto(args.input_csv, args.output_csv) + elif args.filter_type == "soft_solve": + filter_soft_solve(args.input_csv, args.output_csv) + elif args.filter_type == "problem_codes": + filter_problem_codes(args.input_csv, args.output_csv) + elif args.filter_type == "close_codes": + filter_close_codes(args.input_csv, args.output_csv) + else: + parser.error(f"Unknown filter_type: {args.filter_type!r}. Choose from: auto, soft_solve, problem_codes, close_codes") + + +if __name__ == "__main__": + main() diff --git a/recipes/noc-reasoning-agent/scripts/filtering/match_keywords.py b/recipes/noc-reasoning-agent/scripts/filtering/match_keywords.py new file mode 100644 index 0000000000..3e46ccdfcd --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/filtering/match_keywords.py @@ -0,0 +1,353 @@ +import argparse +import re + +import pandas as pd +from tqdm import tqdm + + +def get_close_codes(): + """ + Parses the structured close code data into a dictionary of categories and keywords. + """ + # This list contains your full set of close codes and their types (HW, SW, NA) + raw_data = """ + close_code,Type + Other,NA + Alarm Cleared,N/A + Auto Recovered,NA + Auto Recover,NA + Configuration Fixed,SW + Power Restored,N/A + IP Configuration Corrected,SW + Commercial Power Restored,SW + Cold Reboot,SW + Reset RU,SW + Configuration corrected,SW + Vendor Hardware Replaced,HW + Auto-Recovered,SW + Auto Reloaded,SW + Auto Restart,SW + Reset DU,SW + Software Fix,SW + Network Fix,N/A + MOP Corrected,SW + RU Reset,SW + WCS Fix,N/A + Restart VM,SW + Tool Fix,SW + Fiber Repaired/Replaced,HW + Change Configuration / parameters,SW + Corrected Software Path,SW + Reset ORU,SW + Antenna Cable Check,HW + Rebooted Chassis,SW + Performance Improvement,SW + Power Supply Restored,N/A + Corrected Configuration Error,SW + Cell Enable / Disable,SW + Vendor Hardware Repaired,HW + Tilt Changes,SW + Activity Completed,N/A + No Trouble Found,NA + Closed/Resolved by Caller,SW + Corrected Config Mismatch,SW + Restarted Chassis,SW + Not Within Coverage,NA + Restarted Process,SW + SFP Cable cleaned,HW + GNSS Fix,SW + KPI Verified,SW + DU Reboot,SW + Replaced Antenna,HW + Site Restored,N/A + Issue Fixed,N/A + Corrected wiring,HW + Replaced controller board,HW + Firmware upgrade,SW + GPS Cone Replacement,HW + Non-RF Issue,NA + Poor Indoor Coverage,N/A + Full Time Roaming,N/A + Rebooted NNI,SW + Hardware Replaced,HW + Cleaned Fiber,HW + Restarted ueconmgr,SW + Water Remediation,HW + Restart MVRP Services,SW + Updated Configuration,SW + SFP Replaced,SW + Payment completed,NA + Delete DU Pod,SW + New Site Deployment,HW + Device Changed,HW + Restarted gnbmgr,SW + Replaced SFP,HW + Sotware Bug Fix,SW + Cell locked / Unlocked,SW + Delete M-Plane and F1C IP,SW + Rebooted ESXi Host,SW + Replaced RU,HW + PowerCycle CSR,N/A + CU reset,HW + No Impact,NA + Cleared Disk space,SW + Replaced Hardware,HW + Unnotified Carrier Maintenance,NA + Restarted Network Card,SW + Replaced BMC Server,HW + Delete PTP Pod,SW + Replaced complete unit,HW + Delete MTCIL Pod,SW + Firmware/Software Configuration,SW + DU Replacement,HW + RET Adjustment,SW + Replaced Power Supply,HW + improved by Optimization,SW + Reinstantiated Site/CICD,N/A + Cancelled,NA + Replaced/repaired external cable,HW + Rolled Back Change,SW + Fiber/Connector Repaired/Replaced,HW + POD Reboot,SW + GPS Replaced,HW + GPS Cable replaced,HW + Application Reinstantiated,SW + Copper/Optic Cable Replaced,HW + Access Realignment,SW + Pods Restarted,SW + Memory adjusted,SW + DU Reset,SW + Delete Core Files,SW + Fiber Cable replaced,HW + BMC Reset,SW + Repaired connector,HW + Fiber Replaced,HW + Device Configuration corrected,SW + Power Equipment Repaired,HW + RU Replacement,HW + Hardware Replace,HW + Replaced Optical Card,HW + Re provisioned,SW + Replaced/Reseated Cabled,SW + Hybrid replaced,N/A + DU,N/A + Replaced Fan Module,HW + Fiber Replacement,HW + Replaced CSR,HW + Rebooted NID,SW + Sleepy CU-PODs restarted,SW + Electronic components replaced,HW + SFP/cable Replaced,HW + Repaired Cable/Connector,HW + Cleared by Government Officials,NA + Configuration Corrected/Updated,SW + Replaced GPS antenna,HW + Deleted Prior EC2,SW + Host Restart,SW + Software Stability,SW + Rerun CICD Pipeline,SW + Repalced the connector cables,HW + CSR Power Supply Replaced,HW + Restarted Manually,HW + Corrected plungers,HW + Restart Application,SW + RU/DU Restart,SW + Pod Reset,SW + CUCP Pod Swtichover to worker node,SW + HVAC repaired/replaced,HW + Cleaned Fan Filter,HW + SFP Cable Replaced,HW + Others,N/A + Solved Remotely (Permanently),SW + Replaced Network Card,HW + FE Rebooted,SW + Network Switch,N/A + Reset RET's,SW + Replaced Attenuator,HW + Power Cycle,N/A + Rollout Restart,N/A + Replaced Chassis,HW + Re-Deployment,SW + Initialized PODS,SW + Replaced NID,HW + CPU Replaced,HW + Corrected BGP Configuration,SW + Replaced the connector cables,HW + bccsvc Restart,SW + Rebooted Network Card,SW + Delete MVRP Logs,SW + DC Power Bounce,SW + POD Reboot - USM,SW + BMC Power Supply Replaced,HW + Reboot Dpp Pod,SW + Capacity Increase,SW + Sleepy CUCP restarted,SW + Hardware Restarted,SW + Cleared/Reformatted Disk 0,SW + Replaced /repaired internal cable,HW + Groundbars/Copper replaced,HW + Improved by Parameter Change,SW + Adding sdaas_ip incorrect in BPI or infoblox,SW + Fiber Sweep,HW + Restarted ngclientiwf,SW + Replaced fans,HW + Reconfigured BMC,SW + eCPRI Fiber Replaced,HW + Generator Deployement,HW + Restored Tripped Breaker,HW + Batteries Replaced,HW + Not a Sleepy CU,SW + Cleared/Reformatted Disk 1,SW + Reserved / Unreserved,N/A + Door swap,HW + RET motor replacement,HW + Fiber Connectivity Restored,N/A + Initialize Mplane,SW + Replaced polyphaser,HW + RU Software Reset,SW + Barred/ Not Barred,SW + Cabling Replaced,HW + Replaced RET cable,HW + Restarted sctpe1iwf,SW + Activated OCNS,SW + VM Reset,SW + Restarted bccsvc,SW + SMF restart,SW + RU software bug fix,SW + Cleared In Testing,NA + Solar Power Restored,SW + Restarted sctpf1iwf,SW + Updated Lat/long settings,SW + Restarted gwsvc,SW + Cabinet Replacement,HW + Disk Clean,SW + NID Replaced,HW + """ + + from io import StringIO + + codes_df = pd.read_csv(StringIO(raw_data)) + + categorized_codes = { + "Hardware": [code.lower().replace(" ", "") for code, type in codes_df.values if type == "HW"], + "Software": [code.lower().replace(" ", "") for code, type in codes_df.values if type == "SW"], + } + return categorized_codes + + +# Synthetic data uses different resolution_method values. Map them to physical_intervention vs soft_solve. +SYNTHETIC_RESOLUTION_TO_CATEGORY = { + "physical_intervention": [ + "field dispatch required", + "escalated", + ], + "soft_solve": [ + "resolved", + "issue corrected", + "service restored", + "partial resolution", + "pending resolution", + ], +} + + +def find_keyword_matches(row, pattern, keywords): + """Finds which specific keywords from a list match within a DataFrame row.""" + # Synthetic schema uses resolution_method (same meaning as close_code) + resolution_method = row.get("resolution_method", row.get("close_code", "")) + if resolution_method and str(resolution_method).lower().replace(" ", "") in keywords: + return True + + return False + + +def categorize_incidents_by_close_code(df: pd.DataFrame) -> pd.DataFrame: + """ + Categorizes incidents into 'Hardware' or 'Software' based on a structured list of close codes. + It prioritizes Hardware matches over Software matches. + """ + # Ensure required output columns exist + if "solved_category" not in df.columns: + df["solved_category"] = "Uncategorized" + if "solved_reason" not in df.columns: + df["solved_reason"] = "" + + # Get the categorized lists of keywords + close_codes = get_close_codes() + hw_keywords = close_codes["Hardware"] + sw_keywords = close_codes["Software"] + + # Precompile regex patterns for performance. This matches any of the phrases. + hw_pattern = re.compile("|".join(re.escape(k) for k in hw_keywords), flags=re.IGNORECASE) + sw_pattern = re.compile("|".join(re.escape(k) for k in sw_keywords), flags=re.IGNORECASE) + + # Convert all data to string type for safe searching + str_df = df.astype(str) + + print("Categorizing incidents based on close codes...") + for idx, row in tqdm(str_df.iterrows(), total=len(df)): + # Prioritize Hardware: Check for HW keywords first + hw_matches = find_keyword_matches(row, hw_pattern, hw_keywords) + if hw_matches: + df.at[idx, "solved_category"] = "physical_intervention" + continue # Move to the next row once categorized + + # If no HW keywords, check for SW keywords + sw_matches = find_keyword_matches(row, sw_pattern, sw_keywords) + if sw_matches: + df.at[idx, "solved_category"] = "soft_solve" + continue + + # Synthetic data: resolution_method values not in legacy close-code list + resolution_method = row.get("resolution_method", row.get("close_code", "")) + if resolution_method: + rm_normalized = str(resolution_method).strip().lower() + for category, values in SYNTHETIC_RESOLUTION_TO_CATEGORY.items(): + if rm_normalized in values: + df.at[idx, "solved_category"] = category + break + + hw_count = (df["solved_category"] == "physical_intervention").sum() + sw_count = (df["solved_category"] == "soft_solve").sum() + un_count = (df["solved_category"] == "Uncategorized").sum() + + print("\n--- Categorization Complete ---") + print(f"Total rows processed: {len(df)}") + print(f"Physical Intervention: {hw_count}") + print(f"Soft Solve: {sw_count}") + print(f"Uncategorized: {un_count}") + + return df + + +if __name__ == "__main__": + # Example: Adjust these paths to your actual data files + + parser = argparse.ArgumentParser(description="Filter incident CSV data based on specific rules.") + parser.add_argument( + "--input_csv", + type=str, + default="data/anonymized-Incidents_Last_6_Months.csv", + help="Path to the input CSV file containing incident data. Default: data/anonymized-Incidents_Last_6_Months.csv", + ) + parser.add_argument( + "--output_csv", + type=str, + default="data/categorized_incidents.csv", + help="Path to save the filtered CSV file. Default: data/categorized_incidents.csv", + ) + args = parser.parse_args() + + input_file = args.input_csv + output_file = args.output_csv + + print(f"Loading data from {input_file}...") + # Use 'latin1' encoding if your CSV has special characters + df = pd.read_csv(input_file, encoding="latin1") + + # Run the categorization function + df = categorize_incidents_by_close_code(df) + + # Save the updated dataframe to a new CSV file + df.to_csv(output_file, index=False) + print(f"\nCategorized data saved to: {output_file}") diff --git a/recipes/noc-reasoning-agent/scripts/ns_pipelines/generate_synthetic_data.py b/recipes/noc-reasoning-agent/scripts/ns_pipelines/generate_synthetic_data.py new file mode 100644 index 0000000000..1807bdb7da --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/ns_pipelines/generate_synthetic_data.py @@ -0,0 +1,69 @@ +import argparse +import os + +from nemo_skills.pipeline.cli import generate, wrap_arguments + + +def generate_synthetic_data(args, cluster, num_gpus, step=None, input_format_file=None): + os.makedirs("outputs/sdg_reason", exist_ok=True) + generate( + ctx=wrap_arguments( + f"++prompt_config=/workspace/data/prompt_reasoning.yaml " + f"++inference.temperature={args.temperature} " + f"++inference.tokens_to_generate={args.tokens_to_generate} " + f"++code_execution=false " + f"++skip_filled=false " + f"++use_completions_api=true " + f"++input_file={input_format_file} " + ), + cluster=cluster, + server_type="vllm", + input_file=input_format_file, + output_dir="/workspace/outputs/sdg_reason/", + # output_dir=f"/workspace/outputs/sdg_reason/step_{step}", + expname="incident-generation", + model="openai/gpt-oss-120b", + rerun_done=True, + server_gpus=num_gpus, + ) + + print(f"Finished generating step {step}") + + +def generate_synthetic_data_oss_gpt(args, cluster, num_gpus): + """Generate synthetic data using an OSS GPT model (not yet implemented).""" + raise NotImplementedError("OSS GPT generation path is not yet implemented") + + +def main(): + """CLI entry point for synthetic data generation.""" + parser = argparse.ArgumentParser(description="Generate synthetic data using Qwen model") + parser.add_argument("--temperature", type=float, default=0.6, help="Inference temperature (default: 0.6)") + parser.add_argument( + "--tokens_to_generate", type=int, default=8192, help="Number of tokens to generate (default: 8192)" + ) + parser.add_argument( + "--num_gpus", type=int, default=8, help="Number of GPUs to use (default: 8)" + ) + parser.add_argument( + "--llm", type=str, default="qwen2.5-32b-instruct", + choices=["qwen2.5-32b-instruct"], + help="The LLM to use for generation", + ) + + args = parser.parse_args() + cluster = "local" + + num_gpus = args.num_gpus + print(f"Using {num_gpus} GPUs (specified via --num_gpus)") + + if args.llm == "qwen2.5-32b-instruct": + generate_synthetic_data( + args, cluster, num_gpus, step=1, input_format_file="/workspace/outputs/sdg/formatted_output.json" + ) + else: + generate_synthetic_data_oss_gpt(args, cluster, num_gpus) + + +if __name__ == "__main__": + main() diff --git a/recipes/noc-reasoning-agent/scripts/ns_pipelines/prepare_react_agent.py b/recipes/noc-reasoning-agent/scripts/ns_pipelines/prepare_react_agent.py new file mode 100644 index 0000000000..c77c737009 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/ns_pipelines/prepare_react_agent.py @@ -0,0 +1,150 @@ +import argparse +import json +import random +import re + +import yaml +from scripts.tools import ALL_TOOLS_STRING + + +def extract_number_from_input(input_text): + """ + Extracts the incident identifier from the 'input' field using regex. + Supports synthetic IDs (e.g. INCME-100001) and legacy (INCWLS...). + """ + # Match any non-whitespace after "Number:" (e.g. INCME-100001, INCWLS0873337) + match = re.search(r"Number:\s*(\S+)", input_text) + if match: + return match.group(1) + return None + + +def get_tools(text): + matches = {} + + # Find all ... blocks + + tool_calls = re.findall(r"(.*?)", text, flags=re.DOTALL) + tool_response = re.findall(r"(.*?)", text, flags=re.DOTALL) + # print(tool_calls) + if len(tool_calls) != len(tool_response): + raise ValueError(f"Mismatch: {len(tool_calls)} tool_calls vs {len(tool_response)} tool_responses") + for i in range(len(tool_calls)): + # try: + tool_block = tool_calls[i] + response_block = tool_response[i] + # Extract the JSON portion inside the tags + tool_json_str = tool_block.strip() + + tool_data = json.loads(tool_json_str) + response = response_block.strip() + tool_name = tool_data["name"] + arguments = tool_data["arguments"] + + matches[tool_name] = {"arguments": arguments, "response": response} + + # except json.JSONDecodeError as e: + # print(f"Skipping invalid JSON: {e}") + + if not matches: + # print("No tools!") + return None, None + # print(matches) + return matches + + +def main(file1_path, file2_path, prompt_config, output_path="output.jsonl"): + # Load first JSONL: keyed by 'number' (extracted if needed) + data1 = {} + with open(prompt_config, "r") as f: + prompt_template = yaml.safe_load(f) + + system_prompt = prompt_template["system"] + with open(file1_path, "r", encoding="utf-8") as f1: + for line in f1: + line = line.strip() + if line: + try: + d = json.loads(line) + number = d.get("incident_identifier", d.get("number")) + if d.get("expected") or "Close Code: [" in d.get("response", ""): + matches = get_tools(d.get("initial_background", "")) + if matches == (None, None): + print(f"No tools for incident {number}, skipping") + continue + # print(data1[number]) + d["tool_matches"] = matches + formatted_prompt = prompt_template["user"].format(**d) + if formatted_prompt.endswith("\n\n"): + formatted_prompt = formatted_prompt[: -len("\n\n")] + # print(system_prompt) + d["formatted_input"] = formatted_prompt + data1[number] = d + + # num = d.get('number') or extract_number_from_input(d.get('input', '')) + # if num: + # data1[num] = d + except json.JSONDecodeError as e: + print(f"Error decoding JSON in file1: {e}") + + # Load second JSONL: keyed by 'number' + data2 = {} + with open(file2_path, "r", encoding="utf-8") as f2: + for line in f2: + line = line.strip() + if line: + try: + d = json.loads(line) + input_string = d["input"] + output_string = d["output"] + match = re.search(r"Number:\s*(\S+)", input_string) + if match: + number = match.group(1) + else: + raise ValueError("No incident identifier match found in input") + data2[number] = [input_string, output_string] + except json.JSONDecodeError as e: + print(f"Error decoding JSON in file2: {e}") + + # Build consolidated results for matching numbers + + results = [] + for num in data1.keys(): + if num in data2: + used_tools = data1[num]["tool_matches"] + + consolidated = {} + tools = ALL_TOOLS_STRING + if used_tools is None: + print("no tools!") + else: + for tool in tools: + if tool in used_tools: + consolidated[tool] = used_tools[tool]["response"] + # print(consolidated) + # print(used_tools) + + consolidated["system"] = system_prompt + consolidated["input"] = data1[num]["formatted_input"] + consolidated["expected"] = data1[num].get("expected", data2[num][1]) + # consolidated["output"] = + + results.append(consolidated) + + random.shuffle(results) + with open(output_path, "w", encoding="utf-8") as out_file: + for res in results: + out_file.write(json.dumps(res) + "\n") + print(f"Consolidated output written to {output_path}") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Consolidate data from two JSONL files.") + parser.add_argument("file1", help="Path to the first JSONL file (with input, output, expected_answer)") + parser.add_argument("file2", help="Path to the second JSONL file (with check answers)") + parser.add_argument("--prompt_config", default="data/prompts/prompt_incident.yaml") + parser.add_argument( + "--output", default="output.jsonl", help="Path to the output JSONL file (default: output.jsonl)" + ) + args = parser.parse_args() + main(args.file1, args.file2, args.prompt_config, args.output) diff --git a/recipes/noc-reasoning-agent/scripts/tools.py b/recipes/noc-reasoning-agent/scripts/tools.py new file mode 100644 index 0000000000..d25584096c --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/tools.py @@ -0,0 +1,220 @@ +import pandas as pd + + +def _safe_tool_val(val, default: str = "NotApplicable") -> str: + """Return a string safe for JSON; pandas NaN and None become default.""" + if val is None: + return default + if isinstance(val, float) and (val != val or val == float("inf") or val == float("-inf")): + return default + return str(val) + + +def query_alarm(row: pd.Series, site_or_element_id: str = "", **kwargs) -> str: + """ + Queries the alarm management system to retrieve current alarm details, + severity, and active time. + + Input: site_or_element_id + Returns: Alarm status (active/cleared), severity, timestamp, description. + """ + return _safe_tool_val(row.get("query_alarm", row.get("Check_Alarm_Status", "NotApplicable"))) + + +def query_resource_health(row: pd.Series, element_id: str = "", **kwargs) -> str: + """ + Polls monitoring systems (EMS/NMS/Telemetry) to retrieve device health + metrics such as CPU, memory, interface status, and cell state. + + Input: element_id + Returns: Health report (e.g., 'All systems operational' / 'Fault detected'). + """ + return _safe_tool_val(row.get("query_resource_health", row.get("Check_Element_Health", "NotApplicable"))) + + +def query_performance(row: pd.Series, metric_type: str = "", **kwargs) -> str: + """ + Fetches KPIs from monitoring tools. Reports if trends are in line with + expectations or not. Supports metrics like PRB utilization, throughput, + handover stats, VoLTE KPIs, signaling load, etc. + + Input: metric_type (e.g., 'prb_utilization', 'throughput', 'volte_kpi') + Returns: KPI status (OK / NOK / trend analysis). + """ + return _safe_tool_val(row.get("query_performance", row.get("Check_Performance", "NotApplicable"))) + + +def query_topology(row: pd.Series, element_id: str = "", **kwargs) -> str: + """ + Verifies neighbors, adjacencies, upstream devices, and identifies common + alarms in an area. Also maps affected services through topology. + + Input: element_id + Returns: Adjacent elements with alarm status, affected services. + """ + return _safe_tool_val(row.get("query_topology", row.get("Check_Element_Neighbors", "NotApplicable"))) + + +def execute_remote_action(row: pd.Series, element_id: str = "", action: str = "", **kwargs) -> str: + """ + Executes remote CLI commands (SSH/Netconf) for resets, restarts, + failovers, and other operational actions on network elements. + + Input: element_id, action (e.g., 'enodeb_reset', 'generator_start', 'sctp_reset') + Returns: Execution result (Success/Fail). + """ + return _safe_tool_val(row.get("execute_remote_action", row.get("Execute_Remote_Action", "NotApplicable"))) + + +def apply_configuration(row: pd.Series, element_id: str = "", config_type: str = "", **kwargs) -> str: + """ + Retrieves, validates, and pushes configuration changes. Supports + parameter adjustments, load balancing, HVAC settings, routing changes, etc. + + Input: element_id, config_type (e.g., 'load_balancing', 'hvac_setpoint') + Returns: Configuration Verified/Applied or error. + """ + return _safe_tool_val(row.get("apply_configuration", row.get("Check_Apply_Configuration", "NotApplicable"))) + + +def run_diagnostics(row: pd.Series, diagnostic_type: str = "", **kwargs) -> str: + """ + Runs diagnostic scripts including config audits, OTDR tests, kubelet + checks, resource usage analysis, and more. + + Input: diagnostic_type (e.g., 'config_audit', 'otdr', 'kubelet') + Returns: Diagnostic report summary. + """ + return _safe_tool_val(row.get("run_diagnostics", row.get("Triage_Toolkit_Tool", "NotApplicable"))) + + +def inspect_logs(row: pd.Series, log_type: str = "", **kwargs) -> str: + """ + Connects to devices or containers and reviews system logs, dump files, + and event records to identify root cause. + + Input: log_type (e.g., 'container', 'bbu_system', 'routing', 'access_control') + Returns: Log analysis (issues identified / no issues). + """ + return _safe_tool_val(row.get("inspect_logs", row.get("Check_remote_files", "NotApplicable"))) + + +def create_trouble_ticket(row: pd.Series, priority: str = "", team: str = "", issue_details: str = "", **kwargs) -> str: + """ + Logs and routes tickets to departments via the ticketing system. + Supports priority levels and team-specific routing. + + Input: priority, team, issue_details + Returns: Ticket ID. Once resolved, outputs the solution used. + """ + return _safe_tool_val(row.get("create_trouble_ticket", row.get("Create_Ticket", "NotApplicable"))) + + +def verify_recovery(row: pd.Series, element_id: str = "", **kwargs) -> str: + """ + Final verification step — confirms service restoration, checks alarm + clearance, and validates operational status. + + Input: element_id or service_id + Returns: Recovery status (recovered / degraded / ongoing). + """ + return _safe_tool_val(row.get("verify_recovery", "NotApplicable")) + + +def query_external_factors(row: pd.Series, site_or_area: str = "", **kwargs) -> str: + """ + Scans for external factors like weather conditions, fiber cuts, utility + outages, maintenance windows, and scheduled work. + + Input: site_or_area + Returns: External issue report. + """ + return _safe_tool_val(row.get("query_external_factors", row.get("Check_External_Issues", "NotApplicable"))) + + +def orchestrate_workload(row: pd.Series, action: str, workload_type: str = "", **kwargs) -> str: + """ + Automates container/VM operations via Kubernetes or orchestrator: + restart, scale, migrate, failover, drain, etc. + + Input: action (e.g., 'restart', 'scale', 'migrate'), workload_type (e.g., 'pod', 'vm', 'cnf') + Returns: Operation status (Successful/Unsuccessful). + """ + return _safe_tool_val(row.get("orchestrate_workload", row.get("Orchestration_tool", "NotApplicable"))) + + +def query_power_system(row: pd.Series, target: str = "", **kwargs) -> str: + """ + Queries power infrastructure status including UPS, battery, generator, + rectifier, and HVAC systems. + + Input: target (e.g., 'ups', 'battery', 'generator', 'hvac', 'rectifier') + Returns: Power system status and readings. + """ + return _safe_tool_val(row.get("query_power_system", "NotApplicable")) + + +def query_rf_status(row: pd.Series, sector_or_antenna_id: str = "", **kwargs) -> str: + """ + Queries RF chain status including antenna health, PA status, VSWR, + beamforming, RET controller, and signal measurements. + + Input: sector_or_antenna_id + Returns: RF status report. + """ + return _safe_tool_val(row.get("query_rf_status", "NotApplicable")) + + +def query_container_status(row: pd.Series, container_type: str = "", **kwargs) -> str: + """ + Queries Kubernetes pod/node/container state from the orchestrator. + + Input: container_type (e.g., 'pod', 'node', 'control_plane', 'cnf', 'service_mesh') + Returns: Container/pod state details. + """ + return _safe_tool_val(row.get("query_container_status", "NotApplicable")) + + +def verify_signaling_path(row: pd.Series, interface: str = "", **kwargs) -> str: + """ + Tests signaling protocol paths: SCTP, SIP, Diameter, GTP, SIGTRAN, + and other control plane interfaces. + + Input: interface (e.g., 'SCTP', 'SIP', 'GTP', 'Diameter') + Returns: Path status (healthy / degraded / down). + """ + return _safe_tool_val(row.get("verify_signaling_path", "NotApplicable")) + + +def test_connectivity(row: pd.Series, test_type: str = "", **kwargs) -> str: + """ + Tests IP/network connectivity: ICMP ping, backhaul, CPRI link, + peer connectivity, end-to-end path quality. + + Input: test_type (e.g., 'backhaul', 'icmp', 'cpri', 'peer_connectivity') + Returns: Connectivity test results. + """ + return _safe_tool_val(row.get("test_connectivity", "NotApplicable")) + + +ALL_TOOLS = [ + query_alarm, + query_resource_health, + query_performance, + query_topology, + execute_remote_action, + apply_configuration, + run_diagnostics, + inspect_logs, + create_trouble_ticket, + verify_recovery, + query_external_factors, + orchestrate_workload, + query_power_system, + query_rf_status, + query_container_status, + verify_signaling_path, + test_connectivity, +] + +ALL_TOOLS_STRING = [tool.__name__ for tool in ALL_TOOLS] diff --git a/recipes/noc-reasoning-agent/scripts/utils/create_input_jsonl_from_incidents.py b/recipes/noc-reasoning-agent/scripts/utils/create_input_jsonl_from_incidents.py new file mode 100644 index 0000000000..de21c18d8b --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/utils/create_input_jsonl_from_incidents.py @@ -0,0 +1,182 @@ +import argparse +import json +import sys +from pathlib import Path +from typing import List + +import pandas as pd + +from scripts.utils.reasoning_processes import get_reasoning_process_for_fault_category +from scripts.utils.schema_columns import INCIDENT_ID_COLUMN, REQUIRED_COLUMNS + +# For filter by problem code: synthetic uses fault_category (workflow IDs). Keep in sync with filter_rows.py. +ALLOWED_PROBLEM_CODES = [ + # Legacy u_problem_code values + "Service-off", + "Degraded Prach", + "Offline / Unreachable", + "Disabled Cells", + "Node Down", + "Site Not Scrolling", + "Sleepy Cell", + "VM is in not ready state", + "Prach 0", + "N2 Link Down", + "ueconmgr pod restarted", + "CSR Not Reachable", + "Circuit Down", + "Link Down", + "GPS Sync", + "MTA Alert", + # Synthetic fault_category values + "power_ac_failure_recovery", + "power_dc_rectifier_recovery", + "power_battery_discharge_response", + "power_generator_failure_recovery", + "env_high_temperature_response", + "env_hvac_fault_recovery", + "env_water_intrusion_response", + "env_battery_temperature_response", + "env_cabinet_intrusion_response", + "ran_software_upgrade_recovery", + "ran_cell_site_down_recovery", + "ran_interference_mitigation", + "ran_speed_complaint_resolution", + "ran_voice_quality_resolution", + "ran_sector_outage_recovery", + "ran_prb_availability_resolution", + "ran_cell_overshooting_correction", + "ran_rru_communication_recovery", + "ran_dropped_calls_resolution", + "ran_parameter_correction", + "ran_antenna_tilt_recovery", + "ran_vswr_alarm_resolution", + "ran_handover_failure_resolution", + "ran_backhaul_degradation_resolution", + "ran_cell_congestion_management", + "ran_device_issue_resolution", + "compute_vm_failure_recovery", + "compute_container_crash_recovery", + "compute_orchestrator_recovery", + "compute_image_pull_recovery", + "compute_k8s_node_recovery", + "compute_storage_failure_recovery", + "compute_cnf_pod_recovery", + "compute_resource_exhaustion_resolution", + "transport_routing_flap_resolution", + "transport_microwave_degradation_response", + "transport_interface_errors_resolution", + "transport_packet_loss_resolution", + "signaling_routing_failure_recovery", + "signaling_delay_resolution", + "signaling_s1_n2_recovery", + "signaling_sip_registration_recovery", +] + + +def coerce_to_str(value) -> str: + if pd.isna(value): + return "" + return str(value) + + +def extract_examples_by_problem_code(df, num): + fc_col = "fault_category" if "fault_category" in df.columns else "u_problem_code" + subset = df[df[fc_col].isin(ALLOWED_PROBLEM_CODES)] + sampled = [] + for _, group in subset.groupby(fc_col, group_keys=False): + sampled.append(group.sample(n=min(num, len(group)), random_state=42)) + return pd.concat(sampled, ignore_index=True) if sampled else subset.head(0) + + +def main(): + parser = argparse.ArgumentParser(description="Convert incidents CSV to input.jsonl for Qwen prompt") + parser.add_argument("--input", required=True, help="Path to anonymized-Incidents_Last_6_Months.csv") + parser.add_argument("--output", default="input.jsonl", help="Path to output JSONL (default: input.jsonl)") + parser.add_argument("--limit", required=False, help="Max number of rows to export") + parser.add_argument("--required_incidents", required=False, type=str, help="Path to required incidents file") + parser.add_argument("--examples_by_problem_code", required=False, type=int) + + args = parser.parse_args() + + input_path = Path(args.input) + output_path = Path(args.output) + + if not input_path.exists(): + print(f"ERROR: Input CSV not found: {input_path}", file=sys.stderr) + sys.exit(1) + + # Read CSV (handle BOM if present) + try: + df = pd.read_csv(input_path, dtype=str, encoding="utf-8-sig") + except UnicodeDecodeError: + df = pd.read_csv(input_path, dtype=str, encoding="latin1") + + if args.examples_by_problem_code is not None: + df = extract_examples_by_problem_code(df, args.examples_by_problem_code) + print(f"Extracted {len(df)} examples from the dataset.") + + # Reasoning process from fault_category (synthetic) or u_problem_code (legacy) + fc_col = "fault_category" if "fault_category" in df.columns else "u_problem_code" + df["problem_code_reasoning_process"] = df[fc_col].map(get_reasoning_process_for_fault_category) + + # time_to_resolve: synthetic uses intervention_began / intervention_completed + start_col = "intervention_began" if "intervention_began" in df.columns else "work_start" + end_col = "intervention_completed" if "intervention_completed" in df.columns else "work_end" + try: + df[start_col] = pd.to_datetime(df[start_col], errors="coerce") + df[end_col] = pd.to_datetime(df[end_col], errors="coerce") + df["time_to_resolve"] = (df[end_col] - df[start_col]).dt.total_seconds().fillna(0).astype(int) + except (ValueError, TypeError) as e: + print(f"Failed to parse datetime columns {start_col}, {end_col}: {e}") + df["time_to_resolve"] = 0 + + original_cols_lower = {c.lower(): c for c in df.columns} + + # Ensure all required columns exist (case-insensitive match). If missing, create empty. + missing: List[str] = [] + selected = {} + for col in REQUIRED_COLUMNS: + # Match case-insensitively + actual = original_cols_lower.get(col.lower()) + if actual is None: + # Create empty column if missing + missing.append(col) + df[col] = "" + selected[col] = col + else: + selected[col] = actual + + if missing: + print( + "WARNING: Missing columns in CSV; blank fields will be used: " + ", ".join(missing), + file=sys.stderr, + ) + + if args.required_incidents: + with open(args.required_incidents, "r") as f: + required_incidents = f.read().splitlines() + id_col = INCIDENT_ID_COLUMN if INCIDENT_ID_COLUMN in df.columns else "number" + df = df[df[id_col].astype(str).isin(required_incidents)] + + # Limit rows + if args.limit is not None and int(args.limit) > 0: + df = df.head(int(args.limit)) + + # Prepare records + records = [] + for _, row in df.iterrows(): + rec = {key: coerce_to_str(row[selected[key]]) for key in REQUIRED_COLUMNS} + records.append(rec) + + # Write JSONL + output_path.parent.mkdir(parents=True, exist_ok=True) + with output_path.open("w", encoding="utf-8") as f: + for rec in records: + f.write(json.dumps(rec, ensure_ascii=False) + "\n") + + print(f"Wrote {len(records)} records to {output_path}.") + + +if __name__ == "__main__": + main() diff --git a/recipes/noc-reasoning-agent/scripts/utils/format_reasoning_json.py b/recipes/noc-reasoning-agent/scripts/utils/format_reasoning_json.py new file mode 100644 index 0000000000..93bbd324b3 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/utils/format_reasoning_json.py @@ -0,0 +1,536 @@ +import argparse +import copy +import json +import os + +from tqdm import tqdm +from transformers import AutoTokenizer + + +def _incident_id(data): + """Synthetic schema uses incident_identifier; legacy uses number.""" + incident_id = data.get("incident_identifier") or data.get("number") + if incident_id is None: + raise ValueError(f"Missing incident identifier in data: {list(data.keys())}") + return incident_id + + +def _resolution_method(data): + """Synthetic schema uses root_cause_secondary; legacy uses close_code.""" + return data.get("root_cause_secondary") or data.get("close_code", "") + + +def extract_formatted_json_steps(input_file): + """ + Extracts a JSON array string from a larger block of text. + + Args: + text (str): The raw text containing the JSON array. + + Returns: + list: The parsed JSON object (a list of dictionaries). + Returns None if no valid JSON array is found. + """ + + responses = {} + with open(input_file, "r", encoding="utf-8") as f: + for line in f: + try: + data = json.loads(line) + if not data: + continue + text = data["generation"] + + number = _incident_id(data) + + try: + # Find the starting position of the JSON array '[' + + start_index = text.rfind("<|message|>") + text = text[start_index + len("<|message|>") :] + start_index = text.find("[") + # Find the last position of the JSON array ']' to ensure we get the whole thing + end_index = text.rfind("]") + 1 + + if start_index != -1 and end_index != -1: + # Slice the string to get only the JSON part + json_string = text[start_index:end_index] + + # Parse the JSON string into a Python object + parsed_json = json.loads(json_string) + responses[number] = parsed_json + else: + print(text) + print("Error: Could not find the start '[' or end ']' of the JSON array.") + continue + except json.JSONDecodeError as e: + print(text) + print(f"Error decoding JSON: {e}") + except json.JSONDecodeError: + print(f"Skipping invalid line: {line.strip()}") + + return responses + + +def extract_final_thinking_processes(input_file): + responses = {} + with open(input_file, "r", encoding="utf-8") as f: + for line in f: + data = json.loads(line) + text = data["generation"] + number = _incident_id(data) + step_number = data["step_number"] + if number not in responses: + responses[number] = {} + + thinking = text[text.rfind("final<|message|>") + len("final<|message|>") :] + data["generation"] = thinking + responses[number][step_number] = thinking + + return responses + + +def prepare_data_for_reasoning_traces(jsonl_file, input_file, output_file): + formatted_steps_taken = extract_formatted_json_steps(input_file) + new_jsonl = [] + + incorrect_incidents = 0 + # Read the file line by line + with open(jsonl_file, "r", encoding="utf-8") as f: + for line in f: + data = json.loads(line) + if not data: + continue + number = _incident_id(data) + + if number in formatted_steps_taken: + formatted_steps = formatted_steps_taken[number] + current_conclusion = "" + for i in range(len(formatted_steps)): + sub_data = copy.deepcopy(data) + current_steps = formatted_steps[i] + sub_data["step_number"] = current_steps["step_number"] + sub_data["background_context"] = current_conclusion + conclusion_called = f"Step {current_steps['step_number']} {current_steps['sop_step_title']} {current_steps['status']}.\nAction taken: {current_steps['action_taken']}\n" + tool_response = "" + if current_steps["tool_call"]: + conclusion_called += f"Tool called: {current_steps['tool_call']}\n" + tool_response = f"Tool response: {current_steps['result']}\n" + else: + conclusion_called += "No tool call needed.\n" + sub_data["outcome"] = conclusion_called + new_jsonl.append(sub_data) + current_conclusion += conclusion_called + tool_response + # data["formatted_steps"] = formatted_steps_taken[number] + + # new_jsonl.append(data) + else: + incorrect_incidents += 1 + + # print(json.dumps(new_jsonl, indent = 4)) + print(f"{incorrect_incidents} incidents were not parsed correctly and discarded.") + + with open(output_file, "w", encoding="utf-8") as f: + for line in new_jsonl: + json.dump(line, f) + f.write("\n") + + print(f"Wrote {len(new_jsonl)} entries to {output_file}") + + +def token_converting(string, model): + """ + Converts a shorthand tool command like: + Check_Alarm_Status[site-123] + into a Qwen-32B compliant XML block. + """ + if model != "qwen32": + return string # fallback for other models + + import re + + # --- 1. Parse tool name and the raw arguments inside [...] or (...) --- + # Match "ToolName[args]" or "ToolName[ args ]" + m = re.match(r"^\s*([A-Za-z_]\w*)\s*\[(.*)\]\s*$", str(string), re.DOTALL) + + if not m: + # Also accept parenthesis format: ToolName(args) or ToolName() + m = re.match(r"^\s*([A-Za-z_]\w*)\s*\((.*)\)\s*$", str(string), re.DOTALL) + + if not m: + m_no_args = re.match(r"^\s*([A-Za-z_]\w*)\s*[\[\(]\s*[\]\)]\s*$", str(string)) + if m_no_args: + tool_name = m_no_args.group(1) + raw_args = "" + else: + return string + else: + tool_name, raw_args = m.groups() + + # --- 2. Smart Splitter --- + # Splits by commas, but ignores commas inside single/double quotes. + # e.g. "dept, 'Error in rack 1, shelf 2'" -> ["dept", "'Error in rack 1, shelf 2'"] + parts = re.split(r'\s*,\s*(?=(?:[^\'"]|\'[^\']*\'|"[^"]*")+$)', raw_args.strip()) if raw_args.strip() else [] + + # --- 3. Normalize Tokens --- + kv_args = {} + pos_args = [] + + for p in parts: + if not p: + continue + # Check for key=value or key: value + if ("=" in p or ":" in p) and not (p.startswith("'") or p.startswith('"')): + k, v = re.split(r"\s*[:=]\s*", p, maxsplit=1) + v = v.strip().strip('"').strip("'") + kv_args[k.strip()] = v + else: + pos_args.append(p.strip().strip('"').strip("'")) + + # Helper to enforce positional argument counts + def req_pos(n, arg_name="argument"): + if len(pos_args) < n: + raise ValueError( + f"{tool_name} requires at least {n} value(s) (missing {arg_name}); got {len(pos_args)} in: {string}" + ) + + # --- 4. Tool-Specific Argument Mapping --- + # When no arguments are provided (model used tool_name() format), use "all" + # as default so the pipeline doesn't lose the entire incident. + + def _first_pos(key, named_key=None): + """Return named arg, first positional arg, or 'all' as default.""" + if named_key: + val = kv_args.get(named_key) + if val: + return val + return pos_args[0] if pos_args else "all" + + arg_dict = {} + + if tool_name == "query_alarm": + arg_dict = {"site_or_element_id": _first_pos("site_or_element_id")} + + elif tool_name == "query_resource_health": + arg_dict = {"element_id": _first_pos("element_id")} + + elif tool_name == "query_performance": + arg_dict = {"metric_type": _first_pos("metric_type")} + + elif tool_name == "query_topology": + arg_dict = {"element_id": _first_pos("element_id")} + + elif tool_name == "execute_remote_action": + elem = kv_args.get("element_id") or (pos_args[0] if pos_args else "all") + act = kv_args.get("action") or (pos_args[1] if len(pos_args) > 1 else "default_action") + arg_dict = {"element_id": elem, "action": act} + + elif tool_name == "apply_configuration": + elem = kv_args.get("element_id") or (pos_args[0] if pos_args else "all") + cfg = kv_args.get("config_type") or (pos_args[1] if len(pos_args) > 1 else None) + arg_dict = {"element_id": elem} + if cfg: + arg_dict["config_type"] = cfg + + elif tool_name == "run_diagnostics": + arg_dict = {"diagnostic_type": _first_pos("diagnostic_type")} + + elif tool_name == "inspect_logs": + arg_dict = {"log_type": _first_pos("log_type")} + + elif tool_name == "create_trouble_ticket": + pri = kv_args.get("priority") or (pos_args[0] if pos_args else "medium") + team = kv_args.get("team") or (pos_args[1] if len(pos_args) > 1 else "unknown") + details = kv_args.get("issue_details") or (", ".join(pos_args[2:]) if len(pos_args) > 2 else "No details provided") + arg_dict = {"priority": pri, "team": team, "issue_details": details} + + elif tool_name == "verify_recovery": + arg_dict = {"element_id": _first_pos("element_id")} + + elif tool_name == "query_external_factors": + arg_dict = {"site_or_area": _first_pos("site_or_area")} + + elif tool_name == "orchestrate_workload": + act = kv_args.get("action") or (pos_args[0] if pos_args else "default") + typ = kv_args.get("type") or (pos_args[1] if len(pos_args) > 1 else None) + arg_dict = {"action": act} + if typ: + arg_dict["type"] = typ + + elif tool_name == "query_power_system": + arg_dict = {"target": _first_pos("target")} + + elif tool_name == "query_rf_status": + arg_dict = {"sector_or_antenna_id": _first_pos("sector_or_antenna_id")} + + elif tool_name == "query_container_status": + arg_dict = {"type": _first_pos("type")} + + elif tool_name == "verify_signaling_path": + arg_dict = {"interface": _first_pos("interface")} + + elif tool_name == "test_connectivity": + arg_dict = {"test_type": _first_pos("test_type")} + + # --- Fallback for unknown tools --- + else: + if kv_args: + arg_dict = kv_args + elif pos_args: + arg_dict = {"args": pos_args} if len(pos_args) > 1 else {"argument": pos_args[0]} + else: + arg_dict = {} + + # --- 5. Construct XML Output --- + json_call = {"name": tool_name, "arguments": arg_dict} + return json_call + + +def merge_reasoning_steps(steps_taken, reasoning_steps, model="qwen32"): + broken_numbers = [] + for number in steps_taken: + if number in reasoning_steps: + # fix tool calling + try: + for i in range(len(steps_taken[number])): + if steps_taken[number][i]["tool_call"]: + steps_taken[number][i]["tool_call"] = token_converting( + steps_taken[number][i]["tool_call"], model + ) + steps_taken[number][i]["thinking"] = reasoning_steps[number][steps_taken[number][i]["step_number"]] + except (KeyError, ValueError) as e: + print(f"Error merging steps for incident {number}: {e}") + broken_numbers.append(number) + + for number in broken_numbers: + del steps_taken[number] + + return steps_taken + + +SFT_DUMMY_USER = "DUMMY_USER_FOR_SFT" +SFT_ASSISTANT_SENTINEL = "<<>>" + + +def compute_prefix_len_for_dummy_user(tokenizer): + messages = [ + {"role": "user", "content": SFT_DUMMY_USER}, + ] + rendered = tokenizer.apply_chat_template( + messages, + tokenize=False, + add_special_tokens=False, + add_generation_prompt=False, + ) + + idx = len(rendered) + + # Keep everything from the sentinel onward, drop everything before it + return idx + + +def qwen_token_converter(data, full_reasoning_steps, tokenizer=None): + curriculum_learning_stages = {} + turn = 0 + total_tokens = 0 + pre_compute_idx = compute_prefix_len_for_dummy_user(tokenizer) + current_assistant_content = [{"role": "user", "content": SFT_DUMMY_USER}] + + for i in range(len(full_reasoning_steps)): + step = full_reasoning_steps[i] + + thinking = step.get("thinking", "") + status = step.get("status", "") + title = step.get("sop_step_title", "") + action = step.get("action_taken", "") + tool_call = step.get("tool_call", "") + result = step.get("result", "") + step_text = f"\n{thinking} {status} {title}: {action}\n\n" + + # Construct the text for this specific step + # Note: We inject tags here as part of the content + response_message = [{"role": "user", "content": SFT_DUMMY_USER}] + sub_data = copy.deepcopy(data) + + # --- CASE A: Tool Call Triggered --- + if tool_call: + # Response String + response_message.append( + { + "role": "assistant", + "content": step_text, + "tool_calls": [{"type": "function", "function": tool_call}], + } + ) + raw_response = tokenizer.apply_chat_template( + response_message, tokenize=False, add_special_tokens=False, add_generation_prompt=False + ) + cleaned_response = raw_response[pre_compute_idx:] + sub_data["response"] = cleaned_response + + # Background String + raw_background = tokenizer.apply_chat_template( + current_assistant_content, tokenize=False, add_special_tokens=False, add_generation_prompt=False + ) + cleaned_background = raw_background[pre_compute_idx:] + sub_data["background"] = cleaned_background + + # Next Context + current_assistant_content.append( + { + "role": "assistant", + "content": step_text, + "tool_calls": [{"type": "function", "function": tool_call}], + } + ) + current_assistant_content.append({"role": "tool", "content": result}) + # print(raw) + # print("----:") + # print(cleaned) + # exit() + + curriculum_learning_stages[turn] = sub_data + turn += 1 + + # --- CASE B: Final Conclusion --- + elif i == len(full_reasoning_steps) - 1: + total_tokens = len( + tokenizer.apply_chat_template(current_assistant_content, tokenize=True, add_generation_prompt=False) + ) + sub_data = copy.deepcopy(data) + + result = result if result else "" + + response_message.append( + { + "role": "assistant", + "content": step_text + result + f"\nClose Code: [{_resolution_method(sub_data)}]", + } + ) + raw = tokenizer.apply_chat_template( + response_message, tokenize=False, add_special_tokens=False, add_generation_prompt=False + ) + cleaned = raw[pre_compute_idx:] + sub_data["response"] = cleaned + + # Background String + raw_background = tokenizer.apply_chat_template( + current_assistant_content, tokenize=False, add_special_tokens=False, add_generation_prompt=False + ) + cleaned_background = raw_background[pre_compute_idx:] + sub_data["background"] = cleaned_background + + curriculum_learning_stages[turn] = sub_data + + # --- CASE C: Intermediate Step (just accumulation) --- + else: + # We already added to current_assistant_content at the top of loop + pass + + # --- CASE D: Forced Conclusion --- + # If the last step had a tool_call (Case B never triggered), append + # an extra conclusion turn so the model learns to output a Close Code. + if turn > 0 and (turn - 1) in curriculum_learning_stages and turn not in curriculum_learning_stages: + close_code = _resolution_method(data) + if close_code: + total_tokens = len( + tokenizer.apply_chat_template(current_assistant_content, tokenize=True, add_generation_prompt=False) + ) + sub_data = copy.deepcopy(data) + conclusion_msg = [ + {"role": "user", "content": SFT_DUMMY_USER}, + { + "role": "assistant", + "content": f"\nAll troubleshooting steps have been completed and the incident has been resolved.\n\n\nClose Code: [{close_code}]", + }, + ] + raw = tokenizer.apply_chat_template( + conclusion_msg, tokenize=False, add_special_tokens=False, add_generation_prompt=False + ) + sub_data["response"] = raw[pre_compute_idx:] + raw_background = tokenizer.apply_chat_template( + current_assistant_content, tokenize=False, add_special_tokens=False, add_generation_prompt=False + ) + sub_data["background"] = raw_background[pre_compute_idx:] + curriculum_learning_stages[turn] = sub_data + + return curriculum_learning_stages, total_tokens + + +def compile_reasoning(jsonl_file, input_file, output_dir, reasoning_jsonl, tokenizer_name="Qwen/Qwen3-32B"): + # 1. LOAD TOKENIZER ONCE HERE + tokenizer = None + print("Loading Tokenizer (Qwen3-32B)...") + # Trust remote code is often needed for Qwen tokenizers + tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True) + + formatted_steps_taken = extract_formatted_json_steps(input_file) + formatted_reasoning_steps_taken = extract_final_thinking_processes(reasoning_jsonl) + + full_steps = merge_reasoning_steps(formatted_steps_taken, formatted_reasoning_steps_taken) + + all_tokens = [] + stages = {} + incorrect_incidents = 0 + + with open(jsonl_file, "r", encoding="utf-8") as f: + for line in tqdm(f): + data = json.loads(line) + number = _incident_id(data) + + if number in full_steps: + # 2. PASS TOKENIZER TO THE FUNCTION + try: + steps_data, tokens = qwen_token_converter(data, full_steps[number], tokenizer) + for stage in steps_data: + if stage not in stages: + stages[stage] = [] + stages[stage].append(steps_data[stage]) + + if tokens > 0: + all_tokens.append(tokens) + except (KeyError, ValueError) as e: + print(f"Error for incident {number}: {e}") + incorrect_incidents += 1 + else: + incorrect_incidents += 1 + + # ... (Rest of your writing logic remains the same) ... + os.makedirs(output_dir, exist_ok=True) + for i in range(len(stages)): + name = os.path.join(output_dir, f"iteration_{i}.jsonl") + with open(name, "w", encoding="utf-8") as f: + for line in stages[i]: + json.dump(line, f) + f.write("\n") + + print(f"CURRICULUM Info\n{'*' * 20}") + print(f"There are currently {len(stages)} stages") + print(f"{incorrect_incidents} incidents failed") + + +def main(jsonl_file, input_file, output_file, parse_types, reasoning_jsonl=None, output_dir=None): + if parse_types == "steps_extraction": + prepare_data_for_reasoning_traces(jsonl_file, input_file, output_file) + elif parse_types == "compile_reasoning": + if not reasoning_jsonl: + raise ValueError("Please specify a reasoning jsonl file by specifying --reasoning_jsonl") + compile_reasoning(jsonl_file, input_file, output_dir, reasoning_jsonl) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Extract and format reasoning steps from JSONL file.") + parser.add_argument("--input", type=str, help="Path to the first JSONL file") + parser.add_argument("--output", required=False, type=str) + parser.add_argument("--jsonl_file", required=False, type=str) + parser.add_argument("--parse_type", type=str) + parser.add_argument("--output_dir", required=False) + parser.add_argument("--reasoning_jsonl", required=False, type=str) + + parsing_types = ["steps_extraction", "compile_reasoning"] + args = parser.parse_args() + + if args.parse_type not in parsing_types: + raise ValueError(f"{args.parse_type} is not supported. Supported parse_types include {parsing_types}") + + main(args.jsonl_file, args.input, args.output, args.parse_type, args.reasoning_jsonl, args.output_dir) diff --git a/recipes/noc-reasoning-agent/scripts/utils/reasoning_processes.py b/recipes/noc-reasoning-agent/scripts/utils/reasoning_processes.py new file mode 100644 index 0000000000..84ede2f4a1 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/utils/reasoning_processes.py @@ -0,0 +1,3040 @@ +""" +Auto-generated reasoning processes for all 70 workflows. +Maps fault_category (workflow ID) -> SOP text for use in formatting_prompt.yaml. + +Generated by: scripts/utils/generate_reasoning_processes.py +""" + + +WORKFLOW_REASONING_PROCESSES = { + "compute_certificate_renewal": """ + Workflow: Certificate Expiry Resolution + Domain: COMPUTE | Problem Codes: CMP-012 | SLA: 60 minutes + Description: TLS/SSL certificate expiry detection and renewal + + 1. Verify Certificate Alarm: Confirm certificate expiration + - Tools Used: query_alarm() + - Expected Outcomes: Certificate expired; Certificate expiring soon; TLS handshake failing + - Flow: If succeeds, go to step 2; if fails, go to step 5. + + 2. Identify Affected Services: List services using the certificate + - Tools Used: run_diagnostics() + - Expected Outcomes: Services identified; Multiple certs affected + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Trigger Certificate Renewal: Request new certificate from CA + - Tools Used: execute_remote_action() + - Expected Outcomes: Certificate renewed; Renewal in progress; Manual renewal needed + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Deploy New Certificate: Update secrets and restart services + - Tools Used: orchestrate_workload() + - Expected Outcomes: Certificate deployed; Services updated + - Flow: If succeeds, go to step 6; if fails, go to step 5. + + 5. Escalate Certificate Issue: Create ticket for security team + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Security team notified + - Flow: Proceed to step 6. + + 6. Verify TLS Connectivity: Confirm TLS working with new cert + - Tools Used: verify_recovery() + - Expected Outcomes: TLS healthy; Certificate valid; Renewal complete + - Flow: Proceed to step end. + + """, + + "compute_cnf_pod_recovery": """ + Workflow: CNF Pod Failure Recovery + Domain: COMPUTE | Problem Codes: CMP-006 | SLA: 15 minutes + Description: Cloud Native Function pod failure recovery + + 1. Verify CNF Alarm: Confirm CNF pod failure + - Tools Used: query_alarm() + - Expected Outcomes: CNF pod crashed; CNF not ready; CNF degraded + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check CNF Pod Status: Query CNF pod and container state + - Tools Used: query_container_status() + - Expected Outcomes: Pod CrashLoopBackOff; Init failed; Liveness probe failing + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Check CNF Logs: Review CNF application logs + - Tools Used: inspect_logs() + - Expected Outcomes: License error; Configuration error; Database connection failed + - Flow: Proceed to step 4. + + 4. Check Dependent Services: Verify CNF dependencies available + - Tools Used: verify_signaling_path() + - Expected Outcomes: Dependencies OK; Database down; Message bus unavailable + - Flow: If succeeds, go to step 5; if fails, go to step 6. + + 5. Restart CNF Pod: Delete and recreate CNF pod + - Tools Used: orchestrate_workload() + - Expected Outcomes: CNF restarting; CNF scheduled; Restart failed + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Trigger CNF Failover: Activate standby CNF instance + - Tools Used: orchestrate_workload() + - Expected Outcomes: Failover successful; Standby activated + - Flow: If succeeds, go to step 8; if fails, go to step 7. + + 7. Escalate to CNF Vendor: Create critical ticket for vendor support + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Vendor support engaged + - Flow: Proceed to step 8. + + 8. Verify CNF Recovery: Confirm CNF operational + - Tools Used: verify_recovery() + - Expected Outcomes: CNF healthy; Running on standby; Recovery ongoing + - Flow: Proceed to step end. + + """, + + "compute_container_crash_recovery": """ + Workflow: Container Crash Recovery + Domain: COMPUTE | Problem Codes: CMP-002 | SLA: 30 minutes + Description: Container/pod crash loop detection and recovery + + 1. Verify Container Alarm: Confirm container crash or restart loop + - Tools Used: query_alarm() + - Expected Outcomes: CrashLoopBackOff; Container OOMKilled; Container Error + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check Pod Status: Query Kubernetes pod state + - Tools Used: query_container_status() + - Expected Outcomes: Pod CrashLoopBackOff; Pod Pending; Init containers failing + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Check Container Logs: Review container application logs + - Tools Used: inspect_logs() + - Expected Outcomes: Application error; Configuration error; Dependency failure; OOM + - Flow: Proceed to step 4. + + 4. Check Resource Limits: Verify container resource allocation + - Tools Used: run_diagnostics() + - Expected Outcomes: Memory limit hit; CPU throttled; Resources adequate + - Flow: If succeeds, go to step 5; if fails, go to step 6. + + 5. Attempt Pod Restart: Delete and recreate pod + - Tools Used: orchestrate_workload() + - Expected Outcomes: Pod restarting; Pod scheduled; Restart failed + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Scale Replacement Pods: Scale deployment to create new pods + - Tools Used: orchestrate_workload() + - Expected Outcomes: Pods scaled; Healthy pods available + - Flow: If succeeds, go to step 8; if fails, go to step 7. + + 7. Escalate to DevOps: Create ticket for application investigation + - Tools Used: create_trouble_ticket() + - Expected Outcomes: DevOps team notified + - Flow: Proceed to step 8. + + 8. Verify Container Recovery: Confirm pod running and healthy + - Tools Used: verify_recovery() + - Expected Outcomes: Pod healthy; Pod degraded; Investigation ongoing + - Flow: Proceed to step end. + + """, + + "compute_image_pull_recovery": """ + Workflow: Image Pull Failure Recovery + Domain: COMPUTE | Problem Codes: CMP-010 | SLA: 30 minutes + Description: Container image pull failure resolution + + 1. Verify Image Pull Alarm: Confirm image pull failure + - Tools Used: query_alarm() + - Expected Outcomes: ImagePullBackOff; ErrImagePull; Registry timeout + - Flow: If succeeds, go to step 2; if fails, go to step 5. + + 2. Check Registry Connectivity: Test connectivity to container registry + - Tools Used: test_connectivity() + - Expected Outcomes: Registry unreachable; Auth failure; Rate limited + - Flow: If succeeds, go to step 3; if fails, go to step 4. + + 3. Verify Image Credentials: Check image pull secrets + - Tools Used: run_diagnostics() + - Expected Outcomes: Credentials valid; Credentials expired; Secret missing + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Apply Registry Fix: Update credentials or switch registry + - Tools Used: apply_configuration() + - Expected Outcomes: Credentials updated; Registry switched + - Flow: If succeeds, go to step 6; if fails, go to step 5. + + 5. Escalate Registry Issue: Create ticket for registry team + - Tools Used: create_trouble_ticket() + - Expected Outcomes: DevOps team notified + - Flow: Proceed to step 6. + + 6. Verify Image Pull: Confirm pods scheduling successfully + - Tools Used: verify_recovery() + - Expected Outcomes: Images pulling; Pods scheduled; Issue resolved + - Flow: Proceed to step end. + + """, + + "compute_k8s_node_recovery": """ + Workflow: Kubernetes Node Recovery + Domain: COMPUTE | Problem Codes: CMP-003 | SLA: 30 minutes + Description: K8s node NotReady state recovery + + 1. Verify Node Alarm: Confirm Kubernetes node NotReady + - Tools Used: query_alarm() + - Expected Outcomes: Node NotReady; Node unschedulable; Kubelet not responding + - Flow: If succeeds, go to step 2; if fails, go to step 8. + + 2. Check Node Status: Query Kubernetes node conditions + - Tools Used: query_container_status() + - Expected Outcomes: MemoryPressure; DiskPressure; PIDPressure; NetworkUnavailable + - Flow: If succeeds, go to step 3; if fails, go to step 7. + + 3. Check Node Resources: Verify node CPU/memory/disk + - Tools Used: query_resource_health() + - Expected Outcomes: Resources exhausted; Resources OK; Disk full + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Check Kubelet Status: Verify kubelet service running + - Tools Used: run_diagnostics() + - Expected Outcomes: Kubelet stopped; Kubelet crashlooping; Certificate expired + - Flow: If succeeds, go to step 5; if fails, go to step 6. + + 5. Restart Kubelet: Restart kubelet service on node + - Tools Used: execute_remote_action() + - Expected Outcomes: Kubelet restarted; Node recovering; Restart failed + - Flow: If succeeds, go to step 9; if fails, go to step 6. + + 6. Drain Node: Cordon and drain unhealthy node + - Tools Used: orchestrate_workload() + - Expected Outcomes: Node drained; Pods rescheduled; Some pods stuck + - Flow: Proceed to step 7. + + 7. Reboot Node: Reboot the failing node + - Tools Used: execute_remote_action() + - Expected Outcomes: Node rebooting; Reboot initiated; Reboot failed + - Flow: If succeeds, go to step 9; if fails, go to step 8. + + 8. Escalate to Platform Team: Create ticket for node replacement + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Platform team engaged + - Flow: Proceed to step 9. + + 9. Verify Node Recovery: Confirm node Ready + - Tools Used: verify_recovery() + - Expected Outcomes: Node Ready; Node recovering; Node replacement needed + - Flow: Proceed to step end. + + """, + + "compute_network_policy_recovery": """ + Workflow: Network Policy Failure Recovery + Domain: COMPUTE | Problem Codes: CMP-011 | SLA: 30 minutes + Description: K8s network policy misconfiguration resolution + + 1. Verify Network Policy Alarm: Confirm network policy blocking traffic + - Tools Used: query_alarm() + - Expected Outcomes: Traffic blocked; Pod isolation; Ingress denied + - Flow: If succeeds, go to step 2; if fails, go to step 5. + + 2. Analyze Network Policies: Review active network policies + - Tools Used: run_diagnostics() + - Expected Outcomes: Restrictive policy found; Missing allow rule; Label mismatch + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Test Pod Connectivity: Verify pod-to-pod and pod-to-service connectivity + - Tools Used: test_connectivity() + - Expected Outcomes: Connectivity blocked; Partial connectivity; DNS issues + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Apply Policy Correction: Update or remove blocking network policy + - Tools Used: apply_configuration() + - Expected Outcomes: Policy updated; Allow rule added + - Flow: If succeeds, go to step 6; if fails, go to step 5. + + 5. Escalate to Security Team: Create ticket for policy review + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Security team notified + - Flow: Proceed to step 6. + + 6. Verify Network Connectivity: Confirm traffic flowing correctly + - Tools Used: verify_recovery() + - Expected Outcomes: Connectivity restored; Policy adjusted; Review ongoing + - Flow: Proceed to step end. + + """, + + "compute_orchestrator_recovery": """ + Workflow: Orchestrator Failure Recovery + Domain: COMPUTE | Problem Codes: CMP-005 | SLA: 15 minutes + Description: Container/VM orchestrator failure recovery + + 1. Verify Orchestrator Alarm: Confirm orchestrator health issue + - Tools Used: query_alarm() + - Expected Outcomes: API server down; etcd unhealthy; Controller failure + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check Control Plane: Verify control plane component status + - Tools Used: query_container_status() + - Expected Outcomes: API server unhealthy; Scheduler down; etcd quorum lost + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Check etcd Cluster: Verify etcd cluster health + - Tools Used: run_diagnostics() + - Expected Outcomes: etcd healthy; etcd degraded; etcd split brain + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Restart Control Plane: Restart failed control plane components + - Tools Used: execute_remote_action() + - Expected Outcomes: Control plane recovering; Restart in progress; Restart failed + - Flow: If succeeds, go to step 8; if fails, go to step 5. + + 5. Recover etcd: Restore etcd from backup if needed + - Tools Used: execute_remote_action() + - Expected Outcomes: etcd recovered; Restore from backup; Manual intervention needed + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Failover to Backup Control Plane: Switch to backup orchestrator + - Tools Used: execute_remote_action() + - Expected Outcomes: Failover successful; Backup active + - Flow: If succeeds, go to step 8; if fails, go to step 7. + + 7. Escalate to Platform Team: Create critical ticket for orchestrator recovery + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Platform team engaged + - Flow: Proceed to step 8. + + 8. Verify Orchestrator Recovery: Confirm orchestrator healthy + - Tools Used: verify_recovery() + - Expected Outcomes: Orchestrator healthy; Running on backup; Recovery in progress + - Flow: Proceed to step end. + + """, + + "compute_resource_exhaustion_resolution": """ + Workflow: Resource Exhaustion Resolution + Domain: COMPUTE | Problem Codes: CMP-008 | SLA: 30 minutes + Description: Compute resource exhaustion detection and mitigation + + 1. Verify Resource Alarm: Confirm resource exhaustion + - Tools Used: query_alarm() + - Expected Outcomes: CPU exhausted; Memory exhausted; Disk full + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Analyze Resource Usage: Identify resource consumption by workload + - Tools Used: run_diagnostics() + - Expected Outcomes: Specific workload identified; General overload; Resource leak + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Apply Resource Limits: Enforce or adjust resource quotas + - Tools Used: apply_configuration() + - Expected Outcomes: Limits applied; Quota enforced + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Scale Resources: Add compute capacity or scale workloads + - Tools Used: orchestrate_workload() + - Expected Outcomes: Resources scaled; Capacity added; Scaling limited + - Flow: If succeeds, go to step 7; if fails, go to step 5. + + 5. Terminate Non-Critical Workloads: Stop low priority workloads to free resources + - Tools Used: orchestrate_workload() + - Expected Outcomes: Workloads terminated; Resources freed + - Flow: If succeeds, go to step 7; if fails, go to step 6. + + 6. Request Capacity Expansion: Create ticket for infrastructure expansion + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Capacity request submitted + - Flow: Proceed to step 7. + + 7. Verify Resource Status: Monitor resource utilization + - Tools Used: verify_recovery() + - Expected Outcomes: Resources normalized; Resources stable; Expansion in progress + - Flow: Proceed to step end. + + """, + + "compute_service_mesh_recovery": """ + Workflow: Service Mesh Failure Recovery + Domain: COMPUTE | Problem Codes: CMP-009 | SLA: 30 minutes + Description: Service mesh control/data plane failure recovery + + 1. Verify Mesh Alarm: Confirm service mesh issue + - Tools Used: query_alarm() + - Expected Outcomes: Control plane unhealthy; Sidecar failures; mTLS errors + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Check Mesh Control Plane: Verify Istio/Linkerd control plane status + - Tools Used: query_container_status() + - Expected Outcomes: istiod unhealthy; Pilot not syncing; Control plane overloaded + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Check Sidecar Proxies: Verify sidecar proxy health across pods + - Tools Used: run_diagnostics() + - Expected Outcomes: Sidecars healthy; Sidecars crashing; Config sync failed + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Restart Mesh Control Plane: Restart mesh control plane components + - Tools Used: orchestrate_workload() + - Expected Outcomes: Control plane restarted; Mesh recovering + - Flow: If succeeds, go to step 7; if fails, go to step 5. + + 5. Rollback Mesh Configuration: Restore previous mesh configuration + - Tools Used: apply_configuration() + - Expected Outcomes: Config rolled back; Previous config restored + - Flow: If succeeds, go to step 7; if fails, go to step 6. + + 6. Escalate to Platform Team: Create ticket for mesh investigation + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Platform team notified + - Flow: Proceed to step 7. + + 7. Verify Mesh Recovery: Confirm service mesh healthy + - Tools Used: verify_recovery() + - Expected Outcomes: Mesh healthy; Mesh recovering; Investigation ongoing + - Flow: Proceed to step end. + + """, + + "compute_storage_failure_recovery": """ + Workflow: Storage Volume Failure Recovery + Domain: COMPUTE | Problem Codes: CMP-004 | SLA: 30 minutes + Description: Persistent storage volume failure recovery + + 1. Verify Storage Alarm: Confirm storage volume failure + - Tools Used: query_alarm() + - Expected Outcomes: Volume unavailable; Mount failed; I/O errors + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check Volume Status: Query storage backend status + - Tools Used: query_resource_health() + - Expected Outcomes: Volume degraded; Volume offline; Backend unreachable + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Check Storage Backend: Verify storage array health + - Tools Used: run_diagnostics() + - Expected Outcomes: Backend healthy; Backend degraded; Path failure + - Flow: If succeeds, go to step 4; if fails, go to step 6. + + 4. Attempt Volume Recovery: Force remount or reconnect volume + - Tools Used: execute_remote_action() + - Expected Outcomes: Volume recovered; Remount successful; Recovery failed + - Flow: If succeeds, go to step 8; if fails, go to step 5. + + 5. Failover to Replica: Switch to storage replica if available + - Tools Used: orchestrate_workload() + - Expected Outcomes: Failover successful; No replica available + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Restart Dependent Workloads: Restart pods/VMs using the volume + - Tools Used: orchestrate_workload() + - Expected Outcomes: Workloads restarted; Some workloads pending + - Flow: If succeeds, go to step 8; if fails, go to step 7. + + 7. Escalate to Storage Team: Create critical ticket for storage repair + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Storage team engaged + - Flow: Proceed to step 8. + + 8. Verify Storage Recovery: Confirm volume accessible + - Tools Used: verify_recovery() + - Expected Outcomes: Storage recovered; Degraded mode; Recovery ongoing + - Flow: Proceed to step end. + + """, + + "compute_vm_failure_recovery": """ + Workflow: VM Failure Recovery + Domain: COMPUTE | Problem Codes: CMP-001 | SLA: 30 minutes + Description: Virtual machine failure detection and recovery + + 1. Verify VM Alarm: Confirm VM failure or unreachable + - Tools Used: query_alarm() + - Expected Outcomes: VM unreachable; VM crashed; VM heartbeat lost + - Flow: If succeeds, go to step 2; if fails, go to step 8. + + 2. Check Hypervisor Status: Verify host hypervisor health + - Tools Used: query_resource_health() + - Expected Outcomes: Hypervisor healthy; Hypervisor overloaded; Hypervisor fault + - Flow: If succeeds, go to step 3; if fails, go to step 7. + + 3. Check VM Status: Query VM state from orchestrator + - Tools Used: query_container_status() + - Expected Outcomes: VM error state; VM paused; VM stopped; VM not found + - Flow: If succeeds, go to step 4; if fails, go to step 6. + + 4. Check VM Logs: Review VM console and system logs + - Tools Used: inspect_logs() + - Expected Outcomes: Kernel panic; Out of memory; Disk full; Application crash + - Flow: Proceed to step 5. + + 5. Attempt VM Restart: Restart the failed VM + - Tools Used: orchestrate_workload() + - Expected Outcomes: VM restarting; VM started; Restart failed + - Flow: If succeeds, go to step 9; if fails, go to step 6. + + 6. Attempt VM Migration: Live migrate VM to healthy host + - Tools Used: orchestrate_workload() + - Expected Outcomes: Migration successful; Migration in progress; Migration failed + - Flow: If succeeds, go to step 9; if fails, go to step 7. + + 7. Restore from Snapshot: Recover VM from last known good snapshot + - Tools Used: orchestrate_workload() + - Expected Outcomes: Restore successful; Snapshot not available + - Flow: If succeeds, go to step 9; if fails, go to step 8. + + 8. Escalate to Cloud Team: Create critical ticket for cloud operations + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Cloud team engaged + - Flow: Proceed to step 9. + + 9. Verify VM Recovery: Confirm VM operational + - Tools Used: verify_recovery() + - Expected Outcomes: VM recovered; VM degraded; Recovery in progress + - Flow: Proceed to step end. + + """, + + "compute_vnf_failure_recovery": """ + Workflow: VNF Failure Recovery + Domain: COMPUTE | Problem Codes: CMP-007 | SLA: 30 minutes + Description: Virtual Network Function failure recovery + + 1. Verify VNF Alarm: Confirm VNF failure + - Tools Used: query_alarm() + - Expected Outcomes: VNF unreachable; VNF degraded; VNF heartbeat lost + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check VNF Status: Query VNF manager for instance state + - Tools Used: query_resource_health() + - Expected Outcomes: VNF error; VNF stopped; VNF overloaded + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Check VNF Logs: Review VNF application logs + - Tools Used: inspect_logs() + - Expected Outcomes: Software crash; Resource exhaustion; License issue + - Flow: Proceed to step 4. + + 4. Attempt VNF Restart: Restart VNF via VNFM + - Tools Used: orchestrate_workload() + - Expected Outcomes: VNF restarting; VNF started; Restart failed + - Flow: If succeeds, go to step 8; if fails, go to step 5. + + 5. Scale Out VNF: Create new VNF instance for redundancy + - Tools Used: orchestrate_workload() + - Expected Outcomes: Scale out successful; New instance created + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Activate Geo-Redundant VNF: Failover to secondary site VNF + - Tools Used: execute_remote_action() + - Expected Outcomes: Geo-failover initiated; Secondary VNF active + - Flow: If succeeds, go to step 8; if fails, go to step 7. + + 7. Escalate to VNF Vendor: Create critical ticket for vendor support + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Vendor support engaged + - Flow: Proceed to step 8. + + 8. Verify VNF Recovery: Confirm VNF operational + - Tools Used: verify_recovery() + - Expected Outcomes: VNF healthy; Running on secondary; Recovery ongoing + - Flow: Proceed to step end. + + """, + + "core_charging_failure_recovery": """ + Workflow: Charging System Failure Recovery + Domain: CORE | Problem Codes: CORE-005 | SLA: 30 minutes + Description: Online/offline charging system failure recovery + + 1. Verify Charging Alarm: Confirm charging system failure + - Tools Used: query_alarm() + - Expected Outcomes: OCS unreachable; CDR generation failed; Rating error + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Check Charging Status: Query charging system health + - Tools Used: query_resource_health() + - Expected Outcomes: OCS fault; Mediation backup; Database issue + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Check Gy/N40 Interface: Verify charging signaling path + - Tools Used: verify_signaling_path() + - Expected Outcomes: Interface healthy; Timeout; Connection lost + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Enable Offline Charging Fallback: Switch to offline charging if OCS down + - Tools Used: apply_configuration() + - Expected Outcomes: Fallback enabled; CDR buffering active + - Flow: If succeeds, go to step 7; if fails, go to step 5. + + 5. Failover to Backup OCS: Switch to backup charging system + - Tools Used: execute_remote_action() + - Expected Outcomes: Failover successful; Backup OCS active + - Flow: If succeeds, go to step 7; if fails, go to step 6. + + 6. Escalate to Billing Team: Create ticket for charging system repair + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Billing team engaged + - Flow: Proceed to step 7. + + 7. Verify Charging Recovery: Confirm charging services restored + - Tools Used: verify_recovery() + - Expected Outcomes: Charging recovered; Running on fallback; Recovery ongoing + - Flow: Proceed to step end. + + """, + + "core_gateway_failure_recovery": """ + Workflow: SGW/PGW/UPF Failure Recovery + Domain: CORE | Problem Codes: CORE-002 | SLA: 15 minutes + Description: User plane gateway failure detection and recovery + + 1. Verify Gateway Alarm: Confirm SGW/PGW/UPF failure + - Tools Used: query_alarm() + - Expected Outcomes: Gateway unreachable; GTP path failure; User plane down + - Flow: If succeeds, go to step 2; if fails, go to step 8. + + 2. Check Gateway Status: Query gateway health and session state + - Tools Used: query_resource_health() + - Expected Outcomes: Gateway fault; Processing overload; Interface down + - Flow: If succeeds, go to step 3; if fails, go to step 7. + + 3. Check GTP Paths: Verify GTP tunnel status + - Tools Used: verify_signaling_path() + - Expected Outcomes: GTP-C path down; GTP-U path down; Echo timeout + - Flow: If succeeds, go to step 4; if fails, go to step 6. + + 4. Attempt Gateway Restart: Restart gateway service + - Tools Used: execute_remote_action() + - Expected Outcomes: Gateway restarting; Service recovery; Restart failed + - Flow: If succeeds, go to step 9; if fails, go to step 5. + + 5. Trigger Gateway Failover: Activate standby gateway + - Tools Used: execute_remote_action() + - Expected Outcomes: Failover successful; Standby activated; Sessions migrating + - Flow: If succeeds, go to step 9; if fails, go to step 6. + + 6. Redirect Traffic: Update routing to bypass failed gateway + - Tools Used: apply_configuration() + - Expected Outcomes: Traffic rerouted; Alternate path active + - Flow: If succeeds, go to step 9; if fails, go to step 7. + + 7. Notify Affected Services: Alert dependent services of gateway failure + - Tools Used: query_external_factors() + - Expected Outcomes: Services notified; Impact assessed + - Flow: Proceed to step 8. + - Note: This step is skippable if not applicable. + + 8. Escalate to Core Team: Create critical ticket for gateway repair + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Core team engaged + - Flow: Proceed to step 9. + + 9. Verify Gateway Recovery: Confirm gateway operational + - Tools Used: verify_recovery() + - Expected Outcomes: Gateway recovered; Running on standby; Recovery in progress + - Flow: Proceed to step end. + + """, + + "core_hss_udm_recovery": """ + Workflow: HSS/UDM Unavailability Recovery + Domain: CORE | Problem Codes: CORE-003 | SLA: 15 minutes + Description: Subscriber database unavailability recovery + + 1. Verify HSS/UDM Alarm: Confirm HSS or UDM unavailable + - Tools Used: query_alarm() + - Expected Outcomes: HSS unreachable; UDM not responding; Database timeout + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check HSS/UDM Status: Query subscriber database health + - Tools Used: query_resource_health() + - Expected Outcomes: Database fault; Overloaded; Replication lag + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Check Diameter Connectivity: Verify Diameter path to HSS/UDM + - Tools Used: verify_signaling_path() + - Expected Outcomes: Diameter healthy; Diameter timeout; DRA congested + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Attempt HSS Recovery: Restart HSS/UDM service + - Tools Used: execute_remote_action() + - Expected Outcomes: Service restarting; Database recovering; Restart failed + - Flow: If succeeds, go to step 8; if fails, go to step 5. + + 5. Failover to Backup HSS: Switch to geo-redundant HSS/UDM + - Tools Used: execute_remote_action() + - Expected Outcomes: Failover successful; Backup HSS active + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Enable Local Cache: Activate cached subscriber data if available + - Tools Used: apply_configuration() + - Expected Outcomes: Cache enabled; Limited service available + - Flow: Proceed to step 7. + - Note: This step is skippable if not applicable. + + 7. Escalate to Core Database Team: Create critical ticket for HSS/UDM repair + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Database team engaged + - Flow: Proceed to step 8. + + 8. Verify HSS/UDM Recovery: Confirm subscriber services restored + - Tools Used: verify_recovery() + - Expected Outcomes: HSS recovered; Running on backup; Recovery ongoing + - Flow: Proceed to step end. + + """, + + "core_mme_amf_overload_mitigation": """ + Workflow: MME/AMF Overload Mitigation + Domain: CORE | Problem Codes: CORE-001 | SLA: 15 minutes + Description: Core mobility management overload detection and mitigation + + 1. Verify MME/AMF Alarm: Confirm MME or AMF overload condition + - Tools Used: query_alarm() + - Expected Outcomes: CPU overload; Memory high; Session capacity exceeded; Signaling storm + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check Node Status: Query MME/AMF operational metrics + - Tools Used: query_resource_health() + - Expected Outcomes: CPU >80%; Sessions near limit; Message queue building + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Analyze Traffic Pattern: Identify source of overload + - Tools Used: query_performance() + - Expected Outcomes: Attach storm detected; Paging flood; Normal peak traffic; Single eNB causing load + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Enable Overload Controls: Activate NAS-level overload controls + - Tools Used: apply_configuration() + - Expected Outcomes: Overload control active; Attach rejection enabled; Paging optimization active + - Flow: If succeeds, go to step 8; if fails, go to step 5. + + 5. Redistribute Load: Trigger S1-flex load redistribution + - Tools Used: execute_remote_action() + - Expected Outcomes: Load redistributed; eNBs rebalanced; Pool weight adjusted + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Scale Out MME/AMF: Activate standby instance or scale capacity + - Tools Used: orchestrate_workload() + - Expected Outcomes: Instance scaled; Capacity added + - Flow: If succeeds, go to step 8; if fails, go to step 7. + + 7. Escalate to Core Team: Create critical ticket for core operations + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Core team engaged + - Flow: Proceed to step 8. + + 8. Verify Load Status: Monitor MME/AMF load recovery + - Tools Used: verify_recovery() + - Expected Outcomes: Load normalized; Overload controlled; Capacity added + - Flow: Proceed to step end. + + """, + + "core_pcrf_pcf_recovery": """ + Workflow: PCRF/PCF Failure Recovery + Domain: CORE | Problem Codes: CORE-004 | SLA: 30 minutes + Description: Policy control function failure recovery + + 1. Verify PCRF/PCF Alarm: Confirm policy function failure + - Tools Used: query_alarm() + - Expected Outcomes: PCRF unreachable; PCF timeout; Policy rule failure + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Check PCRF/PCF Status: Query policy engine health + - Tools Used: query_resource_health() + - Expected Outcomes: Service fault; Database issue; Overloaded + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Check Gx/N7 Interface: Verify policy signaling path + - Tools Used: verify_signaling_path() + - Expected Outcomes: Interface healthy; Diameter timeout; Connection refused + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Restart PCRF/PCF Service: Restart policy control service + - Tools Used: execute_remote_action() + - Expected Outcomes: Service restarting; Service recovered; Restart failed + - Flow: If succeeds, go to step 7; if fails, go to step 5. + + 5. Failover to Backup PCRF: Switch to standby policy engine + - Tools Used: execute_remote_action() + - Expected Outcomes: Failover successful; Backup active + - Flow: If succeeds, go to step 7; if fails, go to step 6. + + 6. Escalate to Policy Team: Create ticket for policy function repair + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Policy team engaged + - Flow: Proceed to step 7. + + 7. Verify Policy Recovery: Confirm policy services restored + - Tools Used: verify_recovery() + - Expected Outcomes: PCRF recovered; Running on backup; Recovery ongoing + - Flow: Proceed to step end. + + """, + + "env_battery_temperature_response": """ + Workflow: Battery High Temperature Response + Domain: POWER | Problem Codes: ENV-006 | SLA: 30 minutes + Description: Battery thermal runaway prevention and mitigation + + 1. Verify Battery Temperature Alarm: Confirm battery temperature reading + - Tools Used: query_alarm() + - Expected Outcomes: Battery temp elevated; Battery temp critical; Multiple strings affected + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check Battery Status: Get detailed battery temperature and charge state + - Tools Used: query_power_system() + - Expected Outcomes: Temp 35-40C - elevated; Temp 40-45C - high; Temp >45C - critical + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Check HVAC Cooling: Verify battery room cooling operational + - Tools Used: query_power_system() + - Expected Outcomes: HVAC running - cooling active; HVAC fault detected; HVAC capacity insufficient + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Reduce Charge Rate: Lower rectifier charge current to reduce heat + - Tools Used: apply_configuration() + - Expected Outcomes: Charge rate reduced; Already at minimum; Configuration applied + - Flow: If succeeds, go to step 8; if fails, go to step 5. + + 5. Increase Battery Room Cooling: Maximize cooling to battery area + - Tools Used: execute_remote_action() + - Expected Outcomes: Emergency cooling activated; Cooling already maximum; Cooling system fault + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Isolate Battery String: Disconnect affected battery string if critical + - Tools Used: execute_remote_action() + - Expected Outcomes: Battery string isolated; Isolation not possible remotely + - Flow: Proceed to step 7. + + 7. Emergency Dispatch: Dispatch technician for battery inspection + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Emergency dispatch initiated + - Flow: Proceed to step 8. + + 8. Monitor Battery Temperature: Track temperature trend + - Tools Used: verify_recovery() + - Expected Outcomes: Temperature declining; Temperature stabilized; Temperature still rising + - Flow: Proceed to step end. + + """, + + "env_cabinet_intrusion_response": """ + Workflow: Cabinet Intrusion Response + Domain: POWER | Problem Codes: ENV-005 | SLA: 30 minutes + Description: Security breach detection and response + + 1. Verify Intrusion Alarm: Confirm cabinet door or tamper alarm + - Tools Used: query_alarm() + - Expected Outcomes: Door open alarm; Tamper detected; Multiple cabinets alarming + - Flow: If succeeds, go to step 2; if fails, go to step 4. + + 2. Check Scheduled Maintenance: Verify if maintenance window is active + - Tools Used: query_external_factors() + - Expected Outcomes: Scheduled work in progress; No scheduled work; Maintenance window ended + - Flow: If succeeds, go to step 5; if fails, go to step 3. + + 3. Verify Site Access Log: Check recent access control entries + - Tools Used: inspect_logs() + - Expected Outcomes: Authorized access logged; Unauthorized access suspected; Access log unavailable + - Flow: Proceed to step 4. + + 4. Create Security Incident: Log security event and dispatch if unauthorized + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Security incident logged; Field verification dispatched + - Flow: Proceed to step 5. + + 5. Monitor Cabinet Status: Verify cabinet status and equipment health + - Tools Used: verify_recovery() + - Expected Outcomes: Cabinet secured; Door still open; Equipment status verified + - Flow: Proceed to step end. + + """, + + "env_high_temperature_response": """ + Workflow: High Temperature Alert Response + Domain: POWER | Problem Codes: ENV-001 | SLA: 60 minutes + Description: Site overheating diagnosis and mitigation + + 1. Verify Temperature Alarm: Confirm high temperature reading and affected zone + - Tools Used: query_alarm() + - Expected Outcomes: High temp confirmed; Temperature sensor spike; Multiple zones affected + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check Current Temperature: Get detailed temperature readings + - Tools Used: query_power_system() + - Expected Outcomes: Temp elevated but manageable; Temp approaching critical; Temp critical - equipment at risk + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Verify HVAC Operation: Check cooling system status + - Tools Used: query_power_system() + - Expected Outcomes: HVAC running normally; HVAC fault detected; HVAC capacity insufficient + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Adjust HVAC Settings: Increase cooling or enable emergency mode + - Tools Used: apply_configuration() + - Expected Outcomes: Cooling increased; Already at maximum; Configuration applied + - Flow: If succeeds, go to step 8; if fails, go to step 5. + + 5. Attempt HVAC Reset: Restart HVAC controller + - Tools Used: execute_remote_action() + - Expected Outcomes: HVAC restarted successfully; HVAC failed to restart + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Implement Load Reduction: Reduce equipment load to lower heat output + - Tools Used: execute_remote_action() + - Expected Outcomes: Non-critical equipment powered down; Load reduction limited + - Flow: Proceed to step 7. + + 7. Dispatch HVAC Service: Create work order for HVAC repair + - Tools Used: create_trouble_ticket() + - Expected Outcomes: HVAC service dispatched + - Flow: Proceed to step 8. + + 8. Monitor Temperature Trend: Verify temperature stabilizing or declining + - Tools Used: verify_recovery() + - Expected Outcomes: Temperature stabilizing; Temperature declining; Temperature still rising + - Flow: Proceed to step end. + + """, + + "env_humidity_threshold_response": """ + Workflow: Humidity Threshold Response + Domain: POWER | Problem Codes: ENV-004 | SLA: 120 minutes + Description: High humidity detection and correction + + 1. Verify Humidity Alarm: Confirm humidity level and affected area + - Tools Used: query_alarm() + - Expected Outcomes: Humidity elevated; Humidity critical; Sensor fault suspected + - Flow: If succeeds, go to step 2; if fails, go to step 5. + + 2. Check Environmental Readings: Get current temperature and humidity values + - Tools Used: query_power_system() + - Expected Outcomes: Humidity 60-70% - elevated; Humidity >70% - high; Humidity normalizing + - Flow: If succeeds, go to step 3; if fails, go to step 4. + + 3. Adjust HVAC Dehumidification: Enable or increase dehumidification mode + - Tools Used: apply_configuration() + - Expected Outcomes: Dehumidification enabled; Already at maximum; HVAC adjustment applied + - Flow: If succeeds, go to step 5; if fails, go to step 4. + + 4. Schedule HVAC Inspection: Create work order for HVAC inspection + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Inspection scheduled + - Flow: Proceed to step 5. + - Note: This step is skippable if not applicable. + + 5. Verify Humidity Stabilizing: Monitor humidity trend + - Tools Used: verify_recovery() + - Expected Outcomes: Humidity declining; Humidity stable; Humidity still elevated + - Flow: Proceed to step end. + + """, + + "env_hvac_fault_recovery": """ + Workflow: HVAC System Fault Recovery + Domain: POWER | Problem Codes: ENV-002 | SLA: 90 minutes + Description: HVAC failure diagnosis and recovery + + 1. Verify HVAC Alarm: Confirm HVAC fault type and affected units + - Tools Used: query_alarm() + - Expected Outcomes: Compressor fault; Fan failure; Refrigerant issue; Control fault + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Check HVAC Status: Query HVAC controller for detailed diagnostics + - Tools Used: query_power_system() + - Expected Outcomes: Single unit failed - redundancy OK; Multiple units affected; Primary cooling lost + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Attempt HVAC Reset: Reset HVAC controller and compressor + - Tools Used: execute_remote_action() + - Expected Outcomes: HVAC recovered; Reset failed; Temporary recovery + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Verify Cooling Restored: Confirm cooling output restored + - Tools Used: query_power_system() + - Expected Outcomes: Cooling restored to normal; Partial cooling; Cooling insufficient + - Flow: If succeeds, go to step 7; if fails, go to step 5. + + 5. Enable Backup Cooling: Activate emergency/backup cooling if available + - Tools Used: execute_remote_action() + - Expected Outcomes: Backup cooling activated; No backup available + - Flow: Proceed to step 6. + - Note: This step is skippable if not applicable. + + 6. Dispatch HVAC Technician: Schedule HVAC repair service + - Tools Used: create_trouble_ticket() + - Expected Outcomes: HVAC technician dispatched + - Flow: Proceed to step 7. + + 7. Verify Environmental Status: Confirm temperature and humidity acceptable + - Tools Used: verify_recovery() + - Expected Outcomes: Environment within limits; Temperature elevated but stable; Continued monitoring required + - Flow: Proceed to step end. + + """, + + "env_water_intrusion_response": """ + Workflow: Water Intrusion Response + Domain: POWER | Problem Codes: ENV-003 | SLA: 30 minutes + Description: Water leak detection and emergency response + + 1. Verify Water Detection Alarm: Confirm water intrusion and location + - Tools Used: query_alarm() + - Expected Outcomes: Water detected - floor sensor; Water detected - ceiling sensor; Multiple sensors triggered + - Flow: If succeeds, go to step 2; if fails, go to step 5. + + 2. Assess Equipment Risk: Identify equipment at risk from water damage + - Tools Used: query_resource_health() + - Expected Outcomes: Equipment clear of water; Equipment at risk; Equipment already affected + - Flow: If succeeds, go to step 3; if fails, go to step 4. + + 3. Check for HVAC Condensation: Determine if HVAC is source of water + - Tools Used: query_power_system() + - Expected Outcomes: HVAC drain blocked; HVAC condensation normal; External water source + - Flow: Proceed to step 4. + - Note: This step is skippable if not applicable. + + 4. Initiate Emergency Protection: Isolate affected circuits if necessary + - Tools Used: execute_remote_action() + - Expected Outcomes: Circuits isolated for safety; No isolation needed; Isolation failed + - Flow: Proceed to step 5. + + 5. Emergency Dispatch: Dispatch technician for water mitigation + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Emergency dispatch initiated + - Flow: Proceed to step 6. + + 6. Monitor Site Status: Continue monitoring for further water ingress + - Tools Used: verify_recovery() + - Expected Outcomes: Situation contained; Water spreading; Equipment impact reported + - Flow: Proceed to step end. + + """, + + "power_ac_failure_recovery": """ + Workflow: AC Power Failure Recovery + Domain: POWER | Problem Codes: PWR-001 | SLA: 60 minutes + Description: Complete AC power failure response including UPS/generator verification + + 1. Verify AC Power Loss: Confirm AC mains failure on affected site + - Tools Used: query_alarm() + - Expected Outcomes: AC loss confirmed; Partial AC loss; Transient spike only + - Flow: If succeeds, go to step 2; if fails, go to step end. + + 2. Check UPS Status: Verify UPS engaged and battery capacity + - Tools Used: query_power_system() + - Expected Outcomes: UPS active - load transferred; UPS battery low; UPS failed + - Flow: If succeeds, go to step 3; if fails, go to step 8. + + 3. Verify Battery Runtime: Assess remaining battery runtime and load + - Tools Used: query_power_system() + - Expected Outcomes: Runtime >60 min; Runtime 30-60 min; Runtime <30 min critical + - Flow: If succeeds, go to step 4; if fails, go to step 8. + + 4. Initiate Generator Start: Trigger backup generator startup sequence + - Tools Used: execute_remote_action() + - Expected Outcomes: Generator started successfully; Generator start failed; Generator already running + - Flow: If succeeds, go to step 5; if fails, go to step 8. + + 5. Monitor Generator Stabilization: Wait for generator to reach stable output + - Tools Used: query_power_system() + - Expected Outcomes: Generator stable - ready for transfer; Generator unstable; Generator fault + - Flow: If succeeds, go to step 6; if fails, go to step 8. + + 6. Transfer Load to Generator: Execute automatic transfer switch to generator power + - Tools Used: execute_remote_action() + - Expected Outcomes: Load transferred to generator; Transfer failed; Partial transfer + - Flow: If succeeds, go to step 7; if fails, go to step 8. + + 7. Verify Site Power Restored: Confirm all equipment powered and operational + - Tools Used: query_resource_health() + - Expected Outcomes: All systems powered; Partial restoration; Some equipment failed to restart + - Flow: If succeeds, go to step 9; if fails, go to step 8. + + 8. Dispatch Field Technician: Create emergency dispatch for power restoration + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Field dispatch created; Dispatch queued + - Flow: If succeeds, go to step 9; if fails, go to step 10. + + 9. Notify Utility Provider: Report outage to electric utility for restoration ETA + - Tools Used: query_external_factors() + - Expected Outcomes: Utility aware - ETA provided; Utility not reachable; Known area outage + - Flow: Proceed to step 10. + - Note: This step is skippable if not applicable. + + 10. Final Verification: Verify service continuity and power status + - Tools Used: verify_recovery() + - Expected Outcomes: Site operational on backup power; Site degraded; Service impact ongoing + - Flow: Proceed to step end. + + """, + + "power_battery_discharge_response": """ + Workflow: Battery Discharge Alert Response + Domain: POWER | Problem Codes: PWR-003 | SLA: 30 minutes + Description: Urgent battery discharge investigation and mitigation + + 1. Acknowledge Battery Alert: Verify battery discharge alarm and severity + - Tools Used: query_alarm() + - Expected Outcomes: Battery discharging confirmed; Low voltage warning; Critical discharge rate + - Flow: If succeeds, go to step 2; if fails, go to step 8. + + 2. Check Battery Status: Get current battery level and discharge rate + - Tools Used: query_power_system() + - Expected Outcomes: Battery >50% - stable; Battery 20-50% - monitoring; Battery <20% - critical + - Flow: If succeeds, go to step 3; if fails, go to step 8. + + 3. Verify AC Power Status: Check if AC mains power is present + - Tools Used: query_power_system() + - Expected Outcomes: AC present - charging expected; AC absent - discharge expected; AC unstable + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Check Rectifier Charging: Verify rectifiers are charging batteries + - Tools Used: query_power_system() + - Expected Outcomes: Rectifiers charging batteries; Rectifier fault - not charging; Charge current low + - Flow: If succeeds, go to step 9; if fails, go to step 5. + + 5. Start Generator: Initiate backup generator for extended runtime + - Tools Used: execute_remote_action() + - Expected Outcomes: Generator started - charging resumed; Generator failed to start + - Flow: If succeeds, go to step 6; if fails, go to step 7. + + 6. Monitor Battery Recovery: Track battery voltage recovery + - Tools Used: query_power_system() + - Expected Outcomes: Battery voltage rising; Battery stable; Battery continues discharge + - Flow: If succeeds, go to step 9; if fails, go to step 7. + + 7. Implement Load Shedding: Reduce site load to extend battery runtime + - Tools Used: execute_remote_action() + - Expected Outcomes: Non-critical loads shed; Load reduction limited + - Flow: Proceed to step 8. + + 8. Emergency Dispatch: Dispatch technician with portable generator + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Emergency dispatch initiated + - Flow: Proceed to step 9. + + 9. Verify Site Status: Confirm site operational status and power situation + - Tools Used: verify_recovery() + - Expected Outcomes: Site stable on backup; Site at risk; Controlled shutdown may be needed + - Flow: Proceed to step end. + + """, + + "power_dc_rectifier_recovery": """ + Workflow: DC Rectifier Failure Recovery + Domain: POWER | Problem Codes: PWR-002 | SLA: 120 minutes + Description: DC rectifier fault diagnosis and recovery + + 1. Confirm Rectifier Alarm: Verify rectifier failure alarm and identify affected unit + - Tools Used: query_alarm() + - Expected Outcomes: Rectifier fault confirmed; Intermittent fault; False alarm + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check DC Bus Voltage: Measure DC bus voltage and assess impact + - Tools Used: query_power_system() + - Expected Outcomes: DC voltage nominal with redundancy; DC voltage low; DC voltage critical + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Assess Rectifier Redundancy: Determine remaining rectifier capacity + - Tools Used: query_power_system() + - Expected Outcomes: N+1 redundancy maintained; Running at capacity; Below capacity threshold + - Flow: If succeeds, go to step 4; if fails, go to step 6. + + 4. Attempt Rectifier Reset: Power cycle affected rectifier module + - Tools Used: execute_remote_action() + - Expected Outcomes: Rectifier recovered; Reset failed; Rectifier hardware fault + - Flow: If succeeds, go to step 5; if fails, go to step 6. + + 5. Verify Rectifier Recovery: Confirm rectifier back online and charging + - Tools Used: query_power_system() + - Expected Outcomes: Rectifier online - load sharing; Rectifier unstable; Rectifier failed + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Schedule Field Replacement: Create work order for rectifier replacement + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Work order created; Parts availability confirmed + - Flow: If succeeds, go to step 7; if fails, go to step 8. + + 7. Enable Battery Monitoring: Set up enhanced battery monitoring during degraded state + - Tools Used: apply_configuration() + - Expected Outcomes: Enhanced monitoring enabled; Configuration applied + - Flow: Proceed to step 8. + - Note: This step is skippable if not applicable. + + 8. Final Status Check: Verify power system stability + - Tools Used: verify_recovery() + - Expected Outcomes: System stable with redundancy; Operating in degraded mode; Escalation required + - Flow: Proceed to step end. + + """, + + "power_generator_failure_recovery": """ + Workflow: Generator Failure Recovery + Domain: POWER | Problem Codes: PWR-004 | SLA: 90 minutes + Description: Backup generator fault diagnosis and recovery + + 1. Verify Generator Alarm: Confirm generator failure or fault condition + - Tools Used: query_alarm() + - Expected Outcomes: Generator fault confirmed; Generator not starting; Generator running rough + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Check Generator Status: Query generator controller for detailed status + - Tools Used: query_power_system() + - Expected Outcomes: Fuel level OK - electrical fault; Low fuel detected; Mechanical fault indicated + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Attempt Generator Reset: Reset generator controller and retry start + - Tools Used: execute_remote_action() + - Expected Outcomes: Generator started after reset; Reset did not resolve issue + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Monitor Generator Stability: Verify generator running stable + - Tools Used: query_power_system() + - Expected Outcomes: Generator stable; Generator unstable + - Flow: If succeeds, go to step 7; if fails, go to step 5. + + 5. Check Fuel Level: Verify fuel tank level and consumption rate + - Tools Used: query_power_system() + - Expected Outcomes: Fuel adequate; Fuel low - refuel needed; Fuel system fault + - Flow: Proceed to step 6. + - Note: This step is skippable if not applicable. + + 6. Dispatch Generator Service: Create work order for generator repair/refuel + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Generator service dispatched; Fuel delivery scheduled + - Flow: Proceed to step 7. + + 7. Verify Backup Power Status: Confirm site power resilience status + - Tools Used: verify_recovery() + - Expected Outcomes: Generator operational; Site on UPS only - reduced backup; AC power restored + - Flow: Proceed to step end. + + """, + + "ran_antenna_tilt_recovery": """ + Workflow: Antenna Tilt Fault Recovery + Domain: RAN | Problem Codes: RAN-009 | SLA: 180 minutes + Description: Remote electrical tilt (RET) fault recovery + + 1. Verify Tilt Alarm: Confirm RET fault or incorrect tilt + - Tools Used: query_alarm() + - Expected Outcomes: RET communication fault; Tilt mismatch; RET motor fault + - Flow: If succeeds, go to step 2; if fails, go to step 4. + + 2. Check RET Controller: Query RET controller status + - Tools Used: query_rf_status() + - Expected Outcomes: RET controller offline; RET reporting incorrect tilt; RET motor stuck + - Flow: If succeeds, go to step 3; if fails, go to step 4. + + 3. Attempt RET Reset: Reset RET controller + - Tools Used: execute_remote_action() + - Expected Outcomes: RET recovered; RET still faulty + - Flow: If succeeds, go to step 5; if fails, go to step 4. + + 4. Schedule Field Service: Create work order for RET repair + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Field service scheduled + - Flow: Proceed to step 5. + + 5. Verify Coverage Impact: Assess impact of incorrect tilt + - Tools Used: verify_recovery() + - Expected Outcomes: Tilt corrected; Coverage stable; Field repair needed + - Flow: Proceed to step end. + + """, + + "ran_backhaul_degradation_resolution": """ + Workflow: Backhaul Degradation Resolution + Domain: RAN | Problem Codes: RAN-011 | SLA: 90 minutes + Description: Backhaul performance degradation troubleshooting + + 1. Verify Backhaul Alarm: Confirm backhaul performance issue + - Tools Used: query_alarm() + - Expected Outcomes: Bandwidth reduced; High latency; Packet loss detected; Link errors + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Test Backhaul Connectivity: Run connectivity tests on backhaul link + - Tools Used: test_connectivity() + - Expected Outcomes: Latency elevated; Packet loss confirmed; Throughput reduced + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Check Router Status: Query cell site router health + - Tools Used: query_resource_health() + - Expected Outcomes: Router healthy; Router interface errors; Router CPU high + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Clear Router Errors: Clear interface errors and reset counters + - Tools Used: execute_remote_action() + - Expected Outcomes: Errors cleared; Errors recurring + - Flow: If succeeds, go to step 7; if fails, go to step 5. + + 5. Escalate to Transport: Create ticket for transport team + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Transport team notified + - Flow: If succeeds, go to step 6; if fails, go to step 7. + + 6. Enable QoS Protection: Apply QoS policy to protect critical traffic + - Tools Used: apply_configuration() + - Expected Outcomes: QoS applied; QoS already active + - Flow: Proceed to step 7. + - Note: This step is skippable if not applicable. + + 7. Verify Service Impact: Monitor service quality and backhaul status + - Tools Used: verify_recovery() + - Expected Outcomes: Backhaul stable; Degradation continues; Transport investigating + - Flow: Proceed to step end. + + """, + + "ran_bbu_fault_recovery": """ + Workflow: BBU Fault Recovery + Domain: RAN | Problem Codes: RAN-012 | SLA: 60 minutes + Description: Baseband Unit hardware/software fault recovery + + 1. Verify BBU Alarm: Confirm BBU fault type + - Tools Used: query_alarm() + - Expected Outcomes: BBU board fault; BBU software crash; BBU overheating; BBU power fault + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check BBU Health: Query detailed BBU status + - Tools Used: query_resource_health() + - Expected Outcomes: Processing board fault; Main control fault; Fan fault; Multiple boards affected + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Check BBU Logs: Review BBU system logs for root cause + - Tools Used: inspect_logs() + - Expected Outcomes: Software crash identified; Hardware error logged; Temperature alarm logged + - Flow: Proceed to step 4. + + 4. Attempt BBU Reset: Reset BBU or affected board + - Tools Used: execute_remote_action() + - Expected Outcomes: BBU recovering; BBU reset failed; Partial recovery + - Flow: If succeeds, go to step 5; if fails, go to step 6. + + 5. Verify Cell Recovery: Confirm cells coming back online + - Tools Used: query_rf_status() + - Expected Outcomes: All cells recovered; Some cells recovered; Cells still down + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Activate Redundant Board: Switch to redundant BBU board if available + - Tools Used: execute_remote_action() + - Expected Outcomes: Failover successful; No redundancy available + - Flow: If succeeds, go to step 8; if fails, go to step 7. + - Note: This step is skippable if not applicable. + + 7. Dispatch Field Technician: Create emergency dispatch for BBU repair + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Field dispatch created + - Flow: Proceed to step 8. + + 8. Verify Site Status: Confirm site operational status + - Tools Used: verify_recovery() + - Expected Outcomes: Site recovered; Site degraded; Site down - field en route + - Flow: Proceed to step end. + + """, + + "ran_carrier_aggregation_recovery": """ + Workflow: Carrier Aggregation Failure Recovery + Domain: RAN | Problem Codes: RAN-017 | SLA: 180 minutes + Description: CA activation failure troubleshooting + + 1. Verify CA Failure: Confirm carrier aggregation not activating + - Tools Used: query_alarm() + - Expected Outcomes: CA not activating; SCell addition failure; CA throughput low + - Flow: If succeeds, go to step 2; if fails, go to step 4. + + 2. Check SCell Status: Verify secondary carrier health + - Tools Used: query_rf_status() + - Expected Outcomes: SCell healthy; SCell degraded; SCell configuration error + - Flow: If succeeds, go to step 3; if fails, go to step 4. + + 3. Review CA Configuration: Verify CA parameters and SCell mapping + - Tools Used: run_diagnostics() + - Expected Outcomes: Configuration correct; SCell not mapped; CA threshold issue + - Flow: If succeeds, go to step 5; if fails, go to step 4. + + 4. Apply CA Configuration Fix: Correct CA configuration issues + - Tools Used: apply_configuration() + - Expected Outcomes: CA configuration corrected; Configuration change applied + - Flow: Proceed to step 5. + + 5. Verify CA Performance: Monitor CA activation rate + - Tools Used: verify_recovery() + - Expected Outcomes: CA functioning; CA improved; CA issue persists + - Flow: Proceed to step end. + + """, + + "ran_cell_congestion_management": """ + Workflow: Cell Congestion Management + Domain: RAN | Problem Codes: RAN-005 | SLA: 60 minutes + Description: High traffic cell congestion mitigation + + 1. Verify Congestion Alert: Confirm cell congestion condition + - Tools Used: query_alarm() + - Expected Outcomes: PRB utilization high; RRC connection high; User count threshold exceeded + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Analyze Traffic Pattern: Review traffic load and patterns + - Tools Used: query_performance() + - Expected Outcomes: Peak hour congestion; Event-driven congestion; Sustained high load + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Check Carrier Utilization: Assess load distribution across carriers + - Tools Used: query_resource_health() + - Expected Outcomes: Load imbalance detected; All carriers loaded; CA capable devices low + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Apply Load Balancing: Enable or adjust inter-frequency load balancing + - Tools Used: apply_configuration() + - Expected Outcomes: Load balancing optimized; Limited improvement; Load redistribution active + - Flow: If succeeds, go to step 8; if fails, go to step 5. + + 5. Adjust Handover Parameters: Modify HO thresholds to offload traffic + - Tools Used: apply_configuration() + - Expected Outcomes: HO parameters adjusted; Neighbor offload increased + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Request Capacity Expansion: Create capacity planning request + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Capacity request submitted + - Flow: If succeeds, go to step 7; if fails, go to step 8. + + 7. Enable Cell Barring: Apply selective cell barring if critical + - Tools Used: apply_configuration() + - Expected Outcomes: Cell barring applied; Barring not needed + - Flow: Proceed to step 8. + - Note: This step is skippable if not applicable. + + 8. Verify Congestion Status: Monitor traffic and congestion levels + - Tools Used: verify_recovery() + - Expected Outcomes: Congestion relieved; Congestion improved; Congestion ongoing + - Flow: Proceed to step end. + + """, + + "ran_cell_overshooting_correction": """ + Workflow: Cell Overshooting Correction + Domain: RAN | Problem Codes: RAN-015 | SLA: 240 minutes + Description: Cell coverage overshooting mitigation + + 1. Verify Overshoot Detection: Confirm cell overshooting condition + - Tools Used: query_alarm() + - Expected Outcomes: Overshooting detected via TA; Neighbor anomaly detected; UE distribution abnormal + - Flow: If succeeds, go to step 2; if fails, go to step 4. + + 2. Analyze Coverage Pattern: Review timing advance and UE distribution + - Tools Used: query_performance() + - Expected Outcomes: High TA distribution; Abnormal cell selection; Coverage confirmed + - Flow: If succeeds, go to step 3; if fails, go to step 4. + + 3. Adjust Tilt/Power: Reduce coverage via tilt or power adjustment + - Tools Used: apply_configuration() + - Expected Outcomes: Tilt increased; Power reduced; Configuration applied + - Flow: If succeeds, go to step 5; if fails, go to step 4. + + 4. Request RF Optimization: Schedule drive test and optimization + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Optimization scheduled + - Flow: Proceed to step 5. + + 5. Monitor Coverage Impact: Track coverage changes + - Tools Used: verify_recovery() + - Expected Outcomes: Coverage normalized; Improvement observed; Further adjustment needed + - Flow: Proceed to step end. + + """, + + "ran_cell_site_down_recovery": """ + Workflow: Cell Site Down Recovery + Domain: RAN | Problem Codes: RAN-001 | SLA: 30 minutes + Description: Complete cell site outage investigation and recovery + + 1. Verify Site Outage: Confirm cell site not responding to network + - Tools Used: query_alarm() + - Expected Outcomes: Site unreachable - confirmed; Partial connectivity; Intermittent response + - Flow: If succeeds, go to step 2; if fails, go to step 11. + + 2. Check Backhaul Status: Verify transport link to site + - Tools Used: test_connectivity() + - Expected Outcomes: Backhaul up - local issue; Backhaul down; Intermittent backhaul + - Flow: If succeeds, go to step 3; if fails, go to step 8. + + 3. Check Power Status: Verify site power via remote monitoring + - Tools Used: query_power_system() + - Expected Outcomes: Power normal; Power fault detected; Unable to query - site offline + - Flow: If succeeds, go to step 4; if fails, go to step 7. + + 4. Attempt Remote Ping: Test IP connectivity to site equipment + - Tools Used: test_connectivity() + - Expected Outcomes: eNodeB responding; BBU responding only; No response + - Flow: If succeeds, go to step 5; if fails, go to step 6. + + 5. Check eNodeB Status: Query base station operational status + - Tools Used: query_resource_health() + - Expected Outcomes: eNodeB fault detected; Software crash; Hardware alarm + - Flow: If succeeds, go to step 6; if fails, go to step 11. + + 6. Attempt eNodeB Reset: Remote reset of base station + - Tools Used: execute_remote_action() + - Expected Outcomes: eNodeB recovering; Reset initiated - waiting; Reset failed + - Flow: If succeeds, go to step 9; if fails, go to step 11. + + 7. Initiate Power Recovery: Address power issue via power workflow + - Tools Used: execute_remote_action() + - Expected Outcomes: Power recovery in progress; Requires field visit + - Flow: If succeeds, go to step 9; if fails, go to step 11. + + 8. Contact Transport Team: Escalate backhaul issue to transport + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Transport team notified; Cross-functional escalation + - Flow: Proceed to step 11. + + 9. Monitor Site Recovery: Wait for site to come back online + - Tools Used: query_resource_health() + - Expected Outcomes: Site online; Site partially recovered; Site still down + - Flow: If succeeds, go to step 10; if fails, go to step 11. + + 10. Verify Cell Sectors: Confirm all sectors operational + - Tools Used: query_rf_status() + - Expected Outcomes: All sectors up; Some sectors degraded; Sector failures + - Flow: If succeeds, go to step 12; if fails, go to step 11. + + 11. Dispatch Field Technician: Create emergency dispatch to site + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Field dispatch created + - Flow: Proceed to step 12. + + 12. Final Status Verification: Confirm site status and coverage impact + - Tools Used: verify_recovery() + - Expected Outcomes: Site fully operational; Site degraded; Site still down - field en route + - Flow: Proceed to step end. + + """, + + "ran_coverage_complaint_resolution": """ + Workflow: Coverage Complaint Resolution + Domain: RAN | Problem Codes: SVC-001 | SLA: 240 minutes + Description: Customer coverage complaint investigation + + 1. Review Complaint Details: Analyze customer complaint location and symptoms + - Tools Used: query_external_factors() + - Expected Outcomes: Indoor coverage issue; Outdoor weak signal; Specific area affected + - Flow: If succeeds, go to step 2; if fails, go to step 5. + + 2. Check Serving Cell Health: Verify cells serving complaint area + - Tools Used: query_resource_health() + - Expected Outcomes: Serving cells healthy; Cell degraded; Coverage gap identified + - Flow: If succeeds, go to step 3; if fails, go to step 4. + + 3. Analyze Coverage Data: Review coverage prediction and drive test data + - Tools Used: query_performance() + - Expected Outcomes: Coverage adequate per design; Coverage hole confirmed; Terrain obstruction + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Apply Coverage Adjustment: Adjust tilt/power if beneficial + - Tools Used: apply_configuration() + - Expected Outcomes: Coverage adjustment applied; Limited improvement possible + - Flow: If succeeds, go to step 6; if fails, go to step 5. + + 5. Schedule Site Survey: Request field survey of complaint area + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Site survey scheduled + - Flow: Proceed to step 6. + + 6. Document Resolution: Record findings and actions taken + - Tools Used: verify_recovery() + - Expected Outcomes: Issue resolved; Investigation ongoing; Customer notified + - Flow: Proceed to step end. + + """, + + "ran_device_issue_resolution": """ + Workflow: Subscriber Device Issue Resolution + Domain: RAN | Problem Codes: SVC-005 | SLA: 120 minutes + Description: Device compatibility or registration issue + + 1. Review Device Issue: Analyze device-related complaint + - Tools Used: query_external_factors() + - Expected Outcomes: Registration failure; Device compatibility issue; SIM issue suspected + - Flow: If succeeds, go to step 2; if fails, go to step 3. + + 2. Check Network Registration: Verify device registration status + - Tools Used: verify_signaling_path() + - Expected Outcomes: Registration successful; Registration rejected; Attach failure + - Flow: If succeeds, go to step 4; if fails, go to step 3. + + 3. Escalate to Core Team: Create ticket for subscriber investigation + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Core team notified + - Flow: Proceed to step 4. + + 4. Document Resolution: Record device issue findings + - Tools Used: verify_recovery() + - Expected Outcomes: Issue resolved; Investigation ongoing; Customer advised + - Flow: Proceed to step end. + + """, + + "ran_dropped_calls_resolution": """ + Workflow: Dropped Calls Complaint Resolution + Domain: RAN | Problem Codes: SVC-003 | SLA: 180 minutes + Description: Customer dropped call complaint investigation + + 1. Review Complaint Details: Analyze dropped call complaint + - Tools Used: query_external_factors() + - Expected Outcomes: Dropped during mobility; Dropped stationary; Specific location + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Check Call Drop KPIs: Review cell drop rate metrics + - Tools Used: query_performance() + - Expected Outcomes: Drop rate elevated; Drop rate normal; Specific cause identified + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Analyze Drop Causes: Review RRC release reasons + - Tools Used: run_diagnostics() + - Expected Outcomes: Radio link failure; Handover failure; Resource exhaustion + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Check Serving Cell Health: Verify cell RF health + - Tools Used: query_rf_status() + - Expected Outcomes: RF healthy; RF issues detected; Interference present + - Flow: If succeeds, go to step 5; if fails, go to step 6. + + 5. Apply Drop Mitigation: Adjust parameters to reduce drops + - Tools Used: apply_configuration() + - Expected Outcomes: Mitigation applied; Parameter adjustment made + - Flow: If succeeds, go to step 7; if fails, go to step 6. + + 6. Request RF Investigation: Create ticket for RF optimization + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Investigation ticket created + - Flow: Proceed to step 7. + + 7. Document Resolution: Record findings and actions + - Tools Used: verify_recovery() + - Expected Outcomes: Issue addressed; Investigation ongoing; Customer notified + - Flow: Proceed to step end. + + """, + + "ran_gps_sync_recovery": """ + Workflow: GPS Synchronization Recovery + Domain: RAN | Problem Codes: RAN-010 | SLA: 60 minutes + Description: GPS timing and synchronization loss recovery + + 1. Verify GPS Alarm: Confirm GPS sync loss + - Tools Used: query_alarm() + - Expected Outcomes: GPS signal lost; GPS antenna fault; GPS receiver fault; Running on holdover + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Check GPS Receiver Status: Query GPS module status and satellite count + - Tools Used: query_resource_health() + - Expected Outcomes: No satellites visible; Low satellite count; GPS receiver fault + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Check Holdover Status: Verify site operating on holdover clock + - Tools Used: query_resource_health() + - Expected Outcomes: Holdover active - stable; Holdover degrading; Holdover exceeded + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Attempt GPS Reset: Reset GPS receiver module + - Tools Used: execute_remote_action() + - Expected Outcomes: GPS recovering; GPS reacquisition in progress; GPS fault persists + - Flow: If succeeds, go to step 7; if fails, go to step 5. + + 5. Enable Backup Sync Source: Switch to alternate timing source if available + - Tools Used: apply_configuration() + - Expected Outcomes: Backup sync enabled; No backup available; Sync source switched + - Flow: Proceed to step 6. + + 6. Schedule Field Inspection: Create work order for GPS system inspection + - Tools Used: create_trouble_ticket() + - Expected Outcomes: GPS inspection scheduled + - Flow: Proceed to step 7. + + 7. Verify Sync Status: Monitor timing and sync recovery + - Tools Used: verify_recovery() + - Expected Outcomes: GPS recovered; Running on backup sync; Holdover mode - field required + - Flow: Proceed to step end. + + """, + + "ran_handover_failure_resolution": """ + Workflow: Handover Failure Resolution + Domain: RAN | Problem Codes: RAN-006 | SLA: 120 minutes + Description: High handover failure rate troubleshooting + + 1. Verify HO Failure Alarm: Confirm high handover failure rate + - Tools Used: query_alarm() + - Expected Outcomes: HO failure rate elevated; HO timeout failures; HO preparation failures + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Analyze HO Statistics: Review handover success rates and failure types + - Tools Used: query_performance() + - Expected Outcomes: Specific neighbor high failures; All neighbors affected; X2 failures high + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Check Neighbor Relations: Verify neighbor cell configurations + - Tools Used: query_topology() + - Expected Outcomes: Missing neighbor detected; Neighbor config mismatch; Neighbors correct + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Check Target Cell Health: Verify target cells operational + - Tools Used: query_resource_health() + - Expected Outcomes: Target cells healthy; Target cell degraded; Target cell congested + - Flow: Proceed to step 5. + + 5. Adjust HO Parameters: Modify handover thresholds and timers + - Tools Used: apply_configuration() + - Expected Outcomes: HO parameters optimized; Parameter adjustment applied; Limited improvement expected + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Check X2 Interface: Verify X2 connectivity between eNodeBs + - Tools Used: verify_signaling_path() + - Expected Outcomes: X2 link healthy; X2 congestion; X2 failures detected + - Flow: Proceed to step 7. + + 7. Request RF Optimization: Create work order for drive test optimization + - Tools Used: create_trouble_ticket() + - Expected Outcomes: RF optimization scheduled + - Flow: Proceed to step 8. + - Note: This step is skippable if not applicable. + + 8. Monitor HO Success Rate: Track handover performance improvement + - Tools Used: verify_recovery() + - Expected Outcomes: HO success rate improved; HO performance stable; Further optimization needed + - Flow: Proceed to step end. + + """, + + "ran_high_rtwp_resolution": """ + Workflow: High RTWP Resolution + Domain: RAN | Problem Codes: RAN-004 | SLA: 90 minutes + Description: Received Total Wideband Power issue resolution + + 1. Verify RTWP Alarm: Confirm high RTWP on affected carrier + - Tools Used: query_alarm() + - Expected Outcomes: RTWP elevated; RTWP critical; Intermittent RTWP spikes + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Check RF Path: Inspect RF path for faults + - Tools Used: query_rf_status() + - Expected Outcomes: RF path normal; VSWR alarm; Connector fault suspected + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Analyze Noise Floor: Check for elevated noise floor + - Tools Used: query_performance() + - Expected Outcomes: Noise floor elevated; Normal noise floor; External interference indicated + - Flow: Proceed to step 4. + + 4. Attempt RRU Power Adjustment: Adjust receive path gain + - Tools Used: apply_configuration() + - Expected Outcomes: RTWP improved; Adjustment limited; No improvement + - Flow: If succeeds, go to step 7; if fails, go to step 5. + + 5. Schedule Site Inspection: Create work order for RF path inspection + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Site inspection scheduled + - Flow: If succeeds, go to step 6; if fails, go to step 7. + + 6. Enable RTWP Monitoring: Set up enhanced RTWP monitoring + - Tools Used: apply_configuration() + - Expected Outcomes: Enhanced monitoring enabled + - Flow: Proceed to step 7. + - Note: This step is skippable if not applicable. + + 7. Verify RTWP Status: Monitor RTWP trend + - Tools Used: verify_recovery() + - Expected Outcomes: RTWP normalized; RTWP improved; RTWP elevated - investigation ongoing + - Flow: Proceed to step end. + + """, + + "ran_interference_mitigation": """ + Workflow: RF Interference Mitigation + Domain: RAN | Problem Codes: RAN-003 | SLA: 120 minutes + Description: Detect and mitigate RF interference + + 1. Verify Interference Alarm: Confirm interference detected on sector + - Tools Used: query_alarm() + - Expected Outcomes: UL interference confirmed; DL interference suspected; Intermittent interference + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Analyze Interference Pattern: Review interference level and frequency + - Tools Used: query_rf_status() + - Expected Outcomes: Constant interference; Periodic interference; Wideband interference + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Check Neighbor Cell Status: Verify neighbor cells not causing interference + - Tools Used: query_resource_health() + - Expected Outcomes: Neighbors normal; Neighbor misconfigured; Neighbor overshoot detected + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Check for External Interference: Assess external interference sources + - Tools Used: query_external_factors() + - Expected Outcomes: External source suspected; No external source found; Radar interference detected + - Flow: Proceed to step 5. + + 5. Apply Interference Mitigation: Adjust parameters to reduce interference impact + - Tools Used: apply_configuration() + - Expected Outcomes: Mitigation applied; Limited improvement; Interference persists + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Request Spectrum Analysis: Schedule field spectrum analysis + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Spectrum analysis scheduled + - Flow: If succeeds, go to step 7; if fails, go to step 8. + + 7. Report to Spectrum Management: File interference report if external source + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Interference report filed + - Flow: Proceed to step 8. + - Note: This step is skippable if not applicable. + + 8. Verify Interference Status: Monitor interference levels + - Tools Used: verify_recovery() + - Expected Outcomes: Interference reduced; Interference stable; Investigation ongoing + - Flow: Proceed to step end. + + """, + + "ran_maintenance_window_execution": """ + Workflow: Maintenance Window Execution + Domain: RAN | Problem Codes: SVC-004 | SLA: 60 minutes + Description: Planned maintenance window monitoring + + 1. Verify Maintenance Start: Confirm maintenance window active + - Tools Used: query_external_factors() + - Expected Outcomes: Maintenance window active; Window scheduled; Window completed + - Flow: If succeeds, go to step 2; if fails, go to step 3. + + 2. Monitor Service Impact: Track service during maintenance + - Tools Used: query_resource_health() + - Expected Outcomes: Impact within expected; Impact higher than expected; No impact + - Flow: Proceed to step 3. + + 3. Verify Post-Maintenance: Confirm service restored post-maintenance + - Tools Used: verify_recovery() + - Expected Outcomes: Service restored; Partial restoration; Issues detected + - Flow: Proceed to step end. + + """, + + "ran_massive_mimo_recovery": """ + Workflow: Massive MIMO Fault Recovery + Domain: RAN | Problem Codes: RAN-020 | SLA: 120 minutes + Description: Massive MIMO antenna array fault recovery + + 1. Verify MIMO Alarm: Confirm massive MIMO fault + - Tools Used: query_alarm() + - Expected Outcomes: Antenna element fault; Beamforming degraded; TRX chain failure + - Flow: If succeeds, go to step 2; if fails, go to step 5. + + 2. Check Antenna Status: Query antenna array health + - Tools Used: query_rf_status() + - Expected Outcomes: Multiple elements failed; Single element fault; Calibration error + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Attempt Array Recalibration: Trigger antenna array recalibration + - Tools Used: execute_remote_action() + - Expected Outcomes: Recalibration successful; Recalibration failed; Hardware fault persists + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Verify Beamforming: Check beamforming functionality restored + - Tools Used: query_performance() + - Expected Outcomes: Beamforming restored; Degraded operation; Beamforming disabled + - Flow: If succeeds, go to step 6; if fails, go to step 5. + + 5. Schedule Field Service: Create work order for MIMO repair + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Field service scheduled + - Flow: Proceed to step 6. + + 6. Verify Capacity Impact: Monitor capacity and throughput + - Tools Used: verify_recovery() + - Expected Outcomes: MIMO recovered; Operating in degraded mode; Field repair required + - Flow: Proceed to step end. + + """, + + "ran_parameter_correction": """ + Workflow: Parameter Misconfiguration Correction + Domain: RAN | Problem Codes: RAN-013 | SLA: 120 minutes + Description: Cell parameter misconfiguration identification and correction + + 1. Verify Configuration Alert: Confirm parameter misconfiguration detected + - Tools Used: query_alarm() + - Expected Outcomes: Parameter audit mismatch; KPI degradation detected; Configuration drift + - Flow: If succeeds, go to step 2; if fails, go to step 5. + + 2. Run Configuration Audit: Compare current config against baseline + - Tools Used: run_diagnostics() + - Expected Outcomes: Deviations identified; Multiple parameters affected; Minor drift only + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Assess Impact: Determine performance impact of misconfiguration + - Tools Used: query_performance() + - Expected Outcomes: Significant KPI impact; Minor impact; No measurable impact + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Apply Corrective Configuration: Restore parameters to baseline + - Tools Used: apply_configuration() + - Expected Outcomes: Parameters corrected; Partial correction; Correction requires review + - Flow: If succeeds, go to step 6; if fails, go to step 5. + + 5. Escalate to Engineering: Request engineering review of configuration + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Engineering review requested + - Flow: Proceed to step 6. + + 6. Verify Performance: Monitor KPI improvement + - Tools Used: verify_recovery() + - Expected Outcomes: KPIs improved; KPIs stable; Continued monitoring needed + - Flow: Proceed to step end. + + """, + + "ran_pim_interference_resolution": """ + Workflow: PIM Interference Resolution + Domain: RAN | Problem Codes: RAN-016 | SLA: 180 minutes + Description: Passive Intermodulation interference mitigation + + 1. Verify PIM Detection: Confirm PIM interference suspected + - Tools Used: query_alarm() + - Expected Outcomes: PIM pattern detected; Interference correlated with TX; RTWP pattern suspicious + - Flow: If succeeds, go to step 2; if fails, go to step 5. + + 2. Analyze Interference Pattern: Correlate interference with TX power + - Tools Used: query_rf_status() + - Expected Outcomes: PIM confirmed - TX correlated; Pattern inconclusive; External interference + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Check Recent Work History: Review recent antenna work on site + - Tools Used: query_external_factors() + - Expected Outcomes: Recent work performed; No recent changes + - Flow: Proceed to step 4. + - Note: This step is skippable if not applicable. + + 4. Temporary TX Power Reduction: Reduce TX power to confirm PIM reduction + - Tools Used: apply_configuration() + - Expected Outcomes: PIM reduced with lower power; No change observed + - Flow: Proceed to step 5. + + 5. Schedule PIM Hunt: Create work order for PIM source identification + - Tools Used: create_trouble_ticket() + - Expected Outcomes: PIM hunt scheduled + - Flow: Proceed to step 6. + + 6. Verify Mitigation Status: Monitor interference levels + - Tools Used: verify_recovery() + - Expected Outcomes: PIM mitigated; Operating at reduced power; Field investigation required + - Flow: Proceed to step end. + + """, + + "ran_prb_availability_resolution": """ + Workflow: Low PRB Availability Resolution + Domain: RAN | Problem Codes: RAN-018 | SLA: 60 minutes + Description: Physical Resource Block shortage mitigation + + 1. Verify PRB Alert: Confirm low PRB availability + - Tools Used: query_alarm() + - Expected Outcomes: PRB utilization critical; PRB shortage detected; DL/UL imbalance + - Flow: If succeeds, go to step 2; if fails, go to step 5. + + 2. Analyze PRB Usage: Review PRB utilization by traffic type + - Tools Used: query_performance() + - Expected Outcomes: High VoLTE load; Data traffic heavy; Control channel heavy + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Apply PRB Optimization: Adjust PRB allocation and scheduling + - Tools Used: apply_configuration() + - Expected Outcomes: Scheduling optimized; PRB allocation adjusted + - Flow: If succeeds, go to step 6; if fails, go to step 4. + + 4. Enable Load Balancing: Activate inter-frequency load balancing + - Tools Used: apply_configuration() + - Expected Outcomes: Load balancing active; Traffic offloaded + - Flow: If succeeds, go to step 6; if fails, go to step 5. + + 5. Request Capacity Addition: Submit capacity expansion request + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Capacity request submitted + - Flow: Proceed to step 6. + + 6. Verify PRB Status: Monitor PRB availability improvement + - Tools Used: verify_recovery() + - Expected Outcomes: PRB availability improved; PRB stable; Capacity addition required + - Flow: Proceed to step end. + + """, + + "ran_rru_communication_recovery": """ + Workflow: RRU Communication Recovery + Domain: RAN | Problem Codes: RAN-008 | SLA: 60 minutes + Description: Remote Radio Unit communication failure recovery + + 1. Verify RRU Alarm: Confirm RRU communication loss + - Tools Used: query_alarm() + - Expected Outcomes: RRU link down; RRU timeout; Multiple RRUs affected + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check CPRI/Fiber Link: Verify CPRI or eCPRI link status + - Tools Used: test_connectivity() + - Expected Outcomes: CPRI link down; CPRI errors high; Fiber fault indicated + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Check BBU Status: Verify BBU board health + - Tools Used: query_resource_health() + - Expected Outcomes: BBU healthy; BBU interface fault; BBU software error + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Attempt RRU Reset: Remote reset of RRU + - Tools Used: execute_remote_action() + - Expected Outcomes: RRU recovering; RRU not responding; Reset command failed + - Flow: If succeeds, go to step 8; if fails, go to step 5. + + 5. Reset BBU Interface: Reset BBU CPRI interface board + - Tools Used: execute_remote_action() + - Expected Outcomes: Interface recovered; Interface fault persists + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Check RRU Power: Verify RRU has power + - Tools Used: query_power_system() + - Expected Outcomes: RRU power OK; RRU power fault; Cannot verify remotely + - Flow: Proceed to step 7. + + 7. Dispatch Field Technician: Create dispatch for RRU repair + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Field dispatch created + - Flow: Proceed to step 8. + + 8. Verify Sector Status: Confirm affected sectors operational status + - Tools Used: verify_recovery() + - Expected Outcomes: RRU recovered; Sector down - field en route; Partial recovery + - Flow: Proceed to step end. + + """, + + "ran_sector_outage_recovery": """ + Workflow: Sector Outage Recovery + Domain: RAN | Problem Codes: RAN-002 | SLA: 60 minutes + Description: Individual cell sector failure recovery + + 1. Verify Sector Alarm: Confirm sector outage and identify affected sector + - Tools Used: query_alarm() + - Expected Outcomes: Sector alpha down; Sector beta down; Sector gamma down; Multiple sectors + - Flow: If succeeds, go to step 2; if fails, go to step 8. + + 2. Check RF Chain Status: Query RF module and antenna status + - Tools Used: query_rf_status() + - Expected Outcomes: PA fault detected; Antenna VSWR alarm; RF path normal; RRU fault + - Flow: If succeeds, go to step 3; if fails, go to step 7. + + 3. Check RRU Status: Query remote radio unit health + - Tools Used: query_resource_health() + - Expected Outcomes: RRU online; RRU fault; RRU communication lost + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Attempt Sector Reset: Reset RF path for affected sector + - Tools Used: execute_remote_action() + - Expected Outcomes: Sector recovering; Reset failed; Hardware fault persists + - Flow: If succeeds, go to step 6; if fails, go to step 5. + + 5. Attempt RRU Reset: Power cycle remote radio unit + - Tools Used: execute_remote_action() + - Expected Outcomes: RRU recovering; RRU failed to restart; Hardware replacement needed + - Flow: If succeeds, go to step 6; if fails, go to step 7. + + 6. Verify Sector Recovery: Confirm sector back on air + - Tools Used: query_rf_status() + - Expected Outcomes: Sector transmitting; Reduced power; Sector still down + - Flow: If succeeds, go to step 9; if fails, go to step 7. + + 7. Enable Capacity Compensation: Adjust neighboring sectors to compensate + - Tools Used: apply_configuration() + - Expected Outcomes: Neighbor compensation applied; Limited compensation possible + - Flow: Proceed to step 8. + - Note: This step is skippable if not applicable. + + 8. Schedule Field Repair: Create work order for hardware repair + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Field repair scheduled + - Flow: Proceed to step 9. + + 9. Verify Coverage Impact: Assess coverage gap and customer impact + - Tools Used: verify_recovery() + - Expected Outcomes: Sector recovered; Coverage compensated; Coverage gap exists + - Flow: Proceed to step end. + + """, + + "ran_software_upgrade_recovery": """ + Workflow: Software Upgrade Failure Recovery + Domain: RAN | Problem Codes: RAN-014 | SLA: 60 minutes + Description: Failed software upgrade rollback and recovery + + 1. Verify Upgrade Failure: Confirm software upgrade failed + - Tools Used: query_alarm() + - Expected Outcomes: Upgrade failed mid-process; Version mismatch; Boot failure post-upgrade + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check System Status: Query node operational state + - Tools Used: query_resource_health() + - Expected Outcomes: System partially operational; System in degraded state; System down + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Review Upgrade Logs: Analyze upgrade failure logs + - Tools Used: inspect_logs() + - Expected Outcomes: Database migration failed; File system error; Memory exhaustion + - Flow: Proceed to step 4. + + 4. Initiate Rollback: Roll back to previous software version + - Tools Used: execute_remote_action() + - Expected Outcomes: Rollback successful; Rollback in progress; Rollback failed + - Flow: If succeeds, go to step 6; if fails, go to step 5. + + 5. Attempt Emergency Boot: Boot from recovery image + - Tools Used: execute_remote_action() + - Expected Outcomes: Recovery boot successful; Recovery failed + - Flow: If succeeds, go to step 6; if fails, go to step 7. + + 6. Verify System Recovery: Confirm system operational post-rollback + - Tools Used: query_resource_health() + - Expected Outcomes: System recovered; System degraded; Issues persist + - Flow: If succeeds, go to step 8; if fails, go to step 7. + + 7. Dispatch Field Support: Create emergency dispatch for manual recovery + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Field dispatch created + - Flow: Proceed to step 8. + + 8. Document and Report: Record failure for vendor escalation + - Tools Used: verify_recovery() + - Expected Outcomes: System recovered - incident documented; Recovery ongoing; Vendor escalation needed + - Flow: Proceed to step end. + + """, + + "ran_speed_complaint_resolution": """ + Workflow: Speed Complaint Resolution + Domain: RAN | Problem Codes: SVC-002 | SLA: 240 minutes + Description: Customer throughput complaint investigation + + 1. Review Complaint Details: Analyze customer speed complaint + - Tools Used: query_external_factors() + - Expected Outcomes: Low speed reported; Intermittent speed issues; Specific times affected + - Flow: If succeeds, go to step 2; if fails, go to step 5. + + 2. Check Cell Performance: Review cell throughput and utilization + - Tools Used: query_performance() + - Expected Outcomes: Cell throughput normal; Cell congested; Backhaul limited + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Check Device Capability: Verify device supports available features + - Tools Used: query_resource_health() + - Expected Outcomes: Device CA capable; Device limited; Device category low + - Flow: Proceed to step 4. + - Note: This step is skippable if not applicable. + + 4. Apply Throughput Optimization: Adjust parameters for throughput + - Tools Used: apply_configuration() + - Expected Outcomes: Optimization applied; Limited improvement expected + - Flow: If succeeds, go to step 6; if fails, go to step 5. + + 5. Request Detailed Analysis: Create ticket for performance analysis + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Analysis ticket created + - Flow: Proceed to step 6. + + 6. Document Resolution: Record findings and recommendations + - Tools Used: verify_recovery() + - Expected Outcomes: Issue addressed; Investigation ongoing; Customer notified + - Flow: Proceed to step end. + + """, + + "ran_voice_quality_resolution": """ + Workflow: Voice Quality Degradation Resolution + Domain: RAN | Problem Codes: RAN-019 | SLA: 90 minutes + Description: VoLTE/VoNR quality issue troubleshooting + + 1. Verify Voice Quality Alert: Confirm voice quality degradation + - Tools Used: query_alarm() + - Expected Outcomes: MOS score degraded; Call drop rate elevated; Voice jitter detected + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Analyze Voice KPIs: Review VoLTE performance metrics + - Tools Used: query_performance() + - Expected Outcomes: DL bearer quality low; UL bearer quality low; SRVCC failures high + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Check QCI 1 Bearer: Verify voice bearer configuration + - Tools Used: run_diagnostics() + - Expected Outcomes: Bearer configuration OK; QCI 1 issues detected; GBR allocation problem + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Verify IMS Connectivity: Check IMS platform connectivity + - Tools Used: verify_signaling_path() + - Expected Outcomes: IMS connectivity OK; IMS latency elevated; SIP failures detected + - Flow: If succeeds, go to step 5; if fails, go to step 6. + + 5. Apply Voice Optimization: Adjust VoLTE parameters + - Tools Used: apply_configuration() + - Expected Outcomes: Voice parameters optimized; Configuration applied + - Flow: If succeeds, go to step 7; if fails, go to step 6. + + 6. Escalate to Voice Team: Create ticket for voice engineering + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Voice team notified + - Flow: Proceed to step 7. + + 7. Verify Voice Quality: Monitor voice KPI improvement + - Tools Used: verify_recovery() + - Expected Outcomes: Voice quality improved; Voice quality stable; Investigation ongoing + - Flow: Proceed to step end. + + """, + + "ran_vswr_alarm_resolution": """ + Workflow: VSWR Alarm Resolution + Domain: RAN | Problem Codes: RAN-007 | SLA: 90 minutes + Description: Voltage Standing Wave Ratio fault resolution + + 1. Verify VSWR Alarm: Confirm VSWR threshold exceeded + - Tools Used: query_alarm() + - Expected Outcomes: VSWR warning; VSWR critical; VSWR intermittent + - Flow: If succeeds, go to step 2; if fails, go to step 5. + + 2. Check RF Path Status: Get detailed RF path measurements + - Tools Used: query_rf_status() + - Expected Outcomes: VSWR elevated on antenna; VSWR elevated on feeder; VSWR on jumper + - Flow: If succeeds, go to step 3; if fails, go to step 4. + + 3. Check Recent Work History: Verify recent site work that may have caused issue + - Tools Used: query_external_factors() + - Expected Outcomes: Recent antenna work performed; Recent weather event; No recent changes + - Flow: Proceed to step 4. + - Note: This step is skippable if not applicable. + + 4. Reduce TX Power: Lower transmit power to protect equipment + - Tools Used: apply_configuration() + - Expected Outcomes: TX power reduced; Power already at minimum; PA protection activated + - Flow: Proceed to step 5. + + 5. Schedule Field Inspection: Create work order for antenna system inspection + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Antenna inspection scheduled + - Flow: Proceed to step 6. + + 6. Verify Sector Impact: Assess coverage and capacity impact + - Tools Used: verify_recovery() + - Expected Outcomes: Sector operational at reduced power; Sector significantly degraded; Field repair required + - Flow: Proceed to step end. + + """, + + "signaling_delay_resolution": """ + Workflow: Signaling Delay Resolution + Domain: SIGNALING | Problem Codes: SIG-009 | SLA: 60 minutes + Description: Signaling path latency troubleshooting + + 1. Verify Delay Alarm: Confirm signaling delay threshold exceeded + - Tools Used: query_alarm() + - Expected Outcomes: Response time high; Timeout rate elevated; Procedure delay + - Flow: If succeeds, go to step 2; if fails, go to step 5. + + 2. Measure Signaling Latency: Test signaling path delay + - Tools Used: verify_signaling_path() + - Expected Outcomes: High latency confirmed; Variable latency; Network path delay + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Check Processing Load: Verify node processing capacity + - Tools Used: query_resource_health() + - Expected Outcomes: Processing OK; CPU high; Queue delay + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Optimize Signaling Path: Adjust routing or parameters to reduce delay + - Tools Used: apply_configuration() + - Expected Outcomes: Path optimized; Timer adjusted; Limited improvement + - Flow: If succeeds, go to step 6; if fails, go to step 5. + + 5. Request Network Review: Create ticket for transport path analysis + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Network review scheduled + - Flow: Proceed to step 6. + + 6. Verify Delay Status: Monitor signaling latency + - Tools Used: verify_recovery() + - Expected Outcomes: Latency improved; Latency stable; Investigation ongoing + - Flow: Proceed to step end. + + """, + + "signaling_diameter_recovery": """ + Workflow: Diameter Peer Recovery + Domain: SIGNALING | Problem Codes: SIG-002 | SLA: 15 minutes + Description: Diameter signaling peer failure recovery + + 1. Verify Diameter Alarm: Confirm Diameter peer connection lost + - Tools Used: query_alarm() + - Expected Outcomes: Diameter peer down; DWR timeout; Transport failure + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check Diameter Stack: Verify Diameter agent status + - Tools Used: query_resource_health() + - Expected Outcomes: Local stack healthy; Stack overloaded; Stack fault + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Test Peer Connectivity: Verify IP connectivity to Diameter peer + - Tools Used: test_connectivity() + - Expected Outcomes: Peer reachable; Peer unreachable; Port blocked + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Attempt Diameter Reconnect: Force reconnection to Diameter peer + - Tools Used: execute_remote_action() + - Expected Outcomes: Peer reconnecting; CER/CEA exchange started; Reconnect failed + - Flow: If succeeds, go to step 8; if fails, go to step 5. + + 5. Failover to Backup Peer: Route to secondary Diameter peer + - Tools Used: apply_configuration() + - Expected Outcomes: Failover successful; No backup peer; Backup congested + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Escalate to DRA Team: Create ticket for DRA/Diameter investigation + - Tools Used: create_trouble_ticket() + - Expected Outcomes: DRA team notified + - Flow: If succeeds, go to step 7; if fails, go to step 8. + + 7. Implement Traffic Throttling: Apply overload control if needed + - Tools Used: apply_configuration() + - Expected Outcomes: Throttling enabled; OLC applied + - Flow: Proceed to step 8. + - Note: This step is skippable if not applicable. + + 8. Verify Diameter Status: Monitor Diameter peer recovery + - Tools Used: verify_recovery() + - Expected Outcomes: Peer restored; Running on backup; Recovery in progress + - Flow: Proceed to step end. + + """, + + "signaling_gtp_tunnel_recovery": """ + Workflow: GTP Tunnel Failure Recovery + Domain: SIGNALING | Problem Codes: SIG-004 | SLA: 15 minutes + Description: GTP-C/GTP-U tunnel failure recovery + + 1. Verify GTP Alarm: Confirm GTP tunnel failure + - Tools Used: query_alarm() + - Expected Outcomes: GTP-C path failure; GTP-U echo timeout; Tunnel deletion high + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check S/P-GW Status: Verify gateway operational status + - Tools Used: query_resource_health() + - Expected Outcomes: Gateway healthy; Gateway overloaded; Gateway fault + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Test GTP Path: Verify GTP connectivity + - Tools Used: verify_signaling_path() + - Expected Outcomes: Path healthy; Echo loss; Path congested + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Attempt Path Recovery: Clear and rebuild GTP path + - Tools Used: execute_remote_action() + - Expected Outcomes: Path recovering; Path established; Recovery failed + - Flow: If succeeds, go to step 8; if fails, go to step 5. + + 5. Failover to Backup Gateway: Redirect to secondary S/P-GW + - Tools Used: apply_configuration() + - Expected Outcomes: Failover initiated; No backup available + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Escalate to Core Team: Create critical ticket for EPC/5GC investigation + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Core team engaged + - Flow: If succeeds, go to step 7; if fails, go to step 8. + + 7. Enable GTP Optimization: Apply GTP overload controls + - Tools Used: apply_configuration() + - Expected Outcomes: Controls applied; Already optimized + - Flow: Proceed to step 8. + - Note: This step is skippable if not applicable. + + 8. Verify GTP Status: Monitor GTP tunnel recovery + - Tools Used: verify_recovery() + - Expected Outcomes: GTP recovered; Running on backup; Recovery in progress + - Flow: Proceed to step end. + + """, + + "signaling_overload_mitigation": """ + Workflow: Signaling Overload Mitigation + Domain: SIGNALING | Problem Codes: SIG-008 | SLA: 30 minutes + Description: Signaling plane overload detection and mitigation + + 1. Verify Overload Alarm: Confirm signaling overload condition + - Tools Used: query_alarm() + - Expected Outcomes: Message rate high; Queue building; Reject rate elevated + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Identify Overload Source: Determine cause of signaling overload + - Tools Used: query_performance() + - Expected Outcomes: Attach storm; Paging flood; Normal peak traffic + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Check Node Capacity: Verify signaling node capacity + - Tools Used: query_resource_health() + - Expected Outcomes: At capacity; Undersized; Unusual load pattern + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Enable Overload Controls: Activate signaling overload mechanisms + - Tools Used: apply_configuration() + - Expected Outcomes: OLC enabled; Rate limiting active; Selective rejection + - Flow: If succeeds, go to step 7; if fails, go to step 5. + + 5. Request Load Shedding: Implement traffic management + - Tools Used: execute_remote_action() + - Expected Outcomes: Load shedding active; Traffic shaped + - Flow: If succeeds, go to step 7; if fails, go to step 6. + + 6. Escalate Capacity Issue: Create ticket for capacity review + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Capacity team notified + - Flow: Proceed to step 7. + + 7. Verify Load Status: Monitor signaling load levels + - Tools Used: verify_recovery() + - Expected Outcomes: Load normalized; Load stabilized; OLC active + - Flow: Proceed to step end. + + """, + + "signaling_routing_failure_recovery": """ + Workflow: Message Routing Failure Recovery + Domain: SIGNALING | Problem Codes: SIG-010 | SLA: 30 minutes + Description: Signaling message routing failure troubleshooting + + 1. Verify Routing Alarm: Confirm message routing failure + - Tools Used: query_alarm() + - Expected Outcomes: Route not found; Destination unreachable; Routing loop detected + - Flow: If succeeds, go to step 2; if fails, go to step 5. + + 2. Check Routing Tables: Verify signaling routing configuration + - Tools Used: run_diagnostics() + - Expected Outcomes: Missing route; Incorrect route; Routes OK - destination down + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Check Destination Status: Verify target node reachability + - Tools Used: query_resource_health() + - Expected Outcomes: Destination reachable; Destination down; Destination overloaded + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Apply Route Correction: Fix routing configuration + - Tools Used: apply_configuration() + - Expected Outcomes: Route added; Route corrected; Alternate route configured + - Flow: If succeeds, go to step 6; if fails, go to step 5. + + 5. Escalate Routing Issue: Create ticket for routing investigation + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Routing issue escalated + - Flow: Proceed to step 6. + + 6. Verify Routing Status: Monitor message routing success + - Tools Used: verify_recovery() + - Expected Outcomes: Routing restored; Alternate route active; Investigation ongoing + - Flow: Proceed to step end. + + """, + + "signaling_s1_n2_recovery": """ + Workflow: S1/N2 Interface Recovery + Domain: SIGNALING | Problem Codes: SIG-001 | SLA: 15 minutes + Description: S1-MME or N2 interface failure recovery + + 1. Verify S1/N2 Alarm: Confirm S1-MME or N2 interface down + - Tools Used: query_alarm() + - Expected Outcomes: S1-MME link down; N2 interface down; SCTP association lost + - Flow: If succeeds, go to step 2; if fails, go to step 8. + + 2. Check SCTP Status: Verify SCTP association state + - Tools Used: verify_signaling_path() + - Expected Outcomes: SCTP down; SCTP heartbeat timeout; SCTP path failed + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Test IP Connectivity to Core: Verify IP path to MME/AMF + - Tools Used: test_connectivity() + - Expected Outcomes: Path OK; Path down; High latency/loss + - Flow: If succeeds, go to step 4; if fails, go to step 6. + + 4. Check Core Node Status: Verify MME/AMF operational status + - Tools Used: query_resource_health() + - Expected Outcomes: Core node healthy; Core node overloaded; Core node down + - Flow: If succeeds, go to step 5; if fails, go to step 7. + + 5. Attempt SCTP Reset: Reset SCTP association on eNodeB/gNodeB + - Tools Used: execute_remote_action() + - Expected Outcomes: SCTP recovering; Reset failed; Association reestablishing + - Flow: If succeeds, go to step 9; if fails, go to step 6. + + 6. Switch to Backup MME/AMF: Redirect to secondary core node + - Tools Used: execute_remote_action() + - Expected Outcomes: Failover successful; No backup available; Backup also affected + - Flow: If succeeds, go to step 9; if fails, go to step 7. + + 7. Escalate to Core Team: Create critical ticket for core investigation + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Core team engaged + - Flow: If succeeds, go to step 8; if fails, go to step 9. + + 8. Enable S1 Flex Rerouting: Configure S1-flex to alternate pool + - Tools Used: apply_configuration() + - Expected Outcomes: Rerouting enabled; Flex not available + - Flow: Proceed to step 9. + - Note: This step is skippable if not applicable. + + 9. Verify S1/N2 Status: Monitor interface recovery + - Tools Used: verify_recovery() + - Expected Outcomes: S1/N2 restored; Running on backup; Recovery ongoing + - Flow: Proceed to step end. + + """, + + "signaling_s1ap_procedure_recovery": """ + Workflow: S1-AP Procedure Failure Recovery + Domain: SIGNALING | Problem Codes: SIG-005 | SLA: 30 minutes + Description: S1-AP signaling procedure failure troubleshooting + + 1. Verify S1-AP Alarm: Confirm S1-AP procedure failures + - Tools Used: query_alarm() + - Expected Outcomes: Setup failures elevated; Reset procedure triggered; UE context failures + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Analyze S1-AP Statistics: Review S1-AP procedure success rates + - Tools Used: query_performance() + - Expected Outcomes: Attach failures; Handover failures; Paging failures + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Check MME Load: Verify MME capacity and load + - Tools Used: query_resource_health() + - Expected Outcomes: MME healthy; MME overloaded; MME rejecting + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Review S1-AP Configuration: Verify S1-AP parameters + - Tools Used: run_diagnostics() + - Expected Outcomes: Config OK; Timer mismatch; Capacity exceeded + - Flow: If succeeds, go to step 7; if fails, go to step 5. + + 5. Apply S1-AP Tuning: Adjust S1-AP parameters + - Tools Used: apply_configuration() + - Expected Outcomes: Parameters adjusted; Tuning applied + - Flow: If succeeds, go to step 7; if fails, go to step 6. + + 6. Escalate to Core Team: Create ticket for MME investigation + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Core team notified + - Flow: Proceed to step 7. + + 7. Verify S1-AP Performance: Monitor S1-AP procedure success rate + - Tools Used: verify_recovery() + - Expected Outcomes: S1-AP improved; Performance stable; Investigation ongoing + - Flow: Proceed to step end. + + """, + + "signaling_sctp_recovery": """ + Workflow: SCTP Association Recovery + Domain: SIGNALING | Problem Codes: SIG-006 | SLA: 15 minutes + Description: SCTP association failure recovery + + 1. Verify SCTP Alarm: Confirm SCTP association lost + - Tools Used: query_alarm() + - Expected Outcomes: Association down; Path failure; Heartbeat timeout + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Check SCTP Path Status: Verify SCTP multihoming paths + - Tools Used: verify_signaling_path() + - Expected Outcomes: Primary path down; Both paths down; Path flapping + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Test SCTP Connectivity: Verify IP connectivity for SCTP paths + - Tools Used: test_connectivity() + - Expected Outcomes: Primary path OK; Secondary path OK; Both paths unreachable + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Force SCTP Failover: Switch to alternate SCTP path + - Tools Used: execute_remote_action() + - Expected Outcomes: Failover successful; Secondary path active + - Flow: If succeeds, go to step 7; if fails, go to step 5. + + 5. Reset SCTP Association: Restart SCTP association + - Tools Used: execute_remote_action() + - Expected Outcomes: Association reestablishing; Reset failed + - Flow: If succeeds, go to step 7; if fails, go to step 6. + + 6. Escalate Path Issue: Create ticket for network/transport investigation + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Transport team engaged + - Flow: Proceed to step 7. + + 7. Verify SCTP Status: Monitor SCTP association recovery + - Tools Used: verify_recovery() + - Expected Outcomes: Association restored; Running on backup path; Recovery in progress + - Flow: Proceed to step end. + + """, + + "signaling_sigtran_recovery": """ + Workflow: SIGTRAN Link Failure Recovery + Domain: SIGNALING | Problem Codes: SIG-007 | SLA: 15 minutes + Description: SIGTRAN/SS7 over IP link recovery + + 1. Verify SIGTRAN Alarm: Confirm SIGTRAN link failure + - Tools Used: query_alarm() + - Expected Outcomes: M3UA link down; M2PA failure; SG unreachable + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Check SG Status: Verify Signaling Gateway status + - Tools Used: query_resource_health() + - Expected Outcomes: SG healthy; SG overloaded; SG unreachable + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Test SIGTRAN Path: Verify SIGTRAN connectivity + - Tools Used: verify_signaling_path() + - Expected Outcomes: Path healthy; Path blocked; SCTP layer issue + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Activate Backup Linkset: Switch to redundant SIGTRAN linkset + - Tools Used: execute_remote_action() + - Expected Outcomes: Backup linkset active; No backup available + - Flow: If succeeds, go to step 7; if fails, go to step 5. + + 5. Reset SIGTRAN Link: Restart failed SIGTRAN link + - Tools Used: execute_remote_action() + - Expected Outcomes: Link recovering; Link activation in progress; Reset failed + - Flow: If succeeds, go to step 7; if fails, go to step 6. + + 6. Escalate to SS7 Team: Create critical ticket for SS7/SIGTRAN investigation + - Tools Used: create_trouble_ticket() + - Expected Outcomes: SS7 team engaged + - Flow: Proceed to step 7. + + 7. Verify SIGTRAN Status: Monitor SIGTRAN link recovery + - Tools Used: verify_recovery() + - Expected Outcomes: Link restored; Running on backup; Recovery ongoing + - Flow: Proceed to step end. + + """, + + "signaling_sip_registration_recovery": """ + Workflow: SIP Registration Failure Recovery + Domain: SIGNALING | Problem Codes: SIG-003 | SLA: 30 minutes + Description: VoLTE/VoNR SIP registration failure troubleshooting + + 1. Verify SIP Alarm: Confirm SIP registration failure rate elevated + - Tools Used: query_alarm() + - Expected Outcomes: Registration failures high; 401 responses elevated; Timeout failures + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Check P-CSCF Status: Verify P-CSCF operational status + - Tools Used: query_resource_health() + - Expected Outcomes: P-CSCF healthy; P-CSCF overloaded; P-CSCF unreachable + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Verify IMS Path: Test SIP signaling path to IMS + - Tools Used: verify_signaling_path() + - Expected Outcomes: Path healthy; Path congested; Path blocked + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Check HSS Connectivity: Verify HSS/UDM reachability for auth + - Tools Used: verify_signaling_path() + - Expected Outcomes: HSS reachable; HSS overloaded; Auth failures + - Flow: If succeeds, go to step 7; if fails, go to step 5. + + 5. Adjust SIP Routing: Route to alternate P-CSCF pool + - Tools Used: apply_configuration() + - Expected Outcomes: Routing adjusted; Pool switched + - Flow: If succeeds, go to step 7; if fails, go to step 6. + + 6. Escalate to IMS Team: Create ticket for IMS investigation + - Tools Used: create_trouble_ticket() + - Expected Outcomes: IMS team notified + - Flow: Proceed to step 7. + + 7. Verify Registration Rate: Monitor SIP registration success rate + - Tools Used: verify_recovery() + - Expected Outcomes: Registration rate recovered; Rate improving; Investigation ongoing + - Flow: Proceed to step end. + + """, + + "transport_fiber_cut_response": """ + Workflow: Fiber Cut Response + Domain: TRANSPORT | Problem Codes: TRN-001 | SLA: 30 minutes + Description: Fiber optic cable cut detection and response + + 1. Verify Fiber Alarm: Confirm fiber cut or loss of light + - Tools Used: query_alarm() + - Expected Outcomes: Loss of light confirmed; High optical attenuation; Multiple fibers affected + - Flow: If succeeds, go to step 2; if fails, go to step 8. + + 2. Identify Affected Services: Determine services impacted by fiber cut + - Tools Used: query_topology() + - Expected Outcomes: Cell sites affected; Core links affected; Customer circuits affected + - Flow: If succeeds, go to step 3; if fails, go to step 7. + + 3. Check OTDR Data: Review optical time domain reflectometer data + - Tools Used: run_diagnostics() + - Expected Outcomes: Cut location identified; OTDR data inconclusive; Multiple faults detected + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Activate Protection Path: Switch to protection fiber if available + - Tools Used: execute_remote_action() + - Expected Outcomes: Protection active; No protection available; Protection also affected + - Flow: If succeeds, go to step 6; if fails, go to step 5. + + 5. Implement Rerouting: Reroute traffic via alternate paths + - Tools Used: apply_configuration() + - Expected Outcomes: Traffic rerouted; Partial restoration; No alternate path + - Flow: If succeeds, go to step 6; if fails, go to step 7. + + 6. Verify Service Restoration: Confirm services recovered via protection/reroute + - Tools Used: test_connectivity() + - Expected Outcomes: Services restored; Partial restoration; Services impacted + - Flow: Proceed to step 7. + + 7. Dispatch Fiber Repair Team: Create emergency dispatch for fiber repair + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Fiber repair dispatch created + - Flow: If succeeds, go to step 8; if fails, go to step 9. + + 8. Notify Affected Customers: Send outage notifications if applicable + - Tools Used: query_external_factors() + - Expected Outcomes: Notifications sent; No customer impact + - Flow: Proceed to step 9. + - Note: This step is skippable if not applicable. + + 9. Monitor Recovery Status: Track fiber repair progress + - Tools Used: verify_recovery() + - Expected Outcomes: Protected - repair in progress; Degraded - repair in progress; Full outage - repair underway + - Flow: Proceed to step end. + + """, + + "transport_high_latency_resolution": """ + Workflow: High Latency Resolution + Domain: TRANSPORT | Problem Codes: TRN-005 | SLA: 90 minutes + Description: Transport path latency troubleshooting + + 1. Verify Latency Alarm: Confirm latency threshold exceeded + - Tools Used: query_alarm() + - Expected Outcomes: Latency elevated; Latency spikes detected; Consistent high latency + - Flow: If succeeds, go to step 2; if fails, go to step 5. + + 2. Test Path Latency: Measure end-to-end latency + - Tools Used: test_connectivity() + - Expected Outcomes: Latency Xms confirmed; Variable latency; Timeout detected + - Flow: If succeeds, go to step 3; if fails, go to step 4. + + 3. Identify High Latency Hop: Locate source of latency + - Tools Used: run_diagnostics() + - Expected Outcomes: Specific hop identified; Distributed latency; Routing inefficient + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Optimize Routing: Adjust routing to reduce latency + - Tools Used: apply_configuration() + - Expected Outcomes: Route optimized; Limited improvement; Best path already used + - Flow: If succeeds, go to step 6; if fails, go to step 5. + + 5. Request Capacity Review: Create ticket for capacity analysis + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Capacity review scheduled + - Flow: Proceed to step 6. + + 6. Verify Latency Status: Monitor latency improvement + - Tools Used: verify_recovery() + - Expected Outcomes: Latency normalized; Latency improved; Investigation ongoing + - Flow: Proceed to step end. + + """, + + "transport_interface_errors_resolution": """ + Workflow: Interface Errors Resolution + Domain: TRANSPORT | Problem Codes: TRN-007 | SLA: 90 minutes + Description: Network interface error troubleshooting + + 1. Verify Interface Alarm: Confirm interface error threshold exceeded + - Tools Used: query_alarm() + - Expected Outcomes: CRC errors elevated; Input errors high; Carrier transitions + - Flow: If succeeds, go to step 2; if fails, go to step 5. + + 2. Check Interface Counters: Get detailed interface error statistics + - Tools Used: query_resource_health() + - Expected Outcomes: Layer 1 errors; Layer 2 errors; Duplex mismatch suspected + - Flow: If succeeds, go to step 3; if fails, go to step 4. + + 3. Check Physical Layer: Verify optics and cable integrity + - Tools Used: run_diagnostics() + - Expected Outcomes: Optical levels OK; RX power low; TX power fault + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Clear Counters and Monitor: Reset counters and watch for new errors + - Tools Used: execute_remote_action() + - Expected Outcomes: Errors continuing; Errors stopped; Intermittent errors + - Flow: If succeeds, go to step 6; if fails, go to step 5. + + 5. Schedule Physical Repair: Create ticket for cable/optic replacement + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Field repair scheduled + - Flow: Proceed to step 6. + + 6. Verify Interface Status: Monitor interface error rate + - Tools Used: verify_recovery() + - Expected Outcomes: Errors resolved; Errors reduced; Repair required + - Flow: Proceed to step end. + + """, + + "transport_microwave_degradation_response": """ + Workflow: Microwave Link Degradation Response + Domain: TRANSPORT | Problem Codes: TRN-002 | SLA: 60 minutes + Description: Microwave link performance degradation troubleshooting + + 1. Verify Microwave Alarm: Confirm microwave link degradation + - Tools Used: query_alarm() + - Expected Outcomes: Signal fade detected; BER elevated; Modulation downshift; Link down + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check Weather Conditions: Verify weather impact on link + - Tools Used: query_external_factors() + - Expected Outcomes: Rain fade likely; No weather issues; High humidity detected + - Flow: Proceed to step 3. + + 3. Check Link Status: Query microwave terminal status + - Tools Used: query_resource_health() + - Expected Outcomes: RSL degraded; Equipment fault; Antenna misalignment suspected + - Flow: If succeeds, go to step 4; if fails, go to step 6. + + 4. Enable ACM Optimization: Verify adaptive coding and modulation active + - Tools Used: apply_configuration() + - Expected Outcomes: ACM optimized; Already at lowest modulation; Configuration applied + - Flow: Proceed to step 5. + + 5. Activate Link Protection: Enable backup link if available + - Tools Used: execute_remote_action() + - Expected Outcomes: Protection active; No protection available; 1+1 already active + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Schedule Site Visit: Create work order for antenna inspection + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Site inspection scheduled + - Flow: If succeeds, go to step 7; if fails, go to step 8. + + 7. Apply QoS Policy: Prioritize critical traffic during degradation + - Tools Used: apply_configuration() + - Expected Outcomes: QoS applied; QoS already active + - Flow: Proceed to step 8. + - Note: This step is skippable if not applicable. + + 8. Monitor Link Status: Track link recovery + - Tools Used: verify_recovery() + - Expected Outcomes: Link recovering; Link stable at lower rate; Link still degraded + - Flow: Proceed to step end. + + """, + + "transport_mpls_lsp_recovery": """ + Workflow: MPLS LSP Failure Recovery + Domain: TRANSPORT | Problem Codes: TRN-003 | SLA: 30 minutes + Description: MPLS Label Switched Path failure recovery + + 1. Verify LSP Alarm: Confirm MPLS LSP down + - Tools Used: query_alarm() + - Expected Outcomes: Primary LSP down; Backup LSP active; Multiple LSPs affected + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check LSP Path: Trace LSP path and identify failure point + - Tools Used: run_diagnostics() + - Expected Outcomes: Failure node identified; Failure link identified; Path unreachable + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Check Router Status: Query router at failure point + - Tools Used: query_resource_health() + - Expected Outcomes: Router healthy - link issue; Router interface down; Router unreachable + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Force LSP Recomputation: Trigger RSVP path recomputation + - Tools Used: execute_remote_action() + - Expected Outcomes: LSP rerouted; No alternate path found; Recomputation in progress + - Flow: If succeeds, go to step 6; if fails, go to step 5. + + 5. Check Physical Layer: Verify underlying connectivity + - Tools Used: test_connectivity() + - Expected Outcomes: Physical link down; Physical layer OK; Errors detected + - Flow: If succeeds, go to step 6; if fails, go to step 7. + + 6. Verify Service Impact: Check services using affected LSP + - Tools Used: query_topology() + - Expected Outcomes: Services rerouted; Services degraded; Services impacted + - Flow: If succeeds, go to step 8; if fails, go to step 7. + + 7. Escalate to Transport Core: Create high priority ticket for MPLS repair + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Transport core notified + - Flow: Proceed to step 8. + + 8. Monitor LSP Status: Track LSP recovery + - Tools Used: verify_recovery() + - Expected Outcomes: Primary LSP restored; Running on backup; Recovery in progress + - Flow: Proceed to step end. + + """, + + "transport_packet_loss_resolution": """ + Workflow: High Packet Loss Resolution + Domain: TRANSPORT | Problem Codes: TRN-004 | SLA: 60 minutes + Description: Transport path packet loss troubleshooting + + 1. Verify Packet Loss Alarm: Confirm packet loss threshold exceeded + - Tools Used: query_alarm() + - Expected Outcomes: Packet loss elevated; Intermittent loss; Severe packet loss + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Test Path Connectivity: Run path quality tests + - Tools Used: test_connectivity() + - Expected Outcomes: Loss confirmed at X%; Intermittent loss detected; Path test passed + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Identify Loss Point: Locate where packet loss is occurring + - Tools Used: run_diagnostics() + - Expected Outcomes: Loss at specific hop; Loss distributed; Last mile issue + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Check Interface Counters: Review router interface error counters + - Tools Used: query_resource_health() + - Expected Outcomes: Input errors high; Output drops high; CRC errors; Counters normal + - Flow: Proceed to step 5. + + 5. Apply Traffic Engineering: Reroute traffic to reduce congestion + - Tools Used: apply_configuration() + - Expected Outcomes: Traffic rerouted; No alternate path; Partial improvement + - Flow: If succeeds, go to step 7; if fails, go to step 6. + + 6. Schedule Link Repair: Create ticket for link inspection + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Link inspection scheduled + - Flow: Proceed to step 7. + + 7. Verify Packet Loss Status: Monitor packet loss levels + - Tools Used: verify_recovery() + - Expected Outcomes: Packet loss resolved; Packet loss improved; Investigation ongoing + - Flow: Proceed to step end. + + """, + + "transport_router_cpu_resolution": """ + Workflow: Router High CPU Resolution + Domain: TRANSPORT | Problem Codes: TRN-006 | SLA: 60 minutes + Description: Router CPU utilization troubleshooting + + 1. Verify CPU Alarm: Confirm router CPU threshold exceeded + - Tools Used: query_alarm() + - Expected Outcomes: CPU elevated; CPU critical; CPU spikes + - Flow: If succeeds, go to step 2; if fails, go to step 6. + + 2. Check Router Status: Query router process and CPU breakdown + - Tools Used: query_resource_health() + - Expected Outcomes: Control plane high; Data plane high; Management high + - Flow: If succeeds, go to step 3; if fails, go to step 5. + + 3. Identify High CPU Process: Determine which process is consuming CPU + - Tools Used: run_diagnostics() + - Expected Outcomes: BGP reconvergence; OSPF SPF calculation; ACL processing; DoS attack suspected + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Apply Mitigation: Take action to reduce CPU load + - Tools Used: execute_remote_action() + - Expected Outcomes: Mitigation applied; Process stabilized; Root cause persists + - Flow: If succeeds, go to step 7; if fails, go to step 5. + + 5. Enable Control Plane Protection: Apply CoPP to protect control plane + - Tools Used: apply_configuration() + - Expected Outcomes: CoPP applied; CoPP already active; Rate limiting enabled + - Flow: Proceed to step 6. + + 6. Escalate if Needed: Create ticket for engineering review + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Engineering review scheduled + - Flow: Proceed to step 7. + - Note: This step is skippable if not applicable. + + 7. Verify CPU Status: Monitor CPU recovery + - Tools Used: verify_recovery() + - Expected Outcomes: CPU normalized; CPU improved; CPU still elevated + - Flow: Proceed to step end. + + """, + + "transport_routing_flap_resolution": """ + Workflow: Routing Protocol Flapping Resolution + Domain: TRANSPORT | Problem Codes: TRN-008 | SLA: 30 minutes + Description: BGP/OSPF/ISIS route flapping troubleshooting + + 1. Verify Routing Alarm: Confirm routing protocol instability + - Tools Used: query_alarm() + - Expected Outcomes: BGP peer flapping; OSPF adjacency flapping; ISIS flapping + - Flow: If succeeds, go to step 2; if fails, go to step 7. + + 2. Check Protocol Status: Query routing protocol state + - Tools Used: query_resource_health() + - Expected Outcomes: Peer unstable; Hold timer expiry; Interface flapping + - Flow: If succeeds, go to step 3; if fails, go to step 6. + + 3. Analyze Protocol Logs: Review routing protocol event logs + - Tools Used: inspect_logs() + - Expected Outcomes: CPU issue indicated; Interface issue; Remote peer issue + - Flow: If succeeds, go to step 4; if fails, go to step 5. + + 4. Check Underlying Connectivity: Verify physical/logical path to peer + - Tools Used: test_connectivity() + - Expected Outcomes: Path stable; Path intermittent; Path down + - Flow: If succeeds, go to step 5; if fails, go to step 6. + + 5. Apply Route Dampening: Enable route dampening to stabilize + - Tools Used: apply_configuration() + - Expected Outcomes: Dampening enabled; Routing stabilizing + - Flow: If succeeds, go to step 8; if fails, go to step 6. + + 6. Adjust Timer Settings: Modify keepalive/hold timers + - Tools Used: apply_configuration() + - Expected Outcomes: Timers adjusted; Stability improved + - Flow: If succeeds, go to step 8; if fails, go to step 7. + + 7. Escalate to Network Engineering: Create urgent ticket for routing investigation + - Tools Used: create_trouble_ticket() + - Expected Outcomes: Engineering team notified + - Flow: Proceed to step 8. + + 8. Verify Routing Stability: Monitor routing protocol state + - Tools Used: verify_recovery() + - Expected Outcomes: Routing stable; Routing improving; Investigation ongoing + - Flow: Proceed to step end. + + """, + +} + + +DEFAULT_REASONING_PROCESS = WORKFLOW_REASONING_PROCESSES["ran_cell_site_down_recovery"] + + +def get_reasoning_process_for_fault_category(fault_category: str) -> str: + """Return reasoning process text for a given fault_category (workflow ID).""" + if fault_category is None or str(fault_category).strip() in ("", "nan", "None"): + return DEFAULT_REASONING_PROCESS + fc = str(fault_category).strip() + return WORKFLOW_REASONING_PROCESSES.get(fc, DEFAULT_REASONING_PROCESS) diff --git a/recipes/noc-reasoning-agent/scripts/utils/schema_columns.py b/recipes/noc-reasoning-agent/scripts/utils/schema_columns.py new file mode 100644 index 0000000000..b792d7e5c2 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/utils/schema_columns.py @@ -0,0 +1,72 @@ +""" +Canonical column names for the pipeline when using synthetic (new) data format. +Use these names everywhere so the pipeline runs on CSV/JSONL with synthetic schema +without converting to the old format. +""" + +from typing import List + +# Primary key for incidents (CSV column and JSONL key) +INCIDENT_ID_COLUMN = "incident_identifier" + +# All columns expected in synthetic CSV (and produced in JSONL). +# Derived columns: time_to_resolve (computed), solved_category (from match_keywords), problem_code_reasoning_process (mapped from fault_category). +REQUIRED_COLUMNS: List[str] = [ + INCIDENT_ID_COLUMN, + "geographical_territory", + "incident_classification", + "incident_subtype", + "service_domain", + "equipment_provider", + "operational_zone", + "resolution_status", + "suspension_cause", + "urgency_level", + "affected_site", + "incident_summary", + "detection_timestamp", + "escalation_date", + "responsible_team", + "fault_category", + "root_cause_primary", + "resolution_summary", + "action_chronicle", + "reporter_identity", + "intervention_began", + "intervention_completed", + "resolution_method", + "root_cause_secondary", + "cause_additional", + "triggered_by_modification", + "resolver_identity", + "time_to_resolve", + "solved_category", + "problem_code_reasoning_process", +] + +# Column used for close-code / resolution classification (match_keywords, filter_rows) +RESOLUTION_METHOD_COLUMN = "resolution_method" +RESOLUTION_SUMMARY_COLUMN = "resolution_summary" +FAULT_CATEGORY_COLUMN = "fault_category" +SOLVED_CATEGORY_COLUMN = "solved_category" + +# Datetime columns (for parsing and time_to_resolve) +INTERVENTION_BEGAN_COLUMN = "intervention_began" +INTERVENTION_COMPLETED_COLUMN = "intervention_completed" + +# Prompt/display columns (used in YAML and scripts) +PROMPT_COLUMNS = { + "incident_identifier": INCIDENT_ID_COLUMN, + "geographical_territory": "geographical_territory", + "incident_classification": "incident_classification", + "incident_subtype": "incident_subtype", + "service_domain": "service_domain", + "equipment_provider": "equipment_provider", + "operational_zone": "operational_zone", + "urgency_level": "urgency_level", + "affected_site": "affected_site", + "incident_summary": "incident_summary", + "detection_timestamp": "detection_timestamp", + "responsible_team": "responsible_team", + "fault_category": "fault_category", +} diff --git a/recipes/noc-reasoning-agent/scripts/utils/split_incident_data.py b/recipes/noc-reasoning-agent/scripts/utils/split_incident_data.py new file mode 100644 index 0000000000..5cf93c13b1 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/utils/split_incident_data.py @@ -0,0 +1,96 @@ +import argparse +import json +import os +import random +from pathlib import Path + + +def split_generation_field( + input_folder: str, train_out: str, test_out: str, test_size: float = 0.2, seed: int = 42, max_examples=10000 +): + train_path = Path(train_out) + test_path = Path(test_out) + + incidents = [] + with open(os.path.join(input_folder, "iteration_0.jsonl"), "r", encoding="utf-8") as f_in: + for line in f_in: + row = json.loads(line) + number = row.get("incident_identifier", row.get("number")) + incidents.append(number) + + random.seed(seed) + random.shuffle(incidents) + + incidents = incidents[:max_examples] + n = len(incidents) + n_test = max(1, int(round(n * test_size))) if n > 0 else 0 + n_train = n - n_test + train_set = set(incidents[:n_train]) + test_set = set(incidents[n_train:]) + + train_rows = [] + test_rows = [] + + i = 0 + while os.path.exists(os.path.join(input_folder, f"iteration_{i}.jsonl")): + current_iteration_train = [] + current_iteration_test = [] + with open(os.path.join(input_folder, f"iteration_{i}.jsonl"), "r", encoding="utf-8") as f_in: + for line in f_in: + row = json.loads(line) + number = row.get("incident_identifier", row.get("number")) + if number in train_set: + current_iteration_train.append(row) + + random.shuffle(current_iteration_train) + train_rows += current_iteration_train + i += 1 + + i = 0 + while os.path.exists(os.path.join(input_folder, f"iteration_{i}.jsonl")): + current_iteration_train = [] + current_iteration_test = [] + with open(os.path.join(input_folder, f"iteration_{i}.jsonl"), "r", encoding="utf-8") as f_in: + for line in f_in: + row = json.loads(line) + number = row.get("incident_identifier", row.get("number")) + if number in test_set: + resolution = row.get("root_cause_secondary") or row.get("close_code", "") + if resolution: + row["expected"] = f"Close Code: [{resolution}]" + row["initial_background"] = row["background"] + row["background"] = "\n\n" + current_iteration_test.append(row) + + random.shuffle(current_iteration_test) + test_rows += current_iteration_test + i += 1 + + train_path.parent.mkdir(parents=True, exist_ok=True) + test_path.parent.mkdir(parents=True, exist_ok=True) + + with open(train_path, "w", encoding="utf-8") as f_train: + for r in train_rows: + f_train.write(json.dumps(r, ensure_ascii=False) + "\n") + + with open(test_path, "w", encoding="utf-8") as f_test: + for r in test_rows: + f_test.write(json.dumps(r, ensure_ascii=False) + "\n") + + print(f"Total lines read: {n}") + print(f"Train size: {len(train_set)} | Test size: {len(test_set)}") + print(f"Train path: {train_path}") + print(f"Test path: {test_path}") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Split JSONL incidents with 'generation' field into training data") + parser.add_argument("--input_dir", help="Path to input_dir JSONL file", default="output_incident.jsonl") + parser.add_argument("--train_output", help="Path to output JSONL file", default="training_data_split.jsonl") + parser.add_argument("--test_output", help="Path to output JSONL file", default="testing_data_split.jsonl") + parser.add_argument( + "--preview", type=int, default=2, help="Number of examples to preview before confirmation (default: 2)" + ) + args = parser.parse_args() + + split_generation_field(args.input_dir, args.train_output, args.test_output) diff --git a/recipes/noc-reasoning-agent/scripts/utils/split_mocktools_answers.py b/recipes/noc-reasoning-agent/scripts/utils/split_mocktools_answers.py new file mode 100644 index 0000000000..ef8199fea1 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/utils/split_mocktools_answers.py @@ -0,0 +1,123 @@ +import argparse +import json + + +def parse_generation(gen_str): + """ + Parses a generation string into a dictionary where keys are before ':' and values are after. + """ + parsed = {} + if not gen_str: + return parsed + lines = gen_str.split("\n") + for line in lines: + if ":" in line: + key, value = line.split(":", 1) + key = key.strip() + value = value.strip() + parsed[key] = value + return parsed + + +def main(file1_path, file2_path, output_path=None): + """ + Processes two JSONL files: extracts specified fields from the first file, + includes 'generation' from the first file and 'generation2' from the second file (matched by 'number'), + parses the 'generation2' into additional answer columns, + and outputs the result as JSONL to stdout or a file. + + Args: + file1_path (str): Path to the first JSONL file. + file2_path (str): Path to the second JSONL file. + output_path (str, optional): Path to the output JSONL file. If None, prints to stdout. + """ + # Define the answer keys to extract + answer_keys = [ + "Check_Triage_toolkit_answer", + "Check_Alarm_Status_on_System_answer", + "Remote_Connection_answer", + "Remote_Device_Reboot_answer", + "Check_Fiber_Issues_answer", + "Check_Power_Issues_answer", + "Check_Element_Neighbors_answer", + ] + + # Load data from first file, keyed by 'number' + data1 = {} + with open(file1_path, "r", encoding="utf-8") as f1: + for line_num, line in enumerate(f1, 1): + line = line.strip() + if line: + try: + d = json.loads(line) + num = d.get("number") + if num: + data1[num] = d + except json.JSONDecodeError as e: + print(f"Error decoding JSON in file1 on line {line_num}: '{line}'. Error: {e}") + + # Load generations from second file, keyed by 'number' + data2 = {} + with open(file2_path, "r", encoding="utf-8") as f2: + for line_num, line in enumerate(f2, 1): + line = line.strip() + if line: + try: + d = json.loads(line) + num = d.get("number") + if num: + data2[num] = d.get("generation", "") + except json.JSONDecodeError as e: + print(f"Error decoding JSON in file2 on line {line_num}: '{line}'. Error: {e}") + + # Build results for matching numbers + results = [] + for num, d1 in data1.items(): + if num in data2: + # Parse generation2 + parsed2 = parse_generation(data2[num]) + + extracted = { + "number": num, + "u_region2": d1.get("u_region2", ""), + "category": d1.get("category", ""), + "subcategory": d1.get("subcategory", ""), + "u_market2": d1.get("u_market2", ""), + "u_ran_vendor": d1.get("u_ran_vendor", ""), + "u_aoi2": d1.get("u_aoi2", ""), + "priority": d1.get("priority", ""), + "u_locations": d1.get("u_locations", ""), + "short_description": d1.get("short_description", ""), + "opened_at": d1.get("opened_at", ""), + "follow_up": d1.get("follow_up", ""), + "assignment_group": d1.get("assignment_group", ""), + "problem code": d1.get("u_problem_code", ""), + "generation": d1.get("generation", ""), + "generation2": data2[num], + "close_notes": d1.get("close_notes", ""), + } + + # Add the parsed answer columns from generation2 + for key in answer_keys: + extracted[key] = parsed2.get(key, "NotApplicable") + + results.append(extracted) + + # Output + if output_path: + with open(output_path, "w", encoding="utf-8") as out_file: + for res in results: + out_file.write(json.dumps(res) + "\n") + print(f"Output written to {output_path}") + else: + for res in results: + print(json.dumps(res)) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Extract and combine data from two JSONL files.") + parser.add_argument("file1", help="Path to the first JSONL file") + parser.add_argument("file2", help="Path to the second JSONL file") + parser.add_argument("--output", help="Optional path to output JSONL file (default: print to stdout)") + args = parser.parse_args() + main(args.file1, args.file2, args.output) diff --git a/recipes/noc-reasoning-agent/scripts/utils/token_usage.py b/recipes/noc-reasoning-agent/scripts/utils/token_usage.py new file mode 100644 index 0000000000..d0a66ecc3a --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/utils/token_usage.py @@ -0,0 +1,219 @@ +#!/usr/bin/env python3 +# analyze_generation_lengths.py +# +# Usage: +# python analyze_generation_lengths.py \ +# --jsonl workspace/NoC_Reasoning_Agent/outputs/sdg_gpt_oss/output.jsonl \ +# --yaml data/prompt_incident.yaml \ +# --model openai/gpt-oss-120b \ +# --out outputs/generation_lengths.csv + +import argparse +import json +import math +import os +from typing import Optional + +from tqdm import tqdm + +# Optional deps +try: + from transformers import AutoTokenizer +except Exception: + AutoTokenizer = None + +try: + import yaml +except Exception: + yaml = None + +import matplotlib.pyplot as plt +import pandas as pd + + +def load_tokenizer(model_name: Optional[str]): + """ + Try to load a HF tokenizer. If unavailable (e.g., no internet/cache), + return None and we'll fall back to whitespace tokenization. + """ + if not model_name or AutoTokenizer is None: + return None + try: + tok = AutoTokenizer.from_pretrained(model_name, local_files_only=True) + return tok + except Exception: + # Try again without local_files_only (may fail if no internet) + try: + tok = AutoTokenizer.from_pretrained(model_name) + return tok + except Exception: + return None + + +def count_tokens(text: str, tokenizer) -> int: + if not isinstance(text, str): + return 0 + if tokenizer is not None: + # Use encode to match model token count (fast + accurate) + try: + return len(tokenizer.encode(text, add_special_tokens=False)) + except Exception: + pass + # Fallback: whitespace tokens + return len(text.split()) + + +def read_yaml_prompt(path: Optional[str]) -> Optional[str]: + if not path: + return None + if not os.path.exists(path): + print(f"[warn] YAML file not found: {path}") + return None + if yaml is None: + print("[warn] PyYAML not installed; skipping YAML parsing.") + return None + with open(path, "r", encoding="utf-8") as f: + data = yaml.safe_load(f) + # Your structure shows top-level key 'user' + return data.get("user") if isinstance(data, dict) else None + + +def stream_jsonl_lengths(jsonl_path: str, tokenizer) -> pd.DataFrame: + """ + Streams a JSONL file and extracts token lengths for the 'generation' field. + If a line includes 'num_generated_tokens', we keep it for reference. + """ + records = [] + if not os.path.exists(jsonl_path): + raise FileNotFoundError(f"JSONL not found: {jsonl_path}") + + with open(jsonl_path, "r", encoding="utf-8") as f: + total_lines = sum(1 for _ in open(jsonl_path, "r", encoding="utf-8")) + + # Iterate with progress bar + for i, line in tqdm(enumerate(f, start=1), total=total_lines, desc="Processing lines"): + # Do something with each line + line = line.strip() + if not line: + continue + try: + row = json.loads(line) + except json.JSONDecodeError: + print(f"[warn] Skipping malformed JSON at line {i}") + continue + + gen = row.get("generation", "") + # Some pipelines store the content under nested keys; add common fallbacks here if needed. + if not isinstance(gen, str): + # Try a nested alternative if your data uses it (customize as needed): + # gen = row.get("output", {}).get("text", "") + pass + + token_len = count_tokens(gen, tokenizer) + ref_len = row.get("num_generated_tokens", None) + + records.append({"index": i, "token_length": token_len, "num_generated_tokens_field": ref_len}) + + df = pd.DataFrame.from_records(records) + return df + + +def describe_lengths(df: pd.DataFrame, col: str = "token_length") -> pd.Series: + if df.empty: + return pd.Series(dtype=float) + s = df[col].dropna().astype(int) + # Custom concise stats + desc = pd.Series( + { + "count": int(s.shape[0]), + "min": int(s.min()) if len(s) else 0, + "p10": int(s.quantile(0.10)) if len(s) else 0, + "p25": int(s.quantile(0.25)) if len(s) else 0, + "median": int(s.median()) if len(s) else 0, + "p75": int(s.quantile(0.75)) if len(s) else 0, + "p90": int(s.quantile(0.90)) if len(s) else 0, + "max": int(s.max()) if len(s) else 0, + "mean": float(s.mean()) if len(s) else 0.0, + "std": float(s.std(ddof=1)) if len(s) > 1 else 0.0, + } + ) + return desc + + +def plot_histogram(df: pd.DataFrame, out_png: str, col: str = "token_length"): + if df.empty: + print("[warn] No data to plot.") + return + x = df[col].dropna().astype(int) + # Use a reasonable number of bins based on data spread + bins = min(60, max(10, int(math.sqrt(len(x))))) + plt.figure(figsize=(9, 5)) + plt.hist(x, bins=bins) + plt.title("Distribution of Generation Token Lengths") + plt.xlabel("Token length per sample") + plt.ylabel("Count") + plt.tight_layout() + plt.savefig(out_png, dpi=150) + plt.close() + print(f"[info] Saved histogram to: {out_png}") + + +def main(): + ap = argparse.ArgumentParser(description="Analyze token lengths of 'generation' fields in a JSONL.") + ap.add_argument("--jsonl", required=True, help="Path to JSONL file with a 'generation' field.") + ap.add_argument("--yaml", default=None, help="Optional YAML prompt file with key 'user'.") + ap.add_argument("--model", default=None, help="HF tokenizer name (e.g., 'openai/gpt-oss-120b').") + ap.add_argument("--out", default="generation_lengths.csv", help="Output CSV path.") + ap.add_argument("--plot", default="generation_lengths_hist.png", help="Output PNG for histogram.") + args = ap.parse_args() + + tokenizer = load_tokenizer(args.model) + if tokenizer is None: + print("[warn] Could not load tokenizer; falling back to whitespace token counts.") + + # Optional: count tokens in the prompt + prompt = read_yaml_prompt(args.yaml) + if prompt: + prompt_tokens = count_tokens(prompt, tokenizer) + print(f"[info] Prompt tokens: {prompt_tokens} (from {args.yaml})") + + # Stream JSONL and compute lengths + df = stream_jsonl_lengths(args.jsonl, tokenizer) + + # Save per-row lengths + os.makedirs(os.path.dirname(args.out) or ".", exist_ok=True) + df.to_csv(args.out, index=False) + print(f"[info] Wrote per-sample lengths to: {args.out}") + + # Print concise stats + stats = describe_lengths(df, "token_length") + if not stats.empty: + print("\n=== Token Length Summary (generation) ===") + for k, v in stats.items(): + print(f"{k:>6}: {v}") + + # If your file has a 'num_generated_tokens' column, compare basic correlation + if "num_generated_tokens_field" in df.columns and df["num_generated_tokens_field"].notna().any(): + try: + corr = df[["token_length", "num_generated_tokens_field"]].dropna().corr().iloc[0, 1] + print(f"\n[info] Correlation with 'num_generated_tokens' field: {corr:.3f}") + except Exception: + pass + + # Plot histogram + plot_histogram(df, args.plot, "token_length") + + +if __name__ == "__main__": + main() + +""" +Usage: +python src/utils/token_usage.py \ + --jsonl outputs/sdg_gpt_oss/output.jsonl \ + --yaml data/prompt_incident.yaml \ + --model openai/gpt-oss-120b \ + --out outputs/generation_lengths.csv \ + --plot outputs/generation_lengths_hist.png + +""" diff --git a/recipes/noc-reasoning-agent/scripts/visualization/extract_representation_columns.py b/recipes/noc-reasoning-agent/scripts/visualization/extract_representation_columns.py new file mode 100644 index 0000000000..c9b218cfcc --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/visualization/extract_representation_columns.py @@ -0,0 +1,156 @@ +import json +import os +import re +from collections import defaultdict + +import pandas as pd + +# ---------- Paths ---------- +jsonl_path = "outputs/filtering_soft_with_keywords/output.jsonl" +output_jsonl_path = "outputs/filtering_soft_with_keywords/output_with_categories.jsonl" +csv_path = "data/human_intervention_incidents_soft.csv" +samples_dir = "outputs/filtering_soft_with_keywords/samples" + +os.makedirs(os.path.dirname(output_jsonl_path), exist_ok=True) +os.makedirs(samples_dir, exist_ok=True) + +# ---------- Load CSV (index by incident id: synthetic=incident_identifier, legacy=number) ---------- +sample = pd.read_csv(csv_path, nrows=0) +id_col = "incident_identifier" if "incident_identifier" in sample.columns else "number" +df = pd.read_csv(csv_path, encoding="latin1", dtype={id_col: str}) +df[id_col] = df[id_col].astype(str) + +needed_cols = [id_col, "time_to_resolve", "solved_category", "solved_reason"] +missing = [c for c in needed_cols if c not in df.columns] +if missing: + raise ValueError(f"CSV is missing required columns: {missing}") + +csv_idx = df.set_index(id_col)[["time_to_resolve", "solved_category", "solved_reason"]].to_dict(orient="index") + + +# ---------- Helper: safe enrichment ---------- +def enrich_row(row, lookup): + num = row.get("incident_identifier", row.get("number")) + info = lookup.get(num) + if info is None: + # No match: set to None (or choose sensible defaults) + row["time_to_resolve"] = None + row["solved_category"] = row.get("solved_category") # preserve if already present + row["solved_reason"] = row.get("solved_reason") + else: + row["time_to_resolve"] = info.get("time_to_resolve") + row["solved_category"] = info.get("solved_category") + row["solved_reason"] = info.get("solved_reason") + return row + + +# ---------- Read input JSONL and enrich ---------- +enriched_rows = [] +with open(jsonl_path, "r", encoding="utf-8") as fin: + for line in fin: + if not line.strip(): + continue + row = json.loads(line) + row = enrich_row(row, csv_idx) + enriched_rows.append(row) + +# ---------- Write output JSONL ---------- +with open(output_jsonl_path, "w", encoding="utf-8") as fout: + for row in enriched_rows: + fout.write(json.dumps(row, ensure_ascii=False) + "\n") + +print(f"Wrote enriched JSONL → {output_jsonl_path} (n={len(enriched_rows)})") + +# ---------- Bucket by solved_category ---------- +by_cat = defaultdict(list) +for r in enriched_rows: + cat = r.get("resolution_method") or r.get("close_code") or "Unknown" + by_cat[cat].append(r) + + +# ---------- Length bucketing rules (by token-ish count) ---------- +def token_count(text: str) -> int: + if not isinstance(text, str): + return 0 + # crude token proxy: whitespace-split + return len(text.split()) + + +# You can tweak these thresholds if your generations are generally longer/shorter +SHORT_MAX = 1000 # tokens +MEDIUM_MAX = 5000 # tokens +# long: > MEDIUM_MAX + + +def length_bucket(text: str) -> str: + n = token_count(text) + if n <= SHORT_MAX: + return "short" + elif n <= MEDIUM_MAX: + return "medium" + return "long" + + +# ---------- Pick 5 samples per category with ≥1 short, ≥1 medium, ≥1 long (if available) ---------- +def pick_samples(rows, n=5): + # Build buckets + buckets = {"short": [], "medium": [], "long": []} + for r in rows: + gen = r.get("generation", "") + b = length_bucket(gen) + buckets[b].append(r) + + # deterministic sort within buckets: by token length + for b in buckets: + buckets[b].sort(key=lambda x: token_count(x.get("generation", ""))) + + picked = [] + + # 1) ensure coverage: pick shortest short, median medium, longest long when available + if buckets["short"]: + picked.append(buckets["short"][0]) # shortest short + buckets["short"] = buckets["short"][1:] + + if buckets["medium"]: + mid = len(buckets["medium"]) // 2 + picked.append(buckets["medium"][mid]) # median medium + buckets["medium"].pop(mid) + + if buckets["long"]: + picked.append(buckets["long"][-1]) # longest long + buckets["long"].pop(-1) + + # 2) fill remaining slots from the buckets in round-robin: short → medium → long + order = ["short", "medium", "long"] + i = 0 + while len(picked) < n and any(buckets[b] for b in order): + b = order[i % 3] + if buckets[b]: + picked.append(buckets[b].pop(0)) + i += 1 + + # If still short, just top up from whatever remains (unlikely) + if len(picked) < n: + remaining = buckets["short"] + buckets["medium"] + buckets["long"] + picked.extend(remaining[: (n - len(picked))]) + + return picked[:n] + + +# ---------- Write sample files per category ---------- +def sanitize(name: str) -> str: + return re.sub(r"[^A-Za-z0-9_.-]+", "_", str(name)) + + +for cat, rows in by_cat.items(): + samples = pick_samples(rows, n=5) + out_path = os.path.join(samples_dir, f"samples_{sanitize(cat)}.jsonl") + with open(out_path, "w", encoding="utf-8") as fout: + for r in samples: + # Keep the whole record; downstream you can read r["generation"] for the trace + fout.write(json.dumps(r, ensure_ascii=False) + "\n") + # quick summary + lengths = [length_bucket(r.get("generation", "")) for r in samples] + print( + f"Category: {cat:>20} | total={len(rows):4d} | wrote={len(samples):2d} | mix={dict((x, lengths.count(x)) for x in set(lengths))} -> {out_path}" + ) diff --git a/recipes/noc-reasoning-agent/scripts/visualization/extract_scores.py b/recipes/noc-reasoning-agent/scripts/visualization/extract_scores.py new file mode 100644 index 0000000000..a19e9556fe --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/visualization/extract_scores.py @@ -0,0 +1,44 @@ +import json + +import pandas as pd + +# Replace with your actual JSONL file path +file_path = "evaluations.jsonl" + +# Load JSONL file +data = [] +with open(file_path, "r", encoding="utf-8") as f: + for line in f: + data.append(json.loads(line)) + +# Convert to pandas DataFrame +df = pd.DataFrame(data) + +# Metrics to calculate averages for +metrics = ["rouge1", "rougeL", "bertscore_f1", "llm_judge_score"] + +# Check which columns are available +available_metrics = [metric for metric in metrics if metric in df.columns] + +if not available_metrics: + raise ValueError("No required metrics found in the JSONL file!") + +# Calculate averages +averages = df[available_metrics].mean() + +# Display results +print("Average Metrics:") +for metric, avg in averages.items(): + print(f"{metric}: {avg:.4f}") + + +# Select only the relevant columns +columns_to_display = ["expected_answer", "agent_response", "llm_judge_reason"] +df_subset = df[columns_to_display].head(10) + +# Display neatly +for idx, row in df_subset.iterrows(): + print(f"\n--- Sample {idx + 1} ---") + print(f"True Answer (expected_answer): {row['expected_answer']}") + print(f"Model Answer (agent_response): {row['agent_response']}") + print(f"Judge Explanation (llm_judge_reason): {row['llm_judge_reason']}") diff --git a/recipes/noc-reasoning-agent/scripts/visualization/generate_trace_visualization.py b/recipes/noc-reasoning-agent/scripts/visualization/generate_trace_visualization.py new file mode 100644 index 0000000000..ee138aaa3e --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/visualization/generate_trace_visualization.py @@ -0,0 +1,263 @@ +import html +import json +import re +import sys +from pathlib import Path + +import pandas as pd + + +def parse_steps(reasoning): + """Parses the Thought/Action/Observation steps from the reasoning trace.""" + step_re = re.compile(r"^(Thought|Action|Observation)\s+(\d+):\s*(.*)$", re.MULTILINE) + finish_re = re.compile(r"^Finish\[(.*)\]$", re.MULTILINE | re.DOTALL) + + steps = {} + for kind, num, text in step_re.findall(reasoning or ""): + steps.setdefault(int(num), {})[kind.lower()] = text.strip() + + ordered_steps = [{"index": i, **steps[i]} for i in sorted(steps)] + finish_match = finish_re.search(reasoning or "") + finish_text = finish_match.group(1).strip() if finish_match else None + + return ordered_steps, finish_text + + +def parse_final_reasoning(generation_text): + """Parse the 'generation' field to extract the final Thought, Action, and Observation steps.""" + trace_block_match = re.search( + r"Question:.*?(Finish\[.*?\])", + generation_text, + re.DOTALL, + ) + + if trace_block_match: + reasoning_trace = trace_block_match.group(0) + else: + last_index = generation_text.rfind("Finish") + if last_index != -1: + reasoning_trace = generation_text[last_index:].strip() + else: + return "Final reasoning trace not found." + + step_pattern = re.compile( + r"^(Thought|Action|Observation)\s+\d+:\s*(.*)$", + re.MULTILINE, + ) + steps = step_pattern.findall(reasoning_trace) + + parsed_steps = [f"**{kind}:** {content}" for kind, content in steps] + return "\n".join(parsed_steps) if parsed_steps else reasoning_trace + + +def find_finish_action(generation_text): + """Extract the Finish[...] action text from a generation string.""" + last_index = generation_text.rfind("Finish") + if last_index != -1: + return generation_text[last_index:].strip() + return "" + + +def parse_generation(generation_text): + """ + Extracts the clean, final reasoning trace from the raw 'generation' field. + The trace starts with 'Question:' and ends with 'Finish[...]'. + """ + # Regex to find the block starting with "Question:" and ending with "Finish[...]" + trace_re = re.compile(r"Question:.*Finish\[.*\]", re.DOTALL) + match = trace_re.search(generation_text or "") + return match.group(0).strip() if match else "" + + +def parse_work_notes(work_notes_text): + """Splits work notes into a list of entries based on timestamps.""" + if not work_notes_text: + return [] + + # This regex captures the full timestamp line (e.g., "2024-11-22 10:37:54 - ST (Work notes)") + timestamp_pattern = r"(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s-.*?\))" + + # Split the text by the timestamp pattern, keeping the timestamps + parts = re.split(timestamp_pattern, work_notes_text) + + notes = [] + # The first item is usually empty, so we start from the first captured timestamp + i = 1 + while i < len(parts): + timestamp = parts[i].strip() + note_text = parts[i + 1].strip() + if note_text: # Only add entries that have content + notes.append({"timestamp": timestamp, "note": note_text}) + i += 2 + + return notes + + +def esc(s): + """Helper function for HTML escaping.""" + return html.escape(str(s) if s is not None else "") + + +def render(incident_data): + """Renders a single incident into a collapsible HTML section.""" + reasoning_trace = incident_data.get("generation") + finish_action = find_finish_action(incident_data.get("generation")) + + # --- Part 1: Human-Readable Incident Trace --- + trace_html = "" + fields_to_display = [ + "incident_identifier", + "urgency_level", + "incident_classification", + "incident_subtype", + "responsible_team", + "fault_category", + "detection_timestamp", + "incident_summary", + "geographical_territory", + "service_domain", + "equipment_provider", + "operational_zone", + "affected_site", + "escalation_date", + "generation_start_time", + "generation_end_time", + "time_to_resolve", + "solved_category", + "solved_reason", + ] + for field in fields_to_display: + value = incident_data.get(field) + display_value = value if value and str(value).strip() else "—" + field_name = field.replace("u_", "").replace("_", " ").title() + trace_html += f'' + trace_html += "
{field_name}{esc(display_value)}
" + + # --- Part 2: Chronological Work Notes --- + work_notes = incident_data.get("action_chronicle") or incident_data.get("work_notes") + work_notes_entries = parse_work_notes(work_notes) + notes_html = "
" + for entry in work_notes_entries: + notes_html += f""" +
+
{esc(entry["timestamp"])}
+
{esc(entry["note"])}
+
+ """ + notes_html += "
" + + steps, finish = parse_steps(reasoning_trace) + steps_html = [] + for s in steps: + block = [f'
Step {s["index"]}
'] + for key, cls in [("thought", "thought"), ("action", "action"), ("observation", "obs")]: + if s.get(key): + block.append(f'
{key.title()}: {esc(s[key])}
') + steps_html.append(f"
{''.join(block)}
") + + # --- Assemble the final collapsible report --- + incident_id = esc(incident_data.get("incident_identifier", incident_data.get("number", "Unknown Incident"))) + short_desc = esc(incident_data.get("incident_summary", incident_data.get("short_description", ""))) + + return f""" +
+ + {incident_id} + {short_desc} +
    +
  • Category: {esc(incident_data.get("incident_classification", incident_data.get("category")))}
  • +
  • Problem Code: {esc(incident_data.get("fault_category", incident_data.get("u_problem_code")))}
  • +
+
+
+

Incident Trace (Human-Readable)

+ {trace_html} +

Work Notes (Chronological)

+ {notes_html} +

Full Reasoning Trace

+
{reasoning_trace}
+

Thoughts, Observations, Actions

+
+ {"".join(steps_html)} +
+

Closing Notes

+
{finish_action}
+
+
+ """ + + +def main(input_file, output_file, max_incidents=15, selected_criteria=None): + """Main function to read, process, and write the HTML report.""" + try: + # 1. Load the entire dataset using pandas + df = pd.read_json(input_file, lines=True) + # Prepare for complex filters by adding helper columns + notes_col = "action_chronicle" if "action_chronicle" in df.columns else "work_notes" + df["work_notes_len"] = df[notes_col].astype(str).str.len().fillna(0) + print(f"Loaded {len(df)} incidents from {input_file}") + except FileNotFoundError: + print(f"Error: Input file not found at '{input_file}'") + return + except ValueError: + print(f"Error: Could not parse {input_file}. Ensure it's a valid .jsonl file.") + return + + if selected_criteria: + filtered_df = df.query(f"category == '{selected_criteria}'") + else: + filtered_df = df + data = filtered_df.head(max_incidents).to_dict(orient="records") + + style = """ + + """ + + html_doc = f"Incident Report{style}" + html_doc += "

Incident Analysis Report

" + + for inc_data in data: + html_doc += render(inc_data) + + html_doc += "
" + + Path(output_file).write_text(html_doc, encoding="utf-8") + print(f"Wrote {len(data)} incidents to {output_file}") + + +if __name__ == "__main__": + input_filename = sys.argv[1] + output_filename = sys.argv[2] + if len(sys.argv) > 3: + max_incidents = int(sys.argv[3]) + else: + max_incidents = 15 + if len(sys.argv) > 4: + selected_criteria = sys.argv[4] + else: + selected_criteria = None + main(input_filename, output_filename, max_incidents, selected_criteria)