Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .github/workflows/ci-failure-monitor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ on:
workflow_dispatch:
inputs:
limit:
description: 'Number of workflow runs to analyze'
description: 'Number of workflow runs to analyze (across all workflows)'
required: false
default: '300'
default: '800'
type: string
threshold:
description: 'Alert threshold for consecutive failures'
Expand Down Expand Up @@ -51,8 +51,8 @@ jobs:
cd scripts/ci_monitor
python ci_failures_analysis.py \
--token $GITHUB_TOKEN \
--limit ${{ inputs.limit || '300' }} \
--threshold ${{ inputs.threshold || '2' }} \
--limit ${{ inputs.limit || '800' }} \
--threshold ${{ inputs.threshold || '4' }} \
--output ci_failure_analysis_$(date +%Y%m%d_%H%M%S).json

- name: Upload Analysis Results
Expand Down
31 changes: 31 additions & 0 deletions scripts/ci_monitor/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,11 @@ A comprehensive toolkit to analyze CI failures and performance trends for the SG
### Failures Analyzer (`ci_failures_analysis.py`)
- **Consecutive Failure Tracking**: Identify jobs currently failing
- **Runner Health Monitoring**: Track runner failure rates and identify problematic infrastructure
- **Multi-Workflow Support**: Monitors PR Test (Nvidia), PR Test (AMD), and PR Test (Xeon) workflows
- **Queue Time Tracking**: Monitor average and P90 queue times per runner type
- **Alert System**: Automatic alerts for consecutive failures and runner problems
- **Instance Tracking**: Monitor specific runner instances for targeted remediation
- **Slack Notifications**: Send condensed alerts to Slack (top 3 jobs/runners by consecutive failures and failure rates)
- **GitHub Integration**: Generate comprehensive summaries with actionable recommendations
- **JSON Export**: Export detailed analysis data for further processing

Expand Down Expand Up @@ -160,6 +163,33 @@ python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 300 --threshold 2
python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 500 --threshold 3
```

#### Monitored Workflows

The Failures Analyzer monitors the following workflows:

- **PR Test** - Nvidia GPU tests (self-hosted runners: 1-gpu-runner, 4-gpu-h100-runner, etc.)
- **PR Test (AMD)** - AMD GPU tests (AMD-specific runners)
- **PR Test (Xeon)** - Intel Xeon CPU tests (Xeon-specific runners)

All three workflows are analyzed together, with runner statistics tracked separately by runner type.

#### Slack Notifications

The Failures Analyzer can send condensed alerts to Slack. See [SLACK_SETUP.md](SLACK_SETUP.md) for complete setup instructions.

**What gets sent:**
- Top 3 jobs with consecutive failures
- Top 3 runners with consecutive failures
- Top 3 jobs with highest total failure rate
- Top 3 runners with highest total failure rate
- Queue time summary

```bash
# Send Slack notification from analysis JSON
export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
python slack_notifier.py --json ci_failure_analysis.json
```

#### Understanding the Output

The script generates a **2-section report**:
Expand All @@ -170,6 +200,7 @@ The script generates a **2-section report**:

**Section 2: Runner Health Analysis**
- Shows which runners have high failure rates
- Includes queue time metrics (average and P90)
- Helps identify infrastructure vs code issues

#### Alert Types
Expand Down
Loading
Loading