Skip to content

agenticsorg/sparc-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Roo Code Evaluation & Benchmarking System

A comprehensive benchmarking platform that evaluates AI coding agents using real-world GitHub issues from SWE-bench, integrated with the Roo SPARC methodology for structured, secure, and measurable software engineering workflows.

The Roo SPARC system transforms SWE-bench from a simple dataset into a complete evaluation framework that measures not just correctness, but also efficiency, security, and methodology adherence across thousands of real GitHub issues.


🎯 Overview

SWE-bench provides thousands of real GitHub issues with ground-truth solutions and unit tests. The Roo SPARC system enhances this with:

  • Structured Methodology: SPARC (Specification, Pseudocode, Architecture, Refinement, Completion) workflow
  • Multi-Modal Evaluation: Specialized AI modes for different coding tasks (debugging, testing, security, etc.)
  • Comprehensive Metrics: Steps, cost, time, complexity, and correctness tracking
  • Security-First Approach: No hardcoded secrets, modular design, secure task isolation
  • Database-Driven Workflow: SQLite integration for task management and analytics

✨ Key Features

πŸ—οΈ SPARC Methodology Integration

  • Specification Mode: Requirements analysis and edge case identification
  • Pseudocode Mode: High-level logic design with TDD anchors
  • Architecture Mode: Modular system design and component boundaries
  • Refinement Mode: Implementation with testing and security reviews
  • Completion Mode: Integration, documentation, and final validation

🎯 Specialized AI Modes

  • 🧠 Auto-Coder: Clean, modular code implementation
  • πŸ§ͺ Tester (TDD): Test-driven development and coverage
  • πŸͺ² Debugger: Runtime bug analysis and error resolution
  • πŸ›‘οΈ Security Reviewer: Vulnerability assessment and secure coding
  • πŸ“š Documentation Writer: Comprehensive technical documentation
  • πŸ”— System Integrator: Component integration and cohesion
  • 🎯 Benchmark Orchestrator: SWE-bench evaluation management

πŸ“Š Advanced Analytics

  • Step Tracking: Detailed execution logs with timestamps
  • Complexity Analysis: Task categorization (simple/medium/complex)
  • Performance Metrics: Success rates, efficiency patterns, cost analysis
  • Security Compliance: Secret exposure prevention, modular boundaries
  • Repository Statistics: Per-project performance insights

πŸ”’ Security & Compliance

  • Zero Hardcoded Secrets: Environment abstraction required
  • Modular Design: Files limited to 500 lines maximum
  • Isolated Execution: Task-specific workspaces
  • Solution Security: No solution exposure during active problem solving

🎯 Benefits

For AI Researchers

  • Standardized Evaluation: Consistent methodology across experiments
  • Comprehensive Metrics: Beyond simple pass/fail to include efficiency and quality
  • Real-World Validation: Actual GitHub issues, not synthetic problems
  • Reproducible Results: Detailed execution logs and structured workflows

for Development Teams

  • Code Quality Assessment: Security, modularity, and maintainability metrics
  • Methodology Validation: SPARC workflow effectiveness measurement
  • Performance Optimization: Identify bottlenecks and improvement opportunities
  • Compliance Tracking: Ensure adherence to coding standards and security practices

For Platform Providers

  • Benchmark Comparisons: Standardized evaluation across different AI systems
  • Cost Analysis: Resource utilization and efficiency metrics
  • Quality Assurance: Automated validation of AI-generated solutions
  • Continuous Improvement: Data-driven enhancement of AI capabilities

πŸ“ˆ Evaluation Metrics

Core Performance Indicators

Metric Description Goal
Correctness Unit test pass rate Functional accuracy
Steps Number of execution steps Efficiency measurement
Time Wall-clock completion time Performance assessment
Cost Token usage and API costs Resource efficiency
Complexity Step-based task categorization Difficulty analysis

Advanced Analytics

  • Repository Performance: Success rates by codebase
  • Mode Effectiveness: Performance comparison across AI modes
  • Solution Quality: Code quality and maintainability metrics
  • Security Compliance: Adherence to secure coding practices
  • Methodology Adherence: SPARC workflow compliance

πŸš€ Quick Start

1. Environment Setup

# Clone the repository
git clone https://github.com/agenticsorg/sparc-bench.git
cd sparc-bench

# Set up Python environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Database Initialization

# Load SWE-bench dataset into SQLite
cd swe-bench-sqlite/scripts
python load_swe_bench_to_sqlite.py

# Verify database setup
python benchmark_db_helper.py summary

3. Run Your First Benchmark

# Get an available task
python benchmark_db_helper.py get_task

# Start task execution
python benchmark_db_helper.py start_task <instance_id>

# Monitor progress and analyze results
python benchmark_db_helper.py step_analytics

πŸ—οΈ System Architecture

sparc-bench/
β”œβ”€ swe-bench-sqlite/          # Database and task management
β”‚  β”œβ”€ databases/              # SQLite databases (lite & full)
β”‚  β”œβ”€ scripts/                # Database utilities and helpers
β”‚  └─ README.md              # Database documentation
β”œβ”€ swe-bench-workspace/       # Active task execution
β”‚  β”œβ”€ active/                # Isolated task workspaces
β”‚  β”œβ”€ results/               # Completion results and reports
β”‚  └─ config/                # Configuration and environment
β”œβ”€ .roo/                      # Roo SPARC mode definitions
β”‚  β”œβ”€ rules-benchmark/       # Benchmark orchestrator rules
β”‚  └─ rules-code/            # Code editing guidelines
β”œβ”€ .roomodes                  # Mode configurations and instructions
└─ plans/                     # Architecture and planning docs

🎯 Benchmark Orchestrator Workflow

Phase 1: Secure Task Selection

# Get task without solution exposure
python benchmark_db_helper.py get_task

Phase 2: Structured Execution

# Start task with timing
python benchmark_db_helper.py start_task <instance_id>

# Log execution steps
python benchmark_db_helper.py log_step <instance_id> "Step description"

Phase 3: Completion & Analysis

# Mark completion
python benchmark_db_helper.py update_status <instance_id> completed "Success details"

# Analyze results
python benchmark_db_helper.py task_details <instance_id>

# Reveal solution (post-completion only)
python benchmark_db_helper.py get_solution <instance_id>

πŸ”§ Advanced Configuration

Custom Mode Creation

Define specialized modes in .roomodes for specific evaluation scenarios:

customModes:
  - slug: custom-evaluator
    name: 🎯 Custom Evaluator
    roleDefinition: Your custom evaluation logic
    customInstructions: Specific instructions for your use case

Database Management

  • Full Dataset: 2,294 real GitHub issues
  • Lite Dataset: 300 curated issues for faster evaluation
  • Custom Datasets: Load your own evaluation sets

Performance Tuning

  • Batch Processing: Parallel task execution
  • Resource Limits: Memory and time constraints
  • Quality Gates: Automated quality checks

πŸ“Š Analytics Dashboard

Real-Time Monitoring

# Overall progress
python benchmark_db_helper.py summary

# Repository-specific insights
python benchmark_db_helper.py repo_stats

# Step complexity analysis
python benchmark_db_helper.py step_analytics

Data Export

All results are stored in structured SQLite format for:

  • Custom analysis and visualization
  • Integration with external monitoring tools
  • Historical trend analysis
  • Performance regression detection


πŸ” Example Evaluation Run

# 1. Initialize evaluation environment
cd swe-bench-sqlite/scripts
python benchmark_db_helper.py summary

# 2. Select and start a task
TASK_ID=$(python benchmark_db_helper.py get_task | jq -r '.instance_id')
python benchmark_db_helper.py start_task $TASK_ID

# 3. Execute with step tracking
python benchmark_db_helper.py log_step $TASK_ID "Analyzing problem statement"
python benchmark_db_helper.py log_step $TASK_ID "Implementing solution"
python benchmark_db_helper.py log_step $TASK_ID "Running tests and validation"

# 4. Complete and analyze
python benchmark_db_helper.py update_status $TASK_ID completed "Solution verified"
python benchmark_db_helper.py task_details $TASK_ID

🀝 Contributing

Development Workflow

  1. Fork the repository
  2. Create feature branch following SPARC methodology
  3. Implement with step tracking and security compliance
  4. Run evaluation suite
  5. Submit pull request with benchmark results

Guidelines

  • Modular Design: Keep files under 500 lines
  • Security First: No hardcoded secrets or credentials
  • Test Coverage: Include comprehensive test suites
  • Documentation: Update README and mode definitions

πŸ“š Resources


πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


The Roo SPARC Coding Evaluation & Benchmark System transforms software engineering evaluation from simple correctness checking into comprehensive methodology assessment, providing the insights needed to build more effective, secure, and maintainable AI coding systems.

Created by rUv - Bridging the gap between AI capability and real-world software engineering excellence.

About

SWE Benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published