A comprehensive benchmarking platform that evaluates AI coding agents using real-world GitHub issues from SWE-bench, integrated with the Roo SPARC methodology for structured, secure, and measurable software engineering workflows.
The Roo SPARC system transforms SWE-bench from a simple dataset into a complete evaluation framework that measures not just correctness, but also efficiency, security, and methodology adherence across thousands of real GitHub issues.
SWE-bench provides thousands of real GitHub issues with ground-truth solutions and unit tests. The Roo SPARC system enhances this with:
- Structured Methodology: SPARC (Specification, Pseudocode, Architecture, Refinement, Completion) workflow
- Multi-Modal Evaluation: Specialized AI modes for different coding tasks (debugging, testing, security, etc.)
- Comprehensive Metrics: Steps, cost, time, complexity, and correctness tracking
- Security-First Approach: No hardcoded secrets, modular design, secure task isolation
- Database-Driven Workflow: SQLite integration for task management and analytics
- Specification Mode: Requirements analysis and edge case identification
- Pseudocode Mode: High-level logic design with TDD anchors
- Architecture Mode: Modular system design and component boundaries
- Refinement Mode: Implementation with testing and security reviews
- Completion Mode: Integration, documentation, and final validation
- π§ Auto-Coder: Clean, modular code implementation
- π§ͺ Tester (TDD): Test-driven development and coverage
- πͺ² Debugger: Runtime bug analysis and error resolution
- π‘οΈ Security Reviewer: Vulnerability assessment and secure coding
- π Documentation Writer: Comprehensive technical documentation
- π System Integrator: Component integration and cohesion
- π― Benchmark Orchestrator: SWE-bench evaluation management
- Step Tracking: Detailed execution logs with timestamps
- Complexity Analysis: Task categorization (simple/medium/complex)
- Performance Metrics: Success rates, efficiency patterns, cost analysis
- Security Compliance: Secret exposure prevention, modular boundaries
- Repository Statistics: Per-project performance insights
- Zero Hardcoded Secrets: Environment abstraction required
- Modular Design: Files limited to 500 lines maximum
- Isolated Execution: Task-specific workspaces
- Solution Security: No solution exposure during active problem solving
- Standardized Evaluation: Consistent methodology across experiments
- Comprehensive Metrics: Beyond simple pass/fail to include efficiency and quality
- Real-World Validation: Actual GitHub issues, not synthetic problems
- Reproducible Results: Detailed execution logs and structured workflows
- Code Quality Assessment: Security, modularity, and maintainability metrics
- Methodology Validation: SPARC workflow effectiveness measurement
- Performance Optimization: Identify bottlenecks and improvement opportunities
- Compliance Tracking: Ensure adherence to coding standards and security practices
- Benchmark Comparisons: Standardized evaluation across different AI systems
- Cost Analysis: Resource utilization and efficiency metrics
- Quality Assurance: Automated validation of AI-generated solutions
- Continuous Improvement: Data-driven enhancement of AI capabilities
Metric | Description | Goal |
---|---|---|
Correctness | Unit test pass rate | Functional accuracy |
Steps | Number of execution steps | Efficiency measurement |
Time | Wall-clock completion time | Performance assessment |
Cost | Token usage and API costs | Resource efficiency |
Complexity | Step-based task categorization | Difficulty analysis |
- Repository Performance: Success rates by codebase
- Mode Effectiveness: Performance comparison across AI modes
- Solution Quality: Code quality and maintainability metrics
- Security Compliance: Adherence to secure coding practices
- Methodology Adherence: SPARC workflow compliance
# Clone the repository
git clone https://github.com/agenticsorg/sparc-bench.git
cd sparc-bench
# Set up Python environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Load SWE-bench dataset into SQLite
cd swe-bench-sqlite/scripts
python load_swe_bench_to_sqlite.py
# Verify database setup
python benchmark_db_helper.py summary
# Get an available task
python benchmark_db_helper.py get_task
# Start task execution
python benchmark_db_helper.py start_task <instance_id>
# Monitor progress and analyze results
python benchmark_db_helper.py step_analytics
sparc-bench/
ββ swe-bench-sqlite/ # Database and task management
β ββ databases/ # SQLite databases (lite & full)
β ββ scripts/ # Database utilities and helpers
β ββ README.md # Database documentation
ββ swe-bench-workspace/ # Active task execution
β ββ active/ # Isolated task workspaces
β ββ results/ # Completion results and reports
β ββ config/ # Configuration and environment
ββ .roo/ # Roo SPARC mode definitions
β ββ rules-benchmark/ # Benchmark orchestrator rules
β ββ rules-code/ # Code editing guidelines
ββ .roomodes # Mode configurations and instructions
ββ plans/ # Architecture and planning docs
# Get task without solution exposure
python benchmark_db_helper.py get_task
# Start task with timing
python benchmark_db_helper.py start_task <instance_id>
# Log execution steps
python benchmark_db_helper.py log_step <instance_id> "Step description"
# Mark completion
python benchmark_db_helper.py update_status <instance_id> completed "Success details"
# Analyze results
python benchmark_db_helper.py task_details <instance_id>
# Reveal solution (post-completion only)
python benchmark_db_helper.py get_solution <instance_id>
Define specialized modes in .roomodes
for specific evaluation scenarios:
customModes:
- slug: custom-evaluator
name: π― Custom Evaluator
roleDefinition: Your custom evaluation logic
customInstructions: Specific instructions for your use case
- Full Dataset: 2,294 real GitHub issues
- Lite Dataset: 300 curated issues for faster evaluation
- Custom Datasets: Load your own evaluation sets
- Batch Processing: Parallel task execution
- Resource Limits: Memory and time constraints
- Quality Gates: Automated quality checks
# Overall progress
python benchmark_db_helper.py summary
# Repository-specific insights
python benchmark_db_helper.py repo_stats
# Step complexity analysis
python benchmark_db_helper.py step_analytics
All results are stored in structured SQLite format for:
- Custom analysis and visualization
- Integration with external monitoring tools
- Historical trend analysis
- Performance regression detection
# 1. Initialize evaluation environment
cd swe-bench-sqlite/scripts
python benchmark_db_helper.py summary
# 2. Select and start a task
TASK_ID=$(python benchmark_db_helper.py get_task | jq -r '.instance_id')
python benchmark_db_helper.py start_task $TASK_ID
# 3. Execute with step tracking
python benchmark_db_helper.py log_step $TASK_ID "Analyzing problem statement"
python benchmark_db_helper.py log_step $TASK_ID "Implementing solution"
python benchmark_db_helper.py log_step $TASK_ID "Running tests and validation"
# 4. Complete and analyze
python benchmark_db_helper.py update_status $TASK_ID completed "Solution verified"
python benchmark_db_helper.py task_details $TASK_ID
- Fork the repository
- Create feature branch following SPARC methodology
- Implement with step tracking and security compliance
- Run evaluation suite
- Submit pull request with benchmark results
- Modular Design: Keep files under 500 lines
- Security First: No hardcoded secrets or credentials
- Test Coverage: Include comprehensive test suites
- Documentation: Update README and mode definitions
- SWE-bench Official: https://www.swebench.com/
- Dataset Documentation: SWE-bench Datasets Guide
- GitHub Repository: SWE-bench on GitHub
- Research Papers: SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
This project is licensed under the MIT License - see the LICENSE file for details.
The Roo SPARC Coding Evaluation & Benchmark System transforms software engineering evaluation from simple correctness checking into comprehensive methodology assessment, providing the insights needed to build more effective, secure, and maintainable AI coding systems.
Created by rUv - Bridging the gap between AI capability and real-world software engineering excellence.