This is a complete, production-ready RAG (Retrieval-Augmented Generation) chatbot application specifically designed for teaching GenAI testing concepts. The application is intentionally built with common issues that students will discover during testing exercises.
Frontend (HTML/CSS/JS) → Flask Backend → RAG Pipeline → Cohere API
↓
ChromaDB Vector Store
↑
Knowledge Base Documents
- Python 3.8+
- Cohere API key (get one at https://cohere.com/)
- 2GB+ RAM for vector database
- Windows PowerShell (for Windows users)
-
Clone and Navigate
cd "c:\Users\jpayne\Documents\Training\Notebooks for ML classes\TestingAITutorial"
-
Create Virtual Environment
python -m venv training-env training-env\Scripts\activate # Windows # source training-env/bin/activate # macOS/Linux
-
Install Dependencies
pip install -r requirements.txt
-
Configure Environment
copy .env.example .env # Edit .env and add your Cohere API key -
Run Application
python run.py
-
Access Application
- Open http://localhost:5000
- Chat interface should load
- Try: "What are the key challenges in testing GenAI applications?"
For easy access to all testing capabilities, use the wrapper script for the virtual environment:
python run_demo_with_venv.pyOr use the quick launcher:
python launch.pyThese provide menu-driven interfaces to:
- Run optimization experiments
- Execute regression testing
- Launch interactive demos
- Run unit tests
- Start the Flask application
- Access documentation
TestingAITutorial/
├── app/
│ ├── __init__.py # Python package initialization
│ ├── main.py # Flask application and API endpoints
│ ├── rag_pipeline.py # RAG implementation with Cohere + ChromaDB
│ └── utils.py # Utility functions and helpers
├── static/
│ ├── css/
│ │ └── style.css # Professional styling with animations
│ └── js/
│ └── chat.js # Interactive chat interface logic
├── templates/
│ └── index.html # Main chat interface template
├── data/
│ ├── documents/ # Knowledge base documents (GenAI testing content)
│ │ ├── genai_testing_guide.md
│ │ ├── faq_genai_testing.md
│ │ ├── production_best_practices.md
│ │ └── evaluation_metrics.md
│ └── chroma_db/ # Vector database storage (auto-created)
├── tests/
│ ├── test_rag_pipeline.py # Core RAG pipeline unit tests
│ ├── test_regression_framework.py # Regression testing framework tests
│ └── evaluation_framework.py # Advanced evaluation tools
├── experiments/ # Educational optimization experiments
│ ├── __init__.py # Package initialization
│ ├── run_experiments.py # Master experiment runner with menu interface
│ ├── chunking_experiments.py # Document chunking strategy testing
│ ├── embedding_experiments.py # Embedding model comparison testing
│ ├── generation_experiments.py # Response generation parameter tuning
│ ├── retrieval_experiments.py # Document retrieval optimization
│ ├── system_optimization_experiments.py # End-to-end system optimization
│ └── README.md # Experiments package documentation
├── regression_testing/ # Production-ready regression testing
│ ├── __init__.py # Package initialization
│ ├── regression_testing.py # Comprehensive regression testing framework
│ ├── demo_regression_testing.py # Interactive regression testing demo
│ ├── config.json # Configurable testing thresholds and settings
│ ├── results/ # Auto-generated test results (created at runtime)
│ └── README.md # Regression testing package documentation
├── docs/ # All project documentation
│ ├── EXPERIMENTS_README.md # Detailed experiment documentation
│ ├── REGRESSION_TESTING_README.md # Regression testing guide
│ ├── PROJECT_PLAN.md # Detailed implementation plan
│ ├── IMPLEMENTATION_CHECKLIST.md # Step-by-step checklist
│ └── STUDENT_GUIDE.md # Complete student tutorial
├── requirements.txt # Python dependencies
├── .env.example # Environment variable template
├── test_environment.py # Environment validation script
├── run.py # Application entry point
├── run_demo_with_venv.py # Virtual environment wrapper for demos
├── launch.py # Interactive launcher menu
└── README.md # This file (main project overview)
- Cohere Integration: Uses Cohere's embeddings and Command model
- ChromaDB Vector Store: Local, persistent document storage
- LangChain Framework: Professional RAG implementation
- Source Attribution: Shows retrieved documents and similarity scores
- Real-time Performance Metrics: Response times and statistics
- Modern UI: Clean, responsive design with animations
- Real-time Chat: WebSocket-like experience with fetch API
- Message History: Persistent chat sessions
- Loading States: Typing indicators and progress feedback
- Statistics Dashboard: System health and performance metrics
- Mobile Responsive: Works on all device sizes
- Import Issues Fixed: All test files now properly handle Python imports
- Virtual Environment Integration:
run_demo_with_venv.pyscript for seamless execution - Unit Tests: Individual component testing (
tests/test_rag_pipeline.py) - Integration Tests: End-to-end API testing
- Regression Testing: Gold standard answer comparison with pass/fail thresholds (
regression_testing.py) - Quality Metrics: Response relevance and accuracy scoring
- Performance Tests: Load testing and concurrent request handling
- Robustness Tests: Edge cases and adversarial inputs
- Evaluation Framework: Automated quality assessment tools (
tests/evaluation_framework.py) - Optimization Experiments: 5 systematic testing approaches for parameter tuning:
- Chunking Strategy Testing (
chunking_experiments.py) - Embedding Model Comparison (
embedding_experiments.py) - Generation Parameter Tuning (
generation_experiments.py) - Retrieval Strategy Optimization (
retrieval_experiments.py) - End-to-End System Optimization (
system_optimization_experiments.py)
- Chunking Strategy Testing (
- Interactive Testing Framework: Menu-driven experiment runner (
run_experiments.py) - Semantic Similarity Analysis: Meaning-based response comparison
- Quality Gates: Automated deployment readiness assessment
- Evaluation Framework: Automated quality assessment tools
- GenAI Testing Guide: Comprehensive testing strategies
- FAQ: Common questions about GenAI testing
- Best Practices: Production deployment guidelines
- Evaluation Metrics: Detailed metric explanations
- Real-world Examples: Practical testing scenarios
This tutorial provides multiple testing methodologies to comprehensively evaluate GenAI systems:
Purpose: Systematic parameter optimization and component analysis
Approach: Interactive menu-driven experiments
Files: *_experiments.py files for each component
Learn: How different parameters affect system performance
Purpose: Production-ready quality assurance with gold standards
Approach: Compare responses to expert-curated correct answers
Metrics: Semantic similarity, keyword matching, quality gates
Learn: Automated pass/fail criteria and deployment readiness
Purpose: Component-level validation and API testing
Approach: Traditional pytest-based testing
Coverage: Pipeline initialization, API endpoints, core functionality
Learn: Standard software testing practices for AI systems
Purpose: Advanced quality assessment and metrics collection
Approach: Multi-dimensional response evaluation
Features: Custom scoring, consistency analysis, performance profiling
Learn: How to measure AI system quality beyond simple accuracy
Purpose: Hands-on learning and concept demonstration
Approach: Guided tutorials with real-time feedback
Features: Live testing, framework validation, educational explanations
Learn: Testing concepts through practical application
Main chat interface
Process chat messages
{
"message": "What are the key challenges in testing GenAI?"
}Response:
{
"response": "The key challenges in testing GenAI applications include...",
"sources": [
{
"content": "GenAI testing requires...",
"metadata": {"source": "genai_testing_guide.md"},
"similarity": 0.87
}
],
"response_time": 1.234,
"retrieval_time": 0.456,
"generation_time": 0.778,
"status": "success"
}System health check
{
"status": "healthy",
"rag_pipeline": "initialized",
"cohere_api_key": "configured"
}Performance statistics
{
"queries_processed": 42,
"average_response_time": 1.23,
"documents_loaded": 156,
"error_rate": 0.02
}# Required
COHERE_API_KEY=your_cohere_api_key_here
# Optional
FLASK_ENV=development
FLASK_DEBUG=True
FLASK_PORT=5000
CHUNK_SIZE=1000
CHUNK_OVERLAP=100
MAX_RETRIEVAL_DOCS=5
SIMILARITY_THRESHOLD=0.7
LOG_LEVEL=INFO| Parameter | Description | Default | Impact |
|---|---|---|---|
CHUNK_SIZE |
Document chunk size for embeddings | 1000 | Affects retrieval granularity |
CHUNK_OVERLAP |
Overlap between chunks | 100 | Prevents information loss |
MAX_RETRIEVAL_DOCS |
Number of documents retrieved | 5 | Balances context vs. performance |
SIMILARITY_THRESHOLD |
Minimum similarity for retrieval | 0.7 | Filters irrelevant documents |
This application includes several intentional issues that represent common problems in production GenAI systems:
- Suboptimal Chunk Size: Chunks may be too large for effective retrieval
- Low Similarity Thresholds: May retrieve irrelevant documents
- Outdated Embedding Model: Using older Cohere model instead of latest
- Prompt Engineering: Prompt may not effectively prevent hallucination
- Temperature Settings: May be too high, causing inconsistency
- Max Token Limits: May cut off responses prematurely
- Inefficient Processing: Some operations may be unnecessarily slow
- Memory Usage: Vector database operations may not be optimized
- Concurrent Handling: May not scale well under load
- Response Consistency: Similar queries may get very different responses
- Source Attribution: Attribution accuracy may be questionable
- Edge Case Handling: May not gracefully handle unusual inputs
- Response Relevance: Semantic similarity to expected answers
- Faithfulness: Grounding in retrieved documents
- Consistency: Similar responses to similar queries
- Completeness: Adequate depth and coverage
- Behavioral Analysis: Compare responses across similar queries
- Code Review: Examine parameters and configuration
- Performance Profiling: Measure component timing
- Edge Case Testing: Unusual inputs and adversarial queries
- Automated Test Suite:
tests/test_rag_pipeline.py - Evaluation Framework:
tests/evaluation_framework.py - Performance Benchmarks: Built-in timing and statistics
- Quality Scorers: Response quality assessment functions
curl -X POST http://localhost:5000/api/chat \
-H "Content-Type: application/json" \
-d '{"message": "What is hallucination in GenAI?"}'# Install pytest if not included
pip install pytest
# Run all tests (with proper imports fixed)
python tests/test_rag_pipeline.py
python tests/test_regression_framework.py
# Or using pytest
python -m pytest tests/test_rag_pipeline.py -v
python -m pytest tests/test_regression_framework.py -v
# Run evaluation framework
python tests/evaluation_framework.py# Interactive experiment menu
python -m experiments.run_experiments
# Individual experiments
python -m experiments.chunking_experiments
python -m experiments.embedding_experiments
python -m experiments.generation_experiments
python -m experiments.retrieval_experiments
python -m experiments.system_optimization_experiments# Full regression test suite
python regression_testing/regression_testing.py
# Interactive demo (recommended - uses correct virtual environment)
python run_demo_with_venv.py
# Then select option 4 for framework validation
# Or run demo directly
python regression_testing/demo_regression_testing.py
# Quick regression test
python regression_testing/regression_testing.py --quickfrom tests.evaluation_framework import EvaluationFramework
from app.rag_pipeline import RAGPipeline
pipeline = RAGPipeline()
evaluator = EvaluationFramework(pipeline)
# Run comprehensive evaluation
results = evaluator.evaluate_response_quality([
{"query": "What is GenAI testing?", "expected_topics": ["testing", "genai"]}
])
print(f"Average Quality Score: {results['average_quality_score']}")from regression_testing.regression_testing import RegressionTestFramework
# Create framework
framework = RegressionTestFramework()
# Run tests programmatically
results = framework.run_regression_tests(save_results=True)
# Check quality gate
gate_passed = (
results['summary']['pass_rate'] >= 0.8 and
results['summary']['critical_failures'] == 0
)
print(f"Quality Gate: {'PASSED' if gate_passed else 'FAILED'}")- Explore the chat interface and basic functionality
- Run simple API tests with curl
- Use the provided test suite to understand testing concepts
- Run experiments to see optimization effects
- Analyze the evaluation framework results
- Run regression tests and understand quality metrics
- Write custom test cases for specific scenarios
- Investigate performance bottlenecks
- Examine retrieval quality and source attribution
- Discover and document intentional issues using experiments
- Propose and implement fixes in the actual system
- Create custom regression test cases with gold standards
- Develop custom evaluation metrics and quality gates
- Develop custom evaluation metrics
- Design production monitoring strategies
"ModuleNotFoundError: No module named 'app' or 'regression_testing'"
- Import issues have been FIXED in the test files
- Tests now properly add the project root to Python path
- Use
python tests/test_rag_pipeline.pyorpython tests/test_regression_framework.py
"KeyError: 'failed_tests' or 'avg_response_length'"
- These issues have been FIXED in the test framework
- Test data now includes all required keys for proper execution
"Import cohere could not be resolved"
- Ensure virtual environment is activated:
training-env\Scripts\activate - Run
pip install -r requirements.txt - Use
run_demo_with_venv.pyfor automatic virtual environment handling
"tf-keras compatibility issues with Keras 3"
- tf-keras has been added to requirements.txt
- Virtual environment should install tf-keras>=2.15.0 automatically
- This resolves Keras 3 compatibility issues in the regression testing framework
"COHERE_API_KEY not found"
- Copy
.env.exampleto.env - Add your Cohere API key to the
.envfile
"ChromaDB initialization failed"
- Ensure you have write permissions in the project directory
- Delete
data/chroma_db/folder and restart if corrupted
"Flask app won't start"
- Check that port 5000 is available
- Set
FLASK_PORT=5001in.envto use a different port
Slow first response
- First query initializes the vector database (expected delay)
- Subsequent queries should be faster
High memory usage
- ChromaDB loads embeddings into memory
- Reduce document collection size if needed
Poor response quality
- This may be intentional! Part of the learning exercise
- Check if you're discovering the planted issues correctly
Off-topic responses
- Test the system's domain boundaries
- Document cases where it should vs. shouldn't know answers
This application provides hands-on experience with:
- Non-deterministic Testing: Dealing with probabilistic outputs
- Quality vs. Performance Trade-offs: Balancing response quality and speed
- Evaluation Metrics: Understanding different ways to measure success
- Production Readiness: What it takes to deploy GenAI systems
- Hallucination Detection: Identifying when AI generates false information
- Bias Testing: Checking for unfair or inappropriate responses
- Edge Case Handling: System behavior with unusual inputs
- Adversarial Robustness: Resistance to malicious inputs
- Latency Optimization: Making responses faster
- Scalability Testing: Handling multiple concurrent users
- Resource Management: Efficient use of CPU, memory, and API calls
- Monitoring and Alerting: Detecting issues in production
- Multi-dimensional Evaluation: Beyond simple accuracy metrics
- Consistency Testing: Ensuring reliable behavior
- Regression Detection: Catching quality degradation
- User Experience Focus: Testing from the user's perspective
This is an educational project. If you find additional issues or have suggestions for improvements:
- Document your findings clearly
- Propose educational value of the change
- Consider impact on learning objectives
- Share with the instructor or class
This project is created for educational purposes. Use freely for learning and teaching GenAI testing concepts.
Happy Testing! 🧪🤖
Remember: The goal isn't just to build GenAI applications, but to build ones that are reliable, safe, and provide genuine value to users. Testing is how we ensure that promise is kept.