Designed and implemented a complete data pipeline for solar panel performance monitoring using real-time web scraping, data cleaning (bronze/silver/gold layers), and interactive dashboards.
🔧 Technologies:
Python, Selenium, BeautifulSoup, pandas, dbt, Airflow, Databricks, PostgreSQL, Plotly/Dash
✨ Key Features:
- 📅 Automated daily data extraction from pvoutput.org
- 🧹 Data transformation into bronze → silver → gold layers using dbt
- ⏱️ Scheduled workflows and job orchestration via Airflow
- 📊 Interactive dashboards to visualize:
- System efficiency
- Power generation trends
- Anomalies and system health
- 🔁 Modular design with support for multiple solar systems (multi-SID)
python3 -m venv venv source venv/bin/activate
pip install -r requirements.txt
Leveraged PySpark in Databricks to aggregate multi-site solar energy data across 30,000+ records, enabling system-wide performance benchmarking, anomaly detection, and cross-SID comparisons. Data stored as Delta tables for efficient downstream dashboard consumption.