A lightweight, real-time web-based monitoring interface for SLURM job queues on HPC clusters. Features automatic array job aggregation, visual progress tracking, and live updates.
- π Real-time Monitoring: Auto-refreshes every 30 seconds with 5-second server-side caching
- π Array Job Aggregation: Automatically groups array tasks with visual statistics
- π Progress Tracking: Visual progress bars based on runtime vs time limits
- π Advanced Filtering: Search by job ID, user, or name; filter by status and user
- π± Responsive Design: Works on desktop and mobile devices
- β‘ Lightweight: Single Python service with no external dependencies (SQLite, Redis, etc.)
- π Secure: Runs on login nodes with systemd integration
# Clone the repository
git clone <repository-url>
cd squeue_web
# Install Python dependencies
pip3 install flask --user
# Run in debug mode (port 8080)
python3 run_debug.py
# Access at http://localhost:8080
# Run as root for port 80 deployment
sudo ./install.sh
# The service will be available at http://<hostname>/
ββββββββββββββββββββββββ
β Login Node β
β ββββββββββββββββββ β
β β SLURM Monitor β β
β β Daemon β β
β β ββββββββββββ β β
β β β Web Serverβ βββΌβββΊ Port 80/8080
β β ββββββββββββ β β
β β ββββββββββββ β β
β β β SLURM β β β
β β β Commands β β β
β β ββββββββββββ β β
β ββββββββββββββββββ β
ββββββββββββββββββββββββ
Endpoint | Method | Description | Response |
---|---|---|---|
/ |
GET | Web interface | HTML |
/api/jobs |
GET | Job list with array aggregation | JSON array |
/api/stats |
GET | Cluster statistics | JSON object |
/api/users |
GET | Unique user list | JSON array |
[
{
"id": "1234567",
"name": "protein_fold",
"user": "jsmith",
"status": "RUNNING",
"partition": "gpu",
"nodes": "node001",
"cpus": "32",
"time": "02:45:30",
"timeLimit": "08:00:00",
"progress": 34
},
{
"id": "1234568",
"baseId": "1234568",
"isArray": true,
"arraySize": 100,
"arrayStats": {
"completed": 45,
"running": 10,
"pending": 40,
"failed": 5
},
"arrayTasks": [...]
}
]
{
"total_jobs": 660,
"running": 500,
"pending": 160,
"arrays": 2
}
squeue_web/
βββ README.md # This file
βββ index.html # Web interface (production)
βββ mock.html # Original mockup with sample data
βββ slurm-monitor.py # Main Python service
βββ run_debug.py # Debug mode runner (port 8080)
βββ test_local.py # Test script with mock SLURM commands
βββ install.sh # Installation script
βββ slurm-monitor.service # Systemd service file
βββ requirements.txt # Python dependencies
βββ Architecture.md # System architecture details
βββ OBJECTIVE.md # Original requirements
- Production: Port 80 (requires root or CAP_NET_BIND_SERVICE)
- Development: Port 8080 (no special privileges required)
The service uses a 5-second TTL cache to reduce SLURM command frequency:
cache = SimpleCache(ttl=5) # 5-second cache
The web interface refreshes every 30 seconds:
setInterval(refreshData, 30000); // 30-second refresh
# Run with mock SLURM data
python3 test_local.py
# Run in debug mode with real SLURM
python3 run_debug.py
- Verbose logging to console
- Hot reload on code changes
- Falls back to mock data if SLURM unavailable
- Detailed API request/response logging
- Edit files as needed
- Test in debug mode:
python3 run_debug.py
- The server auto-reloads on changes
- Check logs for debugging information
- Python 3.6+
- SLURM commands available (
squeue
) - Root access (for port 80) or appropriate capabilities
-
Run the installer as root:
sudo ./install.sh
-
Verify the service is running:
systemctl status slurm-monitor
-
View logs if needed:
journalctl -u slurm-monitor -f
# Start/stop/restart the service
sudo systemctl start slurm-monitor
sudo systemctl stop slurm-monitor
sudo systemctl restart slurm-monitor
# Enable/disable autostart
sudo systemctl enable slurm-monitor
sudo systemctl disable slurm-monitor
# View service logs
sudo journalctl -u slurm-monitor -n 50
sudo tail -f /var/log/slurm-monitor/daemon.log
-
Port 80 already in use:
- Check what's using it:
sudo lsof -i :80
- Either stop the conflicting service or use port 8080
- Check what's using it:
-
SLURM commands not found:
- Ensure you're on a node with SLURM client tools
- Check PATH includes SLURM binaries
-
Permission denied on port 80:
- Run with sudo:
sudo python3 slurm-monitor.py
- Or use debug mode:
python3 run_debug.py
(port 8080)
- Run with sudo:
-
No jobs showing:
- Check SLURM is accessible:
squeue
- Check API endpoint:
curl http://localhost:8080/api/jobs
- Review logs for errors
- Check SLURM is accessible:
# Test SLURM connectivity
squeue --format="%i|%P|%j|%u|%T" --noheader
# Test API endpoints
curl http://localhost:8080/api/jobs
curl http://localhost:8080/api/stats
curl http://localhost:8080/api/users
# Check service logs
sudo journalctl -u slurm-monitor -n 100
# Monitor in real-time
sudo journalctl -u slurm-monitor -f
- Response Time: < 500ms (cached), < 2s (uncached)
- Memory Usage: < 100MB typical
- CPU Usage: < 5% average
- Scalability: Tested with 10,000+ jobs
- Runs on secure login nodes only
- No authentication (relies on network security)
- Read-only access to SLURM queue
- No job modification capabilities
- Recommended: Deploy behind institutional firewall
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly in debug mode
- Submit a pull request
MIT License - See LICENSE file for details
Javier Castellanos ([email protected])
- Built for HPC clusters running SLURM
- Inspired by the need for better queue visualization
- Designed for simplicity and reliability