Skip to content

javierbq/slurm-queue-monitor

Repository files navigation

SLURM Queue Monitor

A lightweight, real-time web-based monitoring interface for SLURM job queues on HPC clusters. Features automatic array job aggregation, visual progress tracking, and live updates.

SLURM Queue Monitor Python Flask License

Features

  • πŸš€ Real-time Monitoring: Auto-refreshes every 30 seconds with 5-second server-side caching
  • πŸ“Š Array Job Aggregation: Automatically groups array tasks with visual statistics
  • πŸ“ˆ Progress Tracking: Visual progress bars based on runtime vs time limits
  • πŸ” Advanced Filtering: Search by job ID, user, or name; filter by status and user
  • πŸ“± Responsive Design: Works on desktop and mobile devices
  • ⚑ Lightweight: Single Python service with no external dependencies (SQLite, Redis, etc.)
  • πŸ”’ Secure: Runs on login nodes with systemd integration

Quick Start

Development Mode (No Root Required)

# Clone the repository
git clone <repository-url>
cd squeue_web

# Install Python dependencies
pip3 install flask --user

# Run in debug mode (port 8080)
python3 run_debug.py

# Access at http://localhost:8080

Production Installation (Root Required)

# Run as root for port 80 deployment
sudo ./install.sh

# The service will be available at http://<hostname>/

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Login Node         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  SLURM Monitor β”‚  β”‚
β”‚  β”‚    Daemon      β”‚  β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚
β”‚  β”‚  β”‚ Web Serverβ”‚ ──┼──► Port 80/8080
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚
β”‚  β”‚  β”‚  SLURM   β”‚  β”‚  β”‚
β”‚  β”‚  β”‚ Commands β”‚  β”‚  β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

API Endpoints

Endpoint Method Description Response
/ GET Web interface HTML
/api/jobs GET Job list with array aggregation JSON array
/api/stats GET Cluster statistics JSON object
/api/users GET Unique user list JSON array

Example API Responses

/api/jobs

[
  {
    "id": "1234567",
    "name": "protein_fold",
    "user": "jsmith",
    "status": "RUNNING",
    "partition": "gpu",
    "nodes": "node001",
    "cpus": "32",
    "time": "02:45:30",
    "timeLimit": "08:00:00",
    "progress": 34
  },
  {
    "id": "1234568",
    "baseId": "1234568",
    "isArray": true,
    "arraySize": 100,
    "arrayStats": {
      "completed": 45,
      "running": 10,
      "pending": 40,
      "failed": 5
    },
    "arrayTasks": [...]
  }
]

/api/stats

{
  "total_jobs": 660,
  "running": 500,
  "pending": 160,
  "arrays": 2
}

File Structure

squeue_web/
β”œβ”€β”€ README.md              # This file
β”œβ”€β”€ index.html            # Web interface (production)
β”œβ”€β”€ mock.html             # Original mockup with sample data
β”œβ”€β”€ slurm-monitor.py      # Main Python service
β”œβ”€β”€ run_debug.py          # Debug mode runner (port 8080)
β”œβ”€β”€ test_local.py         # Test script with mock SLURM commands
β”œβ”€β”€ install.sh            # Installation script
β”œβ”€β”€ slurm-monitor.service # Systemd service file
β”œβ”€β”€ requirements.txt      # Python dependencies
β”œβ”€β”€ Architecture.md       # System architecture details
└── OBJECTIVE.md          # Original requirements

Configuration

Port Configuration

  • Production: Port 80 (requires root or CAP_NET_BIND_SERVICE)
  • Development: Port 8080 (no special privileges required)

Cache Settings

The service uses a 5-second TTL cache to reduce SLURM command frequency:

cache = SimpleCache(ttl=5)  # 5-second cache

Auto-refresh

The web interface refreshes every 30 seconds:

setInterval(refreshData, 30000);  // 30-second refresh

Development

Running Tests

# Run with mock SLURM data
python3 test_local.py

# Run in debug mode with real SLURM
python3 run_debug.py

Debug Mode Features

  • Verbose logging to console
  • Hot reload on code changes
  • Falls back to mock data if SLURM unavailable
  • Detailed API request/response logging

Making Changes

  1. Edit files as needed
  2. Test in debug mode: python3 run_debug.py
  3. The server auto-reloads on changes
  4. Check logs for debugging information

Production Deployment

Prerequisites

  • Python 3.6+
  • SLURM commands available (squeue)
  • Root access (for port 80) or appropriate capabilities

Installation Steps

  1. Run the installer as root:

    sudo ./install.sh
  2. Verify the service is running:

    systemctl status slurm-monitor
  3. View logs if needed:

    journalctl -u slurm-monitor -f

Service Management

# Start/stop/restart the service
sudo systemctl start slurm-monitor
sudo systemctl stop slurm-monitor
sudo systemctl restart slurm-monitor

# Enable/disable autostart
sudo systemctl enable slurm-monitor
sudo systemctl disable slurm-monitor

# View service logs
sudo journalctl -u slurm-monitor -n 50
sudo tail -f /var/log/slurm-monitor/daemon.log

Troubleshooting

Common Issues

  1. Port 80 already in use:

    • Check what's using it: sudo lsof -i :80
    • Either stop the conflicting service or use port 8080
  2. SLURM commands not found:

    • Ensure you're on a node with SLURM client tools
    • Check PATH includes SLURM binaries
  3. Permission denied on port 80:

    • Run with sudo: sudo python3 slurm-monitor.py
    • Or use debug mode: python3 run_debug.py (port 8080)
  4. No jobs showing:

    • Check SLURM is accessible: squeue
    • Check API endpoint: curl http://localhost:8080/api/jobs
    • Review logs for errors

Debug Commands

# Test SLURM connectivity
squeue --format="%i|%P|%j|%u|%T" --noheader

# Test API endpoints
curl http://localhost:8080/api/jobs
curl http://localhost:8080/api/stats
curl http://localhost:8080/api/users

# Check service logs
sudo journalctl -u slurm-monitor -n 100

# Monitor in real-time
sudo journalctl -u slurm-monitor -f

Performance

  • Response Time: < 500ms (cached), < 2s (uncached)
  • Memory Usage: < 100MB typical
  • CPU Usage: < 5% average
  • Scalability: Tested with 10,000+ jobs

Security Considerations

  • Runs on secure login nodes only
  • No authentication (relies on network security)
  • Read-only access to SLURM queue
  • No job modification capabilities
  • Recommended: Deploy behind institutional firewall

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly in debug mode
  5. Submit a pull request

License

MIT License - See LICENSE file for details

Author

Javier Castellanos ([email protected])

Acknowledgments

  • Built for HPC clusters running SLURM
  • Inspired by the need for better queue visualization
  • Designed for simplicity and reliability

About

Lightweight web-based SLURM queue monitor with array job aggregation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published