Skip to content

Conversation

@cshiels-ie
Copy link
Contributor

  • Introduced a new metrics service for Prometheus integration, including counters and histograms for tracking tool executions, HTTP requests, and errors.
  • Updated the logger to record metrics for tool access, including execution duration and success/error status.
  • Added configuration options for enabling metrics in the YAML configuration file.
  • Enhanced the README with instructions on enabling and accessing Prometheus metrics.
  • Updated the main application to conditionally expose a metrics endpoint based on configuration.

AI Generated Details/Description:

Add Prometheus Metrics Support for Observability

Summary

This PR adds comprehensive Prometheus metrics support to the AAP MCP Server, enabling production-ready monitoring and observability of the service.

Changes

New Features

1. Metrics Service (src/metrics.ts)

  • Created a dedicated MetricsService class using prom-client
  • Collects default Node.js metrics (CPU, memory, GC, event loop lag)
  • Implements custom MCP-specific metrics with proper labels

2. Available Metrics

HTTP Metrics:

  • http_requests_total - Counter for all HTTP requests (labeled by method, route, status_code)
  • http_request_duration_seconds - Histogram of request duration

MCP Tool Metrics:

  • mcp_tool_executions_total - Counter for tool executions (labeled by tool_name, service, category, status)
  • mcp_tool_execution_duration_seconds - Histogram of tool execution duration (labeled by tool_name, service, category)
  • mcp_tool_errors_total - Counter for tool errors (labeled by tool_name, service, category, error_type)
  • mcp_active_tools - Gauge of currently active tools per service
  • mcp_active_sessions - Gauge of active MCP sessions

API Call Metrics:

  • mcp_api_calls_total - Counter for AAP API calls (labeled by service, endpoint, method, status_code)

System Metrics:

  • Standard Node.js process metrics (CPU, memory, GC, etc.)

3. Integration Points

  • Logger Integration: src/logger.ts now records metrics for every tool execution
  • HTTP Middleware: Tracks all incoming HTTP requests automatically (when metrics enabled)
  • Tool Execution: Captures timing, success/error status, and category information

4. Configuration

Metrics can be enabled/disabled via configuration:

# aap-mcp.yaml
enable_metrics: true

Or via environment variable:

export ENABLE_METRICS=true

5. Endpoints

  • GET /metrics - Prometheus-formatted metrics endpoint (only available when metrics are enabled)

Key Enhancements

  • Category Tracking: Added getCategoryForTool() helper function to automatically determine which category a tool belongs to (e.g., job_management, inventory_management, etc.)
  • Automatic Labeling: All tool metrics now include category labels for better filtering and dashboards
  • Zero-Impact When Disabled: When metrics are disabled, there's minimal overhead
  • Production-Ready: Uses industry-standard Prometheus client with proper histogram buckets

Usage

Enable Metrics

# aap-mcp.yaml
enable_metrics: true

Access Metrics

curl http://localhost:3000/metrics

Configure Prometheus

# prometheus.yml
scrape_configs:
  - job_name: 'aap-mcp-server'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:3000']
    metrics_path: '/metrics'

Example Queries

# Request rate per second
rate(http_requests_total[5m])

# Tool execution success rate by category
rate(mcp_tool_executions_total{status="success"}[5m]) 
  / rate(mcp_tool_executions_total[5m])

# 95th percentile tool execution time
histogram_quantile(0.95, rate(mcp_tool_execution_duration_seconds_bucket[5m]))

# Error rate by category
rate(mcp_tool_errors_total{category="job_management"}[1m]) * 60

Benefits

  1. Observability: Full visibility into service health and performance
  2. Alerting: Set up alerts on error rates, latency, or resource usage
  3. Debugging: Identify slow tools, error patterns, and bottlenecks
  4. Capacity Planning: Track resource usage trends over time
  5. Category Insights: Monitor performance and errors by functional category

Breaking Changes

None. Metrics are opt-in and disabled by default.

Testing

  • Metrics endpoint responds with valid Prometheus format
  • Metrics update correctly after tool executions
  • HTTP request metrics track all endpoints
  • Category labels correctly identify tool categories
  • No impact when metrics are disabled

Documentation

Updated README.md with Prometheus metrics section including:

  • Configuration instructions
  • Available metrics list
  • Prometheus integration examples
  • Example queries

Dependencies

  • Added prom-client@^15.1.3 for Prometheus metrics collection

Future Enhancements

  • Grafana dashboard templates
  • Custom alerting rules
  • Metrics for session lifecycle
  • Per-user metrics (if needed)

- Introduced a new metrics service for Prometheus integration, including counters and histograms for tracking tool executions, HTTP requests, and errors.
- Updated the logger to record metrics for tool access, including execution duration and success/error status.
- Added configuration options for enabling metrics in the YAML configuration file.
- Enhanced the README with instructions on enabling and accessing Prometheus metrics.
- Updated the main application to conditionally expose a metrics endpoint based on configuration.
@goneri
Copy link
Contributor

goneri commented Oct 31, 2025

You also need to import the package-lock.json (this should address the CI errors 🤞 ).

@goneri goneri merged commit 764778e into ansible:main Nov 3, 2025
5 checks passed
@goneri
Copy link
Contributor

goneri commented Nov 3, 2025

thank you @cshiels-ie

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants