Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 68 additions & 15 deletions README.md

Large diffs are not rendered by default.

247 changes: 247 additions & 0 deletions docs/OBSERVABILITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,247 @@
# Observability Toolset

This toolset provides tools for querying OpenShift cluster observability data including Prometheus metrics and Alertmanager alerts.

## Tools

### prometheus_query

Execute instant PromQL queries against the cluster's Thanos Querier.

**Parameters:**
- `query` (required) - PromQL query string
- `time` (optional) - Evaluation timestamp (RFC3339, Unix timestamp, or relative like `-5m`, `now`)

**Example:**
```
Query: up{job="apiserver"}
```

### prometheus_query_range

Execute range PromQL queries for time-series data.

**Parameters:**
- `query` (required) - PromQL query string
- `start` (required) - Start time (RFC3339, Unix timestamp, or relative like `-1h`)
- `end` (required) - End time (RFC3339, Unix timestamp, or relative like `now`)
- `step` (optional) - Query resolution step (default: `1m`)

**Example:**
```
Query: rate(container_cpu_usage_seconds_total[5m])
Start: -1h
End: now
Step: 1m
```

### alertmanager_alerts

Query alerts from the cluster's Alertmanager.

**Parameters:**
- `active` (optional) - Include active alerts (default: true)
- `silenced` (optional) - Include silenced alerts (default: false)
- `inhibited` (optional) - Include inhibited alerts (default: false)
- `filter` (optional) - Label filter in PromQL format (e.g., `alertname="Watchdog"`)

**Example:**
```
Active: true
Filter: severity="critical"
```

## Enable the Observability Toolset

### Option 1: Command Line

```bash
kubernetes-mcp-server --toolsets core,config,helm,observability
```

### Option 2: Configuration File

```toml
toolsets = ["core", "config", "helm", "observability"]
```

### Option 3: MCP Client Configuration

```json
{
"mcpServers": {
"kubernetes": {
"command": "npx",
"args": ["-y", "kubernetes-mcp-server@latest", "--toolsets", "core,config,helm,observability"]
}
}
}
```

## Configuration

The observability toolset supports optional configuration via the config file:

```toml
[observability]
# Custom monitoring namespace (default: "openshift-monitoring")
monitoring_namespace = "custom-monitoring"
```

| Option | Default | Description |
|--------|---------|-------------|
| `monitoring_namespace` | `openshift-monitoring` | Namespace where Prometheus and Alertmanager routes are located |

## Prerequisites

The observability tools require:

1. **OpenShift cluster** - These tools are designed for OpenShift and rely on OpenShift-specific routes
2. **Monitoring stack enabled** - The cluster must have the monitoring stack deployed (default in OpenShift)
3. **Proper RBAC** - The user/service account must have permissions to:
- Read routes in `openshift-monitoring` namespace
- Access the Thanos Querier and Alertmanager APIs

## How It Works

### Route Discovery

The tools automatically discover the Prometheus (Thanos Querier) and Alertmanager endpoints by reading OpenShift routes:

- **Thanos Querier**: `thanos-querier` route in `openshift-monitoring` namespace
- **Alertmanager**: `alertmanager-main` route in `openshift-monitoring` namespace

### Authentication

The tools use the bearer token from your Kubernetes configuration to authenticate with the monitoring endpoints. This is the same credential used to access the cluster.

### Relative Time Support

Time parameters support multiple formats:

| Format | Example | Description |
|--------|---------|-------------|
| RFC3339 | `2024-01-15T10:00:00Z` | Absolute timestamp |
| Unix | `1705312800` | Unix timestamp in seconds |
| Relative | `-10m`, `-1h`, `-1d` | Relative to current time |
| Keyword | `now` | Current time |

## Security Considerations

### Allowed Prometheus Endpoints

Only read-only Prometheus API endpoints are allowed:
- `/api/v1/query` - Instant queries
- `/api/v1/query_range` - Range queries
- `/api/v1/series` - Series metadata
- `/api/v1/labels` - Label names
- `/api/v1/label/<name>/values` - Label values

Administrative endpoints (like `/api/v1/admin/*`) are blocked.

### Allowed Alertmanager Endpoints

Only alert query endpoints are allowed:
- `/api/v2/alerts` - List alerts
- `/api/v2/silences` - List silences
- `/api/v1/alerts` - Legacy alert endpoint

### Query Limits

- Maximum query length: 10,000 characters
- Maximum response size: 10MB

## Common Use Cases

### Cluster Health

**Check if all API servers are up:**
```
Query: up{job="apiserver"}
```

**API server request latency (99th percentile):**
```
Query: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, verb))
```

### Node and Pod Metrics

**Node CPU usage percentage:**
```
Query: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
```

**Pods in CrashLoopBackOff:**
```
Query: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0
```

**Container memory usage by namespace:**
```
Query: sum by(namespace) (container_memory_working_set_bytes{container!=""})
```

### Alerting

**Get all firing critical alerts:**
```
Tool: alertmanager_alerts
Active: true
Filter: severity="critical"
```

**Count alerts by severity:**
```
Query: count by(severity) (ALERTS{alertstate="firing"})
```

### Network

**Network receive rate by pod:**
```
Query: rate(container_network_receive_bytes_total[5m])
Start: -1h
End: now
Step: 1m
```

### etcd Health

**etcd leader changes:**
```
Query: changes(etcd_server_leader_changes_seen_total[1h])
```

**etcd disk sync duration:**
```
Query: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))
```

## Troubleshooting

### "failed to get route" Error

The monitoring routes may not exist or the user lacks permissions:
```bash
oc get routes -n openshift-monitoring
```

### "no bearer token available" Error

Ensure your kubeconfig has a valid token:
```bash
oc whoami
oc get pods -n openshift-monitoring
```

### Empty Results from Prometheus

Verify the query works in the OpenShift console:
1. Go to **Observe** > **Metrics**
2. Enter your PromQL query
3. Check for results

### TLS Certificate Errors

The tools use `InsecureSkipVerify` for route access. If you need strict TLS verification, this would require additional configuration.
1 change: 1 addition & 0 deletions internal/tools/update-readme/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ import (
_ "github.com/containers/kubernetes-mcp-server/pkg/toolsets/helm"
_ "github.com/containers/kubernetes-mcp-server/pkg/toolsets/kiali"
_ "github.com/containers/kubernetes-mcp-server/pkg/toolsets/kubevirt"
_ "github.com/containers/kubernetes-mcp-server/pkg/toolsets/observability"
)

type OpenShift struct{}
Expand Down
2 changes: 1 addition & 1 deletion pkg/config/config_default.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ import (
func Default() *StaticConfig {
defaultConfig := StaticConfig{
ListOutput: "table",
Toolsets: []string{"core", "config", "helm"},
Toolsets: []string{"core", "config", "helm", "observability"},
}
overrides := defaultOverrides()
mergedConfig := mergeConfig(defaultConfig, overrides)
Expand Down
12 changes: 6 additions & 6 deletions pkg/config/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -247,8 +247,8 @@ func (s *ConfigSuite) TestReadConfigValidPreservesDefaultsForMissingFields() {
s.Equalf("table", config.ListOutput, "Expected ListOutput to be table, got %s", config.ListOutput)
})
s.Run("toolsets defaulted correctly", func() {
s.Require().Lenf(config.Toolsets, 3, "Expected 3 toolsets, got %d", len(config.Toolsets))
for _, toolset := range []string{"core", "config", "helm"} {
s.Require().Lenf(config.Toolsets, 4, "Expected 4 toolsets, got %d", len(config.Toolsets))
for _, toolset := range []string{"core", "config", "helm", "observability"} {
s.Containsf(config.Toolsets, toolset, "Expected toolsets to contain %s", toolset)
}
})
Expand Down Expand Up @@ -568,7 +568,7 @@ func (s *ConfigSuite) TestStandaloneConfigDirPreservesDefaults() {
s.Run("preserves default values", func() {
s.Equal("9999", config.Port, "port should be from drop-in")
s.Equal("table", config.ListOutput, "list_output should be default")
s.Equal([]string{"core", "config", "helm"}, config.Toolsets, "toolsets should be default")
s.Equal([]string{"core", "config", "helm", "observability"}, config.Toolsets, "toolsets should be default")
})
}

Expand All @@ -585,7 +585,7 @@ func (s *ConfigSuite) TestStandaloneConfigDirEmpty() {

s.Run("returns defaults for empty directory", func() {
s.Equal("table", config.ListOutput, "list_output should be default")
s.Equal([]string{"core", "config", "helm"}, config.Toolsets, "toolsets should be default")
s.Equal([]string{"core", "config", "helm", "observability"}, config.Toolsets, "toolsets should be default")
})
}

Expand Down Expand Up @@ -914,7 +914,7 @@ func (s *ConfigSuite) TestBothConfigAndConfigDirEmpty() {

s.Run("returns default configuration", func() {
s.Equal("table", config.ListOutput)
s.Equal([]string{"core", "config", "helm"}, config.Toolsets)
s.Equal([]string{"core", "config", "helm", "observability"}, config.Toolsets)
s.Equal(0, config.LogLevel)
})
}
Expand Down Expand Up @@ -1034,7 +1034,7 @@ func (s *ConfigSuite) TestEmptyConfigFile() {
s.Equal("9999", config.Port, "port should be from drop-in")
// Defaults should still be applied for unset values
s.Equal("table", config.ListOutput, "list_output should be default")
s.Equal([]string{"core", "config", "helm"}, config.Toolsets, "toolsets should be default")
s.Equal([]string{"core", "config", "helm", "observability"}, config.Toolsets, "toolsets should be default")
})
}

Expand Down
1 change: 1 addition & 0 deletions pkg/mcp/modules.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ import (
_ "github.com/containers/kubernetes-mcp-server/pkg/toolsets/helm"
_ "github.com/containers/kubernetes-mcp-server/pkg/toolsets/kiali"
_ "github.com/containers/kubernetes-mcp-server/pkg/toolsets/kubevirt"
_ "github.com/containers/kubernetes-mcp-server/pkg/toolsets/observability"
)
Loading