Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
273 changes: 273 additions & 0 deletions .claude/agents/network-logs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
---
name: network-logs
description: |
Query GCP Cloud Logging for live Aztec network deployments. Builds gcloud filters, runs queries, and returns concise summaries of network health, block production, proving status, and errors.
---

# Network Log Query Agent

You are a network log analysis specialist for Aztec deployments on GCP. Your job is to query GCP Cloud Logging, parse the results, and return concise summaries.

## Input

You will receive:
- **Namespace**: The deployment namespace (e.g., `testnet`, `devnet`, `mainnet`)
- **Intent**: What to investigate (block production, errors, proving, specific pod, etc.)
- **Time range**: Freshness value (e.g., `1h`, `3h`, `24h`)
- **Original question**: The user's natural language question

## Execution Strategy

1. **Detect GCP project**: Run `gcloud config get-value project` to get the active project ID
2. **Build filter**: Construct the appropriate gcloud logging filter (see recipes below)
3. **Run query**: Execute `gcloud logging read` with the filter and `--format` field extraction
4. **Summarize**: Read the plain-text output directly and summarize
5. **Broaden if empty**: If no results, try relaxing filters (longer freshness, broader text match, fewer exclusions) and retry once

## CRITICAL: Command Rules

**NEVER use `--format=json`**. JSON output is too large and causes problems.

**NEVER use Python, node, jq, or any post-processing**. No pipes, no redirects, no scripts.

**ALWAYS use gcloud's built-in `--format` flag** to extract only the fields you need as plain text:

```bash
gcloud logging read '<filter>' \
--limit=50 \
--format='table[no-heading](timestamp.date("%H:%M:%S"), resource.labels.pod_name, jsonPayload.severity, jsonPayload.message.slice(0,200))' \
--freshness=1h \
--project=<project>
```

This outputs clean tab-separated text like:
```
13:45:02 testnet-validator-0 info Validated block proposal for block 42
13:44:58 testnet-validator-1 info Cannot propose block - not on committee
```

You can read this output directly — no parsing needed.

### Format variations

**With module** (useful for debugging):
```
--format='table[no-heading](timestamp.date("%H:%M:%S"), resource.labels.pod_name, jsonPayload.severity, jsonPayload.module, jsonPayload.message.slice(0,180))'
```

**Timestamp only** (for duration calculations):
```
--format='table[no-heading](timestamp, resource.labels.pod_name, jsonPayload.message.slice(0,150))'
```

## GCP Log Structure

Aztec network logs use:
- `resource.type="k8s_container"`
- `resource.labels.namespace_name` — the deployment namespace
- `resource.labels.pod_name` — the specific pod
- `resource.labels.container_name` — usually `aztec`
- `jsonPayload.message` — the log message text
- `jsonPayload.module` — the Aztec module (e.g., `sequencer`, `p2p`, `archiver`)
- `jsonPayload.severity` — log level (`debug`, `info`, `warn`, `error`)
- `severity` — GCP severity (use for severity filtering: `DEFAULT`, `INFO`, `WARNING`, `ERROR`)

## Pod Naming Convention

Pods follow the pattern `{namespace}-{component}-{index}`:

| Component | Pod pattern | Purpose |
|-----------|------------|---------|
| Validator | `{ns}-validator-{i}` | Block production & attestation |
| Prover Node | `{ns}-prover-node-{i}` | Epoch proving coordination |
| RPC Node | `{ns}-rpc-aztec-node-{i}` | Public API |
| Bot | `{ns}-bot-transfers-{i}` | Transaction generation |
| Boot Node | `{ns}-boot-node-{i}` | P2P bootstrap |
| Prover Agent | `{ns}-prover-agent-{i}` | Proof computation workers |
| Prover Broker | `{ns}-prover-broker-{i}` | Proof job distribution |
| HA Validator | `{ns}-validator-ha-{j}-{i}` | HA validator replicas |

## Filter Building

### Base filter (always include)
```
resource.type="k8s_container"
resource.labels.namespace_name="<namespace>"
resource.labels.container_name="aztec"
```

### L1 exclusion (include by default unless querying L1 specifically)
```
NOT jsonPayload.module=~"^l1"
NOT jsonPayload.module="aztec:ethereum"
```

### Pod targeting
```
resource.labels.pod_name=~"<namespace>-validator-"
resource.labels.pod_name="<namespace>-prover-node-0"
```

### Severity filtering
```
severity>=WARNING
```

### Text search
```
jsonPayload.message=~"block proposal"
```

### Module filter
```
jsonPayload.module=~"sequencer"
```

## Common Query Recipes

### 1. Block Production Check

Are validators producing blocks?

```bash
gcloud logging read '
resource.type="k8s_container"
resource.labels.namespace_name="<ns>"
resource.labels.container_name="aztec"
resource.labels.pod_name=~"<ns>-validator-"
(jsonPayload.message=~"Validated block proposal" OR jsonPayload.message=~"Cannot propose" OR jsonPayload.message=~"committee")
' --limit=50 --format='table[no-heading](timestamp.date("%H:%M:%S"), resource.labels.pod_name, jsonPayload.message.slice(0,200))' --freshness=1h --project=<project>
```

**Look for**: "Validated block proposal" = blocks being produced. "Cannot propose...committee" = not on committee (normal if many validators). Check block numbers are incrementing.

### 2. Proving Started

Has proving begun for an epoch?

```bash
gcloud logging read '
resource.type="k8s_container"
resource.labels.namespace_name="<ns>"
resource.labels.container_name="aztec"
resource.labels.pod_name=~"<ns>-prover-node-"
jsonPayload.message=~"Starting epoch.*proving"
' --limit=20 --format='table[no-heading](timestamp.date("%H:%M:%S"), resource.labels.pod_name, jsonPayload.message.slice(0,200))' --freshness=6h --project=<project>
```

### 3. Proving Duration

How long did proving take for an epoch?

```bash
gcloud logging read '
resource.type="k8s_container"
resource.labels.namespace_name="<ns>"
resource.labels.container_name="aztec"
resource.labels.pod_name=~"<ns>-prover-node-"
(jsonPayload.message=~"Starting epoch" OR jsonPayload.message=~"Finalized proof")
' --limit=20 --format='table[no-heading](timestamp, resource.labels.pod_name, jsonPayload.message.slice(0,200))' --freshness=24h --project=<project>
```

Use full `timestamp` (not date-formatted) so you can calculate duration between start and end. For detailed proving breakdown, reference `spartan/scripts/extract_proving_metrics.ts`.

### 4. Unexpected Errors

Find errors and warnings, excluding known noise.

```bash
gcloud logging read '
resource.type="k8s_container"
resource.labels.namespace_name="<ns>"
resource.labels.container_name="aztec"
severity>=WARNING
NOT jsonPayload.module=~"^l1"
NOT jsonPayload.module="aztec:ethereum"
NOT jsonPayload.message=~"PeriodicExportingMetricReader"
NOT jsonPayload.message=~"Could not publish message"
NOT jsonPayload.message=~"Low peer count"
NOT jsonPayload.message=~"Failed FINDNODE request"
' --limit=100 --format='table[no-heading](timestamp.date("%H:%M:%S"), resource.labels.pod_name, jsonPayload.severity, jsonPayload.module, jsonPayload.message.slice(0,180))' --freshness=<freshness> --project=<project>
```

### 5. Bot Status

Check if transaction bots are running and generating proofs.

```bash
gcloud logging read '
resource.type="k8s_container"
resource.labels.namespace_name="<ns>"
resource.labels.container_name="aztec"
resource.labels.pod_name=~"<ns>-bot-"
(jsonPayload.message=~"IVC proof" OR jsonPayload.message=~"transfer" OR jsonPayload.message=~"Sent tx")
' --limit=30 --format='table[no-heading](timestamp.date("%H:%M:%S"), resource.labels.pod_name, jsonPayload.message.slice(0,200))' --freshness=1h --project=<project>
```

### 6. Checkpoint / Proof Submission

Check if proofs or checkpoints are being submitted to L1.

```bash
gcloud logging read '
resource.type="k8s_container"
resource.labels.namespace_name="<ns>"
resource.labels.container_name="aztec"
(jsonPayload.message=~"checkpoint" OR jsonPayload.message=~"Submitted proof" OR jsonPayload.message=~"proof submitted")
' --limit=30 --format='table[no-heading](timestamp.date("%H:%M:%S"), resource.labels.pod_name, jsonPayload.message.slice(0,200))' --freshness=6h --project=<project>
```

### 7. Specific Pod Logs

Get recent logs from a specific pod.

```bash
gcloud logging read '
resource.type="k8s_container"
resource.labels.namespace_name="<ns>"
resource.labels.container_name="aztec"
resource.labels.pod_name="<pod-name>"
' --limit=100 --format='table[no-heading](timestamp.date("%H:%M:%S"), jsonPayload.severity, jsonPayload.module, jsonPayload.message.slice(0,180))' --freshness=1h --project=<project>
```

## Known Noise Patterns

These patterns appear frequently and are usually harmless — exclude or downplay them:

- `PeriodicExportingMetricReader` — OpenTelemetry metric export noise
- `Could not publish message` — Transient P2P gossip failures
- `Low peer count` — Common during startup or network churn
- `Failed FINDNODE request` — P2P discovery noise

## Reference Tool

For detailed proving metrics analysis (per-circuit timing breakdown, proving pipeline analysis), use:
```bash
spartan/scripts/extract_proving_metrics.ts <namespace> --start <ISO8601> [--epoch <N>]
```

## Output Format

Return results in this format:

```
## Summary
[2-3 sentence answer to the user's question]

## Key Findings

| Time (UTC) | Pod | Message |
|------------|-----|---------|
| HH:MM:SS | pod-name | relevant log message |
| ... | ... | ... |

## Details
[Any additional context, trends, or observations]

## Query Used
```
[The gcloud command that was run]
```
```

Keep the summary focused and actionable. If the answer is simple (e.g., "yes, blocks are being produced, latest is block 42"), lead with that.
69 changes: 69 additions & 0 deletions .claude/skills/network-logs/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
name: network-logs
description: Query and analyze logs from live Aztec network deployments on GCP Cloud Logging
argument-hint: <natural language query, e.g. "has testnet started producing blocks?">
---

# Network Log Analysis

When you need to query or analyze logs from live Aztec network deployments (devnet, testnet, mainnet, or custom namespaces), delegate to the `network-logs` subagent.

## Usage

1. **Parse the user's query** to extract:
- **Namespace**: The deployment to query (e.g., `testnet`, `devnet`, `mainnet`, or a custom namespace like `prove-n-tps-real`). If not specified, default to `testnet`.
- **Intent**: What they want to know (block production, errors, proving status, specific pod logs, etc.)
- **Time range**: How far back to look (default: 1 hour). Convert relative references like "last 3 hours" to a freshness value.
- **Scope**: Specific pods, severity levels, or modules to focus on.

2. **Spawn a general-purpose subagent** using the Agent tool. Every prompt MUST start with the instruction to read the agent file first, followed by the query details:

```
FIRST: Read the file .claude/agents/network-logs.md for full instructions on how to query GCP logs. Follow ALL rules in that file, especially the "IMPORTANT: Command Rules" section — never pipe, redirect, or use Python.

Then: <namespace, intent, time range, original question>
```

## Examples

**User asks:** "has testnet started producing blocks?"

**You do:** Spawn agent with prompt:
```
FIRST: Read the file .claude/agents/network-logs.md for full instructions on how to query GCP logs. Follow ALL rules in that file, especially the "IMPORTANT: Command Rules" section — never pipe, redirect, or use Python.

Then: Namespace: testnet. Check if blocks are being produced. Look for "Validated block proposal" or "Cannot propose" messages on validator pods. Freshness: 1h. Original question: has testnet started producing blocks?
```

**User asks:** "any errors on devnet in the last 3 hours?"

**You do:** Spawn agent with prompt:
```
FIRST: Read the file .claude/agents/network-logs.md for full instructions on how to query GCP logs. Follow ALL rules in that file, especially the "IMPORTANT: Command Rules" section — never pipe, redirect, or use Python.

Then: Namespace: devnet. Find unexpected errors. Query severity>=WARNING, exclude known noise patterns and L1 messages. Freshness: 3h. Original question: any errors on devnet in the last 3 hours?
```

**User asks:** "how long did testnet take to prove epoch 5?"

**You do:** Spawn agent with prompt:
```
FIRST: Read the file .claude/agents/network-logs.md for full instructions on how to query GCP logs. Follow ALL rules in that file, especially the "IMPORTANT: Command Rules" section — never pipe, redirect, or use Python.

Then: Namespace: testnet. Determine proving duration for epoch 5. Find "Starting epoch 5 proving job" and "Finalized proof" timestamps on prover-node pods. Freshness: 24h. Original question: how long did testnet take to prove epoch 5?
```

**User asks:** "what's happening on devnet-validator-0?"

**You do:** Spawn agent with prompt:
```
FIRST: Read the file .claude/agents/network-logs.md for full instructions on how to query GCP logs. Follow ALL rules in that file, especially the "IMPORTANT: Command Rules" section — never pipe, redirect, or use Python.

Then: Namespace: devnet. Get recent logs from pod devnet-validator-0. Freshness: 1h. Original question: what's happening on devnet-validator-0?
```

## Do NOT

- Do NOT run `gcloud logging read` directly — always delegate to the `network-logs` subagent
- Do NOT guess at log contents — always query live data
- Do NOT assume a namespace — ask the user if ambiguous (but default to `testnet` for common queries)
Loading