Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 72 additions & 8 deletions .claude/agents/network-logs.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ You are a network log analysis specialist for Aztec deployments on GCP. Your job
You will receive:
- **Namespace**: The deployment namespace (e.g., `testnet`, `devnet`, `mainnet`)
- **Intent**: What to investigate (block production, errors, proving, specific pod, etc.)
- **Time range**: Freshness value (e.g., `1h`, `3h`, `24h`)
- **Time range**: Freshness value (e.g., `10m`, `3h`, `24h`) — default is `10m` for real-time queries
- **Original question**: The user's natural language question

## Execution Strategy
Expand All @@ -36,7 +36,7 @@ You will receive:
gcloud logging read '<filter>' \
--limit=50 \
--format='table[no-heading](timestamp.date("%H:%M:%S"), resource.labels.pod_name, jsonPayload.severity, jsonPayload.message.slice(0,200))' \
--freshness=1h \
--freshness=10m \
--project=<project>
```

Expand All @@ -48,6 +48,8 @@ This outputs clean tab-separated text like:

You can read this output directly — no parsing needed.

**Tip**: When searching for tx hashes or other long identifiers, use `.slice(0,300)` instead of `.slice(0,200)` to avoid truncating the relevant data.

### Format variations

**With module** (useful for debugging):
Expand Down Expand Up @@ -81,12 +83,16 @@ Pods follow the pattern `{namespace}-{component}-{index}`:
| Validator | `{ns}-validator-{i}` | Block production & attestation |
| Prover Node | `{ns}-prover-node-{i}` | Epoch proving coordination |
| RPC Node | `{ns}-rpc-aztec-node-{i}` | Public API |
| Bot | `{ns}-bot-transfers-{i}` | Transaction generation |
| Bot | `{ns}-bot-{type}-{i}` | Transaction generation (types: transfers, swaps, etc.) |
| Boot Node | `{ns}-boot-node-{i}` | P2P bootstrap |
| Prover Agent | `{ns}-prover-agent-{i}` | Proof computation workers |
| Prover Broker | `{ns}-prover-broker-{i}` | Proof job distribution |
| HA Validator | `{ns}-validator-ha-{j}-{i}` | HA validator replicas |

## Deployment-Specific Notes

- **next-net** redeploys every morning at ~4am UTC. Always use timestamp range filters (not `--freshness`) when querying next-net for a specific date, and expect logs to only cover a single instance of the network.

## Filter Building

### Base filter (always include)
Expand All @@ -108,6 +114,13 @@ resource.labels.pod_name=~"<namespace>-validator-"
resource.labels.pod_name="<namespace>-prover-node-0"
```

### Timestamp ranges (for historical queries)
When querying specific past dates instead of recent logs, use timestamp filters **instead of** `--freshness` (they are mutually exclusive):
```
timestamp>="2026-03-11T00:00:00Z"
timestamp<="2026-03-12T00:00:00Z"
```

### Severity filtering
```
severity>=WARNING
Expand Down Expand Up @@ -135,11 +148,11 @@ gcloud logging read '
resource.labels.namespace_name="<ns>"
resource.labels.container_name="aztec"
resource.labels.pod_name=~"<ns>-validator-"
(jsonPayload.message=~"Validated block proposal" OR jsonPayload.message=~"Cannot propose" OR jsonPayload.message=~"committee")
' --limit=50 --format='table[no-heading](timestamp.date("%H:%M:%S"), resource.labels.pod_name, jsonPayload.message.slice(0,200))' --freshness=1h --project=<project>
(jsonPayload.message=~"Validated block proposal" OR jsonPayload.message=~"Built block" OR jsonPayload.message=~"Cannot propose" OR jsonPayload.message=~"Published checkpoint")
' --limit=50 --format='table[no-heading](timestamp.date("%H:%M:%S"), resource.labels.pod_name, jsonPayload.message.slice(0,200))' --freshness=10m --project=<project>
```

**Look for**: "Validated block proposal" = blocks being produced. "Cannot propose...committee" = not on committee (normal if many validators). Check block numbers are incrementing.
**Look for**: "Validated block proposal" = blocks being produced. "Built block N ... with X txs" = shows tx count per block (0 = empty). "Published checkpoint" = checkpoints landing on L1. "Cannot propose...committee" = not on committee (normal if many validators). Check block numbers are incrementing. **Note**: The `pod_name=~"<ns>-validator-"` filter also matches HA validator pods (e.g., `validator-ha-1-1`) — expect both regular and HA validators in results.

### 2. Proving Started

Expand Down Expand Up @@ -187,6 +200,9 @@ gcloud logging read '
NOT jsonPayload.message=~"Could not publish message"
NOT jsonPayload.message=~"Low peer count"
NOT jsonPayload.message=~"Failed FINDNODE request"
NOT jsonPayload.message=~"No active peers"
NOT jsonPayload.message=~"Not enough txs"
NOT jsonPayload.message=~"StateView contract not found"
' --limit=100 --format='table[no-heading](timestamp.date("%H:%M:%S"), resource.labels.pod_name, jsonPayload.severity, jsonPayload.module, jsonPayload.message.slice(0,180))' --freshness=<freshness> --project=<project>
```

Expand All @@ -201,7 +217,7 @@ gcloud logging read '
resource.labels.container_name="aztec"
resource.labels.pod_name=~"<ns>-bot-"
(jsonPayload.message=~"IVC proof" OR jsonPayload.message=~"transfer" OR jsonPayload.message=~"Sent tx")
' --limit=30 --format='table[no-heading](timestamp.date("%H:%M:%S"), resource.labels.pod_name, jsonPayload.message.slice(0,200))' --freshness=1h --project=<project>
' --limit=30 --format='table[no-heading](timestamp.date("%H:%M:%S"), resource.labels.pod_name, jsonPayload.message.slice(0,200))' --freshness=10m --project=<project>
```

### 6. Checkpoint / Proof Submission
Expand All @@ -227,9 +243,54 @@ gcloud logging read '
resource.labels.namespace_name="<ns>"
resource.labels.container_name="aztec"
resource.labels.pod_name="<pod-name>"
' --limit=100 --format='table[no-heading](timestamp.date("%H:%M:%S"), jsonPayload.severity, jsonPayload.module, jsonPayload.message.slice(0,180))' --freshness=1h --project=<project>
' --limit=100 --format='table[no-heading](timestamp.date("%H:%M:%S"), jsonPayload.severity, jsonPayload.module, jsonPayload.message.slice(0,180))' --freshness=10m --project=<project>
```

### 8. Transaction Debugging

Trace a specific transaction by hash. Use the first 8-16 hex characters to search, and `.slice(0,300)` to avoid truncating hashes in output.

```bash
gcloud logging read '
resource.type="k8s_container"
resource.labels.namespace_name="<ns>"
resource.labels.container_name="aztec"
jsonPayload.message=~"<first 8-16 hex chars of tx hash>"
' --limit=50 --format='table[no-heading](timestamp, resource.labels.pod_name, jsonPayload.module, jsonPayload.message.slice(0,300))' --freshness=24h --project=<project>
```

**Investigation steps**: Check which pod received the tx (RPC node vs validators). Look for "Received tx", "Added tx", "dropped", "rejected", "invalid", "revert". If only the RPC node has it, the tx wasn't propagated via P2P. Cross-reference with block production to see if blocks were empty during that period.

### 9. Chain Health / Stability

Check for chain pruning, L1 publish failures, and proposal validation issues.

```bash
gcloud logging read '
resource.type="k8s_container"
resource.labels.namespace_name="<ns>"
resource.labels.container_name="aztec"
(jsonPayload.message=~"Chain pruned" OR jsonPayload.message=~"Failed to publish" OR jsonPayload.message=~"L1 tx timed out" OR jsonPayload.message=~"proposal validation failed")
' --limit=50 --format='table[no-heading](timestamp.date("%H:%M:%S"), resource.labels.pod_name, jsonPayload.message.slice(0,200))' --freshness=10m --project=<project>
```

**Look for**: Repeated chain pruning = L1 publishing pipeline issues. "L1 tx timed out" = Ethereum congestion or gas issues. "proposal validation failed" = block proposal rejected by peers.

### 10. Network Status Overview

For general "status" or "health" queries, run these three queries **in parallel** to get a comprehensive picture:

1. **Block production** — use Recipe 1 (Block Production Check)
2. **Errors** — use Recipe 4 (Unexpected Errors)
3. **Proving** — use Recipe 3 (Proving Duration) with `--freshness=1h`

Then synthesize into a single status report covering:
- **Block production**: Are blocks being built? Latest block number/slot? How many validators participating?
- **Proving**: What epoch was last proved? How long did it take?
- **Warnings**: Any notable errors or warnings (excluding known noise)?

This is the most common query pattern — prefer this composite approach over individual queries when the user asks for general status.

## Known Noise Patterns

These patterns appear frequently and are usually harmless — exclude or downplay them:
Expand All @@ -238,6 +299,9 @@ These patterns appear frequently and are usually harmless — exclude or downpla
- `Could not publish message` — Transient P2P gossip failures
- `Low peer count` — Common during startup or network churn
- `Failed FINDNODE request` — P2P discovery noise
- `No active peers to send requests to` — P2P reqresp on isolated nodes (e.g., blob-sink)
- `Not enough txs to build block` — Normal when transaction volume is low
- `StateView contract not found` — Price oracle warning; Uniswap V4 StateView only exists on mainnet, so all other networks emit this. Safe to ignore unless namespace is `mainnet`

## Reference Tool

Expand Down
17 changes: 13 additions & 4 deletions .claude/skills/network-logs/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ When you need to query or analyze logs from live Aztec network deployments (devn
1. **Parse the user's query** to extract:
- **Namespace**: The deployment to query (e.g., `testnet`, `devnet`, `mainnet`, or a custom namespace like `prove-n-tps-real`). If not specified, default to `testnet`.
- **Intent**: What they want to know (block production, errors, proving status, specific pod logs, etc.)
- **Time range**: How far back to look (default: 1 hour). Convert relative references like "last 3 hours" to a freshness value.
- **Time range**: How far back to look (default: 10 minutes). For relative references ("last 3 hours"), convert to a freshness value. For **absolute dates** ("March 11th", "yesterday"), convert to timestamp range filters: `timestamp>="YYYY-MM-DDT00:00:00Z" timestamp<="YYYY-MM-DDT23:59:59Z"`. Use the current date to resolve relative day references.
- **Scope**: Specific pods, severity levels, or modules to focus on.

2. **Spawn a general-purpose subagent** using the Agent tool. Every prompt MUST start with the instruction to read the agent file first, followed by the query details:
2. **Spawn a `network-logs` subagent** using the Agent tool with `subagent_type: network-logs`. Every prompt MUST start with the instruction to read the agent file first, followed by the query details:

```
FIRST: Read the file .claude/agents/network-logs.md for full instructions on how to query GCP logs. Follow ALL rules in that file, especially the "IMPORTANT: Command Rules" section — never pipe, redirect, or use Python.
Expand All @@ -32,7 +32,7 @@ Then: <namespace, intent, time range, original question>
```
FIRST: Read the file .claude/agents/network-logs.md for full instructions on how to query GCP logs. Follow ALL rules in that file, especially the "IMPORTANT: Command Rules" section — never pipe, redirect, or use Python.

Then: Namespace: testnet. Check if blocks are being produced. Look for "Validated block proposal" or "Cannot propose" messages on validator pods. Freshness: 1h. Original question: has testnet started producing blocks?
Then: Namespace: testnet. Check if blocks are being produced. Look for "Validated block proposal" or "Cannot propose" messages on validator pods. Freshness: 10m. Original question: has testnet started producing blocks?
```

**User asks:** "any errors on devnet in the last 3 hours?"
Expand All @@ -59,7 +59,16 @@ Then: Namespace: testnet. Determine proving duration for epoch 5. Find "Starting
```
FIRST: Read the file .claude/agents/network-logs.md for full instructions on how to query GCP logs. Follow ALL rules in that file, especially the "IMPORTANT: Command Rules" section — never pipe, redirect, or use Python.

Then: Namespace: devnet. Get recent logs from pod devnet-validator-0. Freshness: 1h. Original question: what's happening on devnet-validator-0?
Then: Namespace: devnet. Get recent logs from pod devnet-validator-0. Freshness: 10m. Original question: what's happening on devnet-validator-0?
```

**User asks:** "why couldn't next-net process tx 0x24e837d4... on March 11th?"

**You do:** Spawn agent with prompt:
```
FIRST: Read the file .claude/agents/network-logs.md for full instructions on how to query GCP logs. Follow ALL rules in that file, especially the "IMPORTANT: Command Rules" section — never pipe, redirect, or use Python.

Then: Namespace: next-net. Debug why tx 0x24e837d401e5251cc523ac272c0401bed57d36bd6f26eb2a89167109efe05c2d could not be processed. Search for the hash substring "24e837d4" in logs, then trace: was it received? By which pod? Did it propagate to validators? Was it included in a block? Any errors? Use timestamp range: timestamp>="2026-03-11T00:00:00Z" timestamp<="2026-03-12T00:00:00Z". Original question: why couldn't next-net process this tx?
```

## Do NOT
Expand Down
Loading