Skip to content

fix(router): high subscription loads causing deadlocks#2223

Merged
endigma merged 8 commits intomainfrom
jesse/eng-8055-high-subscription-load-causes-deadlocks
Sep 24, 2025
Merged

fix(router): high subscription loads causing deadlocks#2223
endigma merged 8 commits intomainfrom
jesse/eng-8055-high-subscription-load-causes-deadlocks

Conversation

@endigma
Copy link
Copy Markdown
Member

@endigma endigma commented Sep 17, 2025

  • chore: add pyroscope support permanantly to the router
  • chore: add websocket super load script
  • chore: use wip go-tools c9d5e84c

Summary by CodeRabbit

  • New Features

    • Optional continuous profiling via Pyroscope (new CLI flag); mutually exclusive with pprof and CPU/memory profiling.
    • Docker Compose adds a Pyroscope service and Grafana now installs the Pyroscope app plugin.
  • Tests

    • Added a k6 WebSocket superload benchmark for high-throughput GraphQL subscriptions.
  • Chores

    • Updated Go dependencies, including GraphQL tooling and Pyroscope client packages.

Checklist

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 17, 2025

Walkthrough

Adds a Pyroscope service and Grafana plugin to docker-compose, updates Go module versions and adds pyroscope-go, integrates Pyroscope continuous profiling into the router binary via a --pyroscope-addr flag, and adds a K6 WebSocket superload benchmark script.

Changes

Cohort / File(s) Summary
Docker compose services
docker-compose.yml
Added pyroscope service (image: grafana/pyroscope:latest, ports: ['4040:4040'], restart: unless-stopped, networks: ['primary'], profiles: ['debug']). Updated Grafana GF_INSTALL_PLUGINS to include grafana-pyroscope-app.
Router binary
router/cmd/main.go
Added --pyroscope-addr flag and Pyroscope integration: configures runtime profiling rates, initializes pyroscope client with ApplicationName wundergraph.cosmo.router, hostname tag, explicit ProfileTypes (CPU, AllocObjects, AllocSpace, InuseObjects, InuseSpace, Goroutines, MutexCount, MutexDuration, BlockCount, BlockDuration), logs startup errors, defers Stop(), and disallows concurrent pprof or CPU/mem profiling when Pyroscope is enabled.
Go modules
router/go.mod, router-tests/go.mod
Bumped github.com/wundergraph/graphql-go-tools/v2 from v2.0.0-rc.226 to v2.0.0-rc.227. Added github.com/grafana/pyroscope-go v1.2.7 (direct) and github.com/grafana/pyroscope-go/godeltaprof v0.1.9 (indirect) in router/go.mod.
Benchmarks / load testing
router/bench-ws-superload.js
New K6 WebSocket benchmark: ramps to large VU counts, connects to ws://localhost:3002/graphql using graphql-transport-ws, sends connection_init and subscribe for a high-frequency subscription, asserts 101 upgrade, logs messages, and schedules a ~20s cancel (references an outer-scope socket variable).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The title claims a bug fix for deadlocks under high subscription load, but the changeset primarily adds Pyroscope profiling, a docker-compose service, a k6 WebSocket superload script, and dependency bumps; there is no obvious code change that actually fixes router deadlocks. While the changes are related to investigating or reproducing the issue, they do not match the author’s stated "fix" in the title and therefore misrepresent the main change. This makes the title misleading relative to the diff. Either update the PR title to reflect the actual changes (for example: "chore(router): add Pyroscope profiling and k6 websocket superload script to investigate subscription deadlocks") or include the actual bugfix and tests that resolve the deadlock before labeling it as a "fix".
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a2cf54 and a09f7dc.

⛔ Files ignored due to path filters (2)
  • router-tests/go.sum is excluded by !**/*.sum
  • router/go.sum is excluded by !**/*.sum
📒 Files selected for processing (5)
  • docker-compose.yml (1 hunks)
  • router-tests/go.mod (1 hunks)
  • router/bench-ws-superload.js (1 hunks)
  • router/cmd/main.go (4 hunks)
  • router/go.mod (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • router/go.mod
  • docker-compose.yml
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-08-20T22:13:25.222Z
Learnt from: StarpTech
PR: wundergraph/cosmo#2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.

Applied to files:

  • router-tests/go.mod
📚 Learning: 2025-09-19T15:08:02.998Z
Learnt from: endigma
PR: wundergraph/cosmo#2223
File: router/cmd/main.go:32-33
Timestamp: 2025-09-19T15:08:02.998Z
Learning: Based on direct user experience from endigma (PR author), pyroscope-go's Config.ServerAddress field accepts simple host:port format (e.g., "localhost:4040") in practice, even though the official documentation primarily shows full URL examples with schemes (e.g., "http://localhost:4040"). The library likely handles URL normalization internally.

Applied to files:

  • router/cmd/main.go
📚 Learning: 2025-09-19T15:08:02.998Z
Learnt from: endigma
PR: wundergraph/cosmo#2223
File: router/cmd/main.go:32-33
Timestamp: 2025-09-19T15:08:02.998Z
Learning: pyroscope-go's Config.ServerAddress field accepts both full URLs (e.g., "http://localhost:4040") and simple host:port format (e.g., "localhost:4040"). The library handles URL normalization internally, so both formats work correctly.

Applied to files:

  • router/cmd/main.go
📚 Learning: 2025-09-19T09:36:56.302Z
Learnt from: endigma
PR: wundergraph/cosmo#2223
File: router/bench-ws-superload.js:36-37
Timestamp: 2025-09-19T09:36:56.302Z
Learning: The router/bench-ws-superload.js load testing script is used for generating high websocket loads for debugging and stress testing (such as deadlock detection), not for collecting accurate performance metrics. Performance results from this script are not used for measurement purposes, so logging overhead optimizations are not necessary.

Applied to files:

  • router/bench-ws-superload.js
📚 Learning: 2025-09-19T09:37:06.313Z
Learnt from: endigma
PR: wundergraph/cosmo#2223
File: router/bench-ws-superload.js:72-82
Timestamp: 2025-09-19T09:37:06.313Z
Learning: The router/bench-ws-superload.js load testing script is used for testing purposes where performance results are not critical - it serves its purpose regardless of logging slowdowns or performance metrics accuracy.

Applied to files:

  • router/bench-ws-superload.js
🧬 Code graph analysis (1)
router/cmd/main.go (1)
router/pkg/profile/profile.go (1)
  • Start (104-123)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-router
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (8)
router-tests/go.mod (1)

30-30: LGTM! Verify alignment across remaining modules.

The graphql-go-tools/v2 version bump to v2.0.0-rc.227 is consistent with the router module. However, check if demo/go.mod and aws-lambda-router/go.mod also need the same update.

router/bench-ws-superload.js (1)

1-86: WebSocket load testing script for debugging high subscription loads.

Based on your earlier feedback, this script serves debugging and stress testing purposes rather than performance measurement, so logging overhead and timing precision are not concerns.

router/cmd/main.go (6)

11-11: LGTM! Required imports for Pyroscope integration.

The runtime import enables mutex and block profile rate configuration, while pyroscope-go provides continuous profiling capabilities.

Also applies to: 15-15


32-32: Pyroscope address flag implementation looks good.

The flag accepts both URL formats (http://localhost:4040) and simple host:port (localhost:4040) as confirmed by your testing experience.


34-36: LGTM! Profile path flags properly defined.

The cpuprofile and memprofile flags are correctly declared and will integrate with the existing profiling infrastructure.


99-105: Excellent conflict prevention logic.

The guards properly prevent runtime crashes from profiler conflicts:

  • Line 100: Prevents simultaneous pprof server and Pyroscope (different purposes)
  • Lines 103-105: Prevents CPU/memory profile file generation while Pyroscope's continuous profiler is active

118-150: Comprehensive Pyroscope configuration with proper error handling.

The implementation includes:

  • Runtime profile configuration (mutex and block rates)
  • Component-scoped logging
  • Complete profile type coverage
  • Error handling with graceful degradation
  • Proper cleanup with deferred Stop()

125-143: Verify pyroscope-go versions include security fixes for CVE-2024-53382 and CVE-2025-22870.

  • router/go.mod: github.com/grafana/pyroscope-go v1.2.7 (line 71) and github.com/grafana/pyroscope-go/godeltaprof v0.1.9 // indirect (line 118).
  • Confirm these releases include the CVE fixes in upstream release notes/Grafana advisories; if not, upgrade to a patched release and update go.mod.

Comment @coderabbitai help to get the list of available commands and usage tips.

@endigma endigma force-pushed the jesse/eng-8055-high-subscription-load-causes-deadlocks branch from dcd40a9 to a0176eb Compare September 17, 2025 12:12
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 17, 2025

❌ Internal Query Planner CI checks failed

The Internal Query Planner CI checks failed in the celestial repository, and this is going to stop the merge of this PR.
If you are part of the WunderGraph organization, you can see the PR with more details.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (6)
docker-compose.yml (2)

99-99: Pin Grafana plugin versions to avoid surprise upgrades.

GF_INSTALL_PLUGINS without versions fetches latest on every pull; pin with @Version to keep environments reproducible.

Example:

-      - GF_INSTALL_PLUGINS=grafana-clickhouse-datasource,grafana-pyroscope-app
+      - GF_INSTALL_PLUGINS=grafana-clickhouse-datasource@<version>,grafana-pyroscope-app@<version>

107-116: Avoid :latest for Pyroscope; consider persisting data.

Use a fixed image tag and (optionally) a volume if you want profiles to survive restarts.

-  pyroscope:
-    image: grafana/pyroscope:latest
+  pyroscope:
+    image: grafana/pyroscope:1.x.y
     ports:
       - '4040:4040'
     restart: unless-stopped
     networks:
       - primary
     profiles:
       - debug
+    # optional, if you want persistence
+    # volumes:
+    #   - pyroscope:/var/lib/pyroscope
router/cmd/main.go (3)

32-36: Clarify --pyroscope-addr help text and expected format.

The example “:4040” is not a valid URL for pyroscope-go. Prefer a full URL in help text.

- pyroscopeAddr   = flag.String("pyroscope-addr", os.Getenv("PYROSCOPE_ADDR"), "Address to use for pyroscope continuous profiling. e.g. :4040 for localhost:4040")
+ pyroscopeAddr   = flag.String("pyroscope-addr", os.Getenv("PYROSCOPE_ADDR"), "Pyroscope server URL, e.g. http://localhost:4040")

113-114: Nit: it’s a client profiler, not a server.

Small log wording tweak.

-       logger.Info("starting pyroscope server")
+       logger.Info("starting pyroscope profiler client")

109-111: Consider lower mutex/block profile rates by default.

5 can be non-trivial overhead; 1 is a safer default. Make it configurable if you need higher fidelity.

-       runtime.SetMutexProfileFraction(5)
-       runtime.SetBlockProfileRate(5)
+       runtime.SetMutexProfileFraction(1)
+       runtime.SetBlockProfileRate(1)
router/bench-ws-superload.js (1)

8-13: Ramp shape is very aggressive.

Instant 10k VUs will cause connection storms and false negatives. Consider a gradual ramp and a soak stage.

 export const options = {
   stages: [
-    { duration: '0s', target: 10000 },
-    { duration: '5m', target: 10000 },
+    { duration: '1m', target: 1000 },
+    { duration: '4m', target: 10000 },
+    { duration: '5m', target: 10000 },
   ],
 };
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a9ff64d and a0176eb.

⛔ Files ignored due to path filters (2)
  • router-tests/go.sum is excluded by !**/*.sum
  • router/go.sum is excluded by !**/*.sum
📒 Files selected for processing (5)
  • docker-compose.yml (1 hunks)
  • router-tests/go.mod (1 hunks)
  • router/bench-ws-superload.js (1 hunks)
  • router/cmd/main.go (3 hunks)
  • router/go.mod (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-20T22:13:25.222Z
Learnt from: StarpTech
PR: wundergraph/cosmo#2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.

Applied to files:

  • router-tests/go.mod
  • router/go.mod
🧬 Code graph analysis (1)
router/cmd/main.go (1)
router/pkg/profile/profile.go (1)
  • Start (104-123)
🔇 Additional comments (2)
router/go.mod (2)

34-34: graphql-go-tools bump matches tests.

Consistent with router-tests; no concerns.


71-71: Pyroscope deps added appropriately.

Direct pyroscope-go and indirect godeltaprof make sense for continuous profiling. Main code should handle lifecycle (see main.go comments).

Confirm these versions are the intended minimums across environments (CI/build images). If you pin elsewhere, align here too.

Also applies to: 118-118

Comment thread router-tests/go.mod Outdated
Comment thread router/bench-ws-superload.js
Comment thread router/bench-ws-superload.js
Comment thread router/cmd/main.go Outdated
Comment thread router/cmd/main.go Outdated
Comment thread router/cmd/main.go Outdated
@endigma endigma force-pushed the jesse/eng-8055-high-subscription-load-causes-deadlocks branch from 6dc20c5 to 5211b9d Compare September 17, 2025 13:01
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
router/cmd/main.go (1)

110-147: Gate file-based CPU profiling when Pyroscope is enabled to avoid conflicts.

If --cpuprofile is set alongside --pyroscope-addr, runtime/pprof CPU profiling will conflict with Pyroscope’s CPU profiler. Disable the file CPU profile in that case, but keep memprofile. This was raised earlier and still applies.

Apply this diff:

- // Start profiling if flags are set
- profiler := profile.Start(baseLogger, *cpuProfilePath, *memProfilePath)
- defer profiler.Finish()
+ // Start file-based profiling (avoid CPU conflict when Pyroscope is enabled)
+ cpuPath := *cpuProfilePath
+ if *pyroscopeAddr != "" && cpuPath != "" {
+   baseLogger.Warn("Ignoring --cpuprofile because --pyroscope-addr is enabled; use Pyroscope pprof to capture CPU", zap.String("cpuprofile", cpuPath))
+   cpuPath = ""
+ }
+ profiler := profile.Start(baseLogger, cpuPath, *memProfilePath)
+ defer profiler.Finish()

Small polish in the Pyroscope block:

  • Use os.Hostname() with env fallback for the tag.
  • Tweak log message wording.

Apply this diff:

- logger := baseLogger.With(zap.String("component", "pyroscope"))
- logger.Info("starting pyroscope server")
+ logger := baseLogger.With(zap.String("component", "pyroscope"))
+ logger.Info("Starting Pyroscope agent")
+ host, _ := os.Hostname()
+ if host == "" {
+   host = os.Getenv("HOSTNAME")
+ }
@@
-   Tags:            map[string]string{"hostname": os.Getenv("HOSTNAME")},
+   Tags:            map[string]string{"hostname": host},
🧹 Nitpick comments (3)
router/cmd/main.go (3)

34-36: Nit: make profiling flag descriptions precise.

They don’t run from “program start”; they start when profile.Start is called.

Apply this diff:

- memProfilePath = flag.String("memprofile", "", "Path to write memory profile. Memory is a snapshot taken at the time the program exits")
- cpuProfilePath = flag.String("cpuprofile", "", "Path to write cpu profile. CPU is measured from when the program starts until the program exits")
+ memProfilePath = flag.String("memprofile", "", "Path to write memory profile (heap snapshot on shutdown)")
+ cpuProfilePath = flag.String("cpuprofile", "", "Path to write CPU profile (from profiler start until shutdown)")

99-102: Tighten error message to reference flags explicitly.

Apply this diff:

- log.Fatal("Cannot use pprof and pyroscope at the same time")
+ log.Fatal("Cannot enable both --pprof-addr and --pyroscope-addr; they conflict")

114-119: Operational note: make mutex/block profiling rates configurable.

Fixed rates (5) add overhead at scale. Consider flags/env to tune SetMutexProfileFraction and SetBlockProfileRate per environment.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0176eb and 5211b9d.

⛔ Files ignored due to path filters (2)
  • router-tests/go.sum is excluded by !**/*.sum
  • router/go.sum is excluded by !**/*.sum
📒 Files selected for processing (5)
  • docker-compose.yml (1 hunks)
  • router-tests/go.mod (1 hunks)
  • router/bench-ws-superload.js (1 hunks)
  • router/cmd/main.go (4 hunks)
  • router/go.mod (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • docker-compose.yml
  • router/bench-ws-superload.js
  • router-tests/go.mod
  • router/go.mod
🧰 Additional context used
🧬 Code graph analysis (1)
router/cmd/main.go (1)
router/pkg/profile/profile.go (1)
  • Start (104-123)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: build_test
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_test
  • GitHub Check: Analyze (go)
🔇 Additional comments (2)
router/cmd/main.go (2)

11-11: LGTM: runtime import is appropriate for mutex/block profiling.


15-15: LGTM: pyroscope-go integration import.

Comment thread router/cmd/main.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5211b9d and 87ae77e.

⛔ Files ignored due to path filters (2)
  • router-tests/go.sum is excluded by !**/*.sum
  • router/go.sum is excluded by !**/*.sum
📒 Files selected for processing (2)
  • router-tests/go.mod (1 hunks)
  • router/go.mod (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router-tests/go.mod
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-20T22:13:25.222Z
Learnt from: StarpTech
PR: wundergraph/cosmo#2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.

Applied to files:

  • router/go.mod
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: build-router
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./events)
  • GitHub Check: image_scan
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_test
  • GitHub Check: build_test
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (2)
router/go.mod (2)

71-71: Pyroscope client dependency looks good

Direct pin to github.com/grafana/pyroscope-go v1.2.7 is reasonable for the new profiling flag. Thanks for keeping mark3labs/mcp-go at v0.36.0 per prior learnings.


118-118: OK to keep indirect godeltaprof

github.com/grafana/pyroscope-go/godeltaprof v0.1.9 as an indirect dep is expected with pyroscope-go; no action needed.

Comment thread router/go.mod Outdated
@endigma endigma force-pushed the jesse/eng-8055-high-subscription-load-causes-deadlocks branch 3 times, most recently from 7df01ad to fa915ba Compare September 23, 2025 08:39
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
router/cmd/main.go (3)

32-36: Clarify --pyroscope-addr help text

The current example "e.g. :4040 for localhost:4040" is confusing (":4040" is not a valid remote server address). Suggest clearer examples.

- pyroscopeAddr   = flag.String("pyroscope-addr", os.Getenv("PYROSCOPE_ADDR"), "Address to use for pyroscope continuous profiling. e.g. :4040 for localhost:4040")
+ pyroscopeAddr   = flag.String("pyroscope-addr", os.Getenv("PYROSCOPE_ADDR"), "Pyroscope server address for continuous profiling, e.g. http://localhost:4040 or localhost:4040")

120-122: Adjust log wording: it's a profiler/agent, not a server

Minor copy fix for clarity.

- logger.Info("starting pyroscope server")
+ logger.Info("starting pyroscope profiler")

127-128: Use os.Hostname() with env fallback for the hostname tag

HOSTNAME may be unset in some environments. Prefer os.Hostname() with a fallback.

-            Tags:            map[string]string{"hostname": os.Getenv("HOSTNAME")},
+            Tags: map[string]string{
+                "hostname": func() string {
+                    if h, err := os.Hostname(); err == nil && h != "" {
+                        return h
+                    }
+                    if v := os.Getenv("HOSTNAME"); v != "" {
+                        return v
+                    }
+                    return "unknown"
+                }(),
+            },
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bbf8070 and fa915ba.

⛔ Files ignored due to path filters (2)
  • router-tests/go.sum is excluded by !**/*.sum
  • router/go.sum is excluded by !**/*.sum
📒 Files selected for processing (5)
  • docker-compose.yml (1 hunks)
  • router-tests/go.mod (1 hunks)
  • router/bench-ws-superload.js (1 hunks)
  • router/cmd/main.go (4 hunks)
  • router/go.mod (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • router/go.mod
  • docker-compose.yml
  • router/bench-ws-superload.js
  • router-tests/go.mod
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-19T15:08:02.998Z
Learnt from: endigma
PR: wundergraph/cosmo#2223
File: router/cmd/main.go:32-33
Timestamp: 2025-09-19T15:08:02.998Z
Learning: Based on direct user experience from endigma (PR author), pyroscope-go's Config.ServerAddress field accepts simple host:port format (e.g., "localhost:4040") in practice, even though the official documentation primarily shows full URL examples with schemes (e.g., "http://localhost:4040"). The library likely handles URL normalization internally.

Applied to files:

  • router/cmd/main.go
📚 Learning: 2025-09-19T15:08:02.998Z
Learnt from: endigma
PR: wundergraph/cosmo#2223
File: router/cmd/main.go:32-33
Timestamp: 2025-09-19T15:08:02.998Z
Learning: pyroscope-go's Config.ServerAddress field accepts both full URLs (e.g., "http://localhost:4040") and simple host:port format (e.g., "localhost:4040"). The library handles URL normalization internally, so both formats work correctly.

Applied to files:

  • router/cmd/main.go
🧬 Code graph analysis (1)
router/cmd/main.go (1)
router/pkg/profile/profile.go (1)
  • Start (104-123)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./events)
  • GitHub Check: Analyze (go)
🔇 Additional comments (4)
router/cmd/main.go (4)

11-11: Import of runtime is appropriate

Needed for SetMutexProfileFraction and SetBlockProfileRate.


15-15: Pyroscope client import looks good

Required for the profiling integration.


114-119: Confirm mutex/block profiling rates

SetMutexProfileFraction(5) and SetBlockProfileRate(5) add overhead at scale. Confirm these values are intentional for production loads, or consider making them configurable.


145-147: Graceful shutdown of Pyroscope is handled correctly

Capturing the handle and deferring Stop() is correct.

Comment thread router/cmd/main.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
router/cmd/main.go (4)

32-36: Flag help: clarify accepted Pyroscope address formats

Given the library accepts both URL and host:port, consider mentioning both in the help text.

Apply this diff:

- pyroscopeAddr   = flag.String("pyroscope-addr", os.Getenv("PYROSCOPE_ADDR"), "Address to use for pyroscope continuous profiling. e.g. http://localhost:4040")
+ pyroscopeAddr   = flag.String("pyroscope-addr", os.Getenv("PYROSCOPE_ADDR"), "Address to use for pyroscope continuous profiling, e.g. http://localhost:4040 or localhost:4040")

99-105: Avoid hard-failing when Pyroscope is combined with pprof or --cpuprofile; warn and continue

Fatal-exiting makes local troubleshooting harder. Pyroscope only conflicts with CPU profiling; pprof server (heap/goroutine/etc.) and file heap snapshots can still be useful.

Apply these diffs:

-	if *pprofListenAddr != "" && *pyroscopeAddr != "" {
-		baseLogger.Fatal("Cannot use pprof and pyroscope at the same time")
-	}
+	if *pprofListenAddr != "" && *pyroscopeAddr != "" {
+		baseLogger.Warn("pprof and pyroscope enabled together; /debug/pprof/profile will be unavailable while Pyroscope CPU profiler runs")
+	}
-	if *pyroscopeAddr != "" && (*cpuProfilePath != "" || *memProfilePath != "") {
-		baseLogger.Fatal("Cannot use --cpuprofile or --memprofile while Pyroscope is enabled")
-	}
+	// Pyroscope conflicts with CPU profiling only; allow memprofile.
+	if *pyroscopeAddr != "" && *cpuProfilePath != "" {
+		baseLogger.Warn("Ignoring --cpuprofile because Pyroscope is enabled; cannot start Go CPU profiler twice",
+			zap.String("cpuprofile", *cpuProfilePath),
+		)
+	}

Then, update the file-based profiling start to honor the guard (outside the selected lines; place near the current profile.Start):

// Start file-based profiling if flags are set (independent of Pyroscope)
{
    cpuPath := *cpuProfilePath
    if *pyroscopeAddr != "" && cpuPath != "" {
        cpuPath = "" // already warned above
    }
    if cpuPath != "" || *memProfilePath != "" {
        profiler := profile.Start(baseLogger, cpuPath, *memProfilePath)
        defer profiler.Finish()
    }
}

118-123: Restore global profiling rates on shutdown

SetMutexProfileFraction and SetBlockProfileRate are global; restore previous values to avoid side-effects if this code evolves or is embedded.

Apply this diff:

 	if *pyroscopeAddr != "" {
 		// These 2 lines are only required if you're using mutex or block profiling
 		// Read the explanation below for how to set these rates:
-		runtime.SetMutexProfileFraction(5)
-		runtime.SetBlockProfileRate(5)
+		prevMutex := runtime.SetMutexProfileFraction(5)
+		prevBlock := runtime.SetBlockProfileRate(5)
+		defer func() {
+			runtime.SetMutexProfileFraction(prevMutex)
+			runtime.SetBlockProfileRate(prevBlock)
+		}()

127-133: Enrich tags: robust hostname and useful service metadata

Fallback to os.Hostname and add service metadata for better grouping in Pyroscope.

Apply this diff:

 		pyro, err := pyroscope.Start(pyroscope.Config{
 			ApplicationName: "wundergraph.cosmo.router",
 			ServerAddress:   *pyroscopeAddr,
 			Logger:          logger.Sugar(),
-			Tags:            map[string]string{"hostname": os.Getenv("HOSTNAME")},
+			Tags: map[string]string{
+				"hostname": func() string {
+					if h, err := os.Hostname(); err == nil && h != "" {
+						return h
+					}
+					if v := os.Getenv("HOSTNAME"); v != "" {
+						return v
+					}
+					return "unknown"
+				}(),
+				"service":         "@wundergraph/router",
+				"service_version": core.Version,
+			},
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 15d8ec7 and 1b6b463.

📒 Files selected for processing (1)
  • router/cmd/main.go (4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-19T15:08:02.998Z
Learnt from: endigma
PR: wundergraph/cosmo#2223
File: router/cmd/main.go:32-33
Timestamp: 2025-09-19T15:08:02.998Z
Learning: Based on direct user experience from endigma (PR author), pyroscope-go's Config.ServerAddress field accepts simple host:port format (e.g., "localhost:4040") in practice, even though the official documentation primarily shows full URL examples with schemes (e.g., "http://localhost:4040"). The library likely handles URL normalization internally.

Applied to files:

  • router/cmd/main.go
📚 Learning: 2025-09-19T15:08:02.998Z
Learnt from: endigma
PR: wundergraph/cosmo#2223
File: router/cmd/main.go:32-33
Timestamp: 2025-09-19T15:08:02.998Z
Learning: pyroscope-go's Config.ServerAddress field accepts both full URLs (e.g., "http://localhost:4040") and simple host:port format (e.g., "localhost:4040"). The library handles URL normalization internally, so both formats work correctly.

Applied to files:

  • router/cmd/main.go
🧬 Code graph analysis (1)
router/cmd/main.go (1)
router/pkg/profile/profile.go (1)
  • Start (104-123)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./events)
  • GitHub Check: build_test
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (2)
router/cmd/main.go (2)

11-15: LGTM: imports for runtime and pyroscope are appropriate

Needed for mutex/block profiling and Pyroscope integration.


149-151: LGTM: Stop Pyroscope on shutdown

Deferred Stop ensures profiles are flushed.

Comment thread router/cmd/main.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
router/cmd/main.go (5)

32-32: Clarify flag help: accept URL or host:port

Given our learnings that both formats work, make the help reflect it.

-	pyroscopeAddr   = flag.String("pyroscope-addr", os.Getenv("PYROSCOPE_ADDR"), "Address to use for pyroscope continuous profiling. e.g. http://localhost:4040")
+	pyroscopeAddr   = flag.String("pyroscope-addr", os.Getenv("PYROSCOPE_ADDR"), "Pyroscope server address (URL or host:port). e.g. http://localhost:4040 or localhost:4040")

99-101: Consider allowing pprof alongside Pyroscope (non-CPU endpoints) or proxy via Pyroscope’s pprof handler

Fatal exit is strict and blocks heap/goroutine profiles via pprof when Pyroscope is enabled. Prefer:

  • Allow pprof server while Pyroscope runs (CPU profile endpoint will naturally error “already in use”), or
  • Mount github.com/grafana/pyroscope-go/http/pprof handlers to serve CPU profiles without conflict.

If you want to soften behavior, change Fatal to Warn to keep the process running.

-	if *pprofListenAddr != "" && *pyroscopeAddr != "" {
-		baseLogger.Fatal("Cannot use pprof and pyroscope at the same time")
-	}
+	if *pprofListenAddr != "" && *pyroscopeAddr != "" {
+		baseLogger.Warn("pprof and Pyroscope enabled together; CPU pprof endpoint will be unavailable/conflict")
+	}

Follow-up (optional): integrate pyroscope-go/http/pprof into your pprof server to proxy CPU profiles while Pyroscope is active.


103-105: Permit memprofile with Pyroscope; gate only cpuprofile

Memprofile doesn’t conflict with Pyroscope’s CPU profiler. Keep memprofile working; ignore only cpuprofile when Pyroscope is enabled.

-	if *pyroscopeAddr != "" && (*cpuProfilePath != "" || *memProfilePath != "") {
-		baseLogger.Fatal("Cannot use --cpuprofile or --memprofile while Pyroscope is enabled")
-	}
+	// Ignore only CPU file profiling when Pyroscope is enabled; memprofile is safe.
+	if *pyroscopeAddr != "" && *cpuProfilePath != "" {
+		baseLogger.Warn("Ignoring --cpuprofile because Pyroscope is enabled; the Go CPU profiler cannot be started twice",
+			zap.String("cpuprofile", *cpuProfilePath),
+		)
+		*cpuProfilePath = ""
+	}

Also adjust the profile.Start call to respect the cleared CPU path:

// Start file-based profiling if flags are set
profiler := profile.Start(baseLogger, *cpuProfilePath, *memProfilePath)
defer profiler.Finish()

122-130: Improve hostname tag robustness (use os.Hostname with env fallback)

Avoid empty hostname when HOSTNAME isn’t set.

-			Tags:            map[string]string{"hostname": os.Getenv("HOSTNAME")},
+			Tags: map[string]string{
+				"hostname": func() string {
+					if h, err := os.Hostname(); err == nil && h != "" {
+						return h
+					}
+					return os.Getenv("HOSTNAME")
+				}(),
+			},

125-146: Include server address in start error log

Helps diagnostics when the address is wrong/unreachable.

-		if err != nil {
-			logger.Error("failed to start pyroscope", zap.Error(err))
-		}
+		if err != nil {
+			logger.Error("failed to start pyroscope",
+				zap.Error(err),
+				zap.String("addr", *pyroscopeAddr),
+			)
+		}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 19aab01 and 3a2cf54.

📒 Files selected for processing (1)
  • router/cmd/main.go (4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-19T15:08:02.998Z
Learnt from: endigma
PR: wundergraph/cosmo#2223
File: router/cmd/main.go:32-33
Timestamp: 2025-09-19T15:08:02.998Z
Learning: Based on direct user experience from endigma (PR author), pyroscope-go's Config.ServerAddress field accepts simple host:port format (e.g., "localhost:4040") in practice, even though the official documentation primarily shows full URL examples with schemes (e.g., "http://localhost:4040"). The library likely handles URL normalization internally.

Applied to files:

  • router/cmd/main.go
📚 Learning: 2025-09-19T15:08:02.998Z
Learnt from: endigma
PR: wundergraph/cosmo#2223
File: router/cmd/main.go:32-33
Timestamp: 2025-09-19T15:08:02.998Z
Learning: pyroscope-go's Config.ServerAddress field accepts both full URLs (e.g., "http://localhost:4040") and simple host:port format (e.g., "localhost:4040"). The library handles URL normalization internally, so both formats work correctly.

Applied to files:

  • router/cmd/main.go
🧬 Code graph analysis (1)
router/cmd/main.go (1)
router/pkg/profile/profile.go (1)
  • Start (104-123)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: build_test
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (5)
router/cmd/main.go (5)

11-11: Import of runtime is appropriate

Needed for SetMutexProfileFraction and SetBlockProfileRate.


15-15: Pyroscope dependency import looks good

Matches the usage below.


34-36: Profiling flags: LGTM

Names and help strings are clear.


118-121: Mutex/block profiling rates are set correctly

Enabling these only when Pyroscope is active is appropriate.


147-150: Defer Stop on non-nil profiler: good

Ensures flush on shutdown.

@endigma endigma force-pushed the jesse/eng-8055-high-subscription-load-causes-deadlocks branch from 3a2cf54 to a09f7dc Compare September 23, 2025 15:57
@endigma endigma merged commit ff8bfa3 into main Sep 24, 2025
70 of 86 checks passed
@endigma endigma deleted the jesse/eng-8055-high-subscription-load-causes-deadlocks branch September 24, 2025 11:44
@Noroth Noroth mentioned this pull request Sep 30, 2025
5 tasks
@coderabbitai coderabbitai Bot mentioned this pull request Oct 2, 2025
5 tasks
@coderabbitai coderabbitai Bot mentioned this pull request Feb 11, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants