fix: improve latency metrics reliability by alepane21 · Pull Request #2540 · wundergraph/cosmo

alepane21 · 2026-02-20T21:01:29Z

When during the same query plan there where more than one fetch toward subgraph in parallel, the longer timing was applied to all the subgraphs metric.
Also, after the merge of last engine release, a test was flaky.

Summary by CodeRabbit

Tests
- Added a parallelized integration test validating Prometheus metrics for parallel subgraph request durations, histogram counts/sums, and relative timing between subgraphs.
- Made an existing error-handling test assertion JSON-aware for more robust response comparisons.
Bug Fixes
- Improved latency reporting by using per-subgraph fetch timing when available, producing more accurate metrics for success and error paths.

Checklist

I have discussed my proposed changes in an issue and have received approval to proceed.
I have followed the coding standards of the project.
Tests or benchmarks have been added or updated.
Documentation has been updated on https://github.com/wundergraph/cosmo-docs.
I have read the Contributors Guide.

…ooks

coderabbitai · 2026-02-20T21:01:55Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f4baa8 and 3928ae1.

📒 Files selected for processing (1)

router-tests/error_handling_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

router-tests/error_handling_test.go

Walkthrough

Adds a Prometheus integration test for parallel subgraph durations, adjusts engine loader hooks to use per‑subgraph fetch timing for logging and metrics, and makes one test assertion JSON‑structure aware.

Changes

Cohort / File(s)	Summary
Prometheus test `router-tests/prometheus_parallel_subgraph_metrics_test.go`	New parallelized integration test that registers a manual Prometheus reader/registry, configures subgraph delays, issues a GraphQL request, and asserts `router_http_request_duration_milliseconds` histogram samples and sums per `wg_subgraph_name`.
Metrics calculation change `router/core/engine_loader_hooks.go`	Introduce `subgraphFetchLatency` (prefer per‑fetch `FetchTiming` when present) and use it for logging and `MeasureLatency` calls in both success and error paths, replacing previous use of total latency.
Test assertion robustness `router-tests/error_handling_test.go`	Replace strict string equality with `require.JSONEq` for a GraphQL response assertion to compare JSON structures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat: add timings per client fetch for GraphQL http #2183: Adds/uses per‑fetch timing (FetchTiming/FetchDuration) handling that overlaps with the per‑subgraph timing and metrics changes here.
feat: allow to log response payload and fix feature flag expression bug #2004: Modifies router/core/engine_loader_hooks.go around per‑subgraph recording (e.g., storing response body) and touches the same hook paths.
fix(router): high subscription loads causing deadlocks #2223: Updates go.mod for graphql-go-tools; related dependency bump may coincide with the new test additions.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: improve latency metrics reliability' accurately describes the main change: fixing how latency metrics are measured for parallel subgraph requests by using fetch-specific timings instead of total latency.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-20T21:03:51Z

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-4bbd0249196980e434e7ae7a108ca50eb09c4929-nonroot

coderabbitai

🧹 Nitpick comments (1)

router-tests/prometheus_parallel_subgraph_metrics_test.go (1)

71-73: Consider documenting the tolerance values.

The magic numbers (250ms tolerance, 400ms minimum gap) work but their rationale isn't immediately clear. A brief comment would help future maintainers understand the expected timing bounds.

📝 Suggested documentation

 		employeesDurationMs := employeesHistogram.GetSampleSum()
 		productsDurationMs := productsHistogram.GetSampleSum()

+		// Products should complete close to productsDelay (within 250ms tolerance for test overhead)
 		require.Greater(t, productsDurationMs, float64(productsDelay.Milliseconds()-250))
+		// Employees (no delay) should complete much faster than Products
 		require.Less(t, employeesDurationMs, float64(productsDelay.Milliseconds()/2))
+		// Verify meaningful separation between parallel subgraph latencies
 		require.Greater(t, productsDurationMs-employeesDurationMs, 400.0)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@router-tests/prometheus_parallel_subgraph_metrics_test.go` around lines 71 -
73, The assertions in the test using magic numbers (the 250ms tolerance and
400ms minimum gap) lack explanation; add a brief comment above the three
assertions explaining why those tolerances were chosen (e.g., expected
scheduling/jitter buffer, processing overhead, and intended concurrency gap
between products and employees requests) and reference the variables used:
productsDurationMs, employeesDurationMs, and productsDelay.Milliseconds() so
future maintainers understand the relationship and how the values were derived.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@router-tests/prometheus_parallel_subgraph_metrics_test.go`:
- Around line 71-73: The assertions in the test using magic numbers (the 250ms
tolerance and 400ms minimum gap) lack explanation; add a brief comment above the
three assertions explaining why those tolerances were chosen (e.g., expected
scheduling/jitter buffer, processing overhead, and intended concurrency gap
between products and employees requests) and reference the variables used:
productsDurationMs, employeesDurationMs, and productsDelay.Milliseconds() so
future maintainers understand the relationship and how the values were derived.

codecov · 2026-02-20T22:33:47Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 62.22%. Comparing base (43aa77e) to head (3928ae1).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2540   +/-   ##
=======================================
  Coverage   62.21%   62.22%           
=======================================
  Files         241      241           
  Lines       25499    25503    +4     
=======================================
+ Hits        15864    15868    +4     
- Misses       8297     8298    +1     
+ Partials     1338     1337    -1

Files with missing lines	Coverage Δ
router/core/engine_loader_hooks.go	`89.22% <100.00%> (+0.26%)`	⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…outer_http_request_duration-and-other-request

…tion-and-other-request

…request_duration-and-other-request' into ale/eng-8915-router-router_http_request_duration-and-other-request

…tion-and-other-request

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@router/core/engine_loader_hooks.go`:
- Around line 177-187: The current code sets subgraphFetchLatency from
ctx.Value(rcontext.FetchTimingKey) and assigns it to
exprCtx.Subgraph.Request.ClientTrace.FetchDuration only when a fetchTiming
exists, but after the fallback path (when subgraphFetchLatency = latency) the
FetchDuration remains zero; update the fallback branch so that after assigning
subgraphFetchLatency = latency you also set
exprCtx.Subgraph.Request.ClientTrace.FetchDuration = subgraphFetchLatency (i.e.,
ensure exprCtx.Subgraph.Request.ClientTrace.FetchDuration is always assigned to
subgraphFetchLatency whether it came from fetchTiming or the latency fallback).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e050ea and 8f4baa8.

📒 Files selected for processing (1)

router/core/engine_loader_hooks.go

StarpTech · 2026-02-24T10:12:33Z

+		employeesDurationMs := employeesHistogram.GetSampleSum()
+		productsDurationMs := productsHistogram.GetSampleSum()
+
+		require.Greater(t, productsDurationMs, float64(productsDelay.Milliseconds()-250))


That's a source for flaky tests.

alepane21 · 2026-02-24T11:46:07Z

The issue is not this: OnFinished should be called right after the subgraph is complete!

refactor(router): improve latency metrics handling in engine loader h…

24ec69c

…ooks

github-actions Bot added the router label Feb 20, 2026

coderabbitai Bot reviewed Feb 20, 2026

View reviewed changes

alepane21 and others added 3 commits February 23, 2026 08:42

Merge remote-tracking branch 'origin/main' into ale/eng-8915-router-r…

befb1cb

…outer_http_request_duration-and-other-request

fix: ignore different response order

38d3d54

Merge branch 'main' into ale/eng-8915-router-router_http_request_dura…

4e050ea

…tion-and-other-request

alepane21 changed the title ~~refactor(router): improve latency metrics handling in engine loader h…~~ refactor(router): improve latency metrics affidability Feb 23, 2026

alepane21 marked this pull request as ready for review February 23, 2026 15:59

alepane21 requested review from Noroth, StarpTech, devsergiy, endigma and jensneuse as code owners February 23, 2026 15:59

alepane21 changed the title ~~refactor(router): improve latency metrics affidability~~ fix: improve latency metrics affidability Feb 23, 2026

Merge branch 'main' into ale/eng-8915-router-router_http_request_dura…

1dca661

…tion-and-other-request

alepane21 changed the title ~~fix: improve latency metrics affidability~~ fix: improve latency metrics dependability Feb 24, 2026

alepane21 changed the title ~~fix: improve latency metrics dependability~~ fix: improve latency metrics reliability Feb 24, 2026

alepane21 and others added 3 commits February 24, 2026 11:43

chore: improve readibility

caea193

Merge remote-tracking branch 'origin/ale/eng-8915-router-router_http_…

f0cc218

…request_duration-and-other-request' into ale/eng-8915-router-router_http_request_duration-and-other-request

Merge branch 'main' into ale/eng-8915-router-router_http_request_dura…

8f4baa8

…tion-and-other-request

coderabbitai Bot reviewed Feb 24, 2026

View reviewed changes

Comment thread router/core/engine_loader_hooks.go

chore: replace Eq To Jsoneq to avoid flakiness

3928ae1

StarpTech reviewed Feb 24, 2026

View reviewed changes

alepane21 closed this Feb 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve latency metrics reliability#2540

fix: improve latency metrics reliability#2540
alepane21 wants to merge 9 commits intomainfrom
ale/eng-8915-router-router_http_request_duration-and-other-request

alepane21 commented Feb 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 20, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Feb 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented Feb 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

StarpTech Feb 24, 2026

Uh oh!

alepane21 commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alepane21 commented Feb 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Checklist

Uh oh!

coderabbitai Bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Router-nonroot image scan passed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

StarpTech Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

alepane21 commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alepane21 commented Feb 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 20, 2026 •

edited

Loading

github-actions Bot commented Feb 20, 2026 •

edited

Loading

codecov Bot commented Feb 20, 2026 •

edited

Loading