Skip to content

fix: improve latency metrics reliability#2540

Closed
alepane21 wants to merge 9 commits intomainfrom
ale/eng-8915-router-router_http_request_duration-and-other-request
Closed

fix: improve latency metrics reliability#2540
alepane21 wants to merge 9 commits intomainfrom
ale/eng-8915-router-router_http_request_duration-and-other-request

Conversation

@alepane21
Copy link
Copy Markdown
Contributor

@alepane21 alepane21 commented Feb 20, 2026

When during the same query plan there where more than one fetch toward subgraph in parallel, the longer timing was applied to all the subgraphs metric.
Also, after the merge of last engine release, a test was flaky.

Summary by CodeRabbit

  • Tests

    • Added a parallelized integration test validating Prometheus metrics for parallel subgraph request durations, histogram counts/sums, and relative timing between subgraphs.
    • Made an existing error-handling test assertion JSON-aware for more robust response comparisons.
  • Bug Fixes

    • Improved latency reporting by using per-subgraph fetch timing when available, producing more accurate metrics for success and error paths.

Checklist

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f4baa8 and 3928ae1.

📒 Files selected for processing (1)
  • router-tests/error_handling_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • router-tests/error_handling_test.go

Walkthrough

Adds a Prometheus integration test for parallel subgraph durations, adjusts engine loader hooks to use per‑subgraph fetch timing for logging and metrics, and makes one test assertion JSON‑structure aware.

Changes

Cohort / File(s) Summary
Prometheus test
router-tests/prometheus_parallel_subgraph_metrics_test.go
New parallelized integration test that registers a manual Prometheus reader/registry, configures subgraph delays, issues a GraphQL request, and asserts router_http_request_duration_milliseconds histogram samples and sums per wg_subgraph_name.
Metrics calculation change
router/core/engine_loader_hooks.go
Introduce subgraphFetchLatency (prefer per‑fetch FetchTiming when present) and use it for logging and MeasureLatency calls in both success and error paths, replacing previous use of total latency.
Test assertion robustness
router-tests/error_handling_test.go
Replace strict string equality with require.JSONEq for a GraphQL response assertion to compare JSON structures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: improve latency metrics reliability' accurately describes the main change: fixing how latency metrics are measured for parallel subgraph requests by using fetch-specific timings instead of total latency.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 20, 2026

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-4bbd0249196980e434e7ae7a108ca50eb09c4929-nonroot

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
router-tests/prometheus_parallel_subgraph_metrics_test.go (1)

71-73: Consider documenting the tolerance values.

The magic numbers (250ms tolerance, 400ms minimum gap) work but their rationale isn't immediately clear. A brief comment would help future maintainers understand the expected timing bounds.

📝 Suggested documentation
 		employeesDurationMs := employeesHistogram.GetSampleSum()
 		productsDurationMs := productsHistogram.GetSampleSum()

+		// Products should complete close to productsDelay (within 250ms tolerance for test overhead)
 		require.Greater(t, productsDurationMs, float64(productsDelay.Milliseconds()-250))
+		// Employees (no delay) should complete much faster than Products
 		require.Less(t, employeesDurationMs, float64(productsDelay.Milliseconds()/2))
+		// Verify meaningful separation between parallel subgraph latencies
 		require.Greater(t, productsDurationMs-employeesDurationMs, 400.0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router-tests/prometheus_parallel_subgraph_metrics_test.go` around lines 71 -
73, The assertions in the test using magic numbers (the 250ms tolerance and
400ms minimum gap) lack explanation; add a brief comment above the three
assertions explaining why those tolerances were chosen (e.g., expected
scheduling/jitter buffer, processing overhead, and intended concurrency gap
between products and employees requests) and reference the variables used:
productsDurationMs, employeesDurationMs, and productsDelay.Milliseconds() so
future maintainers understand the relationship and how the values were derived.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@router-tests/prometheus_parallel_subgraph_metrics_test.go`:
- Around line 71-73: The assertions in the test using magic numbers (the 250ms
tolerance and 400ms minimum gap) lack explanation; add a brief comment above the
three assertions explaining why those tolerances were chosen (e.g., expected
scheduling/jitter buffer, processing overhead, and intended concurrency gap
between products and employees requests) and reference the variables used:
productsDurationMs, employeesDurationMs, and productsDelay.Milliseconds() so
future maintainers understand the relationship and how the values were derived.

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 62.22%. Comparing base (43aa77e) to head (3928ae1).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2540   +/-   ##
=======================================
  Coverage   62.21%   62.22%           
=======================================
  Files         241      241           
  Lines       25499    25503    +4     
=======================================
+ Hits        15864    15868    +4     
- Misses       8297     8298    +1     
+ Partials     1338     1337    -1     
Files with missing lines Coverage Δ
router/core/engine_loader_hooks.go 89.22% <100.00%> (+0.26%) ⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@alepane21 alepane21 changed the title refactor(router): improve latency metrics handling in engine loader h… refactor(router): improve latency metrics affidability Feb 23, 2026
@alepane21 alepane21 marked this pull request as ready for review February 23, 2026 15:59
@alepane21 alepane21 changed the title refactor(router): improve latency metrics affidability fix: improve latency metrics affidability Feb 23, 2026
@alepane21 alepane21 changed the title fix: improve latency metrics affidability fix: improve latency metrics dependability Feb 24, 2026
@alepane21 alepane21 changed the title fix: improve latency metrics dependability fix: improve latency metrics reliability Feb 24, 2026
alepane21 and others added 3 commits February 24, 2026 11:43
…request_duration-and-other-request' into ale/eng-8915-router-router_http_request_duration-and-other-request
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@router/core/engine_loader_hooks.go`:
- Around line 177-187: The current code sets subgraphFetchLatency from
ctx.Value(rcontext.FetchTimingKey) and assigns it to
exprCtx.Subgraph.Request.ClientTrace.FetchDuration only when a fetchTiming
exists, but after the fallback path (when subgraphFetchLatency = latency) the
FetchDuration remains zero; update the fallback branch so that after assigning
subgraphFetchLatency = latency you also set
exprCtx.Subgraph.Request.ClientTrace.FetchDuration = subgraphFetchLatency (i.e.,
ensure exprCtx.Subgraph.Request.ClientTrace.FetchDuration is always assigned to
subgraphFetchLatency whether it came from fetchTiming or the latency fallback).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e050ea and 8f4baa8.

📒 Files selected for processing (1)
  • router/core/engine_loader_hooks.go

Comment thread router/core/engine_loader_hooks.go
employeesDurationMs := employeesHistogram.GetSampleSum()
productsDurationMs := productsHistogram.GetSampleSum()

require.Greater(t, productsDurationMs, float64(productsDelay.Milliseconds()-250))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a source for flaky tests.

@alepane21 alepane21 closed this Feb 24, 2026
@alepane21
Copy link
Copy Markdown
Contributor Author

The issue is not this: OnFinished should be called right after the subgraph is complete!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants