Skip to content

fix(federation): fixed a QP panic saying "would create a cycle"#5797

Merged
goto-bus-stop merged 3 commits intodevfrom
duckki/router-546-part-2
Aug 12, 2024
Merged

fix(federation): fixed a QP panic saying "would create a cycle"#5797
goto-bus-stop merged 3 commits intodevfrom
duckki/router-546-part-2

Conversation

@duckki
Copy link
Contributor

@duckki duckki commented Aug 9, 2024

The merge_fetches_to_same_subgraph_and_same_inputs function must merge decendant nodes into ancestor nodes and not the other way around (to avoid cycles).

This PR fixes it by topologically sorting the fetch dependency graph nodes before collecting nodes to merge. So, their merger is always topologically ordered.


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible[^1]
  • Documentation[^2] completed
  • Performance impact assessed and acceptable
  • Tests added and passing[^3]
    • Unit Tests
    • Integration Tests
    • Manual Tests

@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2024

@duckki, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

@router-perf
Copy link

router-perf bot commented Aug 9, 2024

CI performance tests

  • const - Basic stress test that runs with a constant number of users
  • demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled
  • demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled
  • enhanced-signature - Enhanced signature enabled
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • extended-reference-mode - Extended reference mode enabled
  • large-request - Stress test with a 1 MB request payload
  • no-tracing - Basic stress test, no tracing
  • reload - Reload test over a long period of time at a constant rate of users
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • step-local-metrics - Field stats that are generated from the router rather than FTV1
  • step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
  • step - Basic stress test that steps up the number of users over time
  • xlarge-request - Stress test with 10 MB request payload
  • xxlarge-request - Stress test with 100 MB request payload

Comment on lines -1454 to +1456
for node_index in self.graph.node_indices() {
let sorted_nodes = petgraph::algo::toposort(&self.graph, None)
.map_err(|_| FederationError::internal("Failed to sort nodes due to cycle(s)"))?;
for node_index in sorted_nodes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking comment: what's the performance cost of doing a topological sort here, do you know?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a few queries with known-big fetch dependency graphs, and this function doesn't really show up in a flame graph. I think because it only happens once, as the absolute last step, when the graph has already been reduced as much as it can.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent excellent!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that analysis. I didn't observe any noticeable slowdown during corpus comparison runs, either.

@goto-bus-stop goto-bus-stop merged commit b62865e into dev Aug 12, 2024
@goto-bus-stop goto-bus-stop deleted the duckki/router-546-part-2 branch August 12, 2024 09:30
duckki added a commit that referenced this pull request Aug 13, 2024
duckki added a commit that referenced this pull request Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants