Skip to content

feat(router): add a timeout for on_receive_event hooks#2329

Merged
dkorittki merged 17 commits intotopic/streams-v1from
dominik/eng-8459-add-a-timeout-for-onreceiveevent-hooks
Nov 13, 2025
Merged

feat(router): add a timeout for on_receive_event hooks#2329
dkorittki merged 17 commits intotopic/streams-v1from
dominik/eng-8459-add-a-timeout-for-onreceiveevent-hooks

Conversation

@dkorittki
Copy link
Copy Markdown
Contributor

@dkorittki dkorittki commented Nov 12, 2025

Checklist

Motivation and Context

I use an synonym example here to demonstrate what this change is all about, which hopefully accurately describes everything.

The Great Ice Cream Delivery Crisis: A Timeout Tale

Imagine an ice cream factory (Kafka) that's churning out delicious frozen treats (events) faster than you can say "brain freeze." These ice creams need to get to hungry customers (subscription clients) via our fleet of deliveryman (the router's pubsub system).

Each deliveryman has to pass through a traffic light (a users potentially long running hook) on their way to the customer. Usually it's green and they zoom right through. But sometimes? That light turns red. Really, REALLY red. Like, "stuck-for-5-minutes-questioning-your-life-choices" red.

Let's assume you have 3 deliveryman and 3 customers. The first two deliveryman make it over the green light, the third one has to wait in front of the red light (hook is blocking for whatever reason). In the old implementation the two deliveryman will wait for the third deliveryman before picking up new ice-cream (events) from the ice cream factory (Kafka). This makes the ice cream delivery as slow as the weakest of the three.

Solution idea

After this change the first two deliveryman will not wait for long (until timeout exceeds) for their third colleague before they will pick up new ice cream (new events from broker), antisocially but efficiently so.

Why even use a timeout

I need to break out of the ice cream example for a bit here.

The reason to use a timeout instead of simply moving on immediately when a semaphore slot is free is because I try to honor event ordering and only break it if we exceed a timeout. We cannot stop a running goroutine, we can only abandon it. But it will still eventually progress and update a clients subscription. This could mean an event #2 could make it fast to a subscriber than event #1, if the hook processing it takes too long. With a timeout waiting for long running hooks we can somewhat mitigate this to a degree.

The timeout is configurable via config: events.subscription_hooks.on_receive_events.handler_timeout

Always honor semaphore limit

If all deliveryman are stuck at the traffic light then we will not pick up new ice cream from the factory. That was a problem before this change and is still one after this change. But the only way to make it work is to spawn new deliveryman even if the limit (semaphore limit) is already exceeded. I did not opt for this and instead always honor the semaphore limit. There will never be more hook goroutines, ÄH deliveryman, than the configured semaphore limit. Else we would potentially run into a situation, where we have no control over the amount of routines running at a time on the router. If the limit is too low, a user can set it higher via configuration events.subscription_hooks.on_receive_events.max_concurrent_handlers.

Let the hook developers know their time is over

Okay that sounds more dramatic than it really is. What I mean by that is we need to let hook developers know that their hook execution must stop now. We as the pubsub system cannot stop the routine from the outside, and have no control of the inside of the hook. But we can let the hook know to stop voluntarily via timeout contexts. The hooks context now has ctx.Context() which is timed with the routers timeout mechanism. Once we stop waiting for the routine, the ctx.Context() is also cancelled. This gives hook developers the option to stop their routines and avoid running abandoned in the background.

Technical changes

  • Semaphore is now global for each call to Update() on the event subscription updater, allowing free slots to be used for the next batch of events
  • Hook configuration types have been streamlined, which allows us to easily add config parameters to hook behaviour in the future
  • I also removed error logging when a hook returns an error, leaving this up to the hook developer (unrelated, but a small change)
  • I also added some context to loggers like a component key for better logging (unrelated, but small change)
  • Now we either wait for all updater routines to finish or a configurable timeout, whatever happens first. Then we will close the Update() function and receive the next batch of events.
  • Pass a timeout context to the hooks via ctx.Context()
  • Config parameter change: events.subscription_hooks.on_receive_events.max_concurrent_handlers (new path)
  • New Config parameter: events.subscription_hooks.on_receive_events.handler_timeout

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced handler timeout configuration for subscription event processing (5-second default).
    • Refined concurrent event handler configuration structure for improved management.
  • Bug Fixes

    • Enhanced out-of-order event delivery handling with timeout protection during processing delays.
  • Tests

    • Expanded test coverage for timeout-based event delivery scenarios.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 12, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This PR restructures the subscription hooks system by replacing flat hook slices with nested hook structs, introduces a new timeout configuration for event receive handlers alongside concurrency controls, and implements semaphore-based concurrency limiting with deadline-based context in the subscription event updater to manage handler execution timing and ordering.

Changes

Cohort / File(s) Summary
Config structure changes
router/pkg/config/config.go, router/pkg/config/config.schema.json, router/pkg/config/fixtures/full.yaml, router/pkg/config/testdata/config_defaults.json, router/pkg/config/testdata/config_full.json
Replaced scalar field MaxConcurrentEventReceiveHandlers with nested object OnReceiveEvents containing MaxConcurrentHandlers and HandlerTimeout (new field with 5s default). Alters config parsing and merging structure across YAML, JSON schema, and test fixtures.
Hook struct definitions
router/pkg/pubsub/datasource/hooks.go
Refactored Hooks type from flat slices to nested structs: SubscriptionOnStart, OnPublishEvents, and OnReceiveEvents now wrapped as SubscriptionOnStartHooks, OnPublishEventsHooks, and OnReceiveEventsHooks respectively. New OnReceiveEventsHooks type includes MaxConcurrentHandlers and Timeout fields.
Core router wiring
router/core/router.go, router/core/router_config.go, router/core/factoryresolver.go
Updated subscription hooks initialization and configuration wiring to use nested .handlers fields and map config values to timeout/concurrency settings. Refactored module initialization to construct hook structs and propagate concurrency/timeout parameters through PubSub configuration.
Event updater with concurrency control
router/pkg/pubsub/datasource/subscription_event_updater.go, router/pkg/pubsub/datasource/subscription_event_updater_test.go
Introduced semaphore-based concurrency control and deadline-based context for batch processing. Removed error deduplication logic. Constructor NewSubscriptionEventUpdater now handles concurrency limit computation. updateSubscription() signature simplified; timeout handling logs warnings on out-of-order event delivery.
PubSub provider hook handling
router/pkg/pubsub/datasource/pubsubprovider.go, router/pkg/pubsub/datasource/pubsubprovider_test.go
Updated hook iteration to access .Handlers field within wrapped structs. All test initializations changed from direct slices to wrapped structs (e.g., OnPublishEventsHooks{ Handlers: [...] }).
Subscription datasource hooks
router/pkg/pubsub/datasource/subscription_datasource.go, router/pkg/pubsub/datasource/subscription_datasource_test.go
Updated hook iteration to use .Handlers field. Introduced contextualized logger with component identifiers and provider details for hook execution. Test access patterns updated to reference nested hook structures.
Hook context and logging enrichment
router/core/subscriptions_modules.go
Added Context() context.Context method to StreamReceiveEventHandlerContext interface. Enhanced hook contexts with component-specific loggers including provider metadata. Propagated context field into pubSubStreamReceiveEventHookContext.
Subscription test scenarios
router-tests/modules/stream_receive_test.go
Removed two error deduplication test scenarios. Introduced new test "Test timeout mechanism allows out-of-order event delivery" validating delayed event processing with timeout/hook execution verification. Updated configuration to use OnReceiveEventsConfiguration with nested fields. Updated imports to include encoding/json.
Configuration example
router/demo.config.yaml
Added new subscription_hooks.on_receive_events block with max_concurrent_handlers: 97 and handler_timeout: 8800.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

  • Concurrency control logic (subscription_event_updater.go): Semaphore-based slotting with deadline-based context and timeout handling for out-of-order event delivery—verify correctness of goroutine lifecycle, resource cleanup, and panic recovery.
  • Public API changes across hook types: Ensure all consuming code paths correctly access .Handlers field and that the wrapped structs integrate seamlessly with serialization/configuration merging.
  • Hook struct propagation: Trace the threading of MaxConcurrentHandlers and Timeout from config through factoryresolver.go into PubSub OnReceiveEventsHooks to confirm parameter flow consistency.
  • Context propagation: Verify that StreamReceiveEventHandlerContext.Context() is consistently implemented and used throughout hook execution paths.
  • Logging enrichment: Confirm component identifiers and provider metadata are correctly attached to all hook contexts without introducing side effects.

Possibly related PRs

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.30% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main change: adding a timeout feature for on_receive_event hooks, which aligns with the primary objective and the substantial configuration and implementation changes throughout the codebase.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dkorittki dkorittki changed the title Dominik/eng 8459 add a timeout for onreceiveevent hooks feat(router): add a timeout for on_receive_event hooks Nov 12, 2025
@dkorittki dkorittki marked this pull request as ready for review November 12, 2025 22:38
@dkorittki
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 12, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 12, 2025

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-74f86f6a7fdf462652c917be695afa6bf5b1ed54-nonroot

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
router/pkg/pubsub/datasource/subscription_datasource.go (1)

76-85: Bug: returning the wrong error variable; unmarshal failures get swallowed.

If SubscriptionEventConfiguration(input) fails, the code returns err (likely nil) instead of errConf, masking the error and continuing.

Apply this minimal fix and unmarshal once:

- for _, fn := range s.hooks.SubscriptionOnStart.Handlers {
-   conf, errConf := s.SubscriptionEventConfiguration(input)
-   if errConf != nil {
-     return err
-   }
-   err = fn(ctx, conf, s.eventBuilder)
+ conf, errConf := s.SubscriptionEventConfiguration(input)
+ if errConf != nil {
+   return errConf
+ }
+ for _, fn := range s.hooks.SubscriptionOnStart.Handlers {
+   err = fn(ctx, conf, s.eventBuilder)
    if err != nil {
      return err
    }
  }
router/pkg/pubsub/datasource/subscription_event_updater.go (1)

103-120: Stop clobbering hook errors

Inside updateSubscription the loop assigns err = hooks[i](...) every iteration. If an earlier hook returns an error and a later hook succeeds, the final assignment overwrites the failure and the subscription stays open. That is a functional regression—any hook signalling an error should still force the close path. Capture the first non-nil error (and stop invoking the remaining hooks) so we preserve the contract.

-	var err error
-	for i := range hooks {
-		events, err = hooks[i](ctx, s.subscriptionEventConfiguration, s.eventBuilder, events)
+	var err error
+	for i := range hooks {
+		var hookErr error
+		events, hookErr = hooks[i](ctx, s.subscriptionEventConfiguration, s.eventBuilder, events)
 		events = slices.DeleteFunc(events, func(event StreamEvent) bool {
 			return event == nil
 		})
+		if hookErr != nil {
+			err = hookErr
+			break
+		}
 	}
🧹 Nitpick comments (2)
router/pkg/pubsub/datasource/pubsubprovider.go (1)

44-62: OnPublishEvents .Handlers migration looks correct.

Loop and length checks now target .Handlers; behavior preserved.

Consider guarding p.Logger in error paths or defaulting to zap.NewNop() in NewPubSubProvider to avoid a possible nil deref if a nil logger is ever passed.

Also applies to: 93-105

router/pkg/config/config.schema.json (1)

2313-2330: Add duration validation for handler_timeout to match schema conventions.

Many duration fields here use either "duration" with min or "format": "go-duration". handler_timeout currently lacks validation.

Apply this diff:

             "handler_timeout": {
               "type": "string",
               "description": "The amount of time that OnReceiveEvents handlers can run in total for a single batch of events. Specify as a duration string (e.g., '5s', '1m', '500ms').",
-              "default": "5s"
+              "default": "5s",
+              "duration": {
+                "minimum": "1ms"
+              }
             }

Optional: if "0" should be disallowed (router defaults to 5s on 0), keep the minimum > 0s as above; if "0" should disable timeout, document it explicitly and accept "0s".

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 405f64c and d0d4ef2.

📒 Files selected for processing (18)
  • router-tests/modules/stream_receive_test.go (3 hunks)
  • router/core/factoryresolver.go (3 hunks)
  • router/core/router.go (3 hunks)
  • router/core/router_config.go (1 hunks)
  • router/core/subscriptions_modules.go (6 hunks)
  • router/demo.config.yaml (1 hunks)
  • router/pkg/config/config.go (1 hunks)
  • router/pkg/config/config.schema.json (1 hunks)
  • router/pkg/config/fixtures/full.yaml (1 hunks)
  • router/pkg/config/testdata/config_defaults.json (1 hunks)
  • router/pkg/config/testdata/config_full.json (1 hunks)
  • router/pkg/pubsub/datasource/hooks.go (2 hunks)
  • router/pkg/pubsub/datasource/pubsubprovider.go (2 hunks)
  • router/pkg/pubsub/datasource/pubsubprovider_test.go (12 hunks)
  • router/pkg/pubsub/datasource/subscription_datasource.go (2 hunks)
  • router/pkg/pubsub/datasource/subscription_datasource_test.go (7 hunks)
  • router/pkg/pubsub/datasource/subscription_event_updater.go (4 hunks)
  • router/pkg/pubsub/datasource/subscription_event_updater_test.go (18 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.

Applied to files:

  • router/pkg/config/config.go
  • router/core/router.go
  • router/pkg/pubsub/datasource/subscription_event_updater.go
📚 Learning: 2025-08-28T09:18:10.121Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2141
File: router-tests/http_subscriptions_test.go:100-108
Timestamp: 2025-08-28T09:18:10.121Z
Learning: In router-tests/http_subscriptions_test.go heartbeat tests, the message ordering should remain strict with data messages followed by heartbeat messages, as the timing is deterministic and known by design in the Cosmo router implementation.

Applied to files:

  • router/core/router_config.go
  • router/pkg/pubsub/datasource/subscription_event_updater_test.go
  • router/core/router.go
  • router-tests/modules/stream_receive_test.go
  • router/pkg/pubsub/datasource/subscription_event_updater.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/pkg/config/config.schema.json
  • router/pkg/config/testdata/config_full.json
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router/pkg/pubsub/datasource/subscription_event_updater_test.go
  • router-tests/modules/stream_receive_test.go
📚 Learning: 2025-07-29T08:19:55.720Z
Learnt from: Noroth
Repo: wundergraph/cosmo PR: 2088
File: demo/pkg/subgraphs/projects/src/main_test.go:0-0
Timestamp: 2025-07-29T08:19:55.720Z
Learning: In Go testing, t.Fatal, t.FailNow, t.Skip* and similar methods cannot be called from goroutines other than the main test goroutine, as they will cause a runtime panic. Use channels, t.Error*, or other synchronization mechanisms to communicate errors from goroutines back to the main test thread.

Applied to files:

  • router-tests/modules/stream_receive_test.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build_test
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: build_test
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./events)
🔇 Additional comments (7)
router/pkg/config/testdata/config_defaults.json (1)

300-304: Defaults align with router wiring.

MaxConcurrentHandlers=100 and HandlerTimeout serialized to ns are consistent with code defaults.

router/pkg/pubsub/datasource/subscription_datasource.go (1)

49-56: Good: contextualized logger for updater.

Adds clear fields (component/provider_id/provider_type/field_name) and passes per-request logger.

router/pkg/config/testdata/config_full.json (1)

646-649: OnReceiveEvents fixture looks consistent.

Names and default values match schema and router defaults.

router/pkg/config/fixtures/full.yaml (1)

334-336: Fixture matches new schema (names and types).

String duration and concurrency key look correct.

router/core/router.go (3)

256-262: Sane defaults for onReceiveEvents.

Zero-value guard sets maxConcurrentHandlers=100 and timeout=5s.


686-696: Updated hook registration to new handler slices.

Migration to .handlers is correct for onStart/onPublishEvents/onReceiveEvents.


2145-2147: Config wiring maps to new nested fields.

onReceiveEvents.{MaxConcurrentHandlers,HandlerTimeout} correctly populate runtime config.

Comment thread router-tests/modules/stream_receive_test.go Outdated
Comment thread router/demo.config.yaml Outdated
Comment thread router/pkg/pubsub/datasource/subscription_event_updater.go
Copy link
Copy Markdown
Contributor

@alepane21 alepane21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what you did here, good work! I found some small issues.

Comment thread router/core/subscriptions_modules.go Outdated
Comment thread router/core/subscriptions_modules.go Outdated
Comment thread router/core/subscriptions_modules.go
Comment thread router/pkg/config/config.go
Comment thread router/pkg/pubsub/datasource/subscription_event_updater.go Outdated
Comment thread router/pkg/pubsub/datasource/subscription_event_updater.go Outdated
Comment thread router/pkg/pubsub/datasource/subscription_event_updater.go
Comment thread router/pkg/pubsub/datasource/subscription_event_updater.go Outdated
Comment thread router/pkg/pubsub/datasource/subscription_event_updater.go Outdated
@dkorittki dkorittki merged commit b3bc7af into topic/streams-v1 Nov 13, 2025
69 of 75 checks passed
@dkorittki dkorittki deleted the dominik/eng-8459-add-a-timeout-for-onreceiveevent-hooks branch November 13, 2025 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants