Skip to content

feat: enhance federation-sdk with metrics tracking and new utility functions#331

Open
dhulke wants to merge 4 commits intofeat/tracingfrom
feat/metrics
Open

feat: enhance federation-sdk with metrics tracking and new utility functions#331
dhulke wants to merge 4 commits intofeat/tracingfrom
feat/metrics

Conversation

@dhulke
Copy link
Contributor

@dhulke dhulke commented Feb 2, 2026

Summary by CodeRabbit

  • New Features

    • Prometheus metrics support added to the federation SDK for events, messages, rooms, and processing durations.
    • New metrics helpers to normalize origins, bucket counts, and simplify event-type labels for lower-cardinality telemetry.
    • Metrics publicly exposed and initializable for custom registries.
  • Improvements

    • Integrated tracing and timing around federation event handling and transactions.
    • Optional per-handler error hook to observe/handle handler exceptions without changing behavior.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 2, 2026

Walkthrough

Adds Prometheus-based metrics, helper utilities, and public re-exports to the federation SDK; instruments event emission and transaction processing with metrics, timing, tracing, and an optional per-handler exception hook.

Changes

Cohort / File(s) Summary
Dependency
packages/federation-sdk/package.json
Added runtime dependency prom-client (^15.1.3).
Public API
packages/federation-sdk/src/index.ts
Re-exported federationMetrics, initMetrics, helper functions, and EventHandlerExceptionHandler type.
Metrics implementation
packages/federation-sdk/src/metrics/index.ts
New registry-aware metrics module: initMetrics, lazy get-or-create metrics, and federationMetrics getters for counters and summaries.
Metrics helpers
packages/federation-sdk/src/metrics/helpers.ts
New utilities: origin extraction from Matrix IDs, message-type determination, event-type labeling, and PDU/EDU bucketing functions.
Event emitter instrumentation
packages/federation-sdk/src/services/event-emitter.service.ts
Wrapped handlers with tracing/timing, added metrics for processed/failed events, introduced EventHandlerExceptionHandler (onError) propagated through subscribe/on/once, and added event-specific metric emission.
Transaction processing instrumentation
packages/federation-sdk/src/services/event.service.ts
Instrumented processIncomingTransaction with federationTransactionProcessDuration timer, used PDU/EDU buckets and origin labels, moved validation into try/finally to ensure timer cleanup.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant EventEmitter as EventEmitterService
    participant Handler
    participant EventService
    participant Metrics as PrometheusRegistry

    Client->>EventEmitter: emit(event, data)
    EventEmitter->>Handler: createTracedHandler(handler, onError)
    alt active span exists
        EventEmitter->>Handler: run inside traced span (tracing + timing)
    else
        EventEmitter->>Handler: run handler directly (timed)
    end
    Handler->>EventService: processIncomingTransaction / other processing
    EventService->>Metrics: start federationTransactionProcessDuration timer
    EventService-->>Metrics: end timer (labels: origin, pdu_bucket, edu_bucket)
    alt handler success
        EventEmitter->>Metrics: increment federationEventsProcessed (labels)
    else handler throws
        EventEmitter->>Metrics: increment federationEventsFailed (labels, error_type)
        EventEmitter->>Handler: call onError(error,event,data) [optional]
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • ggazzo
  • sampaiodiego

Poem

🐰 I hopped through code with timers bright,
Buckets and labels snug and tight,
Metrics hum where events were sown,
Handlers traced, origins known,
A cheerful hop — federation's light! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: introducing metrics tracking and new utility functions to the federation-sdk.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dhulke dhulke changed the base branch from main to feat/tracing February 2, 2026 01:21
@codecov-commenter
Copy link

codecov-commenter commented Feb 2, 2026

Codecov Report

❌ Patch coverage is 7.01220% with 305 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.35%. Comparing base (c9dbc51) to head (942f67b).

Files with missing lines Patch % Lines
...deration-sdk/src/services/event-emitter.service.ts 4.44% 129 Missing ⚠️
packages/federation-sdk/src/metrics/index.ts 6.08% 108 Missing ⚠️
packages/federation-sdk/src/metrics/helpers.ts 12.00% 44 Missing ⚠️
...kages/federation-sdk/src/services/event.service.ts 7.69% 24 Missing ⚠️
Additional details and impacted files
@@               Coverage Diff                @@
##           feat/tracing     #331      +/-   ##
================================================
- Coverage         50.97%   50.35%   -0.63%     
================================================
  Files                98      100       +2     
  Lines             13966    14244     +278     
================================================
+ Hits               7119     7172      +53     
- Misses             6847     7072     +225     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@packages/federation-sdk/src/metrics/index.ts`:
- Around line 16-25: getOrCreateMetric currently checks the global prom-client
registry (client.register.getSingleMetric) which mismatches metrics created with
the custom registry variable and leads to duplicate registrations; update
getOrCreateMetric to query the same custom registry (use
registry.getSingleMetric(name)) so the existence check and creation happen
against the same registry (ensure the createFn still registers the metric with
that same registry).
🧹 Nitpick comments (2)
packages/federation-sdk/src/metrics/helpers.ts (1)

5-15: Origin extraction may be inaccurate for IDs with port numbers.

The split(':').pop() approach works for standard Matrix IDs like !room:matrix.org, but if the server domain includes a port (e.g., !room:matrix.org:8448), this would return only 8448 instead of matrix.org:8448.

This is an edge case since most production Matrix servers don't include ports in their IDs, but worth noting for metrics accuracy.

♻️ Alternative implementation for port-aware extraction
 export function extractOriginFromMatrixRoomId(roomId: string): string {
-	return roomId.split(':').pop() || 'unknown';
+	const colonIndex = roomId.indexOf(':');
+	return colonIndex !== -1 ? roomId.slice(colonIndex + 1) : 'unknown';
 }

 export function extractOriginFromMatrixUserId(userId: string): string {
-	return userId.split(':').pop() || 'unknown';
+	const colonIndex = userId.indexOf(':');
+	return colonIndex !== -1 ? userId.slice(colonIndex + 1) : 'unknown';
 }
packages/federation-sdk/src/services/event-emitter.service.ts (1)

154-168: Duration metric may measure handler time, not actual room join operation time.

federationRoomJoinDuration is observed here with durationSeconds measuring the event handler execution time. However, the metric name ("Time to join a federated room") suggests it should measure the complete room join operation, which likely occurs in a different service layer (e.g., during invite acceptance).

This could lead to misleading metrics where the "room join duration" is actually just the membership event handler processing time.

Consider either:

  1. Renaming the metric to clarify it measures handler processing time, or
  2. Moving the room join duration observation to where the actual join operation occurs (similar to how federationTransactionProcessDuration is measured in event.service.ts)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@packages/federation-sdk/src/services/event-emitter.service.ts`:
- Around line 185-199: The metric federationMetrics.federationRoomJoinDuration
is misleading (it implies end-to-end join latency) but currently measures
handler execution time between the handler start and recordEventSpecificMetrics;
update its semantic to match actual measurement by either renaming the metric
(e.g., to federationRoomJoinProcessingDuration) or updating its help text to
something like "Time to process membership join event (handler execution time)";
locate the metric definition (where federationMetrics.federationRoomJoinDuration
is created) and change the metric name/help string and any uses (e.g., the
observe call in event-emitter.service.ts, and ensure any dashboards/labels
expecting the old name are adjusted) so the name/help and all references
consistently reflect that this is processing time, not end-to-end join latency.
🧹 Nitpick comments (1)
packages/federation-sdk/src/services/event-emitter.service.ts (1)

72-76: Casting assumptions may be fragile for varying event structures.

The code assumes all event data has an optional event property. While this works for most HomeserverEventSignatures events (like homeserver.matrix.message), some events like homeserver.ping or homeserver.matrix.typing don't have a nested event object. The current implementation handles this gracefully with optional chaining, but the type casting could be more explicit.

The code works correctly because recordEventSpecificMetrics only processes specific event types that do have the nested structure, but consider adding a brief comment explaining this assumption for maintainability.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@packages/federation-sdk/src/metrics/index.ts`:
- Around line 87-100: The federationTransactionProcessDuration metric currently
uses high-cardinality numeric labels pdu_count and edu_count which will explode
time series; update the metric created in the
federationTransactionProcessDuration getter (via getOrCreateMetric and the
'rocketchat_federation_transaction_process_duration_seconds' Summary) to avoid
raw numeric labels: either remove pdu_count and edu_count from labelNames and
record counts via a separate counter/gauge, or replace them with bucketed labels
(e.g., pdu_count_bucket, edu_count_bucket with values like "1-10","11-50","50+")
and ensure recording code uses bucketCount(pduCount)/bucketCount(eduCount) when
observing; keep origin label if needed.
🧹 Nitpick comments (1)
packages/federation-sdk/src/metrics/index.ts (1)

5-7: Document that initMetrics must be called before any metric access.

If metrics are accessed before initMetrics is called, they'll be registered on client.register. After initMetrics switches the registry, subsequent accesses will create duplicate metrics on the new registry. Consider adding a JSDoc comment clarifying the expected initialization order.

📝 Suggested documentation
+/**
+ * Initializes the metrics module with a custom registry.
+ * Must be called before any metric access to ensure all metrics
+ * are registered to the same registry.
+ */
 export function initMetrics(opts: { registry: Registry }) {
 	registry = opts.registry;
 }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants