feat: enhance federation-sdk with metrics tracking and new utility functions#331
feat: enhance federation-sdk with metrics tracking and new utility functions#331dhulke wants to merge 4 commits intofeat/tracingfrom
Conversation
WalkthroughAdds Prometheus-based metrics, helper utilities, and public re-exports to the federation SDK; instruments event emission and transaction processing with metrics, timing, tracing, and an optional per-handler exception hook. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant EventEmitter as EventEmitterService
participant Handler
participant EventService
participant Metrics as PrometheusRegistry
Client->>EventEmitter: emit(event, data)
EventEmitter->>Handler: createTracedHandler(handler, onError)
alt active span exists
EventEmitter->>Handler: run inside traced span (tracing + timing)
else
EventEmitter->>Handler: run handler directly (timed)
end
Handler->>EventService: processIncomingTransaction / other processing
EventService->>Metrics: start federationTransactionProcessDuration timer
EventService-->>Metrics: end timer (labels: origin, pdu_bucket, edu_bucket)
alt handler success
EventEmitter->>Metrics: increment federationEventsProcessed (labels)
else handler throws
EventEmitter->>Metrics: increment federationEventsFailed (labels, error_type)
EventEmitter->>Handler: call onError(error,event,data) [optional]
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## feat/tracing #331 +/- ##
================================================
- Coverage 50.97% 50.35% -0.63%
================================================
Files 98 100 +2
Lines 13966 14244 +278
================================================
+ Hits 7119 7172 +53
- Misses 6847 7072 +225 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@packages/federation-sdk/src/metrics/index.ts`:
- Around line 16-25: getOrCreateMetric currently checks the global prom-client
registry (client.register.getSingleMetric) which mismatches metrics created with
the custom registry variable and leads to duplicate registrations; update
getOrCreateMetric to query the same custom registry (use
registry.getSingleMetric(name)) so the existence check and creation happen
against the same registry (ensure the createFn still registers the metric with
that same registry).
🧹 Nitpick comments (2)
packages/federation-sdk/src/metrics/helpers.ts (1)
5-15: Origin extraction may be inaccurate for IDs with port numbers.The
split(':').pop()approach works for standard Matrix IDs like!room:matrix.org, but if the server domain includes a port (e.g.,!room:matrix.org:8448), this would return only8448instead ofmatrix.org:8448.This is an edge case since most production Matrix servers don't include ports in their IDs, but worth noting for metrics accuracy.
♻️ Alternative implementation for port-aware extraction
export function extractOriginFromMatrixRoomId(roomId: string): string { - return roomId.split(':').pop() || 'unknown'; + const colonIndex = roomId.indexOf(':'); + return colonIndex !== -1 ? roomId.slice(colonIndex + 1) : 'unknown'; } export function extractOriginFromMatrixUserId(userId: string): string { - return userId.split(':').pop() || 'unknown'; + const colonIndex = userId.indexOf(':'); + return colonIndex !== -1 ? userId.slice(colonIndex + 1) : 'unknown'; }packages/federation-sdk/src/services/event-emitter.service.ts (1)
154-168: Duration metric may measure handler time, not actual room join operation time.
federationRoomJoinDurationis observed here withdurationSecondsmeasuring the event handler execution time. However, the metric name ("Time to join a federated room") suggests it should measure the complete room join operation, which likely occurs in a different service layer (e.g., during invite acceptance).This could lead to misleading metrics where the "room join duration" is actually just the membership event handler processing time.
Consider either:
- Renaming the metric to clarify it measures handler processing time, or
- Moving the room join duration observation to where the actual join operation occurs (similar to how
federationTransactionProcessDurationis measured inevent.service.ts)
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@packages/federation-sdk/src/services/event-emitter.service.ts`:
- Around line 185-199: The metric federationMetrics.federationRoomJoinDuration
is misleading (it implies end-to-end join latency) but currently measures
handler execution time between the handler start and recordEventSpecificMetrics;
update its semantic to match actual measurement by either renaming the metric
(e.g., to federationRoomJoinProcessingDuration) or updating its help text to
something like "Time to process membership join event (handler execution time)";
locate the metric definition (where federationMetrics.federationRoomJoinDuration
is created) and change the metric name/help string and any uses (e.g., the
observe call in event-emitter.service.ts, and ensure any dashboards/labels
expecting the old name are adjusted) so the name/help and all references
consistently reflect that this is processing time, not end-to-end join latency.
🧹 Nitpick comments (1)
packages/federation-sdk/src/services/event-emitter.service.ts (1)
72-76: Casting assumptions may be fragile for varying event structures.The code assumes all event data has an optional
eventproperty. While this works for mostHomeserverEventSignaturesevents (likehomeserver.matrix.message), some events likehomeserver.pingorhomeserver.matrix.typingdon't have a nestedeventobject. The current implementation handles this gracefully with optional chaining, but the type casting could be more explicit.The code works correctly because
recordEventSpecificMetricsonly processes specific event types that do have the nested structure, but consider adding a brief comment explaining this assumption for maintainability.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@packages/federation-sdk/src/metrics/index.ts`:
- Around line 87-100: The federationTransactionProcessDuration metric currently
uses high-cardinality numeric labels pdu_count and edu_count which will explode
time series; update the metric created in the
federationTransactionProcessDuration getter (via getOrCreateMetric and the
'rocketchat_federation_transaction_process_duration_seconds' Summary) to avoid
raw numeric labels: either remove pdu_count and edu_count from labelNames and
record counts via a separate counter/gauge, or replace them with bucketed labels
(e.g., pdu_count_bucket, edu_count_bucket with values like "1-10","11-50","50+")
and ensure recording code uses bucketCount(pduCount)/bucketCount(eduCount) when
observing; keep origin label if needed.
🧹 Nitpick comments (1)
packages/federation-sdk/src/metrics/index.ts (1)
5-7: Document thatinitMetricsmust be called before any metric access.If metrics are accessed before
initMetricsis called, they'll be registered onclient.register. AfterinitMetricsswitches the registry, subsequent accesses will create duplicate metrics on the new registry. Consider adding a JSDoc comment clarifying the expected initialization order.📝 Suggested documentation
+/** + * Initializes the metrics module with a custom registry. + * Must be called before any metric access to ensure all metrics + * are registered to the same registry. + */ export function initMetrics(opts: { registry: Registry }) { registry = opts.registry; }
Summary by CodeRabbit
New Features
Improvements