Skip to content

feat: collect federation outgoing metrics#38459

Open
dhulke wants to merge 13 commits intofeat/federation-tracing2from
feat/metrics
Open

feat: collect federation outgoing metrics#38459
dhulke wants to merge 13 commits intofeat/federation-tracing2from
feat/metrics

Conversation

@dhulke
Copy link
Contributor

@dhulke dhulke commented Feb 1, 2026

This PR depends on the new homeserver sdk version in this PR: RocketChat/homeserver#331. This new homeserver sdk version implements error handlers in the event subscription.

Proposed changes (including videos or screenshots)

Issue(s)

Steps to test or reproduce

Further comments

Summary by CodeRabbit

  • New Features

    • Added observability metrics and monitoring for federation operations, including event tracking, message sends, room creation, and performance duration measurements.
  • Chores

    • Enhanced error handling and logging for federation activities.
    • Improved type safety for federation event handlers.
    • Added dependency for metrics support.

@dionisio-bot
Copy link
Contributor

dionisio-bot bot commented Feb 1, 2026

Looks like this PR is not ready to merge, because of the following issues:

  • This PR is missing the 'stat: QA assured' label
  • This PR is missing the required milestone or project

Please fix the issues and try again

If you have any trouble, please check the PR guidelines

@changeset-bot
Copy link

changeset-bot bot commented Feb 1, 2026

⚠️ No Changeset found

Latest commit: 3b199ef

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 1, 2026

Walkthrough

This PR adds comprehensive Prometheus metrics instrumentation to the federation-matrix package. It introduces a new prom-client dependency, creates a metrics helpers module with typed metric definitions, and integrates extensive instrumentation throughout FederationMatrix.ts. Event handlers across multiple files are refactored to use typed event signatures and consistent error-handling callbacks instead of try/catch blocks.

Changes

Cohort / File(s) Summary
Dependencies
ee/packages/federation-matrix/package.json
Added prom-client v^15.1.3 as a runtime dependency for Prometheus metrics support.
Metrics Infrastructure
ee/packages/federation-matrix/src/helpers/metricsHelpers.ts
New module introducing federationMetrics with lazy-initialized Prometheus metrics (counters and summaries) for tracking federation events, message sends, room creation, invites, reactions, and operation durations. Includes determineOutgoingMessageType utility to classify messages as text or file-based.
Core Federation Logic
ee/packages/federation-matrix/src/FederationMatrix.ts
Integrated extensive metrics instrumentation and timing measurements across room creation, direct message rooms, message sending, redactions, invitations, memberships, reactions, typing, and updates. Added startTimer/endTimer pairs in try/finally blocks and success/failure metrics increments throughout. Enhanced error handling with federationEventsFailed metrics in catch blocks.
Event Handler Typings & Error Handling
ee/packages/federation-matrix/src/events/edu.ts, src/events/member.ts, src/events/message.ts, src/events/ping.ts, src/events/reaction.ts, src/events/room.ts
Updated event handlers to use typed event signatures from HomeserverEventSignatures for improved type safety. Replaced try/catch blocks with dedicated error callbacks passed to event listener registration. Minor structural improvements including early returns and consistent error logging patterns.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

stat: ready to merge, stat: QA assured

Suggested reviewers

  • ggazzo
  • sampaiodiego
  • ricardogarim

Poem

🐇 Prometheus watches with metrics so bright,
Federation flows tracked day and night,
Counters and timers, errors caught well,
Observability stories the data will tell!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main objective of the changeset—adding federation outgoing metrics collection across multiple files and modules.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dhulke dhulke changed the base branch from develop to feat/federation-tracing2 February 1, 2026 16:18
@codecov
Copy link

codecov bot commented Feb 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.41%. Comparing base (e83f8e2) to head (d7860c5).

Additional details and impacted files

Impacted file tree graph

@@                     Coverage Diff                      @@
##           feat/federation-tracing2   #38459      +/-   ##
============================================================
- Coverage                     70.41%   70.41%   -0.01%     
============================================================
  Files                          3161     3161              
  Lines                        110151   110151              
  Branches                      19862    19891      +29     
============================================================
- Hits                          77565    77562       -3     
- Misses                        30557    30560       +3     
  Partials                       2029     2029              
Flag Coverage Δ
e2e 60.33% <ø> (-0.03%) ⬇️
e2e-api 47.79% <ø> (-0.02%) ⬇️
unit 71.44% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2026

📦 Docker Image Size Report

➡️ Changes

Service Current Baseline Change Percent
sum of all images 0B 0B 0B
account-service 0B 0B 0B
authorization-service 0B 0B 0B
ddp-streamer-service 0B 0B 0B
omnichannel-transcript-service 0B 0B 0B
presence-service 0B 0B 0B
queue-worker-service 0B 0B 0B
rocketchat 0B 0B 0B

📊 Historical Trend

---
config:
  theme: "dark"
  xyChart:
    width: 900
    height: 400
---
xychart
  title "Image Size Evolution by Service (Last 30 Days + This PR)"
  x-axis ["11/18 22:53", "11/19 23:02", "11/21 16:49", "11/24 17:34", "11/27 22:32", "11/28 19:05", "12/01 23:01", "12/02 21:57", "12/03 21:00", "12/04 18:17", "12/05 21:56", "12/08 20:15", "12/09 22:17", "12/10 23:26", "12/11 21:56", "12/12 22:45", "12/13 01:34", "12/15 22:31", "12/16 22:18", "12/17 21:04", "12/18 23:12", "12/19 23:27", "12/20 21:03", "12/22 18:54", "12/23 16:16", "12/24 19:38", "12/25 17:51", "12/26 13:18", "12/29 19:01", "12/30 20:52", "02/02 01:40 (PR)"]
  y-axis "Size (GB)" 0 --> 0.5
  line "account-service" [0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.00]
  line "authorization-service" [0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.00]
  line "ddp-streamer-service" [0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.00]
  line "omnichannel-transcript-service" [0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.00]
  line "presence-service" [0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.00]
  line "queue-worker-service" [0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.00]
  line "rocketchat" [0.35, 0.35, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.00]
Loading

Statistics (last 30 days):

  • 📊 Average: 1.5GiB
  • ⬇️ Minimum: 1.4GiB
  • ⬆️ Maximum: 1.6GiB
  • 🎯 Current PR: 0B
ℹ️ About this report

This report compares Docker image sizes from this build against the develop baseline.

  • Tag: pr-38459
  • Baseline: develop
  • Timestamp: 2026-02-02 01:40:55 UTC
  • Historical data points: 30

Updated: Mon, 02 Feb 2026 01:40:56 GMT

@dhulke dhulke changed the title Feat/metrics feat: collect federation outgoing metrics Feb 2, 2026
@dhulke dhulke marked this pull request as ready for review February 2, 2026 01:31
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 6 files

…g type signatures and improving error handling
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ee/packages/federation-matrix/src/FederationMatrix.ts (1)

515-548: ⚠️ Potential issue | 🟡 Minor

Metric counter increments for skipped invites.

federatedInvitesSent is incremented by total usernames even when invites are skipped—specifically when the inviter is native federated (line 533-535), no invite is sent but the counter still increments at line 547. Move the counter increment into the callback to only count actual invites sent.

Suggested fix
 await Promise.all(
 	matrixUsersUsername.map(async (username) => {
 		if (validateFederatedUsername(username)) {
-			return federationSDK.inviteUserToRoom(
+			await federationSDK.inviteUserToRoom(
 				userIdSchema.parse(username),
 				roomIdSchema.parse(room.federation.mrid),
 				userIdSchema.parse(inviterUserId),
 			);
+			federationMetrics.federatedInvitesSent.inc({ room_type: room.t });
+			return;
 		}
 
 		// if inviter is an external user it means we receive the invite from the endpoint
 		// since we accept from there we can skip accepting here
 		if (isUserNativeFederated(inviter)) {
 			this.logger.debug('Inviter is native federated, skip accept invite');
 			return;
 		}
 
-		return federationSDK.inviteUserToRoom(
+		await federationSDK.inviteUserToRoom(
 			userIdSchema.parse(`@${username}:${this.serverName}`),
 			roomIdSchema.parse(room.federation.mrid),
 			userIdSchema.parse(inviterUserId),
 		);
+		federationMetrics.federatedInvitesSent.inc({ room_type: room.t });
 	}),
 );
 
-// Increment invite counter for each user invited
-federationMetrics.federatedInvitesSent.inc({ room_type: room.t }, matrixUsersUsername.length);
 federationMetrics.federationEventsProcessed.inc({ event_type: 'membership', direction: 'outgoing' });
🤖 Fix all issues with AI agents
In `@ee/packages/federation-matrix/src/FederationMatrix.ts`:
- Around line 1049-1104: In handleInvite, remove the explicit cleanup call to
Room.performUserRemoval in the catch branch for federationSDK.rejectInvite
(inside the action === 'reject' block) and instead log the 403 error and return
so the leave-event flow (handleLeave) performs the removal; keep the existing
this.logger.error line and ensure the function returns early on 403 without
invoking Room.performUserRemoval.
🧹 Nitpick comments (3)
ee/packages/federation-matrix/src/FederationMatrix.ts (1)

122-165: Consider removing the inline “Increment success metrics” comment.

The instrumentation is clear without the extra comment, and it repeats a pattern seen elsewhere.

As per coding guidelines: Avoid code comments in the implementation.

ee/packages/federation-matrix/src/events/member.ts (1)

94-99: Drop the inline spec comment.

The code is self-explanatory here; consider removing the comment per the no-comments guideline.

As per coding guidelines: Avoid code comments in the implementation.

ee/packages/federation-matrix/src/events/message.ts (1)

147-156: Remove inline SPEC comments from implementation.
Consider relocating these notes to external docs/spec references to keep runtime code clean.

Proposed diff
@@
-			// SPEC: For example, an m.thread relationship type denotes that the event is part of a "thread" of messages and should be rendered as such.
 			const hasRelation = relation && 'rel_type' in relation;
@@
-			// SPEC: Though rich replies form a relationship to another event, they do not use rel_type to create this relationship.
-			// Instead, a subkey named m.in_reply_to is used to describe the reply's relationship,
 			const isRichReply = relation && !('rel_type' in relation) && 'm.in_reply_to' in relation;
@@
-			// SPEC: For example, an m.thread relationship type denotes that the event is part of a "thread" of messages and should be rendered as such.
 			const hasRelation = relation && 'rel_type' in relation;
@@
-			// SPEC: Though rich replies form a relationship to another event, they do not use rel_type to create this relationship.
-			// Instead, a subkey named m.in_reply_to is used to describe the reply's relationship,
 			const isRichReply = relation && !('rel_type' in relation) && 'm.in_reply_to' in relation;
As per coding guidelines: Avoid code comments in the implementation.

Also applies to: 297-306

Comment on lines 1049 to +1104
async handleInvite(roomId: IRoom['_id'], userId: IUser['_id'], action: 'accept' | 'reject'): Promise<void> {
const subscription = await Subscriptions.findInvitedSubscription(roomId, userId);
if (!subscription) {
throw new Error('No subscription found or user does not have permission to accept or reject this invite');
}
try {
const subscription = await Subscriptions.findInvitedSubscription(roomId, userId);
if (!subscription) {
throw new Error('No subscription found or user does not have permission to accept or reject this invite');
}

const room = await Rooms.findOneById(roomId);
if (!room || !isRoomNativeFederated(room)) {
throw new Error('Room not found or not federated');
}
const room = await Rooms.findOneById(roomId);
if (!room || !isRoomNativeFederated(room)) {
throw new Error('Room not found or not federated');
}

const user = await Users.findOneById(userId);
if (!user) {
throw new Error('User not found');
}
const user = await Users.findOneById(userId);
if (!user) {
throw new Error('User not found');
}

if (!user.username) {
throw new Error('User username not found');
}
if (!user.username) {
throw new Error('User username not found');
}

// TODO: should use common function to get matrix user ID
const matrixUserId = isUserNativeFederated(user) ? user.federation.mui : `@${user.username}:${this.serverName}`;
// TODO: should use common function to get matrix user ID
const matrixUserId = isUserNativeFederated(user) ? user.federation.mui : `@${user.username}:${this.serverName}`;

// Add runtime attributes after querying room and user
addSpanAttributes({
matrixRoomId: room.federation.mrid,
matrixUserId,
username: user.username,
isNativeFederatedUser: isUserNativeFederated(user),
});
// Add runtime attributes after querying room and user
addSpanAttributes({
matrixRoomId: room.federation.mrid,
matrixUserId,
username: user.username,
isNativeFederatedUser: isUserNativeFederated(user),
});

if (action === 'accept') {
await federationSDK.acceptInvite(room.federation.mrid, matrixUserId);
}
if (action === 'accept') {
await federationSDK.acceptInvite(room.federation.mrid, matrixUserId);
}

if (action === 'reject') {
try {
await federationSDK.rejectInvite(room.federation.mrid, matrixUserId);
} catch (err) {
if (err instanceof FederationRequestError && err.response.status === 403) {
return Room.performUserRemoval(room, user);
if (action === 'reject') {
try {
await federationSDK.rejectInvite(room.federation.mrid, matrixUserId);
} catch (err) {
if (err instanceof FederationRequestError && err.response.status === 403) {
return Room.performUserRemoval(room, user);
}
this.logger.error({ msg: 'Failed to reject invite in Matrix', err });
throw err;
}
this.logger.error({ msg: 'Failed to reject invite in Matrix', err });
throw err;
}

federationMetrics.federationEventsProcessed.inc({ event_type: 'membership', direction: 'outgoing' });
} catch (err) {
federationMetrics.federationEventsFailed.inc({
event_type: 'membership',
direction: 'outgoing',
error_type: err instanceof Error ? err.constructor.name : 'Unknown',
});
throw err;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check the actual current state of handleInvite method
rg -A 60 "async handleInvite" ee/packages/federation-matrix/src/FederationMatrix.ts

Repository: RocketChat/Rocket.Chat

Length of output: 1986


🏁 Script executed:

# Verify handleLeave method exists and performs cleanup
rg -A 20 "handleLeave" ee/packages/federation-matrix/src/events/member.ts

Repository: RocketChat/Rocket.Chat

Length of output: 965


🏁 Script executed:

# Check other usages of rejectInvite and how they handle errors
rg -B 5 -A 10 "rejectInvite" ee/packages/federation-matrix/src/

Repository: RocketChat/Rocket.Chat

Length of output: 1490


🏁 Script executed:

# Verify Room.performUserRemoval is called in handleLeave
rg -n "performUserRemoval" ee/packages/federation-matrix/

Repository: RocketChat/Rocket.Chat

Length of output: 280


🏁 Script executed:

# Look for context around the 403 handling and what it means
rg -B 10 -A 5 "403" ee/packages/federation-matrix/src/FederationMatrix.ts | head -50

Repository: RocketChat/Rocket.Chat

Length of output: 553


🏁 Script executed:

# Check Matrix federation SDK documentation or comments about rejectInvite behavior
rg -B 5 -A 5 "rejectInvite" --type=ts | grep -A 5 -B 5 "403\|reject\|leave"

Repository: RocketChat/Rocket.Chat

Length of output: 7003


🏁 Script executed:

# Look for event subscriptions in federation-matrix to understand what events trigger cleanup
rg "on.*leave|on.*membership|subscribe.*leave" ee/packages/federation-matrix/src/ --type=ts

Repository: RocketChat/Rocket.Chat

Length of output: 1721


🏁 Script executed:

# Check if there are tests that show the 403 behavior
fd "test|spec" ee/packages/federation-matrix/src/ --type=f -name "*.ts" | head -5

Repository: RocketChat/Rocket.Chat

Length of output: 296


🏁 Script executed:

# Look for more context on 403 handling and Matrix federation behavior
rg -B 15 "403" ee/packages/federation-matrix/src/FederationMatrix.ts

Repository: RocketChat/Rocket.Chat

Length of output: 565


🏁 Script executed:

# Check if there are comments or documentation about the 403 case in rejectInvite
rg -B 10 -A 10 "status === 403" ee/packages/federation-matrix/src/

Repository: RocketChat/Rocket.Chat

Length of output: 1866


🏁 Script executed:

# Look at the subscription model to understand when subscriptions are cleaned up
rg -A 5 "findInvitedSubscription" ee/packages/federation-matrix/src/ --type=ts

Repository: RocketChat/Rocket.Chat

Length of output: 664


🏁 Script executed:

# Check if there are event listeners or handlers that might be triggered by 403
rg "on.*event|once|subscribe" ee/packages/federation-matrix/src/events/member.ts --type=ts -A 3

Repository: RocketChat/Rocket.Chat

Length of output: 651


🏁 Script executed:

# Look for any documentation or comments about the 403 case in the federation SDK or Matrix files
rg -B 5 -A 5 "403\|Forbidden" ee/packages/federation-matrix/ --type=ts

Repository: RocketChat/Rocket.Chat

Length of output: 48


🏁 Script executed:

# Check if there's any documentation on what the 403 status means in the context of rejectInvite
rg "rejectInvite" --type=ts -B 10 -A 10 | grep -A 10 -B 10 "Forbidden\|403\|already"

Repository: RocketChat/Rocket.Chat

Length of output: 1739


🏁 Script executed:

# Look for tests or mocks of the handleInvite or rejectInvite scenario
fd "handleInvite\|reject" ee/packages/federation-matrix/src/ -type f -name "*.spec.ts"

Repository: RocketChat/Rocket.Chat

Length of output: 236


🏁 Script executed:

# Check git history or comments for context on why 403 handling was added
rg "performUserRemoval.*403\|403.*performUserRemoval" ee/packages/federation-matrix/ -B 10 -A 5

Repository: RocketChat/Rocket.Chat

Length of output: 48


Avoid explicit cleanup on invite reject; rely on leave-event flow.

Calling Room.performUserRemoval in the 403 branch violates the established event-driven cleanup pattern. The leave-event handler in handleLeave() is the single cleanup point and should be trusted to handle all user removals. Log the error and return instead.

Suggested fix
 			if (action === 'reject') {
 				try {
 					await federationSDK.rejectInvite(room.federation.mrid, matrixUserId);
 				} catch (err) {
 					if (err instanceof FederationRequestError && err.response.status === 403) {
-						return Room.performUserRemoval(room, user);
+						this.logger.debug({ msg: 'Invite already rejected remotely; awaiting leave-event cleanup', roomId, userId });
+						return;
 					}
 					this.logger.error({ msg: 'Failed to reject invite in Matrix', err });
 					throw err;
 				}
 			}
🤖 Prompt for AI Agents
In `@ee/packages/federation-matrix/src/FederationMatrix.ts` around lines 1049 -
1104, In handleInvite, remove the explicit cleanup call to
Room.performUserRemoval in the catch branch for federationSDK.rejectInvite
(inside the action === 'reject' block) and instead log the 403 error and return
so the leave-event flow (handleLeave) performs the removal; keep the existing
this.logger.error line and ensure the function returns early on 403 without
invoking Room.performUserRemoval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants