Skip to content

LaterGauge callbacks aren't thread safe #18764

@MadLittleMods

Description

@MadLittleMods

Spawning from @reivilibre spotting this error in Sentry,

Sentry error: https://sentry.tools.element.io/organizations/element/issues/11209742

RuntimeError: dictionary changed size during iteration
    synapse/federation/sender/__init__.py in <lambda> at line 419
    synapse/metrics/__init__.py in collect at line 176

Relevant code:

LaterGauge(
name="synapse_federation_transaction_queue_pending_pdus",
desc="",
labelnames=[SERVER_NAME_LABEL],
caller=lambda: {
(self.server_name,): sum(
d.pending_pdu_count() for d in self._per_destination_queues.values()
)
},
)

Potential suspects

While this code has changed recently in #18714, I don't see why those changes would cause this error. In fact, we see this kind of problem occuring before those changes:

I'm guessing the actual problem occurs when using the dedicated metrics listener type which runs on a different thread. So when we collect from the other thread, sometimes we collide with the dictionary changing size on the main thread.

Potential solutions

We probably need to use threading.Lock() like we do for InFlightGauge

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions