-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Draft: Migrate to OpenTelemetry tracing #13400
Draft: Migrate to OpenTelemetry tracing #13400
Conversation
pyproject.toml
Outdated
@@ -196,7 +196,7 @@ oidc = ["authlib"] | |||
systemd = ["systemd-python"] | |||
url_preview = ["lxml"] | |||
sentry = ["sentry-sdk"] | |||
opentracing = ["jaeger-client", "opentracing"] | |||
opentelemetry = ["opentelemetry-api", "opentelemetry-sdk"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what I am doing with the dependencies here. Please double check this stuff.
``` Invalid type StreamToken for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types ``` Had to add a few more logs to find this instance since the warning doens't give much info where I am setting this invalid attribute. This was good enough to find it in the code. ``` BoundedAttributes __setitem__ key=since_token value=StreamToken(room_key=RoomStreamToken(topological=None, stream=1787, instance_map=frozendict.frozendict({})), presence_key=481272, typing_key=0, receipt_key=340, account_data_key=1233, push_rules_key=8, to_device_key=57, device_list_key=199, groups_key=0) BoundedAttributes __setitem__ key=now_token value=StreamToken(room_key=RoomStreamToken(topological=None, stream=1787, instance_map=frozendict.frozendict({})), presence_key=481287, typing_key=0, receipt_key=340, account_data_key=1233, push_rules_key=8, to_device_key=57, device_list_key=199, groups_key=0) BoundedAttributes __setitem__ key=token value=StreamToken(room_key=RoomStreamToken(topological=None, stream=1787, instance_map=frozendict.frozendict({})), presence_key=481291, typing_key=0, receipt_key=340, account_data_key=1237, push_rules_key=8, to_device_key=57, device_list_key=199, groups_key=0) ```
Fix error: ``` AttributeError: 'SpanContext' object has no attribute 'get' ``` `Context`: ``` {'current-span-1a226c96-a5db-4412-bcaa-1fdd34213c5c': _Span(name="sendToDevice", context=SpanContext(trace_id=0x5d2dcc3fdc8205046d60a5cd18672ac6, span_id=0x715c736ff5f4d208, trace_flags=0x01, trace_state=[], is_remote=False))} ``` `SpanContext`: ``` SpanContext(trace_id=0xf7cd9d058b7b76f364bdd649c4ba7b8a, span_id=0x287ce71bac31bfc4, trace_flags=0x01, trace_state=[], is_remote=False) ```
The `incoming-federation-request` vs `process-federation_request` was first introduced in #11870 - Span for remote trace: `incoming-federation-request` - `child_of` reference: `origin_span_context` - `follows_from` reference: `servlet_span` - Span for local trace: `process-federation-request` - `child_of` reference: `servlet_span` (by the nature of it being active) - `follows_from` reference: `incoming-federation-request`
@@ -701,15 +701,15 @@ def new_transaction( | |||
exception_callbacks=exception_callbacks, | |||
) | |||
try: | |||
with opentracing.start_active_span( | |||
with tracing.start_active_span( | |||
"db.txn", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When viewing a trace for a servlet, we have the db.{desc}
span but the child database spans are missing like db.connection
, db.txn
, db.query
🤔
Probably need to assert that the parents are set correctly in the tests as well (particularly with deferreds
).
Conflicts: synapse/logging/opentracing.py tests/logging/test_opentracing.py
Conflicts: poetry.lock synapse/federation/federation_client.py synapse/federation/federation_server.py synapse/handlers/federation.py synapse/handlers/federation_event.py synapse/logging/opentracing.py synapse/rest/client/room.py synapse/storage/controllers/persist_events.py synapse/storage/controllers/state.py
Conflicts: poetry.lock synapse/api/auth.py synapse/federation/federation_client.py synapse/logging/opentracing.py synapse/rest/client/keys.py synapse/rest/client/sendtodevice.py synapse/storage/schema/__init__.py
Conflicts: synapse/storage/schema/__init__.py
``` Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/synapse/http/server.py", line 306, in _async_render_wrapper callback_return = await self._async_render(request) File "/usr/local/lib/python3.9/site-packages/synapse/http/server.py", line 512, in _async_render callback_return = await raw_callback_return File "/usr/local/lib/python3.9/site-packages/synapse/federation/transport/server/_base.py", line 357, in new_func remote_parent_span = create_non_recording_span() File "/usr/local/lib/python3.9/site-packages/synapse/logging/tracing.py", line 502, in create_non_recording_span return opentelemetry.trace.NonRecordingSpan( AttributeError: 'NoneType' object has no attribute 'trace' ```
Hopefully fix: ``` File "/home/runner/work/synapse/synapse/synapse/storage/controllers/persist_events.py", line 246, in add_to_queue links=[Link(end_item.tracing_span_context)], builtins.TypeError: __init__() takes 1 positional argument but 2 were given ```
Conflicts: synapse/storage/schema/__init__.py
Conflicts: .github/workflows/tests.yml poetry.lock synapse/storage/schema/__init__.py
Conflicts: synapse/handlers/message.py synapse/logging/opentracing.py
Conflicts: poetry.lock (conflicts not fixed) synapse/handlers/message.py synapse/handlers/relations.py synapse/storage/databases/main/devices.py synapse/storage/schema/__init__.py
Conflicts: docs/usage/configuration/config_documentation.md poetry.lock synapse/handlers/message.py synapse/http/server.py synapse/logging/opentracing.py synapse/rest/client/keys.py synapse/rest/client/room.py
Conflicts: poetry.lock
span = start_span( | ||
name=name, | ||
context=context, | ||
kind=kind, | ||
attributes=attributes, | ||
links=links, | ||
start_time=start_time, | ||
record_exception=record_exception, | ||
set_status_on_exception=set_status_on_exception, | ||
tracer=tracer, | ||
) | ||
|
||
# Equivalent to `tracer.start_as_current_span` | ||
return opentelemetry.trace.use_span( | ||
span, | ||
end_on_exit=end_on_exit, | ||
record_exception=record_exception, | ||
set_status_on_exception=set_status_on_exception, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this over from #13440 (comment),
Occasionally, I see spans that have a super massive duration and shows up with 213503982d 8h
in the Jaeger UI. Obviously looks like some sort of max duration bug.
The raw data for those spans are the following. Only these spans are messed up and the end_time
does not affect any more parent spans before _process_pulled_event
.
db.get_partial_state_events
{
"traceID": "563478e5d75db03e9fc028822fde0649",
"spanID": "5b63b0172763c5cf",
"flags": 1,
"operationName": "db.get_partial_state_events",
"references": [
{
"refType": "CHILD_OF",
"traceID": "563478e5d75db03e9fc028822fde0649",
"spanID": "10c9f2a2c50e286f"
}
],
"startTime": 1659994809436162,
"duration": 18446744073709517944,
"tags": [
// ...
],
"logs": [],
"processID": "p1",
"warnings": null
}
_process_pulled_event
{
"traceID": "563478e5d75db03e9fc028822fde0649",
"spanID": "6e9ee2608b03b542",
"flags": 1,
"operationName": "_process_pulled_event",
"references": [
{
"refType": "CHILD_OF",
"traceID": "563478e5d75db03e9fc028822fde0649",
"spanID": "af2c9c8d08284134"
}
],
"startTime": 1659994809432356,
"duration": 18446744073709532019,
"tags": [
// ...
],
"logs": [],
"processID": "p1",
"warnings": null
},
Definitely close to the Python "max int size" (unsigned word, 64 bits): sys.maxsize * 2 + 1
(or 2^64 - 1
) -> 18446744073709551615
(via https://stackoverflow.com/questions/7604966/maximum-and-minimum-values-for-ints)
It seems like it's some negative value getting turned into an unsigned integer so it wraps to the end of the value range. But I don't see the math where that can happen yet.
start_time
and end_time
are calculated by time.time_ns()
(_time_ns
)
And the Jaeger exporter calculates startTime
and duration
with this small _nsec_to_usec_round
function.
To see if it's just some endTime (0) - startTime (some positive number) = some negative number
, type of problem: If we take the Python max value minus the duration we see from _process_pulled_event
: 18446744073709551615 - 18446744073709532019
-> 19596
microseconds. TODO: Is that reasonable?
Closing as our existing OpenTracing setup seems to just work over UDP inside Docker (for Complement tests) on my Linux machine now so I don't have a personal need for this anymore (didn't work properly on macOS). This is probably best tackled with a fresh PR with some of the learnings from here anyway |
Migrate to OpenTelemetry tracing
Fix #11850
Blockers
Twisted
release that includes thecontextvars
fixestwisted==22.8.0
(2022-09-06) was released 🚀 - need to try it out!jaeger-client
toopentelemetry
#11850 (comment))force_tracing_for_users
Dev notes
via https://stackoverflow.com/a/70064450/796832
Typing
Type hints for optional imports:
OpenTelemetry
tags.py
child_of
to supplying thecontext
on a OTEL spanfollows_from
references
->links
on an OTEL spanfollows_from
andchild_of
? See https://opentracing.io/specification/Code:
synapse/logging/opentracing.py
synapse/config/tracer.py
OTEL code:
SpanAttributes
(opentelemetry-semantic-conventions/src/opentelemetry/semconv/trace/__init__.py
)Sampling:
opentracing
: https://github.com/jaegertracing/jaeger-client-python#debug-traces-forced-samplingUDP on macOS (maybe Docker)
Error when using
jaeger.thrift
over Thrift-compact protocol with UDP on port6831
homeserver.yaml
Fix: open-telemetry/opentelemetry-python#1061 (comment) -> https://www.jaegertracing.io/docs/1.19/client-libraries/#emsgsize-and-udp-buffer-limits -> jaegertracing/jaeger-client-node#124
To make those changes stick between OS reboots, make sure to add it to
/etc/sysctl.conf
as well:/etc/sysctl.conf
You also have to set
udp_split_oversized_batches
to split the batches up if they go over the 65k limit set in OTELhomeserver.yaml
Using thrift over UDP (port
6831
) to communicate to the Jaeger agent doesn't seem to work from within a Complement test Docker container. I wonder why this is the case? I was seeing this same behavior with the Jaegeropentracing
stuff. Is the UDP connection being over saturated (1065 spans in one trace)? Can the Jaeger agent in Docker not keep up? We see some spans come over but never the main servlet overarching one that is probably the last to be exported. But using the HTTP Jaeger collector endpoint seems to work fine for getting the whole trace (collector_endpoint: "http://localhost:14268/api/traces?format=jaeger.thrift"
)TODO
synapse/logging/scopecontextmanager.py
start_active_span_follows_from
start_active_span_from_edu
synapse/federation/transport/server/_base.py
(search forscope
)docs/tracing.md
Failed to detach context
errors (ValueError: <Token var=<ContextVar name='current_context' default={} at 0x111b80cc0> at 0x111b98880> was created in a different Context
)force_tracing_for_users
Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.Pull request includes a sign off(run the linters)