Trace invalidation from start to finish #710

dwwoelfel · 2025-01-14T19:56:23Z

Right now we don't have a great way of seeing the full lifecycle of a transaction, from when the user calls db.transact to the point where the clients subscribed to that query re-render with new data. Our best metric right now is manually visiting the examples page and checking/unchecking tasks.

This PR introduces end-to-end tracing that will allow us to track from when transact! finishes to when we finish delivering the final websocket :refresh message for the transaction. It's not quite the full latency that the user sees, but it's close.

We use the transaction id to tie all of the spans to a single trace. It's one thing that we can track across multiple machines and know it will be the same on all machines. We generate our own parent trace-id and span-id from the tx-id and then create parent spans with that trace-id. That allows us to have spans from different machines all share the same parent.

It produces a trace diagram that looks like this:

All of the spans are just single points because the parent spans have no way of knowing when the child spans complete.

We only log one out of every 10,000 transactions (configurable via instant-config) and we add an extra :entropy attr that will encourage refinery to forward our spans.

github-actions · 2025-01-14T19:58:44Z

View Vercel preview at instant-www-js-e2e-tracer-jsv.vercel.app.

stopachka

Nice!

dwwoelfel · 2025-01-14T20:10:25Z

server/src/instant/lib/ring/websocket.clj

-      (when (instance? Throwable ret)
-        (throw ret)))))
+          p (promise)]
+      (tracer/with-span! {:name "ws/send-json!"


Tracing the full time it takes to send the json to the client. Looking at honeycomb, the send lock is almost never locked when we get here.

Ah, perhaps it's because with grouped-queue, many of our messages are serialized per session. I wonder if we get something like set-presence and transact at the same time, if the lock will start to come in effect

I think the reason we rarely wait on the lock is that calling Websockets/sendText is very quick--undertow is probably just putting a message onto a queue. All of the waiting happens between when we call sendText and when the complete callback is called.

dwwoelfel added 7 commits January 13, 2025 12:37

wip on e2e tracer

0a16dd3

cleanup

256512a

Merge branch 'main' of github.com:instantdb/instant into e2e-tracer

44d0275

encourage the sampler to skip traces occasionally

758c263

don't bother publishing the traces that won't be complete

903dd90

cleanup

ea4c77b

fix tests

82b10a4

stopachka approved these changes Jan 14, 2025

View reviewed changes

dwwoelfel commented Jan 14, 2025

View reviewed changes

cleanup

c427da8

dwwoelfel merged commit 1f61651 into main Jan 14, 2025
27 checks passed

dwwoelfel deleted the e2e-tracer branch January 14, 2025 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trace invalidation from start to finish #710

Trace invalidation from start to finish #710

dwwoelfel commented Jan 14, 2025

github-actions bot commented Jan 14, 2025

stopachka left a comment

dwwoelfel Jan 14, 2025

stopachka Jan 14, 2025

dwwoelfel Jan 14, 2025

Trace invalidation from start to finish #710

Trace invalidation from start to finish #710

Conversation

dwwoelfel commented Jan 14, 2025

github-actions bot commented Jan 14, 2025

stopachka left a comment

Choose a reason for hiding this comment

dwwoelfel Jan 14, 2025

Choose a reason for hiding this comment

stopachka Jan 14, 2025

Choose a reason for hiding this comment

dwwoelfel Jan 14, 2025

Choose a reason for hiding this comment