-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make tracing a task public so self-tracing is possible #6972
Conversation
35d68a7
to
1716248
Compare
cc @carllerche also for seeing how to make this pass CI without removing the test |
d93b7cf
to
5375c24
Compare
[PR status: I'm having some people check out this PR to see if it helps them debug some "slow task" problems. After I see the lessons learned I'll make it less WIP] |
Chiming in, @arielb1 asked me to poke at this implementation in a somewhat realistic scenario where it might be useful. One such case that I've encountered is while operating a service that uses a Redis cluster as a backend for distributed precision/adaptive throttling, with various multi-step state operations and varying key cardinality. Redis's single threaded event loop and key space partitioning makes it prone to bottlenecks if code is poorly optimized. Debugging performance issues can be complex and require a decent amount of context on Redis's underlying behaviors. Since, Redis-perspective per-command logs typically don't include time spent while waiting in the event loop queue, metrics tend to be heavily aggregated, and performance issues are difficult to directly trigger without large-scale load testing (and with realistic traffic shapes). A bit of magic to systematically trace the slower futures without a lot of manual instrumentation by end users, and without introducing new bottlenecks due to blanket log/metric outputs, would be quite handy. This change would open up that door, and then other libraries could wrap their futures with self-tracing logic. I threw together a crude simulation where I ran a redis cluster locally (8 nodes, key space evenly distributed). I then simulated a big (and fairly hot) key alongside a bunch of normal keys. My wrapper future takes a trace + dumps that trace as well as a human-readable key name, in case the duration of the total lookup is > 500ms. This specific implementation is pretty heavy-handed for the use case above compared to a plain timer + simple event output. But, I can imagine end user scenarios where there are more complex futures that contain multiple i/o calls, different function contexts, etc, where the task trace might be handy. Ariel has another open PR that will also probably make access to the backtrace more useful. Probably the larger benefit is for libraries rather than usage directly by end users. It might be nice for @arielb1 to look at how this API feels in e.g. a tower layer, which seems like a good use case for this. My code and output are below. You'll see a subset of my bigkey calls tripping the threshold, as well as certain regular keys that are routed to the same node and stuck behind the bigkeys in line. Here is my code:
And here is my output:
And then for completeness, here you can see the overlap of the bigkey + regular key futures that were running long - all were routed to node 4:
|
Going to also throw in a test against a more realistic application that has more complicated usage of the tokio runtime to see if anything rattles loose. |
I did some more testing in a real application. It is an axum server that handles user flows across a series of endpoints that span vending javascript to the client, vending inputs for a hashcash proof of work challenge, then various crypto operations to validate the solution and vend an encrypted token. I injected the self-tracing task functionality in an outer tower middleware layer. It dumps a trace in case a request takes longer than 500ms. I then injected a 5% chance of a sleep inside a place where we make a remote call to AWS DynamoDB. This mimics a real performance bottleneck we encountered, due to overlarge table size, that was annoying to debug due to lack of specific metrics wrapping that remote call at the time. I then simulated user realistic user flows that hit all endpoints in the browser, acquired tokens, etc. I didn't test at load but did send enough traffic to hit a bit of concurrency. I didn't see any signs of strange behavior with regard to the executor or otherwise. Seemed to behave as expected. The stack trace was useful and pointed directly to the line of code that had the sleep added. See abbreviated output below:
|
24f51f0
to
048978f
Compare
Now ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable enough.
49e2cbd
to
ba0f34d
Compare
Seems like the doc-tests fail to compile. |
fixed doctests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits, but this mostly LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Waiting for you @Noah-Kennedy |
Bumps tokio from 1.43.0 to 1.44.0. Release notes Sourced from tokio's releases. Tokio v1.44.0 1.44.0 (March 7th, 2025) This release changes the from_std method on sockets to panic if a blocking socket is provided. We determined this change is not a breaking change as Tokio is not intended to operate using blocking sockets. Doing so results in runtime hangs and should be considered a bug. Accidentally passing a blocking socket to Tokio is one of the most common user mistakes. If this change causes an issue for you, please comment on #7172. Added coop: add task::coop module (#7116) process: add Command::get_kill_on_drop() (#7086) sync: add broadcast::Sender::closed (#6685, #7090) sync: add broadcast::WeakSender (#7100) sync: add oneshot::Receiver::is_empty() (#7153) sync: add oneshot::Receiver::is_terminated() (#7152) Fixed fs: empty reads on File should not start a background read (#7139) process: calling start_kill on exited child should not fail (#7160) signal: fix CTRL_CLOSE, CTRL_LOGOFF, CTRL_SHUTDOWN on windows (#7122) sync: properly handle panic during mpsc drop (#7094) Changes runtime: clean up magic number in registration set (#7112) coop: make coop yield using waker defer strategy (#7185) macros: make select! budget-aware (#7164) net: panic when passing a blocking socket to from_std (#7166) io: clean up buffer casts (#7142) Changes to unstable APIs rt: add before and after task poll callbacks (#7120) tracing: make the task tracing API unstable public (#6972) Documented docs: fix nesting of sections in top-level docs (#7159) fs: rename symlink and hardlink parameter names (#7143) io: swap reader/writer in simplex doc test (#7176) macros: docs about select! alternatives (#7110) net: rename the argument for send_to (#7146) process: add example for reading Child stdout (#7141) process: clarify Child::kill behavior (#7162) process: fix grammar of the ChildStdin struct doc comment (#7192) runtime: consistently use worker_threads instead of core_threads (#7186) #6685: tokio-rs/tokio#6685 #6972: tokio-rs/tokio#6972 #7086: tokio-rs/tokio#7086 #7090: tokio-rs/tokio#7090 ... (truncated) Commits 8182ecf chore: prepare Tokio v1.44.0 (#7202) a258bff ci: enable printing in multi thread loom tests (#7200) e076d21 process: clarify Child::kill behavior (#7162) 042433c net: debug_assert on creating a tokio socket from a blocking one (#7166) 0284d1b macros: make select! budget-aware (#7164) 710bc80 rt: coop should yield using waker defer strategy (#7185) a2b12bd readme: adjust release schedule to once per month (#7191) e7b593c process: fix grammar of the ChildStdin struct doc comment (#7192) 3aaf4a5 coop: adjust grammar in tests/coop_budget.rs (#7173) 8e741c1 tokio: mark 1.43 as LTS (#7189) Additional commits viewable in compare view Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase. Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: @dependabot rebase will rebase this PR @dependabot recreate will recreate this PR, overwriting any edits that have been made to it @dependabot merge will merge this PR after your CI passes on it @dependabot squash and merge will squash and merge this PR after your CI passes on it @dependabot cancel merge will cancel a previously requested merge and block automerging @dependabot reopen will reopen this PR if it is closed @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
There is some desire to make it possible for tasks to trace themselves to discover slow wakeups, something similar to the test I added.
This PR makes some functions public (but unstable) to make that easier.
I currently shared it with someone I'm working with to see whether it helps them debug their "slow task" problems. I'll make this PR less WIP after I get feedback from them.
WIP: actually add docs.