Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache wasm stacks in thread-local storage to avoid a mutex. #3922

Closed
wants to merge 5 commits into from

Conversation

rlane
Copy link
Contributor

@rlane rlane commented May 26, 2023

I noticed that when running many concurrent WASM function calls on a 48-core VM the mutex protecting the stack freelist was contended. Replacing it with thread-local storage reduced CPU usage by 75% in a benchmark (20% on my 14-core laptop) and didn't affect single-threaded performance.

@rlane rlane requested a review from syrusakbary as a code owner May 26, 2023 22:43
Copy link
Member

@syrusakbary syrusakbary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks great to me. Waiting for @ptitSeb or @theduke feedback as well

Copy link
Contributor

@ptitSeb ptitSeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. Thread local instead of a mutex, why not!

@rlane
Copy link
Contributor Author

rlane commented Jun 7, 2023

Anything I can help with to get this merged?

@theduke
Copy link
Contributor

theduke commented Jun 7, 2023

@rlane we haven't forgotten about this, we've just been discussing internally if this is the right way to go for all situations, or if we want an additional layered cache tier , with a thread local as the first layer and an additional shared cache.

Or if we need a feature flag to toggle this, because there are scenarios were keeping a stack around in each thread could be suboptimal for memory consumption, like in a scenario with a large thread pool where threads only run occasionally.

We are a bit swamped this week, we'll get back to this next week at the latest.

Would be we be fine with just putting the thread local behind a feature flag to punt on the larger discussion for now, @syrusakbary / @ptitSeb ?

@Arshia001
Copy link
Member

@rlane #4196 was just merged with an alternate implementation. Closing this issue.

@Arshia001 Arshia001 closed this Sep 1, 2023
@rlane
Copy link
Contributor Author

rlane commented Oct 30, 2023

The crossbeam alternative has some performance issues on large machines (224 cores). Here's what I see in perf top:

  61.24%  tournament         [.] crossbeam_queue::seg_queue::SegQueue<T>::push
   9.50%  tournament         [.] crossbeam_queue::seg_queue::SegQueue<T>::pop
... rest of the profile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants