Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Fix asset_debug_server hang. There should be at most one ThreadExecut… #7825

Closed

Conversation

shuoli84
Copy link
Contributor

…or's ticker for one thread.

Objective

  • Fix debug_asset_server hang.

Solution

  • Reuse the thread_local executor for MainThreadExecutor resource, so there will be only one ThreadExecutor for main thread.
  • If ThreadTickers from same executor, they are conflict with each other. Then only tick one.

@james7132 james7132 requested a review from hymm February 26, 2023 13:26
@james7132 james7132 added C-Bug An unexpected or incorrect behavior A-Tasks Tools for parallel and async work labels Feb 26, 2023
@alice-i-cecile alice-i-cecile added this to the 0.10 milestone Feb 26, 2023
Copy link
Contributor

@hymm hymm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure both these changes are necessary? The MainThreadExecutor doesn't get cloned into debug asset server app.

crates/bevy_tasks/src/task_pool.rs Outdated Show resolved Hide resolved
crates/bevy_tasks/src/task_pool.rs Outdated Show resolved Hide resolved
@hymm
Copy link
Contributor

hymm commented Feb 26, 2023

Is this only a problem with tick and not with run? We technically have more executors than just the thread executors running on each thread. We also have the shared multithreaded executor and the thread local executors too.

Is it possible to write a test in bevy_tasks that shows that this deadlocks before this PR?

@james7132
Copy link
Member

@NiklasEi can you check if this fixes your use of the debug asset server? I want to make sure this fixes that key issue before giving this a more thorough review.

@shuoli84
Copy link
Contributor Author

shuoli84 commented Feb 27, 2023

Are you sure both these changes are necessary? The MainThreadExecutor doesn't get cloned into debug asset server app.

There are mainly two changes:
*If two tickers generated from same executor, then just tick one.
This fixes dead lock for following code, but not enough for asset_debug_server.

use bevy_app::App;
use bevy_ecs::prelude::*;

fn run_sub_app(mut sub_app: NonSendMut<DebugApp>) {
    sub_app.app.update();
}

struct DebugApp {
    app: App,
}

fn main() {
    let mut app = bevy_app::App::new();

    let sub_app = bevy_app::App::new();
    app.insert_non_send_resource(DebugApp { app: sub_app });
    app.add_system(run_sub_app);

    app.update();
}
  1. Share the ThreadExecutor instance. 1 + 2 fixed the deadlock for debug_asset_server.

Is this only a problem with tick and not with run? We technically have more executors than just the thread executors running on each thread. We also have the shared multithreaded executor and the thread local executors too.

Is it possible to write a test in bevy_tasks that shows that this deadlocks before this PR?

Yes, it is possible to write a test to repro the dead lock, i'll give it a try.

This bug is hard to reason, I spend like 20+ hours on it :'). The only fact I am sure of is: "if the Ticker get leaked, then the async_executor enter the "troubled" state, that it can't be notify." But this doesn't promise a deadlock, if the thread can be unparked by any other means, it still able to proceed. I've tried to create a separate thread just unpark the main thread, it also able to run.

Is this only a problem with tick and not with run?

In theory run also affected, it is just a wrapper on top of ticker. It didn't happen maybe just because we don't have such a code path.

EDIT: format

@shuoli84
Copy link
Contributor Author

Just added an example, without the fix, it would block. You can try disable check by returning false in conflict_with.

Cargo.toml Outdated Show resolved Hide resolved
@shuoli84
Copy link
Contributor Author

shuoli84 commented Feb 28, 2023

I think I figured out the details. The deadlock is caused by following steps.

  • 1 The outer executor, which ticking with ticker.tick().or(ticker.tick()), this is what main branch code works. It actually spins on three futures, ticker_1, ticker_2, and the job_future. Like following code
let forever = async {
    loop {
        ticker_1.tick().or(ticker_2.tick()).await
    }
}
future::block_on(forever, work_future);
  • 2 at first run, ticker_1, ticker_2 and work_future all returns pending, that means this thread is parked on work_future, and ticker_1, ticker_2 are allocated and not freed.
  • 3 then there is a task dispatched to this executor, which also unpark the main thread. So ticker_1 can proceed, it get the task to run, and call runnable.run on it. At the same time, ticker_2 still alive, which means the executor is at the state of notified.
  • 4 The task ticker_1 is running or polling, is the subapp schedule code, which also calls into MultiThreadExecutor, and finally a future::block_on on systems' running. So main thread parked again, it kinda run with same stack as the outer app. The most significant difference is, current executor is in 'notified' state.
  • 5 Here is the last but the most import step, the ComputePool spawn a task (which is the exclusive system running) to the ThreadExecutor and waiting for it. But the ThreadExecutor is already notified, so this spawn could not unpark the main thread. So the deadlock happened.

Back to the code fix, if we replace the ticker_1.tick().or(ticker_2.tick()) with ticker_1.tick(), then at step 3, when ticker_1 get a task, there is no hanging ticker_2, which makes the executor back to unnotified, then the spawning in step 5 able to notify the executor and unpark main thread successfully.

EDIT: format

@shuoli84
Copy link
Contributor Author

@hymm @james7132 ping

@hymm
Copy link
Contributor

hymm commented Feb 28, 2023

I think I figured out the details. The deadlock is caused by following steps.

This explanation makes sense to me. Thanks for figuring it out. So my PR #7564 fixes things by not reusing the executor and so it the inner executor doesn't get into the weird state. While this PR fixes things by not ever having the second ticker in an or. My test code in the other pr's comments never deadlocked, because I needed to add a second executor that the outer schedule is using.

I'm pretty sure I prefer the change in this PR. We don't need to keep recreating the scope executor and we're no longer doing the double ticking which always felt a little weird.

In the longer term, this seems to be a bug in async executor and we should consider upstreaming a fix.

Copy link
Member

@NiklasEi NiklasEi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NiklasEi can you check if this fixes your use of the debug asset server? I want to make sure this fixes that key issue before giving this a more thorough review.

I just checked it and yes, this PR also fixes my stuck integration tests 👍

@james7132
Copy link
Member

Just wanted to quickly chime in and thank @shuoli84 for digging into this rather complex bug. I'll leave a full review soon. Definitely want this fix before 0.10 goes live.

Copy link
Contributor

@hymm hymm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me now. The logic for which tickers need to be ticked in scope is getting a little complicated, so it'd be nice to have some unit tests for that, but not going to block on that. The multiple tickers code should be getting removed when we remove !Send resources from the world.

crates/bevy_tasks/src/thread_executor.rs Outdated Show resolved Hide resolved
Copy link
Member

@james7132 james7132 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sans a few code quality nits, this looks good to me. Great work!

crates/bevy_tasks/src/task_pool.rs Outdated Show resolved Hide resolved
crates/bevy_tasks/src/thread_executor.rs Outdated Show resolved Hide resolved
@shuoli84
Copy link
Contributor Author

shuoli84 commented Mar 2, 2023

Just opened a pr #7865, which basically runs the load_gltf example with extra feature, the ci captures the hang. So it should be enough for now. Will continue the test validation with that pr.

@shuoli84 shuoli84 requested a review from james7132 March 2, 2023 06:37
@james7132 james7132 added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Mar 2, 2023
let scope_ticker = scope_executor.ticker().unwrap();
if let Some(external_ticker) = external_executor.ticker() {
if tick_task_pool_executor {
let external_ticker = if !external_executor.is_same(scope_executor) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change. This is definitely easier to follow.

@cart
Copy link
Member

cart commented Mar 2, 2023

bors r+

bors bot pushed a commit that referenced this pull request Mar 2, 2023
#7825)

…or's ticker for one thread.

# Objective

- Fix debug_asset_server hang.

## Solution

- Reuse the thread_local executor for MainThreadExecutor resource, so there will be only one ThreadExecutor for main thread. 
- If ThreadTickers from same executor, they are conflict with each other. Then only tick one.
@bors bors bot changed the title Fix asset_debug_server hang. There should be at most one ThreadExecut… [Merged by Bors] - Fix asset_debug_server hang. There should be at most one ThreadExecut… Mar 2, 2023
@bors bors bot closed this Mar 2, 2023
Shfty pushed a commit to shfty-rust/bevy that referenced this pull request Mar 19, 2023
bevyengine#7825)

…or's ticker for one thread.

# Objective

- Fix debug_asset_server hang.

## Solution

- Reuse the thread_local executor for MainThreadExecutor resource, so there will be only one ThreadExecutor for main thread. 
- If ThreadTickers from same executor, they are conflict with each other. Then only tick one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Tasks Tools for parallel and async work C-Bug An unexpected or incorrect behavior S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants