Skip to content

Fix supervisor race condition on Windows trampoline bounce#18170

Closed
zanieb wants to merge 3 commits intomainfrom
zb/fix-bounce-race-ii
Closed

Fix supervisor race condition on Windows trampoline bounce#18170
zanieb wants to merge 3 commits intomainfrom
zb/fix-bounce-race-ii

Conversation

@zanieb
Copy link
Member

@zanieb zanieb commented Feb 23, 2026

Create trampoline children with CREATE_SUSPENDED, assign them to the supervisor job object, then resume the primary thread. This avoids a post-spawn assignment race for short-lived child processes.

Hopefully, this addresses a bug observed as a CI flake where python_upgrade_transparent_from_venv_preview failed with:

  error: uv trampoline failed to assign child process to job object                                     
    Caused by: failed to assign process to job object (os error -2147024891)                            

Create trampoline children with CREATE_SUSPENDED, assign them to the
supervisor job object, then resume the primary thread. This removes the
post-spawn assignment race for short-lived child processes.

Regenerate prebuilt Windows trampoline binaries for i686, x86_64, and
aarch64.
@zanieb zanieb changed the title Spawn trampoline child suspended before job assignment Fix supervisor race condition on trampoline bounce Feb 23, 2026
@zanieb zanieb added the bug Something isn't working label Feb 23, 2026
@zanieb zanieb changed the title Fix supervisor race condition on trampoline bounce Fix supervisor race condition on Windows trampoline bounce Feb 23, 2026
@zanieb zanieb added the windows Specific to the Windows platform label Feb 23, 2026
@zanieb zanieb marked this pull request as ready for review February 24, 2026 23:34
@zanieb zanieb requested a review from EliteTK February 24, 2026 23:35
Copy link
Contributor

@EliteTK EliteTK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of this code, but that's not this PR's fault.

We actually don't care if the child dies before we add it to the job object, and an easier fix here would just be to ignore that case.

But! This does fix another race, which is the one where we spawn the process and we get killed before we add the child to the job object. And that's definitely a win.

@EliteTK
Copy link
Contributor

EliteTK commented Feb 25, 2026

Actually, never mind, I can't find anything that explains what happens if we die while we have a suspended child... So this might mean we leave a suspended child lying around forever.

So maybe the best solution for now is to ignore that error... At least then we can't leak a permanently suspended process.

@EliteTK
Copy link
Contributor

EliteTK commented Feb 25, 2026

Found https://devblogs.microsoft.com/oldnewthing/20230209-00/?p=107812 . Apparently we want PROC_THREAD_ATTRIBUTE_JOB_LIST. It's windows 10+ only though. Not sure if we care about supporting anything older as even Windows 10 is now out of support?

Some enterprises may still be running earlier versions either in locked down environments or using extended support patchsets.

@EliteTK
Copy link
Contributor

EliteTK commented Feb 25, 2026

I wonder if we can add ourselves to the job object and then spawn the child. Normally this isn't what you want since you'd be restricting yourself somehow in some way which you don't intend, but in this case it seems like it wouldn't have any unintended impact.

If we are already in a job and it's pre windows 8 then assigning ourselves to another job would fail, but we already have that problem with the existing code since the child will inherit any job we're in.

@zanieb zanieb temporarily deployed to uv-test-publish February 25, 2026 16:09 — with GitHub Actions Inactive
/// before `AssignProcessToJobObject` is called. However, it requires Windows 10+.
///
/// Returns `None` if the attribute list APIs fail (e.g., on older Windows versions),
/// signaling the caller to use the fallback path.
Copy link
Contributor

@EliteTK EliteTK Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nitpick: I would drop this bit after the comma. It's somewhat unbefitting a function doc comment to attempt to document the call site.

Comment on lines +328 to +332
// The first call is expected to fail with ERROR_INSUFFICIENT_BUFFER, returning
// the required size. If it fails for another reason, fall back.
if init_result.is_ok() || attr_list_size == 0 {
return None;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an error and it's not ERROR_INSUFFICIENT_BUFFER then I don't think we can assume that attr_list_size hasn't been modified to some nonsensical value. We should actually explicitly check that the error was the expected one rather than relying on attr_list_size being non-zero.

Comment on lines +334 to +336
// Allocate the attribute list buffer.
let attr_list_buf = vec![0u8; attr_list_size];
let attr_list = LPPROC_THREAD_ATTRIBUTE_LIST(attr_list_buf.as_ptr() as *mut _);
Copy link
Contributor

@EliteTK EliteTK Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong but if we're allocating for u8 then rust only guarantees that the resulting buffer is aligned for u8 at best? Is it safe to cast this to a type with potentially stricter alignment requirements here?

@zanieb zanieb force-pushed the zb/fix-bounce-race-ii branch 3 times, most recently from 4becdef to ad44925 Compare February 25, 2026 17:32
@zanieb zanieb force-pushed the zb/fix-bounce-race-ii branch from ad44925 to b2ac908 Compare February 25, 2026 18:58
.unwrap_or_else(|_| {
error_and_exit("uv trampoline failed to create layout for attribute list");
});
// SAFETY: layout has non-zero size (attr_list_size comes from InitializeProcThreadAttributeList).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// SAFETY: layout has non-zero size (attr_list_size comes from InitializeProcThreadAttributeList).
// SAFETY: layout has non-zero size (attr_list_size comes from InitializeProcThreadAttributeList).

/// before `AssignProcessToJobObject` is called. However, it requires Windows 10+.
///
/// Returns `None` if the attribute list APIs fail (e.g., on older Windows versions).
fn spawn_child_in_job(si: &STARTUPINFOA, child_cmdline: CString, job: &Job) -> Option<HANDLE> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid variables like si? Use a clear name

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Name originally mimicked distlib's launcher.c

result
}

/// Inner implementation of [`spawn_child_in_job`] that may fail at any point.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean that this may fail at any point?

Why is this a separate function at all?

// SAFETY: We successfully initialized the attribute list above.
unsafe { DeleteProcThreadAttributeList(attr_list) };

// CreateProcessA failed — fall back to the non-attribute-list path.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is confusing

/// This approach avoids race conditions where the child exits or the parent is killed
/// before `AssignProcessToJobObject` is called. However, it requires Windows 10+.
///
/// Returns `None` if the attribute list APIs fail (e.g., on older Windows versions).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the first point at which this would fail? Won't that tell us if the attribute list API is supported and everything after that should be an error instead of None?

Comment on lines +588 to +592
// Try the race-free path first: spawn the child directly into the job using
// PROC_THREAD_ATTRIBUTE_JOB_LIST (Windows 10+). If that's not available,
// fall back to spawning normally and assigning afterward.
let child_handle = spawn_child_in_job(&si, child_cmdline.clone(), &job)
.unwrap_or_else(|| spawn_child_and_assign_to_job(&si, child_cmdline, &job));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Try the race-free path first: spawn the child directly into the job using
// PROC_THREAD_ATTRIBUTE_JOB_LIST (Windows 10+). If that's not available,
// fall back to spawning normally and assigning afterward.
let child_handle = spawn_child_in_job(&si, child_cmdline.clone(), &job)
.unwrap_or_else(|| spawn_child_and_assign_to_job(&si, child_cmdline, &job));
let child_handle = spawn_child_in_job(&si, child_cmdline.clone(), &job)
// If the Windows 10 API is not available, fallback to the mode that can race
.unwrap_or_else(|| spawn_child_and_assign_to_job(&si, child_cmdline, &job));

@samypr100
Copy link
Collaborator

Found https://devblogs.microsoft.com/oldnewthing/20230209-00/?p=107812 . Apparently we want PROC_THREAD_ATTRIBUTE_JOB_LIST. It's windows 10+ only though. Not sure if we care about supporting anything older as even Windows 10 is now out of support?

One consequence of this approach is that it moves us much further away from matching distlib's approach in launcher.c or cpython's launcher2.c.

The blog post highlights a scenario which I think doesn't exactly apply since in distlib AssignProcessToJobObject is a best-effort attempt, which is why I believe distlib's approach doesn't use suspend either. The attribute list approach can also fail depending on existing nested jobs and parent state (which is a possibility with the flake).

Was the goal of this PR to try to reduce the risk of failure or enforcement? I'd be curious has this flake occured before or is it after the recent trampoline changes.

If the goal was to reduce risk of failure, I think a smaller simpler change as seen below be a better first step instead, e.g.

-    if let Err(e) = unsafe { job.assign_process(child_handle) } {
-        print_job_error_and_exit(
-            "uv trampoline failed to assign child process to job object",
-            e,
-        );
-    }
+    if let Err(e) = unsafe { job.assign_process(child_handle) } {
+        warn!(
+            "uv trampoline failed to assign child process to job object\n  Caused by: {} (os error {})"
+            e.message(),
+            e.code(),
+        );
+    }

This is permissive, matches distlib more closely, and job assignment will still in most cases work.

zanieb added a commit that referenced this pull request Mar 4, 2026
A simpler short-term fix for #18170

It's unclear if the complexity of that pull request is worth it.
@codspeed-hq
Copy link

codspeed-hq bot commented Mar 15, 2026

Merging this PR will not alter performance

✅ 5 untouched benchmarks


Comparing zb/fix-bounce-race-ii (9b01d99) with main (2591bfb)

Open in CodSpeed

@zanieb
Copy link
Member Author

zanieb commented Mar 16, 2026

We decided on #18291 instead for now

@zanieb zanieb closed this Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working windows Specific to the Windows platform

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants