Skip to content

Commit fdd53b3

Browse files
authored
Cleanup internal data-structures when process has been forked (#2676)
Closes #1921 ### What The crux of the problem is the following: > The child process is created with a single thread—the one that called fork(). The entire virtual address space of the parent is replicated in the child, ... The major consequence of this is that our global `RecordingStream` context is duplicated into the child memory space but none of the threads (batcher, tcp-sender, dropper, etc.) are duplicated. When we go to call `connect()` inside the forked process, we try to replace the global recording-stream, which subsequently tries to call drop on the forked copy of `RecordingStreamInner` . However, without any existing threads to process the flush, things just hang inside that flush call. We take a few actions to alleviate this problem: 1. Introduce a new SDK function: `cleanup_if_forked` which compares the process-ids on existing globals and forgets them as necessary. 1. In python, use `os.register_at_fork` to proactively call `cleanup_if_forked` in any forked child processes. 1. Also add a call to `cleanup_if_forked` inside of init() in case we're forking through a more exotic mechanism. 1. Check for the forked state anywhere we potentially flush to avoid deadlocks and produce a visible user-error. Additionally, it turns out that forked processes bypass the normal python `atexit` handler which means we don't get proper shutdown/flush behavior when the forked processes terminate. To help users workaround this, we introduce a `@shutdown_at_exit` decorator which can be used to decorate functions launched via multiprocessing. ### Testing On linux: ``` $ python examples/python/multiprocessing/main.py ``` observe demo exits cleanly and all data shows in viewer. ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) * [x] I've included a screenshot or gif (if applicable) * [x] I have tested [demo.rerun.io](https://demo.rerun.io/pr/2676) (if applicable) - [PR Build Summary](https://build.rerun.io/pr/2676) - [Docs preview](https://rerun.io/preview/pr%3Ajleibs%2Fcleanup_if_forked/docs) - [Examples preview](https://rerun.io/preview/pr%3Ajleibs%2Fcleanup_if_forked/examples)
1 parent 9589002 commit fdd53b3

File tree

6 files changed

+180
-8
lines changed

6 files changed

+180
-8
lines changed

crates/re_sdk/src/global.rs

+82
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,41 @@ thread_local! {
3838
static LOCAL_BLUEPRINT_RECORDING: RefCell<Option<RecordingStream>> = RefCell::new(None);
3939
}
4040

41+
/// Check whether we are the child of a fork.
42+
///
43+
/// If so, then our globals need to be cleaned up because they don't have associated batching
44+
/// or sink threads. The parent of the fork will continue to process any data in the original
45+
/// globals so nothing is being lost by doing this.
46+
pub fn cleanup_if_forked_child() {
47+
if let Some(global_recording) = RecordingStream::global(StoreKind::Recording) {
48+
if global_recording.is_forked_child() {
49+
re_log::debug!("Fork detected. Forgetting global Recording");
50+
RecordingStream::forget_global(StoreKind::Recording);
51+
}
52+
}
53+
54+
if let Some(global_blueprint) = RecordingStream::global(StoreKind::Blueprint) {
55+
if global_blueprint.is_forked_child() {
56+
re_log::debug!("Fork detected. Forgetting global Blueprint");
57+
RecordingStream::forget_global(StoreKind::Recording);
58+
}
59+
}
60+
61+
if let Some(thread_recording) = RecordingStream::thread_local(StoreKind::Recording) {
62+
if thread_recording.is_forked_child() {
63+
re_log::debug!("Fork detected. Forgetting thread-local Recording");
64+
RecordingStream::forget_thread_local(StoreKind::Recording);
65+
}
66+
}
67+
68+
if let Some(thread_blueprint) = RecordingStream::thread_local(StoreKind::Blueprint) {
69+
if thread_blueprint.is_forked_child() {
70+
re_log::debug!("Fork detected. Forgetting thread-local Blueprint");
71+
RecordingStream::forget_thread_local(StoreKind::Blueprint);
72+
}
73+
}
74+
}
75+
4176
impl RecordingStream {
4277
/// Returns `overrides` if it exists, otherwise returns the most appropriate active recording
4378
/// of the specified type (i.e. thread-local first, then global scope), if any.
@@ -106,6 +141,15 @@ impl RecordingStream {
106141
Self::set_any(RecordingScope::Global, kind, rec)
107142
}
108143

144+
/// Forgets the currently active recording of the specified type in the global scope.
145+
///
146+
/// WARNING: this intentionally bypasses any drop/flush logic. This should only ever be used in
147+
/// cases where you know the batcher/sink threads have been lost such as in a forked process.
148+
#[inline]
149+
pub fn forget_global(kind: StoreKind) {
150+
Self::forget_any(RecordingScope::Global, kind);
151+
}
152+
109153
// --- Thread local ---
110154

111155
/// Returns the currently active recording of the specified type in the thread-local scope,
@@ -125,6 +169,15 @@ impl RecordingStream {
125169
Self::set_any(RecordingScope::ThreadLocal, kind, rec)
126170
}
127171

172+
/// Forgets the currently active recording of the specified type in the thread-local scope.
173+
///
174+
/// WARNING: this intentionally bypasses any drop/flush logic. This should only ever be used in
175+
/// cases where you know the batcher/sink threads have been lost such as in a forked process.
176+
#[inline]
177+
pub fn forget_thread_local(kind: StoreKind) {
178+
Self::forget_any(RecordingScope::ThreadLocal, kind);
179+
}
180+
128181
// --- Internal helpers ---
129182

130183
fn get_any(scope: RecordingScope, kind: StoreKind) -> Option<RecordingStream> {
@@ -180,6 +233,35 @@ impl RecordingStream {
180233
},
181234
}
182235
}
236+
237+
fn forget_any(scope: RecordingScope, kind: StoreKind) {
238+
match kind {
239+
StoreKind::Recording => match scope {
240+
RecordingScope::Global => {
241+
if let Some(global) = GLOBAL_DATA_RECORDING.get() {
242+
std::mem::forget(global.write().take());
243+
}
244+
}
245+
RecordingScope::ThreadLocal => LOCAL_DATA_RECORDING.with(|cell| {
246+
if let Some(cell) = cell.take() {
247+
std::mem::forget(cell);
248+
}
249+
}),
250+
},
251+
StoreKind::Blueprint => match scope {
252+
RecordingScope::Global => {
253+
if let Some(global) = GLOBAL_BLUEPRINT_RECORDING.get() {
254+
std::mem::forget(global.write().take());
255+
}
256+
}
257+
RecordingScope::ThreadLocal => LOCAL_BLUEPRINT_RECORDING.with(|cell| {
258+
if let Some(cell) = cell.take() {
259+
std::mem::forget(cell);
260+
}
261+
}),
262+
},
263+
}
264+
}
183265
}
184266

185267
// ---

crates/re_sdk/src/lib.rs

+2
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ pub use re_log_types::{
2626
ApplicationId, Component, ComponentName, EntityPath, SerializableComponent, StoreId, StoreKind,
2727
};
2828

29+
pub use global::cleanup_if_forked_child;
30+
2931
#[cfg(not(target_arch = "wasm32"))]
3032
impl crate::sink::LogSink for re_log_encoding::FileSink {
3133
fn send(&self, msg: re_log_types::LogMsg) {

crates/re_sdk/src/recording_stream.rs

+30
Original file line numberDiff line numberDiff line change
@@ -349,10 +349,17 @@ struct RecordingStreamInner {
349349

350350
batcher: DataTableBatcher,
351351
batcher_to_sink_handle: Option<std::thread::JoinHandle<()>>,
352+
353+
pid_at_creation: u32,
352354
}
353355

354356
impl Drop for RecordingStreamInner {
355357
fn drop(&mut self) {
358+
if self.is_forked_child() {
359+
re_log::error_once!("Fork detected while dropping RecordingStreamInner. cleanup_if_forked() should always be called after forking. This is likely a bug in the SDK.");
360+
return;
361+
}
362+
356363
// NOTE: The command channel is private, if we're here, nothing is currently capable of
357364
// sending data down the pipeline.
358365
self.batcher.flush_blocking();
@@ -410,8 +417,14 @@ impl RecordingStreamInner {
410417
cmds_tx,
411418
batcher,
412419
batcher_to_sink_handle: Some(batcher_to_sink_handle),
420+
pid_at_creation: std::process::id(),
413421
})
414422
}
423+
424+
#[inline]
425+
pub fn is_forked_child(&self) -> bool {
426+
self.pid_at_creation != std::process::id()
427+
}
415428
}
416429

417430
enum Command {
@@ -591,6 +604,18 @@ impl RecordingStream {
591604
pub fn store_info(&self) -> Option<&StoreInfo> {
592605
(*self.inner).as_ref().map(|inner| &inner.info)
593606
}
607+
608+
/// Determine whether a fork has happened since creating this `RecordingStream`. In general, this means our
609+
/// batcher/sink threads are gone and all data logged since the fork has been dropped.
610+
///
611+
/// It is essential that [`crate::cleanup_if_forked_child`] be called after forking the process. SDK-implementations
612+
/// should do this during their initialization phase.
613+
#[inline]
614+
pub fn is_forked_child(&self) -> bool {
615+
(*self.inner)
616+
.as_ref()
617+
.map_or(false, |inner| inner.is_forked_child())
618+
}
594619
}
595620

596621
impl RecordingStream {
@@ -737,6 +762,11 @@ impl RecordingStream {
737762
///
738763
/// See [`RecordingStream`] docs for ordering semantics and multithreading guarantees.
739764
pub fn flush_blocking(&self) {
765+
if self.is_forked_child() {
766+
re_log::error_once!("Fork detected during flush. cleanup_if_forked() should always be called after forking. This is likely a bug in the SDK.");
767+
return;
768+
}
769+
740770
let Some(this) = &*self.inner else {
741771
re_log::warn_once!("Recording disabled - call to flush_blocking() ignored");
742772
return;

examples/python/multiprocessing/main.py

+11-8
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,19 @@
1010
import rerun as rr # pip install rerun-sdk
1111

1212

13+
# Python does not guarantee that the normal atexit-handlers will be called at the
14+
# termination of a multiprocessing.Process. Explicitly add the `shutdown_at_exit`
15+
# decorator to ensure data is flushed when the task completes.
16+
@rr.shutdown_at_exit
1317
def task(child_index: int) -> None:
14-
# All processes spawned with `multiprocessing` will automatically
15-
# be assigned the same default recording_id.
16-
# We just need to connect each process to the the rerun viewer:
18+
# In the new process, we always need to call init with the same `application_id`.
19+
# By default, the `recording_id`` will match the `recording_id`` of the parent process,
20+
# so all of these processes will have their log data merged in the viewer.
21+
# Caution: if you manually specified `recording_id` in the parent, you also must
22+
# pass the same `recording_id` here.
1723
rr.init("multiprocessing")
24+
25+
# We then have to connect to the viewer instance.
1826
rr.connect()
1927

2028
title = f"task {child_index}"
@@ -37,11 +45,6 @@ def main() -> None:
3745

3846
task(0)
3947

40-
# Using multiprocessing with "fork" results in a hang on shutdown so
41-
# always use "spawn"
42-
# TODO(https://github.com/rerun-io/rerun/issues/1921)
43-
multiprocessing.set_start_method("spawn")
44-
4548
for i in [1, 2, 3]:
4649
p = multiprocessing.Process(target=task, args=(i,))
4750
p.start()

rerun_py/rerun_sdk/rerun/__init__.py

+48
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
"""The Rerun Python SDK, which is a wrapper around the re_sdk crate."""
22
from __future__ import annotations
33

4+
import functools
5+
from typing import Any, Callable, TypeVar, cast
6+
47
# NOTE: The imports determine what is public API. Avoid importing globally anything that is not public API. Use
58
# (private) function and local import if needed.
69
import rerun_bindings as bindings # type: ignore[attr-defined]
@@ -155,6 +158,10 @@ def init(
155158
global _strict_mode
156159
_strict_mode = strict
157160

161+
# Always check whether we are a forked child when calling init. This should have happened
162+
# via `_register_on_fork` but it's worth being conservative.
163+
cleanup_if_forked_child()
164+
158165
if init_logging:
159166
new_recording(
160167
application_id,
@@ -311,6 +318,47 @@ def unregister_shutdown() -> None:
311318
atexit.unregister(rerun_shutdown)
312319

313320

321+
def cleanup_if_forked_child() -> None:
322+
bindings.cleanup_if_forked_child()
323+
324+
325+
def _register_on_fork() -> None:
326+
# Only relevant on Linux
327+
try:
328+
import os
329+
330+
os.register_at_fork(after_in_child=cleanup_if_forked_child)
331+
except NotImplementedError:
332+
pass
333+
334+
335+
_register_on_fork()
336+
337+
338+
_TFunc = TypeVar("_TFunc", bound=Callable[..., Any])
339+
340+
341+
def shutdown_at_exit(func: _TFunc) -> _TFunc:
342+
"""
343+
Decorator to shutdown Rerun cleanly when this function exits.
344+
345+
Normally, Rerun installs an atexit-handler that attempts to shutdown cleanly and
346+
flush all outgoing data before terminating. However, some cases, such as forked
347+
processes will always skip this at-exit handler. In these cases, you can use this
348+
decorator on the entry-point to your subprocess to ensure cleanup happens as
349+
expected without losing data.
350+
"""
351+
352+
@functools.wraps(func)
353+
def wrapper(*args: Any, **kwargs: Any) -> Any:
354+
try:
355+
return func(*args, **kwargs)
356+
finally:
357+
rerun_shutdown()
358+
359+
return cast(_TFunc, wrapper)
360+
361+
314362
# ---
315363

316364

rerun_py/src/python_bridge.rs

+7
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,7 @@ fn rerun_bindings(_py: Python<'_>, m: &PyModule) -> PyResult<()> {
119119
m.add_function(wrap_pyfunction!(new_recording, m)?)?;
120120
m.add_function(wrap_pyfunction!(new_blueprint, m)?)?;
121121
m.add_function(wrap_pyfunction!(shutdown, m)?)?;
122+
m.add_function(wrap_pyfunction!(cleanup_if_forked_child, m)?)?;
122123

123124
// recordings
124125
m.add_function(wrap_pyfunction!(get_application_id, m)?)?;
@@ -349,6 +350,12 @@ fn get_global_data_recording() -> Option<PyRecordingStream> {
349350
RecordingStream::global(rerun::StoreKind::Recording).map(PyRecordingStream)
350351
}
351352

353+
/// Cleans up internal state if this is the child of a forked process.
354+
#[pyfunction]
355+
fn cleanup_if_forked_child() {
356+
rerun::cleanup_if_forked_child();
357+
}
358+
352359
/// Replaces the currently active recording in the global scope with the specified one.
353360
///
354361
/// Returns the previous one, if any.

0 commit comments

Comments
 (0)