Skip to content

Conversation

AhmedMousa-ag
Copy link
Contributor

Related

What

  • The hang in the issue Initializing twice with same recording id hangs the SDK #9948 is caused when trying to insert duplicated recording, and since we don't need this behavior, just avoid inserting the new recording. The real reason for hang not known yet, but it's not replicated in PyTests.

  • Moved the py_rerun_warn function from rerun_py/src/dataframe.rs to rerun_py/src/utils.rs for readability.

  • Added a new function that does both logging the warning using re_log::warn macro and PyErr::warn to detect the warning in python as using re_log::warn macro alone isn't enough to detect the warning in python, and using PyErr::warn doesn't log the warning as per your code convention. I suggest to use this function for other warnings in the python bridge

  • Added a unit test for detecting the warning of duplicated recording stream. Unforntually the hanging behavior can't be replicated in PyTests, still not sure why.

@Wumpf Wumpf self-requested a review June 11, 2025 08:17
@Wumpf Wumpf changed the title Rerun py avoid duplicated recording streams Fix initializing two recordings with the same recording id causing SDK hangs Jun 11, 2025
@Wumpf Wumpf added 🪳 bug Something isn't working sdk-python Python logging API include in changelog labels Jun 11, 2025
Copy link
Member

@Wumpf Wumpf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking into this! Needs some more polish though

@AhmedMousa-ag AhmedMousa-ag requested a review from Wumpf June 11, 2025 19:53
Copy link
Member

@Wumpf Wumpf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running python tests with pixi run py-test shows that a lot of tests are now broken because of the emitted warning. Haven't looked through the exact cause on all of them whether the warning should be handled, or strict mode is leaking etc. but this clearly needs some attention

@AhmedMousa-ag
Copy link
Contributor Author

Which pixi version are you using? it says in the toml file you're using 0.34.0 but the pixi.lock refer to a higher version.

Running python tests with pixi run py-test shows that a lot of tests are now broken because of the emitted warning. Haven't looked through the exact cause on all of them whether the warning should be handled, or strict mode is leaking etc. but this clearly needs some attention

@AhmedMousa-ag
Copy link
Contributor Author

I found that returning the warning result to python caused multipile errors on the python side, I will ignore it as it was. I believe it's enough to see the warning on in terminal, logs and be known in python side. instead of raising as an error and hault the whole process.

@Wumpf
Copy link
Member

Wumpf commented Jun 13, 2025

I really don't want to create precedence in ignoring warnings. This means that strict_mode no longer works as expected everywhere.
I'm looking into fixing up those tests instead

@Wumpf
Copy link
Member

Wumpf commented Jun 13, 2025

yikes. Turns out we never actually delete rust sided recordings ever. I.e. this:

        with rr.RecordingStream("rerun_example_test_recording") as rec:
            pass

will leak a recording on the rust side 😱

Looking into this as well now separately

@Wumpf
Copy link
Member

Wumpf commented Jun 13, 2025

the leaking recordings is actually a known issue 😬

tried to tackle it, but I need help from @abey79 for that. Wasn't too hard to get things to work out without. So this should do now if ci passes.

@AhmedMousa-ag
Copy link
Contributor Author

AhmedMousa-ag commented Jun 13, 2025

Why not to delete the recording on __del__() destructor when the recording stream is deleted? Here

The implementation in rust and in python should be straight forward.

/// Delete a recording stream by id.
#[allow(clippy::fn_params_excessive_bools)]
#[allow(clippy::struct_excessive_bools)]
#[pyfunction]
#[pyo3(signature = (
    recording_id,
))]
fn delete_recording(_py: Python<'_>, recording_id: String) -> PyResult<()> {
    if all_recordings()
        .remove(&StoreId::from_string(
            StoreKind::Recording,
            recording_id.clone(),
        ))
        .is_none()
    {
        utils::py_rerun_warn(format!("Recording id: {recording_id} not found.").as_str())?;
    };
    Ok(())
}

the leaking recordings is actually a known issue 😬

tried to tackle it, but I need help from @abey79 for that. Wasn't too hard to get things to work out without. So this should do now if ci passes.

@Wumpf
Copy link
Member

Wumpf commented Jun 13, 2025

I think that's what we going end up doing in some form or capactiy but I found untangling the ramifications rather complicated. See:

Recording streams are internally already refcounted but we have this all_recordings registry on top and the Python object that wraps the recording is rather confusing as well: removing from the all_recordings registry doesn't do the actual drop which (see above comment links) is "dangerous" if done in the wrong spot, therefore the "real" drop of the inner recording stream (the one wrapped by the recording) happens at another point in time we don't control 😵‍💫
We might first need to make the python objects hold handles rather than "real" recording streams

@AhmedMousa-ag
Copy link
Contributor Author

Ok, so we don't know what the grabage collector doing at any time, but can't we invoke the garabge collection in the destructor? gc.collect().
Usually invoking the garbage collection should be avoided, but is it worth it? will it solve the issue?

@Wumpf
Copy link
Member

Wumpf commented Jun 15, 2025

There's no gurarantees what-so-ever on what gc.collect exactly does, it's more of a hint

@Wumpf Wumpf merged commit 298049f into rerun-io:main Jun 15, 2025
34 checks passed
jleibs added a commit that referenced this pull request Jul 9, 2025
### Related

- Resoles: #10562
- Resolves: #10561
- While fixing the dead-lock originally addressed in:
#10201

### Also
In order to preserve backwards compatibility in environments where
`rr.init()` was depended upon to clean up existing resources
semi-deterministically due to creation of a new stream, this now does a
cleanup sweep of any recordings that only exist in the all_recordings
list.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪳 bug Something isn't working include in changelog sdk-python Python logging API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Initializing twice with same recording id hangs the SDK
2 participants