Slow `register_callsite()` can cause other subscribers to not receive `register_callsite()` at all #2743

hds · 2023-10-10T14:17:36Z

Bug Report

There is code below that reproduces this issue consistently (increase the sleep time to make it 100%).

I've uploaded a branch containing the test that reproduces the issue and my first attempt at a fix (which doesn't appear to fix anything). There's a draft PR open against v0.1.x to show the different #2744.

Version

tracing-core: 0.1.30
tracing: 0.1.37 (not necessary, can also be reproduced from just tracing-core)

Also confirmed present on the v0.1.x branch.

This error is not reproducible on master.

Platform

Linux and macOS confirmed. Have not tested on Windows.

Linux <name> 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Darwin <name> 23.4.0 Darwin Kernel Version 23.4.0: Wed Feb 21 21:44:31 PST 2024; root:xnu-10063.101.15~2/RELEASE_X86_64 x86_64

Crates

tracing-core

Description

If a Subscriber's register_callsite() function takes too long to complete, it can cause other subscribers to not have register_callsite called at all. Even if the second subscriber is the one which is then recording the tracing event at that callsite.

We start two threads, each thread:
a. Registers a local Subscriber (via tracing::subscriber::set_default)
b. Records a single event.

When the first subscriber takes too long inside register_callsite(), the second subscriber doesn't receive a call to register_callsite() at all, instead it receives a call to event() directly.

What is too long?

By ensuring the order in which the threads start executing, a thread::sleep of 50 microseconds will

I can reliably reproduce this issue with the code below with a thread::sleep of 50 microseconds, although it will sometimes succeed. I have seen it reproduce with a sleep of only 20 microseconds, although that was less frequent (3 failures out of 10 tries just now). The machine is a 6 core 3.xGHz Dell desktop that's about 6 years old - so it's not top of the line, but also not underpowered. Keep in mind that the test is calling println!, so that may be adding additional delay.

Here is a full reproduction of the issue as an integration test is below. Note that the assert that would actually make this test fail has been commented out so that it can be observed that subscriber-2 doesn't receive a call to register_callsite() at all (it is not that it receives it after the call to event() which I had originally suspected).

use std::{
    ptr,
    sync::atomic::{AtomicPtr, Ordering},
    thread::{self, JoinHandle},
    time::Duration,
};

use tracing_core::{
    callsite::{Callsite as _, DefaultCallsite},
    dispatcher::set_default,
    field::{FieldSet, Value},
    span, Dispatch, Event, Kind, Level, Metadata, Subscriber,
};

struct TestSubscriber {
    sleep: Duration,
    callsite: AtomicPtr<Metadata<'static>>,
}

impl TestSubscriber {
    fn new(sleep_micros: u64) -> Self {
        Self {
            sleep: Duration::from_micros(sleep_micros),
            callsite: AtomicPtr::new(ptr::null_mut()),
        }
    }
}

impl Subscriber for TestSubscriber {
    fn register_callsite(&self, metadata: &'static Metadata<'static>) -> tracing_core::Interest {
        if !self.sleep.is_zero() {
            thread::sleep(self.sleep);
        }

        self.callsite
            .store(metadata as *const _ as *mut _, Ordering::SeqCst);

        tracing_core::Interest::always()
    }

    fn event(&self, event: &tracing_core::Event<'_>) {
        let stored_callsite = self.callsite.load(Ordering::SeqCst);
        let event_callsite: *mut Metadata<'static> = event.metadata() as *const _ as *mut _;

        // This assert is the actual test.
        assert_eq!(
            stored_callsite, event_callsite,
            "stored callsite: {stored_callsite:#?} does not match event \
            callsite: {event_callsite:#?}. Was `event` called before \
            `register_callsite`?"
        );
    }

    fn enabled(&self, _metadata: &Metadata<'_>) -> bool {
        true
    }
    fn new_span(&self, _span: &span::Attributes<'_>) -> span::Id {
        span::Id::from_u64(0)
    }
    fn record(&self, _span: &span::Id, _values: &span::Record<'_>) {}
    fn record_follows_from(&self, _span: &span::Id, _follows: &span::Id) {}
    fn enter(&self, _span: &tracing_core::span::Id) {}
    fn exit(&self, _span: &tracing_core::span::Id) {}
}

fn subscriber_thread(idx: usize, register_sleep_micros: u64) -> JoinHandle<()> {
    thread::Builder::new()
        .name(format!("subscriber-{idx}"))
        .spawn(move || {
            // We use a sleep to ensure the starting order of the 2 threads.
            let subscriber = TestSubscriber::new(register_sleep_micros);
            let _dispatch_guard = set_default(&Dispatch::new(subscriber));

            static CALLSITE: DefaultCallsite = {
                // The values of the metadata are unimportant
                static META: Metadata<'static> = Metadata::new(
                    "event ",
                    "module::path",
                    Level::INFO,
                    None,
                    None,
                    None,
                    FieldSet::new(&["message"], tracing_core::callsite::Identifier(&CALLSITE)),
                    Kind::EVENT,
                );
                DefaultCallsite::new(&META)
            };
            let _interest = CALLSITE.interest();

            let meta = CALLSITE.metadata();
            let field = meta.fields().field("message").unwrap();
            let message = format!("event-from-{idx}", idx = idx);
            let values = [(&field, Some(&message as &dyn Value))];
            let value_set = CALLSITE.metadata().fields().value_set(&values);

            Event::dispatch(meta, &value_set);

            // Wait a bit for everything to end (we don't want to remove the subscriber
            // immediately because that will influence the test).
            thread::sleep(Duration::from_millis(10));
        })
        .expect("failed to spawn thread")
}

/// Regression test for missing register_callsite call (#2743)
///
/// This test provokes the race condition which causes the second subscriber to not receive a
/// call to `register_callsite` before it receives a call to `event`.
///
/// Because the test depends on the interaction of multiple dispatchers in different threads,
/// it needs to be in a test file by itself.
#[test]
fn event_before_register() {
    let subscriber_1_register_sleep_micros = 30;
    let subscriber_2_register_sleep_micros = 0;

    let jh1 = subscriber_thread(1, subscriber_1_register_sleep_micros);

    // This delay ensures that the event callsite has interest() called first.
    thread::sleep(Duration::from_micros(10));
    let jh2 = subscriber_thread(2, subscriber_2_register_sleep_micros);

    jh1.join().expect("failed to join thread");
    jh2.join().expect("failed to join thread");
}

If we add println! messages to the register_callsite and event methods, we get the following output:

subscriber-1: register_callsite: 0x000055555564bb88
subscriber-2: event (with callsite): 0x000055555564bb88 (stored callsite: 0x0000000000000000)
subscriber-1: event (with callsite): 0x000055555564bb88 (stored callsite: 0x000055555564bb88)

When the sleep in the first subscriber is set to 0, the expected behavior is observed:

subscriber-1: register_callsite: 0x000055555564bb88
subscriber-1: event (with callsite): 0x000055555564bb88 (stored callsite: 0x000055555564bb88)
subscriber-1: register_callsite: 0x000055555564bb88
subscriber-2: register_callsite: 0x000055555564bb88
subscriber-2: event (with callsite): 0x000055555564bb88 (stored callsite: 0x000055555564bb88)

My best guess is that the reregistration of the callsite (because there's a new subscriber subscriber-2 in the new thread) causes subscriber-2 to skip registering the callsite of the event when it executes it.

A description of the fix can be found in #2938.

The text was updated successfully, but these errors were encountered:

This branch contains code to reproduce an issue where 2 thread local subscribers interfer so that when the first one takes too long to complete `register_callsite()`, the second subscriber doesn't have `register_callsite()` called at all. It also contains an attempt at a fix by removing a race condition in `DefaultCallsite::register`. This fix may be necessary anyway, but it isn't sufficient. The issue for this problem is #2743.

hawkw · 2023-10-10T16:10:04Z

Thanks for the repro, I can look into the potential fix and see what else is necessary to make it pass!

We have identified that some of the `console-subscriber` integration tests are flaky (#473). This appears to be a result of the following issue in `tracing` tokio-rs/tracing#2743. Flaky tests are really worse than no tests. So these tests will be removed until the flakiness can be fixed.

There are 2 triggers which will cause a subscriber to receive a call to `Subscriber::register_callsite` for a specific callsite. 1. The first time the event or span at that callsite is executed. 2. When a new subscriber is added or removed (for example, calls to `set_default` or `with_default`) It is trigger (2) that will cause a new subscriber to receive `Subscriber::register_callsite` for all the callsites which had already been registered before it became active. When a callsite is registered for trigger (1), the callsite starts in state `UNREGISTERED`. The first thread to encounter the callsite will transition it to `REGISTERING` and determine the overall interest for the callsite by registering with all known dispatchers (which will call into `Subscriber::register_callsite`). Once that is complete, the callsite is added to the list of all known callsites and its state is transitioned to `REGISTERED`. is (re)built for all known dispatchers. The callsite starts in state `UNREGISTERED`. The This calls down into `Subscriber::register_callsite` for each subscriber. Once that is complete, the callsite is added to the global list of known callsites. While the callsite interest is being rebuilt, other threads that encounter the callsite will be given `Interest::sometimes()` until the registration is complete. However, if a new subscriber is added during this window, all the interest for all callsites will be rebuilt, but because the new callsite (in state `REGISTERING`) won't be included because it isn't yet in the global list of callsites. This can cause a case where that new subscriber being added won't receive `Subscriber::register_callsite` before it receives the subsequent call to `Subscriber::event` or `Subscriber::new_span`. The documentation on [Registering Callsites] is not very explicit on this point, but it does suggest that `Subscriber::register_callsite` will be called before the call to either `Subscriber::event` or `Subscriber::new_span`, and the current behavior can break this implicit contract. [Registering Callsites]: https://docs.rs/tracing-core/0.1.32/tracing_core/callsite/index.html#registering-callsites This change swaps the order of rebuilding the callsite interest and adding the callsite to the global list so that the callsite gets pushed first, avoiding this window in which a subscriber won't get a call to `register_callsite`. As such, a callsite may have its interest read before it is set. In this case, the existing implementation will return `Interest::sometimes()` for the `DefaultCallsite` implementation. Other implementations (outside of the `tracing` project) may perform this differently, but in this case, there is no documented guarantee regarding the ordering. A regression test is included which provokes the race condition 100% of the time before the changes in this fix. Fixes: #2743

There are 2 triggers which will cause a subscriber to receive a call to `Subscriber::register_callsite` for a specific callsite. 1. The first time the event or span at that callsite is executed. 2. When a new subscriber is added or removed (for example, calls to `set_default` or `with_default`) It is trigger (2) that will cause a new subscriber to receive `Subscriber::register_callsite` for all the callsites which had already been registered before it became active. When a callsite is registered for trigger (1), the callsite starts in state `UNREGISTERED`. The first thread to encounter the callsite will transition it to `REGISTERING` and determine the overall interest for the callsite by registering with all known dispatchers (which will call into `Subscriber::register_callsite`). Once that is complete, the callsite is added to the list of all known callsites and its state is transitioned to `REGISTERED`. is (re)built for all known dispatchers. The callsite starts in state `UNREGISTERED`. The This calls down into `Subscriber::register_callsite` for each subscriber. Once that is complete, the callsite is added to the global list of known callsites. While the callsite interest is being rebuilt, other threads that encounter the callsite will be given `Interest::sometimes()` until the registration is complete. However, if a new subscriber is added during this window, all the interest for all callsites will be rebuilt, but because the new callsite (in state `REGISTERING`) won't be included because it isn't yet in the global list of callsites. This can cause a case where that new subscriber being added won't receive `Subscriber::register_callsite` before it receives the subsequent call to `Subscriber::event` or `Subscriber::new_span`. The documentation on [Registering Callsites] is not very explicit on this point, but it does suggest that `Subscriber::register_callsite` will be called before the call to either `Subscriber::event` or `Subscriber::new_span`, and the current behavior can break this implicit contract. [Registering Callsites]: https://docs.rs/tracing-core/0.1.32/tracing_core/callsite/index.html#registering-callsites This change swaps the order of rebuilding the callsite interest and adding the callsite to the global list so that the callsite gets pushed first, avoiding this window in which a subscriber won't get a call to `register_callsite`. As such, a callsite may have its interest read before it is set. In this case, the existing implementation will return `Interest::sometimes()` for the `DefaultCallsite` implementation. Other implementations (outside of the `tracing` project) may perform this differently, but in this case, there is no documented guarantee regarding the ordering. A regression test is included which provokes the race condition 100% of the time before the changes in this fix. Fixes: #2743 Co-authored-by: David Barsky <[email protected]>

hds mentioned this issue Oct 10, 2023

bug(tracing): register_callsite doesn't get called #2744

Draft

hawkw added kind/bug Something isn't working crate/core Related to the `tracing-core` crate labels Oct 10, 2023

hds mentioned this issue Oct 11, 2023

Flaky console-subscriber integration tests tokio-rs/console#473

Open

This was referenced Nov 16, 2023

fix(subscriber): mitigate race in Callsites::contains tokio-rs/console#474

Closed

task: make task span explicit root tokio-rs/tokio#6158

Merged

hds mentioned this issue Apr 10, 2024

test(subscriber): remove very flaky tests tokio-rs/console#544

Merged

hds mentioned this issue Apr 16, 2024

core: fix missed register_callsite error #2938

Merged

ThomWright mentioned this issue Sep 6, 2024

Incorrect Interest::Never is cached and never invalidated on v0.1 #2874

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow `register_callsite()` can cause other subscribers to not receive `register_callsite()` at all #2743

Slow `register_callsite()` can cause other subscribers to not receive `register_callsite()` at all #2743

hds commented Oct 10, 2023 •

edited

Loading

hawkw commented Oct 10, 2023 •

edited

Loading

Slow register_callsite() can cause other subscribers to not receive register_callsite() at all #2743

Slow register_callsite() can cause other subscribers to not receive register_callsite() at all #2743

Comments

hds commented Oct 10, 2023 • edited Loading

Bug Report

Version

Platform

Crates

Description

hawkw commented Oct 10, 2023 • edited Loading

Slow `register_callsite()` can cause other subscribers to not receive `register_callsite()` at all #2743

Slow `register_callsite()` can cause other subscribers to not receive `register_callsite()` at all #2743

hds commented Oct 10, 2023 •

edited

Loading

hawkw commented Oct 10, 2023 •

edited

Loading