Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(dist/features): ship tracing and friends by default #3803

Merged
merged 7 commits into from
Jun 15, 2024

Conversation

rami3l
Copy link
Member

@rami3l rami3l commented May 2, 2024

Warning

Since #3876, RUST_LOG has been replaced with RUSTUP_LOG for configuring Rustup's internal logging system. The rest of the thread has used RUST_LOG due to historic reasons.

Part of #3790.

Rationale

Currently, helping out the Rustup team by enabling local tracing is quite a tedious process (esp. for community contributors), requiring rebuilding Rustup from the exact commit with an extra feature, otel:

## Usage
The normal [OTLP environment
variables](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/exporter.md)
can be used to customise its behaviour, but often the simplest thing is to just
run a Jaeger docker container on the same host:
```sh
docker run -d --name jaeger -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 -e COLLECTOR_OTLP_ENABLED=true -p 6831:6831/udp -p 6832:6832/udp -p 5778:5778 -p 16686:16686 -p 4317:4317 -p 4318:4318 -p 14250:14250 -p 14268:14268 -p 14269:14269 -p 9411:9411 jaegertracing/all-in-one:latest
```
Then build rustup-init with tracing:
```sh
cargo build --features=otel
```
Run the operation you want to analyze:
```sh
RUSTUP_FORCE_ARG0="rustup" ./target/debug/rustup-init show
```
And [look in Jaeger for a trace](http://localhost:16686/search?service=rustup).

After some experiment, it turned out that we actually can ship the tracing features by default without forcing the user to face OTEL connection errors on a daily basis.

To clarify, this does not mean Rustup is setting up a central (a.k.a. phone-home-style) telemetry mechanism, and we will keep the tracing disabled by default unless RUST_LOG has been explicitly set.

Concerns

@rami3l

This comment was marked as outdated.

@djc
Copy link
Contributor

djc commented May 2, 2024

I think we should use a basic console-based tracing-subscriber setup:

pub(super) fn subscribe() -> tracing::subscriber::DefaultGuard {
    let sub = tracing_subscriber::FmtSubscriber::builder()
        .with_max_level(tracing::Level::TRACE)
        .with_writer(|| TestWriter)
        .finish();
    tracing::subscriber::set_default(sub)
}

struct TestWriter;

impl Write for TestWriter {
    fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
        print!(
            "{}",
            str::from_utf8(buf).expect("tried to log invalid UTF-8")
        );
        Ok(buf.len())
    }
    fn flush(&mut self) -> io::Result<()> {
        io::stdout().flush()
    }
}

This is what I've been using in Quinn for many years. Every test just starts by calling let _guard = subscribe();, which has been a highly effective tool for test observability.

If we do this we can massively simplify the scaffolding for traces/logging:

  • Get rid of all the opentelemetry dependencies, which I don't think we need
  • Remove the use of the otel feature
  • Remove the custom test macro

@rbtcollins
Copy link
Contributor

Currently, helping out the Rustup team by enabling local tracing is quite a tedious process (esp. for community contributors), requiring rebuilding Rustup from the exact commit with an extra feature, otel:

I don't think this is true. We've not asked people to build with otel enabled that I'm remembering. The OS level traces we use to debug fundamental problems are from strace / truss and other similar tools. Otel / tracing! is not deployed widely enough within rustup to be a replacement for such things.

@rbtcollins
Copy link
Contributor

Get rid of all the opentelemetry dependencies, which I don't think we need
Remove the use of the otel feature
Remove the custom test macro

Please don't - while the OS level debugging is vital, for doing investigations on performance, having a nice report with spans and the detailed call tree is very useful, and since we configure it off by default it has very little overhead to maintain or build with. Really only the all-features-build test matters.

@djc
Copy link
Contributor

djc commented May 2, 2024

Get rid of all the opentelemetry dependencies, which I don't think we need
Remove the use of the otel feature
Remove the custom test macro

Please don't - while the OS level debugging is vital, for doing investigations on performance, having a nice report with spans and the detailed call tree is very useful, and since we configure it off by default it has very little overhead to maintain or build with. Really only the all-features-build test matters.

How many times have you used it in the past year? IMO while custom test macro + opentelemetry dependencies may not impose run-time overhead for downstream users, it does impose significant maintenance overhead that may not be warranted for the additional insight compared to just tracing-subscriber built-in (and maybe tokio-console level output).

@djc
Copy link
Contributor

djc commented May 2, 2024

Currently, helping out the Rustup team by enabling local tracing is quite a tedious process (esp. for community contributors), requiring rebuilding Rustup from the exact commit with an extra feature, otel:

I don't think this is true. We've not asked people to build with otel enabled that I'm remembering. The OS level traces we use to debug fundamental problems are from strace / truss and other similar tools. Otel / tracing! is not deployed widely enough within rustup to be a replacement for such things.

IMO the important point is that we should have a user-facing solution like "enable RUST_LOG=trace and give us the output of that", which seems like a decent method of getting better insight into problems that happen only in specific environments, which seems to be an important source of issues for rustup.

@rami3l
Copy link
Member Author

rami3l commented May 3, 2024

I just checked and it looks like tokio-console doesn’t currently have a timeline view (tokio-rs/console#129), so I imagine opentelemetry and jaeger are here to stay for longer...

For now I plan to:

  • Ship tracing by default with a console-based subscriber.
  • If possible, reimplement our current logging system using that subscriber (with a fmt::layer() that mimics the original output style).
  • Keep the opentelemetry related stuff behind the otel feature.

More specifically, I imagine having multiple subscribers (tokio-rs/tracing#971) based on env vars and features:

  • A "classic" subscriber that targets process().stderr(), has a classic output format, will only print rustup log lines up to a certain level (info), and will be disabled if RUST_LOG is set.
  • A "tracing" subscriber that also targets process().stderr() and is not limited to rustup (so we could have tonic log lines as well, for example). Its activation is mutually exclusive with the "classic" subscriber, and the precise logging level will be controlled by RUST_LOG.
  • An OpenTelemetry subscriber (could be replaced by tokio-console in the future, but not just yet) available behind the otel feature.

Finally:

  • We need to make sure this subscriber's use of CLI colors is coherent with the current system (incl. env variable controls via RUSTUP_TERM_COLOR, etc).
  • With the consistent use of RUST_LOG, RUSTUP_DEBUG should be retired accordingly.
  • Make tracing::instrument always run, i.e. no longer behind #[cfg_attr(feature = "otel", ..)] (split into feat(log): unhide tracing::instrument from behind feature = "otel" #3873).

Waiting for #3367 might be worthwhile, since this will change the startup process. Merged.

@djc
Copy link
Contributor

djc commented May 30, 2024

#3367 has been merged, would be good to rebase this!

@rami3l rami3l force-pushed the refactor/tracing branch 8 times, most recently from 2142f7e to 6d54875 Compare June 2, 2024 04:41
@rami3l

This comment was marked as resolved.

@rami3l rami3l force-pushed the refactor/tracing branch from 7dfa263 to 54f078c Compare June 2, 2024 12:30
@rbtcollins
Copy link
Contributor

How many times have you used it in the past year? IMO while custom test macro + opentelemetry dependencies may not impose run-time overhead for downstream users, it does impose significant maintenance overhead that may not be warranted for the additional insight compared to just tracing-subscriber built-in (and maybe tokio-console level output).

I have been largely absent for the last year, so perhaps a better thing to ask is 'of the times you've been investigating performance, how often have you used detailed call graph with opentelemetry' - and the answer there is every single time. I depend on it.

IMO the important point is that we should have a user-facing solution like "enable RUST_LOG=trace and give us the output of that", which seems like a decent method of getting better insight into problems that happen only in specific environments, which seems to be an important source of issues for rustup.

I think that that is important too, for sure.

@rami3l rami3l force-pushed the refactor/tracing branch 2 times, most recently from e919914 to a667805 Compare June 14, 2024 06:06
@rami3l
Copy link
Member Author

rami3l commented Jun 14, 2024

@djc The new approach proposed in #3871 (review) has worked! This PR has been thus unblocked from test refactoring and I'll just need to do some rather simple tweaks (see #3803 (comment) for the list of remaining tasks) to get it ready 🎉

PS: I've also changed the signature of tracing_subscriber() to accept a Process in order to align with #3871 better.

@rami3l rami3l force-pushed the refactor/tracing branch 2 times, most recently from 6489353 to 0cf0c6b Compare June 14, 2024 11:34
@rami3l
Copy link
Member Author

rami3l commented Jun 14, 2024

@rbtcollins @djc I've just finished all 3 cleanup tasks, unfortunately however the 3rd (the final) commit (0cf0c6b) has made our CI red. To make sure that only this commit was bad though, I've reverted it temporarily in 1b7662e, and indeed the CI is green again.

I don't see how a mechanical change like 0cf0c6b can cause Rustup to stack overflow on Windows (https://github.com/rust-lang/rustup/actions/runs/9515433133/job/26229657328)... Am I missing something?

@djc
Copy link
Contributor

djc commented Jun 14, 2024

Maybe try a branch where you apply 0cf0c6b to 1.27.1 directly (or some other commit maybe from before #3868) to see if that works? It might be some "interesting" interaction that causes recursion in an unobvious way?

@rami3l
Copy link
Member Author

rami3l commented Jun 14, 2024

Maybe try a branch where you apply 0cf0c6b to 1.27.1 directly (or some other commit maybe from before #3868) to see if that works? It might be some "interesting" interaction that causes recursion in an unobvious way?

@djc That's a good idea! I'll give it a try tomorrow.

That being said, the overall tracing thing is working pretty well, so it might make some sense to leave 0cf0c6b for another PR. We'll have to test it against another branch via a PR anyways.

@djc
Copy link
Contributor

djc commented Jun 14, 2024

That being said, the overall tracing thing is working pretty well, so it might make some sense to leave 0cf0c6b for another PR.

Oh yes, I'm in favor of merging without it!

@rami3l rami3l force-pushed the refactor/tracing branch from 1b7662e to 8d8c030 Compare June 14, 2024 14:29
@rami3l rami3l marked this pull request as ready for review June 14, 2024 14:29
@rami3l rami3l requested a review from djc June 14, 2024 14:30
src/cli/log.rs Show resolved Hide resolved
src/cli/log.rs Outdated Show resolved Hide resolved
src/cli/log.rs Show resolved Hide resolved
src/cli/log.rs Outdated Show resolved Hide resolved
src/utils/notify.rs Show resolved Hide resolved
src/utils/notify.rs Outdated Show resolved Hide resolved
src/cli/log.rs Outdated Show resolved Hide resolved
src/cli/log.rs Outdated Show resolved Hide resolved
src/cli/log.rs Outdated Show resolved Hide resolved
@rami3l rami3l force-pushed the refactor/tracing branch from 8d8c030 to c95df35 Compare June 15, 2024 02:07
@rami3l rami3l requested a review from djc June 15, 2024 02:08
Copy link
Contributor

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@rami3l rami3l added this pull request to the merge queue Jun 15, 2024
Merged via the queue into master with commit 368f133 Jun 15, 2024
26 checks passed
@rami3l rami3l deleted the refactor/tracing branch June 15, 2024 08:59
@rami3l rami3l mentioned this pull request Oct 6, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants