Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OtlpHttpClient::Export() fails in ASYNC build #1955

Closed
marcalff opened this issue Feb 4, 2023 · 9 comments · Fixed by #2000
Closed

OtlpHttpClient::Export() fails in ASYNC build #1955

marcalff opened this issue Feb 4, 2023 · 9 comments · Fixed by #2000
Assignees
Labels
bug Something isn't working

Comments

@marcalff
Copy link
Member

marcalff commented Feb 4, 2023

Problem

Start a local OpenTelemetry-Collector process on port 4318.

Run the example_otlp_http client.

In the sync build (WITH_ASYNC_EXPORT_PREVIEW=OFF), the client prints no error messages, and the collector gets all the trace data.

The SYNC build works as expected.

Repeat in the async build (WITH_ASYNC_EXPORT_PREVIEW=ON)

The client prints errors:

[email protected]:otlp> ./example_otlp_http
[Error] File: /home/malff/CODE/MY_GITHUB/opentelemetry-cpp/exporters/otlp/src/otlp_http_client.cc:303 [OTLP HTTP Client] Session state: (manually) cancelled.No error
[Error] File: /home/malff/CODE/MY_GITHUB/opentelemetry-cpp/exporters/otlp/src/otlp_http_exporter.cc:93 [OTLP HTTP Client] ERROR: Export 1 trace span(s) error: 1
[Error] File: /home/malff/CODE/MY_GITHUB/opentelemetry-cpp/exporters/otlp/src/otlp_http_client.cc:303 [OTLP HTTP Client] Session state: (manually) cancelled.
[Error] File: /home/malff/CODE/MY_GITHUB/opentelemetry-cpp/exporters/otlp/src/otlp_http_client.cc:303 [OTLP HTTP Client] Session state: (manually) cancelled.No error
[Error] File: /home/malff/CODE/MY_GITHUB/opentelemetry-cpp/exporters/otlp/src/otlp_http_exporter.cc:93 [OTLP HTTP Client] ERROR: Export 1 trace span(s) error: 1
[Error] File: /home/malff/CODE/MY_GITHUB/opentelemetry-cpp/exporters/otlp/src/otlp_http_client.cc:303 [OTLP HTTP Client] Session state: (manually) cancelled.
[Error] File: /home/malff/CODE/MY_GITHUB/opentelemetry-cpp/exporters/otlp/src/otlp_http_client.cc:303 [OTLP HTTP Client] Session state: (manually) cancelled.No error
[Error] File: /home/malff/CODE/MY_GITHUB/opentelemetry-cpp/exporters/otlp/src/otlp_http_exporter.cc:93 [OTLP HTTP Client] ERROR: Export 1 trace span(s) error: 1
[Error] File: /home/malff/CODE/MY_GITHUB/opentelemetry-cpp/exporters/otlp/src/otlp_http_client.cc:303 [OTLP HTTP Client] Session state: (manually) cancelled.
[Error] File: /home/malff/CODE/MY_GITHUB/opentelemetry-cpp/exporters/otlp/src/otlp_http_client.cc:303 [OTLP HTTP Client] Session state: (manually) cancelled.No error
[Error] File: /home/malff/CODE/MY_GITHUB/opentelemetry-cpp/exporters/otlp/src/otlp_http_exporter.cc:93 [OTLP HTTP Client] ERROR: Export 1 trace span(s) error: 1
[Error] File: /home/malff/CODE/MY_GITHUB/opentelemetry-cpp/exporters/otlp/src/otlp_http_client.cc:303 [OTLP HTTP Client] Session state: (manually) cancelled.

The collector gets incomplete data.

Analysis

In both cases, sync and async, the code reaches the async export method:

sdk::common::ExportResult OtlpHttpClient::Export(
    const google::protobuf::Message &message,
    std::function<bool(opentelemetry::sdk::common::ExportResult)> &&result_callback,
    std::size_t max_running_requests) noexcept
{
  ...

  // Wait for the response to be received
  if (options_.console_debug)
  {
    OTEL_INTERNAL_LOG_DEBUG(
        "[OTLP HTTP Client] Waiting for response from "
        << options_.url << " (timeout = "
        << std::chrono::duration_cast<std::chrono::milliseconds>(options_.timeout).count()
        << " milliseconds)");
  }

  // Wait for any session to finish if there are to many sessions
  std::unique_lock<std::mutex> lock(session_waker_lock_);
  bool wait_successful =
      session_waker_.wait_for(lock, options_.timeout, [this, max_running_requests] {
        std::lock_guard<std::recursive_mutex> guard{session_manager_lock_};
        return running_sessions_.size() <= max_running_requests;
      });

  cleanupGCSessions();

  ...

In the sync case:

  • max_running_requests is 0,
  • running_sessions_.size() is 1,
  • the code blocks in session_waker_.wait_for(),
  • the http server reply
  • the response is received
  • running_sessions_.size() become 0
  • session_waker_ is notified
  • the wait_for return.

In the async case:

  • max_running_requests is 64,
  • running_sessions_.size() is 1,
  • the code does not blocks in session_waker_.wait_for(),
  • cleanupGCSessions() is called
  • the running session is aborted

and all this happens before the http server response

Export fails, printing Session state: (manually) cancelled.

The wait_for() predicate:

running_sessions_.size() <= max_running_requests

does not work for async builds.

Setting OtlpHttpExporterOptions::max_concurrent_requests = 0 as a work around makes OtlpHttpClient::Export() to work.

Not sure what the proper fix is, max_running_requests appear broken.

@marcalff marcalff added the bug Something isn't working label Feb 4, 2023
@owent
Copy link
Member

owent commented Feb 5, 2023

Does it happen only in #1793 or is it also happen in main branch?
cleanupGCSessions() only cleanup the sessions that already finished or failed. When OnResponse() is called, we think session end with some HTTP response. When the OnEvent() is called with state http_client::SessionState::CreateFailed, case http_client::SessionState::ConnectFailed, http_client::SessionState::SendFailed, http_client::SessionState::SSLHandshakeFailed,http_client::SessionState::TimedOut, http_client::SessionState::NetworkError, or http_client::SessionState::Cancelled we think the session is abort with failure.
If this problem happens only in #1973, i guess OnEvent() may be called with http_client::SessionState::SSLHandshakeFailed or http_client::SessionState::NetworkError, I can follow the details to find out what really happens in 1 or 2 days later.

@marcalff
Copy link
Member Author

marcalff commented Feb 5, 2023

Hi @owent

Found originally in #1793, and reproduced in the main branch, with a plain http (no ssl) use case.

Will try to investigate more.

@owent
Copy link
Member

owent commented Feb 6, 2023

Hi @owent

Found originally in #1793, and reproduced in the main branch, with a plain http (no ssl) use case.

Will try to investigate more.

Hi @marcalff

I have a quick look at this problem. The problem may be all HTTP requests are canceled when we destroy the exporter(By destroying tracer provider in CleanupTracer() -> trace::Provider::SetTracerProvider(none)). Tracer::ForceFlushWithMicroseconds do nothing now, and when shutdown, all recordables are exported into exporter and then HttpClient::CancelAllSessions() is called by Shutdown() of exporter, processor, and context and provider.
There is no ForceFlush() API in exporters now, I think we can only wait some time to see if the test data can be exported successfully.
Shoud we add ForceFlush() for all exporters all just add APIs to get internal OtlpHttpClient from OTLP exporters and then call OtlpHttpClient::ForceFlush() to wait the exporting?

@lalitb
Copy link
Member

lalitb commented Feb 6, 2023

If these errors are thrown from OtlpHttpClient::Shutdown() ->HttpClient::CancelAllSessions(), will it help if we change the order of calls such that OtlpHttpClient::ForceFlush is called before HttpClient::CancelAllSessions() ?

@owent
Copy link
Member

owent commented Feb 6, 2023

If these errors are thrown from OtlpHttpClient::Shutdown() ->HttpClient::CancelAllSessions(), will it help if we change the order of calls such that OtlpHttpClient::ForceFlush is called before HttpClient::CancelAllSessions() ?

In my use case, we have many modules in our system, and we expect all modules should be close immediately when got a SIGSTOP, any module should not extend the lifetime of the application.
If OtlpHttpClient::ForceFlush is called in OtlpHttpClient::Shutdown and timeout, this module must still wait some time to close all sessions, I think the whole waiting time will be greater than it's passed into OtlpHttpClient::Shutdown and we can not tell how long it's.
So in my understanding.it may be better to let user chose whether to cancel sessions or waiting until all the existing sessions are exported.

@lalitb
Copy link
Member

lalitb commented Feb 6, 2023

Ok thanks for explanation. I think it makes sense to add Exporter::ForceFlush(), and let end-user to explicitly call it through TracerProvider::ForceFlush(). We also have an issue #1623 for this.

@lalitb
Copy link
Member

lalitb commented Feb 13, 2023

@owent - Would you like to see this issue further?

@owent owent self-assigned this Feb 14, 2023
@owent
Copy link
Member

owent commented Feb 14, 2023

@owent - Would you like to see this issue further?

Sure I will fix this issue and #1623 some time laters.

@lalitb
Copy link
Member

lalitb commented Feb 14, 2023

Thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants