Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaults due to corrupted InstrumentationScope pointer when using multiple meters #1720

Closed
johanpel opened this issue Oct 28, 2022 · 3 comments · Fixed by #1783
Closed

Segfaults due to corrupted InstrumentationScope pointer when using multiple meters #1720

johanpel opened this issue Oct 28, 2022 · 3 comments · Fixed by #1783
Labels
bug Something isn't working

Comments

@johanpel
Copy link
Contributor

When using metrics I've started seeing some segmentation faults related to the InstrumentationScope unique_ptr created in GetMeter. I was initially not being able to capture the segfault in GDB which implies some synchronization issue, so I added some debug prints to the pointer that InstrumentationScope::Create creates, which in the case of the segmentation fault differs from the raw pointer returned by Meter::GetInstrumentationScope in the SDK MetricCollector::Collect function, the latter of which points to an invalid address.

Eventually I was able to create a reproducible example found below. It's a bit flaky, so you may want to play around with num_meters parameter in this example to find a sweet spot where no error logs are being generated but the segfault still occurs.

I noticed when I am using the same meter name for all instruments, I can no longer reproduce this problem, but it may just be because the loop in MetricCollector::Collect becomes very short and the synchronization issue doesn't manifest itself.

I based this off the commit of this morning: dd7e257b6de71eeaf9e3149530962301705b9a0d

#include <chrono>
#include <memory>
#include <thread>
#include <vector>

#include "opentelemetry/exporters/otlp/otlp_grpc_metric_exporter_factory.h"
#include "opentelemetry/metrics/async_instruments.h"
#include "opentelemetry/metrics/provider.h"
#include "opentelemetry/nostd/shared_ptr.h"
#include "opentelemetry/sdk/metrics/export/periodic_exporting_metric_reader.h"
#include "opentelemetry/sdk/metrics/meter_provider.h"

namespace metric_sdk    = opentelemetry::sdk::metrics;
namespace nostd         = opentelemetry::nostd;
namespace common        = opentelemetry::common;
namespace metrics_api   = opentelemetry::metrics;
namespace otlp_exporter = opentelemetry::exporter::otlp;

namespace
{
class MeasurementFetcher
{
public:
  static void Fetch(opentelemetry::metrics::ObserverResult observer_result, void * /* state */)
  {
    if (nostd::holds_alternative<nostd::shared_ptr<opentelemetry::metrics::ObserverResultT<long>>>(
            observer_result))
    {
      nostd::get<nostd::shared_ptr<opentelemetry::metrics::ObserverResultT<long>>>(observer_result)
          ->Observe(0);
    }
  }
};
}  // namespace

int main(int argc, char *argv[])
{
  // Initialize exporter.
  otlp_exporter::OtlpGrpcMetricExporterOptions exporter_options;
  auto exporter = otlp_exporter::OtlpGrpcMetricExporterFactory::Create(exporter_options);

  // Initialize reader.
  metric_sdk::PeriodicExportingMetricReaderOptions reader_options;
  reader_options.export_interval_millis = std::chrono::milliseconds(5);
  reader_options.export_timeout_millis  = std::chrono::milliseconds(4);
  std::unique_ptr<metric_sdk::MetricReader> reader{
      new metric_sdk::PeriodicExportingMetricReader(std::move(exporter), reader_options)};

  // Initialize provider.
  auto provider = std::make_shared<metric_sdk::MeterProvider>();
  provider->AddMetricReader(std::move(reader));

  metrics_api::Provider::SetMeterProvider(provider);

  constexpr size_t num_meters = 64;

  std::vector<std::string> meter_names;
  std::vector<nostd::shared_ptr<metrics_api::Meter>> meters;
  std::vector<nostd::shared_ptr<metrics_api::ObservableInstrument>> instruments;

  for (size_t i = 0; i < num_meters; i++)
  {
    meter_names.push_back("my_meter_" + std::to_string(i));
    // Create a meter.
    meters.push_back(provider->GetMeter(meter_names.back()));
    // Create an instrument.
    instruments.push_back(meters.back()->CreateInt64ObservableGauge("my_instrument"));
    instruments.back()->AddCallback(MeasurementFetcher::Fetch, nullptr);
  }

  std::this_thread::sleep_for(std::chrono::seconds(1));
}

With the following backtrace:

std::__uniq_ptr_impl<opentelemetry::v1::sdk::instrumentationscope::InstrumentationScope, std::default_delete<opentelemetry::v1::sdk::instrumentationscope::InstrumentationScope> >::_M_ptr unique_ptr.h:173
std::unique_ptr<opentelemetry::v1::sdk::instrumentationscope::InstrumentationScope, std::default_delete<opentelemetry::v1::sdk::instrumentationscope::InstrumentationScope> >::get unique_ptr.h:422
opentelemetry::v1::sdk::metrics::Meter::GetInstrumentationScope meter.cc:298
opentelemetry::v1::sdk::metrics::MetricCollector::Collect(opentelemetry::v1::nostd::function_ref<bool (opentelemetry::v1::sdk::metrics::ResourceMetrics &)>) metric_collector.cc:47
opentelemetry::v1::sdk::metrics::MetricReader::Collect(opentelemetry::v1::nostd::function_ref<bool (opentelemetry::v1::sdk::metrics::ResourceMetrics &)>) metric_reader.cc:38
operator() periodic_exporting_metric_reader.cc:57
std::__invoke_impl<void, <lambda> >(std::__invoke_other, struct {...} &&) invoke.h:61
std::__invoke<<lambda> >(struct {...} &&) invoke.h:96
std::thread::_Invoker<std::tuple<<lambda> > >::_M_invoke<0>(std::_Index_tuple<0>) std_thread.h:253
std::thread::_Invoker<std::tuple<<lambda> > >::operator()(void) std_thread.h:260
std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<<lambda> > >, void>::operator()(void) const future:1409
std::__invoke_impl<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<<lambda> > >, void> &>(std::__invoke_other, std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<<lambda> > >, void> &) invoke.h:61
std::__invoke_r<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<<lambda> > >, void> &>(std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<<lambda> > >, void> &) invoke.h:116
std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<<lambda> > >, void> >::_M_invoke(const std::_Any_data &) std_function.h:291
std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>::operator()() const std_function.h:590
std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *) future:571
std::__invoke_impl<void, void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *), std::__future_base::_State_baseV2 *, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *>(std::__invoke_memfun_deref, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *), std::__future_base::_State_baseV2 *&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *&&, bool *&&) invoke.h:74
std::__invoke<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *), std::__future_base::_State_baseV2 *, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *>(void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *), std::__future_base::_State_baseV2 *&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *&&, bool *&&) invoke.h:96
<lambda#1>::operator()() const mutex:776
<lambda#1>::operator()() const mutex:712
<lambda#1>::_FUN() mutex:712
__pthread_once_slow 0x00007ffff7808f68
__gthread_once gthr-default.h:700
std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *), std::__future_base::_State_baseV2 *, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *>(std::once_flag &, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *), std::__future_base::_State_baseV2 *&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *&&, bool *&&) mutex:783
std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) future:411
std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<<lambda> > >, void>::_M_run(void) future:1737
std::__invoke_impl<void, void (std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<<lambda> > >, void>::*)(), std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<<lambda> > >, void> *>(std::__invoke_memfun_deref, void (std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<<lambda> > >, void>::*&&)(std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<<lambda> > >, void> *const), std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<<lambda> > >, void> *&&) invoke.h:74
std::__invoke<void (std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<<lambda> > >, void>::*)(), std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<<lambda> > >, void> *>(void (std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<<lambda> > >, void>::*&&)(std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<<lambda> > >, void> *const)) invoke.h:96
std::thread::_Invoker<std::tuple<void (std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<<lambda> > >, void>::*)(), std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<<lambda> > >, void> *> >::_M_invoke<0, 1>(std::_Index_tuple<0, 1>) std_thread.h:253
std::thread::_Invoker<std::tuple<void (std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<<lambda> > >, void>::*)(), std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<<lambda> > >, void> *> >::operator()(void) std_thread.h:260
std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<<lambda> > >, void>::*)(), std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<<lambda> > >, void> *> > >::_M_run(void) std_thread.h:211
<unknown> 0x00007ffff7b7a2b3
start_thread 0x00007ffff7803b43
clone3 0x00007ffff7895a00
@johanpel johanpel added the bug Something isn't working label Oct 28, 2022
@lalitb lalitb added this to the Metrics post GA release milestone Oct 28, 2022
@johanpel
Copy link
Contributor Author

Started looking into this a little closer today and I noticed also the meter in the for loop of MetricCollector::Collect also becomes a nullptr sometimes.

The problem is with MeterContext::GetMeters which returns a nostd::span viewing into the MeterContext meters_ field. However, while that span is being used by the background thread, the backing buffer of the vector could be in the process of being reallocated while new meters are added.

A simple fix is to construct and return a copy of meters_ in MeterContext::GetMeters while it is locked, so the background thread can work on the copied shared_ptrs to the meters while another thread can safely mutate the meters_ field at the same time.

I can imagine such a fix has performance implications as the refcounts for all meters need to be atomically updated for the copy. I can imagine one other approach where the span is kept but somehow locking the MeterContext storage_lock_ while iterating over the meters in MetricCollector::Collect.

@malkia
Copy link

malkia commented Jan 4, 2023

I've found that for observables I had to RemoveCallback before shutting down at various places.

@lalitb
Copy link
Member

lalitb commented Jan 4, 2023

I've found that for observables I had to RemoveCallback before shutting down at various places.

I will have a look at it, and raise a issue if able to reproduce. Else, if you have a reproducible code with Observables please feel free to create separate issue for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment