Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage pointer dereference when using OTLP GRPC exporter with inacessible endpoint #3215

Open
tom-neara opened this issue Dec 18, 2024 · 6 comments · May be fixed by #3221
Open

Garbage pointer dereference when using OTLP GRPC exporter with inacessible endpoint #3215

tom-neara opened this issue Dec 18, 2024 · 6 comments · May be fixed by #3221
Labels
bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@tom-neara
Copy link

tom-neara commented Dec 18, 2024

Describe your environment

x86_64, Ubuntu-22.04, bazel 7.40, opentelemetry-cpp 1.16.0 and 1.18.0

Steps to reproduce
Lazily setup otel like so (my setup had a lot more indirection, this code doesn't actually crash for me) and run with OTEL_EXPORTER_OTLP_ENDPOINT=127.0.0.1:1337 or some other address that isn't a valid endpoint.

In my actual environment where it reproduces, I'm launching python then calling into a pybind11 module which I built which calls otel_tracer() above and creates a few spans. The crash occurs in the grpc exporter after main returns, see below.

#include <iostream>
#include <opentelemetry/context/propagation/text_map_propagator.h>
#include <opentelemetry/exporters/memory/in_memory_span_exporter.h>
#include <opentelemetry/exporters/otlp/otlp_grpc_exporter_factory.h>
#include <opentelemetry/sdk/trace/batch_span_processor_factory.h>
#include <opentelemetry/sdk/trace/processor.h>
#include <opentelemetry/sdk/trace/simple_processor.h>
#include <opentelemetry/sdk/trace/simple_processor_factory.h>
#include <opentelemetry/sdk/trace/tracer_provider_factory.h>
#include <opentelemetry/trace/context.h>
#include <opentelemetry/trace/propagation/http_trace_context.h>
#include <opentelemetry/trace/provider.h>
#include <opentelemetry/trace/span_startoptions.h>
#include <opentelemetry/trace/tracer.h>
#include <opentelemetry/trace/tracer_provider.h>

namespace otel = opentelemetry;
namespace trace_sdk = opentelemetry::sdk::trace;

static std::shared_ptr<otel::trace::TracerProvider> api_provider;

void setup_opentelemetry()
{
    auto exporter = otel::exporter::otlp::OtlpGrpcExporterFactory::Create();
    auto processor =
        trace_sdk::BatchSpanProcessorFactory::Create(std::move(exporter), {});
    auto resource = otel::sdk::resource::Resource(
        otel::sdk::resource::Resource::Create({{"service.name", "test"}}));
    auto provider = trace_sdk::TracerProviderFactory::Create(
        std::move(processor), resource);
    api_provider = std::move(provider);
    otel::trace::Provider::SetTracerProvider(api_provider);
}
std::shared_ptr<otel::trace::Tracer> otel_tracer()
{
    if (!api_provider)
    {
        setup_opentelemetry();
    }
    return api_provider->GetTracer("test", "1.0.0");
}

int
main(void)
{
    auto tracer = otel_tracer();

    {
      auto span = tracer->StartSpan("main");
      auto scope = tracer->WithActiveSpan(span);
      span->SetAttribute("key", "value");
    }
    return 0;
}

What is the expected behavior?
Program exits normally

What is the actual behavior?

Crash in https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/include/opentelemetry/sdk/common/global_log_handler.h#L162

Additional context

gdb session:

Thread 714 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff9bfe00640 (LWP 4009503)]
0x00007fff4b8c0d3f in opentelemetry::v1::exporter::otlp::OtlpGrpcExporter::Export(std::__1::span<std::__1::unique_ptr<opentelemetry::v1::sdk::trace::Recordable, std::__1::default_delete<opentelemetry::v1::sdk::trace::Recordable> >, 18446744073709551615ul> const&) () from /home/tbirch/.cache/bazel/_bazel_tbirch/5dac0c5eac9fcd76e48829b4e488238f/execroot/_main/bazel-out/k8-fastbuild/bin/innerloop/ml/neara_ai/classify.runfiles/_main/innerloop/ml/neara_ai/classifier_module.so
(gdb) bt
#0  0x00007fff4b8c0d3f in opentelemetry::v1::exporter::otlp::OtlpGrpcExporter::Export(std::__1::span<std::__1::unique_ptr<opentelemetry::v1::sdk::trace::Recordable, std::__1::default_delete<opentelemetry::v1::sdk::trace::Recordable> >, 18446744073709551615ul> const&) () from /home/tbirch/.cache/bazel/_bazel_tbirch/5dac0c5eac9fcd76e48829b4e488238f/execroot/_main/bazel-out/k8-fastbuild/bin/innerloop/ml/neara_ai/classify.runfiles/_main/innerloop/ml/neara_ai/classifier_module.so
#1  0x00007fff4c835189 in opentelemetry::v1::sdk::trace::BatchSpanProcessor::Export() ()
   from /home/tbirch/.cache/bazel/_bazel_tbirch/5dac0c5eac9fcd76e48829b4e488238f/execroot/_main/bazel-out/k8-fastbuild/bin/innerloop/ml/neara_ai/classify.runfiles/_main/innerloop/ml/neara_ai/classifier_module.so
#2  0x00007fff4c835003 in opentelemetry::v1::sdk::trace::BatchSpanProcessor::DrainQueue() ()
   from /home/tbirch/.cache/bazel/_bazel_tbirch/5dac0c5eac9fcd76e48829b4e488238f/execroot/_main/bazel-out/k8-fastbuild/bin/innerloop/ml/neara_ai/classify.runfiles/_main/innerloop/ml/neara_ai/classifier_module.so
#3  0x00007fff4c834769 in opentelemetry::v1::sdk::trace::BatchSpanProcessor::DoBackgroundWork() ()
   from /home/tbirch/.cache/bazel/_bazel_tbirch/5dac0c5eac9fcd76e48829b4e488238f/execroot/_main/bazel-out/k8-fastbuild/bin/innerloop/ml/neara_ai/classify.runfiles/_main/innerloop/ml/neara_ai/classifier_module.so
#4  0x00007fff4c839439 in decltype (((*(std::declval<opentelemetry::v1::sdk::trace::BatchSpanProcessor*>())).*(std::declval<void (opentelemetry::v1::sdk::trace::BatchSpanProcessor::*)()>()))()) std::__1::__invoke[abi:v160004]<void (opentelemetry::v1::sdk::trace::BatchSpanProcessor::*)(), opentelemetry::v1::sdk::trace::BatchSpanProcessor*, , void>(void (opentelemetry::v1::sdk::trace::BatchSpanProcessor::*&&)(), opentelemetry::v1::sdk::trace::BatchSpanProcessor*&&) ()
   from /home/tbirch/.cache/bazel/_bazel_tbirch/5dac0c5eac9fcd76e48829b4e488238f/execroot/_main/bazel-out/k8-fastbuild/bin/innerloop/ml/neara_ai/classify.runfiles/_main/innerloop/ml/neara_ai/classifier_module.so
#5  0x00007fff4c8393be in void std::__1::__thread_execute[abi:v160004]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (opentelemetry::v1::sdk::trace::BatchSpanProcessor::*)(), opentelemetry::v1::sdk::trace::BatchSpanProcessor*, 2ul>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (opentelemetry::v1::sdk::trace::BatchSpanProcessor::*)(), opentelemetry::v1::sdk::trace::BatchSpanProcessor*>&, std::__1::__tuple_indices<2ul>) ()
   from /home/tbirch/.cache/bazel/_bazel_tbirch/5dac0c5eac9fcd76e48829b4e488238f/execroot/_main/bazel-out/k8-fastbuild/bin/innerloop/ml/neara_ai/classify.runfiles/_main/innerloop/ml/neara_ai/classifier_module.so
#6  0x00007fff4c839192 in void* std::__1::__thread_proxy[abi:v160004]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (opentelemetry::v1::sdk::trace::BatchSpanProcessor::*)(), opentelemetry::v1::sdk::trace::BatchSpanProcessor*> >(void*) ()
   from /home/tbirch/.cache/bazel/_bazel_tbirch/5dac0c5eac9fcd76e48829b4e488238f/execroot/_main/bazel-out/k8-fastbuild/bin/innerloop/ml/neara_ai/classify.runfiles/_main/innerloop/ml/neara_ai/classifier_module.so
#7  0x00007ffff6694ac3 in start_thread (arg=<optimised out>) at ./nptl/pthread_create.c:442
#8  0x00007ffff6726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) disassemble
Dump of assembler code for function _ZN13opentelemetry2v18exporter4otlp16OtlpGrpcExporter6ExportERKNSt3__14spanINS4_10unique_ptrINS0_3sdk5trace10RecordableENS4_14default_deleteIS9_EEEELm18446744073709551615EEE:
<...>
   0x00007fff4b8c0cf9 <+1705>:	call   0x7fff4ac3f240 <_ZNKSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE5c_strB7v160004Ev>
   0x00007fff4b8c0cfe <+1710>:	mov    %rax,-0x5d0(%rbp)
   0x00007fff4b8c0d05 <+1717>:	lea    -0x520(%rbp),%rdi
   0x00007fff4b8c0d0c <+1724>:	call   0x7fff4b22a720 <_ZN13opentelemetry2v13sdk6common12AttributeMapC2Ev>
   0x00007fff4b8c0d11 <+1729>:	jmp    0x7fff4b8c0d16 <_ZN13opentelemetry2v18exporter4otlp16OtlpGrpcExporter6ExportERKNSt3__14spanINS4_10unique_ptrINS0_3sdk5trace10RecordableENS4_14default_deleteIS9_EEEELm18446744073709551615EEE+1734>
   0x00007fff4b8c0d16 <+1734>:	mov    -0x5d0(%rbp),%r8
   0x00007fff4b8c0d1d <+1741>:	mov    -0x5c8(%rbp),%rdi
   0x00007fff4b8c0d24 <+1748>:	mov    (%rdi),%rax
   0x00007fff4b8c0d27 <+1751>:	mov    $0x1,%esi
   0x00007fff4b8c0d2c <+1756>:	lea    -0x19f0b56(%rip),%rdx        # 0x7fff49ed01dd
   0x00007fff4b8c0d33 <+1763>:	mov    $0xa0,%ecx
   0x00007fff4b8c0d38 <+1768>:	lea    -0x520(%rbp),%r9
=> 0x00007fff4b8c0d3f <+1775>:	call   *0x10(%rax)
   0x00007fff4b8c0d42 <+1778>:	lea    -0x520(%rbp),%rdi
   0x00007fff4b8c0d49 <+1785>:	call   0x7fff4b20c060 <_ZN13opentelemetry2v13sdk6common12AttributeMapD2Ev>
   0x00007fff4b8c0d4e <+1790>:	lea    -0x4f0(%rbp),%rdi
   0x00007fff4b8c0d55 <+1797>:	call   0x7fff4acedfd0 <_ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEED2Ev>
   0x00007fff4b8c0d5a <+1802>:	lea    -0x4c0(%rbp),%rdi
   0x00007fff4b8c0d61 <+1809>:	call   0x7fff4ad05390 <_ZNSt3__118basic_stringstreamIcNS_11char_traitsIcEENS_9allocatorIcEEED1Ev>
   0x00007fff4b8c0d66 <+1814>:	movl   $0x0,-0x3c(%rbp)
   0x00007fff4b8c0d6d <+1821>:	lea    -0x3a8(%rbp),%rdi
   0x00007fff4b8c0d74 <+1828>:	call   0x7fff4b2312d0 <_ZNSt3__110shared_ptrIN13opentelemetry2v13sdk6common12internal_log10LogHandlerEED2B7v160004Ev>
   0x00007fff4b8c0d79 <+1833>:	mov    -0x3c(%rbp),%eax
   0x00007fff4b8c0d7c <+1836>:	test   %eax,%eax
(gdb) info registers
rax            0x7fec7e568f88      140653708611464
rbx            0x7ff9bfe00640      140710642714176
rcx            0xa0                160
rdx            0x7fff49ed01dd      140734433657309
rsi            0x1                 1
rdi            0x7feb80ee82c0      140649457156800
rbp            0x7ff9bfdffc40      0x7ff9bfdffc40
rsp            0x7ff9bfdff670      0x7ff9bfdff670
r8             0x7febc8002030      140650649493552
r9             0x7ff9bfdff720      140710642710304
r10            0x7ff9bfdfe6f0      140710642706160
r11            0x2eef90b907f2fa4   211380060597202852
r12            0x7ff9bfe00640      140710642714176
r13            0x0                 0
r14            0x7ffff66947d0      140737327482832
r15            0x7ffa13dfdb90      140712051989392
rip            0x7fff4b8c0d3f      0x7fff4b8c0d3f <opentelemetry::v1::exporter::otlp::OtlpGrpcExporter::Export(std::__1::span<std::__1::unique_ptr<opentelemetry::v1::sdk::trace::Recordable, std::__1::default_delete<opentelemetry::v1::sdk::trace::Recordable> >, 18446744073709551615ul> const&)+1775>
eflags         0x10206             [ PF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
(gdb) x/20x 0x7fec7e568f88
0x7fec7e568f88:	0x00000000	0x00000000	0x00000000	0x00000000
0x7fec7e568f98:	0x00000000	0x00000000	0x00000000	0x00000000
0x7fec7e568fa8:	0x00000000	0x00000000	0x00000000	0x00000000
0x7fec7e568fb8:	0x00000000	0x00000000	0x00000000	0x00000000
0x7fec7e568fc8:	0x00000000	0x00000000	0x00000000	0x00000000

Applying this patch prevents the destruction of the global handler and thus prevents it from crashing

diff --git sdk/src/common/global_log_handler.cc sdk/src/common/global_log_handler.cc
index f60527ee..66620793 100644
--- sdk/src/common/global_log_handler.cc
+++ sdk/src/common/global_log_handler.cc
@@ -59,9 +59,8 @@ void NoopLogHandler::Handle(LogLevel,
.
 std::pair<nostd::shared_ptr<LogHandler>, LogLevel> &GlobalLogHandler::GetHandlerAndLevel() noexcept
 {
-  static std::pair<nostd::shared_ptr<LogHandler>, LogLevel> handler_and_level{
-      nostd::shared_ptr<LogHandler>(new DefaultLogHandler), LogLevel::Warning};
-  return handler_and_level;
+  static std::pair<nostd::shared_ptr<LogHandler>, LogLevel> *handler_and_level{new std::pair<nostd::shared_ptr<LogHandler>, LogLevel>{nostd::shared_ptr<LogHandler>(new DefaultLogHandler), LogLevel::Warning}};
+  return *handler_and_level;
 }
.
 }  // namespace internal_log
@tom-neara tom-neara added the bug Something isn't working label Dec 18, 2024
@github-actions github-actions bot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 18, 2024
@tom-neara
Copy link
Author

To be clear I think this bug is due to the unpredictable order of static destructors thus the simplest option is to intentionally leak.

@owent
Copy link
Member

owent commented Dec 23, 2024

To be clear I think this bug is due to the unpredictable order of static destructors thus the simplest option is to intentionally leak.

const opentelemetry::nostd::shared_ptr<LogHandler> &log_handler =                     \
    GlobalLogHandler::GetLogHandler();                                                \
if (!log_handler)                                                                     \
{                                                                                     \
  break;                                                                              \
}                                                                                     \
std::stringstream tmp_stream;                                                         \
tmp_stream << message;                                                                \
log_handler->Handle(level, __FILE__, __LINE__, tmp_stream.str().c_str(), attributes); \

The destroying order of static local variables are predictable.The problem here may be the log_handler is destroyed after if (!log_handler) and before log_handler->Handle(level, __FILE__, __LINE__, tmp_stream.str().c_str(), attributes);.

@owent owent linked a pull request Dec 24, 2024 that will close this issue
3 tasks
@owent
Copy link
Member

owent commented Dec 24, 2024

@tom-neara Could you please try #3221 to see if it occurs again.

@tom-neara
Copy link
Author

@owent this was the first thing I tried, but it doesn't prevent the crash.

As can be seen in my gdb session, it's not the object pointer in the shared_ptr (%rax) which is zero, but the entry in the vtable (memory pointed to by %rax). The problem I believe is that ~shared_ptr<T> isn't guaranteed to zero the pointer inside the shared_ptr, it is only required to call delete. Thus a static shared_ptr can have its destructor called but still contain a dangling pointer afterwards.

@owent
Copy link
Member

owent commented Dec 31, 2024

@owent this was the first thing I tried, but it doesn't prevent the crash.

As can be seen in my gdb session, it's not the object pointer in the shared_ptr (%rax) which is zero, but the entry in the vtable (memory pointed to by %rax). The problem I believe is that ~shared_ptr<T> isn't guaranteed to zero the pointer inside the shared_ptr, it is only required to call delete. Thus a static shared_ptr can have its destructor called but still contain a dangling pointer afterwards.

You are right, I commit another approach which is just like some singleton's implementation to check the destroying.

@tom-neara
Copy link
Author

@owent great, I checked the latest version of your try_fix_lifetime_of_global_log_handler branch and it does fix the issue.

@marcalff marcalff added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants