Skip to content

Conversation

@rynewang
Copy link
Contributor

@rynewang rynewang commented Nov 9, 2024

This fixes a jemalloc profiling deadlock, as hit by clickhouse ClickHouse/ClickHouse#66346

Stacktrace:

Thread 36 (Thread 0x7f77d3798700 (LWP 4673)):
#0  __lll_lock_wait (futex=futex@entry=0x7f7ae01333a0 <object_mutex>, private=0) at lowlevellock.c:52
#1  0x00007f7ae06a10a3 in __GI___pthread_mutex_lock (mutex=0x7f7ae01333a0 <object_mutex>) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f7ae012dd88 in __gthread_mutex_lock (__mutex=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at ./gthr-default.h:749
#3  _Unwind_Find_registered_FDE (bases=0x7f77d3792058, bases@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, pc=0x7f7ae012be5b <_Unwind_Backtrace+59>, pc@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind-dw2-fde.c:1049
#4  _Unwind_Find_FDE (pc=0x7f7ae012be5b <_Unwind_Backtrace+59>, bases=bases@entry=0x7f77d3792058) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind-dw2-fde-dip.c:459
#5  0x00007f7ae0129e08 in uw_frame_state_for (context=0x7f77d3791fb0, fs=0x7f77d3791e00) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind-dw2.c:1263
#6  0x00007f7ae012b060 in uw_init_context_1 (context=0x7f77d3791fb0, outer_cfa=0x7f77d3792260, outer_ra=0x7f7ae0739446) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind-dw2.c:1592
#7  0x00007f7ae012be5c in _Unwind_Backtrace (trace=0x7f7ae072ed70, trace_argument=0x7f77d3792260) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind.inc:295
#8  0x00007f7ae0739446 in ?? () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#9  0x00007f7ae06d6045 in ?? () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#10 0x00007f7ae012d6be in start_fde_sort (count=1284, accu=0x7f77d3792360) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind-dw2-fde.c:443
#11 init_object (ob=0x7f79a9bd5fe0) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind-dw2-fde.c:802
#12 search_object (ob=0x7f79a9bd5fe0, pc=<optimized out>) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind-dw2-fde.c:992
#13 0x00007f7ae012de76 in _Unwind_Find_registered_FDE (bases=0x7f77d37926d8, bases@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, pc=0x7f7ae012be5b <_Unwind_Backtrace+59>, pc@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind-dw2-fde.c:1069
#14 _Unwind_Find_FDE (pc=0x7f7ae012be5b <_Unwind_Backtrace+59>, bases=bases@entry=0x7f77d37926d8) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind-dw2-fde-dip.c:459
#15 0x00007f7ae0129e08 in uw_frame_state_for (context=0x7f77d3792630, fs=0x7f77d3792480) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind-dw2.c:1263
#16 0x00007f7ae012b060 in uw_init_context_1 (context=0x7f77d3792630, outer_cfa=0x7f77d37928e0, outer_ra=0x7f7ae0739446) at --Type <RET> for more, q to quit, c to continue without paging--
/opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind-dw2.c:1592
#17 0x00007f7ae012be5c in _Unwind_Backtrace (trace=0x7f7ae072ed70, trace_argument=0x7f77d37928e0) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind.inc:295
#18 0x00007f7ae0739446 in ?? () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#19 0x00007f7ae06d6045 in ?? () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#20 0x00000000004d6e91 in _PyMem_RawMalloc (size=<optimized out>, ctx=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at /usr/local/src/conda/python-3.10.14/Objects/obmalloc.c:91

@rynewang
Copy link
Contributor Author

This is what I found when debugging with jemalloc profiling. Our bazel config for jemalloc needs to update with a flag otherwise it deadlocks.

@rynewang rynewang enabled auto-merge (squash) November 13, 2024 01:09
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 13, 2024
@rynewang rynewang merged commit df15c58 into ray-project:master Nov 13, 2024
1 check passed
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
This fixes a jemalloc profiling deadlock, as hit by clickhouse
ClickHouse/ClickHouse#66346

Signed-off-by: Ruiyang Wang <[email protected]>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
This fixes a jemalloc profiling deadlock, as hit by clickhouse
ClickHouse/ClickHouse#66346

Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: mohitjain2504 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants