Fix another libgap segmentation fault#40728
Fix another libgap segmentation fault#40728user202729 wants to merge 6 commits intosagemath:developfrom
Conversation
22c9f77 to
7eeab93
Compare
|
Documentation preview for this PR (built with commit 97f9af0; changes) is ready! 🎉 |
|
Argh, I noticed this while working on #40585 but it's not what was causing that reproducible segfault so I eventually reverted my changes and forgot about it. I think you'll find the same issue in |
|
The CI failures are #40715, you'll probably have to pad the |
|
I suspect the timeout to be more likely #40556 . Anyway… |
|
more mysterious failures, looks worth investigating. (probably not caused by this pull request?) → with the latest commit it becomes less common, but nonetheless still happen sometimes. I'm tempted to conclude it's #40734 but I haven't looked at it yet. although this is also interesting since it may give some clue on how the alarm disappear (looking at polynomial_element.pyx etc.) |
f8e4454 to
97f9af0
Compare
Follow-up to #40613.
Save the following into
b.sagethen run
sage b.sage. It should segmentation fault soon.The cause
As is well known, if Cython generates code of the following form
then the
__Pyx_XDECREF(__pyx_t_8)gets optimized toif (__pyx_t_8 != 0) SEGMENTATION_FAULT;,and if GAP error is raised, we get null pointer dereference.
Before #40613,
KeyboardInterruptorAlarmInterruptjumps tosig_on(),sig_GAP_Enter()is used, first jumps toGAP_Enter(), then asig_error()call jumps tosig_on()to raise a Python exception,GAP_Enter()is used, jumps toGAP_Enter(), which then raises a Python exception.After #40613, in all cases, the outermost
GAP_Enter()is jumped to.Debug guide
There are 3 parts of debugging:
__pyx_t_8,(B),(A).Finding the variable name is easy: it suffices to run
sage --gdbthen run the script.When it segmentation fault, look at the instruction that causes the segmentation fault.
In this case, it is
therefore the variable is
[rbp-0x90].How to map back from assembly to
.cfilecompile_commands.jsonmeson-generated_src_sage_libs_gap_element.pyx.c.ofromgap/element.pyx.c.oto.sthen add-S -fverbose-asm -masm=inteland recompile, then read the.sfile generatedalternatively I believe adding
-gat appropriate places also give you this.How to find the call
Break at either
__pyx_f_4sage_4libs_3gap_4util_gap_interrupt_asap,GAP_THROW, or__longjmp_chk.Then
CYSIGNALS_CRASH_QUIET=1 rr recordit,rr replay -eandreverse-continuefrom the point of segmentation fault.Relevant backtrace:
run
frame 48and see the assembly iswhich correspond to (look at
element.pyx.c:8996and scroll a few lines up)We determine this is the call
(B).How to find the assignment instruction
We want to find
(A).Assume we're in
rr, runcontinuethenreverse-stepia few times until thecmpinstruction is executed, then runwhere the argument to
watchis copied from the output ofp.Then
reverse-continue. Whichever instruction last set the memory address is the instruction(A).Turns out it's the
__Pyx_GetModuleGlobalName(to fetch thelibgapPython global variable).Conclusion
Don't access any Python object between
GAP_Enter...GAP_Leave. Not even accessing the Python global variablelibgap.Side note, if I add
cdef libgapright before thefrom ... import libgapline, then this particular cause is eliminated, but it still segmentation fault because of theenumerate()plus some complex combination of compiler optimization.This fix creates a temporarily variable, but I think the overhead of creating a Python
tupleis small.We're still accessing
gap_elements[i], which is a Python API call (!!!), but as it happens,GAP_AssListnever throws, so we're safe.📝 Checklist
⌛ Dependencies