Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGILL's on ARM32 while using valgrind #33727

Closed
igorsnunes opened this issue Mar 18, 2020 · 14 comments
Closed

SIGILL's on ARM32 while using valgrind #33727

igorsnunes opened this issue Mar 18, 2020 · 14 comments
Assignees
Labels
arch-arm32 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI JitUntriaged CLR JIT issues needing additional triage
Milestone

Comments

@igorsnunes
Copy link

igorsnunes commented Mar 18, 2020

Hi all,

I am currently debugging an application with valgrind in a raspberry pi and I came across some "Illegal Instructions" issues on "libcoreclr.so".

Not sure if these instructions are actually being executed, however, as they are being detected by valgrind, an "Illegal Instruction" signal is being raised.

See the messages below:

The first one concerns to sub.w instruction: SP is being used in Rd position and r8 in Rn (according to the ISA, if SP is being used as Rd, SP should also be in Rn, see http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0552a/BABFFEJF.html).

done.
0x04000a30 in _start () from /lib/ld-linux-armhf.so.3
(gdb) c
Continuing.
[New Thread 17257]

Thread 1 received signal SIGILL, Illegal instruction.
0x05f837fe in _DacGlobals::InitializeEntries(unsigned int) () from /home/pi/workspace-00032/edge/libcoreclr.so
(gdb) x/i $pc
0x5f837ff <_ZN11_DacGlobals17InitializeEntriesEj+3262>: sub.w sp, r8, #80 ; 0x50

I managed to bypass this SIGILL by patching valgrind, explicitly allowing this constraint (this probably shouldn`t be done). However, another SIGILL was raised, but this time it was located elsewhere.

(gdb) c
Continuing.
[New Thread 4291]

Thread 1 received signal SIGILL, Illegal instruction.
0x23f7d48e in ?? ()
(gdb) x/i $pc
0x23f7d48f: ldmia.w sp!, {lr}
(gdb)

Apparently, this one concerns to the use of only one register in the register list in instruction LDMIA. For more information, see: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489g/Cihcadda.html.

I wonder if these instructions are being emmited by the JIT compiler, or maybe this is a CLANG issue.

Thanks in advance!

PS:
more info about my environment:

First, I am running a container to publish my app. The app publishing process is located in build.sh.

docker run -v %ProductContainersFolder%:/product_containers --rm mcr.microsoft.com/dotnet/core/sdk:3.1 bash /product_containers/build.sh

The "dotnet publish" command is described below:

dotnet publish -c Release --framework netcoreapp3.1 -r linux-arm --self-contained yes --output .....

After publishing the app, I`m copying the whole environment (app+runtime+libs including libcoreclr.so) to my raspberry to run it there.

It's important to note that, if I don`t use valgrind, my application runs without problems on raspberry.

category:correctness
theme:codegen
skill-level:beginner
cost:small

@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't add an area label to this Issue.

Checkout this page to find out which area owner to ping, or please add exactly one area label to help train me in the future.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Mar 18, 2020
@igorsnunes igorsnunes changed the title SIGILL`s on ARM32 while using valgrind SIGILL's on ARM32 while using valgrind Mar 18, 2020
@janvorli
Copy link
Member

Apparently, this one concerns to the use of only one register in the register list in instruction LDMIA.

What's wrong with using one register there? The document you've pointed to says:

reglist
is a list of one or more registers to be loaded or stored, enclosed in braces. It can contain register ranges. It must be comma separated if it contains more than one register or register range.

As for the sub with SP

The first one concerns to sub.w instruction: SP is being used in Rd position and r8 in Rn (according to the ISA, if SP is being used as Rd, SP should also be in Rn, see

@dotnet/jit-contrib do we generate sub.w sp, r8, #80 form in JIT?

@igorsnunes
Copy link
Author

Hi @janvorli ,

Thanks for your reply.

Concerning to the use of only one register, you are correct about that. But the documentation also points (in the restrictions section):

In 32-bit Thumb instructions:

  • ...
  • there must be two or more registers in the list.
    If you write an STM or LDM instruction with only one register in reglist, the assembler automatically substitutes the equivalent STR or LDR instruction. Be aware of this when comparing disassembly listings with source code.

So, this may lead to two possibilities:

  • this is a valgrind issue: valgrind is expecting that the code was generated by a standard assembler, and it should accept only one argument on LDM/STM.
  • this is a dotnet issue: it should do the assembler job (substituting the instruction).

Thanks.

@janvorli
Copy link
Member

@igorsnunes ah, I've missed the restrictions section. I guess the case of LDMIA with one register in the list is something that works, but is not supported according to the doc, so we should not be emitting that.

@jeffschwMSFT jeffschwMSFT added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 25, 2020
@BruceForstall
Copy link
Member

@dotnet/jit-contrib do we generate sub.w sp, r8, #80 form in JIT?

I can look into this, although in the case shown here it looks like that instruction is in native code (unless the symbols are incorrect).

@BruceForstall BruceForstall self-assigned this Mar 27, 2020
@BruceForstall BruceForstall added this to the 5.0 milestone Mar 27, 2020
@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Mar 27, 2020
@Spongman
Copy link

i'm seeing the same thing:

valgrind --tool=massif dotnet exec bin/Release/netcoreapp3.1/linux-arm/app.dll
==2241== Massif, a heap profiler
==2241== Copyright (C) 2003-2017, and GNU GPL'd, by Nicholas Nethercote
==2241== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==2241== Command: dotnet exec bin/Release/netcoreapp3.1/linux-arm/app.dll
==2241== 
--2241-- WARNING: unhandled arm-linux syscall: 389
--2241-- You may be able to write your own handler.
--2241-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--2241-- Nevertheless we consider this a bug.  Please report
--2241-- it at http://valgrind.org/support/bug_reports.html.
disInstr(thumb): unhandled instruction: 0xF1A4 0x0D50
==2241== valgrind: Unrecognised instruction at address 0x52444a5.
==2241==    at 0x52444A4: _DacGlobals::InitializeEntries(unsigned int) (in /home/me/app/bin/Release/netcoreapp3.1/linux-arm/libcoreclr.so)
==2241== Your program just tried to execute an instruction that Valgrind
==2241== did not recognise.  There are two possible reasons for this.
==2241== 1. Your program has a bug and erroneously jumped to a non-code
==2241==    location.  If you are running Memcheck and you just saw a
==2241==    warning about a bad jump, it's probably your program's fault.
==2241== 2. The instruction is legitimate but Valgrind doesn't handle it,
==2241==    i.e. it's Valgrind's fault.  If you think this is the case or
==2241==    you are not sure, please let us know and we'll try to fix it.
==2241== Either way, Valgrind will now raise a SIGILL signal which will
==2241== probably kill your program.
disInstr(thumb): unhandled instruction: 0xF1A4 0x0D50
==2241== valgrind: Unrecognised instruction at address 0x52444a5.
==2241==    at 0x52444A4: _DacGlobals::InitializeEntries(unsigned int) (in /home/me/app/bin/Release/netcoreapp3.1/linux-arm/libcoreclr.so)
==2241== Your program just tried to execute an instruction that Valgrind
==2241== did not recognise.  There are two possible reasons for this.
==2241== 1. Your program has a bug and erroneously jumped to a non-code
==2241==    location.  If you are running Memcheck and you just saw a
==2241==    warning about a bad jump, it's probably your program's fault.
==2241== 2. The instruction is legitimate but Valgrind doesn't handle it,
==2241==    i.e. it's Valgrind's fault.  If you think this is the case or
==2241==    you are not sure, please let us know and we'll try to fix it.
==2241== Either way, Valgrind will now raise a SIGILL signal which will
==2241== probably kill your program.
==2241== 
==2241== Process terminating with default action of signal 4 (SIGILL)
==2241==  Illegal opcode at address 0x52444A5
==2241==    at 0x52444A4: _DacGlobals::InitializeEntries(unsigned int) (in /home/me/app/bin/Release/netcoreapp3.1/linux-arm/libcoreclr.so)
==2241== 
Illegal instruction

@BruceForstall
Copy link
Member

@janvorli As far as I can tell, the JIT will never generate sub.w sp, r8, #80 form. Also, the jit never generates ldm form. From the info here, it looks like those must be coming from some kind of native code. I didn't see anywhere in the VM that would generate the ldm form.

@AndyAyersMS
Copy link
Member

I'll investigate.

@AndyAyersMS AndyAyersMS self-assigned this Aug 6, 2020
@AndyAyersMS
Copy link
Member

Able to repro, here is valgrind on 8Queens.dll

==9409== Memcheck, a memory error detector
==9409== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==9409== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==9409== Command: /mnt/laptop/repos/runtime/artifacts/tests/coreclr/Linux.arm.Checked/Tests/Core_Root/corerun 8Queens.dll
==9409== 
disInstr(thumb): unhandled instruction: 0xF1A2 0x0D28
==9409== valgrind: Unrecognised instruction at address 0x4fee5bd.
==9409==    at 0x4FEE5BC: _DacGlobals::InitializeEntries(unsigned int) (src/coreclr/src/inc/vptr_list.h:35)
==9409==    by 0x50A428D: operator() (src/coreclr/src/vm/ceemain.cpp:1132)
==9409==    by 0x50A428D: EEStartup() (src/coreclr/src/vm/ceemain.cpp:1138)
==9409==    by 0x50A4169: EnsureEEStarted() (src/coreclr/src/vm/ceemain.cpp:322)
==9409==    by 0x50FDEE1: CorHost2::Start() (src/coreclr/src/vm/corhost.cpp:101)
==9409==    by 0x4F7C65D: coreclr_initialize (src/coreclr/src/dlls/mscoree/unixinterface.cpp:236)
==9409==    by 0x109FA9: ExecuteManagedAssembly(char const*, char const*, char const*, int, char const**) (src/coreclr/src/hosts/unixcoreruncommon/coreruncommon.cpp:490)
==9409==    by 0x109491: main (src/coreclr/src/hosts/unixcorerun/corerun.cpp:148)
==9409== Your program just tried to execute an instruction that Valgrind
==9409== did not recognise.  There are two possible reasons for this.
==9409== 1. Your program has a bug and erroneously jumped to a non-code
==9409==    location.  If you are running Memcheck and you just saw a
==9409==    warning about a bad jump, it's probably your program's fault.
==9409== 2. The instruction is legitimate but Valgrind doesn't handle it,
==9409==    i.e. it's Valgrind's fault.  If you think this is the case or
==9409==    you are not sure, please let us know and we'll try to fix it.
==9409== Either way, Valgrind will now raise a SIGILL signal which will
==9409== probably kill your program.
==9409== Use of uninitialised value of size 4
==9409==    at 0x53F82CC: CONTEXTFromNativeContext (src/coreclr/src/pal/src/thread/context.cpp:528)
==9409==    by 0x53A8C33: common_signal_handler(int, siginfo_t*, void*, int, ...) (src/coreclr/src/pal/src/exception/signal.cpp:882)
==9409==    by 0x53A839B: sigill_handler(int, siginfo_t*, void*) (src/coreclr/src/pal/src/exception/signal.cpp:352)
==9409==    by 0x4A7675F: ??? (sigrestorer.S:77)
==9409== 
==9409== Use of uninitialised value of size 4
==9409==    at 0x53F82D0: CONTEXTFromNativeContext (src/coreclr/src/pal/src/thread/context.cpp:533)
==9409==    by 0x53A8C33: common_signal_handler(int, siginfo_t*, void*, int, ...) (src/coreclr/src/pal/src/exception/signal.cpp:882)
==9409==    by 0x53A839B: sigill_handler(int, siginfo_t*, void*) (src/coreclr/src/pal/src/exception/signal.cpp:352)
==9409==    by 0x4A7675F: ??? (sigrestorer.S:77)
==9409== 
==9409== Use of uninitialised value of size 4
==9409==    at 0x53F82D2: CONTEXTFromNativeContext (src/coreclr/src/pal/src/thread/context.cpp:533)
==9409==    by 0x53A8C33: common_signal_handler(int, siginfo_t*, void*, int, ...) (src/coreclr/src/pal/src/exception/signal.cpp:882)
==9409==    by 0x53A839B: sigill_handler(int, siginfo_t*, void*) (src/coreclr/src/pal/src/exception/signal.cpp:352)
==9409==    by 0x4A7675F: ??? (sigrestorer.S:77)
==9409== 
==9409== Use of uninitialised value of size 4
==9409==    at 0x53F82F0: CONTEXTFromNativeContext (src/coreclr/src/pal/src/thread/context.cpp:544)
==9409==    by 0x53A8C33: common_signal_handler(int, siginfo_t*, void*, int, ...) (src/coreclr/src/pal/src/exception/signal.cpp:882)
==9409==    by 0x53A839B: sigill_handler(int, siginfo_t*, void*) (src/coreclr/src/pal/src/exception/signal.cpp:352)
==9409==    by 0x4A7675F: ??? (sigrestorer.S:77)
==9409== 
==9409== Use of uninitialised value of size 4
==9409==    at 0x53F82F2: CONTEXTFromNativeContext (src/coreclr/src/pal/src/thread/context.cpp:544)
==9409==    by 0x53A8C33: common_signal_handler(int, siginfo_t*, void*, int, ...) (src/coreclr/src/pal/src/exception/signal.cpp:882)
==9409==    by 0x53A839B: sigill_handler(int, siginfo_t*, void*) (src/coreclr/src/pal/src/exception/signal.cpp:352)
==9409==    by 0x4A7675F: ??? (sigrestorer.S:77)
==9409== 
==9409== Use of uninitialised value of size 4
==9409==    at 0x53F89AC: GetNativeSigSimdContext (src/coreclr/src/pal/src/include/pal/context.h:430)
==9409==    by 0x53F8333: GetConstNativeSigSimdContext (src/coreclr/src/pal/src/include/pal/context.h:454)
==9409==    by 0x53F8333: CONTEXTFromNativeContext (src/coreclr/src/pal/src/thread/context.cpp:612)
==9409==    by 0x53A8C33: common_signal_handler(int, siginfo_t*, void*, int, ...) (src/coreclr/src/pal/src/exception/signal.cpp:882)
==9409==    by 0x53A839B: sigill_handler(int, siginfo_t*, void*) (src/coreclr/src/pal/src/exception/signal.cpp:352)
==9409==    by 0x4A7675F: ??? (sigrestorer.S:77)
==9409== 
==9409== Use of uninitialised value of size 4
==9409==    at 0x53F835A: CONTEXTFromNativeContext (src/coreclr/src/pal/src/thread/context.cpp:625)
==9409==    by 0x53A8C33: common_signal_handler(int, siginfo_t*, void*, int, ...) (src/coreclr/src/pal/src/exception/signal.cpp:882)
==9409==    by 0x53A839B: sigill_handler(int, siginfo_t*, void*) (src/coreclr/src/pal/src/exception/signal.cpp:352)
==9409==    by 0x4A7675F: ??? (sigrestorer.S:77)
==9409== 
==9409== Conditional jump or move depends on uninitialised value(s)
==9409==    at 0x4A76164: sigaddset (sigaddset.c:26)
==9409==    by 0x53A8C43: common_signal_handler(int, siginfo_t*, void*, int, ...) (src/coreclr/src/pal/src/exception/signal.cpp:886)
==9409==    by 0x53A839B: sigill_handler(int, siginfo_t*, void*) (src/coreclr/src/pal/src/exception/signal.cpp:352)
==9409==    by 0x4A7675F: ??? (sigrestorer.S:77)
==9409== 
==9409== Use of uninitialised value of size 4
==9409==    at 0x4A76170: sigaddset (sigaddset.c:32)
==9409==    by 0x53A8C43: common_signal_handler(int, siginfo_t*, void*, int, ...) (src/coreclr/src/pal/src/exception/signal.cpp:886)
==9409==    by 0x53A839B: sigill_handler(int, siginfo_t*, void*) (src/coreclr/src/pal/src/exception/signal.cpp:352)
==9409==    by 0x4A7675F: ??? (sigrestorer.S:77)
==9409== 
==9409== Conditional jump or move depends on uninitialised value(s)
==9409==    at 0x487A2D2: pthread_sigmask (pthread_sigmask.c:33)
==9409==    by 0x53A8C4F: common_signal_handler(int, siginfo_t*, void*, int, ...) (src/coreclr/src/pal/src/exception/signal.cpp:887)
==9409==    by 0x53A839B: sigill_handler(int, siginfo_t*, void*) (src/coreclr/src/pal/src/exception/signal.cpp:352)
==9409==    by 0x4A7675F: ??? (sigrestorer.S:77)
==9409== 
==9409== Syscall param rt_sigprocmask(set) points to uninitialised byte(s)
==9409==    at 0x487DF06: __libc_do_syscall (libc-do-syscall.S:47)
==9409==    by 0x487A2E7: pthread_sigmask (pthread_sigmask.c:45)
==9409==    by 0x53A8C4F: common_signal_handler(int, siginfo_t*, void*, int, ...) (src/coreclr/src/pal/src/exception/signal.cpp:887)
==9409==    by 0x53A839B: sigill_handler(int, siginfo_t*, void*) (src/coreclr/src/pal/src/exception/signal.cpp:352)
==9409==    by 0x4A7675F: ??? (sigrestorer.S:77)
==9409==  Address 0xfe9fd45c is on thread 1's stack
==9409==  in frame #2, created by common_signal_handler(int, siginfo_t*, void*, int, ...) (src/coreclr/src/pal/src/exception/signal.cpp:838)
==9409== 
==9409== Use of uninitialised value of size 4
==9409==    at 0x53A8CE2: common_signal_handler(int, siginfo_t*, void*, int, ...) (src/coreclr/src/pal/inc/pal.h:0)
==9409==    by 0x53A839B: sigill_handler(int, siginfo_t*, void*) (src/coreclr/src/pal/src/exception/signal.cpp:352)
==9409==    by 0x4A7675F: ??? (sigrestorer.S:77)
==9409== 
disInstr(thumb): unhandled instruction: 0xF1A2 0x0D28
==9409== valgrind: Unrecognised instruction at address 0x4fee5bd.
==9409==    at 0x4FEE5BC: _DacGlobals::InitializeEntries(unsigned int) (src/coreclr/src/inc/vptr_list.h:35)
==9409==    by 0x50A428D: operator() (src/coreclr/src/vm/ceemain.cpp:1132)
==9409==    by 0x50A428D: EEStartup() (src/coreclr/src/vm/ceemain.cpp:1138)
==9409==    by 0x50A4169: EnsureEEStarted() (src/coreclr/src/vm/ceemain.cpp:322)
==9409==    by 0x50FDEE1: CorHost2::Start() (src/coreclr/src/vm/corhost.cpp:101)
==9409==    by 0x4F7C65D: coreclr_initialize (src/coreclr/src/dlls/mscoree/unixinterface.cpp:236)
==9409==    by 0x109FA9: ExecuteManagedAssembly(char const*, char const*, char const*, int, char const**) (src/coreclr/src/hosts/unixcoreruncommon/coreruncommon.cpp:490)
==9409==    by 0x109491: main (src/coreclr/src/hosts/unixcorerun/corerun.cpp:148)
==9409== Your program just tried to execute an instruction that Valgrind
==9409== did not recognise.  There are two possible reasons for this.
==9409== 1. Your program has a bug and erroneously jumped to a non-code
==9409==    location.  If you are running Memcheck and you just saw a
==9409==    warning about a bad jump, it's probably your program's fault.
==9409== 2. The instruction is legitimate but Valgrind doesn't handle it,
==9409==    i.e. it's Valgrind's fault.  If you think this is the case or
==9409==    you are not sure, please let us know and we'll try to fix it.
==9409== Either way, Valgrind will now raise a SIGILL signal which will
==9409== probably kill your program.
==9409== 
==9409== Process terminating with default action of signal 4 (SIGILL): dumping core
==9409==  Illegal opcode at address 0x4FEE5BD
==9409==    at 0x4FEE5BC: _DacGlobals::InitializeEntries(unsigned int) (src/coreclr/src/inc/vptr_list.h:35)
==9409==    by 0x50A428D: operator() (src/coreclr/src/vm/ceemain.cpp:1132)
==9409==    by 0x50A428D: EEStartup() (src/coreclr/src/vm/ceemain.cpp:1138)
==9409==    by 0x50A4169: EnsureEEStarted() (src/coreclr/src/vm/ceemain.cpp:322)
==9409==    by 0x50FDEE1: CorHost2::Start() (src/coreclr/src/vm/corhost.cpp:101)
==9409==    by 0x4F7C65D: coreclr_initialize (src/coreclr/src/dlls/mscoree/unixinterface.cpp:236)
==9409==    by 0x109FA9: ExecuteManagedAssembly(char const*, char const*, char const*, int, char const**) (src/coreclr/src/hosts/unixcoreruncommon/coreruncommon.cpp:490)
==9409==    by 0x109491: main (src/coreclr/src/hosts/unixcorerun/corerun.cpp:148)
==9409== 
==9409== HEAP SUMMARY:
==9409==     in use at exit: 138,752 bytes in 128 blocks
==9409==   total heap usage: 7,602 allocs, 7,474 frees, 938,435 bytes allocated
==9409== 
==9409== LEAK SUMMARY:
==9409==    definitely lost: 60 bytes in 1 blocks
==9409==    indirectly lost: 0 bytes in 0 blocks
==9409==      possibly lost: 75,679 bytes in 19 blocks
==9409==    still reachable: 63,013 bytes in 108 blocks
==9409==         suppressed: 0 bytes in 0 blocks
==9409== Rerun with --leak-check=full to see details of leaked memory
==9409== 
==9409== Use --track-origins=yes to see where uninitialised values come from
==9409== For lists of detected and suppressed errors, rerun with: -s
==9409== ERROR SUMMARY: 12 errors from 12 contexts (suppressed: 0 from 0)

Instruction is same form as first bad instruction in the top comment

=> 0x04fee5bc <_DacGlobals::InitializeEntries(unsigned int)+2204>:      a2 f1 28 0d     sub.w   sp, r2, #40     ; 0x28

;; in context, looks like an odd codegen pattern ... other allocas do the math in gprs and then just a mov to sp

   0x04fee5b8 <+2200>:  mov     r2, sp
   0x04fee5ba <+2202>:  ldr     r4, [r0, #0]
=> 0x04fee5bc <+2204>:  sub.w   sp, r2, #40     ; 0x28

This is from Clang. Seems like it comes from heavy use of _alloca in that method.

@AndyAyersMS
Copy link
Member

Checked for the sub sp, Rn issue in the open bugs against Clang and couldn't find anything similar, though it may just be tricky to search for. Probably should extract the code for _DacGlobals::InitializeEntries into a standalone repro and and open an issue there.

Did not try hacking past this to get to the next bug, but will try and do so relatively soon.

At any rate, we're unlikely to be able to fix this in 5.0, so will move to future.

@AndyAyersMS AndyAyersMS modified the milestones: 5.0.0, Future Aug 7, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@oldzhu
Copy link
Contributor

oldzhu commented Dec 31, 2020

for the illegal arm instruction ldm I reported at #33344. I found a way to reproduce it and actually the illegal ldm instruction is not generated when "jit" but it is generated when compile for "r2r". The following is the detail steps:

  1. Create a new classlib project named as testlib and copy the below content to Class1.cs
using System;

namespace testlib
{
    public struct struct1
    {
            public uint m1;
            public struct1(uint p1)
            {
                   m1=p1;
            }
    }
    public struct struct2
    {
            public uint m2;
            public struct2(uint p2)
            {
                   m2=p2;
            }
    }
    public class class01
    {
        public static struct2 test01(struct1 s1)
        {
                return new struct2(s1.m1 | 0x20000000);
        }
    }
}  
  1. Publish the testlib with the below command:

dotnet publish -c release -r linux-arm -p:PublishReadyToRun=true

  1. Disassembly the native code in the published r2r image testlib.dll using R2RDump as the below:

./R2RDump/R2RDump -d -i testlib.dll

testlib.struct2 testlib.class01.test01(testlib.struct1)
Id: 2
StartAddress: 0x0000249C
Size: 18 bytes
UnwindRVA: 0x000024F0

Debug Info
    Bounds:
    Native Offset: 0x0, Prolog, Source Types: StackEmpty
    Native Offset: 0xA, IL Offset: 0x0011, Source Types: SourceTypeInvalid
    Native Offset: 0xA, Epilog, Source Types: StackEmpty

    Variable Locations:
    Variable Number: 0
    Start Offset: 0x0
    End Offset: 0x4
    Loc Type: VLT_REG
    Register: R0

249c: 01 b4            push    {r0}
249e: 00 b5            push    {lr}
24a0: 01 98            ldr     r0, [sp, #4]
24a2: 40 f0 00 50      orr     r0, r0, #536870912
24a6: bd e8 00 40      ldm.w   sp!, {lr}               // this is the illegal instruction generated by our libclrjit.so...
24aa: 01 b0            add     sp, #4
24ac: 70 47            bx      lr

You can see the ldm illegal instruction at the above.

  1. I also published testlib as non-r2r image and performed a live debugging, I could see our jit jitted the different instructions as
    the below for the same testlib.struct2 testlib.class01.test01(testlib.struct1)
(lldb) clru 
Normal JIT generated code
testlib.class01.test01(testlib.struct1)
ilAddr is 7550125A pImport is 006A7850
Begin 6CC91498, size 2c
>>> 000000006cc91498 01b42de9             push    {r0, r10, r12, sp, pc}

…
(gdb) disas /rs 0x6cc91498,+0x2c
Dump of assembler code from 0x6cc91498 to 0x6cc914c4:
   0x6cc91498:  01 b4   push    {r0}
   0x6cc9149a:  2d e9 0c 4c     stmdb   sp!, {r2, r3, r10, r11, lr}
   0x6cc9149e:  0d f1 0c 0b     add.w   r11, sp, #12
   0x6cc914a2:  00 21   movs    r1, #0
   0x6cc914a4:  01 a8   add     r0, sp, #4
   0x6cc914a6:  01 60   str     r1, [r0, #0]
   0x6cc914a8:  05 99   ldr     r1, [sp, #20]
   0x6cc914aa:  41 f0 00 51     orr.w   r1, r1, #536870912      ; 0x20000000
   0x6cc914ae:  01 a8   add     r0, sp, #4
   0x6cc914b0:  4e f6 19 23     movw    r3, #59929      ; 0xea19
   0x6cc914b4:  c6 f6 e0 63     movt    r3, #28384      ; 0x6ee0
   0x6cc914b8:  98 47   blx     r3
   0x6cc914ba:  01 98   ldr     r0, [sp, #4]
   0x6cc914bc:  bd e8 0c 4c     ldmia.w sp!, {r2, r3, r10, r11, lr}
   0x6cc914c0:  01 b0   add     sp, #4
   0x6cc914c2:  70 47   bx      lr
End of assembler dump.

That can explain why there is no crash for the same method in my testing if I publish it as non-r2r image.

  1. I performed a live debugging for crossgen for the scenario to generate the illegal ldm instruction and copied the debugging output for your reference,
(gdb) bt 5
#0  emitter::emitOutputInstr (this=0x55555593ea30, ig=<optimized out>, id=0x555555945d98, dp=0x7fffffffc830) at /home/oldzhu/buildroot/output/build/dotnetruntime-origin_master/src/coreclr/jit/emitarm.cpp:5998
#1  0x00007ffff6f3962b in emitter::emitIssue1Instr (this=0x55555593ea30, ig=0x555555945c80, id=0x555555945d98, dp=0x7fffffffc830) at /home/oldzhu/buildroot/output/build/dotnetruntime-origin_master/src/coreclr/jit/emit.cpp:3624
#2  emitter::emitEndCodeGen (this=0x55555593ea30, comp=<optimized out>, contTrkPtrLcls=true, fullyInt=<optimized out>, fullPtrMap=<optimized out>, xcptnsCount=<optimized out>, prologSize=0x55555593e8a8, epilogSize=0x55555593e8ac, codeAddr=0x7fffffffce08, coldCodeAddr=0x55555593e898, consAddr=0x55555593e8a0) at /home/oldzhu/buildroot/output/build/dotnetruntime-origin_master/src/coreclr/jit/emit.cpp:5209
#3  0x00007ffff6f1c37f in CodeGen::genEmitMachineCode (this=0x55555593e590) at /home/oldzhu/buildroot/output/build/dotnetruntime-origin_master/src/coreclr/jit/codegencommon.cpp:2326
#4  0x00007ffff6f25752 in CodeGenPhase::DoPhase (this=<optimized out>) at /home/oldzhu/buildroot/output/build/dotnetruntime-origin_master/src/coreclr/jit/codegen.h:1605
(More stack frames follow...)
(gdb) list 5998
5993                assert((imm & 0x3) != 0x3);
5994                if (imm & 0x2)
5995                    code |= 0x8000; //  PC bit
5996                if (imm & 0x1)
5997                    code |= 0x4000; //  LR bit
5998                imm >>= 2;
5999                assert(imm <= 0x1fff); //  13 bits
6000                code |= imm;
6001                dst += emitOutput_Thumb2Instr(dst, code);
6002                break;
(gdb) info reg
rax            0x1                 1
rbx            0xe8bd0000          3904700416
rcx            0x4000              16384

I think the illegal ldm instruction reported in my case at #33344 is not the same as this one. It is obvious the illegal instruction in my case is generated by our libclrjit.so, but in this case, the illegal instruction is in libcoreclr.so which instructions are generated by Clang/LLVM or injected by some other tools.

oldzhu added a commit to oldzhu/4dotnet that referenced this issue Jan 3, 2021
@oldzhu
Copy link
Contributor

oldzhu commented Jan 3, 2021

I worked out a POC fix for arm ldm illegal instruction by modifying emitarm.cpp as the below:

--- /dotnetruntime-origin_master/src/coreclr/jit/emitarm.cpp 2021-01-03 09:49:48.790000000 +0800
+++/dotnetcore/dotnetruntime/modified/emitarm.cpp  2021-01-03 09:53:34.740000000 +0800
@@ -1520,12 +1520,19 @@

             if (imm & 0x8000) // Is the PC being popped?
                 hasPC = true;
+           if (imm == 0x4000)
+           {
+                // We have to use the Thumb-2 pop single register encoding but ldm sp!, {lr}
+                regNumber reg = genRegNumFromMask(imm);
+                emitIns_R(ins, attr, reg);
+                return;
+           }
             if (imm & 0x4000) // Is the LR being popped?
             {
                 hasLR = true;
                 useT2 = true;
             }
-
+
         COMMON_PUSH_POP:

             if (((imm - 1) & imm) == 0) // Is only one or zero bits set in imm?

before the patch:

the arm instructions generated for the method testlib.struct2 testlib.class01.test01(testlib.struct1)

249c: 01 b4 push {r0}
249e: 00 b5 push {lr}
24a0: 01 98 ldr r0, [sp, #4]
24a2: 40 f0 00 50 orr r0, r0, #536870912
24a6: bd e8 00 40 ldm.w sp!, {lr} // this is the illegal instruction generated by our libclrjit.so...
24aa: 01 b0 add sp, #4
24ac: 70 47 bx lr

after the patch:

the arm instructions generated for the method testlib.struct2 testlib.class01.test01(testlib.struct1)

249c: 01 b4            push    {r0}
249e: 00 b5            push    {lr}
24a0: 01 98            ldr     r0, [sp, #4]
24a2: 40 f0 00 50      orr     r0, r0, #536870912
24a6: 5d f8 04 eb      ldr     lr, [sp], #4          // now it becomes correct single pop instruction..
24aa: 01 b0            add     sp, #4
24ac: 70 47            bx      lr

also wrote a simple program to test calling the R2R test01 method, before the POC patch, could see the crash caused by illegal instruction, after the POC patch, it works without any problem calling into the R2R test01 method.

@danmoseley
Copy link
Member

@oldzhu thanks. You might consider using ``` blocks in your text above to fix the formatting.

@oldzhu
Copy link
Contributor

oldzhu commented Jan 3, 2021

@oldzhu thanks. You might consider using ``` blocks in your text above to fix the formatting.

Thanks Dan! I fixed..

oldzhu added a commit to oldzhu/runtime that referenced this issue Jul 12, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm32 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI JitUntriaged CLR JIT issues needing additional triage
Projects
None yet
Development

No branches or pull requests

10 participants