-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exit code 139 on 'dotnet new' #34231
Comments
cc: @janvorli |
Possibly related, ran into 139 when running
|
@karelz @danmosemsft - can we prioritize investigation here? This is causing infrastructure flakiness for ASP.NET Core. |
What is base OS version? Is there some better way how to reproduce beside CI? |
Well the two builds I linked are no longer around because it's been a month. Luckily I think I still have one of the core dumps and it loads on my Ubuntu machine.
Nope, very random and infrequent. |
the one I linked had QueueId |
@BrennanConroy can you please share the core so that I can take a deeper look? |
Sent an email. |
Tagging subscribers to this area: @dotnet/ncl |
@ryanbrandenburg why do you think your callstack in #34231 (comment) is related to this issue? It is because the symptom (exit code) is the same or is there more to it? |
The exit code is the only connection I know of. That call-stack is purely on our end and was included only in case looking at the site that calls it might be helpful to the investigator. @BrennanConroy was the one who originally called out the possible connection, so he might have done a more in-depth analysis. |
Exit code 139 means "access violation", so it can be effectively anything. But this is what makes it point to the old issue fixed in 2.1 / 2.2:
|
Here is the top of stack trace for the old fixed issue:
And here is the full stack trace from the current dump:
As you can see from the fact that the
So the process is exiting. From this point of view, it looks pretty similar to the old fixed issue. |
@bartonjs any idea why the openssl shutdown could get triggered even though we are passing the OPENSSL_INIT_NO_ATEXIT flag? (I've verified that we are passing that flag in there). The openssl on Ubuntu 18.04 is openssl 1.1.1 (I've verified that). |
Nope. Git-blame says that that code hasn't ever changed. https://github.com/openssl/openssl/blame/OpenSSL_1_1_1-stable/crypto/init.c#L650-L657. ... unless something in the process loaded OpenSSL before we did, so our NO_ATEXIT didn't matter? |
@BrennanConroy do you happen to know the full command line used for running the crashed process? GDB shows just a part of it:
|
I don't know 100%, but my guess is: |
It seems to be the one, I've found all these as strings in the dump. |
@bartonjs I've ran the command above, set a breakpoint at the |
So after installing the matching openssl sources on the Ubuntu 18.04, I've found a surprising thing. The sources don't contain the lines https://github.com/openssl/openssl/blame/OpenSSL_1_1_1-stable/crypto/init.c#L650-L657 at all! And examining the disassembly of the function confirmed that. While I can see references to symbols Unfortunately this version registers the OPENSSL_cleanup unconditionally. |
If the only real error we ever see is in ERR_get_error_string, maybe we want to register our own atexit after openssl init that sets a static and we just return NULL from the shim? (Since they're called in reverse order of registration we need to do it after) Presumably other things were happening more gracefully, since we tried to throw an exception. |
It seems like a workaround we could possibly do, however there could be a race between the call to ERR_get_error_string and the new atexit handler you suggest. If the new atexit handler was called while the ERR_get_error_string was already running and the openssl's atexit handler was called before the ERR_get_error_string accessed the lock, we would still crash. So it seems we would need to place a lock around the ERR_get_error_string call and take the same lock for setting the static you've mentioned. |
@bartonjs are you planning to implement the workaround you've suggested? |
@janvorli It's on my todo list, but if you have some free time and want to jump on it first, that's good with me. |
@bartonjs I don't have a free time now, but I'll ping you if I find some before you get to that. |
I can take a look. Should we do that only for OpenSSL 1.1.1? |
When it rains, it pours, eh? 😄 When Jan said he didn't have time I pushed this up to my current task. Hopefully I'll have a PR soon (though I don't know how to test to see that it covered everything we cared about). |
OK, I have a fix... but every weird threading test I've thrown at it hasn't reproduced the problem (all of my threads seem to exit before my atexit handler gets called, whether they're background threads or foreground threads). So I don't get the satisfaction of empirical verification. |
And... so 6 minutes later I finally get one that works (it required background threads, and apparently just luck). |
Terrible code of race-induction: using System;
using System.Reflection;
using System.Runtime.InteropServices;
using System.Security.Cryptography;
using System.Security.Cryptography.X509Certificates;
using System.Threading;
namespace RaceIt
{
internal static class Program
{
private static int Main()
{
int count = 0;
bool shutdown = false;
Type type = typeof(X509Certificate2).Assembly.GetType("Interop").GetNestedType("Crypto", BindingFlags.NonPublic);
MethodInfo info = type.GetMethod("CreateOpenSslCryptographicException", BindingFlags.NonPublic | BindingFlags.Static);
Func<Exception> func = (Func<Exception>)info.CreateDelegate(typeof(Func<Exception>));
try { new X509Certificate2(new byte[5]).Dispose(); } catch {}
for (int i = 0; i < 50; i++)
{
Thread t = new Thread(
() =>
{
string last = null;
var stopwatch = new System.Diagnostics.Stopwatch();
while (stopwatch.ElapsedMilliseconds < 2000)
{
ERR_put_error(1, 2, 3, "this file", 17);
Exception e = func();
string cur = $"{e.GetType().Name}: {e.Message}";
if (cur != last)
{
Console.WriteLine(cur);
last = cur;
}
Interlocked.Increment(ref count);
if (shutdown) { stopwatch.Start(); }
}
Console.WriteLine($"{count} exceptions, says thread {Thread.CurrentThread.ManagedThreadId}.");
});
t.IsBackground = true;
t.Start();
}
for (int i = 10; i >= 1; i--)
{
Console.WriteLine(i);
Thread.Sleep(1000);
}
shutdown = true;
Console.WriteLine($"{count} exceptions, roughly.");
return 2;
}
[DllImport("libcrypto.so.1.1")]
private extern static void ERR_put_error(int lib, int func, int reason, string file, int line);
}
} |
Looks like the same issue as #28932 which was fixed in 2.1 and 2.2?
Core dump: https://dev.azure.com/dnceng/internal/_build/results?buildId=574320&view=ms.vss-test-web.build-test-results-tab&runId=18013826&resultId=120463&paneView=attachments
and
https://dev.azure.com/dnceng/internal/_build/results?buildId=564240&view=ms.vss-test-web.build-test-results-tab&runId=17753940&resultId=120476&paneView=attachments
I took a look at the first core dump:
dotnet dump analyze shows:
gdb shows:
#0 0x00007fb482f382c2 in __pthread_mutex_trylock (mutex=0x0) at ../nptl/pthread_mutex_trylock.c:172
for the first frameRuntime version 5.0.0-preview.2.20119.9
@bartonjs @janvorli
The text was updated successfully, but these errors were encountered: