-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: stack barrier inconsistency during profiling (dup #12932?) #12528
Comments
CC @RLH @aclements |
Same on What did I do?
|
Hi @vsdutka and @daemonfire300. Please post as much of the error and traceback as you can (if you need to redact frames related to your apps, that's fine, but please include all of the frames in the Go runtime). Could you give more detail on how you produced this error, ideally enough that we could try to reproduce it ourselves? Thanks! |
@aclements sorry for the late response, but what do you mean by "redact[ing] frames"? The code that is being executed? The stracktrace? |
@daemonfire300, I mean we need the actual error printed by the crash to debug this, including the exact error message and the stack traces that follow the error message that involve functions in the runtime. It's easiest to just paste all of the output starting with the error message here, but if you're working on a closed system, it's okay if you leave out the bits that are specific to your system. |
Just to clarify our motivation here. If we can reproduce the problem On Tue, Oct 6, 2015 at 11:24 AM, Austin Clements [email protected]
|
In my case app crash without any stack. Only message about mp.p == nil Отправлено из myMail for iOS Вторник, 6 октября 2015 г., 18:31 +0300 от Richard L. Hudson [email protected]:
|
Hmm, Are you using cgo to call C or the unsafe package? Do you have any On Tue, Oct 6, 2015 at 4:18 PM, vsdutka [email protected] wrote:
|
I make RESTFul service for Oracle DB. And I use GOracle lib for working with oracle Pl/sql. When I make stress test (400 concurrent request) app crash after 5 minutes. Отправлено из myMail for iOS Среда, 7 октября 2015 г., 1:04 +0300 от Richard L. Hudson [email protected]:
|
Thanks for the information but unfortunately I don't know how to recreate On Wed, Oct 7, 2015 at 12:29 AM, vsdutka [email protected] wrote:
|
I will try it tomorrow Отправлено из myMail for iOS Среда, 7 октября 2015 г., 14:41 +0300 от Richard L. Hudson [email protected]:
|
I can post the cpu pprof and pprof memory files as pdf if that is helping? Because I dont get a stacktrace, same as the other users. |
import (
"bytes"
"encoding/json"
"errors"
"flag"
"fmt"
"html/template"
"io/ioutil"
"log"
"math"
"net/http"
"runtime"
"sort"
"strconv"
"strings"
"time"
"github.com/Azure/azure-sdk-for-go/storage"
"github.com/davecheney/profile"
_ "github.com/go-sql-driver/mysql"
"github.com/gorilla/mux"
)
func contentTypeWrapper(handler http.HandlerFunc, contentType string) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", contentType)
handler(w, r)
}
}
func benchmarkWrapper(handler http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
date := time.Now()
start := time.Now().UnixNano()
handler(w, r)
end := time.Now().UnixNano() - start
benchmark := benchmark{
Duration: float64(end),
Route: r.URL.Path,
Method: r.Method,
Tag: "[HTTP]",
Date: date,
}
metrics = append(metrics, benchmark)
gatherCoreMetrics("[HTTP]", start, r)
}
}
func main() {
runtime.GOMAXPROCS(runtime.NumCPU())
flag.Parse()
pcfg := profile.Config{
MemProfile: true,
CPUProfile: true,
BlockProfile: true,
}
pcfg.ProfilePath = "."
defer profile.Start(&pcfg).Stop()
r := mux.NewRouter()
r.HandleFunc("/pictures", contentTypeWrapper(benchmarkWrapper(authWrapper(picturesHandler)), "application/json")).Methods("GET")
http.Handle("/", r)
http.Handle("/static/", http.StripPrefix("/static/", http.FileServer(http.Dir("./public"))))
//http.ListenAndServe(":"+os.Getenv("HTTP_PLATFORM_PORT"), nil)
http.ListenAndServe(":"+"8989", nil)
} The picturesHandler does some HTTP requests against S3 buckets (only meta data), nothing fancy. Writes it into a []myCustomStruct and json.Marshals it and does w.Write(response) |
Thanks, but I don't think this would help. What's happening is that an internal check for an invariant in the runtime is failing. Most likely whatever's going wrong is going wrong an instant before the failure. It's unfortunate that the runtime isn't printing a stack trace, but there are other ways to get a stack trace. It's likely that if we can get a stack trace, the problem will be obvious. Please run your program under gdb, set a breakpoint on the panic with "br 'runtime.throw'" (note the single quotes), run it with "run" until it stops at the panic, and use "bt" to print a stack trace. |
How I can get stack trace if I recieve this error when running test by 2015-10-08 2:52 GMT+03:00 Austin Clements [email protected]:
|
I try to test my lib and recieve this error if use options Stack trace 2015-10-08 13:40 GMT+03:00 Vyacheslav Dutka [email protected]:
|
Hmm. Well, that stack is definitely not right. Could you get it crash again and in addition to "bt", run "x/256xg $rsp" in gdb and "go tool nm -n " on your binary and paste the results?
Do you mean that it crashes if and only if you pass -test.cpuprofile? |
gdb --args otasker.test.exe -test.cpuprofile=prof.out -test.run=TestDescri
warning: ERROR !!! HeapSetInformation failed to set g_SpyHeap to LFH [New Thread 2692.0x1574] Breakpoint 1, 0x0000000000430310 in runtime.throw ()
Quit anyway? (y or n) y
100000 C size_of_heap_reserve |
@vsdutka, thanks! I haven't had a chance to analyze this yet, but I noticed the nm output got cut off. Could you post it as an attachment or a gist? |
2015-10-13 3:09 GMT+03:00 Austin Clements [email protected]:
100000 C size_of_heap_reserve |
Your nm still got cut off, but I put together a rough call stack below. Note that some of these are probably stray values that look like return RIPs, but aren't (like github.com/vsdutka/otasker.Describe.func1). But overall this looks pretty believable.
The most obvious possibility is that A more complex but likely possibility is that something went wrong in @vsdutka, can you attach your binary (the one you used with gdb to dump the stack) to the issue? I should be able to work out exactly where the real panic happened using that and the stack dump. If you can't, then please send the output of "go tool objdump -s 'cpuProfile..(add|evict)|panicindex' ", since that's most likely to tell me what I need to know. Thanks! |
/cc @mdempsky, in case you have any insight on what could be causing a panicindex in cpuProfile.evict (or possibly cpuProfile.add). |
It's not obvious to me why the values at $rsp mean anything. The gdb backtrace: #0 0x0000000000430310 in runtime.throw () is clear: it is backtracing to the function called via systemstack from writebarrierptr_nostore1. This is on the system stack, which is why gdb can't backtrace any further. But when you look at $rsp from that point, you are just seeing the system stack. You aren't seeing the Go call stack, which is some goroutine call stack, at some completely different location. |
Given that mp.p is nil, we were already on the system stack when we entered
writebarrierptr_nostore1. That should mean that the stack at $rsp is
useful, since there would have been no stack switch.
|
Then I don't understand why the gdb backtrace is so bad. |
I don't understand that either. But if the traceback was cut off because of a switch to the system stack, I would have expected GDB to at least get as far as the return to systemstack. My guess would be that it has something to do with Windows, but that's purely a guess. |
2015-10-14 5:41 GMT+03:00 Austin Clements [email protected]:
|
@vsdutka, did you mean to post a blank message? If you were trying to attach your binary, you need to go to the GitHub web interface and use the file attachment link at the bottom of the issue reply box. |
@vsdutka or @daemonfire300, can you still reproduce this with Go 1.5.2? |
=== RUN TestDescribe --- PASS: TestDescribe (0.17s) === RUN TestDescribeAfterRecompile --- FAIL: TestDescribeAfterRecompile (0.16s) describe_test.go:108: got 2015-12-17T12:50:57+03:00, want 2015-12-17T12:50:57+03:00 === RUN TestDescribe1 --- PASS: TestDescribe1 (3.26s) === RUN TestDescribeNotExists --- PASS: TestDescribeNotExists (0.10s) === RUN TestDescribeConcurent10 found next stack barrier at 0xc0820e1c28; expected [@@@ fatal error: missed stack barrier runtime stack: exit status 3221225477 FAIL github.com/vsdutka/otasker 7.800s |
@vsdutka, just to clarify, that was with Go 1.5.2? Did you happen to get a core file? Is profiling enabled? How reproducible is this? If you do get other reproductions, please post them, since more data may help with diagnosing this.
Hmm. According to this it found the next stack barrier at exactly where it was supposed to be. This may be a tight race where the stkbar array was updated between the check in gentraceback and the panic print. |
I read that wrong. The @@@ indicates gp.stkbarPos, while the ==> indicates where gentraceback found the unexpected stack barrier. However, the message is slightly misleading in this case: we didn't actually go through the two stack barriers it's printing; rather gentraceback wasn't expecting any more stack barriers from the list of stack barriers it captured when it started. It may have started with an empty list, or it may have successfully unwound some barriers that aren't the ones being printed (though starting with an empty list seems more likely). At some point between when gentraceback started (and captured its expected stack barrier list) and when it found this stack barrier, something else set the stack barrier list that's being printed. |
I believe I know what's going on here. If sigprof fails to acquire the stack barrier lock, it can still perform a traceback if the M is in a Cgo call. This makes the following sequence possible:
The solution is that sigprof must not perform any user stack walking if it fails to acquire the stack barrier lock, regardless of the state of the G. |
I went through every (non-testing) use of gentraceback and every place that modifies stkbar or stkbarPos to root out remaining races:
There are four labels on functions above: "STW" means the function stops the world, "stack lock" means that the function holds the stack lock while reading or writing stkbar, "safe point" means the function ensures the G is at a safe point, and "sync" means the function is running synchronously on the G and preventing safe points. Two "stack lock" functions are safe because they serialize on the lock. Two "sync" functions are safe because they cannot both be running on the G at the same time. A "sync" and a "safe point" function are safe because the sync function prevents the G from reaching a safe point, and the safe point function prevents the G from entering a sync function. For each reader/writer pair, I've written what synchronizes them, if anything. In many cases there are multiple things that synchronize them, in which case I've picked one to simplify the table. This shows that we still have three problems in profiling signals (two of which are essentially the same), and a frustrating level of complexity. I plan to make two changes to address these:
This should simplify the synchronization down to:
|
I think the sigprof libcall case is actually safe. I thought that was equivalent to the sigprof cgo case, but if I'm reading the libcall code correctly, libcalls switch to the system stack and leave the G in _Grunning, which means sigprof libcall is actually "sync" in the above nomenclature (it's not at a safe point and won't reach one while it's in the libcall). Since every stkbar write is either "sync" or "safe point", that makes sigprof libcall safe. I think, to keep things simple, I'm still going to require that sigprof hold the stack barrier lock in this case. If I do understand the libcall case correctly, sigprof will always successfully acquire the stack barrier lock if the G is in a libcall, so no harm done. |
CL https://golang.org/cl/18022 mentions this issue. |
CL https://golang.org/cl/18023 mentions this issue. |
Currently, setNextBarrierPC manipulates the stack barriers without acquiring the stack barrier lock. This is mostly okay because setNextBarrierPC also runs synchronously on the G and prevents safe points, but this doesn't prevent a sigprof from occurring during a setNextBarrierPC and performing a traceback. Given that setNextBarrierPC simply sets one entry in the stack barrier array, this is almost certainly safe in reality. However, given that this depends on a subtle argument, which may not hold in the future, and that setNextBarrierPC almost never happens, making it nowhere near performance-critical, we can simply acquire the stack barrier lock and be sure that the synchronization will work. Updates #12528. For 1.5.3. Change-Id: Ife696e10d969f190157eb1cbe762a2de2ebce079 Reviewed-on: https://go-review.googlesource.com/18022 Run-TryBot: Austin Clements <[email protected]> Reviewed-by: Russ Cox <[email protected]>
I received few new errors with cgo+oracle oci -( memory corruption) after compile with 1.5.2. I will try to reproduce. Отправлено из myMail for iOS Пятница, 18 декабря 2015 г., 20:11 +0300 от [email protected] [email protected]:
|
@vsdutka, can you try reproducing with the patch from https://go-review.googlesource.com/18027? I believe that should fix this problem. |
CL https://golang.org/cl/18026 mentions this issue. |
CL https://golang.org/cl/18027 mentions this issue. |
How to download patched go? 2015-12-18 21:02 GMT+03:00 GopherBot [email protected]:
|
@vsdutka, if you don't already have the Go sources downloaded, following the directions at https://golang.org/doc/install/source (which boils down to running Let me know if you can't build Go from source and I'll see if I can build it for you. |
After compilation with patched Go: c:/!dev/go/bin/go.exe test -v -cpuprofile=prof.out ./... === RUN TestDescribe --- PASS: TestDescribe (0.23s) === RUN TestDescribeAfterRecompile --- FAIL: TestDescribeAfterRecompile (0.89s) describe_test.go:108: got 2015-12-21T12:50:53+03:00, want 2015-12-21T12:50:53+03:00 === RUN TestDescribe1 --- PASS: TestDescribe1 (3.62s) === RUN TestDescribeNotExists --- PASS: TestDescribeNotExists (0.10s) === RUN TestDescribeConcurent10 found next stack barrier at 0xc08316fdf0; expected [@@@ fatal error: missed stack barrier runtime stack: exit status 3221225477 2015-12-21 10:12 GMT+03:00 Austin Clements [email protected]:
|
That's too bad. Can you check out https://go-review.googlesource.com/18091 ( |
c:/!dev/go/bin/go.exe test -v -cpuprofile=prof.out ./... === RUN TestDescribe --- PASS: TestDescribe (0.48s) === RUN TestDescribeAfterRecompile --- PASS: TestDescribeAfterRecompile (1.37s) === RUN TestDescribe1 --- PASS: TestDescribe1 (2.99s) === RUN TestDescribeNotExists --- PASS: TestDescribeNotExists (0.09s) === RUN TestDescribeConcurent10 found next stack barrier at 0xc0831f9df0; expected [@@@ flags 0xc8; original stkbar [*0xc0831f9c28=0x4bbb6a ] fatal error: missed stack barrier runtime stack: exit status 3221225477 FAIL github.com/vsdutka/otasker 16.455s Ошибка: процесс завершен с кодом 1. 2015-12-21 21:36 GMT+03:00 Austin Clements [email protected]:
|
I see the problem. My fix was correct in spirit, but I screwed up the case where sp < gp.stack.lo || gp.stack.hi < sp and we're in a cgo call. In this case, tracebackUser is true, so we do the traceback, but we haven't locked the user stack, so this isn't safe. |
CL https://golang.org/cl/18255 mentions this issue. |
CL https://golang.org/cl/18328 mentions this issue. |
I started to give error "writebarrierptr_nostore1" without any stack after compilation of my app "https://github.com/vsdutka/iplsgo" with Go 1.5 and stress it by 400 concurent requests.
OS: Windows Server 2008 R2 Standart 64bit
The text was updated successfully, but these errors were encountered: