-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: Go 2: goroutine collector or closer to prevent goroutine leaks #41414
Comments
I don’t understand this statement, go does have such a facility — when a goroutine exits any memory references that are now unreachable are freed. This includes the goroutines stack and it’s associated runtime data structures. What I think you’re asking for is explicit goroutine collection, that is, a way to cause a goroutine to exit this freeing its resources. Is that what you are asking for? |
@davecheney that is garbage collection, unreachable memory is automatically freed by the runtime. I am referering to cleaning up 'zombie' goroutines that are no longer completing any useful function because whatever created them has dropped every reference to them presumably because the work is complete, cancelled or there was a panic. I suppose it's also addressing #5308 in some form. |
Interesting idea. If a collected goroutine panics due to a blocking channel operation, that will most likely terminate the program. Is that what we want? What if the goroutine is holding a lock when it is collected, then it is quite possible that the lock will not be released, and the program will deadlock. Is that what we want? It seems hard to debug. There are quite a few ways that a goroutine can block. It can be executing a blocking system call, or it can be blocked in cgo code. It can be blocking trying to acquire a mutex. It can be sleeping due to |
Thanks for confirming. I have two additional questions;
|
That's not ideal, I think this type of panic could be automatically recovered by the goroutine before exiting.
You would expect locks to be released with defer, on errors, or after recovering from a panic.
The proposal only applies to channel operations and system IO that can return an error such as methods on os.File and net.Conn. Any other blocking operations do not panic and will continue to block as usual.
This proposal adds goroutine reference types that are returned by go invocations, a goroutine is considered to be a 'zombie' when no references to it exist. If I want a goroutine to exist, I keep a reference to it somewhere. If I have no references to the goroutine then it might as well not exist to me
Some 'zombie' goroutines may exit just fine. Others may be blocked on a IO operations that are taking a very long time or are waiting to recieve on a channel that has no senders anymore. In other words, they think they still have work to do. |
So this is the flow of logic
Yea? Something like this?
This doesn't sound like the reference to a goroutine model you described earlier. It sounds different.
Could you be more precise about "just moves it along". Do you mean send some kind of signal too it? |
To leverage this, we'd probably keep a map of active goroutines, and therefore name them. Terminating a goroutine then looks like Wouldn't it be essential in some cases to discover what "process1" is blocked on before deleting it? The proposal doesn't reference any of the recent discussion re interrupting blocking file ops and/or syscalls (e.g. #41054), but seems to incorporate that goal. Is that accurate?
That could become a CPU-intensive loop if any code is retrying on error. And you can't continue past a mutex that didn't lock. |
Yes, this is what I was illustrating with the example: //goroutine is a new builtin 'non-comparable reference type' that is returned by go calls.
var job goroutine = go DoSomething()
//when a value of the goroutine type is *collected* by the garbage collector,
// 1. blocking IO operations inside that goroutine return an "end of goroutine" error.
// 2. subsequent blocking IO operations inside that goroutine panic with "end of goroutine".
// 3. blocking channel operations panic with "end of goroutine".
job = nil
I gave some examples of why a goroutine wouldn't exit on its own. Whether or not a goroutine will exit is a halting problem and does not have anything to do with being a 'zombie' in this context.
Not a signal. Once a goroutine has been 'collected', closed or marked as a 'zombie' under this proposal, it is not allowed to perform IO operations such as read/write on files or connections. It is also not allowed to block during a send or recieve channel operation. Any IO operations that the goroutine makes will return an error such as "end of goroutine". The rationale here is that the goroutine is able to handle these errors/panics by cleaning up and exiting. Here is an example from https://www.ardanlabs.com/blog/2018/11/goroutine-leaks-the-forgotten-sender.html // [Adapted from Listing 1 in the article]
// leak would normally be a buggy function. It launches a goroutine that
// blocks receiving from a channel. Nothing will ever be
// sent on that channel and the channel is never closed so
// that goroutine would be blocked forever.
func leak() {
ch := make(chan int)
//The returned reference is not stored anywhere so the goroutine will be collected.
//The channel operation will panic and the goroutine will exit, preventing the goroutine leak.
_ := go func() {
defer recover() //Recover is added to prevent the panic from bringing down the entire program.
val := <-ch
fmt.Println("We received a value:", val)
}()
}
Yes for globally scoped routines that could make sense, or a map to a slice of goroutine values so that you can associate a job with all the goroutines that are related to that job. If you wanted to cancel the job, you delete it from the map and all the goroutines would get collected and if written correctly would cleanup and exit.
I think you need to provide an example of such a case.
Yes, this proposal can address situations where a goroutine is blocked for a long time on a resource without any SetTimeout methods, set a timeout in the map of active jobs for deletion and the goroutine will return an error on the blocking resource and the goroutine should exit.
Well, if the goroutine repeats a blocking IO operation after an error was returned, then it should panic. A "end of goroutine" needs to be handled correctly the first time. Edit: I have added this to the proposal. |
For example, a process that alternately waits on two resources, one which blocks indefinitely, the other which blocks for short times, but can become stalled due to external failure. You only want to "collect" this goroutine in the latter case. The app could keep a status variable alongside the goroutine reference, and update it before & after every blocking op, or the runtime could let you inspect what any goroutine is currently blocked on. |
Process the resource that could stall inside of a timeout and pass a buffered status channel so that you can detect if the goroutine was ended due to timeout. package main
import (
"fmt"
"io/ioutil"
"os"
"time"
)
//UnreliableResource represents a function that could block for a long time.
func UnreliableResource(status chan error) {
f, err := os.Open("/path/to/unreliable/resource")
if err != nil {
status <- err
return
}
b, err := ioutil.ReadAll(f)
if err != nil {
status <- err
return
}
status <- err
return
}
//timeout ends the given goroutine after the given duration
//timeout must be given the only reference to g, ie go timeout(time.Second, go DoSomething())
//otherwise this function does nothing.
func timeout(duration time.Duration, g goroutine) {
time.Sleep(duration)
}
func main() {
var status = make(chan error, 1)
go timeout(30 * time.Second, go UnreliableResource(status))
//if there is a timeout, error will be "end of goroutine"
if err := <-status; err != nil {
fmt.Println(err)
}
} |
Compare #19702. |
I don't think this proposal is semantically appropriate. The purpose of garbage collection is to “[simulate] a computer with an infinite amount of memory.” On such a system, the blocked goroutines would never be collected, and therefore would never panic or return an error. We violate that abstraction today using finalizers, but finalizers introduce a lot of other problems (such as #9442, #41362, #7546). I think we should seek to minimize their use, not double-down on them as a mechanism for cancellation. From that standpoint, #19702 was a stronger proposal: it sought to make the behavior of the program more like a computer with infinite memory, not less so. |
@bcmills To avoid losing the abstraction, goroutines could be explicitly closed with close (then the developer is free to catch leaking goroutines with runtime.SetFinalizer) but this starts to look like a different proposal. var job = go DoSomething()
//when a value of the goroutine type is closed.
// 1. blocking IO operations inside that goroutine return an "goroutine closed" error.
// 2. subsequent blocking IO operations inside that goroutine panic with "goroutine closed".
// 3. blocking channel operations panic with "goroutine closed".
close(job) This reads nicely and a timeout function is easier to reason about. //timeout closes the given goroutine after the given duration
func timeout(duration time.Duration, job goroutine) {
time.Sleep(duration)
close(job)
}
go timeout(time.Second, go UnreliableWork()) Then there is no need for a special goroutine type, the go invocation could instead return a 'returning channel' that will return the result of the function. Closing this channel would close the goroutine. Although you would need multi-value channels for multiple return values which is a little odd. func DoSomething() (*Result, error) {
return nil, errors.New("failed to do anything")
}
var status chan(*Result, error) = go DoSomething()
result, err <- status This is more like a builtin context.Context. |
Why not to use ctx, cancel := context.WithCancel(context.Background())
go DoSomething(ctx)
time.Sleep(timeout)
cancel() (btw, |
See #28342, there are a number of challenges with context, for example:
Inexerienced Go developers can easily leak goroutines. If DoSomething is leaky as below, context will not prevent the leak: func DoSomething(ctx context.Context) {
var leak chan struct{}
<- leak
} Closing or collecting the goroutine as described in this proposal will. |
Is the rough goal to ensure that you don't get goroutine leaks? If that's the case, then you could have approaches to solving it. For example https://github.com/egonelbre/antifreeze implements a runtime checker for goroutines that were stuck waiting for too long. In a similar fashion you could have something that monitors goroutines to finish in time, e.g. https://play.golang.org/p/SbBEi4OYOH7. And a more complicated version that also prints the stack frame https://play.golang.org/p/L1xiObvrT4a |
I expect synchronous code like we'd write today. Besides that, before you deliberately terminate a goroutine, you may want to log a stack trace. You could also use that info to decide whether to terminate at all.
|
@egonelbre, note that a runtime check for permanently-blocked goroutines was proposed (and, IMO unfortunately, rejected) in #37681. |
There is a classic problem found in Actor systems where an unreachable
Actor can't be garbage collected since it can spontaneously link itself
into the reachability graph simply by sending a reference to itself to
another Actor. Traditional reachability semantics depend on the fact that
unreachability is stable, once an object becomes unreachable it stays
unreachable. That doesn't seem to be the case with goroutines.
…On Wed, Sep 16, 2020 at 6:12 PM Splizard ***@***.***> wrote:
Why not to use context?
See #28342 <#28342>, there are a
number of challenges with context, for example:
- It creates two classes of functions, ones that know about context
and ones that don't.
- Every function that needs to be cancellable needs to have a ctx
context.Context as an argument.
- Unlike with errors, it has no builtin support for its usecase.
- ctx context.Context stutters.
- Needs to be handled and used correctly otherwise there will be
goroutine leaks.
- The name can be confusing and doesn't imply that any kind of
cancellation is involved.
- Is easy to abuse as a defacto untyped goroutine local-storage.
Inexerienced Go developers can easily leak goroutines. If DoSomething is
leaky, context will not preven this type of leak:
func DoSomething(ctx context.Context) {
var leak chan struct{}
<- leak
}
Closing or collecting the goroutine will.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#41414 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHNNH5G3OJYM3VLFGHAUK3SGEZ4FANCNFSM4ROEMCYA>
.
|
@Splizard you could also note in the issue that for stalled I/O syscalls, the leak encompasses an OS thread, which is typically a lot more memory than a goroutine. |
The proposal here seems too subtle to me. It seems odd to have a different behavior based on whether the result of the It seems confusing to have the behavior change specifically for channel operations and for I/O syscalls. The latter isn't all that well defined in any case. There may well be ways to improve goroutine cancellation and leaks. But this specific proposal seems to me to be tweaking the language to support a particular use case, rather than providing a general and orthogonal mechanism. |
Based on the discussion above, this is a likely decline. Leaving open for four weeks for final comments. |
This is addressed by explicitly closing goroutines instead of implicitly collecting them, so if the collector option of this proposal is a likely decline, should I create a new proposal about explicitly closeable goroutines? var job = go DoSomething()
//when a value of the goroutine type is closed.
// 1. blocking Read/Write operations inside that goroutine return an "goroutine closed" error.
// 2. subsequent blocking Read/Write operations inside that goroutine panic with "goroutine closed".
// 3. blocking channel operations panic with "goroutine closed".
close(job)
The rationale is that a closed goroutine no longer has permission to block and it should terminate as soon as possible. EDIT:
What is the particular use case? |
What should happen if a goroutine is blocked waiting for a lock? What should happen if a goroutine currently holds this lock, but the author has chosen not to use defer to free that lock? |
It remains blocked. Unless the lock implementation uses channels, or blocking Read/Write calls in which case a panic or error should be raised and the lock will not be aquired.
If they are reading or writing to an unbuffered channel while they are holding a lock, then the goroutine panics and the lock is leaked. Much like present Go behaviour, if a goroutine writes to a closed channel whilst holding a lock, it panics and the lock is leaked. The proposal extends this risk to reading from unbuffered channels too. If you are holding a lock and you perform blocking channel operations, you must defer the release of the lock or catch any potential panics so that the lock doesn't leak. |
Thank you for explaining. The restrictions on locks appear to undermine the utility of this proposal — goroutines waiting for a lock will continue to “leak”. |
Let me first say clearly that I think that any proposal around closeable goroutines is unlikely to be accepted. That said, if you want to pursue this, you should also address the other concerns, not just this one.
Thanks, I think I understand that. The problem is that it is very specific and is not orthogonal. Certain cases work, other cases fail, and the distinction is based on the implementation rather than on any fundamental aspect of the language. There is no deep reason why we say a goroutine will return an error from a Reader but not from, say, a
Goroutines that only block for I/O or channel operations. |
Further discussion, but no change in consensus. |
Whoops, sorry, jumped the gun there. |
No change in consensus. This time for sure. |
Memory management in Go is implicit and this is considered to be a good thing. I think this needs to be kept in mind when discussing language features that are implicit vs explicit.
Memory leaks are one of the issues that an implicit garbage collector is able to prevent. All memory is accounted for in Go.
However, goroutines leaks have replaced memory leaks and just like manual memory-managment, goroutine-management is a pain. Just like in C where developers must think about the lifetime of any memory they allocate. In Go, developers need to think about the lifetime of any goroutine they create. In both cases, this is done explicitly. In C with free and in Go with context.Context. Both are optional and can result in leaks.
@networkimprov notes this about goroutine leaks.
The question would be then, if Go has an implicit garbage collector, why doesn't it have an implicit goroutine collector.
My proposal includes a small language change so that goroutines are referenced and therefore implicitly collectable.
To extend the lifetime of a goroutine, assign it to a global variable or appropriately scoped value (ie a http.Request).
To cancel a goroutine, remove any references to it that are held by your program.
This proposal would also reduce the need of context.Context being added to every function signature and addresses some of the issues of #28342
If backwards compatabillity is desired, go calls made as a statement should not have their goroutines collected. As for future compatabillity, the goroutine type is flexible and other functionality for dealing with goroutines (ie return values, recovering from panics, and/or priority) could be added in the future without breaking compatabillity.
Granted, this wouldn't prevent all goroutines leaks ie (infinite loops without blocking operations or error handling) but I believe it would catch most real-world cases.
This proposal is similar to #27982 however it doesn't have the same objections as that proposal.
Namely, there is no parent-child relationship between goroutines and this is not exactly syntactic sugar for contexts.
Closable goroutines
An alternative to having a goroutine collector, is to have closable goroutines which are closed with the builtin close() function.
When closed they act as if they were collected by a goroutine collector as described above.
See comment below.
FAQ
What blocking operations return an error or panic?
The only blocking IO operations that raise an error or panic are channel operations and system-call backed functions that can return an error such as the methods on os.File or net.Conn, all other blocking operations block as usual.
The text was updated successfully, but these errors were encountered: