Skip to content
This repository has been archived by the owner on Feb 3, 2018. It is now read-only.

Protect against hung subprocesses with inactivity timeouts #84

Closed
sdboyer opened this issue Aug 18, 2016 · 12 comments
Closed

Protect against hung subprocesses with inactivity timeouts #84

sdboyer opened this issue Aug 18, 2016 · 12 comments

Comments

@sdboyer
Copy link
Owner

sdboyer commented Aug 18, 2016

Noticed today that trying to bzr branch with the bzr:// protocol ended up causing the child process to just...sit, generating no output.

We need some basic protection against this (there's no general solution, because halting problem). Sadly, we can't just use a simple timeout, as there are times when we absolutely expect these processes to take a while (e.g., cloning kubernetes). However, if we can watch stdout/stderr for output and update a timer every time we see activity, then we should be able to set an "inactivity" timeout, where we kill the command only if a certain window has elapsed since progress was reported. That'll work for most git, hg, and bzr operations at least, I think.

@sdboyer sdboyer added the bug label Aug 18, 2016
@sdboyer sdboyer changed the title Protect against inactive subprocess timeouts Protect against inactive subprocesses with inactivity timeouts Aug 18, 2016
@sdboyer sdboyer changed the title Protect against inactive subprocesses with inactivity timeouts Protect against hung subprocesses with inactivity timeouts Aug 18, 2016
@erizocosmico
Copy link
Contributor

I can take this one, as said in #159!

@sdboyer
Copy link
Owner Author

sdboyer commented Jan 28, 2017

Awesome. Please don't hesitate AT ALL to ask questions. Here, or in slack - whatever. Meantime, here's some starting thoughts:

  • I think you're likely to run into a blocker right from the start the start, as Masterminds/vcs does not offer an interface for direct control over executing processes. I can't imagine a way of doing this without direct access to the running exec.Command object and using Start(), as everything else will just block until the command completes. So, you may need to start with a patch to Masterminds/vcs that creates a companion to RunFromDir that returns the assembled exec.Command object instead of calling CombinedOutput(). (/cc @mattfarina)
  • We do know we plan to be passing down cancellation channels eventually (per Incorporate context.Context into SourceManager #159) into the command loop, so design with that in mind.
  • Be warned, I plan to convert gps from glide to dep tonight. (This is risky, because https://github.com/golang/dep/milestone/1)
  • The *SourceMgr side of the code (as opposed to the solver) has been devolving towards spaghetti, and the path by which these objects get created/managed is kinda long and winding. I like grokking bigger-picture context before I work, so in case you do, here's how I recommend you go about grokking the flow:
    • All the logic you'll be changing is inside source types, which are called from pretty much all the public interface methods, returned from SourceMgr.getSourceFor()
    • You can mostly skip the deducer stuff there - just see how a maybeSource gets set up, and how it interweaves with baseVCSSource
    • There's tons of logic that tries to "lazily" determine when it needs to do work. Much of the refactor I need to do is bringing some sanity and order to that, so try to ignore it. Just focus on the calls that are made against the baseVCSSource.crepo.r - that's the actual Masterminds/vcs object.
  • Tests, please! 😄 This part of gps isn't nearly as well-tested as the solver, but it's time to impose discipline.
  • Yes, a bunch of these commands are very racey. The solver is a strictly serial algorithm, but that's not a good excuse. Really, I just wasn't thinking about it when I did my first pass. Fixing that is my primary motivation for refactoring all of SourceMgr.

@sdboyer
Copy link
Owner Author

sdboyer commented Jan 28, 2017

I'd also be happy to find time for a an hour pairing session on this, if you think it'd be helpful. Maybe once you've gotten your hands dirty a bit, and you have some questions about how the hell it all fits together.

@erizocosmico
Copy link
Contributor

erizocosmico commented Jan 29, 2017

Nice, let me grok a bit with the code and I'll ping you when I have some questions!

EDIT:

There are some operations in gps that don't directly use RunFromDir, but directly use the API of vcs. These operations could not be monitored unless we migrate them to RunFromDir or add the timeout directly into vcs, which I guess is not what we want. Should I assume we'll manage the exec.Cmd always for now?

Aside from that, I've been thinking about how to implement this and I thought of the following:

  • In vcs: add a method to Repo interface (and implementors) CmdFromDir which does the exact same thing RunFromDir does, except it does not call CombinedOutput and returns the command. Also, modify RunFromDir so it uses CmdFromDir internally and avoid having the same code on both.
  • In gps: create a type MonitoredCmd (any suggestions on the name?, doesn't quite convince me), which would be created with a timeout and an exec.Cmd. We can offer a exec.Cmd-like API providing CombinedOutput/Output and Run/Start+Wait. Adding context.Context to the mix should be fairly easy. Instead of just killing the underlying cmd process when there is no activity, do it when ctx deadline is met as well.

Any thoughts about this before I start coding the proposed solution?

Also, gophers slack says I need to contact an admin to get an invite.

@sdboyer
Copy link
Owner Author

sdboyer commented Jan 29, 2017

Both of these ideas sound 👍!

I don't have any name suggestions for MonitoredCmd - it seems fine. Though, unless there's some reason it needs to be exposed via the public interface, it can just be monitoredCmd for now. That'll give us more leeway in renaming it in the future.

Of course, if we're inferring from stdout/stderr outputs, it may require a bit of research to pick sane numbers for these inactivity timeouts.

@sdboyer
Copy link
Owner Author

sdboyer commented Jan 29, 2017

Oh, and https://invite.slack.golangbridge.org/ should get you an invite for the gopher slack

@erizocosmico
Copy link
Contributor

Thanks! Will start working on the implementation then

@sdboyer
Copy link
Owner Author

sdboyer commented Feb 10, 2017

Just checking in on this :)

@erizocosmico
Copy link
Contributor

I'm working on implementing the monitored cmds on all operations right now. The issue with this is that all operations that are not RunFromDir can't be trivially changed to use monitoredCmd.
For example, when vcs.Repo.Update is used, internally that cmd is not monitored. The solution to that is reimplement the Update (or any other method being used) for all possible VCSs using monitored cmds instead.

What are your thoughts about this?

@sdboyer
Copy link
Owner Author

sdboyer commented Feb 10, 2017

Unfortunately, I think you're right - there's no magical way around that design of the interface. And now that you say it, I'm picturing just how much work that is ☹️ Thanks for sticking with it!

The only real thought I'd have about it is that I'd love to see the WIP as you go in a PR, as the e.g. vcs_source.go file is already poorly organized, but still bloated. I don't know exactly what that SHOULD look like, but I'd rather avoid having you get to the end of the work and finding myself wanting to ask for a bunch of reorganization.

@sdboyer
Copy link
Owner Author

sdboyer commented Apr 16, 2017

I think this can be marked as done now. There are some improvements needed for handling on windows, but that's not the original scope here.

Would you agree, @erizocosmico ?

@erizocosmico
Copy link
Contributor

erizocosmico commented Apr 16, 2017 via email

@sdboyer sdboyer closed this as completed Apr 16, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants