Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connection to synclet stops working #2182

Closed
landism opened this issue Sep 11, 2019 · 14 comments
Closed

connection to synclet stops working #2182

landism opened this issue Sep 11, 2019 · 14 comments
Labels
bug Something isn't working

Comments

@landism
Copy link
Member

landism commented Sep 11, 2019

Sometimes if one spins up Tilt and then waits a bit and then tries to do a live update, one gets this:

Trying to build and deploy with *engine.LiveUpdateBuildAndDeployer
 → Updating container(s): 3f6435e478
Will copy 1 file(s) to container(s): 3f6435e478
- ‘/Users/omrieival/dev/…/login.html’ --> ‘/…/login.html’
got unexpected error during build/deploy: failed invoking synclet.UpdateContainer: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = “transport: Error while dialing dial tcp 127.0.0.1:64375: connect: connection refused”

I suspect this is the same issue as #2099 - the port forward to the synclet dies and doesn't come back.

@maiamcc
Copy link
Contributor

maiamcc commented Sep 12, 2019

@landism do we have a sense of how common this issue is / how many people it's breaking?

@maiamcc maiamcc added the bug Something isn't working label Sep 13, 2019
@landism
Copy link
Member Author

landism commented Sep 13, 2019

@landism do we have a sense of how common this issue is / how many people it's breaking?

I've only seen one report. I haven't spent any time trying to see how easy it is to repro.

@nicks
Copy link
Member

nicks commented Sep 19, 2019

from the #tilt kubernetes slack:

regarding the synclet error which I’ve previously raised(and also encountered again).. seems to be derived from a grpc bug that was fixed in this pr: https://github.com/grpc/grpc-go/pull/2669/files/61b6168befee28293298cfcde420115c90997021#diff-e1550a73f5d25064c8b586ec68d81a64R1035
issue: grpc/grpc-go#2636
Looks like this during live update:

  → Updating container(s): 8ef2741b02
Will copy 1 file(s) to container(s): 8ef2741b02
- '/Users/omrieival/dev/core-server/server/utils/sqs.py' --> '/home/ubuntu/tapingo/server/utils/sqs.py'
got unexpected error during build/deploy: failed invoking synclet.UpdateContainer: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:56951: connect: connection refused"

@nicks
Copy link
Member

nicks commented Sep 19, 2019

for next steps: we still need to adapt the retry logic i added for normal port-forwards to the synclet port-forwarding. The logic is slightly different because the local synclet port is dynamically allocated

@landism
Copy link
Member Author

landism commented Sep 20, 2019

In the interim, one can work around this by:

  1. using the 'exec' update mode, which uses kubectl exec instead of the Synclet tilt up --update-mode=exec
  2. restart_container, is not supported with --update-mode=exec, so any services that use it instead need https://github.com/windmilleng/rerun-process-wrapper

@omrikiei
Copy link
Contributor

omrikiei commented Dec 9, 2019

thanks for the intermediate solution @landism, is there a plan to upgrade the grpc package in the next versions to try and resolve this?

@landism
Copy link
Member Author

landism commented Dec 9, 2019

thanks for the intermediate solution @landism, is there a plan to upgrade the grpc package in the next versions to try and resolve this?

The tilt releases have had the grpc fix mentioned above for the last two months.
We have not yet done a synclet release that contains the above fix, but my understanding is that it's a client-side-only fix, so the tilt release should have sufficed.

Have you been observing this problem using a tilt release >= 0.10.14?

@omrikiei
Copy link
Contributor

omrikiei commented Dec 10, 2019

Have you been observing this problem using a tilt release >= 0.10.14?

Yes, one of our engineers just experienced it in 0.10.23:

STEP 1/1 — updating image registry.tapingo.com/tapingo/core_server
│ Updating container: 95b762bf42
Will copy 1 file(s) to container: 95b762bf42 - ‘/Users/alondener/Projects/Tapingo/core-server/server/backoffice/shops/views/editor.py’ --> ‘/home/ubuntu/tapingo/server/backoffice/shops/views/editor.py’
got unexpected error during build/deploy: failed invoking synclet.UpdateContainer: rpc error: code = Unavailable desc = all SubConns are in TransientFailure

It might be an issue with the port-forward to the synclet server

@nicks
Copy link
Member

nicks commented Feb 12, 2020

Guy in the k8s #tilt channel reported this today, and wondered if it might be a port-mapping issue.

got unexpected error during build/deploy: failed invoking synclet.UpdateContainer: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:64375: connect: connection refused"

One high-level concern I have is that we have two separate update mechanisms. They work in different cases. They both have some issues.

I wonder if the overall experience would be better if we disabled the synclet by default and focused our eng investment on the exec-based updater. We're already encouraging people to use tilt up --update-mode=exec when it breaks

@maiamcc
Copy link
Contributor

maiamcc commented Feb 12, 2020

If memory serves, the exec updater adds like .25-.5s overhead per command? Still might be worth it to provide a uniform experience (tho ofc there's some optimization we can do there--e.g. stringing all the run commands into a single shell command so we only need to call kubectl exec once and minimize our overhead).

Though we'd also have to really double down on getting restart_container out of ppl's Tiltfiles -- either a. getting people to wrap their commands in an init script, or b. doing that automatically when we see a need for it.

@nicks
Copy link
Member

nicks commented Feb 12, 2020

ya, exactly

There's a bunch of opportunities for improving the exec codepath (e.g., optimizing the overhead).

There's also a bunch of opportunities for improving the synclet codepath (e.g., supporting containerd)

But because we have two codepaths, it dilutes the benefit for each investment

@apolegoshko
Copy link

+1 for this issue.

@nicks
Copy link
Member

nicks commented Apr 27, 2020

An update - we're currently expecting to remove the synclet, see #3245 for more details. If we do that, this issue will become moot.

@maiamcc
Copy link
Contributor

maiamcc commented Jul 22, 2020

As of #3563, the synclet is NOT enabled by default, and will soon be removed completely--closing this issue.

@maiamcc maiamcc closed this as completed Jul 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants