Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner Fix nvidia setup and restart #607

Merged
merged 11 commits into from
Jun 21, 2022
Merged

Runner Fix nvidia setup and restart #607

merged 11 commits into from
Jun 21, 2022

Conversation

DavidGOrtega
Copy link
Contributor

@DavidGOrtega DavidGOrtega commented Jun 19, 2022

This has been a tricky PR. After fixing the restart to be able to access the GPU due to kernel upgrade we were hitting a very funny error seen here and here hanging the ssh connection on DIAL and never escaping back, hence the resource timing out.
To solve it I had to put the logs function within a timed-out function. we added Teleport's fix mentioned on #607 (comment)

Related: #606

@0x2b3bfa0
Copy link
Member

As per golang/go#15113 (comment), the 2 second timeout we had in place should be enough, but it apparently isn't. 😅

See also golang/go#15113gravitational/teleport#1153gravitational/teleport#1152 for a possible solution.

@DavidGOrtega
Copy link
Contributor Author

DavidGOrtega commented Jun 19, 2022

We already have a solution in place

@DavidGOrtega
Copy link
Contributor Author

As per golang/go#15113 (comment), the 2 second timeout we had in place should be enough, but it apparently isn't. 😅

See also golang/go#15113gravitational/teleport#1153gravitational/teleport#1152 for a possible solution.

golang/go#21941 (comment)

From the client's perspective, one could wrap ssh.Dial in a goroutine with a buffered channel and have an app-specific timeout. We actually like this approach much better,

@DavidGOrtega DavidGOrtega requested review from dacbd and a team June 19, 2022 20:20
@DavidGOrtega DavidGOrtega self-assigned this Jun 19, 2022
@DavidGOrtega DavidGOrtega added p0-critical Max priority (ASAP) cloud-az Microsoft Azure cloud-gcp Google Cloud resource-runner iterative_runner TF resource gpu Inexplicably convoluted drivers labels Jun 19, 2022
Copy link
Member

@0x2b3bfa0 0x2b3bfa0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me, although it would be nice to find a better (?) workaround (e.g. gravitational/teleport#1152) for the SSH timeout issue instead of leaking goroutines. Not that it matters much for this use case either, but doesn't feel quite right.

@DavidGOrtega
Copy link
Contributor Author

DavidGOrtega commented Jun 20, 2022

@0x2b3bfa0 We are using gravitational/teleport#1152 workaround!

9aaac5d

Copy link
Member

@0x2b3bfa0 0x2b3bfa0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@0x2b3bfa0
Copy link
Member

Sorry! I missed that commit! 🙈 🚀

Copy link
Contributor

@dacbd dacbd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

environment/setup.sh Outdated Show resolved Hide resolved
iterative/resource_runner.go Show resolved Hide resolved
iterative/resource_runner.go Show resolved Hide resolved
iterative/resource_runner.go Outdated Show resolved Hide resolved
@dacbd
Copy link
Contributor

dacbd commented Jun 21, 2022

@0x2b3bfa0 lets merge and release?
image

@0x2b3bfa0 0x2b3bfa0 merged commit 5194dc5 into master Jun 21, 2022
@0x2b3bfa0 0x2b3bfa0 deleted the runner/nvidia-setup branch June 21, 2022 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-az Microsoft Azure cloud-gcp Google Cloud gpu Inexplicably convoluted drivers p0-critical Max priority (ASAP) resource-runner iterative_runner TF resource
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Google Compute Engine VM with V100 doesn't use the GPU (NVIDIA driver issue?)
4 participants