Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-GPU mandelboxes #6987

Merged
merged 23 commits into from
Aug 12, 2022
Merged

Non-GPU mandelboxes #6987

merged 23 commits into from
Aug 12, 2022

Conversation

yanchenm
Copy link
Contributor

@yanchenm yanchenm commented Jul 29, 2022

Ticket(s) Closed

Description
Modify host-service and mandelboxes to be able to run on non-GPU AWS instances.

Implementation

  • Add a new NOGPU environment variable to both host and containers (automatically set through host-setup and Dockerfile respectively)
  • Bring back dummy display driver for mandelboxes

Documentation & Tests Added

Testing Instructions

Test that host setup, host service, and mandelboxes work on non-GPU instance:

  • Spin up r5 class instance
  • ./setup_host.sh --localdevelopment --nogpu
  • Everything else regarding setup/running mandelboxes should be exactly the same as before, no additional options or manual intervention required
  • Reboot
  • Build desired mandelbox image
  • Start host service
  • Run mandelbox image
  • Echo $NOGPU and make sure value is true
  • Check /usr/share/whist/display.log and ensure that X server has launched successfully and not crashed (last lines of log do not contain Server terminated with error (1).)
  • Check /var/log/whist/display-out.log and look for success message Waiting for AwesomeWM to exit, to keep the X server open indefinitely...
  • Protocol will not run successfully but make sure it has run (logs present in /var/log/whist)

Test that host setup, host service, and mandelboxes still work as expected on GPU instances.

PR Checklist

  • Did the PR author fully test this PR end-to-end?
  • Did one PR reviewer fully test this PR end-to-end?
  • Did one PR reviewer conduct a thorough code design review?

@github-actions github-actions bot added 📁 Repo: services This PR/Issue modifies /services code 📁 Repo: host-setup This PR/Issue modifies /host-setup code 📁 Repo: mandelboxes This PR/Issue modifies /mandelboxes code 📁 Repo: backend This PR/Issue modifies /backend code labels Jul 29, 2022
@github-actions
Copy link

Schema is unchanged, no database migration needed.

Carry on!

@codecov
Copy link

codecov bot commented Jul 29, 2022

Codecov Report

Merging #6987 (3559c1a) into dev (de05a3a) will decrease coverage by 0.02%.
The diff coverage is 23.52%.

❗ Current head 3559c1a differs from pull request most recent head 9e1cf16. Consider uploading reports for the commit 9e1cf16 to get more accurate results

@@            Coverage Diff             @@
##              dev    #6987      +/-   ##
==========================================
- Coverage   54.43%   54.41%   -0.03%     
==========================================
  Files         156      156              
  Lines       32089    32109      +20     
==========================================
+ Hits        17467    17471       +4     
- Misses      14358    14374      +16     
  Partials      264      264              
Flag Coverage Δ *Carryforward flag
backend-services 44.49% <23.52%> (-0.04%) ⬇️ Carriedforward from f5fe0f2
protocol 56.45% <ø> (-0.02%) ⬇️

*This pull request uses carry forward flags. Click here to find out more.

Impacted Files Coverage Δ
...ckend/services/host-service/mandelbox/mandelbox.go 54.21% <0.00%> (ø)
backend/services/metadata/environment.go 54.79% <0.00%> (-5.82%) ⬇️
backend/services/host-service/spinup.go 70.93% <44.44%> (-0.57%) ⬇️
protocol/client/frontend/virtual/interface.cpp 15.70% <0.00%> (-0.40%) ⬇️
protocol/client/frontend/virtual/impl.c 20.88% <0.00%> (-0.38%) ⬇️
protocol/whist/logging/logging.c 89.71% <0.00%> (-0.32%) ⬇️
protocol/whist/fec/wirehair/WirehairTools.cpp 77.54% <0.00%> (ø)
protocol/whist/fec/wirehair/WirehairCodec.cpp 88.04% <0.00%> (+0.05%) ⬆️
backend/services/host-service/host-service.go 13.62% <0.00%> (+0.35%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de05a3a...9e1cf16. Read the comment docs.

Copy link
Owner

@philippemnoel philippemnoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks super clean!! Definitely want a review from @MauAraujo and @rpadaki

mandelboxes/base/Dockerfile.20 Outdated Show resolved Hide resolved
backend/services/Makefile Outdated Show resolved Hide resolved
host-setup/setup_host.sh Outdated Show resolved Hide resolved
Copy link
Contributor

@rpadaki rpadaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just developer QoL suggestions. I'll spin up a new instance and get testing later today!

@philippemnoel @gabrieleoliaro what do you think of switching the client side of the end-to-end tester over to a non-GPU instance. I think this would (a) save us money and (b) better simulate conditions on a client's machine.

backend/services/metadata/environment.go Outdated Show resolved Hide resolved
host-setup/docker-daemon-config/daemon.json Show resolved Hide resolved
host-setup/setup_host.sh Outdated Show resolved Hide resolved
mandelboxes/base/Dockerfile.20 Outdated Show resolved Hide resolved
mandelboxes/base/Dockerfile.20 Show resolved Hide resolved
mandelboxes/scripts/build_mandelbox_image.py Outdated Show resolved Hide resolved
Copy link

@MauAraujo MauAraujo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @yanchenm ! Sorry for the delay in reviewing, but we have been a bit busy these last few days. Left a couple of small comments, I think once we fix them this PR will be good to go. I was able to test on an r5.large instance and start the host service + a mandelbox. I confirmed that the Xserver started correctly and even was able to connect the protocol (it complained about nvidia drivers and Chrome didn't start, but it connected and appeared to stream audio). Thanks for taking care of this and great work :)

backend/services/host-service/metrics/metrics.go Outdated Show resolved Hide resolved
backend/services/metadata/environment.go Outdated Show resolved Hide resolved
backend/services/metadata/environment.go Show resolved Hide resolved
mandelboxes/base/Dockerfile.20 Show resolved Hide resolved
@github-actions github-actions bot added the 📁 Repo: protocol This PR/Issue modifies /protocol code label Aug 8, 2022
@rpadaki
Copy link
Contributor

rpadaki commented Aug 8, 2022

Chrome didn't start because we explicitly make it use the gpu @MauAraujo -- great catch. That's an easy fix; just need to experiment with chrome flags

@philippemnoel
Copy link
Owner

This is great @yanchenm ! Sorry for the delay in reviewing, but we have been a bit busy these last few days. Left a couple of small comments, I think once we fix them this PR will be good to go. I was able to test on an r5.large instance and start the host service + a mandelbox. I confirmed that the Xserver started correctly and even was able to connect the protocol (it complained about nvidia drivers and Chrome didn't start, but it connected and appeared to stream audio). Thanks for taking care of this and great work :)

could we make Chrome start as part of this PR? The protocol doesn't need to start we can fix this next, should be easy

@philippemnoel
Copy link
Owner

@yanchenm @MauAraujo can we get this PR wrapped up please?

@MauAraujo
Copy link

@yanchenm @MauAraujo can we get this PR wrapped up please?

Yes, we just have to solve the remaining E2E issues and it's good to go. Will wrap up today

@yanchenm
Copy link
Contributor Author

Testing Chrome flags to get the launch working, will update

@github-actions
Copy link

Protocol End-to-End Streaming Test Results

Experiments Summary

Expand Summary

Experiment 1 - Bandwidth: Unbounded, Delay: None, Packet Drops: None, Queue limit: default. Download logs:

aws s3 cp s3://whist-e2e-protocol-test-logs/yanchen/non-gpu/2022_08_11@16-30-35/ 2022_08_11@16-30-35/ --recursive

Experiment 2 - Bandwidth: variable between 15Mbit and 30Mbit, Delay: 10 ms, Packet Drops: None, Queue Limit: None, Conditions change over time? Yes, frequency is variable between 1000 ms and 2000 ms. Download logs:

aws s3 cp s3://whist-e2e-protocol-test-logs/yanchen/non-gpu/2022_08_11@16-39-45/ 2022_08_11@16-39-45/ --recursive

Experiment 3 - Bandwidth: variable between 10Mbit and 20Mbit, Delay: 10 ms, Packet Drops: None, Queue Limit: None, Conditions change over time? Yes, frequency is variable between 500 ms and 2000 ms. Download logs:

aws s3 cp s3://whist-e2e-protocol-test-logs/yanchen/non-gpu/2022_08_11@16-44-05/ 2022_08_11@16-44-05/ --recursive

Experiment 4 - Bandwidth: 10Mbit, Delay: 10 ms, Packet Drops: None, Queue Limit: 100 packets, Conditions change over time? No.. Download logs:

aws s3 cp s3://whist-e2e-protocol-test-logs/yanchen/non-gpu/2022_08_11@16-48-25/ 2022_08_11@16-48-25/ --recursive

Full Results: link here

Copy link
Owner

@philippemnoel philippemnoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@philippemnoel philippemnoel merged commit d608bf8 into dev Aug 12, 2022
@philippemnoel philippemnoel deleted the yanchen/non-gpu branch August 12, 2022 14:16
rpadaki added a commit that referenced this pull request Aug 12, 2022
rpadaki added a commit that referenced this pull request Aug 12, 2022
@rpadaki rpadaki restored the yanchen/non-gpu branch August 12, 2022 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📁 Repo: backend This PR/Issue modifies /backend code 📁 Repo: host-setup This PR/Issue modifies /host-setup code 📁 Repo: mandelboxes This PR/Issue modifies /mandelboxes code 📁 Repo: protocol This PR/Issue modifies /protocol code 📁 Repo: services This PR/Issue modifies /services code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make the host-service/backend & mandelboxes work for non-GPU instances
5 participants