Add deployment examples to docs (#584)

TraceMachina · Dec 31, 2023 · 546484b · 546484b
1 parent 0269835
commit 546484b
Show file tree

Hide file tree

Showing 2 changed files with 272 additions and 1 deletion.
diff --git a/nativelink-docs/deployment-examples.mdx b/nativelink-docs/deployment-examples.mdx
@@ -0,0 +1,271 @@
+---
+title: Deployment Examples
+description: 'See the capabilities of remote execution and cache'
+---
+
+# AWS Terraform Deployment
+
+This directory contains a reference/starting point on creating a full AWS
+Terraform deployment of Native Link's cache and remote execution system.
+
+## Prerequisites - Setup Hosted Zone / Base Domain
+You are required to first setup a Route53 Hosted Zone in AWS. This is because
+we will generate SSL certificates and need a domain to register them under.
+
+1. Login to AWS and go to
+[Route53](https://console.aws.amazon.com/route53/v2/hostedzones)
+2. Click `Create hosted zone`
+3. Enter a domain (or subdomain) that you plan on using as the name and ensure
+it's a `Public hosted zone`
+4. Click into the hosted zone you just created and expand `Hosted zone details`
+and copy the `Name servers`
+5. In the DNS server that your domain is currently hosted under (it may be
+another Route53 hosted zone) create a new `NS` record with the same
+domain/subdomain that you used in Step 3. The value should be the
+`Name servers` from Step 4
+
+It may take a few mins to propagate
+
+## Terraform Setup
+1. [Install Terraform](https://www.terraform.io/downloads)
+2. Open terminal and run `terraform init` in this directory
+3. Run `terraform apply -var
+base_domain=INSERT_DOMAIN_NAME_YOU_SETUP_IN_PREREQUISITES_HERE`
+
+It will take some time to apply, when it is finished everything should be
+running. The endpoints are:
+
+```txt
+CAS: grpcs://cas.INSERT_DOMAIN_NAME_YOU_SETUP_IN_PREREQUISITES_HERE
+Scheduler: grpcs://scheduler.INSERT_DOMAIN_NAME_YOU_SETUP_IN_PREREQUISITES_HERE
+```
+
+As a reference you should be able to compile this project using Bazel with
+something like:
+
+```sh
+bazel test //... \
+    --remote_cache=grpcs://cas.INSERT_DOMAIN_NAME_YOU_SETUP_IN_PREREQUISITES_HERE \
+    --remote_executor=grpcs://scheduler.INSERT_DOMAIN_NAME_YOU_SETUP_IN_PREREQUISITES_HERE \
+    --remote_instance_name=main
+```
+
+## Server configuration
+![Native Link AWS Terraform Diagram](https://user-images.githubusercontent.com/1831202/176286845-ff683266-3f23-489c-b58a-3eda49e484be.png)
+
+## Instances
+All instances use the same configuration for the AMI. There are technically two
+AMI's but only because by default this solution will spawn workers for x86
+servers and ARM servers, so two AMIs are required.
+
+### CAS
+The CAS is only used as a public interface to the S3 data. All the services
+will talk to S3 directly, so they don't need to talk to the CAS instance.
+
+#### More optimal configuration
+You can reduce cost and increase reliability by moving the CAS onto the same
+machine that invokes the remote execution protocol (like Bazel). Then point
+the configuration to `localhost` and it will translate the S3 calls into the
+Bazel Remote Execution Protocol API.
+
+### Scheduler
+The scheduler is currently the only single point of failure in the system.
+We currently only support one scheduler at a time.
+The workers will lookup the scheduler in a Route53 DNS record set by a
+lambda function that is configured to execute every time an instance
+change happens on the auto-scaling group the scheduler is under.
+We don't use a load balancer here mostly for cost reaons and the
+fact that there's no real gain from using one, since we don't want/need
+to encrypt our data since we are using it all inside the VPC.
+
+### Workers
+Worker instances in this confugration (but can be changed) will only spawn 1 or
+2 CPU machines all with NVMe drives. This is also for cost reasons, since NVMe
+drives are much faster and often cheaper than EBS volumes.
+When the instance spawns it will lookup the available properties of the node
+and notify the scheduler. For example, if the instance has 2 cores it will
+let the scheduler know it has two cores.
+
+## Security
+The security permissions of each instance group is very strict.
+The major vulnerabilities are that the instances by default are all public ip
+instances and we allow incoming traffic on all instances to port 22 (SSH) for
+debugging reaons.
+
+The risk of a user using this configuration in production is quite high and
+for this reason we don't allow the two S3 buckets (access logs for ELBs and
+S3 CAS bucket) to be deleted if they have content.
+If you prefer to use `terraform destroy`, you need to manually purge these
+buckets or change the Terraform files to force destroy them.
+Taking the safer route seemed like the best route, even if it means the
+default developer life is slightly more difficult.
+
+## Future work / TODOs
+* Currently we never delete S3 files. Depending on the configuration this needs
+to be done carefully. Likely the best approach is with a service that runs
+constantly.
+* Auto scaling up the instances is not confugred. An endpoint needs to be made
+so that a parsable (like json) feed can be read out of the scheduler through a
+lambda and publish the results to `CloudWatch`; then a scaling rule should be
+made for that ASG.
+
+## Useful tips
+You can add `-var terminate_ami_builder=false` to the `terraform apply` command
+and it makes it easier to modify/apply/test your changes to these `.tf` files.
+This command will cause the AMI builder instances to not be terminated, which
+costs more money, but makes it so that Terraform will not create a new AMI
+each time you call the command.
+
+
+# GCP Terraform Deployment
+
+This directory contains a reference/starting point on creating a full GCP
+[Terraform](https://www.terraform.io/downloads) deployment of Native Link's
+cache and remote execution system.
+
+## Prerequisites
+
+1. Google Compute Cloud project with billing enabled.
+2. A domain where name servers can be pointed to Google DNS Cloud.
+
+## Terraform Setup
+
+Setup is done in two configurations, a **global** configuration and **dev**
+configuration. The dev configuration depends on the global configuration.
+Global configuration is a one-time setup which requires an out-of-bound step
+of updating registrar managed name servers. This step is required for
+certificate manager authorization to generate certificate chain.
+
+### Global Setup
+
+Setup basic configurations for DNS, certificates, Compute API and Terraform
+state storage bucket. The global setup should be a one-time process, once
+properly configured it does not need to be redone.
+
+It is important to note that after these configurations are applied the
+managed name servers for the DNS zone need to be configured. If the certificate
+management fails to generate the entire process might need to be redone.
+
+After this is applied grab the name servers from the Terraform state and enter
+the four name servers into the owning domains registrar configuration page.
+
+Confirm certificates are generated by checking the
+[Certificate Manager](https://cloud.google.com/certificate-manager/docs/overview)
+page in [Google Cloud Console](https://console.cloud.google.com) that the
+status is Active before moving onto running the dev plan.
+
+```sh
+PROJECT_ID=example-sandbox
+DNS_ZONE=example-sandbox.example.com
+
+cd deployment-examples/terraform/experimental_GCP/deployments/global
+terraform init
+terraform apply -var gcp_project_id=$PROJECT_ID -var gcp_dns_zone=$DNS_ZONE
+# Print google name servers, ex: ns-cloud-XX.googledomains.com.
+terraform state show module.native_link.data.google_dns_managed_zone.dns_zone
+```
+
+### Development Setup
+
+Setup and deploy the `native-link` servers and dependencies. The general
+configuration is laid out similar to
+[Native Link AWS Terraform Diagram](https://user-images.githubusercontent.com/1831202/176286845-ff683266-3f23-489c-b58a-3eda49e484be.png)
+from
+[AWS deployment example](https://github.com/TraceMachina/native-link/blob/main/deployment-examples/terraform/experimental_AWS/README.md).
+Deployment has additional flags in `variables.tf` for controlling machine
+type, prefixing resource name space for multiple deployments and other
+template parameters.
+
+```sh
+PROJECT_ID=example-sandbox
+cd deployment-examples/terraform/experimental_GCP/deployments/dev
+terraform init
+terraform apply -var gcp_project_id=$PROJECT_ID
+```
+
+A complete and successful deployment should be able to run remote execution
+commands from Bazel (or other supported build systems).
+
+## Example Test
+
+Simple way to test as a client is by
+[creating](https://cloud.google.com/sdk/gcloud/reference/compute/instances/create)
+a "workstation" instance on Google Cloud Platform, install Bazel, clone
+`native-link` and run tests using the deployed remote cache and
+remote executor.
+
+```sh
+# Example of using gcloud generated cli command bootstrap instance.
+# Using google cloud console is easy to generate this command.
+# Use ubuntu-2204 x86_64 as the base image as it is compatible
+# with remote execution environment setup by the Terraform scripts.
+NAME=dev-workstation-001
+PROJECT_ID=example-sandbox
+REGION=us-central1 # defaulted value in variables.tf
+ZONE=us-central1-a # defaulted value in variables.tf
+[email protected]
+OS_IMAGE=projects/ubuntu-os-cloud/global/images/ubuntu-2204-jammy-v20231201
+DISK=projects/example-sandbox/zones/$ZONE/diskTypes/pd-standard
+
+gcloud compute instances create $NAME \
+    --project=$PROJECT_ID \
+    --zone=$ZONE \
+    --machine-type=e2-standard-8 \
+    --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default \
+    --maintenance-policy=MIGRATE \
+    --provisioning-model=STANDARD \
+    --service-account=$SERVICE_ACCOUNT \
+    --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append \
+    --create-disk=auto-delete=yes,boot=yes,device-name=instance-1,image=${OS_IMAGE},mode=rw,size=30,type=$DISK \
+    --no-shielded-secure-boot \
+    --shielded-vtpm \
+    --shielded-integrity-monitoring \
+    --labels=goog-ec-src=vm_add-gcloud \
+    --reservation-affinity=any
+```
+
+[SSH](https://cloud.google.com/sdk/gcloud/reference/compute/ssh) into the
+workstation instance, install deps and clone `native-link`
+(which has Bazel compatible remote execution setup). On first run it could
+take the a few minutes for the DNS records and load balancers to pick up
+the new resources (usually under ten minutes).
+
+```sh
+# On local machine
+NAME=dev-workstation-001
+PROJECT_ID=example-sandbox
+ZONE=us-central1-a
+gcloud compute ssh --zone $ZONE $NAME --project $PROJECT_ID
+
+# On gcp workstation
+git clone https://github.com/TraceMachina/native-link.git
+sudo apt install -y npm
+sudo npm install -g @bazel/bazelisk
+cd native-link
+
+DNS_ZONE=example-sandbox.example.com
+CAS="cas.${DNS_ZONE}"
+EXECUTOR="scheduler.${DNS_ZONE}"
+
+bazel test //... --experimental_remote_execution_keepalive \
+--remote_instance_name=main \
+--remote_cache=$CAS \
+--remote_executor=$EXECUTOR \
+--remote_default_exec_properties=cpu_count=1 \
+--remote_timeout=3600 \
+--remote_download_minimal \
+--verbose_failures
+```
+
+### Developing/Testing
+
+[Visual Studio Code](https://code.visualstudio.com/) could be used to actively
+work on native-link code cloned by using
+[Visual Studio Remote Development](https://code.visualstudio.com/docs/remote/remote-overview).
+The setup will allow for Visual Studio running on a local machine connected to
+a remote workstation, mapping along the file system and access to terminal.
+Using this setup can allow for working on native-link or testing different
+workloads without having to match environment expectations. Install the
+[Visual Studio Remote Development Extension Pack](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack),
+connect using ssh to work station instance and map `native-link` folder
+(or any other cloned project).
diff --git a/nativelink-docs/mint.json b/nativelink-docs/mint.json
@@ -43,7 +43,7 @@
   "navigation": [
     {
       "group": "Get Started",
-      "pages": ["quickstart", "contributing", "about"]
+      "pages": ["quickstart", "deployment-examples", "contributing", "about"]
     }
   ],
   "footerSocials": {