Terraform deployment template to spin up a 3 node H2O Open Source cluster in GCP. This template is based on this tutorial
This template is a work in progress and is provided without any warranty or support. You are free to refer/modify it as you need.
There are two distinct parts to this setup
- Setting up the GCP project, service account, VPC, Subnet etc. Also in this step we create the GCP compute instance we call Workspace. All H2O users will need to ssh to this workspace instance to create the H2O cluster in the private subnet (not directly accessible). The workspace instance is in the public subnet and forms the gateway for all communication between the H2O cluster and machine of data scientist.
- Creating the H2O cluster using the
h2ocluster
tool available on the Workspace instance
These activities are performed by someone who have Cloud Admin privileges. In this step we perform pre-required activities using the browser on the GCP console, and then setup gcloud
sdk tool on a machine where we use terraform
to get the entire infrastructure setup.
- Using a web browser, login to GCP Console
- Create a new Project
- Ensure Billing is enabled for the project
- Enable needed APIs and services (link is on the top of project dashboard)
- Compute Engine API
- Identity and Access Management (IAM) API
- Cloud Resorce Manager API
- Cloud Monitoring API
- Cloud Logging API
- OS Config API
- !! NOTE !! - For this work, the used project name is
project48a
.
-
Preferably setup the
gcloud
sdk on a linux based machine, ideally used by the Cloud system admin team to manage the cloud infrastructure. -
Follow steps to install Google Cloud SDK
-
Create a new gcloud profile and authenticate
$ gcloud config configurations create hemen-h2oai Created [hemen-h2oai]. Activated [hemen-h2oai]. $ gcloud auth login Your browser has been opened to visit: https://acco ..... deleted .... t_account You are now logged in as [[email protected]]. Your current project is [None]. You can change this setting by running: $ gcloud config set project PROJECT_ID
-
Setup project. You would have already created a project from the GUI as discussed earlier. Ensure it has billing enabled as well as services API enabled.
$ gcloud config set project project48a Updated property [core/project].
-
Setup compute region. You can use
gcloud compute regions list
to get a list of available compute regions$ gcloud config set compute/region us-west1 Updated property [compute/region].
-
Setup compute zone. You can use
gcloud compute zones list
to get a list of available compute zones$ gcloud config set compute/zone us-west1-a Updated property [compute/zone].
-
Check all set configurations are as expected
$ gcloud config list [compute] region = us-west1 zone = us-west1-a [core] account = [email protected] disable_usage_reporting = True project = project48a Your active configuration is: [hemen-h2oai]
-
A total of 3 service accounts are needed for this to work end to end. Of the three, one is created manually and has the most privileges. The remaining two will are created by the terraform script
project48a-sa
- This one is created manually as shown below using gcloud
- It is used to setup the VPC, firewalls etc and also the Workspace instance
- Needs Compute Admin to create instances and Storage Admin to manage state
workspaceinstate-sa
- Terraform creates this
- The SA assigned to the workspace instance started above.
- This SA will then be used by Terraform to control the permissions of starting H2O clusters.
h2ocluster-sa
- Terraform creates this
- This SA will be assigned to each VM instance that forms the H2O cluster nodes
- Access to google cloud storage and BigQuery
-
Create a Service Account for this Project
gcloud iam service-accounts create project48a-sa \ --description="Project48a Service Account" \ --display-name="project48a-sa" gcloud iam service-accounts list
-
Ensure the Service account has necessary priviledges. Here these may be a bit extra but more fin grained access roles could be given
gcloud projects add-iam-policy-binding project48a --member serviceAccount:[email protected] --role roles/storage.admin gcloud projects add-iam-policy-binding project48a --member serviceAccount:[email protected] --role roles/compute.admin gcloud projects add-iam-policy-binding project48a --member serviceAccount:[email protected] --role roles/iam.serviceAccountAdmin gcloud projects add-iam-policy-binding project48a --member serviceAccount:[email protected] --role roles/iam.serviceAccountUser gcloud projects add-iam-policy-binding project48a --member serviceAccount:[email protected] --role roles/iam.securityAdmin
-
Create a service account key for use with terraform. First create a directory structure as shown in the tree command.
cat
is used to check if the key file got createdcd gcp/network gcloud iam service-accounts keys create gcpkey.json --iam-account [email protected] cat gcpkey.json
-
This service account key can now be used in Terraform by setting the environment variable
GOOGLE_APPLICATION_CREDENTIALS
see for more information. Alternatively it could also be mentioned in the Terraform codeexport GOOGLE_APPLICATION_CREDENTIALS=`pwd`/gcpkey.json
- The TF code currently assumes that GCS bucket to store TF state is already created. We use this approach
- Ensure the value of variable
gcp_project_name
innetwork/main/variables.tf
is in sync with the project name used in the below command to create the tf state backend bucket gsutil mb gs://project48a-tfstate
to create the bucketgsutil versioning set on gs://project48a-tfstate
to eanble versioning support
- Ensure the value of variable
- In web browser, select the project and left top menu dropdown select Storage >> Browser and validate the bucket is created.
- An alternate option is to have TF create the bucket, but then would need
terraform apply
in multiple folders as in https://github.com/tasdikrahman/terraform-gcp-examples
Next we configure terrafom to use gcs backend for state mangement
- Add this block to
gcp/main/terraform.tf
file
backend "gcs" {
bucket = "project48a-tfstate"
prefix = "h2o/terraform"
}
The directory structure now is
gcp
├── network
├── h2ocluster
network
directory- contains all Terraform code that will setup the VPC, Subnets, Firewall, Workspace instance, service accounts etc.
- executed only one time
- happens on any external machine, possibly a cloud admins laptop
h2ocluster
directory- this directory is zipped and should be moved to
/opt/h2ocluster
in the workspace system - contains Terraform code that will setup a N node H2O cluster instance in the private subnet when requested by a user.
- will be executed multiple times by the user to start/stop the cluster.
- will not be executed directly as terraform apply or destroy. Instead a bash wrapper will be provided to list, create and destroy the custer instance
- list will use gcloud commands whereas create and destroy will leverage the terraform code in this directory.
- this directory is zipped and should be moved to
- Navigate to
gcp/network
directory and runterraform init
to initialize terraform.- a
terraform.tfstate
will be created in thenetwork/.terraform
directory with details about the gcs backend and modules - the TF state file without any resources is created in GCP backend with the file named
default.tfstate
. gsutil cat gs://project48a-tfstate/h2o/terraform/default.tfstate
to view the content of this initial state
- a
terraform apply
can be used to create all the necessary network and workspace resourcesterraform show
can be used to see the resources stateterraform refresh
can be used to update state informaton with the chages in real world infra that happened via Google Web console.- At this point we trigger a
terraform apply
to create the VPC, public + private subnets, firewall rules, NAT gateways, service account etc. and the main Workspace machine on a GCP Compute instance.
- This is a single machine like a bastion host, in the public subnet of VPC. It should be up and running now.
- Instances in the public subnet will get an external IP and hence are internet accessible.
- Instances without a public address are private and as a convention we put them in the private subnet.
- After
terraform apply
when the workspace machine was created it can be accessed withgcloud beta compute ssh --zone "us-west1-a" --project "project48a" --ssh-key-file=~/.ssh/google_compute/id_rsa "h2o-instance-workspace"
- Create SSH key - For the very first time we would not have an ssh key to use.
- Assuming that you have completed the
gcloud auth login
step from point 3 above you can run the above command without--ssh-key-file
option. - This will create the files
google_compute_engine
,google_compute_engine.pub
andgoogle_compute_engine.knownhosts
files in$HOME/.ssh
directory. - Will work only if in Project >> IAM your user id will have
Compute OS Login
orCompute OS Admin Login
roles to your member.
- Assuming that you have completed the
- It should be able to access this machine now with the above command
- Additionally, once done with the above command we can then use normal ssh also. Note the username to use when we connect above. You can get this username to use when you ssh above.
ssh -i ~/.ssh/google_compute_engine [email protected]
If the Workspace machine is created and you are able to ssh to it, we conclude step 1 of creating the infrastructure setup
- On Workspace Servers
- Once Workspace server is up and running check.
jq --version
,terraform --version
,gcloud config list
. All these commands should be working. Additionallygcloud
should be able to detect the service account that is associated with the Workspace compute instance. - To avoid the zone prompt for some of the commands used internally by the
h2ocluster
tool set the zone information usinggcloud config set compute/region us-west1
- Update PATH variable
export PATH="$PATH:/opt/h2ocluster/terraform"
- Initialise
h2ocluster --help
- Read the usage of
h2ocluster
tool usingh2ocluster --help
- Create a cluster
h2ocluster create
- Once created note the IP and Port information displayed
H2O Cluster Information: ========================= Cluster Name: h2o-hemenkap-letxk7f-cluster Cluster Size: 3 Cluster Leader IP and PORT: 10.100.1.2:54321 Cluster Leader Url: http://10.100.1.2:54321/flow/index.html#
- Using this info, create an ssh local port forward to the H2O cluster created in the private subnet via the Workspace machine (which is like a bastion). You can select any local port to forward. I used
8888
in this example.ssh -i ~/.ssh/google_compute_engine -L 8888:10.100.1.2:54321 [email protected]
- Open a browser on your laptop and go to URL
http://localhost:8888
. You will see the H2O flow UI. - If you are running Python or R code to connect to the H2O cluster then the cluster address will be different based on where you code is executing.
- Workspace machine use
http://10.100.1.2:54321
- Local machine/laptop with ssh forwarding use
http://localhost:8888
- Workspace machine use
Here we will upload some data into S3 bucket and then import it in to H2O-3 cluster. We will also create a BQ table and import it into H2O-3
- Create a GCS bucket
gsutil mb -p project48a -c NEARLINE -l US-WEST1 -b on gs://h2ocluster-train-data
- Upload some data files to the bucket
gsutil cp ~/Workspace/Office/Datasets/flights_delay/allyears2k.csv gs://h2ocluster-train-data/flights-delay/
gsutil cp ~/Workspace/Office/Datasets/flights_delay/airlines_all.05p.csv gs://h2ocluster-train-data/flights-delay/
- To import this GCS file in H2O-3, create a new Flow Notebook, and in the cell enter the expression
importFiles ['gs://h2ocluster-train-data/flights-delay/allyears2k.csv']
. Run the cell byt hittingctrl+Enter
. - It will indicate that 1 file is imported, click the
Parse these files..
button - A new section will show up, set necessary datatype changes for the columns and then click the
Parse
button at the bottom of the section. - A job will run which will generate a
.hex
file. This is the H2O-3 dataframe. Click theView
button to get frame summary. - In the section that opens click
View Data
to see the imported data in H2O-3
- When using
bq
command for the first time it initialized and asked for the default project. My project is namedproject48a
so I selected the same. - To list datasets in this project
bq ls project48a:
, the last part defining the project can be removed if default project is set - Create airlines dataset
bq --location=us-west1 mk project48a:airlines
- Verify it got created
bq ls --format=pretty
- Adding a table to the dataset and loading data in it
bq --location=us-west1 load --autodetect --null_marker="NA" --source_format=CSV project48a:airlines.allyears2k ~/Workspace/Office/Datasets/flights_delay/allyears2k.csv
bq ls --format=pretty project48a:airlines
bq show --format=pretty project48a:airlines.allyears2k
will describe the table
- To import this data into H2O-3, create a new notebook and click Data >> Import SQL Table. In the section that opens up enter the below information to read data from the above table.
- JDBC URL:
jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=project48a;OAuthType=3;Location=us-west1;LogLevel=4;LogPath=/tmp/h2o-bigquery-logs;
- Table:
project48a.airlines.allyears2k
- Fetch mode can be Single or Distributed
- Leave all other fields empty
- JDBC URL:
- Click the Import Button. It will open a new section. Click View button in this section.
- A new section will open that shows the progress of the import job.
- Once the job is 100% completed, click the View button to get the frame summary. Finally, click the View Data button to verify that the data was imported.
- To speed up the cluster creation times you can use an image with H2O preloaded on it.
- Create a service account key for h2ocluster-vm-sa on your local machine and move it to the workspace machine under /opt/h2ocluster/packer/scripts
gcloud iam service-accounts keys create h2ocluster-sa-key.json --iam-account [email protected]
- To create such an image, on the Workspace machine follow the instructions below
cd /opt/h2ocluster/packer/
- If needed update the variable values in the file
h2o-gcp-image.json
packer build h2o-gcp-image.json
- Once the image is built, it can be used in the terraform code to create H2O-3 clusters.
- Edit file
/opt/h2ocluster/terraform/terraform.tfvars
and update the value ofh2o_cluster_instance_boot_disk_image
to the name of the packer imge.
- Edit file
- Now the cluster load times will be significantly reduced as compared to the situation where we start from a bare RHEL7 image as the base
-
https://cloud.google.com/solutions/managing-infrastructure-as-code
-
https://www.digitalocean.com/community/tutorials/how-to-structure-a-terraform-project
Useful for Workspace
Useful for Startup Script completion tracking
-
- See the updating on a running instance instead of the
-
Metadata on GCP instances can be accessed using the metadata url.
-
For non GCP instances we can access it as
gcloud compute instances describe h2o-instance-workspace --format='value[](metadata.items.startup-complete)'
- We use gcloud topic filters to get the desired value out of the response.