osdataproc

osdataproc is a command-line tool for creating an OpenStack cluster with Apache Spark and Apache Hadoop configured. It comes with JupyterLab and Hail, a genomic data analysis library built on Spark installed, as well as Netdata for monitoring.

The osdataproc uses Terraform to spin up cluster in OpenStack, and Ansible to set up data on cluster

Setup

Create a Python virtual environment. For example:
```
python3 -m venv env
```
Download Terraform (1.4, or higher) and unzip it into a location on your path, e.g. into your venv. Make sure to download the appropriate version for your operating system and architecture.
```
curl https://releases.hashicorp.com/terraform/1.4.6/terraform_1.4.6_darwin_amd64.zip > terraform_1.4.6_darwin_amd64.zip
unzip terraform_1.4.6_darwin_amd64.zip
```

Source the environment, clone this repository and install the requirements into the virtual environment:

source env/bin/activate
git clone https://github.com/wtsi-hgi/osdataproc.git
cd osdataproc
pip install -e .

Make sure you have created an SSH keypair with ssh-keygen if you have not done so before. The default options are OK. Read the notes below if your private key has a passphrase.
Download your OpenStack project's openrc.sh file. You can find the specific file for your project at Project > API Access, and then Download OpenStack RC File > OpenStack RC File on the right.
Source your openrc.sh file:
```
source <project-name>-openrc.sh
```

You can then run the osdataproc command as shown in the examples, below. osdataproc --help, or osdataproc create --help, etc. will show all possible arguments.

Once run, it will ask you for a password. This is for access to the web interfaces, including Jupyter Lab. It is also the password for an encrypted NFS volume (see the NFS documentation). When you first access your cluster via a browser you will be asked for said password.

Example Usage

Create a Cluster

osdataproc create [--num-workers]    <Number of desired worker nodes>
                  [--public-key]     <Path to public key file>
                  [--flavour]        <OpenStack flavour to use>
                  [--network-name]   <OpenStack network to use>
                  [--lustre-network] <OpenStack Lustre provider network to use>
                  [--image-name]     <OpenStack image to use - Ubuntu images only>
                  [--nfs-volume]     <Name/ID of volume to attach or create as NFS shared volume>
                  [--volume-size]    <Size of OpenStack volume to create>
                  [--device-name]    <Device mountpoint name of volume>
                  [--floating-ip]    <OpenStack floating IP to associate to master node - will automatically create one if not specified>
                  <cluster_name>

NOTE: Ensure that the image used has python3.9 as the default version of python. The focal images should work.

osdataproc create will output the public IP of your master node when the node has been created. You can SSH into this using the public key provided:

ssh ubuntu@<public_ip>

Note that it will take a few minutes for the configuration to complete. The following services can then be accessed from your browser:

Service	URL
Jupyter Lab	`https://<public_ip>/jupyter`
Spark	`https://<public_ip>/spark`
Spark History	`https://<public_ip>/sparkhist`
HDFS	`https://<public_ip>/hdfs`
YARN	`https://<public_ip>/yarn`
MapReduce History	`https://<public_ip>/mapreduce`
Netdata Metrics	`https://<public_ip>/netdata`

Attaching a Volume

You can attach a volume as an NFS share to your cluster creating either a new volume, or attaching an existing volume. This will mount the volume on the data directory and mount the data directory of your master node to all of the worker nodes as a shared volume over NFS.

See the NFS documentation for details and creation options.

Lustre Support

For Lustre support, you must provide the name of the Lustre provider network that exists in your tenant and an image configured to mount Lustre from this network. For Sanger users, please check with ISG for the details.

Destroy a Cluster

osdataproc destroy <cluster_name>

You will be prompted to type 'yes' if you are happy with printed destroy plan.

Configuration Options

There is a vars.yml file where default options for creating the cluster can be saved, as well as Spark and Hadoop configuration items tuned. Additional packages and Python modules to install on the cluster can also be specified here.

Troubleshooting Notes

Please check the version of jinja2, make sure it is 3.0.3, otherwise you may have the error message as below.

[WARNING]: Skipping plugin (/lib/python3.9/site-packages/ansible/plugins/filter/mathstuff.py) as it seems to be invalid: cannot import name
'environmentfilter' from 'jinja2.filters' (/lib/python3.9/site-packages/jinja2/filters.py)

Your cluster name should never contain underscore characters; valid choices are alphanumeric characters and dashes.
If your private key has a passphrase, Ansible will not be able to connect to the created instances unless you add your key to ssh-agent first:
```
eval $(ssh-agent)
ssh-add
```
You can check the provisioning status of the worker nodes via the master node and checking /var/log/user_data.log. For example:
```
ssh -J ubuntu@<public-ip> \
    ubuntu@<user>-<cluster_name>-worker-<index> \
    tail -f /var/log/user_data.log
```
osdataproc is configured to use Kryo serialization for use with Hail for up to 10x faster data serialization. However, not all Serializable types are supported and so it may be necessary to change $SPARK_HOME/conf/spark-defaults.conf by commenting out or removing the spark.serializer configuration option. This can also be removed by default in vars.yml when creating a cluster.
For Sanger users, check the appropriate FCE capacity dashboard under the Sanger metrics.
If spark-submit is not on PATH (spark-submit: command not found), activate the venv
```
source /home/ubuntu/venv/bin/activate
```
If running spark-submit you get the following error
```
TypeError: 'JavaPackage' object is not callable
```
check that SPARK_HOME is set to /opt/spark
```
export SPARK_HOME=/opt/spark
```

Contributing and Editing

You can contribute by submitting pull requests to this repository. If you create a fork you will need to update the REPO and BRANCH variables in terraform/user-data.sh.tpl to the new repository location for the changes you make to be reflected in the created cluster.

Developing and debugging notes

Main variables

Variables containing main package versions are defined in the vars.yaml JDK version is defined in ansible/roles/commom/default

Logs location

All Terraform and Ansible logs are located directly in the /var/log folder. By default, Ansible writes data to syslog.

Logs for master node is also saved to the OSDataProc_HOME}/terraform/terraform.tfstate.d/cluster_name/ansible-master.log

Updating installation on worker nodes

To run Ansible playbooks on the worker nodes, Terraform clones it from the public git repository. To apply updated playbooks on cluster creation, you need to push changes to git branch and use this branch in the terraform/user-data.sh.tpl script.

Name		Name	Last commit message	Last commit date
Latest commit History 272 Commits
ansible		ansible
terraform		terraform
.gitignore		.gitignore
LICENSE		LICENSE
NFS.md		NFS.md
README.md		README.md
cluster-create-example		cluster-create-example
copy_id		copy_id
osdataproc.py		osdataproc.py
run		run
setup.py		setup.py
vars.yml		vars.yml
volumes.py		volumes.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

osdataproc

Setup

Example Usage

Create a Cluster

Attaching a Volume

Lustre Support

Destroy a Cluster

Configuration Options

Troubleshooting Notes

Contributing and Editing

Developing and debugging notes

Main variables

Logs location

Updating installation on worker nodes

To Do

About

Releases

Packages

Languages

License

wtsi-hgi/osdataproc

Folders and files

Latest commit

History

Repository files navigation

osdataproc

Setup

Example Usage

Create a Cluster

Attaching a Volume

Lustre Support

Destroy a Cluster

Configuration Options

Troubleshooting Notes

Contributing and Editing

Developing and debugging notes

Main variables

Logs location

Updating installation on worker nodes

To Do

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages