The steps below describe how to set up a fully isolated instance of xBrowse on Amazon Web Services. We begin with vanilla CentOS 6.5 virtual machine, and provision an xBrowse server that can be accessed over the public internet.
Although these steps are specific to AWS and CentOS 6.5, they map closely to the steps you'd take to install xBrowse on other systems. For example, this is a very similar process to how we administer xBrowse at the Broad Institute, with a few modifications to accommodate specifics to our internal infrastructure.
Note that, though this is an AWS tutorial, we don't maintain any prebuilt AMIs at this time - though you could use the result of this tutorial to package AMIs for internal use.
Table of Contents
Make sure that you have an AWS account. This tutorial will not fall under the free usage tier, so you'll need a credit card. It will only cost a few dollars if you delete everything right after, but that last part is important - make sure you terminate VMs that you aren't using.
These instructions are sparse since there are multiple ways to create virtual machines on AWS.
- Create a new EC2 Virtual Machine from Community AMI
ami-8997afe0
. (Note: you must set your region tous-east-1
for this AMI to appear in search results). Use instance typem3.medium
, or something more powerful.
At this point, you should be able to log into the machine:
ssh -i /path/to/private/key [email protected]
-
Create an EBS Volume with at least 50 GB storage. This is where all of the xBrowse data and database files will go.
-
Attach the EBS volume to the VM.
-
Mount the EBS volume to the VM. In this document, we assume that the volume is mounted to
/mnt
. One way is to run:
lsblk # this shows all devices that can be mounted along with their name and size
mkfs -t ext4 /dev/xvdl # replace 'xvdl' with the name given by lsblk
mount -t ext4 /dev/xvdl /mnt
Before continuing, make sure that the mountpoint is correctly set up - it should look something like this:
$ df -H
Filesystem Size Used Avail Use% Mounted on
/dev/xvde 8.5G 682M 7.4G 9% /
tmpfs 2.0G 0 2.0G 0% /dev/shm
/dev/xvdl 53G 189M 50G 1% /mnt
A bulk of the provisioning in xBrowse is performed by Puppet, but a few steps are run manually. Log into the machine and do the following:
-
Update Yum
yum update -y
-
Install git and wget
yum install git wget unzip -y
-
Install Puppet, or check that you have at least version 3.7 installed.
rpm -Uvh http://yum.puppetlabs.com/el/6/products/i386/puppetlabs-release-6-7.noarch.rpm
yum -y -q install puppet
-
Create subdirectories:
cd /mnt
mkdir -p code/xbrowse-settings data/reference_data data/projects mongodb
-
Clone the xbrowse repo from github:
cd /mnt/code
git clone https://github.com/xbrowse/xbrowse.git
-
Download necessary reference data from xBrowse and external sources.
cd /mnt/data/reference_data
wget ftp://atguftp.mgh.harvard.edu/xbrowse-resource-bundle.tar.gz; tar -xzf xbrowse-resource-bundle.tar.gz
wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b142_GRCh37p13/VCF/00-All.vcf.gz
wget ftp://dbnsfp:[email protected]/dbNSFPv2.9.zip; unzip -d dbNSFP dbNSFPv2.9.zip
wget ftp://ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh37/clinvar.vcf.gz*
wget ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3/ExAC.r0.3.sites.vep.vcf.gz
cd /mnt/data/projects
wget ftp://atguftp.mgh.harvard.edu/1kg_project.tar.gz; tar -xzf 1kg_project.tar.gz
-
Run Puppet to provision this machine for xBrowse. This takes a while (~2 hours) and performs a bulk of the provisioning.
puppet apply /mnt/code/xbrowse/deploy/ec2/ec2_provision.pp --modulepath=/mnt/code/xbrowse/deploy/puppet/modules
-
Install Perl dependencies. Perl itself is installed in the Puppet command above, but we must install the package manager and a few packages manually.
(This should eventually be rolled into Puppet.)
curl -L http://cpanmin.us | perl - --sudo App::cpanminus
cpanm Archive::Extract CGI Time::HiRes Archive::Zip Archive::Tar
-
Download and install VEP. It's used by xBrowse to annotate variants.
(This also should eventually be rolled into Puppet.)
cd /mnt
wget https://github.com/Ensembl/ensembl-tools/archive/release/78.zip
unzip 78.zip
mv ensembl-tools-release-78/scripts/variant_effect_predictor .
rm -rf 78.zip ensembl-tools-release-78
cd variant_effect_predictor
perl INSTALL.pl --AUTO acf --CACHEDIR ../vep_cache_dir --SPECIES homo_sapiens --ASSEMBLY GRCh37 --CONVERT
-
Set python path:
export PYTHONPATH=/mnt/code/xbrowse:/mnt/code/xbrowse-settings:$PYTHONPATH
-
Initialize the database. This django command creates the database xBrowse uses for storing users, project and other metatada.
cd /mnt/code/xbrowse
python2.7 manage.py migrate
-
Load reference data - genes, population variation, etc.
This will take ~20 minutes (a sequence of progress bars will show).
python2.7 manage.py load_resources
-
Create superuser(s). It will ask you to create a username and password which you will then be able to use to login to the development website. This user will have access to all xBrowse projects on your development instance.
python2.7 manage.py createsuperuser
Things are mostly set up now, but if you try to load the public DNS of this machine in a web browser - you actually won't be able to connect. One final hack is in order. We need to loosen the machine's SELinux and firewall rules so it can accept public traffic:
iptables -F && iptables -A FORWARD -j REJECT && /etc/init.d/iptables save
Now visit your public DNS again, and you should see the familiar xBrowse homepage.
However, this instance does not have any data loaded. We'll load a test project now.
-
Initialize the 1kg example project:
python2.7 manage.py add_project 1kg
-
Populate project with data from the test project directory:
python2.7 manage.py load_project_dir 1kg /mnt/data/projects/1kg_project
-
To load the VCF data:
python2.7 manage.py load_project 1kg
This should take ~1 hour - it has to parse all the variants from the VCF file, annotate them, and load them into the variant database (annotation speed is the main bottleneck).