Skip to content

Preparing and loading data

Andreas Kusalananda Kähäri edited this page Nov 14, 2017 · 3 revisions

Introduction

Some of this document echoes the ExAC browser's README.md, but has local modifications.

The data will be loaded into MongoDB instance on swefreq-db from the command line. The process is not automated due to its tendency to sometimes die half-way.

Multi-dataset specifics

Each dataset uses a different collection in the MongoDB database. The collection used is determined by two things:

  1. The FLASK_PORT environment variable. This is an environment variable that needs to be set before the browser is started or before the data is loaded. It should be set to a port number greater than or equal to 8000 (an arbitrarily picked number). Whenever we re-import the SweGen dataset, that dataset should be loaded with FLASK_PORT=8000. The default port used by the browser, if FLASK_PORT is unset, is 5000.

  2. The settings.json file has entries like

    "mongoDB-8000": "somename1",
    "mongoDB-8001": "somename2",
    

    The number in the key corresponds to the value of the FLASK_PORT environment variable, and the values are the MongoDB collection name that will be used.

Preparing the data

The data is loaded from the andkaha container (a private account is used where everything is already set up).

It is assumed that the dataset comes in two parts:

  1. A VCF file containing the variations, together with a corresponding Tabix index file.
  2. A set of one or several coverage data files, also with Tabix index files.

The VCF file should be accessible as exac_data/variations.vcf.gz (and exac_data/variations.vcf.gz.tbi) from the swefreq-browser directory, and the coverage data should be located in the subdirectory exac_data/coverage and be matched by Panel.*.coverage.txt.gz. Using symbolic links works well.

Supporting data

The following supporting dataset files should also be accessible in the exac_data directory:

  • canonical_transcripts.txt.gz
  • dbNSFP_gene.gz
  • dbSNP.txt.bgz and dbSNP.txt.bgz.tbi
  • gencode.gtf.gz
  • omim_info.txt.gz (unsure how this is used)

For further information about getting and preparing these datasets, see https://github.com/NBISweden/swefreq/wiki/Getting-and-preparing-supporting-datasets

Loading data

To actually load the data, first activate the Python virtual environment:

source exac_env/bin/activate

Then set the FLASK_PORT environment variable:

export FLASK_PORT=80nn

Then load the supporting datasets:

python manage.py load_dbsnp_file
python manage.py load_gene_models

... and the VCF file with coverage:

python manage.py load_variants_file
python manage.py load_base_coverage

Then precalculate some statistics and cache:

python manage.py precalculate_metrics
python manage.py create_cache

Starting the browser

The browser needs to be started using

python exac.py --host=0.0.0.0 --port="$FLASK_PORT"