-
Notifications
You must be signed in to change notification settings - Fork 1
Preparing and loading data
Some of this document echoes the ExAC browser's README.md
, but has local modifications.
The data will be loaded into MongoDB instance on swefreq-db
from the command line. The process is not automated due to its tendency to sometimes die half-way.
Each dataset uses a different collection in the MongoDB database. The collection used is determined by two things:
-
The
FLASK_PORT
environment variable. This is an environment variable that needs to be set before the browser is started or before the data is loaded. It should be set to a port number greater than or equal to 8000 (an arbitrarily picked number). Whenever we re-import the SweGen dataset, that dataset should be loaded withFLASK_PORT=8000
. The default port used by the browser, ifFLASK_PORT
is unset, is 5000. -
The
settings.json
file has entries like"mongoDB-8000": "somename1", "mongoDB-8001": "somename2",
The number in the key corresponds to the value of the
FLASK_PORT
environment variable, and the values are the MongoDB collection name that will be used.
The data is loaded from the andkaha
container (a private account is used where everything is already set up).
It is assumed that the dataset comes in two parts:
- A VCF file containing the variations, together with a corresponding Tabix index file.
- A set of one or several coverage data files, also with Tabix index files.
The VCF file should be accessible as exac_data/variations.vcf.gz
(and exac_data/variations.vcf.gz.tbi
) from the swefreq-browser
directory, and the coverage data should be located in the subdirectory exac_data/coverage
and be matched by Panel.*.coverage.txt.gz
. Using symbolic links works well.
The following supporting dataset files should also be accessible in the exac_data
directory:
canonical_transcripts.txt.gz
dbNSFP_gene.gz
-
dbSNP.txt.bgz
anddbSNP.txt.bgz.tbi
gencode.gtf.gz
-
omim_info.txt.gz
(unsure how this is used)
For further information about getting and preparing these datasets, see https://github.com/NBISweden/swefreq/wiki/Getting-and-preparing-supporting-datasets
To actually load the data, first activate the Python virtual environment:
source exac_env/bin/activate
Then set the FLASK_PORT
environment variable:
export FLASK_PORT=80nn
Then load the supporting datasets:
python manage.py load_dbsnp_file
python manage.py load_gene_models
... and the VCF file with coverage:
python manage.py load_variants_file
python manage.py load_base_coverage
Then precalculate some statistics and cache:
python manage.py precalculate_metrics
python manage.py create_cache
The browser needs to be started using
python exac.py --host=0.0.0.0 --port="$FLASK_PORT"