Skip to content

Getting and preparing supporting datasets

Andreas Kusalananda Kähäri edited this page Jan 26, 2018 · 15 revisions

Getting the needed datasets for the SweFreq browser: GRChg37 and GRChg38

GRChg37 datasets

GENCODE v19 GRChg37.p13

curl -O

Remove the GL chromosomes (all lines except the ones starting with # (comments) or chr):

zgrep -E '^(#|chr)' gencode.v19.annotation.gtf.gz |
gzip -c >gencode.v19.annotation-filtered.gtf.gz

ln -sf gencode.v19.annotation-filtered.gtf.gz gencode.gtf.gz

dbSNP b150 GRCh37.p13

curl -O

zcat b150_SNPChrPosOnRef_105.bcp.gz | mawk 'length($3) > 0 { gsub(/ +/, "\t"); print }' |
sort --parallel=8 -S 256M -k2,2 -k3,3n | bgzip -c >dbSNP_b150.txt.bgz

tabix -s 2 -b 3 -e 3 dbSNP_b150.txt.bgz

ln -sf dbSNP_b150.txt.bgz dbSNP.txt.bgz
ln -sf dbSNP_b150.txt.bgz.tbi dbSNP.txt.bgz.tbi

dbNSFP v2.9.3 GRCh37

This is a 13.4 GB Zip-file that is really slow to download. Out of it, we need to get a single 26MB file (sigh).

curl -O ftp://dbnsfp:[email protected]/
unzip dbNSFP2.9_gene
gzip -9 dbNSFP2.9_gene
ln -sf dbNSFP2.9_gene.gz dbNSFP_gene.gz

Ensembl canonical transcripts (Ensembl 75, GRCh37.p13)

mysql -BN -h -u anonymous -D homo_sapiens_core_75_37 \
    -e 'SELECT g.stable_id, t.stable_id FROM gene g JOIN transcript t
        ON (g.canonical_transcript_id = t.transcript_id)' |
sort | gzip -9c >canonical_transcripts_ensembl_75_GRCh37.gz

ln -sf canonical_transcript_ensembl_90_GRCh38.gz canonical_transcripts.txt.gz


Using old dataset (still GRChg37?). Update requires registration.

GRChg38 datasets

The SweGen data is on GRChg37, but I'm assuming we will have datasets on the newer assembly at some point.

I have prepared the following data in a private user account in the andkaha container (which is the container from which I'm planning to do data loading):

GENCODE v27 GRChg38.p10

curl -O
ln -sf gencode.v27.annotation.gtf.gz gencode.gtf.gz

dbSNP b150 GRCh38.p7

curl -O

zcat b150_SNPChrPosOnRef_108.bcp.gz | mawk 'length($3) > 0 { gsub(/ +/, "\t"); print }' |
sort --parallel=8 -S 256M -k2,2 -k3,3n | bgzip -c >dbSNP_b150.txt.bgz

tabix -s 2 -b 3 -e 3 dbSNP_b150.txt.bgz

ln -sf dbSNP_b150.txt.bgz dbSNP.txt.bgz
ln -sf dbSNP_b150.txt.bgz.tbi dbSNP.txt.bgz.tbi

dbNSFP v3.5a GRCh38 (.p2?)

This is a 16.1 GB Zip-file that is really slow to download. Out of it, we need to get a single 26MB file (sigh).

curl -O ftp://dbnsfp:[email protected]/
unzip dbNSFP3.5_gene
gzip -9 dbNSFP3.5_gene
ln -sf dbNSFP3.5_gene.gz dbNSFP_gene.gz

Ensembl canonical transcripts (Ensembl 90, GRCh38.p10)

mysql -BN -h -u anonymous -D homo_sapiens_core_90_38 \
    -e 'SELECT g.stable_id, t.stable_id FROM gene g JOIN transcript t
        ON (g.canonical_transcript_id = t.transcript_id)' |
sort | gzip -9c >canonical_transcripts_ensembl_90_GRCh38.gz

ln -sf canonical_transcript_ensembl_90_GRCh38.gz canonical_transcripts.txt.gz


Using old dataset (still GRChg37?). Update requires registration.

Directory layout for supporting data

├── canonical_transcripts.txt.gz -> real/canonical_transcripts_ensembl_75_GRCh37.gz
├── dbNSFP_gene.gz -> real/dbNSFP2.9_gene.gz
├── dbSNP.txt.bgz -> real/dbSNP_b150.txt.bgz
├── dbSNP.txt.bgz.tbi -> real/dbSNP_b150.txt.bgz.tbi
├── gencode.gtf.gz -> real/gencode.v27lift37.annotation.gtf.gz
├── omim_info.txt.gz -> real/omim_info.txt.gz
└── real
    ├── b150_SNPChrPosOnRef_105.bcp.gz
    ├── canonical_transcripts_ensembl_75_GRCh37.gz
    ├── dbNSFP2.9_gene.gz
    ├── dbSNP_b150.txt.bgz
    ├── dbSNP_b150.txt.bgz.tbi
    ├── gencode.v27lift37.annotation.gtf.gz
    └── omim_info.txt.gz

├── canonical_transcripts.txt.gz -> real/canonical_transcripts_ensembl_90_GRCh38.gz
├── dbNSFP_gene.gz -> real/dbNSFP3.5_gene.gz
├── dbSNP.txt.bgz -> real/dbSNP_b150.txt.bgz
├── dbSNP.txt.bgz.tbi -> real/dbSNP_b150.txt.bgz.tbi
├── gencode.gtf.gz -> real/gencode.v27.annotation.gtf.gz
├── omim_info.txt.gz -> real/omim_info.txt.gz
└── real
    ├── b150_SNPChrPosOnRef_108.bcp.gz
    ├── canonical_transcripts_ensembl_90_GRCh38.gz
    ├── dbNSFP3.5_gene.gz
    ├── dbSNP_b150.txt.bgz
    ├── dbSNP_b150.txt.bgz.tbi
    ├── gencode.v27.annotation.gtf.gz
    └── omim_info.txt.gz