Skip to content

Getting and preparing supporting datasets

Andreas Kusalananda Kähäri edited this page Jan 26, 2018 · 15 revisions

Getting the needed datasets for the SweFreq browser: GRChg37 and GRChg38

GRChg37 datasets

GENCODE v19 GRChg37.p13

curl -O ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz

Remove the GL chromosomes (all lines except the ones starting with # (comments) or chr):

zgrep -E '^(#|chr)' gencode.v19.annotation.gtf.gz |
gzip -c >gencode.v19.annotation-filtered.gtf.gz

ln -sf gencode.v19.annotation-filtered.gtf.gz gencode.gtf.gz

dbSNP b150 GRCh37.p13

curl -O ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/database/data/organism_data/b150_SNPChrPosOnRef_105.bcp.gz

zcat b150_SNPChrPosOnRef_105.bcp.gz | mawk 'length($3) > 0 { gsub(/ +/, "\t"); print }' |
sort --parallel=8 -S 256M -k2,2 -k3,3n | bgzip -c >dbSNP_b150.txt.bgz

tabix -s 2 -b 3 -e 3 dbSNP_b150.txt.bgz

ln -sf dbSNP_b150.txt.bgz dbSNP.txt.bgz
ln -sf dbSNP_b150.txt.bgz.tbi dbSNP.txt.bgz.tbi

dbNSFP v2.9.3 GRCh37

This is a 13.4 GB Zip-file that is really slow to download. Out of it, we need to get a single 26MB file (sigh).

curl -O ftp://dbnsfp:[email protected]/dbNSFPv2.9.3.zip
unzip dbNSFPv2.9.3.zip dbNSFP2.9_gene
gzip -9 dbNSFP2.9_gene
ln -sf dbNSFP2.9_gene.gz dbNSFP_gene.gz

Ensembl canonical transcripts (Ensembl 75, GRCh37.p13)

mysql -BN -h ensembldb.ensembl.org -u anonymous -D homo_sapiens_core_75_37 \
    -e 'SELECT g.stable_id, t.stable_id FROM gene g JOIN transcript t
        ON (g.canonical_transcript_id = t.transcript_id)' |
sort | gzip -9c >canonical_transcripts_ensembl_75_GRCh37.gz

ln -sf canonical_transcript_ensembl_90_GRCh38.gz canonical_transcripts.txt.gz

OMIM

Using old dataset (still GRChg37?). Update requires registration.

GRChg38 datasets

The SweGen data is on GRChg37, but I'm assuming we will have datasets on the newer assembly at some point.

I have prepared the following data in a private user account in the andkaha container (which is the container from which I'm planning to do data loading):

GENCODE v27 GRChg38.p10

curl -O ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/gencode.v27.annotation.gtf.gz
ln -sf gencode.v27.annotation.gtf.gz gencode.gtf.gz

dbSNP b150 GRCh38.p7

curl -O ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/database/data/organism_data/b150_SNPChrPosOnRef_108.bcp.gz

zcat b150_SNPChrPosOnRef_108.bcp.gz | mawk 'length($3) > 0 { gsub(/ +/, "\t"); print }' |
sort --parallel=8 -S 256M -k2,2 -k3,3n | bgzip -c >dbSNP_b150.txt.bgz

tabix -s 2 -b 3 -e 3 dbSNP_b150.txt.bgz

ln -sf dbSNP_b150.txt.bgz dbSNP.txt.bgz
ln -sf dbSNP_b150.txt.bgz.tbi dbSNP.txt.bgz.tbi

dbNSFP v3.5a GRCh38 (.p2?)

This is a 16.1 GB Zip-file that is really slow to download. Out of it, we need to get a single 26MB file (sigh).

curl -O ftp://dbnsfp:[email protected]/dbNSFPv3.5a.zip
unzip dbNSFPv3.5a.zip dbNSFP3.5_gene
gzip -9 dbNSFP3.5_gene
ln -sf dbNSFP3.5_gene.gz dbNSFP_gene.gz

Ensembl canonical transcripts (Ensembl 90, GRCh38.p10)

mysql -BN -h ensembldb.ensembl.org -u anonymous -D homo_sapiens_core_90_38 \
    -e 'SELECT g.stable_id, t.stable_id FROM gene g JOIN transcript t
        ON (g.canonical_transcript_id = t.transcript_id)' |
sort | gzip -9c >canonical_transcripts_ensembl_90_GRCh38.gz

ln -sf canonical_transcript_ensembl_90_GRCh38.gz canonical_transcripts.txt.gz

OMIM

Using old dataset (still GRChg37?). Update requires registration.

Directory layout for supporting data

data-GRChg37
├── canonical_transcripts.txt.gz -> real/canonical_transcripts_ensembl_75_GRCh37.gz
├── dbNSFP_gene.gz -> real/dbNSFP2.9_gene.gz
├── dbSNP.txt.bgz -> real/dbSNP_b150.txt.bgz
├── dbSNP.txt.bgz.tbi -> real/dbSNP_b150.txt.bgz.tbi
├── gencode.gtf.gz -> real/gencode.v27lift37.annotation.gtf.gz
├── omim_info.txt.gz -> real/omim_info.txt.gz
└── real
    ├── b150_SNPChrPosOnRef_105.bcp.gz
    ├── canonical_transcripts_ensembl_75_GRCh37.gz
    ├── dbNSFP2.9_gene.gz
    ├── dbNSFPv2.9.3.zip
    ├── dbSNP_b150.txt.bgz
    ├── dbSNP_b150.txt.bgz.tbi
    ├── gencode.v27lift37.annotation.gtf.gz
    └── omim_info.txt.gz

data-GRChg38
├── canonical_transcripts.txt.gz -> real/canonical_transcripts_ensembl_90_GRCh38.gz
├── dbNSFP_gene.gz -> real/dbNSFP3.5_gene.gz
├── dbSNP.txt.bgz -> real/dbSNP_b150.txt.bgz
├── dbSNP.txt.bgz.tbi -> real/dbSNP_b150.txt.bgz.tbi
├── gencode.gtf.gz -> real/gencode.v27.annotation.gtf.gz
├── omim_info.txt.gz -> real/omim_info.txt.gz
└── real
    ├── b150_SNPChrPosOnRef_108.bcp.gz
    ├── canonical_transcripts_ensembl_90_GRCh38.gz
    ├── dbNSFP3.5_gene.gz
    ├── dbNSFPv3.5a.zip
    ├── dbSNP_b150.txt.bgz
    ├── dbSNP_b150.txt.bgz.tbi
    ├── gencode.v27.annotation.gtf.gz
    └── omim_info.txt.gz