-
Notifications
You must be signed in to change notification settings - Fork 1
Getting and preparing supporting datasets
Andreas Kusalananda Kähäri edited this page Jan 26, 2018
·
15 revisions
curl -O ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
Remove the GL
chromosomes (all lines except the ones starting with #
(comments) or chr
):
zgrep -E '^(#|chr)' gencode.v19.annotation.gtf.gz |
gzip -c >gencode.v19.annotation-filtered.gtf.gz
ln -sf gencode.v19.annotation-filtered.gtf.gz gencode.gtf.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/database/data/organism_data/b150_SNPChrPosOnRef_105.bcp.gz
zcat b150_SNPChrPosOnRef_105.bcp.gz | mawk 'length($3) > 0 { gsub(/ +/, "\t"); print }' |
sort --parallel=8 -S 256M -k2,2 -k3,3n | bgzip -c >dbSNP_b150.txt.bgz
tabix -s 2 -b 3 -e 3 dbSNP_b150.txt.bgz
ln -sf dbSNP_b150.txt.bgz dbSNP.txt.bgz
ln -sf dbSNP_b150.txt.bgz.tbi dbSNP.txt.bgz.tbi
This is a 13.4 GB Zip-file that is really slow to download. Out of it, we need to get a single 26MB file (sigh).
curl -O ftp://dbnsfp:[email protected]/dbNSFPv2.9.3.zip
unzip dbNSFPv2.9.3.zip dbNSFP2.9_gene
gzip -9 dbNSFP2.9_gene
ln -sf dbNSFP2.9_gene.gz dbNSFP_gene.gz
mysql -BN -h ensembldb.ensembl.org -u anonymous -D homo_sapiens_core_75_37 \
-e 'SELECT g.stable_id, t.stable_id FROM gene g JOIN transcript t
ON (g.canonical_transcript_id = t.transcript_id)' |
sort | gzip -9c >canonical_transcripts_ensembl_75_GRCh37.gz
ln -sf canonical_transcript_ensembl_90_GRCh38.gz canonical_transcripts.txt.gz
Using old dataset (still GRChg37?). Update requires registration.
The SweGen data is on GRChg37, but I'm assuming we will have datasets on the newer assembly at some point.
I have prepared the following data in a private user account in the andkaha
container (which is the container from which I'm planning to do data loading):
curl -O ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/gencode.v27.annotation.gtf.gz
ln -sf gencode.v27.annotation.gtf.gz gencode.gtf.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/database/data/organism_data/b150_SNPChrPosOnRef_108.bcp.gz
zcat b150_SNPChrPosOnRef_108.bcp.gz | mawk 'length($3) > 0 { gsub(/ +/, "\t"); print }' |
sort --parallel=8 -S 256M -k2,2 -k3,3n | bgzip -c >dbSNP_b150.txt.bgz
tabix -s 2 -b 3 -e 3 dbSNP_b150.txt.bgz
ln -sf dbSNP_b150.txt.bgz dbSNP.txt.bgz
ln -sf dbSNP_b150.txt.bgz.tbi dbSNP.txt.bgz.tbi
This is a 16.1 GB Zip-file that is really slow to download. Out of it, we need to get a single 26MB file (sigh).
curl -O ftp://dbnsfp:[email protected]/dbNSFPv3.5a.zip
unzip dbNSFPv3.5a.zip dbNSFP3.5_gene
gzip -9 dbNSFP3.5_gene
ln -sf dbNSFP3.5_gene.gz dbNSFP_gene.gz
mysql -BN -h ensembldb.ensembl.org -u anonymous -D homo_sapiens_core_90_38 \
-e 'SELECT g.stable_id, t.stable_id FROM gene g JOIN transcript t
ON (g.canonical_transcript_id = t.transcript_id)' |
sort | gzip -9c >canonical_transcripts_ensembl_90_GRCh38.gz
ln -sf canonical_transcript_ensembl_90_GRCh38.gz canonical_transcripts.txt.gz
Using old dataset (still GRChg37?). Update requires registration.
data-GRChg37 ├── canonical_transcripts.txt.gz -> real/canonical_transcripts_ensembl_75_GRCh37.gz ├── dbNSFP_gene.gz -> real/dbNSFP2.9_gene.gz ├── dbSNP.txt.bgz -> real/dbSNP_b150.txt.bgz ├── dbSNP.txt.bgz.tbi -> real/dbSNP_b150.txt.bgz.tbi ├── gencode.gtf.gz -> real/gencode.v27lift37.annotation.gtf.gz ├── omim_info.txt.gz -> real/omim_info.txt.gz └── real ├── b150_SNPChrPosOnRef_105.bcp.gz ├── canonical_transcripts_ensembl_75_GRCh37.gz ├── dbNSFP2.9_gene.gz ├── dbNSFPv2.9.3.zip ├── dbSNP_b150.txt.bgz ├── dbSNP_b150.txt.bgz.tbi ├── gencode.v27lift37.annotation.gtf.gz └── omim_info.txt.gz data-GRChg38 ├── canonical_transcripts.txt.gz -> real/canonical_transcripts_ensembl_90_GRCh38.gz ├── dbNSFP_gene.gz -> real/dbNSFP3.5_gene.gz ├── dbSNP.txt.bgz -> real/dbSNP_b150.txt.bgz ├── dbSNP.txt.bgz.tbi -> real/dbSNP_b150.txt.bgz.tbi ├── gencode.gtf.gz -> real/gencode.v27.annotation.gtf.gz ├── omim_info.txt.gz -> real/omim_info.txt.gz └── real ├── b150_SNPChrPosOnRef_108.bcp.gz ├── canonical_transcripts_ensembl_90_GRCh38.gz ├── dbNSFP3.5_gene.gz ├── dbNSFPv3.5a.zip ├── dbSNP_b150.txt.bgz ├── dbSNP_b150.txt.bgz.tbi ├── gencode.v27.annotation.gtf.gz └── omim_info.txt.gz