-
Notifications
You must be signed in to change notification settings - Fork 35
Building the b37 human decoy reference genome
Luca Santuari edited this page Nov 19, 2023
·
12 revisions
The 30x down-sampled BAM file of the NA12878 GIAB pilot sample does not have a reference genome associated to it.
This script is used to create it.
The following steps are required to gather the necessary sequences:
- Download the FASTA
and FAI files of the reference genome
human_g1k_v37_decoy
. Leave the password field blank to download from the FTP server; - Remove the sequences
NC_007605
andhs37d5
from the human_g1k_v37_decoy genome with this script, resulting in a filehuman_g1k_v37_decoy_filtered.fasta
; - Keep only the chromosome names in the FASTA headers:
awk '{print $1}' human_g1k_v37_decoy_filtered.fasta > human_g1k_v37_decoy_filtered_short_header.fasta
- Download the NIST ERCC sequences in FASTA format. The file is called
SRM2374_Sequence_v1.FASTA
. - seqtk is used to extract the sequence with header
080418_Consensus_Vector_Sequence_NIST_SEQUENCING_ASSEMBLY_noRestrict_rev
from the fileercc_and_human_rRNA_and_tagdust.fa
as follows:
wget https://raw.githubusercontent.com/Population-Transcriptomics/C1-CAGE-preview/master/ercc_and_human_rRNA_and_tagdust.fa
echo '080418_Consensus_Vector_Sequence_NIST_SEQUENCING_ASSEMBLY_noRestrict_rev' > seq.list
seqtk subseq -l 60 ercc_and_human_rRNA_and_tagdust.fa seq.list > 080418_Consensus_Vector_Sequence_NIST_SEQUENCING_ASSEMBLY_noRestrict_rev.fasta
- Concatenate the files
human_g1k_v37_decoy_filtered_short_header.fasta
,SRM2374_Sequence_v1.FASTA
and080418_Consensus_Vector_Sequence_NIST_SEQUENCING_ASSEMBLY_noRestrict_rev.fasta
into a single file namedb37_human_decoy_reference.dos.fasta
. - Convert the file to UNIX format:
dos2unix -n b37_human_decoy_reference.dos.fasta b37_human_decoy_reference.fasta