Building the b37 human decoy reference genome

The 30x down-sampled BAM file of the NA12878 GIAB pilot sample does not have a reference genome associated to it.

This script is used to create it.

The following steps are required to gather the necessary sequences:

Download the FASTA and FAI files of the reference genome human_g1k_v37_decoy. Leave the password field blank to download from the FTP server;
Remove the sequences NC_007605 and hs37d5 from the human_g1k_v37_decoy genome with this script, resulting in a file human_g1k_v37_decoy_filtered.fasta;
Keep only the chromosome names in the FASTA headers: awk '{print $1}' human_g1k_v37_decoy_filtered.fasta > human_g1k_v37_decoy_filtered_short_header.fasta
Download the NIST ERCC sequences in FASTA format. The file is called SRM2374_Sequence_v1.FASTA.
seqtk is used to extract the sequence with header 080418_Consensus_Vector_Sequence_NIST_SEQUENCING_ASSEMBLY_noRestrict_rev from the file ercc_and_human_rRNA_and_tagdust.fa as follows:

wget https://raw.githubusercontent.com/Population-Transcriptomics/C1-CAGE-preview/master/ercc_and_human_rRNA_and_tagdust.fa

echo '080418_Consensus_Vector_Sequence_NIST_SEQUENCING_ASSEMBLY_noRestrict_rev' > seq.list

seqtk subseq -l 60 ercc_and_human_rRNA_and_tagdust.fa seq.list > 080418_Consensus_Vector_Sequence_NIST_SEQUENCING_ASSEMBLY_noRestrict_rev.fasta

Concatenate the files human_g1k_v37_decoy_filtered_short_header.fasta, SRM2374_Sequence_v1.FASTA and 080418_Consensus_Vector_Sequence_NIST_SEQUENCING_ASSEMBLY_noRestrict_rev.fasta into a single file named b37_human_decoy_reference.dos.fasta.
Convert the file to UNIX format: dos2unix -n b37_human_decoy_reference.dos.fasta b37_human_decoy_reference.fasta

Provide feedback