Mutations in repetitive regions are usually lost by standard variant callers as low mapping quality leads to non-reliable variant calls. We introduce Armadillo, a pipeline that stacks all copies from repetitive genes to make mutations in such regions visible.
- BWA (built with v.0.7.17)
- Samtools (built with v.1.9)
- gfClient (built with v.35)
- Python3 (built with v.3.7)
Use git to download Armadillo:
git clone https://github.com/pbousquets/armadillo
Check that the dependencies are installed:
cd armadillo
./configure #run it as sudo to install the python packages and creating an alias globally
Before we can run armadillo, we need to get a port prepared to run gfClient. In order to do that, we just run:
nohup gfServer start localhost PORT /path/to/reference_genome.2bit >/dev/null 2>&1 &
The port will stay opened unless we kill the task or shut the computer down.
The regions of interest must be analysed before running armadillo to keep just those which are repetitive and get the coords of their copies. Armadillo can do that just by providing a reference genome, a BED-formatted list of regions of interest and the port previously opened for gfClient:
armadillo data-prep -i /path/to/rois.bed -g /path/to/reference_genome.fa -p $PORT [-m min_len] [-o output_dir]
It's possible to use a configuration file to run armadillo. Just by using config-file option a configuration file will be copied at current working directory. Then, just change any parameter you want and run armadillo.
armadillo config-file
armadillo run configuration_file.txt
Options can be also passed directly through the command line:
armadillo run -n CASE -C control.bam -T tumor.bam --armadillo_data /path/to/armadillo_data [options]
If using the dockerized version:
We recommend adding "--net='host' -v /path/to/reference_genome.2bit:/path/to/reference_genome.2bit". This will allow to properly perform the queries even within the container. The complete command we recommend is:
docker run --rm -it --net='host' -v /path/to/reference_genome.2bit:/path/to/reference_genome.2bit -v /path/to/genomes:/path/to/genomes -v /home:/path/to/output armadillo [arguments]
The program will print multiple files:
- The stacked minibams of both tumor and normal samples, generated in the first step of the pipeline and used for variant calling.
- CASE_candidates.vcf is the first one to be printed. It's an intermediate with mutations' readnames, which are then used by remove_dups.py to remove duplications.
- CASE_final.vcf is the final VCF.
- Pablo Bousquets Muñoz - [email protected]
- Ander Díaz Navarro
- Xose Antón Suárez Puente
Bousquets-Muñoz, P., Díaz-Navarro, A., Nadeu, F. et al. PanCancer analysis of somatic mutations in repetitive regions reveals recurrent mutations in snRNA U2. npj Genom. Med. 7, 19 (2022). https://doi.org/10.1038/s41525-022-00292-2