*News: 2023/03/01: we upload the Uniqness map and source files (ref.fa for hg38) to Zenodo for users to download.
conda install aquila
(Please ensure channels are properly setup for bioconda before installing)
Aquila_step1 --help
Aquila_step2 --help
Aquila_clean --help
Aquila_step1_multilibs --help
Aquila_assembly_based_variants_call --help
Aquila_phasing_all_variants --help
Aquila_step0_sortbam --help
Aquila_step0_sortbam_multilibs --help
# You can also check the below corresponding scripts for more details
#Download the reference file (hg38)
wget https://zenodo.org/record/7689958/files/source.tar.gz
tar -xvf source.tar.gz
rm source.tar.gz
#Download hg38 "Uniqness_map"
wget https://zenodo.org/record/7689958/files/Uniqness_map_hg38.tar.gz
tar -xvf Uniqness_map_hg38.tar.gz
rm Uniqness_map_hg38.tar.gz
Aquila utilizes Python3 (+ numpy, pysam, sortedcontainers and scipy), SAMtools, and minimap2. To be able to execute the above programs by typing their name on the command line, the program executables must be in one of the directories listed in the PATH environment variable (".bashrc").
Or you could just run "./install.sh" to check their availability and install them if not, but make sure you have installed "python3", "conda" and "wget" first.
git clone https://github.com/maiziex/Aquila.git
cd Aquila
chmod +x install.sh
./install.sh
After running "./install.sh", a folder "source" would be download, it includes human GRCh38 reference fasta file, or you could also just download it by yourself from the corresponding official websites.
Put the "Aquila/bin" in the ".bashrc" file, and source the ".bashrc" file
Or just use the fullpath of "Aquila_step1.py" and "Aquila_step2.py"
Aquila uses 23 for "chrX", and not able to handle "chrY" in current version.
Aquila/bin/Aquila_step1.py --bam_file possorted_bam.bam --vcf_file S12878_freebayes.vcf --sample_name S12878 --out_dir Assembly_results_S12878 --
_map_dir Aquila/Uniqness_map_hg38
--bam_file: "possorted_bam.bam" is a bam file generated from barcode-aware aligner like "Longranger align". How to get the bam file, you can also check here.
--vcf_file: "S12878_freebayes.vcf" is a VCF file generated from variant caller like "FreeBayes". How to get the vcf file, you can also check here. *** We now have a new version for step1 to use 1000 Genomes VCF as the input VCF file (please check here), and Aquila will use common variants from 1000G to help partition linked-reads. In the later version, Aquila will use Graph Genome Reference to replace Conventional Linear Reference.
--sample_name: "S12878" is the sample name you can define.
--uniq_map_dir: "Aquila/Uniqness_map_hg38" is the uniqness file for GRCh38 you can download by "./install.sh".
--mbq_threshold: default = 13, It's phred-scaled quality score for the assertion made in ALT.
--boundary: default = 50000 (50kb), It is the boundary for long fragments with the same barcode.
--out_dir: default = ./Asssembly_results. You can define your own folder, for example "Assembly_results_S12878".
--block_threshold: default = 200000 (200kb)
--block_len_use: default = 100000 (100kb)
--num_threads: default = 8. It's recommended not to change this setting unless large memory node could be used (2*memory capacity(it suggests for assembly below)), then try to use "--num_threads 12".
--num_threads_for_samtools_sort: default = 20. This setting is evoked for "samtools sort".
--chr_start --chr_end: if you only want to assembly some chromosomes or only one chromosome. For example: use "--chr_start 1 --chr_end 5" will assemble chromsomes 1,2,3,4,5. Use "--chr_start 2 --chr_end 2" will only assemlby chromosome 2. (*Notes: Use 23 for "chrX")
To use the above option "--chr_start --chr_end", it is recommended(not required) to run the below command first to save more time for step1. (This step is recommended if your computing node is not reliable and may break down very often)
python Aquila/bin/Aquila_step0_sortbam.py --bam_file possorted_bam.bam --out_dir Assembly_results_S12878 --num_threads_for_samtools_sort 30
Running Step 1 for chromosomes parallelly on multiple(23) nodes, you can check multiple_nodes.sh for guidance.
Coverage | Memory | Time for chr1 on a single node |
---|---|---|
60X | 100GB | 13:12:35 |
90X | 150GB | 1-01:08:38 |
Coverage | Memory | Time for WGS on a single node |
---|---|---|
60X | 450GB | 2-14:27:37 |
90X | 500GB | 3-12:34:12 |
Aquila/bin/Aquila_step2.py --out_dir Assembly_results_S12878 --num_threads 30 --reference Aquila/source/ref.fa
--reference: "Aquila/source/ref.fa" is the reference fasta file you can download by "./install".
--out_dir: default = ./Asssembly_results, make sure it's the same as "--out_dir" from Step1 if you want to define your own output directory name.
--num_threads: default = 30, this determines the number of files assembled simultaneously by SPAdes.
--num_threads_spades: default = 5, this is the "-t" for SPAdes.
--block_len_use: default = 100000 (100kb)
--chr_start --chr_end: if you only want to assembly some chromosomes or only one chromosome. For example: use "--chr_start 1 --chr_end 2"
Coverage | Memory | Time for chr1 on a single node | --num_threads | --num_threads_spades |
---|---|---|---|---|
60X | 100GB | 09:50:43 | 30 | 20 |
90X | 100GB | 13:25:08 | 40 | 20 |
Coverage | Memory | Time for WGS on a single node | --num_threads | --num_threads_spades |
---|---|---|---|---|
60X | 100GB | 3-12:16:27 | 40 | 20 |
90X | 100GB | 4-15:00:00 | 40 | 20 |
Assembly_Results_S12878/Assembly_Contigs_files: Aquila_contig.fasta and Aquila_Contig_chr*.fasta
Assembly_results_S12878
|
|-H5_for_molecules (Aquila_step1)
| └-S12878_chr*_sorted.h5 --> (Fragment files for each chromosome including barcode, variants annotation (0: ref allele; 1: alt allele), coordinates for each fragment)
|
|-HighConf_file (Aquila_step1)
| └-chr*_global_track.p --> (Pickle file for saving coordinates of high-confidence boundary points)
|
|-results_phased_probmodel (Aquila_step1)
| └-chr*.phased_final --> (Phased fragment files)
|
|-phase_blocks_cut_highconf (Aquila_step1)
|
|-sorted_bam (Aquila_step1)
| |-finish_bam.txt --> (generated once "sorted_bam.bam" is completed)
| └-sorted_bam.bam --> (bam file by sorting with the read name)
|
|-Raw_fastqs (Aquila_step1)
| └-fastq_by_Chr_* --> (fastq file for each chromosome)
|
|-ref_dir (Aquila_step2)
|
|-Local_Assembly_by_chunks (Aquila_step1 + Aquila_step2)
| └-chr*_files_cutPBHC
| |-fastq_by_*_*_hp1.fastq --> (fastq file for a small phased chunk of haplotype 1)
| |-fastq_by_*_*_hp2.fastq --> (fastq file for a small phased chunk of haplotype 2)
| |-fastq_by_*_*_hp1_spades_assembly --> (minicontigs: assembly results for the small chunk of haplotype 1)
| └-fastq_by_*_*_hp2_spades_assembly --> (minicontigs: assembly results for the small chunk of haplotype 2)
|
└-Assembly_Contigs_files (Aquila_step2)
|-Aquila_cutPBHC_minicontig_chr*.fasta --> (final minicontigs for each chromosome)
|-Aquila_Contig_chr*.fasta --> (final contigs for each chromosome)
└-Aquila_contig.fasta --> (final contigs for WGS)
Aquila outputs an overall contig file “Aquila_Contig_chr*.fasta” for each chromosome, and one contig file for each haplotype: Aquila_Contig_chr*_hp1.fasta
and Aquila_Contig_chr*_hp2.fasta
. For each contig, the header, for an instance, “>36_PS39049620:39149620_hp1” includes contig number “36”, phase block start coordinate “39049620”, phase block end coordinate “39149620”, and haplotype number “1”. Within the same phase block, the haplotype number “hp1” and “hp2” are arbitrary for maternal and paternal haplotypes. For some contigs from large phase blocks, the headers are much longer and complex, for an instance, “>56432_PS176969599:181582362_hp1_ merge177969599:178064599_hp1-177869599:177969599_hp1”. “56” denotes contig number, “176969599” denotes the start coordinate of the final big phase block, “181582362” denotes the end coordinate of the final big phase block, and “hp1” denotes the haplotype “1”. “177969599:178064599_hp1” and “177869599:177969599_hp1” mean that this contig is concatenated from minicontigs in small chunk (start coordinate: 177969599, end coordinate: 178064599, and haplotype: 1) and small chunk (start coordinate: 177869599, end coordinate: 177969599, and haplotype: 1).
- Aquila outputs all raw contigs, even those <1kb. For some other downstream analyses (e.g. to obtain QV from Merqury), it is necessary to filter out small contigs.
If your hard drive storage is limited (Aquila will generate a lot of intermediate files by local assembly), it is suggested to quily clean some data by running Aquila_clean.py
after you get all your contig files in Assembly_Contigs_files
. Or you can keep them for some analysis (check the above output directory tree for details).
Aquila/bin/Aquila_clean.py --assembly_dir Assembly_results_S12878
For example, you can use Assemlby_results_S12878
as input directory to generate a VCF file which includes SNPs, small Indels and SVs, and the phased profile of all of them.
Please check Assembly_based_variants_call_and_phasing for details.
1. Download hg19 reference from 10x Genomics website
wget https://zenodo.org/record/7689958/files/Uniqness_map_hg19.tar.gz
If you want to run Aquila for other diploid species with high quality reference genomes, to generate Uniqness_map
for Aquila, check the details of hoffmanMappability to get the corresponding "k100.umap.bed.gz", then run Aquila/bin/Get_uniqnessmap_for_Aquila.py
to get the final Uniqness_map
folder to run Aquila.
Or you can use our "Aquila_uniqmap" to generate the Uniqness_map
folder to run Aquila, check How_to_get_Umap for details.
Aquila/bin/Aquila_step1_multilibs.py --bam_file_list ./S24385_Lysis_2/Longranger_align_bam/S24385_lysis_2/outs/possorted_bam.bam,./S24385_Lysis_2H/Longranger_align_bam/S24385_lysis_2H/outs/possorted_bam.bam --vcf_file_list ./S24385_lysis_2/Freebayes_results/S24385_lysis_2_grch38_ref_freebayes.vcf,./S24385_lysis_2H/Freebayes_results/S24385_lysis_2H_grch38_ref_freebayes.vcf --sample_name_list S24385_lysis_2,S24385_lysis_2H --out_dir Assembly_results_merged --uniq_map_dir Aquila/Uniqness_map_hg38
--bam_file: "possorted_bam.bam" is bam file generated from barcode-awere aligner like "Lonranger align". Each bam file is seperately by comma (",").
--vcf_file: "S12878_freebayes.vcf" is VCF file generated from variant caller like "FreeBayes". Each VCF file is seperately by comma (",").
--sample_name: S24385_lysis_2,S24385_lysis_2H are the sample names you can define. Each sample name is seperately by comma (",").
--uniq_map_dir: "Aquila/Uniqness_map_hg38" is the uniqness file you can download by "./install.sh".
--out_dir: default = ./Asssembly_results
--block_threshold: default = 200000 (200kb)
--block_len_use: default = 100000 (100kb)
--num_threads: default = 8. It's recommended not to change this setting unless large memory node could be used (2*memory capacity(it suggests for assembly below)), then try to use "--num_threads 12".
--num_threads_for_samtools_sort: default = 20. This setting is evoked for "samtools sort".
--chr_start --chr_end: if you only want to assembly some chromosomes or only one chromosome. For example: use "--chr_start 1 --chr_end 5" will assemble chromsomes 1,2,3,4,5. Use "--chr_start 2 --chr_end 2" will only assemlby chromosome 2. (*Notes: Use 23 for "chrX") To use the above option "--chr_start --chr_end", it is recommended (not required) to run the below command first to save more time for step1.
python Aquila/bin/Aquila_step0_sortbam_multilibs.py --bam_file_list ./S24385_Lysis_2/Longranger_align_bam/S24385_lysis_2/outs/possorted_bam.bam,./S24385_Lysis_2H/Longranger_align_bam/S24385_lysis_2H/outs/possorted_bam.bam --out_dir Assembly_results_merged --num_threads_for_samtools_sort 10 --sample_name_list S24385_lysis_2,S24385_lysis_2H
Aquila/bin/Aquila_step2.py --out_dir Assembly_results_merged --num_threads 30 --reference Aquila/source/ref.fa
--reference: "Aquila/source/ref.fa" is the reference fasta file you can download by "./install".
--out_dir: default = ./Asssembly_results, make sure it's the same as "--out_dir" from step1 if you want to define your own output directory name.
--num_threads: default = 20
--block_len_use: default = 100000 (100kb)
--chr_start --chr_end: if you only want to assembly some chromosomes or only one chromosome.