PAVC (PAV Classifier) is developed for presence/absence variation (PAV) identification and easily obtain vcf format results. It is based on the results of SyRI, which is a accurate structural variation detect tools. Use PAVC, you can performs accurate classification and get results files in vcf format, which can be very conveniently used for downstream analysis like graph pan-genome construction, GWAS, population genetic analysis, etc.
- R
- R packages - data.table and Biostrings
install.packages("data.table")
-
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("Biostrings")```
- bcftools
First of all, please download PAVC.
git clone https://github.com/Weihankk/PAVC.git
- Now suppose you have prepared two genomes for SV calling and would like to classify them into PAV.
refgenome
: Reference genomeqrygenome
: Query genome
- SV calling by SyRI.
nucmer --maxmatch -l 50 -c 100 -t 10 refgenome qrygenome
delta-filter -m -i 90 -l 100 out.delta > out.filtered.delta
show-coords -THrd out.filtered.delta > out.filtered.coords
syri -c out.filtered.coords -d out.filtered.delta --allow-offset 0 -r refgenome -q qrygenome
syri.out is raw results of syri, we will use this file for PAV classification.
- Run PAV_Classifier
Rscript GetPAV.R syri.out
Rscript GetVCF.R pavc.txt query_name
bgzip pavc.vcf -@ 6
bcftools index pavc.vcf.gz --threads 6
bcftools norm -d all -cx -f refgenome pavc.vcf.gz > pavc.norm.vcf
Rscript FilterPAV.R pavc.norm.vcf 50
Rscript StatResult.R pavc.norm.FilterLen50.vcf pavc.txt
For FilterPAV.R, 50 indicated only keep PAV with length >= 50bp.
The result of SyRI include two hierarchy: Differences in structure (abbreviated as DSTR) & Differences in sequence (abbreviated as DSEQ). Note that DSEQs are located in DSTRs.
- Differences in structure in include five types:
- SYN, syntenic region
- INV, inverted region
- TRANS/INVTR, translocated region or inverted translocated region
- DUP/INVDP, duplicated region or inverted duplicated region
- NOTAL, un-aligned region
- Differences in sequence include seven types:
- SNP, single nucleotide polymorphism
- CPG, copy gain in query
- CPL, copy loss in query
- HDR, highly diverged regions
- TDM, tandem repeat
- INS, insertion in query
- DEL, deletion in query
Therefore, concat all DSTRs we can got the refgenome/qrygenome. However, the coordinates of some DSTRs on the genome are not clear, like NOTAL.
- SYN : Neither belongs to presence or absence
- INV : Neither belongs to presence or absence
- TRANS/INVTR : The part located on refgenome is absence, while the part located on qrygenome is presence.
- DUP/INVDP : SyRI classified them into copygain and copyloss, so we can easily classified them. DUP/INVDP-copyloss is absence, while DUP/INVDP-copygain is presence.
- NOTAL : Refgenome sequence can not align on qrygenome is absence, qrygenome seqeunce can not align on refgenome is presence.
- SNP : Neither belongs to presence or absence
- CPG : Can be regarded as presence
- CPL : Can be regarded as absence
- HDR : The part located on refgenome is absence, while the part located on qrygenome is presence.
- TDM : If refgenome segment length > qry segment length, then this is a absence, otherwise this is a presence
- INS : Can be regarded as presnece
- DEL : Can be regarded as absence
If you have any questions with installation and usage, please open a new issue in Issues.