Updated: Sep 18, 2023
gAIRR-annotate provides annotations of IG and TR genes on personal assembly data
gAIRR-call can genotype IG and TR genes for short read data sequenced from germline cell. It is designed for probe captured gAIRR-seq data, but it can also be applied to other type of short read enriched from IG and TR regions.
- BWA aligner (0.7.17)
- SPAdes assembler (v3.13.0)
pip install gAIRR-suite
pip install gAIRR-suite
git clone https://github.com/maojanlin/gAIRRsuite.git
cd gAIRRsuite
Though optional, it is a good practice to install in a virtual environment to manage the dependancies:
python -m venv venv
source venv/bin/activate
Now a virtual environment (named venv) is activated:
python setup.py install
$ gAIRR_annotate -wd <work_dir> -id <sample_id> -a1 <assembly_h1.fa> -a2 <assembly_h2.fa>
The -a2
argument is optional for diploid personal assemblies.
The final annotation report is work_dir/sample_id/group_genes.1.bed
, and work_dir/sample_id/group_genes.2.bed
if the second haplotype of the assembly is provided.
The novel alleles in the annotation will be maked with parentheses, indicating the edit distance of the novel allele to the documented genes. For example, TRAV8-3*01(i:1)
means the annotated gene has TRAV19*01(i:0,h:17)
has no edit-distance to original allele, but has
The novel allele sequence can be found in work_dir/sample_id/novel_sample_ID_geneLocus.fasta.1/2.bed
If only IG or TR genes are prefered, option -lc IG
or -lc TR
can be specified.
$ gAIRR_call -wd <work_dir> -id <sample_ID> -rd1 <read.R1.fastq.gz> -rd2 <read.R2.fastq.gz> -lc <TRV TRJ TRD IGV IGJ IGD>
The prefered IG/TR locus and V/D/J genes can be specified with -lc
option.
The final calling report is work_dir/sample_ID/gene-locus/gAIRR-call_report.rpt
, gene-locus
is the targeted gene set, like TCRV, TCRJ, etc.
The novel allele sequence from the gAIRR-call_report.rpt
can be found in work_dir/sample_ID/gene-locus_novel/gene-locus_with_novel.fasta
--flanking
option can be specified to run the flanking sequence assembly algorithm. The flanking sequnece information can be found in work_dir/sample_ID/gene-locus_flanking/flanking_result/flanking_haplotypes.fasta
Note that the seriel numbers provided by the gAIRR_call
and gAIRR_annotate
are not necessary corresponding. When there are multiple novel alleles, the seril number starting from
Usage:
./scripts/check_RSS.sh
The check_RSS.sh
pipeline uses the RSS and separated heptamer and nonamer sequences downloaded from IMGT database to check if there are proper RSS pattern in the flanking sequences
To run the check_RSS.sh
pipeline, BWA aligner should be installed.
The check_RSS.sh
pipeline first align all the known IMGT RSS to the flanking sequences to check if there are identical or near-identical RSS pattern. The flanking sequences missing RSS are then recorded in RSS_checking/first_scan/missing_RSS_HG002-part_TCRJ_first_scan.fasta
and passed to second scanning. The second scanning aligned heptamer and nonamer sequences separately to the flanking sequences and try to identify heptamer-nonamer pairs that resemble proper RSSs.
Generated files:
RSS_checking/first_scan/database_HG002-part_TCRJ_first_scan.csv
is the RSS report file of HG002-part
. It indicate if the RSS are known, novel or could not be found after first scanning.
RSS_checking/second_scan/database_HG002-part_TCRJ_second_scan.csv
is the RSS report file of the flanking sequences that missed RSS in the first scanning.
Usage:
./scripts/database_collect.sh
./scripts/allele_consensus.sh
The database_collect.sh
pipeline collects the novel and flanking sequence database into database files. The duplicated novel or flanking sequences will be collapsed into one. Taking TRV novel allele as an example, generated file database_novel_TRV.tsv
indicates which samples possess which novel allele, and database_novel_TRV.fasta
recorded the novel allele sequence.
For samples with multiple assembly. Consensus allele result can be get from allele_consensus.sh
pipeline. Taking database_novel_TRV.tsv
and database_novel_TRV.fasta
as input, allele_consensus.sh
will generate database_novel_TRV_consensus.tsv
and database_novel_TRV_consensus.fasta
as output according to ./example/samples/consensus_name_HGSVC.log
.
In ./example/samples/consensus_name_HGSVC.log
, terms are separated by space. The first term is the consensus name while the later terms indicate the samples' different assembly id.
The gAIRR_suite/material/
directory contains IMGT allele sequences and RSS information.
The example/samples/
containts two miniature samples. HG002_part_gAIRR-seq_R1.fasta
and HG002_part_gAIRR-seq_R2.fasta
are a small part of the pair-end gAIRR-seq reads sequenced from HG002. HG002-S22-H1-000000F_1900000-2900000.fasta
is a genome assembly sequence extracted from (Garg, S. et al, 2021). The genome sequence is the 1900000:2900000 segment from the contig HG002-S22-H1 of HG002's maternal haplotype assembly.
In the example settings. Running
$ gAIRR_call -wd target_call -id HG002-part -rd1 gAIRR_suite/example/samples/HG002_part_gAIRR-seq_R1.fastq.gz -rd2 gAIRR_suite/example/samples/HG002_part_gAIRR-seq_R2.fastq.gz
or ./scripts/AIRRCall.sh
will gAIRR-call the HG002's AIRR alleles based on HG002_part_gAIRR-seq_R1.fasta
and HG002_part_gAIRR-seq_R2.fasta
.
Running
$ gAIRR_annotate -wd target_annotate -id HG002-part -a1 gAIRR_suite/example/samples/HG002-S22-H1-000000F_1900000-2900000.fasta
or ./scripts/AIRRAnnotate.sh
will gAIRR-annotate part of the HG002's genome assembly HG002-S22-H1-000000F_1900000-2900000.fasta
. In ./scripts/AIRRAnnotate.sh
, several shell script commands are commented. The commented commands are the settings to gAIRR-annotate two phased assemblies while in the example is to gAIRR-annotate single strend genome assembly.