Skip to content

Profiling Germline Adaptive Immune Receptor Repertoire with gAIRR Suite

License

Notifications You must be signed in to change notification settings

maojanlin/gAIRRsuite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Updated: Sep 18, 2023

gAIRR-suite: Profiling genes encoding the adaptive immune receptor repertoire

gAIRR-annotate provides annotations of IG and TR genes on personal assembly data

gAIRR-call can genotype IG and TR genes for short read data sequenced from germline cell. It is designed for probe captured gAIRR-seq data, but it can also be applied to other type of short read enriched from IG and TR regions.

Prerequisite programs:

  • BWA aligner (0.7.17)
  • SPAdes assembler (v3.13.0)

Installation

pip install gAIRR-suite

pip install gAIRR-suite
git clone https://github.com/maojanlin/gAIRRsuite.git
cd gAIRRsuite

Though optional, it is a good practice to install in a virtual environment to manage the dependancies:

python -m venv venv
source venv/bin/activate

Now a virtual environment (named venv) is activated:

python setup.py install

gAIRR-annotate usage

$ gAIRR_annotate -wd <work_dir> -id <sample_id> -a1 <assembly_h1.fa> -a2 <assembly_h2.fa> 

The -a2 argument is optional for diploid personal assemblies.

The final annotation report is work_dir/sample_id/group_genes.1.bed, and work_dir/sample_id/group_genes.2.bed if the second haplotype of the assembly is provided.

The novel alleles in the annotation will be maked with parentheses, indicating the edit distance of the novel allele to the documented genes. For example, TRAV8-3*01(i:1) means the annotated gene has $1$ edit-distance to TRAV8-3*01, TRAV19*01(i:0,h:17) has no edit-distance to original allele, but has $17$ bases being clipped.

The novel allele sequence can be found in work_dir/sample_id/novel_sample_ID_geneLocus.fasta.1/2.bed

If only IG or TR genes are prefered, option -lc IG or -lc TR can be specified.

gAIRR-call usage

$ gAIRR_call -wd <work_dir> -id <sample_ID> -rd1 <read.R1.fastq.gz> -rd2 <read.R2.fastq.gz> -lc <TRV TRJ TRD IGV IGJ IGD>

The prefered IG/TR locus and V/D/J genes can be specified with -lc option.

The final calling report is work_dir/sample_ID/gene-locus/gAIRR-call_report.rpt, gene-locus is the targeted gene set, like TCRV, TCRJ, etc.

The novel allele sequence from the gAIRR-call_report.rpt can be found in work_dir/sample_ID/gene-locus_novel/gene-locus_with_novel.fasta

--flanking option can be specified to run the flanking sequence assembly algorithm. The flanking sequnece information can be found in work_dir/sample_ID/gene-locus_flanking/flanking_result/flanking_haplotypes.fasta

Note that the seriel numbers provided by the gAIRR_call and gAIRR_annotate are not necessary corresponding. When there are multiple novel alleles, the seril number starting from $0$ can be in different order.

checking RSS

Usage:

./scripts/check_RSS.sh

The check_RSS.sh pipeline uses the RSS and separated heptamer and nonamer sequences downloaded from IMGT database to check if there are proper RSS pattern in the flanking sequences

To run the check_RSS.sh pipeline, BWA aligner should be installed.

The check_RSS.sh pipeline first align all the known IMGT RSS to the flanking sequences to check if there are identical or near-identical RSS pattern. The flanking sequences missing RSS are then recorded in RSS_checking/first_scan/missing_RSS_HG002-part_TCRJ_first_scan.fasta and passed to second scanning. The second scanning aligned heptamer and nonamer sequences separately to the flanking sequences and try to identify heptamer-nonamer pairs that resemble proper RSSs.

Generated files: RSS_checking/first_scan/database_HG002-part_TCRJ_first_scan.csv is the RSS report file of HG002-part. It indicate if the RSS are known, novel or could not be found after first scanning. RSS_checking/second_scan/database_HG002-part_TCRJ_second_scan.csv is the RSS report file of the flanking sequences that missed RSS in the first scanning.

Data collection pipeline

Usage:

./scripts/database_collect.sh
./scripts/allele_consensus.sh

The database_collect.sh pipeline collects the novel and flanking sequence database into database files. The duplicated novel or flanking sequences will be collapsed into one. Taking TRV novel allele as an example, generated file database_novel_TRV.tsv indicates which samples possess which novel allele, and database_novel_TRV.fasta recorded the novel allele sequence.

For samples with multiple assembly. Consensus allele result can be get from allele_consensus.sh pipeline. Taking database_novel_TRV.tsv and database_novel_TRV.fasta as input, allele_consensus.sh will generate database_novel_TRV_consensus.tsv and database_novel_TRV_consensus.fasta as output according to ./example/samples/consensus_name_HGSVC.log.

In ./example/samples/consensus_name_HGSVC.log, terms are separated by space. The first term is the consensus name while the later terms indicate the samples' different assembly id.

Example

The gAIRR_suite/material/ directory contains IMGT allele sequences and RSS information. The example/samples/ containts two miniature samples. HG002_part_gAIRR-seq_R1.fasta and HG002_part_gAIRR-seq_R2.fasta are a small part of the pair-end gAIRR-seq reads sequenced from HG002. HG002-S22-H1-000000F_1900000-2900000.fasta is a genome assembly sequence extracted from (Garg, S. et al, 2021). The genome sequence is the 1900000:2900000 segment from the contig HG002-S22-H1 of HG002's maternal haplotype assembly.

In the example settings. Running

$ gAIRR_call -wd target_call -id HG002-part -rd1 gAIRR_suite/example/samples/HG002_part_gAIRR-seq_R1.fastq.gz -rd2 gAIRR_suite/example/samples/HG002_part_gAIRR-seq_R2.fastq.gz

or ./scripts/AIRRCall.sh will gAIRR-call the HG002's AIRR alleles based on HG002_part_gAIRR-seq_R1.fasta and HG002_part_gAIRR-seq_R2.fasta.

Running

$ gAIRR_annotate -wd target_annotate -id HG002-part -a1 gAIRR_suite/example/samples/HG002-S22-H1-000000F_1900000-2900000.fasta

or ./scripts/AIRRAnnotate.sh will gAIRR-annotate part of the HG002's genome assembly HG002-S22-H1-000000F_1900000-2900000.fasta. In ./scripts/AIRRAnnotate.sh , several shell script commands are commented. The commented commands are the settings to gAIRR-annotate two phased assemblies while in the example is to gAIRR-annotate single strend genome assembly.

About

Profiling Germline Adaptive Immune Receptor Repertoire with gAIRR Suite

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •