-
Notifications
You must be signed in to change notification settings - Fork 5
Step 3.1 and 3.2: Generation of labels and window (pairs) data for split reads (1) and SV filtering (2) approaches
Arnold Kuzniar edited this page Feb 5, 2020
·
2 revisions
Currently, the following two scripts can be used to generate labelled JSON files:
- split reads approach: label_window_pairs_on_split_read_positions.py
- SV filtering approach: label_window_pairs_with_svcallset.py
label_window_pairs_on_split_read_positions.py Input arguments:
- ibam: input BAM file. Only used to get the dictionary {keys:chromosome names, values:chromosome lengths} with the function get_chr_len_dict. The same dictionary could be obtained from the FAI index.
- sample: sample ID, used to reconstruct the path to the folder containing the channel data. Used to find the directory with the channel data ("outputpath/sample")
- window: window size to consider (default: 200 bp). This must be the same window size used with create_window_pairs.py to generate the window data.
- ground_truth: ground_truth SV callsets (deletions), one among:
- GIAB sample NA12878 (HG001): Personalis_1000_Genomes_deduplicated_deletions.bed. Dataset obtained with the svclassify method.
- GIAB sample NA24385 (HG002):
HG002_SVs_Tier1_v0.6.PASS.vcf.gz
. Filtered (PASS) deletions from the NIST_SVs_Integration_v0.6 dataset. on bioRxiv. NA24385 SVs were also characterized with PacBio CCS reads. - Synthetic diploid CHM1_CHM13: mix of two haploid cell lines CHM1 and CHM13 sequenced with PacBio. Ground truth deletion callset published in Huddleston2016. SV callset.
- in artificial data mode: SURVIVOR SV file for deletions and insertions (INDEL), inversions (INV), tandem duplications (DUP) or inter-chromosomal translocations (TRA).
- outputpath: directory containing the channel data per sample
- out: output filename
- logfile: name of the logfile
label_window_pairs_with_svcallset.py
Most of the input arguments are the same as for label_window_pairs.py
apart from:
- sv_caller: GRIDSS or manta SV callset in BEDPE format, generated using the StructuralVariantAnnotation R/Bioconductor package
The script create_window_pairs.py takes care of generating the training data from the labelled (pair of) positions.
Input arguments are:
- bam: BAM file, not used currently, it can be safely deleted.
- outputpath: folder where to place the output files (parent folder of the chr_array)
- sample: sample name
- logfile: name of the logfile
- window: window size. It must be the same as the window size used in generating the labels.
- sv_caller: if in 'SV filtering' mode, the name of the SV caller used in the labelling. In 'split reads' mode this parameter is equal to the empty string ('')
- mode: training/test. Used to either split the data in positive (DEL) and negative (noDEL) sets (training mode) or not splitting the data (test mode).
- save_npz: boolean. Output files by default are saved in carray format (for large arrays). By setting save_npz to True the data can be also saved in Numpy NPZ format.
Output:
- One (test mode) or two (training mode) bcolz carray(s) with suffix '_win200_carray' (if window size is 200 bp) with label info in metadata
- If save_npz is set to True: one (test mode) or two (training mode) NPZ file(s) with label info