Step 3.1 and 3.2: Generation of labels and window (pairs) data for split reads (1) and SV filtering (2) approaches

SV labelling

Currently, the following two scripts can be used to generate labelled JSON files:

label_window_pairs_on_split_read_positions.py Input arguments:

ibam: input BAM file. Only used to get the dictionary {keys:chromosome names, values:chromosome lengths} with the function get_chr_len_dict. The same dictionary could be obtained from the FAI index.
sample: sample ID, used to reconstruct the path to the folder containing the channel data. Used to find the directory with the channel data ("outputpath/sample")
window: window size to consider (default: 200 bp). This must be the same window size used with create_window_pairs.py to generate the window data.
ground_truth: ground_truth SV callsets (deletions), one among:
- GIAB sample NA12878 (HG001): Personalis_1000_Genomes_deduplicated_deletions.bed. Dataset obtained with the svclassify method.
- GIAB sample NA24385 (HG002): HG002_SVs_Tier1_v0.6.PASS.vcf.gz. Filtered (PASS) deletions from the NIST_SVs_Integration_v0.6 dataset. on bioRxiv. NA24385 SVs were also characterized with PacBio CCS reads.
- Synthetic diploid CHM1_CHM13: mix of two haploid cell lines CHM1 and CHM13 sequenced with PacBio. Ground truth deletion callset published in Huddleston2016. SV callset.
- in artificial data mode: SURVIVOR SV file for deletions and insertions (INDEL), inversions (INV), tandem duplications (DUP) or inter-chromosomal translocations (TRA).
outputpath: directory containing the channel data per sample
out: output filename
logfile: name of the logfile

label_window_pairs_with_svcallset.py Most of the input arguments are the same as for label_window_pairs.py apart from:

sv_caller: GRIDSS or manta SV callset in BEDPE format, generated using the StructuralVariantAnnotation R/Bioconductor package

The script create_window_pairs.py takes care of generating the training data from the labelled (pair of) positions.

Input arguments are:

bam: BAM file, not used currently, it can be safely deleted.
outputpath: folder where to place the output files (parent folder of the chr_array)
sample: sample name
logfile: name of the logfile
window: window size. It must be the same as the window size used in generating the labels.
sv_caller: if in 'SV filtering' mode, the name of the SV caller used in the labelling. In 'split reads' mode this parameter is equal to the empty string ('')
mode: training/test. Used to either split the data in positive (DEL) and negative (noDEL) sets (training mode) or not splitting the data (test mode).
save_npz: boolean. Output files by default are saved in carray format (for large arrays). By setting save_npz to True the data can be also saved in Numpy NPZ format.

Output:

One (test mode) or two (training mode) bcolz carray(s) with suffix '_win200_carray' (if window size is 200 bp) with label info in metadata
If save_npz is set to True: one (test mode) or two (training mode) NPZ file(s) with label info