Skip to content

Step 3.1 and 3.2: Generation of labels and window (pairs) data for split reads (1) and SV filtering (2) approaches

Arnold Kuzniar edited this page Feb 5, 2020 · 2 revisions

SV labelling

Currently, the following two scripts can be used to generate labelled JSON files:

label_window_pairs_on_split_read_positions.py Input arguments:

  • ibam: input BAM file. Only used to get the dictionary {keys:chromosome names, values:chromosome lengths} with the function get_chr_len_dict. The same dictionary could be obtained from the FAI index.
  • sample: sample ID, used to reconstruct the path to the folder containing the channel data. Used to find the directory with the channel data ("outputpath/sample")
  • window: window size to consider (default: 200 bp). This must be the same window size used with create_window_pairs.py to generate the window data.
  • ground_truth: ground_truth SV callsets (deletions), one among:
  • outputpath: directory containing the channel data per sample
  • out: output filename
  • logfile: name of the logfile

label_window_pairs_with_svcallset.py Most of the input arguments are the same as for label_window_pairs.py apart from:

Generation of window data

The script create_window_pairs.py takes care of generating the training data from the labelled (pair of) positions.

Input arguments are:

  • bam: BAM file, not used currently, it can be safely deleted.
  • outputpath: folder where to place the output files (parent folder of the chr_array)
  • sample: sample name
  • logfile: name of the logfile
  • window: window size. It must be the same as the window size used in generating the labels.
  • sv_caller: if in 'SV filtering' mode, the name of the SV caller used in the labelling. In 'split reads' mode this parameter is equal to the empty string ('')
  • mode: training/test. Used to either split the data in positive (DEL) and negative (noDEL) sets (training mode) or not splitting the data (test mode).
  • save_npz: boolean. Output files by default are saved in carray format (for large arrays). By setting save_npz to True the data can be also saved in Numpy NPZ format.

Output:

  • One (test mode) or two (training mode) bcolz carray(s) with suffix '_win200_carray' (if window size is 200 bp) with label info in metadata
  • If save_npz is set to True: one (test mode) or two (training mode) NPZ file(s) with label info