Generating all the necessary files in the data directory #76

lsantuari · 2020-11-24T09:47:53Z

sv-channels should take in input:

a BAM file (symlinked to test.bam) generated by mapping the reads (FASTQ format) to the reference sequence (FASTA format) with a read mapper that supports split read mapping, such as BWA mem;
the reference sequence (symlinked to test.fasta) in FASTA format that was used to map the reads.

All the other files in the data directory must be generated from either the BAM file or the FASTA file:

the FASTA index test.fasta.fai
the file reference_N_regions.bed generated with Ns_to_bed.py;
the file seqs.bed generated from the FASTA index test.fasta.fai;
the file test.2bit generated with faToTwoBit

arnikz · 2020-11-25T14:27:36Z

the file seqs.bed generated from the FASTA index test.fasta.fai;

@lsantuari: Could you clarify how did you get

sv-channels/data/seqs.bed

Lines 1 to 2 in 9866b35

    
           12	44000000	46000000 
        
           22	44000000	46000000

from

sv-channels/data/test.fasta.fai

Lines 1 to 2 in 9866b35

    
           12	2000000	4	2000000	2000001 
        
           22	2000000	2000009	2000000	2000001

?

arnikz · 2020-11-25T14:59:07Z

the file test.2bit generated with faToTwoBit

the file reference_N_regions.bed generated with Ns_to_bed.py;

@lsantuari: Accordingly, I got an empty BED file.

faToTwoBit test.fasta test.2bit
python Ns_to_bed.py -b reference_N_regions.bed -t test.2bit -c 12,22

related to #79

lsantuari · 2020-11-25T15:23:08Z

the file test.2bit generated with faToTwoBit

the file reference_N_regions.bed generated with Ns_to_bed.py;

@lsantuari: Accordingly, I got an empty BED file.
faToTwoBit test.fasta test.2bit
python Ns_to_bed.py -b reference_N_regions.bed -t test.2bit -c 12,22
related to #79

It is correct. It means that there are no genomic intervals containing only ambiguous bases (represented by 'N' in this case) in our test 'reference genome'. Our test.fasta file contains:

chromosome 12 from position 44000000 to position 46000000
chromosome 22 from position 44000000 to position 46000000

lsantuari · 2020-11-25T15:39:52Z

the file seqs.bed generated from the FASTA index test.fasta.fai;

@lsantuari: Could you clarify how did you get

sv-channels/data/seqs.bed

Lines 1 to 2 in 9866b35

12 44000000 46000000

22 44000000 46000000

from

sv-channels/data/test.fasta.fai

Lines 1 to 2 in 9866b35

12 2000000 4 2000000 2000001

22 2000000 2000009 2000000 2000001

?

seqs.bed is an old version, where the values were hard-coded, and it must be replaced.

BED is 0-based, so the file seqs.bed should be:

12	0	1999999
22	0	1999999

which can be obtained from the FASTA index test.fasta.fai with the following command:

awk '{print $1 "\t0\t" $2-1}' test.fasta.fai > seqs.bed

arnikz · 2020-11-25T16:15:00Z

@lsantuari: Could you clarify the purpose of these?

$ ls data/ *.bed
ENCFF001TDO.bed  reference_N_regions.bed  seqs.bed  test.bed  # the latter is a symlink to seqs.bed

EXCL_LIST=ENCFF001TDO.bed - used by merge_sv_calls.R
REF_REG=reference_N_regions.bed - used by merge_sv_calls.R
BED=test.bed - used by label_windows.py.

lsantuari · 2020-11-25T16:30:06Z

@lsantuari: Could you clarify the purpose of these?
$ ls data/ *.bed
ENCFF001TDO.bed  reference_N_regions.bed  seqs.bed  test.bed  # the latter is a symlink to seqs.bed
EXCL_LIST=ENCFF001TDO.bed - used by merge_sv_calls.R

The ENCODE blacklist is used to filter out SVs that falls in these regions. However, the same filtering is also performed in the downstream analysis, for instance in generate_figure2.R, so it is not necessary at this stage

REF_REG=reference_N_regions.bed - used by merge_sv_calls.R

The BED file with intervals containing Ns are used to make sure that no SVs overlap with these regions. As for the ENCODE blacklist, this filtering is also performed in the downstream analysis and therefore it is not necessary at this stage

BED=test.bed - used by label_windows.py.

this is used as a check to make sure that we only consider genomic positions in the chromosome intervals. It is used in the function chr_dict_from_bed in functions.py and called here.
This file could be replaced by using only the FASTA index test.fasta.fai

- remove (old) generated data files

arnikz · 2020-11-26T14:24:52Z

this is used as a check to make sure that we only consider genomic positions in the chromosome intervals. It is used in the function chr_dict_from_bed in functions.py and called here.
This file could be replaced by using only the FASTA index test.fasta.fai

@lsantuari: Are you sure it's always 1:1 correspondence between FASTA/FAI and BED? I'll change the code accordingly (e.g., using pysam).

lsantuari · 2020-11-26T17:07:30Z

this is used as a check to make sure that we only consider genomic positions in the chromosome intervals. It is used in the function chr_dict_from_bed in functions.py and called here.
This file could be replaced by using only the FASTA index test.fasta.fai

@lsantuari: Are you sure it's always 1:1 correspondence between FASTA/FAI and BED? I'll change the code accordingly (e.g., using pysam).

@arnikz Yes. Basically what we need are the chromosome IDs and lengths (first and second columns of the FASTA.FAI index).

arnikz · 2020-11-26T17:26:32Z

Yes. Basically what we need are the chromosome IDs and lengths (first and second columns of the FASTA.FAI index).

Just to make sure: length or length - 1?

sv-channels/data/test.fasta.fai

Lines 1 to 2 in 9866b35

    
           12	2000000	4	2000000	2000001 
        
           22	2000000	2000009	2000000	2000001

gives

$ awk '{print $1 "\t0\t" $2-1}' test.fasta.fai
12	0	1999999
22	0	1999999

previously

sv-channels/scripts/genome_wide/functions.py

Line 600 in f14c3dd

d[columns[0]] = int(columns[2]) - int(columns[1])

so here length - 1 stored

hence updated code

sv-channels/scripts/genome_wide/functions.py

Line 596 in 6e97987

d[seqid] = fa.lengths[i] - 1

lsantuari · 2020-11-30T08:18:17Z

length - 1

length - 1 is correct

lsantuari added the data label Nov 24, 2020

lsantuari added this to the 0.1.0 milestone Nov 24, 2020

lsantuari assigned arnikz and lsantuari Nov 24, 2020

arnikz unassigned lsantuari Nov 24, 2020

arnikz mentioned this issue Nov 25, 2020

Windown labeling fails on CTX SV type #77

Open

arnikz pushed a commit that referenced this issue Nov 25, 2020

Update conda env: add samtools & fatotwobit #76.

f8da830

arnikz pushed a commit that referenced this issue Nov 25, 2020

Add samtools to index FASTA #76.

1d9d3d9

arnikz pushed a commit that referenced this issue Nov 25, 2020

Add step to write *.2bit & *.bed files #76.

302c865

arnikz pushed a commit that referenced this issue Nov 25, 2020

Fix: output BED file #76.

f14c3dd

arnikz pushed a commit that referenced this issue Nov 26, 2020

Add FASTA->2bit conversion step #76.

dad6945

- remove (old) generated data files

arnikz pushed a commit that referenced this issue Nov 26, 2020

Use FASTA index instead of BED file #76.

f4cbf60

arnikz pushed a commit that referenced this issue Nov 26, 2020

Use FASTA index instead of BED file #76.

c847ba7

arnikz pushed a commit that referenced this issue Nov 26, 2020

Wait for .2bit and .bw files #76.

b73e1cf

arnikz pushed a commit that referenced this issue Nov 26, 2020

Wait for .2bit and .bw files #76.

6e97987

arnikz added code enhancement labels Nov 26, 2020

arnikz mentioned this issue Nov 26, 2020

Replace Ns_to_bed.py with seqkit #80

Closed

arnikz closed this as completed Dec 2, 2020

arnikz mentioned this issue Dec 2, 2020

Introduce Ns at random positions #84

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating all the necessary files in the data directory #76

Generating all the necessary files in the data directory #76

lsantuari commented Nov 24, 2020 •

edited by arnikz

Loading

arnikz commented Nov 25, 2020 •

edited

Loading

arnikz commented Nov 25, 2020 •

edited

Loading

lsantuari commented Nov 25, 2020

lsantuari commented Nov 25, 2020

arnikz commented Nov 25, 2020

lsantuari commented Nov 25, 2020

arnikz commented Nov 26, 2020 •

edited

Loading

lsantuari commented Nov 26, 2020

arnikz commented Nov 26, 2020 •

edited

Loading

lsantuari commented Nov 30, 2020

Generating all the necessary files in the data directory #76

Generating all the necessary files in the data directory #76

Comments

lsantuari commented Nov 24, 2020 • edited by arnikz Loading

arnikz commented Nov 25, 2020 • edited Loading

arnikz commented Nov 25, 2020 • edited Loading

lsantuari commented Nov 25, 2020

lsantuari commented Nov 25, 2020

arnikz commented Nov 25, 2020

lsantuari commented Nov 25, 2020

arnikz commented Nov 26, 2020 • edited Loading

lsantuari commented Nov 26, 2020

arnikz commented Nov 26, 2020 • edited Loading

lsantuari commented Nov 30, 2020

lsantuari commented Nov 24, 2020 •

edited by arnikz

Loading

arnikz commented Nov 25, 2020 •

edited

Loading

arnikz commented Nov 25, 2020 •

edited

Loading

arnikz commented Nov 26, 2020 •

edited

Loading

arnikz commented Nov 26, 2020 •

edited

Loading