Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating all the necessary files in the data directory #76

Closed
3 of 4 tasks
lsantuari opened this issue Nov 24, 2020 · 10 comments
Closed
3 of 4 tasks

Generating all the necessary files in the data directory #76

lsantuari opened this issue Nov 24, 2020 · 10 comments
Assignees
Milestone

Comments

@lsantuari
Copy link
Contributor

lsantuari commented Nov 24, 2020

sv-channels should take in input:

  • a BAM file (symlinked to test.bam) generated by mapping the reads (FASTQ format) to the reference sequence (FASTA format) with a read mapper that supports split read mapping, such as BWA mem;
  • the reference sequence (symlinked to test.fasta) in FASTA format that was used to map the reads.

All the other files in the data directory must be generated from either the BAM file or the FASTA file:

@arnikz
Copy link
Contributor

arnikz commented Nov 25, 2020

  • the file seqs.bed generated from the FASTA index test.fasta.fai;

@lsantuari: Could you clarify how did you get

12 44000000 46000000
22 44000000 46000000

from

12 2000000 4 2000000 2000001
22 2000000 2000009 2000000 2000001
?

@arnikz
Copy link
Contributor

arnikz commented Nov 25, 2020

  • the file test.2bit generated with faToTwoBit
  • the file reference_N_regions.bed generated with Ns_to_bed.py;

@lsantuari: Accordingly, I got an empty BED file.

faToTwoBit test.fasta test.2bit
python Ns_to_bed.py -b reference_N_regions.bed -t test.2bit -c 12,22

related to #79

arnikz pushed a commit that referenced this issue Nov 25, 2020
@lsantuari
Copy link
Contributor Author

  • the file test.2bit generated with faToTwoBit
  • the file reference_N_regions.bed generated with Ns_to_bed.py;

@lsantuari: Accordingly, I got an empty BED file.

faToTwoBit test.fasta test.2bit
python Ns_to_bed.py -b reference_N_regions.bed -t test.2bit -c 12,22

related to #79

It is correct. It means that there are no genomic intervals containing only ambiguous bases (represented by 'N' in this case) in our test 'reference genome'. Our test.fasta file contains:

  • chromosome 12 from position 44000000 to position 46000000
  • chromosome 22 from position 44000000 to position 46000000

@lsantuari
Copy link
Contributor Author

  • the file seqs.bed generated from the FASTA index test.fasta.fai;

@lsantuari: Could you clarify how did you get

12 44000000 46000000
22 44000000 46000000

from

12 2000000 4 2000000 2000001
22 2000000 2000009 2000000 2000001

?

seqs.bed is an old version, where the values were hard-coded, and it must be replaced.

BED is 0-based, so the file seqs.bed should be:

12	0	1999999
22	0	1999999

which can be obtained from the FASTA index test.fasta.fai with the following command:

awk '{print $1 "\t0\t" $2-1}' test.fasta.fai > seqs.bed

@arnikz
Copy link
Contributor

arnikz commented Nov 25, 2020

@lsantuari: Could you clarify the purpose of these?

$ ls data/ *.bed
ENCFF001TDO.bed  reference_N_regions.bed  seqs.bed  test.bed  # the latter is a symlink to seqs.bed
  • EXCL_LIST=ENCFF001TDO.bed - used by merge_sv_calls.R
  • REF_REG=reference_N_regions.bed - used by merge_sv_calls.R
  • BED=test.bed - used by label_windows.py.

@lsantuari
Copy link
Contributor Author

@lsantuari: Could you clarify the purpose of these?

$ ls data/ *.bed
ENCFF001TDO.bed  reference_N_regions.bed  seqs.bed  test.bed  # the latter is a symlink to seqs.bed
  • EXCL_LIST=ENCFF001TDO.bed - used by merge_sv_calls.R

The ENCODE blacklist is used to filter out SVs that falls in these regions. However, the same filtering is also performed in the downstream analysis, for instance in generate_figure2.R, so it is not necessary at this stage

  • REF_REG=reference_N_regions.bed - used by merge_sv_calls.R

The BED file with intervals containing Ns are used to make sure that no SVs overlap with these regions. As for the ENCODE blacklist, this filtering is also performed in the downstream analysis and therefore it is not necessary at this stage

  • BED=test.bed - used by label_windows.py.

this is used as a check to make sure that we only consider genomic positions in the chromosome intervals. It is used in the function chr_dict_from_bed in functions.py and called here.
This file could be replaced by using only the FASTA index test.fasta.fai

arnikz pushed a commit that referenced this issue Nov 25, 2020
arnikz pushed a commit that referenced this issue Nov 25, 2020
arnikz pushed a commit that referenced this issue Nov 26, 2020
- remove (old) generated data files
@arnikz
Copy link
Contributor

arnikz commented Nov 26, 2020

this is used as a check to make sure that we only consider genomic positions in the chromosome intervals. It is used in the function chr_dict_from_bed in functions.py and called here.
This file could be replaced by using only the FASTA index test.fasta.fai

@lsantuari: Are you sure it's always 1:1 correspondence between FASTA/FAI and BED? I'll change the code accordingly (e.g., using pysam).

arnikz pushed a commit that referenced this issue Nov 26, 2020
arnikz pushed a commit that referenced this issue Nov 26, 2020
arnikz pushed a commit that referenced this issue Nov 26, 2020
@lsantuari
Copy link
Contributor Author

this is used as a check to make sure that we only consider genomic positions in the chromosome intervals. It is used in the function chr_dict_from_bed in functions.py and called here.
This file could be replaced by using only the FASTA index test.fasta.fai

@lsantuari: Are you sure it's always 1:1 correspondence between FASTA/FAI and BED? I'll change the code accordingly (e.g., using pysam).

@arnikz Yes. Basically what we need are the chromosome IDs and lengths (first and second columns of the FASTA.FAI index).

arnikz pushed a commit that referenced this issue Nov 26, 2020
@arnikz
Copy link
Contributor

arnikz commented Nov 26, 2020

Yes. Basically what we need are the chromosome IDs and lengths (first and second columns of the FASTA.FAI index).

Just to make sure: length or length - 1?

12 2000000 4 2000000 2000001
22 2000000 2000009 2000000 2000001

gives

$ awk '{print $1 "\t0\t" $2-1}' test.fasta.fai
12	0	1999999
22	0	1999999

previously

d[columns[0]] = int(columns[2]) - int(columns[1])
so here length - 1 stored

hence updated code

d[seqid] = fa.lengths[i] - 1

@lsantuari
Copy link
Contributor Author

length - 1

length - 1 is correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants