-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating all the necessary files in the data directory #76
Comments
@lsantuari: Could you clarify how did you get Lines 1 to 2 in 9866b35
from sv-channels/data/test.fasta.fai Lines 1 to 2 in 9866b35
|
@lsantuari: Accordingly, I got an empty BED file.
related to #79 |
It is correct. It means that there are no genomic intervals containing only ambiguous bases (represented by 'N' in this case) in our test 'reference genome'. Our test.fasta file contains:
|
seqs.bed is an old version, where the values were hard-coded, and it must be replaced. BED is 0-based, so the file seqs.bed should be:
which can be obtained from the FASTA index test.fasta.fai with the following command:
|
@lsantuari: Could you clarify the purpose of these?
|
The ENCODE blacklist is used to filter out SVs that falls in these regions. However, the same filtering is also performed in the downstream analysis, for instance in generate_figure2.R, so it is not necessary at this stage
The BED file with intervals containing Ns are used to make sure that no SVs overlap with these regions. As for the ENCODE blacklist, this filtering is also performed in the downstream analysis and therefore it is not necessary at this stage
this is used as a check to make sure that we only consider genomic positions in the chromosome intervals. It is used in the function chr_dict_from_bed in functions.py and called here. |
@lsantuari: Are you sure it's always 1:1 correspondence between FASTA/FAI and BED? I'll change the code accordingly (e.g., using |
@arnikz Yes. Basically what we need are the chromosome IDs and lengths (first and second columns of the FASTA.FAI index). |
Just to make sure: sv-channels/data/test.fasta.fai Lines 1 to 2 in 9866b35
gives
previously sv-channels/scripts/genome_wide/functions.py Line 600 in f14c3dd
length - 1 stored
hence updated code sv-channels/scripts/genome_wide/functions.py Line 596 in 6e97987
|
|
sv-channels should take in input:
All the other files in the data directory must be generated from either the BAM file or the FASTA file:
The text was updated successfully, but these errors were encountered: