Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better description of how to run the tool would be helpful #4

Open
rsharris opened this issue Jul 12, 2019 · 3 comments
Open

Better description of how to run the tool would be helpful #4

rsharris opened this issue Jul 12, 2019 · 3 comments

Comments

@rsharris
Copy link
Contributor

The current readme doesn't clearly describe how the user can use the tool to solve the problem it is intended to solve. If I have an assembly, and I want to identify Y-specific contigs, how do I do that?

My best guess, from trying the run the example in the repo, is that the info about which contigs are Y-specific is encoded in the headers of proportion_annotated_contigs.fastq. But this information in not described in the readme. Nor is any step mentioned that will separate Y contigs from the input contigs.

Note that that conclusion is based on the fact that, for me, the output of discovery.py (proportion_annotated_contigs.fastq) is identical to the input (data/male_contigs.fasta), except that annotation has been added.

The command I ran was the one shown as "a typical run":
python discoverY.py --female_bloom --mode female+male
But it is not clear whether this is the appropriate command to run for the example. Based on the files provided, and after digging through the code to see which options would cause all the provided files to be used, that was the command I can up with. This would be made clearer by having a "tutorial" section in the readme that showed the command to be run.

It would also be helpful to provide, as part of the example, the expected output. As it stands, I don't know whether my run of discovery.py worked. It's possible that it is not working and that this has fed into my misunderstanding of how it is supposed to be used.

It's also possible that I don't understand what the example is intended to demonstrate.

The discussion of 'best mode' and the jupyter notebook stuff should clarify whether this step is intended as part of the tyipcal usage pipeline or not. After having a lot of difficulty with the notebook, and looking at it in more detail, and realizing that it doesn't read the output from discovery.py, my best guess is that this is a pre-computing step, to be run before discovery.py, to guide the choice of threshold. However (assuming that is true), there's nothing that indicates how the resulting threhold would be used.

To recap, as it is currently described quite a bit of insight, digging, and guesswork is required on the part of the user.

@rsharris
Copy link
Contributor Author

I should add that when I run the example, it reports that a proportion of 1.0 for each and every contig. That seems really strange -- it would be strange example. How can I know whether this is expected or if instead it's an indication somethings wrong with my installation of the program?

@rsharris
Copy link
Contributor Author

In the current readme, the threshold output by discoverY.py is described as "proportion_shared_with_female". But I think it is really "proportion_NOT_shared_with_female".

Thus values closer to 1 mean a contig is more likely to be from Y.

@deilepaita
Copy link

Agree!

Output of DiscoverY in README.md should be corrected to "proportion_NOT_shared_with_female", because after running DiscorerY, contig file has the following header: '>Sc0000000 7492748 0.012910911534003885 102.0'; while the printed results in the terminal are:
'No. of contigs seen so far: 1
Current contig ID is : Sc0000000
Median is: 102.0
Total No. of k-mers from this contig: 7492732
No. of k-mers not shared with female: 96738
Proportion is: 0.012910911534003885'

Another correction that should be made is the description on how to calculate k-mers from male reads, because it indicates:
"cd dependency
ln -s ../data/female.fasta #make sure the correct reads file is provided to DSK
./run_dsk_Linux.sh r1.fastq 25"
which is misleading for new users. Why do the user needs to soft link female.fasta to dependency if it is completely unnecessary for running DSK?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants