-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best snippy output file to use? #41
Comments
@pgcudahy depends on what you need. If you want to calculate distances with all the different |
Thanks for the considered response. Ultimately I want snp distances for looking at transmission. It seemed to me that |
@pgcudahy see my reply in your other issue. Perhaps a better strategy might be to be ruthless and remove everything that is remotely dodgy. Maybe you end up with just 100 samples or even fewer. But, it gives you a starting point to go from end to end with your analysis with confidence in the data. Once you have a good idea of how things should look and behave with a high quality subset, you can start relaxing your criteria for inclusion. This way, you have a solid baseline to compare to in order to evaluate different inclusion/exclusion criteria. Keep in mind that the core SNPs will change depending on the samples you include. My limited experience with transmission analysis in TB using genomics is that every SNP counts. People are ruthless in excluding them using stringent criteria for variant filtering, and samples need to be well justified to be included in regards to their quality. To answer your question, the size of the genome is not really important in terms of what you might expect the core genome size to be. As I mentioned in the other issue, the diversity of the sample, the quality and evolutionary distance of the reference genome, and the quality and number of samples will all play a role. Additionally, repeat elements, mobile elements, recombination (not a big issue in TB, AFAIK), and any other genomic elements not present in the MRCA of the samples (and therefore not vertically inherited by all samples in your collection) will affect your core genome. Best of luck. |
OK, thanks again for your very patient and thoughtful answers. I did as you suggested and used fastp to clean up the fastq files and kraken2 to filter out the non-Mtb labelled reads from the fastq files. Then I removed the samples with <40x depth and finally ran the remaining files through snippy and snp-dists. Now I get a core genome of 7325bp which still seems short for a reference of > 4,000,000 bp. To check this, I did 100 trials where I picked ten samples at random and ran them through
So it looks like my final result is reasonable for my samples. Does this seem right to you, to have a core genome that short? |
Silly question, but is it preferable to use
core.aln
orcore.full.aln
fromsnippy-core
as the input tosnp-dists
, or even some other file output fromsnippy
?The text was updated successfully, but these errors were encountered: