Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best snippy output file to use? #41

Open
pgcudahy opened this issue Aug 26, 2020 · 4 comments
Open

Best snippy output file to use? #41

pgcudahy opened this issue Aug 26, 2020 · 4 comments

Comments

@pgcudahy
Copy link

Silly question, but is it preferable to use core.aln or core.full.aln from snippy-core as the input to snp-dists, or even some other file output from snippy?

@andersgs
Copy link

@pgcudahy depends on what you need. If you want to calculate distances with all the different N and - that you will see in the core.full.aln, then use that. If you just want a clean core SNP distance, then use core.aln.

@pgcudahy
Copy link
Author

pgcudahy commented Sep 1, 2020

Thanks for the considered response. Ultimately I want snp distances for looking at transmission. It seemed to me that core.aln was better but the reason I had asked is I've been having trouble creating core.aln because I have hundreds of samples and the lowest quality ones either make snippy-core fail to create the core.aln file or result in a very short core genome. Are there any rules of thumb for criteria to filter the individual aligned fasta files (snps.aligned.fa) to get a good core.aln? With a reference length of ~ 4,100,000 bp, what roughly would a good quality core genome length be?

@andersgs
Copy link

andersgs commented Sep 4, 2020

@pgcudahy see my reply in your other issue.

Perhaps a better strategy might be to be ruthless and remove everything that is remotely dodgy. Maybe you end up with just 100 samples or even fewer. But, it gives you a starting point to go from end to end with your analysis with confidence in the data.

Once you have a good idea of how things should look and behave with a high quality subset, you can start relaxing your criteria for inclusion. This way, you have a solid baseline to compare to in order to evaluate different inclusion/exclusion criteria. Keep in mind that the core SNPs will change depending on the samples you include.

My limited experience with transmission analysis in TB using genomics is that every SNP counts. People are ruthless in excluding them using stringent criteria for variant filtering, and samples need to be well justified to be included in regards to their quality.

To answer your question, the size of the genome is not really important in terms of what you might expect the core genome size to be. As I mentioned in the other issue, the diversity of the sample, the quality and evolutionary distance of the reference genome, and the quality and number of samples will all play a role. Additionally, repeat elements, mobile elements, recombination (not a big issue in TB, AFAIK), and any other genomic elements not present in the MRCA of the samples (and therefore not vertically inherited by all samples in your collection) will affect your core genome.

Best of luck.

@pgcudahy
Copy link
Author

OK, thanks again for your very patient and thoughtful answers. I did as you suggested and used fastp to clean up the fastq files and kraken2 to filter out the non-Mtb labelled reads from the fastq files. Then I removed the samples with <40x depth and finally ran the remaining files through snippy and snp-dists. Now I get a core genome of 7325bp which still seems short for a reference of > 4,000,000 bp. To check this, I did 100 trials where I picked ten samples at random and ran them through snp-dists and got the core genome length in base pairs. The distribution was

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1209    2117    2408    2448    2768    4024 

image

So it looks like my final result is reasonable for my samples. Does this seem right to you, to have a core genome that short?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants