Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assess Popgen48/scalepopgen #80

Closed
Tracked by #65
muffato opened this issue Jun 10, 2024 · 5 comments
Closed
Tracked by #65

Assess Popgen48/scalepopgen #80

muffato opened this issue Jun 10, 2024 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@muffato
Copy link
Member

muffato commented Jun 10, 2024

We need to review how much of our population genomics ideas Popgen48/scalepopgen can do to determine:

  • if it could be used as is
  • if it could be used with modifications
  • if we'd rather extract and replicate some functionality here

Links: poster

Summary

  1. All the different tools and analyses can be independently enabled.
  2. There were a couple of things to do to the input VCF files, but then the pipeline runs fine.
  3. We'd want to clarify whether it's going to be part of nf-core or not.
  4. We need to decide in which pipeline (scalepopgen or a new pipeline) the ROH and population size analyses should go.

Next developments

Based on the tests above, to use scalepopgen, we would want to:

  • Add automatic splitting of VCF files by chromosome
  • Handle non-numeric chromosome names when plotting TajimaD and SweepFinder results
@hangxue-wustl
Copy link
Collaborator

hangxue-wustl commented Jul 2, 2024

Requirement for input files.

  1. All VCF files need to be splitted by the chromosomes and indexed with tabix.
  2. Sample map has two tab-delimited columns without header line. In the first column are individual IDs and in the second are population IDs

vcf_input.csv:
chrom,vcf,vcf_idx
chr1,chrom1.vcf.gz,chrom1.vcf.gz.tbi
chr2,chrom2.vcf.gz,chrom2.vcf.gz.tbi

sample.map:
ind1 pop1
ind2 pop1
ind3 pop2
ind4 pop2

Splitting the VCF file by chromosomes
bcftools index -s mLutLut_renamed_autosomes_bisnps.vcf.gz | cut -f 1 | while read C; do bcftools view -O z -o split.${C}.vcf.gz mLutLut_renamed_autosomes_bisnps.vcf.gz.vcf.gz "${C}" ; done

@hangxue-wustl
Copy link
Collaborator

hangxue-wustl commented Jul 5, 2024

Downloaded supplementary data from https://doi.org/10.1093/molbev/msad207 and followed EurasianOtter_PopGen.html to obtain vcf.gz files and rename samples, and select only autosomes and bialleleic SNPs for analyses. Split the vcf file by chromosomes using bcftools. Ran "nextflow run scalepopgen -profile singularity -params-file /global/scratch/users/hangxue/otter/vcf_publication/jul4_parameters.yml -qs 10". See output graphs at https://docs.google.com/presentation/d/1O8vFmYImrJd6p4pvSLyzwiMsf9fTAZSTaG_FJGLz8t8/edit#slide=id.p

@hangxue-wustl
Copy link
Collaborator

hangxue-wustl commented Jul 18, 2024

Tested PCA, Admixture, Pairwise Fst and Treemix in scalepopgen. These can run successfully with little modifications. Scalepopgen can also do Tajimas_D and search for selective sweeps selection (Sweepfinder2), but plotting the these two results requires the type of the chromosome name being integer. Out of these, Sweepfinder2 takes the longest, ~7hr for the otter data, followed by admixture ~1hr.
Additional potential analysis:

  1. ROH identification (eg. RzooROH)
  2. Estimate population-size inference (eg. GONe)

@muffato
Copy link
Member Author

muffato commented Aug 19, 2024

Regarding the otter data. Here is more information about the sample confusion that occurred during that project.

The label swaps were very visible on the admixture plots, see left (labels corrected) vs right (wrong labels)
Admixture
In your pipeline run it's only k=2 that is a bit messy. All the other k are clean. I think you may have the correct labels and the differences are due to different methods / parameters ?

@hangxue-wustl
Copy link
Collaborator

I have doubled checked the label. I think the ones I am working with is labeled correctly. Yeah, I think the difference might be due to different softwares / parameters

@muffato muffato closed this as completed Sep 18, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Genome After Party Sep 18, 2024
@github-project-automation github-project-automation bot moved this from In progress to Done in variantcalling Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: Done
Status: Done
Development

No branches or pull requests

2 participants