Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] <title>Discrepancy in CheckM Results Between PGAP and Standalone Runs #329

Open
jonniechan opened this issue Jan 9, 2025 · 3 comments

Comments

@jonniechan
Copy link

I am writing to inquire about an issue I encountered when running the PGAP with my genome data. As part of the pipeline, PGAP provides CheckM results. However, I noticed a significant discrepancy between the CheckM results reported by PGAP and those obtained when I ran CheckM standalone on the same genome.

PGAP code: ./pgap.py -n -o /result -g fasta/1C88.fasta -s "Staphylococcus epidermidis" --no-internet --debug --docker singularity --container-path ./pgap.sif --ignore-all-errors

PGAP result: Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity
annotation Staphylococcus epidermidis (6) 20 933 208 467 465 0 1 0 0 49.85 0.32 33.33

CheckM code: checkm analyze staphylococcus_epidermidis.ms ./1C88_bins ./1C88_staphylococcus_outputtttt -x fna -t 30

CheckM result: Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity

1C88_PGAP Staphylococcus epidermidis (6) 20 933 208 53 873 5 2 0 0 94.51 1.12 18.18

The genome data I used for both analyses is identical. Also there was no specific difference in GC contents, total genome size, N50, gene counts compared to ncbi complete reference genome.
Also I ran CheckM2
CheckM2:
Name Completeness Contamination Completeness_Model_Used Translation_Table_Used Coding_Density Contig_N50 Average_Gene_Length Genome_Size GC_Content Total_Coding_Sequences Total_Contigs Max_Contig_Length Additional_Notes
1C88_PGAP 81.25 6.64 Neural Network (Specific Model) 11 0.797 2487636 199.0716743119266 2589755 0.32 3488 5 2487636 None

Are there any specific parameters or configurations in PGAP's implementation of CheckM that could explain the observed differences?
Could this discrepancy be related to differences in the CheckM database versions or other factors?
Your guidance on this matter would be greatly appreciated. Please let me know if you require any additional details about the genome or the analysis setup.

Thank you for your support.

@azat-badretdin
Copy link
Contributor

Thank you for your report, user @jonniechan

First of all, CheckM2 is (quite significantly) a different beast both in methodology and results (it does not provide output lineage for example)

As for the comparison: standalone Checkm vs PGAP's Checkm - it is significant and I opened an internal investigation for this.

Is the genome public? If not, can you share it?

@jonniechan
Copy link
Author

Hi, I uploaded it to my repository. @azat-badretdin

@azat-badretdin
Copy link
Contributor

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants