[FEATURE REQUEST] <title>Discrepancy in CheckM Results Between PGAP and Standalone Runs #329

jonniechan · 2025-01-09T07:59:43Z

I am writing to inquire about an issue I encountered when running the PGAP with my genome data. As part of the pipeline, PGAP provides CheckM results. However, I noticed a significant discrepancy between the CheckM results reported by PGAP and those obtained when I ran CheckM standalone on the same genome.

PGAP code: ./pgap.py -n -o /result -g fasta/1C88.fasta -s "Staphylococcus epidermidis" --no-internet --debug --docker singularity --container-path ./pgap.sif --ignore-all-errors

PGAP result: Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity
annotation Staphylococcus epidermidis (6) 20 933 208 467 465 0 1 0 0 49.85 0.32 33.33

CheckM code: checkm analyze staphylococcus_epidermidis.ms ./1C88_bins ./1C88_staphylococcus_outputtttt -x fna -t 30

CheckM result: Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity

1C88_PGAP Staphylococcus epidermidis (6) 20 933 208 53 873 5 2 0 0 94.51 1.12 18.18

The genome data I used for both analyses is identical. Also there was no specific difference in GC contents, total genome size, N50, gene counts compared to ncbi complete reference genome.
Also I ran CheckM2
CheckM2:
Name Completeness Contamination Completeness_Model_Used Translation_Table_Used Coding_Density Contig_N50 Average_Gene_Length Genome_Size GC_Content Total_Coding_Sequences Total_Contigs Max_Contig_Length Additional_Notes
1C88_PGAP 81.25 6.64 Neural Network (Specific Model) 11 0.797 2487636 199.0716743119266 2589755 0.32 3488 5 2487636 None

Are there any specific parameters or configurations in PGAP's implementation of CheckM that could explain the observed differences?
Could this discrepancy be related to differences in the CheckM database versions or other factors?
Your guidance on this matter would be greatly appreciated. Please let me know if you require any additional details about the genome or the analysis setup.

Thank you for your support.

azat-badretdin · 2025-01-10T14:29:59Z

Thank you for your report, user @jonniechan

First of all, CheckM2 is (quite significantly) a different beast both in methodology and results (it does not provide output lineage for example)

As for the comparison: standalone Checkm vs PGAP's Checkm - it is significant and I opened an internal investigation for this.

Is the genome public? If not, can you share it?

jonniechan · 2025-01-12T03:55:45Z

Hi, I uploaded it to my repository. @azat-badretdin

azat-badretdin · 2025-01-13T10:01:50Z

Thanks!

azat-badretdin added the PGAPX-1470 label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE REQUEST] <title>Discrepancy in CheckM Results Between PGAP and Standalone Runs #329

[FEATURE REQUEST] <title>Discrepancy in CheckM Results Between PGAP and Standalone Runs #329

jonniechan commented Jan 9, 2025

azat-badretdin commented Jan 10, 2025

jonniechan commented Jan 12, 2025

azat-badretdin commented Jan 13, 2025

[FEATURE REQUEST] <title>Discrepancy in CheckM Results Between PGAP and Standalone Runs #329

[FEATURE REQUEST] <title>Discrepancy in CheckM Results Between PGAP and Standalone Runs #329

Comments

jonniechan commented Jan 9, 2025

azat-badretdin commented Jan 10, 2025

jonniechan commented Jan 12, 2025

azat-badretdin commented Jan 13, 2025