Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero coverage for all samples #17

Open
cedwardson4 opened this issue Feb 5, 2018 · 14 comments
Open

Zero coverage for all samples #17

cedwardson4 opened this issue Feb 5, 2018 · 14 comments

Comments

@cedwardson4
Copy link

I am having an issue with "refinem outliers" where I'm getting an error: "invalid value encountered in double_scalars." This originally happened in v0.0.22 but after looking at the github issues pages, I updated to v0.0.23 and still got this issue.

I examined my scaffold_stats output, and found that the coverage values in coverage.tsv were all zero (in both v0.0.22 and v0.0.23). I looked at some output on a different dataset run on v0.0.21 and did not see this.

It is possible this is an issue with how I am calling "refinem scaffold_stats", but I'm not getting a warning that it isn't reading my files correctly (refinem scaffold_stats appears to run without error).

@donovan-h-parks
Copy link
Owner

Hello. The "scaffold_stats" command has not changed between v0.0.21 and v0.0.23. Can you run the different dataset you have through v0.0.23 to confirm that it isn't a RefineM issue? Assuming this is fine it would suggest something unique about your current dataset which would help narrow down the issue.

@cedwardson4
Copy link
Author

OK. I think this has something to do with my dataset.

The headers of the BAM files and the contigs are "SPAdes"-style but "fixed" with sed and samtools to be compatible with Anvio.
for BAM in *.bam
do
samtools view -H $BAM | sed "s/./_/" | samtools reheader - $BAM > $BAM.ok.bam
done

sed 's/./_/g' all_spades_contigs.fasta > all_spades_contigs_fixed.fasta

I also confirmed that the bins have matching headers.

@cedwardson4
Copy link
Author

I believe the problem is that my reads are single reads, not paired end. Example output:
[2018-02-06 23:08:18] INFO: Calculating coverage profile for 31b_SL115863_0_trimmed.sorted.bam (10 of 10):
Finished processing 355358 of 355358 (100.00%) reference sequences.

total reads: 25707472

# properly mapped reads: 0 (0.0%)
# duplicate reads: 0 (0.0%)
# secondary reads: 914257 (3.6%)
# reads failing QC: 0 (0.0%)
# reads failing alignment length: 4241798 (16.5%)
# reads failing edit distance: 3269395 (12.7%)
# reads not properly paired: 17282022 (67.2%)

@donovan-h-parks
Copy link
Owner

Hello. That would do it. If you pass the "--all_reads" flag, CheckM will use singletons reads in the coverage estimate.

@cedwardson4
Copy link
Author

OK that worked, but the error in refinem outliers did not disappear.

[2018-02-07 23:33:01] INFO: RefineM v0.0.23
[2018-02-07 23:33:01] INFO: refinem outliers coverage_stats_new_bins/scaffold_stats.tsv bin_outliers_new_bins
[2018-02-07 23:33:01] INFO: Reading scaffold statistics.
[2018-02-07 23:33:22] INFO: Calculating statistics for 336 genomes over 355358 scaffolds.
[2018-02-07 23:33:42] INFO: Reading reference distributions.
Finding outliers in 197 of 336 (58.6%) genomes.
Unexpected error: <type 'exceptions.FloatingPointError'>
Traceback (most recent call last):
File "/usr/local/bin/refinem", line 396, in
parser.parse_options(args)
File "/usr/local/lib/python2.7/dist-packages/refinem/main.py", line 687, in parse_options
self.outliers(options)
File "/usr/local/lib/python2.7/dist-packages/refinem/main.py", line 280, in outliers
options.report_type, outlier_file)
File "/usr/local/lib/python2.7/dist-packages/refinem/outliers.py", line 484, in identify
cov_perc)
File "/usr/local/lib/python2.7/dist-packages/refinem/outliers.py", line 390, in outlier_info
corr_r, _corr_p = pearsonr(gs.median_coverage, stats.coverage)
File "/usr/local/lib/python2.7/dist-packages/scipy/stats/stats.py", line 3029, in pearsonr
r = r_num / r_den
FloatingPointError: invalid value encountered in double_scalars

@donovan-h-parks
Copy link
Owner

Hello. If you can send me your scaffold_stats.tsv file I can look into what is causing the issue. donovan[dot]parks[at]gmail.com

@cedwardson4
Copy link
Author

I've sent you a Google drive link since its 91MB compressed.

@donovan-h-parks
Copy link
Owner

donovan-h-parks commented Feb 10, 2018

Hello. The issue is that one of your contigs has zero coverage across all your samples (namely, c_000000311451). This probably speaks to an underlying issue with your assembly and/or read mapping files as lots of the contigs have extremely low coverage. I will add a warning in the next release of RefineM so the program continues to run and reports such problematic contigs, but it is probably best to try and determine why the coverage is so low for some contigs.

If you wish to move forward with the current data, you need to remove c_000000311451 from the scaffold_stats.tsv file.

@cedwardson4
Copy link
Author

cedwardson4 commented May 27, 2018

Hi, I'm just getting back to this problem. I'm still getting the error after exploring some things as described below.

First, these bins were from 10 samples, 5 depths in a lake, total sample coassembly.

  1. Realized that spades never actually finished correctly and that might have been part of the issue?
  2. Next, I used a per-depth assembly to bin with metabat2. Still getting zero coverage on a large number of contigs. Realized I assembled with a fastq file that has a different file name than the individual read fastq files (probably because I cat-ted them, but I can't find the original file to verify)
  3. bwa mem defaults to edit distance of 0.04, whereas refinem uses 0.02. I tried 0.04 and still got zero coverage contigs.
  4. All of the contigs that have zero coverage are smaller than 2500, which was the minimum contig size for my binning with metabat2.

1,2,3 may not be relevant, but 4 is - the bottom line is that for calculating "bin outliers" which is what I want to do, I don't need coverage for these contigs because they aren't in any bins.

@donovan-h-parks
Copy link
Owner

Hi. Can I consider this issue closed? It appears the zero coverage contigs are real and not an issue of RefineM. If you wish to use RefineM, you will need to remove all zero coverage contigs from any files provided to the program.

@cedwardson4
Copy link
Author

OK. Thanks!

@cedwardson4
Copy link
Author

Reopening this issue, but also this is sort of a follow up to #11. After removing the zero-coverage contigs, remapping, and recalculating scaffold stats, as well as removing single contig bins (as suggested in #11), I was getting a new stats.py error:

prob = _betai(0.5*df, 0.5, df/(df+t_squared))
FloatingPointError: invalid value encountered in double_scalars

After looking around in the code and trying to figure things out, I came across this scipy issue. Which seemed relevant (I'm using two BAM files, so I manually changed the stats.py code to the fix mentioned in the issue.
Now, I get an error:
File "/usr/local/lib/python2.7/dist-packages/refinem/outliers.py", line 390, in outlier_info
corr_r, _corr_p = pearsonr(gs.median_coverage, stats.coverage)
File "/usr/local/lib/python2.7/dist-packages/scipy/stats/stats.py", line 3010, in pearsonr
r = r_num / r_den
FloatingPointError: invalid value encountered in double_scalars

I'm struggling to figure this one out, so I'm not sure what is going on here, but I'm also wondering if this pearson calculation is necessary for my data, as on the front page you state the only the "mean absolute error criteria is used"?

@cedwardson4 cedwardson4 reopened this May 30, 2018
@donovan-h-parks
Copy link
Owner

Can you confirm you are using the latest version of RefineM? If you can send me a simple example that produces this problem I can take a look. Ideally, a single genome and the exact RefineM commands you ran that result in the issue.

@SagesWang
Copy link

Hi, I have encountered the same issue. But after I filtered out the contigs whose Genome Id is "unbinned" or "bin.unbinned", this issue was gone.
The command I used to do the filtering is as follow:
cat scaffold_stats.tsv | awk -F"\t" '{if($2!="unbinned" && $2!="bin.unbinned") print $0}' >scaffold_stats_filter.tsv
Hope this information will help you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants