Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pysam crashes when quality only has a single value of "*" #1211

Open
ymcki opened this issue Aug 19, 2023 · 2 comments
Open

pysam crashes when quality only has a single value of "*" #1211

ymcki opened this issue Aug 19, 2023 · 2 comments

Comments

@ymcki
Copy link

ymcki commented Aug 19, 2023

I have a unmapped bam with many entries of basecalls with only one base and a quality of "*".

pysam 0.21.0 then crashes whenever it reaches a line like that.

Traceback (most recent call last):
  File "./q.py", line 8, in <module>
    sum_bq += sum(read.query_qualities)
TypeError: 'NoneType' object is not iterable

The simple progam that allows you replicate the problem is:

import pysam

bam = sys.argv[1]

sum_bq = 0
bam_file = pysam.AlignmentFile(bam, "r", check_sq=False, threads=4)
for read in bam_file:
    sum_bq += sum(read.query_qualities)

This is the content of the ubam file presented in sam format I used to reproduce the error

@HD	VN:1.6	SO:unknown
@PG	ID:basecaller	PN:dorado	VN:0.2.1+c70423e	CL:dorado duplex -t 96 /models/[email protected] /pod5/ --pairs pairs_from_bam/pair_ids_filtered.txt
9642cde2-f8c3-4741-86cc-0fcbe29776d7;7c1c2b91-f9ff-4d5d-87c5-83c29b74c749	4	*	0	0	*	*	0	1	C	*	qs:i:9
88f36782-d9dc-45fa-92e7-347f1493b65b;ee2c206e-41bb-4c2a-9575-5170fe7f3ec0	4	*	0	0	*	*	0	1	C	*	qs:i:9
0f2f76b2-beb8-4625-8f75-a8065a72cbe6;51d2ca59-ba74-4f82-82d3-09490bbf8b21	4	*	0	0	*	*	0	1	T	*	qs:i:9
38b6482b-0c91-42d0-92b1-6d4298d0eba5;7ade0e9f-c25c-43be-99de-d7bf59e660a9	4	*	0	0	*	*	0	1	T	*	qs:i:9
d7dc1c40-d297-4c2e-94cf-015d80dc7aab;d38c2d9f-e3bd-4c23-9cc7-5c9e6b704528	4	*	0	0	*	*	0	1	C	*	qs:i:9
8d7d8c03-c449-4bf0-af44-36ae3271d6fe;2a851990-847f-42b6-9ad0-f2b809868599	4	*	0	0	*	*	0	1	C	*	qs:i:9
55e10091-8299-446b-a31a-ca3c5f57088a;47ab3a62-383c-478d-bc0e-030be2da5786	4	*	0	0	*	*	0	1	T	*	qs:i:9

My guess is that pysam treated single "*" as this field is empty instead of single base quality of *

@jmarshall
Copy link
Member

In SAM, in general a QUAL field of * indicates that base qualities are not available. Historically no-one has particularly cared about single-base reads and it is unspecified whether in that case * indicates unavailable or a base quality of 9. That ambiguity is samtools/hts-specs#715, on which you may wish to express an opinion.

However there is no such ambiguity in BAM. If the “ubam file” you are actually feeding to this script is a BAM file, then getting None here indicates that the record really does have QUAL absent. (However the qs:i:9 tag, if it is indeed an average base quality score, suggests that this data originated from a SAM file that intended QUAL = * to mean base quality 9…)

Pysam is not crashing here. Your script is crashing when read.query_qualities returns None, which your script is not dealing with. This property is None when the QUAL field is absent, and your script probably needs to deal with this possibility.

Are these single-base reads important in your analysis, or are they a handful of degenerate reads that could be filtered out without adverse effect? Are there other single-base reads present in your data with other characters in their QUAL fields?

@ymcki
Copy link
Author

ymcki commented Aug 20, 2023

Thanks for your reply. I don't know about there is an undefined spec regarding this situation. Intuitively, I would think it is a way to specify a single base read with quality 9.

I encountered this situation while using ONT's dorado basecaller. Anyway, I will relay this ambiguity to the dorado basecaller team and see what they want to do with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants