-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pangoline Lineage calling despite "N" at defining SNP position #492
Comments
Thanks for reporting this @MantaRay87. I think I might know the cause but it would help to have some more info:
When a sequence has an 'N' or other IUPAC ambiguous base, UShER will impute the value based on matches at other positions. When there are multiple EPPs, AFAIK UShER picks the node with the most descendants, which would make it pick BA.5.2 when it matches both BA.5.2 and BA.5.2.28 because of the N at 4575 -- but when adding UShER mode to pangolin, I added a bit of logic to override UShER's favorite when there was a plurality of matches in one lineage. Unfortunately I have seen some cases in which one lineage gets a plurality of matches only because of Ns matching multiple branches within the lineage -- so that "voting" might be harming more than helping, given the amount of amplicon dropout that is common now. Without looking at complete output for your sequences (or ideally, running pangolin on the sequences myself), I can't be sure that's causing the odd assignments, but that's my best guess. |
Hey @AngieHinrichs thanks for the fast reply. I sure can: The corresponding Gisaid IDS can be found here: The ONT sequences in which all 4 samples are called as BA.5.2.21 are not public. If you also need output and sequences here let me know :) |
Perfect, thanks. Yes, it is what I guessed: usher picks BA.5.2 for all four of those sequences, but since there are more matches within the BA.5.2.28 branch due to the Ns, the "voting" picks BA.5.2.28. I probably should remove the "voting" from pangolin. I will try to find time soon to see how many sequences' assignments will change, and evaluate whether it looks like an improvement overall. In the meantime, the |
Not even sure if pangolin itself is the right repository to mention this issue/question or if it has been explained before.
While using some of our Illumina sequence data to compare it with our ONT sequences I noticed the following:
with
pangolin-data v1.14
the consensus files were called as BA.5.2. But with the newpangolin-data v1.15.1
they are called as BA.5.2.28. The lineage defining SNP of BA.5.2.28 is ORF1a:T1437I (from pango-designation issue #1133). But our consensi in question have a missing Amplicon at that particular stretch of the genome, hence in the consensus sequences are stated "N" --> How/why does pangolin assign this lineage anyway? How can I be sure it is correct?We have the same Issue with the ONT data where lineage BA.5.2.21 is called after the update but also here we have "N" at this particular stretch in the sequence at this position.
The text was updated successfully, but these errors were encountered: