Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ambiguous nucleotides? #31

Open
trommleralex opened this issue Aug 26, 2019 · 6 comments
Open

ambiguous nucleotides? #31

trommleralex opened this issue Aug 26, 2019 · 6 comments
Assignees

Comments

@trommleralex
Copy link

Dear Torsten,

I want to calculate genetic distances between sequences that contain ambiguous bases, i.e. W, S, Y and so on. If I am not mistaken snp-dists can either ignore these positions or count them as a snp. However, I would like to use the ambiguous information, e.g.:

W vs. A or T -> print distance 0
W vs. G or C -> print distance 1

I also would love to stick to unix command line because I have thousands of sequences and could loop the command easily in unix.

Would you consider implementing the ambiguous base information thing into snp-dist or could you recommend any other program that can deal with them?

Thanks a lot and best wishes!
Alex

@kloetzl
Copy link
Contributor

kloetzl commented Aug 26, 2019

Hi there,

Implementing ambiguous nucleotides is possible. However, weighing the comparisons is not trivial. You suggest that d(W,A) = 0 and d(W,T) = 0, but d(A,T) = 1. The distance thus is no longer a true metric distance. So I am unsure how the weighing should be implemented to satisfy all need of users of snp-dists.

Best,
Fabian

@trommleralex
Copy link
Author

trommleralex commented Aug 26, 2019 via email

@tseemann tseemann self-assigned this Aug 26, 2019
@tseemann
Copy link
Owner

I think supporting IUPAC codes in some manner would be a good option to include, but it is complicated. What about d(W,W) and d(W, B/D/G/V) etc?

Nucleotide Code:  Base:
----------------  -----
A.................Adenine
C.................Cytosine
G.................Guanine
T (or U)..........Thymine (or Uracil)
R.................A or G
Y.................C or T
S.................G or C
W.................A or T
K.................G or T
M.................A or C
B.................C or G or T
D.................A or G or T
H.................A or C or T
V.................A or C or G
N.................any base
. or -............gap

@kloetzl
Copy link
Contributor

kloetzl commented Aug 28, 2019

I came up with an implementation that works with ambiguous nucleotides. @trommleralex Note that you have to fill the table in main.c also, you have to compile using make.

@trommleralex
Copy link
Author

trommleralex commented Aug 29, 2019 via email

@kullrich
Copy link

Dear @tseemann, we once met in UK, Hinxton during ENA meeting. I took the opportunity and cloned your nice snp-dists repo. I have added a so-called basic literal-distance, which deals with IUPAC distances. The original code was not touched and one can still calculate the snp-dists distance.
https://github.com/kullrich/literal-dists
Best regards
Kristian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants