Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weight is incorrect, not a proper DL, not metric #1

Open
nreynis opened this issue Apr 26, 2014 · 3 comments
Open

Weight is incorrect, not a proper DL, not metric #1

nreynis opened this issue Apr 26, 2014 · 3 comments

Comments

@nreynis
Copy link

nreynis commented Apr 26, 2014

This implementation seems to be wrong.
Althought I've not dive deep enough in the code to spot where is the bug, I've isolated a simple test case.

DL("beak", "water") returns 5 instead of 4

step 1
beak water => substitution (b by w cost = 1)
step 2
weak water => permutation (e and a cost = 1)
step 3
waek water => deletion (t cost = 1)
step 4
waek waer => substitution (r by k cost = 1)

result
waek = waek => in 4 operations of cost 1

This bug cause the algorithm to lose it's metric properties, which can have a lot of implications. If you use it to build indexes it will certainly screw up your search engine. If you use it for a simple string compare, it should be good enough (even if not a real DamerauLevenshtein).

@cbaatz
Copy link
Owner

cbaatz commented Apr 26, 2014

Hi ywg,

thanks for reporting this. You're right that this does not implement a proper Damerau-Levenshtein distance - it implements the optimal string alignment algorithm, as stated in the first sentence of the README.

I might have misinterpreted the Wikipedia article to say that this was also called the restricted Damerau-Levenshtein distance. Re-reading the article it's still not clear to me whether that is an accepted use since it talks about a "true" Damerau-Levenshtein algorithm which presumably would be pointless if there wasn't a restricted version? Perhaps you could shed some light on the use and point to a more credible source than Wikipedia? :)

@nreynis
Copy link
Author

nreynis commented Apr 27, 2014

Hi cbaatz,
effectively, I haven't read thoroughly the README file, this is my fault.

Good information source regarding this algorithm is hard to find, and a lot of ressources are actually restricted edit distance (OSA) confused with DL. Even the french wikipedia article is totally wrong and is illustrated with an OSA implementation.

I've found these two blog post which are (IMHO) quite clear on the question:
http://scarcitycomputing.blogspot.com/2013/04/damerau-levenshtein-edit-distance.html
http://software-and-algorithms.blogspot.com/2012/09/damerau-levenshtein-edit-distance.html

Sadly the original article from Fred Damerau is behind a paywall, but this one isn't and can be considered more authoritative than blog posts:
https://eprint.iacr.org/2006/364.pdf

"A.1 Proof of Metric Properties
Damerau (Damerau 1964) identified the four string-
edit operations, but did not construct a metric. Lev-
enshtein (Levenshtein 1966) constructed and proved
a metric for the three operations excluding transposi-
tion. While it is well-known that one still has a metric
when the fourth operation is used, neither paper dis-
cusses this and the proof is very short, so we include
it here."

If you wan't I've wrote some tests with JSSpec
who check for metric properties by calculating the distance
between random words and a corpus of 6000 words.

@cbaatz
Copy link
Owner

cbaatz commented Apr 28, 2014

Hi ywg, I'd be interested in the tests - I can't promise that I'll rewrite this to true DL, but I'll consider it and then the tests would be great. CB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants