Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could partial_ratio support Levenshtein.normalized_similarity? #401

Open
rocke2020 opened this issue Aug 11, 2024 · 2 comments
Open

Could partial_ratio support Levenshtein.normalized_similarity? #401

rocke2020 opened this issue Aug 11, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@rocke2020
Copy link

rapidfuzz 3.9.6 and python 3.10.
I have carefully read the rapidfuzz docs and made tests. I find: partial_ratio use Indel.normalized_similarity.
Could partial_ratio also support Levenshtein.normalized_similarity? Maybe default to use Indel.normalized_similarity, but we can choose to use Levenshtein.normalized_similarity.
In my use case, I indeed need partial_ratio to use Levenshtein.normalized_similarity.
thanks!!

from icecream import ic
from rapidfuzz import fuzz
from rapidfuzz.distance import Levenshtein, Indel

a = '34cdef16z'
c = '09cdef78'
ic(fuzz.partial_ratio(a, c))
ic(fuzz.partial_ratio_alignment(a, c))
ic(fuzz.ratio(a[0:6], c))
ic(Indel.normalized_similarity(a[0:6], c) * 100)
ic(Levenshtein.normalized_similarity(a[0:6], c) * 100)

ic| test.py:25 in <module>- fuzz.partial_ratio(a, c): 57.14285714285714
ic| test.py:26 in <module>- fuzz.partial_ratio_alignment(a, c): ScoreAlignment(score=57.14285714285714, src_start=0, src_end=6, dest_start=0, dest_end=8)
ic| test.py:27 in <module>- fuzz.ratio(a[0:6], c): 57.14285714285714
ic| test.py:28 in <module>- Indel.normalized_similarity(a[0:6], c) * 100: 57.14285714285714
ic| test.py:29 in <module>- Levenshtein.normalized_similarity(a[0:6], c) * 100: 50.0
@maxbachmann
Copy link
Member

In theory we could support this, but this comes with a couple of issues:

  1. partial_ratio is used in a lot of the "combined" algorithms like WRatio. So it's not completely clear what to do about them
  2. would this only allow Indel and Levenshtein which seems a bit arbitrary or more metrics? For more metrics it gets a lot more complex to handle things like the different kinds of normalizations
  3. This has a couple of optimizations to keep the sliding window algorithm remotely fast by skipping some unneeded calculations. This doesn't work for the generic case though. Probably this could have a generic fallback, but it would likely need optimized versions for algorithms where this is possible.

One more possible setting could be whether the sliding window allows for shorter substrings at the start / end of the longer string.

Not completely sure whether this should be added under a different name either.

I assume a generic implementation shouldn't be too much work and optimized versions could be added over time. Still not sure when I will get around to it though.

@maxbachmann maxbachmann added the enhancement New feature or request label Aug 22, 2024
@rocke2020
Copy link
Author

thanks a lot for maxbachmann.
You have a global view on this issue!
As Indel and Levenshtein are very similar, I think it would be easier only support Indel and Levenshtein in partial_ratio, to save your time. For me, I hope to only use Levenshtein in most NLP process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants