Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tensors in edit distance computations #2856

Open
kylebgorman opened this issue Dec 1, 2024 · 1 comment
Open

Support tensors in edit distance computations #2856

kylebgorman opened this issue Dec 1, 2024 · 1 comment
Labels
enhancement New feature or request
Milestone

Comments

@kylebgorman
Copy link

kylebgorman commented Dec 1, 2024

🚀 Feature

Edit distance is well-defined over arrays of any kind of categorical values (like integers standing in for characters or strings), not just strings. However, the implementation of edit distance is hard-coded to only accept strings or lists of strings.

Motivation

First, I'd like to use the edit distance operators in this library but converting tensors back into strings is a quite expensive operation.

Second, in my implementation, the tables needed to do said conversion are not currently local to the module that would hold the torchmetric metric objects either, so it would add additional bookkeeping as well.

Third and finally, I don't even want to do comparison based on the number of edits as computed via Unicode strings but rather based on arbitrary symbols which may be multi-character. For instance, if we treat strings as nominal attributes, then the edit distance for:

  • gold: ["this", "is", "fun"]
  • hypothesis: ["this", "isn't", "fun"]

is 1, but we get a different result (3) if we join the strings together and use EditDistance:

  • gold: "this is fun"
  • hypothesis: "this isn't fun"

I would like the ability to get the former measure and keep the computations on the right device.

Pitch

Currently the update function is typed to accept Union[str, List[str]] and raises an error if this is not the case. I would remove this restriction so that pairs of 1d and 2d integral tensors are also accepted, and would be interpreted the same way as a pair of strings and a pair of lists of strings would be, respectively. Other than that preprocessing, the checks on type would be loosened there and in the lower-level implementation here.

I would be willing to take a stab at this for EditDistance; I am not volunteering to do it for all the other text metrics however.

Alternatives

Of course I could always just roll my own metric and keep it to myself.

Additional context

I wonder if this is going to be miserable on CUDA since there's nothing CUDAish about the edit distance algorithm's very iterative implementation.

@kylebgorman kylebgorman added the enhancement New feature or request label Dec 1, 2024
Copy link

github-actions bot commented Dec 1, 2024

Hi! thanks for your contribution!, great first issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants