Skip to content

The method compares two text samples for their similarity/dissimilarity as edits needed to convert source string to target string.

License

Notifications You must be signed in to change notification settings

taimoorkhan-nlp/text_edit_distance_similarity

Repository files navigation

Analyzing Text Similarity Using Edit Distance

Description

This method calculates the edit distance between two texts to estimate their similarity or dissimilarity. The edit distance measures how many operations — such as inserting, deleting, or substituting characters — are needed to transform one text into another. For instance, simple edit distance between "cut" and "cat" is 1, as only one substitution is needed. Similarly, simple distance between "cat" and "at" is also 1, as one deletion suffices. In its simplest form, edit distance assigns an equal cost to all operations — insertions, deletions, and substitutions. Variants of the method allow for different cost structures, making it adaptable to various applications. For example, this method can be used to compare texts like dialects of a language, definitions of similar concepts across disciplines, or even versions of the same news article from different media sources. It can also be applied to anonymize personal information by distorting text.

The method offers 3 edit distance variants (Simple edit distance, Levenshtein edit distance and Damerau-Levenshtein edit distance) between two texts both at character and word level, and has the following operations:

  • Simple edit distance i.e., having insertion, deletion, and substitution operations, all having cost 1.
  • Levenshtein edit distance i.e., having insertion and deletion with cost 1 and substitution with cost 2 (it is also equivalent to saying no substitution allowed)
  • Damerau-Levenshtein edit distance i.e., having insertion, deletion, substitution, and transposition, all having equal cost 1.

Reproducibility: The method is reproducible as it offers vanilla implementation without requiring any packages or resources to be installed. It only uses the basic (string and random) packages usually already included. It gives full control to update costs and scale as needed. Random seeds are defined to have predictable random numbers for reproducibility.*

Keywords

Edit distance, text similarity, Levenshtein edit distance, Damerau-Levenshtein edit distance

Use Case(s)

  • Identifying different mentions of entities (e.g. names like "Donald Trump", "D. Trump", and "Trump")
  • Finding tweets/social media posts similar to a certain tweet, sentence, or claim.

Repo Structure

The methods are defined in utils.py and are called on sample tweets from the notebook text_edit_distance_similarity.ipynb.

Environment Setup

No setup is needed.

Input Data

The method is directly applicable to textual digital behavioral data from social media and other digital platforms. User can provide these input texts to evaluate edit distance by directly writing them in the notebook text_edit_distance_similarity.ipynb.

Sample Input and Output

Provide two posts/strings as input directly in the notebook text_edit_distance_similarity.ipynb to compare.

For example, we want to measure the dissimilarity (edit distance) between the two tweets sharing the same news:

tweet1 = "Excited to share our latest research on AI and its impact on social sciences! Leveraging data for better insights"
tweet2 = "Thrilled about our new findings on how AI transforms social science research. Innovation meets impact!"

After running the script, the method prints output to the screen as a string. The output string has the following information

  • Edit distance version used from the available implementation versions.
  • at word/character level showing whether the method is applied at word or character level
  • score (as integer value) representing the edit distance or cost between the texts, usually interpreted as the minimum edits needed to transform source text to target text using the available operations and their costs.

The two sample outputs for tweet1 and tweet2 are given in the input above. First using Levenshtein edit distance at the word level while second using simple edit distance at the character level. Levenshtein edit distance (at word level) is 26 i.e., tweet1 and tweet2 are 26 word changes apart using Levenshtein edit distance. Simple edit distance (at character level) is 71 i.e., tweet1 and tweet2 are 71 character changes apart using simple edit distance.

Levenshtein edit distance (at word level): 26
simple edit distance (at character level): 71

How to Use

  • run pip install jupyter or conda install jupyter, if not installed already
  • run Jupyter using the command jupyter lab or jupyter notebook
  • Open and execute all cells in text_edit_distance_similarity.ipynb.
  • execute the notebook cells to call all methods defined in utils.py on the same texts

Contact Details

Taimoor Khan ([email protected])

Publications

  1. Hossain, E., Rana, R., Higgins, N., Soar, J., Barua, P. D., Pisani, A. R., & Turner, K. (2023). Natural language processing in electronic health records in relation to healthcare decision-making: a systematic review. Computers in biology and medicine, 155, 106649.
  2. Chaabi, Y., & Allah, F. A. (2022). Amazigh spell checker using the Damerau-Levenshtein algorithm and N-gram. Journal of King Saud University-Computer and Information Sciences, 34(8), 6116-6124.

About

The method compares two text samples for their similarity/dissimilarity as edits needed to convert source string to target string.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •