Analyzing Text Similarity Using Edit Distance

Description

This method calculates the edit distance between two texts to estimate their similarity or dissimilarity. The edit distance measures how many operations — such as inserting, deleting, or substituting characters — are needed to transform one text into another. For instance, simple edit distance between "cut" and "cat" is 1, as only one substitution is needed. Similarly, simple distance between "cat" and "at" is also 1, as one deletion suffices. In its simplest form, edit distance assigns an equal cost to all operations — insertions, deletions, and substitutions. Variants of the method allow for different cost structures, making it adaptable to various applications. For example, this method can be used to compare texts like dialects of a language, definitions of similar concepts across disciplines, or even versions of the same news article from different media sources. It can also be applied to anonymize personal information by distorting text.

The method offers 3 edit distance variants (Simple edit distance, Levenshtein edit distance and Damerau-Levenshtein edit distance) between two texts both at character and word level, and has the following operations:

Simple edit distance i.e., having insertion, deletion, and substitution operations, all having cost 1.
Levenshtein edit distance i.e., having insertion and deletion with cost 1 and substitution with cost 2 (it is also equivalent to saying no substitution allowed)
Damerau-Levenshtein edit distance i.e., having insertion, deletion, substitution, and transposition, all having equal cost 1.

Reproducibility: The method is reproducible as it offers vanilla implementation without requiring any packages or resources to be installed. It only uses the basic (string and random) packages usually already included. It gives full control to update costs and scale as needed. Random seeds are defined to have predictable random numbers for reproducibility.*

Keywords

Edit distance, text similarity, Levenshtein edit distance, Damerau-Levenshtein edit distance

Use Case(s)

Identifying different mentions of entities (e.g. names like "Donald Trump", "D. Trump", and "Trump")
Finding tweets/social media posts similar to a certain tweet, sentence, or claim.

Repo Structure

The methods are defined in utils.py and are called on sample tweets from the notebook text_edit_distance_similarity.ipynb.

Environment Setup

No setup is needed.

Input Data

The method is directly applicable to textual digital behavioral data from social media and other digital platforms. User can provide these input texts to evaluate edit distance by directly writing them in the notebook text_edit_distance_similarity.ipynb.

Sample Input and Output

Provide two posts/strings as input directly in the notebook text_edit_distance_similarity.ipynb to compare.

For example, we want to measure the dissimilarity (edit distance) between the two tweets sharing the same news:

tweet1 = "Excited to share our latest research on AI and its impact on social sciences! Leveraging data for better insights"
tweet2 = "Thrilled about our new findings on how AI transforms social science research. Innovation meets impact!"

After running the script, the method prints output to the screen as a string. The output string has the following information

Edit distance version used from the available implementation versions.
at word/character level showing whether the method is applied at word or character level
score (as integer value) representing the edit distance or cost between the texts, usually interpreted as the minimum edits needed to transform source text to target text using the available operations and their costs.

The two sample outputs for tweet1 and tweet2 are given in the input above. First using Levenshtein edit distance at the word level while second using simple edit distance at the character level. Levenshtein edit distance (at word level) is 26 i.e., tweet1 and tweet2 are 26 word changes apart using Levenshtein edit distance. Simple edit distance (at character level) is 71 i.e., tweet1 and tweet2 are 71 character changes apart using simple edit distance.

Levenshtein edit distance (at word level): 26
simple edit distance (at character level): 71

How to Use

run pip install jupyter or conda install jupyter, if not installed already
run Jupyter using the command jupyter lab or jupyter notebook
Open and execute all cells in text_edit_distance_similarity.ipynb.
execute the notebook cells to call all methods defined in utils.py on the same texts

Contact Details

Taimoor Khan ([email protected])

Publications

Hossain, E., Rana, R., Higgins, N., Soar, J., Barua, P. D., Pisani, A. R., & Turner, K. (2023). Natural language processing in electronic health records in relation to healthcare decision-making: a systematic review. Computers in biology and medicine, 155, 106649.
Chaabi, Y., & Allah, F. A. (2022). Amazigh spell checker using the Damerau-Levenshtein algorithm and N-gram. Journal of King Saud University-Computer and Information Sciences, 34(8), 6116-6124.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
information-13-00452-g002.webp		information-13-00452-g002.webp
postBuild		postBuild
requirements.txt		requirements.txt
text_edit_distance.ipynb		text_edit_distance.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analyzing Text Similarity Using Edit Distance

Description

Keywords

Use Case(s)

Repo Structure

Environment Setup

Input Data

Sample Input and Output

How to Use

Contact Details

Publications

About

Releases

Packages

Contributors 3

Languages

License

taimoorkhan-nlp/text_edit_distance_similarity

Folders and files

Latest commit

History

Repository files navigation

Analyzing Text Similarity Using Edit Distance

Description

Keywords

Use Case(s)

Repo Structure

Environment Setup

Input Data

Sample Input and Output

How to Use

Contact Details

Publications

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages