Skip to content

String-to-String Algorithms for Natural Language Processing

License

Notifications You must be signed in to change notification settings

stanfordnlp/string2string

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

string2string

PyPI - Python Version PyPI version Downloads license arXiv

Table of Contents

Getting Started | Tutorials | Example Usage | Documentation | Tests | Citation | Thanks | Contributing

Abstract

The string2string library is an open-source tool that offers a comprehensive suite of efficient algorithms for a broad range of string-to-string problems. It includes both traditional algorithmic solutions and recent advanced neural approaches to address various problems in pairwise string alignment, distance measurement, lexical and semantic search, and similarity analysis. Additionally, the library provides several helpful visualization tools and metrics to facilitate the interpretation and analysis of these methods.

The library features notable algorithms such as the Smith-Waterman algorithm for pairwise local alignment, the Hirschberg algorithm for global alignment, the Wagner-Fisher algorithm for edit distance, BARTScore and BERTScore for similarity analysis, the Knuth-Morris-Pratt algorithm for lexical search, and Faiss for semantic search. Moreover, it wraps existing highly efficient and widely-used implementations of certain frameworks and metrics, such as sacreBLEU and ROUGE, whenever it is appropriate and suitable.

In general, the string2string library seeks to provide extensive coverage and increased flexibility compared to existing libraries for strings. It can be used for many downstream applications, tasks, and problems in natural-language processing, bioinformatics, and computational social sciences. With its comprehensive suite of algorithms, visualization tools, and metrics, the string2string library is a valuable resource for researchers and practitioners in various fields.

Getting Started

Install the string2string library by running the following command in your terminal:

pip install string2string

Once the installation is complete, you can import the library and start using its functionalities.

Remark: We recommend using Python 3.7+ for the library.

Tutorials

Example Usage

Alignment

In the following example, we illustrate how to align two sequences of strings globally by using the Needleman-Wunsch algorithm. This algorithm, along with other alignment techniques, can be found in the alignment module of the library. The code snippet below demonstrates how to apply the Needleman-Wunsch algorithm to perform a global alignment of two given strings.

This example provides a practical illustration of how to use the Needleman-Wunsch algorithm to solve the problem of sequence alignment, which is a fundamental problem in bioinformatics.

>>> # Import the NeedlemanWunsch class from the alignment module
>>> from string2string.alignment import NeedlemanWunsch

>>> # Create an instance of the NeedlemanWunsch class
>>> nw = NeedlemanWunsch()

>>> # Let's create a list of strings (resembling DNA sequences), but they can be any strings (e.g., words), of course.
>>> seq1 = ['X', 'ATT', 'GC', 'GC', 'A', 'A', 'G']
>>> seq2 = ['ATT', 'G', 'GC', 'GC', 'A', 'C', 'G']

>>> # Compute the alignment between two strings
>>> aligned_seq1, aligned_seq2 = nw.get_alignment(seq1, seq2)

>>> # Print the alignment between the sequences, as computed by the Needleman-Wunsch algorithm.
>>> nw.print_alignment(aligned_seq1, aligned_seq2)
X | ATT | - | GC | GC | A | A | G
- | ATT | G | GC | GC | A | C | G

>>> alg_path, alg_seq1_parts, alg_seq2_parts = nw.get_alignment_strings_and_indices(aligned_seq1, aligned_seq2)
>>> from string2string.misc.plotting_functions import plot_pairwise_alignment
>>> plot_pairwise_alignment(seq1_pieces = alg_seq1_parts, seq2_pieces = alg_seq2_parts, alignment = alg_path, str2colordict = {'-': 'lightgray', 'ATT': 'indianred', 'GC': 'darkseagreen', 'A': 'skyblue', 'G': 'palevioletred', 'C': 'steelblue'}, title = 'Global Alignment Between Two Sequences of Strings')

Distance

The following code snippet demonstrates how to use the Levenshtein edit distance algorithm to compute the edit distance between two strings, at the character level and at the word level.

>>> # Let's create a Levenshtein edit distance class instance, with the default (unit cost) weights, from the distance module
>>> from string2string.distance import LevenshteinEditDistance
>>> edit_dist = LevenshteinEditDistance()

>>> # Let's also create a Tokenizer class instance with the default word delimiter (i.e., space)
>>> from string2string.misc import Tokenizer
>>> tokenizer = Tokenizer(word_delimiter=' ')

>>> # Let's create two strings
>>> text1 = "The quick brown fox jumps over the lazy dog"
>>> text2 = "The kuack brown box jumps over the lazy dog"

>>> # Get the edit distance between them at the character level
>>> edit_dist_score  = edit_dist.compute(text1, text2)

>>> print(f"Edit distance between these two texts at the character level is {edit_dist_score}")
# Edit distance between these two texts at the character level is 3.0

>>> # Tokenize the two texts
>>> text1_tokens = tokenizer.tokenize(text1)
>>> text2_tokens = tokenizer.tokenize(text2)

>>> # Get the distance between them at the word level
>>> edit_dist_score  = edit_dist.compute(text1_tokens, text2_tokens)

>>> print(f"Edit distance between these two texts at the word level is {edit_dist_score}")
# Edit distance between these two texts at the word level is 2.0

Search

The following code snippet demonstrates how to use the Knuth-Morrs-Pratt (KMP) search algorithm to find the index of a pattern in a text.

>>> # Let's create a KMPSearch class instance from the search module
>>> from string2string.search import KMPSearch
>>> knuth_morris_pratt = KMPSearch()

>>> # Let's define a pattern and a text
>>> pattern = Jane Austen'
>>> text = 'Sense and Sensibility, Pride and Prejudice, Emma, Mansfield Park, Northanger Abbey, Persuasion, and Lady Susan were written by Jane Austen and are important works of English literature.'

>>> # Now let's find the index of the pattern in the text, if it exists (otherwise, -1 is returned).
>>> idx = knuth_morris_pratt.search(pattern=pattern,text=text)

>>> print(f'The index of the pattern in the text is {idx}.')
# The index of the pattern in the text is 127.

Faiss Semantic Search

The example below demonstrates how to use the Faiss tool developed by FAIR to perform semantic search. First, we use the BART-Large model from Hugging Face to generate embeddings for a small corpus of 25 sentences. To perform the search, we first encode a query sentence using the same BART model and use it to search the corpus. Specifically, we search for sentences that are most similar in meaning to the query sentence. After performing the search, we print the top thre sentences from the corpus that are most similar to the query sentence.

This approach can be useful in a variety of natural-processing applications, such as question-answering and information retrieval, where it is essential to find relevant information quickly and accurately.

>>> # Let's create a FaissSearch class instance from the search module to perform semantic search
>>> from string2string.search import FaissSearch
>>> faiss_search = FaissSearch(model_name_or_path = 'facebook/bart-large')

>>> # Let's create a corpus of strings (e.g., sentences)
>>> corpus = {
        'text': [
            "Coffee is my go-to drink in the morning.", 
            "I always try to make time for exercise.", 
            "Learning something new every day keeps me motivated.", 
            "The sunsets in my hometown are breathtaking.", 
            "I am grateful for the support of my friends and family.", 
            "The book I'm reading is incredibly captivating.", 
            "I love listening to music while I work.", 
            "I'm excited to try the new restaurant in town.", 
            "Taking a walk in nature always clears my mind.", 
            "I believe that kindness is the most important trait.", 
            "It's important to take breaks throughout the day.", 
            "I'm looking forward to the weekend.", 
            "Reading before bed helps me relax.", 
            "I try to stay positive even in difficult situations.", 
            "Cooking is one of my favorite hobbies.", 
            "I'm grateful for the opportunity to learn and grow every day.", 
            "I love traveling and experiencing new cultures.", 
            "I'm proud of the progress I've made so far.", 
            "A good night's sleep is essential for my well-being.", 
            "Spending time with loved ones always brings me joy.", 
            "I'm grateful for the beauty of nature around me.", 
            "I try to live in the present moment and appreciate what I have.", 
            "I believe that honesty is always the best policy.", 
            "I enjoy challenging myself and pushing my limits.", 
            "I'm excited to see what the future holds."
        ],
    }

>>> # Next we need to initialize and encode the corpus
>>> faiss_search.initialize_corpus(
    corpus=corpus,
    section='text', 
    embedding_type='mean_pooling',
    )

>>> # Let's define a query, and the number of top results we want to retrieve; then, let's perform the semantic search.
>>> query = 'I like going for a run in the morning.'
>>> top_k = 5
>>> top_k_results = faiss_search.search(query=query, k = top_k)

>>> # Let's define a function to print the results of the search.
>>> def print_results(query, results, top_k):
        # Let's first print the query.
        print(f'Query: "{query}"\n')

        # Let's now print the top k results.
        print(f'Top {top_k} most similar sentences in the corpus to the query (smallest score is most similar):')
        for i in range(top_k):
            print(f' - {i+1}: "{results["text"][i]}" with a similarity score of {top_k_results["score"][i]:.2f}')
>>> print_results(query=query, results=top_k_results, top_k=top_k)
# Query: "I like going for a run in the morning."

# Top 3 most similar sentences in the corpus to the query (smallest score is most similar):
#  - 1: "I always try to make time for exercise." with a similarity score of 170.65
#  - 2: "The sunsets in my hometown are breathtaking." with a similarity score of 238.20
#  - 3: "Coffee is my go-to drink in the morning." with a similarity score of 238.85

Cosine Similarity (with Glove and fastText Embeddings)

The following example demonstrates how to use pre-trained GloVe embeddings to calculate the cosine similarity between different pairs of words. Specifically, we compute the cosine similarity between the embeddings of four words: "cat", "dog", "phone", and "computer". We then create a similarity matrix and use the plot_heatmap function in the visualization module to plot this matrix.

Overall, this example provides a practical demonstration of how to use pre-trained embeddings such as GloVe and fastText to quantify the semantic similarity between pairs of words, which can be useful in a variety of natural-language processing tasks.

>>> # Let's create a Cosine Similarity class instance from the similarity module
>>> from string2string.similarity import CosineSimilarity
>>> cosine_similarity = CosineSimilarity()

>>> # Let's also create an instance of the GloVeEmbeddings class from the misc module to compute the embeddings of words
>>> from string2string.misc import GloVeEmbeddings
>>> glove = GloVeEmbeddings(model='glove.6B.200d', dim=50, force_download=True, dir='./models/glove-model/')

>>> # Let's define a list of words
>>> words = ['cat', 'dog', 'phone', 'computer']

>>> # Let's create a list to store the embeddings of the words and compute them
>>> embeds = []
>>> for word in words:
>>>     embedding = glove.get_embedding(word)
>>>     embeds.append(embedding)

>>> # Let's create a similarity matrix to store the cosine similarity between each pair of embeddings
>>> similarity_matrix = np.zeros((len(words), len(words)))
>>> for i in range(len(embeds)):
        similarity_matrix[i, i] = 1
        for j in range(i + 1, len(embeds)):
            result = cosine_similarity.compute(embeds[i], embeds[j], dim=1).item()
            similarity_matrix[i, j] = result
            similarity_matrix[j, i] = result

>>> # Let's visualize the similarity matrix
>>> from string2string.misc.plotting_functions import plot_heatmap
>>> plot_heatmap(
        similarity_matrix, 
        title='Similarity Between GloVe Embeddings',
        x_ticks = words,
        y_ticks = words,
        x_label = 'Words',
        y_label = 'Words',
        valfmt = '{x:.2f}',
        cmap="Blues",
    )

Metrics (sacreBLEU and ROUGE)

In the code snippet below, you can see an example of how to employ the sacreBLEU metric for calculating the BLEU score. This metric is implemented as a wrapper around the sacreBLEU library. The computation is performed by providing a list of candidate sentences and a list of reference sentences as input and calling the compute method of the metric instance.

>>> # Let's import the sacreBLEU metric from the metrics module and create an instance of it
>>> from string2string.metrics import sacreBLEU
>>> sbleu = sacreBLEU()

>>> # Let's define a list of candidate sentences and a list of reference sentences
>>> candidates = ['The sun is shining.', 'The birds are chirping.', 'She is playing the guitar.', 'He is cooking dinner.']
>>> references = [['The sun is shining.', 'The sun is bright.'], ['The birds are singing.', 'The harold is singing.'], ['Julie is playing the flute.', 'She is playing the piano.'], ['Chef is cooking dinner.', 'He is cooking lunch.']]

>>> # Compute the sacreBLEU score
>>> result = sbleu.compute(candidates, references)
>>> print(result)
# {'score': 67.92604743211312, 'counts': [19, 13, 9, 4], 'totals': [21, 17, 13, 9], 'precisions': [90.47619047619048, 76.47058823529412, 69.23076923076923, 44.44444444444444], 'bp': 1.0, 'sys_len': 21, 'ref_len': 21}

Similarly, we can use the ROUGE metric to calculate the ROUGE score. This metric is implemented as a wrapper around the ROUGE library.

>>> # Let's import the ROUGE metric from the metrics module and create an instance of it
>>> from string2string.metrics import ROUGE
>>> rogue = ROUGE()

>>> # Let's define a list of candidate sentences and a list of reference sentences
>>> candidates = ["The cat is sitting on the mat.", "The dog is barking at the mailman.", "The bird is singing in the tree."] 
>>> references = [["The cat is sitting on the mat."], ["The dog is barking at the postman."], ["The bird sings on the tree."]]

>>> # Compute the ROUGE score
>>> result = rogue.compute(candidates, references)
>>> print(result)
# {'rouge1': 0.8241758241758242, 'rouge2': 0.7323232323232324, 'rougeL': 0.8241758241758242, 'rougeLsum': 0.8241758241758242}

BARTScore and BERTScore

This code snippet shows how to utilize the BARTScore and BERTScore metrics to compute their corresponding scores. These similarity metrics serve as wrappers around the BARTScore and BERTScore libraries.

>>> # Importing the BERTScore and BARTScore classes from the similarity module
>>> from string2string.similarity import BERTScore, BARTScore

>>> # Initializing the BARTScore metric
>>> bart_scorer = BARTScore(model_name_or_path='facebook/bart-large-cnn')

>>> # Initializing the BERTScore metric
>>> bert_scorer = BERTScore(lang="en")

>>> # Define the source and target sentences
>>> source_sentences = ["I'm super happy today.", "John von Neumann was a mathematician and physicist."]
>>> target_sentences = ["I feel pretty wonderful today.", "Kurt Vonnegut's Slaughterhouse Five had a profound impact on many people."]

>>> # Compute the BARTScore
>>> score = bart_scorer.compute(source_sentences, target_sentences, agg="mean", batch_size=4)
>>> print(score)
# {'score': array([-2.77911878, -4.60774326])}

>>> # Compute the BERTScore
>>> score = bert_scorer.compute(source_sentences, target_sentences)
>>> print(score)
# {'precision': array([0.943804 , 0.8559195], dtype=float32), 'recall': array([0.943804  , 0.85145634], dtype=float32), 'f1': array([0.943804 , 0.8536821], dtype=float32)}

Potential Future Enhancements and Improvements

  • Use the Numba library to accelerate the execution time of the algorithms.
  • Include additional string-based metrics in the miscellaneous module.
  • Implement BLAST and FASTA algorithms. [Work-in-Progress]
  • Add more customizable and useful visualization features.

Citation

@article{suzgun2023string2string,
  title={string2string: A Modern Python Library for String-to-String Algorithms},
  author={Suzgun, Mirac and Shieber, Stuart M and Jurafsky, Dan},
  journal={arXiv preprint arXiv:2304.14395},
  year={2023}
}

Thanks

Our project owes a debt of gratitude to the following individuals for their contributions, comments, and feedback: Federico Bianchi, Corinna Coupette, Sebastian Gehrmann, Tayfun Gür, Şule Kahraman, Deniz Keleş, Luke Melas-Kyriazi, Christopher Manning, Tolúlopé Ògúnrèmí, Alexander "Sasha" Rush, Kyle Swanson, and Garrett Tanzer.

Contributing

If you are interested in contributing to this library, we encourage you to review our documentation first to gain a better understanding of the codebase's format and structure. Once you are familiar with the codebase, you are welcome to submit a pull request. We are looking for new contributors to help us drive our efforts forward!