Semantic Textual Similarity (STS) is defined as the measure of “semantic equivalence” between two blocks of text, phrases, sentences, or documents. Semantic similarity methods usually give a ranking or percentage of similarity between texts.
The main objective Semantic Similarity is to measure the distance between the semantic meanings of a pair of words, phrases, sentences, or documents.
For our BTP project, we took inspiration from the below listed implementations which are currently in use.
- Customer Support : Companies can create a corpus of pre seen frequent queries and with our engine they can look for queries which resemble the most similarity, and send an automated response saving lots of unnecessary work-force.
- Medical : when a new case comes to a practitioner, he/she can look for similar cases in the past and get the most closely resembling case and make better decisions.
- Law : When a new case comes, law firms look over their database for similar cases in the past and then strategies from those case studies. ( Currently used by some firms ) These examples are a reality but are still kept as an internal tool by organizations, as it brings them a competitive edge.
We wish to present this technology in the hands of the general consumer with this platform where individuals and organizations will be able to collect and classify text data, which will accelerate their processes.
The basic idea is to divide the entire process in 2 steps
- Generate Embeddings for sentences
- Compare those Embeddings
- Using Semantic Nets and Corpus Statistics [1]
- Sentence Embeddings using Siamese BERT-Networks [3]
- Cosine Based similarity
The proposed method derives text similarity from semanticand syntactic information contained in the compared texts. Unlike existing methods that use a fixed set of vocabulary, the proposed method dynamically forms a joint word set only using allthe distinct words in the pair of sentences.
for each sentence, a raw semantic vector is derived with the assistance of a lexical database. A word order vector isformed for each sentence, again using information from the lexical database.
Since each word in a sentence contributes differently to the meaning of the whole sentence, thesignificance of a word is weighted by using informationcontent derived from a corpus
By combining the rawsemantic vector with information content from the corpus, asemantic vector is obtained for each of the two sentences.
Semantic similarity is computed based on the two semantic vectors. An word order similarity is calculated using the two order vectors. Finally, the sentence similarity is derived bycombining semantic similarity and order similarity.
Here we are utilizing semantic knowledge base to compare words. For this project we are using WordNet
In WordNet words are organized into synonym sets (synsets) in the knowledge base , with semantics and relation pointers to other synsets.
similarity between words is determined not only by path lengths but also by depth. We propose that the similaritys s(w1, w2) between words w1 and w2 as a function of path length and depth as follows:
- l is the shortest path length between w1 and w2,
- h is the depth of subsumer in the hierarchical semantic nets.
- f1 and f2 are transfer functions of path length and depth,respectively.
-
$\alpha$ is experimental constants. -
$\beta$ > 0 is a smoothing factor and$\beta$ $\to$ $\infty$ then the depth of a word in the semantic nets is not considered.
Given two sentences,T1andT2, a joint word set is formed:
T = T1
Since the joint word set is purely derived from thecompared sentences, it is compact with no redundantinformation. The joint word set,T, can be viewed as the semantic information for the compared sentences.
The vector derived from the joint word set iscalled the lexical semantic vector, denoted by
The value of an entry of the lexicalsemantic vector,
Case 1. If wi appears in the sentence, si set to 1 Case 2. Else, a semantic similarity score is computed between wi and each word in thesentence T1, using above method. let call it x, then si = ( x > $\epsilon$ ) ? x : 0
Now that we have the lexical semantic vector, the semanticsimilarity between two sentences is defined as the cosinecoefficient between the two vectors:
Let us consider a pair of sentences,T1andT2, that containexactly the same words in the same order with theexception of two words from T1 which occur in the reverse order in T2. For example:
- T1: A quick brown dog jumps over the lazy fox.
- T2: A quick brown fox jumps over the lazy dog.
For the example pair of sentencesT1andT2, the jointword set is:
- T : { A,quick,brown,dog,jumps,over,the,lazy,fox }
assign a unique index number for each word inT1andT2. The index number is simply the order number inwhich the word appears in the sentence.
indexes : { A : 0 , quick : 1 , brown : 2 , dog : 3 , jumps : 4 , over : 5 , the : 6 , lazy : 7 , fox : 8 }
word order vectors for T1 and T2 are r1 and r2, respectively.
- r1 : { 1,2,3,4,5,6,7,8,9 }
- r2 : { 1,2,3,9,5,6,7,8,4 }
The measure for measuring the wordorder similarity of two sentences is:
Both semantic and syntactic information (in terms ofword order) play a role in conveying the meaning ofsentences. Thus, the overall sentence similarity is defined asa combination of semantic similarity and word ordersimilarity:
where
first we will talk about BERT. BERT (Devlin et al., 2018) is a pre-trained transformer network (Vaswani et al., 2017), which set for various NLP tasks new state-of-the-art results. The input for BERT for sentence-pair regression consists ofthe two sentences, separated by a special [SEP] token. Multi-head attention over 12 (base-model)or 24 layers (large-model) is applied and the out-put is passed to a simple regression function to de-rive the final label.
A large disadvantage of the BERT network structure is that no independent sentence embed-dings are computed, which makes it difficult to de-rive sentence embeddings from BERT.
we use the pre-trained BERT network and only fine-tune it to yield useful sentence embeddings. This reduces significantly the needed training time: SBERT canbe tuned in less than 20 minutes, while yieldingbetter results than comparable sentence embed-ding methods.
SBERT adds a pooling operation to the outputof BERT to derive a fixed sized sentence embedding. There are many available pooling stratergies, but we are going to use the MEAN-stratergy.
In order to fine-tune BERT / RoBERTa, we create siamese and triplet networks (Schroff et al.,2015) to update the weights such that the producedsentence embeddings are semantically meaningfuland can be compared with cosine-similarity.
we make use of Regression Objective Function (ROF) . In ROF The cosine-similarity between the two sentence embeddingsuandvis computed (Figure 2). We use mean-squared-error loss as the objective function.
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to be equal to the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1.
Marelli et al.compiled the SICK dataset for sentence level semantic similarity/relatedness in 2014 composed of 10,000 sentence pairs obtained from the ImageFlickr 8 and MSR-Video descriptions dataset. The sentence pairs were derived from image descriptions by various annotators. 750 random sentence pairs from the two datasets were selected,followed by three steps to obtain the final SICK dataset: sentence normalisation, sentence expansion and sentence pairing.
In order to encourage research in the field of semantic similarity, semantic textual similarity tasks called SemEval have been conducted from 2012. The organizers of the SemEval tasks collected sentences from a wide variety of sources and compiled them to form a benchmark data set against which the performance of the models submitted by the participants in the task was measured
-
Front End
- Home Page
- Authentication
- Single User
- Organization
- Comparison page ( For both Org and Individuals )
- Multiple sentence A and Multiple Sentence B ( Comparison )
- Inserting One by One
- Inserting with file upload in specific format
- Multiple sentence A and Multiple Sentence B ( Comparison )
- Company registration page
- Upload file
- API Help Page
-
Back End
- DB
- DB Models
- DB Layer
- REST API layer
- For Interaction with front-end
- For Interaction with other world
- Comparison Engine Integration
- DB
-
Comparison Engine
- Corpus Based Algorithm - Embedding generator
- Sent-Sim Algorithm - Embedding generator
- Comparison Methods
- One to One
- One to Many
- Many to Many
-
Y. Li, D. McLean, Z. A. Bandar, J. D. O'Shea and K. Crockett, "Sentence similarity based on semantic nets and corpus statistics," in IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 8, pp. 1138-1150, Aug. 2006, doi: 10.1109/TKDE.2006.130.
-
Miller, G.A., 1995. WordNet: a lexical database for English. Communications of the ACM, 38(11), pp.39-41.
-
Reimers, N. and Gurevych, I., 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
-
Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.