- Deep semantic match has broad applications in information retrieval, question answering, dialog system, paraphrase, etc.
- In production (e.g., retrieval-based chatbot), retrieval (e.g., Elasticsearch, inverted index) + ** rerank** (e.g., deep matching model) is two steps to perform information retrieval
- TFIDF
- BM25 (BestMatch)
k1
is term frequency saturation (词频饱和度), value between[1.2, 2.0]
b
is field length normalization, value between[0, 1]
, usually0.75
- Query likelihood
- Jaccard
- Size of intersection divided by size of union of two sets
- Word duplication does not matter
def jaccard_sim(str1, str2): a = set(str1.split()) b = set(str2.split()) c = a.intersection(b) return float(len(c)) / (len(a) + len(b) - len(c))
-
Interaction-based is perfered
- Principle
- Examples
- ARC-II, DeepMatch, MatchPyramid
- ESIM
- Kernel pooling network
-
Representation-based
-
Model Review: A Deep Look into Neural Ranking Models for Information Retrieval
-
Implementation
- MatchZoo
- SPTAG
- Annoy
- Faiss
- HNSW
- How to get k-NN points for the query?
- The search starts from a random sample on the top layer. The search on one layer stops as no closer neighbor could be found.
- The discovered closest neighbor on the current layer is treated as the starting point (i.e., “enter point”) of the search on the lower layer, until it reaches to the bottom layer
- The standard NN-Descent search is adopted to search for top-k nearest neighbors on this layer
- How to get k-NN points for the query?
- Use machine learning models to build ranking models
- Features
- BM25
- PageRank
- Edit distance
- Categories
- Pointwise
- Pariwise
- Listwise
- LambdaMART
- Set retrieval
- Precision
- Proportion of retrieved documents that are relevant,
|retrieved & relevant| / |retrieved|
- Proportion of retrieved documents that are relevant,
- Recall
- Proportion of relevant documents that are retrieved,
|retrieved & relevant| / |relevant|
- Proportion of relevant documents that are retrieved,
- F1
- Harmonic mean of precision and recall
- Precision
- Ranked retrieval
-
Precision@k
- Proportion of top-k documents that are relevant
-
Recall@k
-
Precision-recall curve
-
Average precision
-
Discounted Cumulative Gain (DCG)
-
- Learning to Rank for Information Retrieval by Tie-Yan Liu, 2015
- Learning to Rank for Information Retrieval and Natural Language Processing by Hang Li, 2011