Introduction

Deep semantic match has broad applications in information retrieval, question answering, dialog system, paraphrase, etc.
In production (e.g., retrieval-based chatbot), retrieval (e.g., Elasticsearch, inverted index) + ** rerank** (e.g., deep matching model) is two steps to perform information retrieval

Models

Statistical models

TFIDF
BM25 (BestMatch)
- k1 is term frequency saturation (词频饱和度), value between [1.2, 2.0]
- b is field length normalization, value between [0, 1], usually 0.75
Query likelihood

Size of intersection divided by size of union of two sets
Word duplication does not matter

def jaccard_sim(str1, str2): 
    a = set(str1.split()) 
    b = set(str2.split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

Deep learning models

Interaction-based is perfered
- Principle
- Examples
  - ARC-II, DeepMatch, MatchPyramid
  - ESIM
  - Kernel pooling network
Representation-based
- Principle
- Examples
  - DSSM, CDSSM, ARC-I
Model Review: A Deep Look into Neural Ranking Models for Information Retrieval
Implementation
- MatchZoo
- SPTAG
- Annoy
- Faiss
- HNSW
  - How to get k-NN points for the query?
    - The search starts from a random sample on the top layer. The search on one layer stops as no closer neighbor could be found.
    - The discovered closest neighbor on the current layer is treated as the starting point (i.e., “enter point”) of the search on the lower layer, until it reaches to the bottom layer
    - The standard NN-Descent search is adopted to search for top-k nearest neighbors on this layer

Learning to Rank

Use machine learning models to build ranking models
Features
- BM25
- PageRank
- Edit distance
Categories
- Pointwise
- Pariwise
- Listwise
LambdaMART

Evaluation metrics

Set retrieval
- Precision
  - Proportion of retrieved documents that are relevant, |retrieved & relevant| / |retrieved|
- Recall
  - Proportion of relevant documents that are retrieved, |retrieved & relevant| / |relevant|
- F1
  - Harmonic mean of precision and recall
Ranked retrieval
- Precision@k
  - Proportion of top-k documents that are relevant
- Recall@k
  - Proportion of relevant documents that are in the top-k
- Precision-recall curve
- Average precision
  - Average all precision@k values when recall is 1.0
  - Mean Average Precision (MAP)
    - Take mean of average precisions across a set of queries
  - Example: Number of relevant documents for thisquery is 10
- Discounted Cumulative Gain (DCG)
  - Assumption: more than two levels of relevance
  - Normalized Discounted Cumulative Gain (NDCG)

Books

Learning to Rank for Information Retrieval by Tie-Yan Liu, 2015
Learning to Rank for Information Retrieval and Natural Language Processing by Hang Li, 2011

Tutorials

SIGIR 2018 Deep Learning for Matching in Search and Recommendation
《搜索与推荐中的深度学习匹配》之搜索篇
基于GRU和am-softmax的句子相似度模型
Top competition solutions
问题对语义相似度计算-参赛总结
平安人寿-智能问答系统的探索与实践, pdf
Simple and Effective Text Matching with Richer Alignment Features, ACL2019

Industrial applications

Sogou IR
Baidu SimNet
Yunzhisheng
Baidu AnyQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Introduction

Models

Statistical models

Deep learning models

Learning to Rank

Evaluation metrics

Books

Tutorials

Industrial applications

Files

README.md

Latest commit

History

README.md

File metadata and controls

Introduction

Models

Statistical models

Deep learning models

Learning to Rank

Evaluation metrics

Books

Tutorials

Industrial applications