diff --git a/tutorials/kdd22/Session 3 Approximate Nearest Neighbor Search in PECOS.ipynb b/tutorials/kdd22/Session 3 Approximate Nearest Neighbor Search in PECOS.ipynb
index 21b54aa..89e076b 100644
--- a/tutorials/kdd22/Session 3 Approximate Nearest Neighbor Search in PECOS.ipynb
+++ b/tutorials/kdd22/Session 3 Approximate Nearest Neighbor Search in PECOS.ipynb
@@ -1,69 +1,92 @@
{
"cells": [
+ {
+ "cell_type": "markdown",
+ "id": "30e3e801-3659-4455-b405-5a2b083ea952",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "# Approximate Nearest Neighbor (ANN) Search in PECOS "
+ ]
+ },
{
"cell_type": "markdown",
"id": "e5073aac",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Introduction\n",
+ "Recall that PECOS is a scalable ML library for predictions on enormous and correlated output space. Therefore, we also support inference of embedding-based models such as dual-encoders, which is often formulated as a Maximum Inner Product Search (**MIPS**) or equivalently, an approximate nearest neighbor (**ANN**) search problem.\n",
+ "\n",
+ "In PECOS, we implemented a state-of-the-art graph-based ANN search algorithm, namely **H**ierarchical **N**avigable **S**mall **W**orld (**HNSW**) model. The life-cycle of HNSW can be divided into twp steps:\n",
+ "* *Training*: given user-provided database vectors, build the HNSW graph data structures for indexing;\n",
+ "* *Prediction*: given any query vector, return the top-K approximate nearest vectors indexed in the database.\n",
+ "\n",
+ "More specifically, the **search (i.e., inference)** procedure of HNSW can be summarized as:\n",
+ "* For each layer, conduct best first search traversal. The best candidate serves as an initial point to next layer;\n",
+ "* Traverse from top layer (course-grain graph, long-range link) to bottom layer (fine-grain graph, short-range link);\n",
+ "* The bottom layer graph contains all database items as the nodes.\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2eaba286-135c-4694-9756-638c8b3b1169",
"metadata": {},
"source": [
- "# Approximate Nearest Neighbor (ANN) Search in PECOS \n",
- "\n",
- "PECOS provides the efficient approach for **approximate nearest neighbor (ANN) search**. More specifically, after training an hierarchical navigable small world (HNSW) model (or buildling the **PECOS-HNSW indexer**) with a corpus of vectors, PECOS supports to efficiently infer top-K approximated nearest indexed vectors for an arbitrary query vector. In this part of the tutorial, we will demonstrate how to use PECOS-HNSW tackle the approximate nearest neighbor (ANN) search problem and how to integrate HNSW with PECOS XMR models.\n",
- "\n",
- "#### HNSW at a glimpse\n",
- "The search procedure of HNSW can be summarized as:\n",
- "* traverse from top layer (course-grain graph, long-range link) to bottom layer (fine-grain graph, short-range link)\n",
- "* best first search traversal on each graph, where the best candidate serves as initial to next layer\n",
- "
\n",
- "\n",
- "\n",
"## Highlight of PECOS-HNSW\n",
+ "In this part of tutorial, we introduce the usage of PECOS-HNSW to tackle ANN search problem, and highlight some key functionalities and features in PECOS-HNSW:\n",
"\n",
- "* Support both sparse and dense input features\n",
- "* Support SIMD instructions (SSE, AVX256, and AVX512)\n",
- "* Modularity implementation\n",
- "\n",
- "## Comparison of PECOS and NMSLIB on the sparse data\n",
- "\n",
- "#### Disclaimer \n",
- "The benchmarking results listed in this notebook are based on an `r5dn-24xlarge` AWS instance with 96 Intel(R) Xeon(R) Platinum 8259CL CPUs @ 2.50GHz. With distinct environments, the magnitude of improvments could be also different.\n",
+ "* support inference on both sparse and dense input features;\n",
+ "* support SIMD instructions (AVX, AVX256, and AVX512) and select the best available one in runtime;\n",
+ "* achieve new SOTA results compared to other popular graph-based ANN libraries (e.g., NMSLIB and HNSWLIB).\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "94b45789-b24f-4ed5-8d9c-0568c2d4d1cc",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Benchmarking PECOS-HNSW with NMSLIB/HNSWLIB\n",
"\n",
- "#### Results\n",
- "* We compare two implementations of HNSW: `PECOS` and `NMSLIB` on a sparse dataset (i.e., RCV1).\n",
- "* For RCV1, the instances in training/test set are `781,265` and `23,149`, respectively. The feature dimension is `47,236`.\n",
- "* The HNSW index is constructed under `M=16` and `efConstruction=500`.\n",
- "* From the table below, we see that, under similar Recall@10, `PECOS` achieves `[88%,93%]` speedup compared to the `NMSLIB` package.\n",
+ "### Disclaimer \n",
+ "We follow [ANN-Benchmark](https://github.com/erikbern/ann-benchmarks) evaluation protocol, which conducts inference on each test query sequentially (**batch_size=1**) with the **single-thread** setup. \n",
+ "The benchmarking results are based on an r5dn-24xlarge (**w/ avx512 supports**) AWS instance with 96 Intel(R) Xeon(R) Platinum 8259CL CPUs @ 2.50GHz. With distinct environments, the magnitude of improvements could be also different.\n",
"\n",
- "| M=16, efC=500 | | | HNSW (PECOS) | | | HNSW (NMSLIB) | speedup (PECOS/NMSLIB) |\n",
- "|:-------------:|:---------:|:-----------------------:|:------------------:|:---------:|:-----------------------:|:------------------:|:----------------------------:|\n",
- "| efS | Recall@10 | Throughput (#query/sec) | Latency (ms/query) | Recall@10 | Throughput (#query/sec) | Latency (ms/query) | |\n",
- "| 10 | 0.7733 | 5250.297 | 0.1905 | 0.7790 | 2710.256 | 0.3690 | 93.72% |\n",
- "| 20 | 0.8545 | 3677.292 | 0.2719 | 0.8581 | 1924.505 | 0.5196 | 91.08% |\n",
- "| 40 | 0.9043 | 2409.959 | 0.4149 | 0.9055 | 1271.085 | 0.7867 | 89.60% |\n",
- "| 80 | 0.9325 | 1508.349 | 0.6630 | 0.9326 | 800.999 | 1.2484 | 88.31% |\n",
- "| 120 | 0.9434 | 1125.047 | 0.8889 | 0.9426 | 597.873 | 1.6726 | 88.17% |\n",
- "| 200 | 0.9533 | 763.752 | 1.3093 | 0.9523 | 404.518 | 2.4721 | 88.81% |\n",
- "| 400 | 0.9621 | 433.872 | 2.3048 | 0.9608 | 229.553 | 4.3563 | 89.01% |\n",
- "| 600 | 0.9657 | 305.747 | 3.2707 | 0.9644 | 161.879 | 6.1775 | 88.87% |\n",
- "| 800 | 0.9678 | 237.651 | 4.2078 | 0.9663 | 124.806 | 8.0124 | 90.42% |\n",
+ "
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
"\n",
- "## Hands-on Tutorial\n",
+ "### Results on *RCV1-47236-angular*\n",
"\n",
- "The life cycle of a PECOS-HNSW model consists of two stages:\n",
+ "* For RCV1, the instances in training/test set are 781,265 and 23,149, respectively. The feature dimension is 47,236.\n",
+ "* PECOS-HNSW achieves an average of **1.9x** speedup compared to the NMSLIB package.\n",
"\n",
- "* building the indexer (training)\n",
- "* inference (testing).\n",
+ "### Results on on *SIFT-128-euclidean*\n",
"\n",
- "### Install PECOS through Python PIP"
+ "* For SIFT, the instances in training/test set are 1,000,000 and 10,000, respectively. The feature dimension is 128.\n",
+ "* PECOS-HNSW achieved an average of **1.3x** speedup compared to the HNSWLIB package.\n",
+ "* PECOS-HNSW++ (ongoing work) achieved an average of **3x** speedup compared to the HNSWLIB package."
]
},
{
- "cell_type": "code",
- "execution_count": null,
- "id": "f6df49a3",
- "metadata": {},
- "outputs": [],
+ "cell_type": "markdown",
+ "id": "d74d7466-e379-4378-9992-19f3bdb09ccc",
+ "metadata": {
+ "tags": []
+ },
"source": [
- "! pip install libpecos"
+ "## Hands-on Tutorial"
]
},
{
@@ -76,37 +99,10 @@
},
{
"cell_type": "code",
- "execution_count": 1,
+ "execution_count": null,
"id": "140a0d24",
"metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "--2022-07-15 21:03:07-- https://archive.org/download/pecos-dataset/ann-benchmarks/rcv1-angular-47236.tar.gz\n",
- "Resolving archive.org (archive.org)... 207.241.224.2\n",
- "Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.\n",
- "HTTP request sent, awaiting response... 302 Found\n",
- "Location: https://ia802308.us.archive.org/21/items/pecos-dataset/ann-benchmarks/rcv1-angular-47236.tar.gz [following]\n",
- "--2022-07-15 21:03:07-- https://ia802308.us.archive.org/21/items/pecos-dataset/ann-benchmarks/rcv1-angular-47236.tar.gz\n",
- "Resolving ia802308.us.archive.org (ia802308.us.archive.org)... 207.241.228.48\n",
- "Connecting to ia802308.us.archive.org (ia802308.us.archive.org)|207.241.228.48|:443... connected.\n",
- "HTTP request sent, awaiting response... 200 OK\n",
- "Length: 317972212 (303M) [application/octet-stream]\n",
- "Saving to: ‘rcv1-angular-47236.tar.gz’\n",
- "\n",
- "100%[======================================>] 317,972,212 11.0MB/s in 40s \n",
- "\n",
- "2022-07-15 21:03:47 (7.68 MB/s) - ‘rcv1-angular-47236.tar.gz’ saved [317972212/317972212]\n",
- "\n",
- "rcv1-angular-47236/\n",
- "rcv1-angular-47236/X.trn.npz\n",
- "rcv1-angular-47236/X.tst.npz\n",
- "rcv1-angular-47236/Y.tst.npy\n"
- ]
- }
- ],
+ "outputs": [],
"source": [
"! wget https://archive.org/download/pecos-dataset/ann-benchmarks/rcv1-angular-47236.tar.gz\n",
"! tar -zxvf ./rcv1-angular-47236.tar.gz"
@@ -114,7 +110,7 @@
},
{
"cell_type": "code",
- "execution_count": 2,
+ "execution_count": 1,
"id": "46dc982b",
"metadata": {},
"outputs": [
@@ -128,7 +124,9 @@
],
"source": [
"import numpy as np\n",
+ "import os, time\n",
"from pecos.utils import smat_util\n",
+ "from pecos.ann.hnsw import HNSW\n",
"X_trn = smat_util.load_matrix(\"./rcv1-angular-47236/X.trn.npz\").astype(np.float32)\n",
"X_tst = smat_util.load_matrix(\"./rcv1-angular-47236/X.tst.npz\").astype(np.float32)\n",
"Y_tst = smat_util.load_matrix(\"./rcv1-angular-47236/Y.tst.npy\")\n",
@@ -144,18 +142,22 @@
"source": [
"### Training Indexer\n",
"\n",
- "To train a PECOS-HNSW model, training parameters need to be defined in an object of HNSW.TrainParams as the argument train_params. The key parameters of training a PECOS-HNSW model include:\n",
+ "To train a [PECOS-HNSW](https://github.com/amzn/pecos/tree/v0.4.0/pecos/ann/hnsw) model, training parameters need to be defined in an object of [HNSW.TrainParams](https://github.com/amzn/pecos/blob/v0.4.0/pecos/ann/hnsw/model.py#L33) as the argument `train_params`.\n",
+ "\n",
+ "The key parameters of training a [PECOS-HNSW](https://github.com/amzn/pecos/tree/v0.4.0/pecos/ann/hnsw) model include:\n",
"* `M` (default 32): The maximum number of edges per node for each layer. A larger M leads to a larger model size and greater memory consumption. Higher/lower M are more suitable for high/low dimensional data or the pursue of high/low recall.\n",
"* `efC` (default 100): The size of the priority queue for best first search in construction. `efC` can be considered as the trade-off between efficiency and accuracy for indexing. A higher `efC` results in longer construction time but better quality of indexing.\n",
"* `metric_type` (default ip): The distance metric type for ANN search. PECOS-HNSW currently supports Euclidean distance (`l2`); and inner product (`ip`)\n",
"* `threads` (default -1): The number of threads for training, or -1 to use all available cores.\n",
"\n",
- "The parameters for inference can be also decided as the argument pred_params during model construction so that the model can be directly applied for inference without further parameter designation.\n"
+ "Detailed hyper-parameters can be found in the original HNSW paper ([Malkov et al, TPAMI 2018](https://arxiv.org/abs/1603.09320)).\n",
+ "\n",
+ "The parameters for inference can be also decided as the argument `pred_params` during model construction so that the model can be directly applied for inference without further parameter designation.\n"
]
},
{
"cell_type": "code",
- "execution_count": 3,
+ "execution_count": 2,
"id": "553aaf55",
"metadata": {},
"outputs": [
@@ -163,14 +165,11 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "HNSW Indexer | M 32 efC 100 metric ip | time(s) 11.980276823043823\n"
+ "HNSW Indexer | M 32 efC 100 metric ip | time(s) 12.000147581100464\n"
]
}
],
"source": [
- "import time\n",
- "from pecos.ann.hnsw import HNSW\n",
- "\n",
"M, efC = 32, 100\n",
"metric = \"ip\"\n",
"train_params = HNSW.TrainParams(\n",
@@ -196,7 +195,7 @@
},
{
"cell_type": "code",
- "execution_count": 4,
+ "execution_count": 3,
"id": "bf7905f3",
"metadata": {},
"outputs": [],
@@ -214,36 +213,30 @@
"source": [
"### Inference and Evaluation\n",
"\n",
- "To conduct inference with a train HNSW model, prediction parameters need to be defined in an object of HNSW.PredParams as the argument pred_params. The key parameters of inference with a PECOS-HNSW model include:\n",
+ "To conduct inference, prediction parameters need to be defined in an object of [HNSW.PredParams](https://github.com/amzn/pecos/blob/v0.4.0/pecos/ann/hnsw/model.py#L51) as the argument `pred_params`.\n",
"\n",
+ "The key parameters of inference with a PECOS-HNSW model include:\n",
"* `efS` (default 100): The size of the priority queue for best first search during inference. Similar to efC, efS can be considered as the trade-off between search efficiency and accuracy. A higher efS results in more accurate results with slower speed. efS is required to be greater than topk.\n",
"* `topk` (default 10): The number of approximate nearest neighbor to be returned. \n",
"* `threads` (default -1): The number of searchers for parallel inference, -1 to use all available searchers.\n",
"\n",
- "The predict function derives the search results based on a query matrix of shape (# of data points for inference, # of dimentions) and `pred_params`, as well as searchers. The argument `ret_csr` (default `true`) decides the format of returned results as:\n",
+ "Users should also construct `searchers` to avoid memory overhead\n",
+ "```\n",
+ "searchers = model.searchers_create(num_searcher=1) # multiple searchers inference multiple queries in parallel\n",
+ "```\n",
"\n",
- "* If `ret_csr` is false, the returned results would be two matrices of shape (# of data points, topk), which indicate the topk indices in the training corpus and the corresponding distances for each testing instance.\n",
- "* If `ret_csr` is true, the returned results would be a [Compressed Sparse Row (CSR) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) of shape (# of data points, # of points in the training corpus). Each row contains sorted topk distance values at the corresponding columns (i.e., indices in training corpus). The data for each row (i.e., `data[indptr[i]:indptr[i + 1]]`) are also sorted by the distance values.\n",
- "\n"
+ "The predict function derives search results based on a query matrix of shape (# of data points for inference, # of dimensions), `pred_params`, and `searchers`. "
]
},
{
"cell_type": "code",
- "execution_count": 5,
+ "execution_count": 4,
"id": "e25e31d4",
"metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Prediction Time = 15.7988 seconds.\n"
- ]
- }
- ],
+ "outputs": [],
"source": [
"pred_params = HNSW.PredParams(efS=100, topk=10)\n",
- "searchers = model.searchers_create(num_searcher=1)\n",
+ "searchers = model.searchers_create(num_searcher=1) # multiple searchers inference multiple queries in parallel\n",
"start_time = time.time()\n",
"indices, distances = model.predict(\n",
" X_tst,\n",
@@ -251,21 +244,33 @@
" searchers=searchers,\n",
" ret_csr=False,\n",
")\n",
- "pred_time = time.time() - start_time\n",
- "print(f\"Prediction Time = {pred_time:.4f} seconds.\")"
+ "pred_time = time.time() - start_time"
]
},
{
"cell_type": "markdown",
- "id": "0ce9aefa",
+ "id": "dc06d282-ba6b-4a57-a963-1203fcc87c63",
"metadata": {},
+ "source": [
+ "The argument `ret_csr` (default `true`) decides the format of returned results as:\n",
+ "\n",
+ "* If `ret_csr` is false, the returned results would be two matrices of shape (# of data points, topk), which indicate the topk indices in the training corpus and the corresponding distances for each testing instance.\n",
+ "* If `ret_csr` is true, the returned results would be a [Compressed Sparse Row (CSR) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) of shape (# of data points, # of points in the training corpus). Each row contains sorted topk distance values at the corresponding columns (i.e., indices in training corpus). The data for each row (i.e., `data[indptr[i]:indptr[i + 1]]`) are also sorted by the distance values."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0ce9aefa",
+ "metadata": {
+ "tags": []
+ },
"source": [
"### Evaluation"
]
},
{
"cell_type": "code",
- "execution_count": 6,
+ "execution_count": 5,
"id": "38401700",
"metadata": {},
"outputs": [],
@@ -279,7 +284,7 @@
},
{
"cell_type": "code",
- "execution_count": 7,
+ "execution_count": 6,
"id": "b0b6d72a",
"metadata": {},
"outputs": [
@@ -287,7 +292,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "HNSW inference | R@10 0.9025 Throughput(q/s) 1465.236 latency(ms/q) 0.6825\n"
+ "HNSW inference | R@10 0.9020 Throughput(q/s) 1478.575 latency(ms/q) 0.6763\n"
]
}
],
@@ -301,14 +306,17 @@
{
"cell_type": "markdown",
"id": "75880f0d",
- "metadata": {},
+ "metadata": {
+ "jp-MarkdownHeadingCollapsed": true,
+ "tags": []
+ },
"source": [
- "## Recall vs Throughput Trade-off"
+ "## Appendix: Recall vs Throughput Trade-off"
]
},
{
"cell_type": "code",
- "execution_count": 8,
+ "execution_count": 7,
"id": "12bd7fb6",
"metadata": {},
"outputs": [],
@@ -318,9 +326,14 @@
" M_list = [16]\n",
" efC = 500\n",
" topk = 10\n",
- " efS_list = [10, 20, 40, 80, 120, 200, 400, 600, 800]\n",
+ " efS_list = [10, 20, 40, 80, 120, 200, 400, 600]\n",
" for M in M_list:\n",
- " train_params = HNSW.TrainParams(M=M, efC=efC, metric_type=metric, threads=-1)\n",
+ " train_params = HNSW.TrainParams(\n",
+ " M=M,\n",
+ " efC=efC,\n",
+ " metric_type=metric,\n",
+ " threads=-1,\n",
+ " )\n",
" start_time = time.time()\n",
" model = HNSW.train(X_trn, train_params=train_params, pred_params=None)\n",
" print(\"Indexer | M {} efC {} metric {} | train time(s) {}\".format(\n",
@@ -345,7 +358,7 @@
},
{
"cell_type": "code",
- "execution_count": 9,
+ "execution_count": 8,
"id": "0b4af0fb",
"metadata": {},
"outputs": [
@@ -353,16 +366,15 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "Indexer | M 16 efC 500 metric ip | train time(s) 46.87919640541077\n",
- "inference | efS 10 R@10 0.7733 Throughput(q/s) 5250.297 latency(ms/q) 0.1905\n",
- "inference | efS 20 R@10 0.8545 Throughput(q/s) 3677.292 latency(ms/q) 0.2719\n",
- "inference | efS 40 R@10 0.9043 Throughput(q/s) 2409.959 latency(ms/q) 0.4149\n",
- "inference | efS 80 R@10 0.9325 Throughput(q/s) 1508.349 latency(ms/q) 0.6630\n",
- "inference | efS 120 R@10 0.9434 Throughput(q/s) 1125.047 latency(ms/q) 0.8889\n",
- "inference | efS 200 R@10 0.9533 Throughput(q/s) 763.752 latency(ms/q) 1.3093\n",
- "inference | efS 400 R@10 0.9621 Throughput(q/s) 433.872 latency(ms/q) 2.3048\n",
- "inference | efS 600 R@10 0.9657 Throughput(q/s) 305.747 latency(ms/q) 3.2707\n",
- "inference | efS 800 R@10 0.9678 Throughput(q/s) 237.651 latency(ms/q) 4.2078\n"
+ "Indexer | M 16 efC 500 metric ip | train time(s) 47.05828666687012\n",
+ "inference | efS 10 R@10 0.7737 Throughput(q/s) 5228.054 latency(ms/q) 0.1913\n",
+ "inference | efS 20 R@10 0.8552 Throughput(q/s) 3665.046 latency(ms/q) 0.2728\n",
+ "inference | efS 40 R@10 0.9043 Throughput(q/s) 2406.416 latency(ms/q) 0.4156\n",
+ "inference | efS 80 R@10 0.9320 Throughput(q/s) 1504.624 latency(ms/q) 0.6646\n",
+ "inference | efS 120 R@10 0.9432 Throughput(q/s) 1122.713 latency(ms/q) 0.8907\n",
+ "inference | efS 200 R@10 0.9536 Throughput(q/s) 763.034 latency(ms/q) 1.3106\n",
+ "inference | efS 400 R@10 0.9622 Throughput(q/s) 433.572 latency(ms/q) 2.3064\n",
+ "inference | efS 600 R@10 0.9656 Throughput(q/s) 305.857 latency(ms/q) 3.2695\n"
]
}
],
@@ -370,15 +382,362 @@
"run_pecos(X_trn, X_tst, Y_tst)"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "ccea74a4-ca0d-4839-b1fa-b42ce86aea82",
+ "metadata": {
+ "jp-MarkdownHeadingCollapsed": true,
+ "tags": []
+ },
+ "source": [
+ "## Appendix: Plot Recall vs Throughput Curve "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "cc70d86f-a4fc-46b6-a079-860db59efc96",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def get_pareto_frontier(Xs, Ys, maxX=True, maxY=True):\n",
+ " '''Pareto frontier selection process'''\n",
+ " sorted_list = sorted([[Xs[i], Ys[i]] for i in range(len(Xs))], reverse=maxY)\n",
+ " pareto_front = [sorted_list[0]]\n",
+ " for pair in sorted_list[1:]:\n",
+ " if maxY:\n",
+ " if pair[1] >= pareto_front[-1][1]:\n",
+ " pareto_front.append(pair)\n",
+ " else:\n",
+ " if pair[1] <= pareto_front[-1][1]:\n",
+ " pareto_front.append(pair)\n",
+ " return pareto_front\n",
+ "\n",
+ "def plot_one(\n",
+ " results_dict,\n",
+ " xlim, ylim, title,\n",
+ " FONTSIZE=28):\n",
+ " import matplotlib.pyplot as plt\n",
+ " f, axs = plt.subplots(1, 1, figsize=(10,10))\n",
+ " for algo_name in results_dict.keys():\n",
+ " algo_dict = results_dict[algo_name]\n",
+ " pareto_front = get_pareto_frontier(algo_dict[\"recall\"], algo_dict[\"throughput\"])\n",
+ " Xs_list, Ys_list = zip(*pareto_front)\n",
+ " axs.plot(\n",
+ " Xs_list,\n",
+ " Ys_list,\n",
+ " label=algo_name,\n",
+ " ms=7, mew=3, lw=3,\n",
+ " color=algo_dict[\"color\"],\n",
+ " linestyle=algo_dict[\"linestyle\"],\n",
+ " marker=algo_dict[\"marker\"],\n",
+ " )\n",
+ " axs.set_xlim([xlim, 1.01])\n",
+ " axs.set_ylim([0.0, ylim])\n",
+ " axs.tick_params(axis='both', which='major', labelsize=FONTSIZE-8)\n",
+ " axs.tick_params(axis='both', which='minor', labelsize=FONTSIZE-8)\n",
+ " axs.set_ylabel(\"Throughoput (#queries/sec)\", fontsize=FONTSIZE-4)\n",
+ " axs.set_xlabel(\"Recall10@10\", fontsize=FONTSIZE-4)\n",
+ " axs.set_title(title, fontsize=FONTSIZE)\n",
+ " axs.legend(fontsize=18)\n",
+ " #axs[i, j].set_legend(loc='upper center', bbox_to_anchor=(0.76, 1.01), ncol=1, fancybox=True, shadow=True, fontsize=FONTSIZE-8)\n",
+ " axs.grid(visible=True, which='major', color='black', linestyle='-')\n",
+ " axs.grid(visible=True, which='minor', color='gray', linestyle='--')\n",
+ " axs.minorticks_on()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4c7927f8-626a-4ac6-a889-47c8bd023954",
+ "metadata": {
+ "jp-MarkdownHeadingCollapsed": true,
+ "tags": []
+ },
+ "source": [
+ "### plot RCV1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "be44342f-f53a-4266-a865-42d1e4f0cc0e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def get_results_rcv1():\n",
+ " results_dict = {\n",
+ " \"PECOS-HNSW\": {\n",
+ " \"color\": \"blue\",\n",
+ " \"marker\": \"D\",\n",
+ " \"linestyle\": \"-\",\n",
+ " \"recall\": [0.7733, 0.8545, 0.9043, 0.9325, 0.9434, 0.9533, 0.9621, 0.9657, 0.9678],\n",
+ " \"throughput\": [5250.297, 3677.292, 2409.959, 1508.349, 1125.047, 763.752, 433.872, 305.747, 237.651],\n",
+ " },\n",
+ " \"HNSW(NMSLIB)\": {\n",
+ " \"color\": \"black\",\n",
+ " \"marker\": \"o\",\n",
+ " \"linestyle\": \"--\",\n",
+ " \"recall\": [0.7790, 0.8581, 0.9055, 0.9326, 0.9426, 0.9523, 0.9608, 0.9644, 0.9663],\n",
+ " \"throughput\": [2710.256, 1924.505, 1271.085, 800.999, 597.873, 404.518, 229.553, 161.879, 124.806],\n",
+ " }\n",
+ " }\n",
+ " return results_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "d8466781-720d-4eeb-8762-314f08f3c656",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "