diff --git a/tutorials/kdd22/README.md b/tutorials/kdd22/README.md
new file mode 100644
index 0000000..2f3e3d2
--- /dev/null
+++ b/tutorials/kdd22/README.md
@@ -0,0 +1,24 @@
+# KDD 2022 Hands-on Tutorial - PECOS: Prediction for Enormous and Correlated Output Spaces
+
+In this tutorial, we will introduce several key functions and features of the PECOS library.
+By way of real-world examples, the attendees will learn how to efficiently train large-scale machine learning models for enormous output spaces, and obtain predictions in less than 1 millisecond for a data input with million labels, in the context of product recommendation and natural language processing.
+We will also show the flexibility of dealing with diverse machine learning problems and data formats with assorted built-in utilities in PECOS.
+By the end of the tutorial, we believe that attendees will be easily capable of adopting certain concepts to their own projects and address different machine learning problems with enormous output spaces.
+
+* Presenters: Hsiang-Fu Yu (Amazon Search), Jiong Zhang (Amazon Search), Wei-Cheng Chang (Amazon Search), Jyun-Yu Jiang (Amazon Search), and Cho-Jui Hsieh (UCLA)
+
+* Contributer: Wei Li (Amazon Search)
+
+## Agenda
+
+| Time | Session | Material |
+|---|---|---|
+| 8:00 AM - 8:30 AM | Check-in and Environment Setup | |
+| 8:30 AM - 8:50 AM | Session 1: Introduction to PECOS | |
+| 8:50 AM - 9:30 AM | Session 2: Extreme Multi-label Ranking with PECOS | [Notebook](https://github.com/amzn/pecos/blob/tutorials/kdd22/Session%202%20Extreme%20Multi-label%20Ranking%20with%20PECOS.ipynb) |
+| 9:30 AM - 10:00 AM | Coffee Break | |
+| 10:00 AM - 10:30 AM | Session 3: Approximate Nearest Neighbor (ANN) Search in PECOS | [Notebook](https://github.com/amzn/pecos/blob/tutorials/kdd22/Session%203%20Approximate%20Nearest%20Neighbor%20Search%20in%20PECOS.ipynb) |
+| 10:30 AM - 11:10 AM | Session 4: Utilities in PECOS | [Notebook](https://github.com/amzn/pecos/blob/tutorials/kdd22/Session%204%20Utilities%20in%20PECOS) |
+| 11:10 AM - 11:40 AM | Session 5: XR-Transformer cookbook and Distributed PECOS | [Notebook](https://github.com/amzn/pecos/blob/tutorials/kdd22/Session%205%20XR-Transformer%20cookbook%20and%20Distributed%20PECOS) |
+| 11:40 AM - 11:50 AM | Session 6: Research with PECOS | |
+| 11:50 AM - 12:00 PM | Closing Remarks | |
diff --git a/tutorials/kdd22/Session 2 Extreme Multi-label Ranking with PECOS.ipynb b/tutorials/kdd22/Session 2 Extreme Multi-label Ranking with PECOS.ipynb
new file mode 100644
index 0000000..81f6ce5
--- /dev/null
+++ b/tutorials/kdd22/Session 2 Extreme Multi-label Ranking with PECOS.ipynb
@@ -0,0 +1,1493 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "67e70878",
+ "metadata": {},
+ "source": [
+ "# eXtreme Multi-label Ranking (XMR) Problem and PECOS\n",
+ "\n",
+ "Prediction for Enormous and Correlated Output Spaces (PECOS) is a versatile and modular machine learning framework for solving prediction problems with very large outputs spaces. For a given input instance, we apply PECOS to the eXtreme Multilabel Ranking (XMR) problem to find and rank the most relevant items from an enormous but fixed and finite output space.\n",
+ "\n",
+ "
\n",
+ "\n",
+ "As shown in the above figure, to address the XMR problem, PECOS conceptually consists of three stages, including semantic label indexing, machine-learned matching, and ranking. For more details about XMR problem and model formulation, please refer to presentations in the PECOS Day. In this part of the tutorial, we will use XR-Linear as an example to demonstrate how to use PECOS to tackle real-world problems and understrand the model architecture in PECOS."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "41d87d24",
+ "metadata": {},
+ "source": [
+ "## Experimental Dataset\n",
+ "\n",
+ "`eurlex-4k`, `wiki10-31k`, `amazoncat-13k`, `amazon-670k`, `wiki-500k`, and `amazon-3m` are available."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "1073ac9c",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "2022-07-14 08:54:02 URL:https://ia802308.us.archive.org/21/items/pecos-dataset/xmc-base/wiki10-31k.tar.gz [162277861/162277861] -> \"wiki10-31k.tar.gz\" [1]\n",
+ "xmc-base/wiki10-31k/output-items.txt\n",
+ "xmc-base/wiki10-31k/tfidf-attnxml\n",
+ "xmc-base/wiki10-31k/tfidf-attnxml/X.trn.npz\n",
+ "xmc-base/wiki10-31k/tfidf-attnxml/X.tst.npz\n",
+ "xmc-base/wiki10-31k/X.trn.txt\n",
+ "xmc-base/wiki10-31k/X.tst.txt\n",
+ "xmc-base/wiki10-31k/Y.trn.npz\n",
+ "xmc-base/wiki10-31k/Y.trn.txt\n",
+ "xmc-base/wiki10-31k/Y.tst.npz\n",
+ "xmc-base/wiki10-31k/Y.tst.txt\n"
+ ]
+ }
+ ],
+ "source": [
+ "DATASET = \"wiki10-31k\"\n",
+ "! wget -nv -nc https://archive.org/download/pecos-dataset/xmc-base/{DATASET}.tar.gz\n",
+ "! tar --skip-old-files -zxf {DATASET}.tar.gz \n",
+ "! find xmc-base/{DATASET}/*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "73f0fa78",
+ "metadata": {},
+ "source": [
+ "### Analyze Sparse Features and Label Space"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "d680e1e0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import scipy.sparse as smat\n",
+ "import matplotlib.pyplot as plt\n",
+ "X_trn = smat.load_npz(f\"xmc-base/{DATASET}/tfidf-attnxml/X.trn.npz\")\n",
+ "Y_trn = smat.load_npz(f\"xmc-base/{DATASET}/Y.trn.npz\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "0b34281d",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'14146 instances with 101938 features.'"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"{} instances with {} features.\".format(*X_trn.shape)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "7f16a000",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'Overall Sparsity: 99.34%'"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"Overall Sparsity: {:.2f}%\".format(100 * (1 - X_trn.nnz / (X_trn.shape[0] * X_trn.shape[1])))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "dcf0f0cf",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "counts, bins = np.histogram(100 - 100 * X_trn.getnnz(1) / X_trn.shape[1], bins=20)\n",
+ "plt.hist(bins[:-1], bins, weights=counts)\n",
+ "plt.title(DATASET);\n",
+ "plt.xlabel(\"Feature Sparsity (%)\");\n",
+ "plt.ylabel(\"Number of Instances\");"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a1e0157b",
+ "metadata": {},
+ "source": [
+ "### Extremely large label space"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "49f8fe28",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'14146 instances with 30938 labels.'"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"{} instances with {} labels.\".format(*Y_trn.shape)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "64f5fc9b",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'Overall Sparsity: 99.94%'"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"Overall Sparsity: {:.2f}%\".format(100 * (1 - Y_trn.nnz / (Y_trn.shape[0] * Y_trn.shape[1])))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "b5cb2084",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "counts, bins = np.histogram(100 - 100 * Y_trn.getnnz(1) / Y_trn.shape[1], bins=20, range=(99.85, 100))\n",
+ "plt.hist(bins[:-1], bins, weights=counts)\n",
+ "plt.title(DATASET);\n",
+ "plt.xlabel(\"Label Sparsity (%)\");\n",
+ "plt.ylabel(\"Number of Instances\");"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "057fb642",
+ "metadata": {},
+ "source": [
+ "## Numerical Feature and Label Format in PECOS\n",
+ "\n",
+ "In PECOS, numerical features of instances can be in either a [dense NumPy matrix](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) or a [Compressed Sparse Row (CSR) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) of shape `(nr_inst, nr_feat)`, where `nr_inst` and `nr_feat` are numbers of instances and features. Similary, labels of instances can be also presented as a dense or a sparse matrix of shape `(nr_inst, nr_labels)`, where `nr_labels` is the number of labels in the XMR problem. Note that for the sparse format, training labels should be a [Compressed Sparse Column (CSC) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html) while testing labels should be a CSR matrix for the purpose of computational efficiency. For convenience, PECOS also provides APIs for loading features and labels from binary files in arbitary formats.\n",
+ "\n",
+ "In addition to numerical features, PECOS also supports handling text data with transformer. Please refer to [Part 2](Part%202%20-%20Text%20Processing.ipynb) in this tutorial for more details about text processing in PECOS."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "c518d892",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Training features X_trn is a csr matrix of shape (14146, 101938).\n",
+ "Training labels Y_trn is a csc matrix of shape (14146, 30938).\n",
+ "Testing features X_tst is a csr matrix of shape (6616, 101938).\n",
+ "Testing labels Y_tst is a csr matrix of shape (6616, 30938).\n"
+ ]
+ }
+ ],
+ "source": [
+ "import numpy as np\n",
+ "from pecos.xmc.xlinear.model import XLinearModel\n",
+ "\n",
+ "DATASET = \"wiki10-31k\"\n",
+ "\n",
+ "X_trn = XLinearModel.load_feature_matrix(\"xmc-base/{}/tfidf-attnxml/X.trn.npz\".format(DATASET))\n",
+ "Y_trn = XLinearModel.load_label_matrix(\"xmc-base/{}/Y.trn.npz\".format(DATASET), for_training=True)\n",
+ "\n",
+ "X_tst = XLinearModel.load_feature_matrix(\"xmc-base/{}/tfidf-attnxml/X.tst.npz\".format(DATASET))\n",
+ "Y_tst = XLinearModel.load_label_matrix(\"xmc-base/{}/Y.tst.npz\".format(DATASET), for_training=False)\n",
+ "\n",
+ "print(f\"Training features X_trn is a {X_trn.getformat()} matrix of shape {X_trn.shape}.\")\n",
+ "print(f\"Training labels Y_trn is a {Y_trn.getformat()} matrix of shape {Y_trn.shape}.\")\n",
+ "print(f\"Testing features X_tst is a {X_tst.getformat()} matrix of shape {X_tst.shape}.\")\n",
+ "print(f\"Testing labels Y_tst is a {Y_tst.getformat()} matrix of shape {Y_tst.shape}.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b0c731f5",
+ "metadata": {},
+ "source": [
+ "## Hands-on Example: XMR with XR-Linear\n",
+ "\n",
+ "XR-LINEAR is a recursive linear machine learned realization of our PECOS framework. As shown in the below figure, XR-Linear treats machine-learned matching as a smaller XMR problem, thereby recursively apply the three-stage framework of PECOS to address the problem.\n",
+ "\n",
+ "\n",
+ "
\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "150fea14",
+ "metadata": {},
+ "source": [
+ "### Semantic Label Indexing and Cluster Chain in XR-Linear\n",
+ "\n",
+ "The first step of training an XR-Linear model is to conduct semantic label indexing and establish the *hierarchial label tree* for resursive training the XR-Linear model and its inference. \n",
+ "\n",
+ "PECOS supports any method for semantic label indexing. In the PECOS library, as a build-in method, we provide Label Representation via Positive Instance Feature Aggregation (PIFA) for semantic label indexing with only the need of positive instances and their features in training data. PECOS can also consider additional label features `Z` of shape `(nr_labels, nr_label_feat)` in either dense or sparse matrix format, where `nr_label_feat` is the number of label features. These representations and features for each label are concatenated or combined as label embedding in `LabelEmbeddingFactory` in PECOS.\n",
+ "\n",
+ "To conduct semantic label indexing, PECOS learns an indexer based on label embedding. PECOS currently supports to use the Hierarchical K-Means for semantic label indexing with a hyper-parameter `nr_splits` (the number of clusters in each layer, or `B` in [our report](https://arxiv.org/pdf/2010.05878.pdf)), which decides the depth `D` of the hierarchical label tree. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "26794215",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "4 layers in the trained hierarchical label tree.\n"
+ ]
+ }
+ ],
+ "source": [
+ "from pecos.xmc import Indexer, LabelEmbeddingFactory\n",
+ "\n",
+ "label_feat = LabelEmbeddingFactory.create(Y_trn, X_trn, method=\"pifa\")\n",
+ "# label_feat = LabelEmbeddingFactory.create(Y_trn, X_trn, Z, method=\"pifa_lf_concat\") # for using label features Z\n",
+ "\n",
+ "cluster_chain = Indexer.gen(label_feat, nr_splits=8, indexer_type=\"hierarchicalkmeans\")\n",
+ "\n",
+ "print(f\"{len(cluster_chain)} layers in the trained hierarchical label tree.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "02ffda21",
+ "metadata": {},
+ "source": [
+ "### Training XR-Linear Negative Sampling and Sparsification\n",
+ "\n",
+ "Negative sampling plays an important role in solving the XMR problem. PECOS currently provides two negative sampling schemes, including Teacher Forcing Negatives (TFN) and Matcher Aware Negatives (MAN). Please refer to [our report](https://arxiv.org/pdf/2010.05878.pdf)) and presentations in the [PECOS Day](https://w.amazon.com/bin/view/Search/MIDAS/Projects/PECOS/PecosDay/) for more details about negative sampling schemes.\n",
+ "\n",
+ "To reduce model sizes and improve efficiency, PECOS conduct model sparsification with a hyper-parameter `threshold`. The model weights with absolute values smaller than the threshold will be discarded."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "bd3d6527",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Training time: 40.9793 seconds.\n"
+ ]
+ }
+ ],
+ "source": [
+ "import time\n",
+ "start_time = time.time()\n",
+ "\n",
+ "# For negative_sampling_scheme in model training, \"man\" and tfn+man\" are also available.\n",
+ "xlm = XLinearModel.train(X_trn, Y_trn, C=cluster_chain, threshold=0.1, negative_sampling_scheme=\"tfn\")\n",
+ "\n",
+ "training_time = time.time() - start_time\n",
+ "print(f\"Training time: {training_time:.4f} seconds.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "20f5cfa7",
+ "metadata": {},
+ "source": [
+ "PECOS supports serializing and loading the trained model into binary on disk with convenient interfaces. Note that model loading with `is_predict_only=True` could lead to faster prediction speed by disabling the flexibility of model modification."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "3d5d468d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "xlm.save(\"{}.xlm.model\".format(DATASET))\n",
+ "xlm = XLinearModel.load(\"{}.xlm.model\".format(DATASET), is_predict_only=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4b6038ec",
+ "metadata": {},
+ "source": [
+ "### Prediction and Evaluation\n",
+ "\n",
+ "As a tree model, the inference method significantly affects the prediction efficiency of XR-Linear in PECOS. As illustrated in the following figure, the prediction process in PECOS employs a beam search with a hyper-parameter `beam_size`. The other hyper-parameter `only_topk` also needs to be decided to limit the predicted most relevant labels for each instance. The `predict` function of the trained model will result in a CSR matrix of shape `(nr_inst, nr_labels)` and exactly `only_topk` non-zero columns for each row (or instance).\n",
+ "\n",
+ "\n",
+ "
\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "7f851bc1",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Y_pred is a csr matrix of shape (6616, 30938) and 66160 non-zero elements.\n"
+ ]
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "Y_pred = xlm.predict(X_tst, beam_size=10, only_topk=10)\n",
+ "\n",
+ "print(f\"Y_pred is a {Y_pred.getformat()} matrix of shape {Y_pred.shape} and {Y_pred.nnz} non-zero elements.\")\n",
+ "\n",
+ "import matplotlib.pyplot as plt\n",
+ "plt.plot(range(Y_pred.shape[0]), Y_pred.getnnz(1))\n",
+ "plt.xlabel(\"Instance ID\");\n",
+ "plt.ylabel(\"Number of Predictions in Y_pred\");"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fb1ed22c",
+ "metadata": {},
+ "source": [
+ "For evaluation, we evaluate the trained model with conventional ranking metrics, including Precision@K and Recall@K. PECOS also provides the evaluation interface for predicted sparse matrices."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "4c57da1a",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "prec = 84.07 78.17 72.68 67.79 63.79 60.06 56.63 53.51 50.83 48.33\n",
+ "recall = 4.97 9.16 12.68 15.60 18.25 20.49 22.40 24.05 25.60 26.95\n"
+ ]
+ }
+ ],
+ "source": [
+ "from pecos.utils import smat_util\n",
+ "metrics = smat_util.Metrics.generate(Y_tst, Y_pred, topk=10)\n",
+ "print(metrics)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "32908310",
+ "metadata": {},
+ "source": [
+ "### Dive Deep in Cluster Chain\n",
+ "\n",
+ "Specifically, PECOS trains a *cluster_chain* of `D` matching matrices `C[d]`, where `C[d]` is a CSC matrix of shape `(L[d], K[d])`; `L[d]` and `K[d]` are the numbers of labels and clusters in the layer `d`. Note that the clusters of a layer would be the labels of the next layer. The labels of the last layer `L[D - 1]` would be the labels of the overall XMR problem `nr_labels`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "6b0cb55e",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "4 layers in the trained hierarchical label tree with C[d] as:\n",
+ "cluster_chain[0] is a csc matrix of shape (8, 1)\n",
+ "cluster_chain[1] is a csc matrix of shape (64, 8)\n",
+ "cluster_chain[2] is a csc matrix of shape (512, 64)\n",
+ "cluster_chain[3] is a csc matrix of shape (30938, 512)\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(f\"{len(cluster_chain)} layers in the trained hierarchical label tree with C[d] as:\")\n",
+ "for d, C in enumerate(cluster_chain):\n",
+ " print(f\"cluster_chain[{d}] is a {C.getformat()} matrix of shape {C.shape}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "55eeb4e5",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAbgAAAEyCAYAAACI4cUNAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAA3aElEQVR4nO3debhcVZnv8e+PMIU5MQjpYDgBmb2CMQwKQpQZh9BeQUAREMQBWnAkIBfQtvuJqKjdKDQgQ2gFQUSC0BBAZGgFkmAIswmYSGJIQKaEIUB47x9rFSlO6pxTp3aN5/w+z1NPdq091Lt37dR79tprr6WIwMzMbKBZpdUBmJmZNYITnJmZDUhOcGZmNiA5wZmZ2YDkBGdmZgOSE5yZmQ1ITnBmZjYgOcGZtYCkuZL2anUc3Uk6TNI8SS9K+q2k4a2OyaxWTnBmg5CkVSuUbQf8F3A4sBHwEvCzJodmVjdOcGZtRNIwSb+T9JSkZ/P0JnneQZJmdFv+q5KuydNrSPqBpL9JWiTpXElD87zxkuZLOknSk8BFFT7+U8C1EXF7RCwF/h/wcUnrNnSnzRrECc6svaxCSj6bAqOBl4Gz87wpwBhJ25QtfzgwOU9PArYEdgDeCYwCTitbdmNgeN72sRU+ezvgvtKbiHgMeDVv06zjOMGZtZGI+EdEXBURL0XEEuDfgD3yvGXAr4BPw5tVil3A7ySJlLS+EhHP5HX/HTikbPNvAKdHxLKIeLnCx68DPN+t7HnAV3DWkVaqhzez1pG0FvAjYD9gWC5eV9KQiFgOXAJcJulU0tXbFRGxTNLbgbWAGSnXpc0BQ8o2/1REvNLLxy8F1utWth6wpMg+mbWKr+DM2svXgK2AnSNiPWD3XC6AiLiLVG34AeAw4NI8/2lSdeZ2EbFBfq0fEeuUbbuvoUMeBLYvvZG0GbAG8Jdiu2TWGk5wZq2zmqQ1y16rkqoDXwaey030T6+w3mTSfbnXIuJOgIh4Azgf+FG+mkPSKEn79iOeXwAflfQBSWsD3wF+k6s7zTpOVQlO0q75hEfSpyWdJWnTxoZmNuBdT0pmpdcZwI+BoaQrsruAGyqsdynwLuC/u5WfBMwB7pL0AnAz6WqwKhHxIPAFUqJbTEq2X6p2fbN2o2oGPJU0i1R18W7gYuAC4OCI2KOh0ZnZSnLT/8XA2IiY3ep4zNpVtVWUr0fKhBOAsyPip7hllVmrfBGY5uRm1rtqW1EukXQyqdXWByStAqzWuLDMrBJJc0kNTg5sbSRm7a/aKsqNSS22pkXEHZJGA+MjYnIfq5qZmbVEVQkOIDcq2SIibs7P6gxx6yozM2tXVVVRSvocqZeE4cDmpC6AzgX2bFxotRsxYkR0dXW1OgwzM2uwGTNmPB0RG1aaV+09uOOAnYC7ASJidulZm3bU1dXF9OnTWx2GmZk1mKR5Pc2rNsEti4hXS10A5QdSq6vbNLOm6Jp4XatDqNrcSR9udQg2CFT7mMBtkk4BhkraG7gSuLZxYZmZmRVT7RXcROBo4H7g86QeGC5oVFBmNrD5atOaodoENxS4MCLOB5A0JJe91KjAzMzMiqi2ivIWUkIrGUrq587MzKwtVZvg1sxD2AOQp9dqTEhmZmbFVZvgXpQ0tvRG0ntJvZ+bmZm1pWrvwZ0IXCnp76R+8DYGPtmooMzMzIqqKsFFxDRJW7NibKlHI+K1xoVlZmZWTLVXcAA7Al15nbGScGfLZmbWrqrti/JSUh+UM4HluTgAJzgzM2tL1V7BjQO2jWqHHjAbIDrpgWQze6tqW1E+QGpYUjVJF0paLOmBsrLhkm6SNDv/OyyXS9J/SJojaVa3FptH5OVnSzqiPzGYmdngVW2CGwE8JOlGSVNKrz7WuRjYr1vZROCWiNiC9PD4xFy+P7BFfh0LnAMpIQKnAzuTRjM4vZQUzczMelNtFeUZ/d1wRNwuqatb8QRgfJ6+BPgDcFIun5yrQO+StIGkkXnZmyLiGQBJN5GS5mX9jcfMzAaXah8TuK1On7dRRCzM008CG+XpUcATZcvNz2U9la9E0rGkqz9Gjx5dp3DNbLDrtPuw7hx6haqqKCXtImmapKWSXpW0XNILRT44X63VrdFKRJwXEeMiYtyGG1Yc3NXMzAaRau/BnQ0cCswmdbR8DPDTGj5vUa56JP+7OJcvAN5RttwmuayncjMzs15Vm+CIiDnAkIhYHhEXsXIDkmpMAUotIY8Arikr/0xuTbkL8HyuyrwR2EfSsNy4ZJ9cZmZm1qtqG5m8JGl1YKakM4GF9JEcJV1GaiQyQtJ8UmvIScAVko4G5gEH58WvBw4A5pDGmDsKICKekfSvwLS83HdKDU7MzMx6U22CO5yU0I4HvkKqNvx4bytExKE9zNqzwrIBHNfDdi4ELqwyTjMzM6D6KsoDI+KViHghIr4dEV8FPtLIwMzMzIqoNsFV6kHkyDrGYWZmVle9VlFKOhQ4DBjTreeS9QDfCzMzs7bV1z24P5IalIwAflhWvgSY1aigzMzMiuo1wUXEPGCepL2AlyPiDUlbAlsD9zcjQBt4Oq1nCDPrTNXeg7sdWFPSKGAqqVXlxY0KyszMrKhqE5wi4iXSowE/i4iDgO0aF5aZmVkxVSc4Se8DPgWU6peGNCYkMzOz4qpNcCcAJwNXR8SDkjYDbm1cWGZmZsVUO1zO7aT7cKX3jwNfblRQZmZWm05qxNXooX2qSnC55eTXga7ydSLiQ40Jy8zMrJhq+6K8EjgXuABY3rhwrFad9FebmVkzVJvgXo+IcxoaSR8k7Qf8hNS45YKImNTKeMzMrL1V28jkWklfkjRS0vDSq6GRlZE0hDTA6v7AtsChkrZt1uebmVnnqfYKrtTZ8jfKygLYrL7h9GgnYE5u3IKky4EJwENN+nwzM+sw1baiHNPoQPowCnii7P18YOfyBSQdCxyb3y6V9GgdPncE8HQdttNK3of2MBD2AQbGfngf2oS+V5f92LSnGX2NJtDXoKa/qTWieouI84Dz6rlNSdMjYlw9t9ls3of2MBD2AQbGfngf2kej96OvK7iP9jIvgGYluAWkUcRLNsllZmZmFfU1msBRAJJOjYjv5uk1ImJZM4IrMw3YQtIYUmI7hDROnZmZWUW9tqKUdFLug/ITZcV/amxIK4uI14HjgRuBh4ErIuLBJnx0Xas8W8T70B4Gwj7AwNgP70P7aOh+KCJ6nilNAPYAjgHuAx4B9gH2iYh6NOIwMzNriL4S3B7A3aSRvXcEtiGNJvB7YKuIeH8zgjQzM+uvvhqZ7AucBmwOnAXMAl4s3ZszMzNrV73eg4uIUyJiT2AucCmpm6wNJd0p6domxNcSki6UtFjSA62OpRaS3iHpVkkPSXpQ0gmtjqkWktaUdI+k+/J+fLvVMdVK0hBJf5b0u1bHUgtJcyXdL2mmpOmtjqdWkjaQ9GtJj0h6OLcx6BiStsrfQen1gqQTWx1Xf0n6Sv4//YCkyySt2ZDP6a2KsiyYMyPim3n6zxHxHkkjIqLjHzSsRNLuwFJgckS8q9Xx9JekkcDIiLhX0rrADODAiOionl8kCVg7IpZKWg24EzghIu5qcWj9JumrwDhgvYj4SKvj6S9Jc4Fxnf5/XtIlwB0RcYGk1YG1IuK5FodVk9yF4QJg54iY1+p4qiVpFOn/8rYR8bKkK4DrI+Lien9WVX1RlpJbdmQu6+gTvTd5/LtnWh1HrSJiYUTcm6eXkFqejmptVP0XydL8drX86vsvsjYjaRPgw6TROKxFJK0P7A78HCAiXu3U5JbtCTzWScmtzKrAUEmrAmsBf2/Eh1Tb2fKbIuK+RgRijSGpC3gPqbFQx8lVezOBxcBNEdGJ+/Fj4JvAGy2Oo4gApkqakbvF60RjgKeAi3J18QWS1m51UAUcAlzW6iD6KyIWAD8A/gYsBJ6PiKmN+Kx+JzjrHJLWAa4CToyIF1odTy0iYnlE7EDqvWYnSR1VZSzpI8DiiJjR6lgK2i0ixpJG9DguV+N3mlWBscA5EfEe4EVgYmtDqk2uXv0YaazOjiJpGKmz/DHAPwFrS/p0Iz7LCW6AyvesrgJ+0U59htYqVyXdCuzX4lD6a1fgY/ke1uXAhyT9d2tD6r/8VzcRsRi4mjTCR6eZD8wvqwX4NSnhdaL9gXsjYlGrA6nBXsBfI+KpiHiN1OVjQx45c4IbgHLjjJ8DD0fEWa2Op1aSNpS0QZ4eCuxN6mygY0TEyRGxSUR0kaqUfh8RDflrtVEkrZ0bK5Gr9PYBOq6FcUQ8CTwhaatctCedO+TWoXRg9WT2N2AXSWvl36o9Se0E6s4JrgJJl5G6JNtK0nxJR7c6pn7aFTicdLVQak58QKuDqsFI4FZJs0j9kd4UER3ZzL7DbQTcKek+4B7guoi4ocUx1epfgF/kc2oH4N9bG07/5T8y9qZ5nd3XVb6C/jVwL3A/KQ81pMuuqh4TMDMz6zS+gjMzswHJCc6sBXLPIHu1Oo5ykkZKmiLp75IiP2Ji1rGc4MwGofyAbXdvADcA/7fJ4Zg1hBOcWRuRNEzS7yQ9JenZPL1JnneQpBndlv+qpGvy9BqSfiDpb5IWSTo3tz5F0vjcYOokSU8CF3X/7IhYFBE/IzXoMet4TnBm7WUVUvLZFBgNvAycnedNAcZI2qZs+cOByXl6ErAlqXXgO0nds51WtuzGwPC87U7tjcSsam5FadYC+cHvYyLi5j6W2wG4NSKG5ffnAM9ExLckbUfqtHZj4FVSB+HvjojH8rLvA34ZEWMkjQemkjp7fqWPz1wVeA0YExFza91Hs1brazw4M2siSWsBPyL12DIsF68raUhELAcuAS6TdCrp6u2KiFgm6e2kTmtnpGdn0+ZIQ1yVPNVXcjMbSFxFadZevgZsRRoCZT1S7/eQkhV5qKBXgQ8Ah5HGaQR4mlSduV1EbJBf60fEOmXbdnWNDSpVJThJu5Z63Zb0aUlnSdq0saGZDXirKQ3qWnqtCqxLSlTPSRoOnF5hvcmk+3KvRcSdABHxBnA+8KN8NYekUZL27U9AeeDJNfLbNRo1EKVZM1R7BXcO8JKk7Ul/YT7GihvbZlab60nJrPQ6gzS0zlDSFdldpGb73V0KvAvo3mnzScAc4C5JLwA3k64G++Nl0r08SP1+vtzP9c3aRrUjet8bEWMlnQYsiIifl8oaH6KZlctN/xcDYyNidqvjMWtX1TYyWSLpZNJN7Q9IWoU0urKZNd8XgWlObma9qzbBfZJ0Q/uzEfGkpNHA9xsXlplVkh8vEHBgayMxa39VPweXG5VsERE356bMQyJiSUOjMzMzq1G1rSg/Rxq/579y0Sjgtw2KyczMrLBqqyiPIw1RfzdARMwuNUVuRyNGjIiurq5Wh2FmZg02Y8aMpyNiw0rzqk1wyyLi1VIPCfl5nbZ9aLSrq4vp06e3OgyzpuqaeF1Dtjt30ocbsl2zepA0r6d51T4Hd5ukU4ChkvYGrgSurUdwZmZmjVBtgpsIPAXcD3ye9IDqqY0KyszMrKhqqyiHAhdGxPkAkobkspcaFZiZmVkR1V7B3UJKaCVDSd0AmZmZtaVqE9yaEVHqn448vVZvK0i6UNJiSQ+UlQ2XdJOk2fnf0hhXkvQfkuZImiVpbNk6R+TlZ0s6on+7Z2Zmg1W1Ce7FbknnvfTdCevFpDGtyk0EbomILUhXhRNz+f7AFvl1LKlzZ8p6U9+Z9JjC6aWkaGZm1ptq78GdCFwp6e+kboI2JnXf1aOIuF1SV7fiCcD4PH0J8AdSD+gTgMmRulW5S9IGkkbmZW+KiGcAJN1ESpqXVRm3mZkNUlUluIiYJmlrVgy98WhEvFbD520UEQvz9JPARnl6FPBE2XLzc1lP5WZmZr2q9goOYEegK68zVhIRUfOYcBERkur2sLikY0nVm4wePbpemzUzsw5VVYKTdCmwOTATWJ6Lg/4PerpI0siIWJirIBfn8gXAO8qW2ySXLWBFlWap/A+VNhwR5wHnAYwbN65te1mxxvS44d42zKy7aq/gxgHbRrVDD/RsCnAEMCn/e01Z+fGSLic1KHk+J8EbgX8va1iyD3BywRjMzGwQqDbBPUBqWLKwrwVLJF1GuvoaIWk+qTXkJOAKSUcD84CD8+LXAwcAc0gPjx8FEBHPSPpXYFpe7julBidmZma9qTbBjQAeknQPsKxUGBEf62mFiDi0h1l7Vlg2SCMWVNrOhcCFVcZpZmYGVJ/gzmhkEGZmZvVW7WMCtzU6EDMzs3qqdkTvXSRNk7RU0quSlkt6odHBmZmZ1ararrrOBg4FZpM6Wj4G+GmjgjIzMyuq6ge9I2KOpCERsRy4SNKfcZN9G+AaNUq2mTVetQnuJUmrAzMlnUl6XKDaqz8zM7OmqzbBHU5KaMcDXyH1OvLxRgVlZu3DPc9Yp6o2wR0YET8BXgG+DSDpBOAnjQrMrD9clWhm3VWb4I5g5WR2ZIUyM7M+NeoPEl8ZWrleE5ykQ4HDgDGSppTNWg9wl1lmZta2+rqC+yOpQckI4Idl5UuAWY0KysysXfhqs3P1muAiYh4wT9JewMsR8YakLYGtgfubEaCZmVktqm3qfzuwpqRRwFRSq8qLGxWUmZlZUdU2MlFEvJSHuflZRJwpaWYD4zIz6ze3prVy1V7BSdL7gE8BpTNoSGNCMjMzK67aK7gTSN1yXR0RD0raDLi1cWFZO/Bfw2bWyaodLud20n240vvHgS83KigzM7OiqkpwueXk14Gu8nUi4kONCcvMzKyYaqsorwTOBS4AljcuHKuVqxPNzN6q2gT3ekSc09BI+iBpP1LXYEOACyJiUivjMTMrwg+QN161Ce5aSV8CrgaWlQojoinddUkaQhpgdW9gPjBN0pSIeKgZn19PvtIyM2uO/nS2DPCNsrIANqtvOD3aCZiTG7cg6XJgAtBxCc7MrJE8vNEK1baiHNPoQPowCnii7P18YOfyBSQdCxyb3y6T9ECTYitqBPB0q4Poh06K17E2hmNtjLaNVd9bqaidYt20pxl9jSbQ66CmEfGbWiOqt4g4DzgPQNL0iBjX4pCq0kmxQmfF61gbw7E2hmOtv76u4D7ay7wAmpXgFpBGES/ZJJeZmZlV1NdoAkcBSDo1Ir6bp9eIiGW9rdcA04AtJI0hJbZDSOPUmZmZVdRrX5SSTsp9UH6irPhPjQ1pZRHxOnA8cCPwMHBFRDzYyyrnNSWw+uikWKGz4nWsjeFYG8Ox1pkioueZ0gRgD+AY4D7gEWAfYJ+IeLQpEZqZmdWgrwS3B3A3aWTvHYFtSKMJ/B7YKiLe34wgzczM+quvRib7AqcBmwNnAbOAF0v35szMzNpVr/fgIuKUiNgTmAtcSuoma0NJd0q6tgnx9UrSfpIelTRH0sQK89eQ9Ks8/25JXS0IE0nvkHSrpIckPSjphArLjJf0vKSZ+XVaK2LNscyVdH+OY3qF+ZL0H/m4zpI0thVx5li2KjtmMyW9IOnEbsu07NhKulDS4vLnMiUNl3STpNn532E9rHtEXma2pCMqLdOEWL8v6ZH8PV8taYMe1u31nGlSrGdIWlD2PR/Qw7q9/m40KdZflcU5t6cBpFtwXCv+VrXrOduniOjzBZxZNv3n/O+IatZt1IuUbB8j9aayOuke4bbdlvkScG6ePgT4VYtiHQmMzdPrAn+pEOt44HetPKZlsczt7fsFDgD+BxCwC3B3q2MuOyeeBDZtl2ML7A6MBR4oKzsTmJinJwLfq7DecODx/O+wPD2sBbHuA6yap79XKdZqzpkmxXoG8PUqzpFefzeaEWu3+T8ETmuT41rxt6pdz9m+XlWN6B0R3yx7e2Qua/VT7G923xURrwKl7rvKTQAuydO/BvaUpCbGCEBELIyIe/P0ElJL0FHNjqOOJgCTI7kL2EDSyFYHBewJPBYR81odSEmksRS799lafl5eAhxYYdV9gZsi4pmIeBa4CdivUXFC5VgjYmqkVswAd5GeQW25Ho5rNar53air3mLNv0cHA5c1MoZq9fJb1ZbnbF+qSnDlIuK+RgRSg0rdd3VPGm8uk/+TPg+8rSnR9SBXk76H1Hinu/dJuk/S/0jarrmRvUUAUyXNUOoCrbtqjn0rHELPPxTtcmwBNoqIhXn6SWCjCsu04zH+LOnKvZK+zplmOT5Xp17YQzVaux3XDwCLImJ2D/Nbdly7/VZ15Dnb7wRntZO0DnAVcGJEvNBt9r2kqrXtgf8Eftvk8MrtFhFjgf2B4yTt3sJYqiJpdeBjpLELu2unY/sWkep2em7K3CYkfQt4HfhFD4u0wzlzDqlB3A7AQlLVX7s7lN6v3lpyXHv7reqUcxY6O8FV033Xm8tIWhVYH/hHU6LrRtJqpBPmF1GhD8+IeCEilubp64HVJI1ocpilWBbkfxeThkjaqdsi7dh12v7AvRGxqPuMdjq22aJSlW7+d3GFZdrmGEs6EvgI8Kn847aSKs6ZhouIRRGxPCLeAM7vIYZ2Oq6rAh8HftXTMq04rj38VnXUOVvSyQnuze678l/vhwBTui0zhRVD/XwC+H1P/0EbKdez/xx4OCLO6mGZjUv3ByXtRPpump6MJa0tad3SNKmRQfeRGaYAn1GyC/B8WfVFq/T4l3C7HNsy5eflEcA1FZa5EdhH0rBc1bZPLmsqpYGGvwl8LCJe6mGZas6Zhut2H/ife4ihmt+NZtkLeCQi5lea2Yrj2stvVcecs2/RyhYuRV+k1nx/IbWK+lYu+w7pPyPAmqQqqznAPcBmLYpzN9Il/SxgZn4dAHwB+EJe5njgQVKrrruA97co1s1yDPfleErHtTxWkQagfQy4HxjX4vNgbVLCWr+srC2OLSnpLgReI92TOJp0H/gWYDZwMzA8LzuONFp9ad3P5nN3DnBUi2KdQ7qvUjpvS62S/wm4vrdzpgWxXprPx1mkH+SR3WPN71f63Wh2rLn84tI5WrZsq49rT79VbXnO9vXqtScTMzOzTtXJVZRmZmY9coIza4HcQ8VerY6jnKQPK/VS9JykJyVdULoHZNaJnODMBqHcgq+79YHvku4DbUN6hun7zYzLrJ6c4MzaSG6B9jtJT0l6Nk9vkucdJGlGt+W/KumaPL2GpB9I+pukRZLOlTQ0zxsvab7SGI9PAhd1/+yI+GVE3BARL0XqieJ8YNeG77RZgzjBmbWXVUjJZ1NgNPAycHaeNwUYI2mbsuUPBybn6UnAlqQHnd9JugIr71h6Y1I/gZsC1fSKsTup9Z5ZR3IrSrMWkDQXOCYibu5juR2AWyNiWH5/DvBMRHwrdzl2JylxvQosBd4dEY/lZd8H/DIixkgaD0wF1ouIV6qIb2/gCmDniPhLTTtp1mJ9jQdnZk0kaS3gR6ROakt9Ka4raUhELCd1dHuZpFNJV29XRMQySW8H1gJmlPUnLlLv+SVPVZncdgF+CXzCyc06masozdrL14CtSFdO65GqCSElKyKN3vAqqZPew0gPNwM8TarO3C4iNsiv9SNinbJt91ldI+k9pKrQz0bELfXYIbNWcYIza53VJK1Z9lqVNAbXy8BzkoYDp1dYbzLpvtxrEXEnQKzof/FH+WoOSaMk7VttMJLeBdwA/EtEtHxAY7OiCiU4SbvmPtKQ9GlJZ0natD6hmQ1415OSWel1BvBjYCjpiuwuUsLp7lLgXcB/dys/idRF0l2SXiB1qbRVP+L5GrAh8HNJS/PLjUysYxVqZCJpFrA98G5Sv2oXAAdHxB51ic7MVpKb/i8mjbzc0zhiZoNe0SrK1yNlyAnA2RHxU1IVi5k1zheBaU5uZr0r2opyiaSTSa25PiBpFWC14mGZWSX58QIBB7Y2ErP2V7SKcmNSS65pEXGHpNHA+IiY3MeqZmZmDVX4Qe/cqGSLiLg5P8MzJCKW1CU6MzOzGhWqopT0OVKXP8OBzUldA50L7Fk8tNqNGDEiurq6WhmCmZk1wYwZM56OiA0rzSt6D+44YCfgboCImF16BqeVurq6mD59eqvDMDOzBpM0r6d5RRPcsoh4tdQ1UH5Q1Z1bmrVA18TrGrLduZM+3JDtmjVa0ccEbpN0CjA0d856JeAeEMzMrOWKXsFNBI4G7gc+T+qZ4YKiQZkNdI262uoUvtq0Ziia4IYCF0bE+QCShuSyl4oGZmZmVkTRKspbSAmtZCip/zszM7OWKprg1oyIpaU3eXqtgts0MzMrrGgV5YuSxkbEvQCS3kvqFd2sqXxPx8y6K5rgTgSulPR3Uv94GwOfLBqUmZlZUYUSXERMk7Q1K8acejQiXiselpmZWTFFr+AAdgS68rbGSsKdLZuZWasV7YvyUlIflDOB5bk4ACc4MzNrqaJXcOOAbaPokARmZmZ1VvQxgQdIDUvMzMzaStEruBHAQ5LuAZaVCiPiY72tlEclXkKq1nw9IsZJGg78inQ/by5wcEQ8q9ST80+AA0g9pBxZeizBzKzR/AhK5yqa4M4osO4HI+LpsvcTgVsiYpKkifn9ScD+wBb5tTNwTv7XzMysR0UfE7itXoEAE4DxefoS4A+kBDcBmJzv890laQNJIyNiYR0/28zMBphC9+Ak7SJpmqSlkl6VtFzSC1WsGsBUSTMkHZvLNipLWk8CG+XpUcATZevOz2VmZmY9KlpFeTZwCGkcuHHAZ4Atq1hvt4hYkEf/vknSI+UzIyIk9atlZk6UxwKMHj26P6uamdkAVLQVJRExBxgSEcsj4iJgvyrWWZD/XQxcDewELJI0EiD/uzgvvgB4R9nqm+Sy7ts8LyLGRcS4DTfcsMgumZnZAFD0Cu4lSasDMyWdCSykj6QpaW1glYhYkqf3Ab4DTAGOACblf6/Jq0wBjpd0OalxyfO+/9Ycbj3mgUnBx8A6V9EEdzgpoR0PfIV0pfXxPtbZCLg6tf5nVeCXEXGDpGnAFZKOBuYBB+flryc9IjCH9JjAUQVjNjOzQaBogjswIn4CvAJ8G0DSCaTn1iqKiMeB7SuU/wPYs0J5AMcVjNPMBgFfbVq5ovfgjqhQdmTBbZqZmRVW0xWcpEOBw4AxkqaUzVoPeKYegdnA5b+yzawZaq2i/COpQckI4Idl5UuAWUWDMjMzK6qmBBcR84B5kvYCXo6INyRtCWwN3F/PAM3MzGpR9B7c7cCakkYBU0mtKi8uGpSZmVlRRROcIuIl0qMBP4uIg4DtiodlZmZWTNHHBCTpfcCngKNz2ZCC2zQzG/DckULjFb2COwE4Gbg6Ih6UtBlwa/GwzMzMiik6XM7tpPtwpfePA18uGpSZmVlRhRJcbjn5ddIo3G9uKyI+VCws6y8/W2Zm0Jjfgk6t9ix6D+5K4FzgAmB58XB6Jmk/UhdgQ4ALImJSIz/PzMw6W9EE93pEnFOXSHohaQjwU2Bv0oCn0yRNiYiHGv3Z9eYrLTOz5ijayORaSV+SNFLS8NKrLpG91U7AnIh4PCJeBS4HJjTgc8zMbIAoegVX6mz5G2VlAWxWcLvdjQKeKHs/nzQ23JvKR/QGlkp6tGz2CODpOsfU6XxMVuZjsjIfk5UNumOi7/W5SCuPyaY9zSjainJMkfXrKSLOA86rNE/S9IgY1+SQ2pqPycp8TFbmY7IyH5OVtesxqXU0gV4HNY2I39QWTo8WkAZTLdkkl5mZmVVU6xXcR3uZF0C9E9w0YAtJY0iJ7RDScD1mZmYV1TqawFEAkk6NiO/m6TUiYlk9gyv7vNclHQ/cSHpM4MKIeLAfm6hYdTnI+ZiszMdkZT4mK/MxWVlbHhNFRP9Xkk4i9WByTkTskMvujYix9Q3PzMysNrVWUT4CHARsJumO/P5tkraKiEd7X9XMzKzxar2C2wO4mzSy947ANsB1wO+BrSLi/fUM0szMrL9qfdB7X1JC2xw4i/RM2osRcVQ7JTdJ+0l6VNIcSRNbHU87kDRX0v2SZkqa3up4WkXShZIWS3qgrGy4pJskzc7/DmtljM3WwzE5Q9KCfL7MlHRAK2NsJknvkHSrpIckPSjphFw+aM+TXo5JW54nNV3BvbmydB9pHLixwL8BjwLPRkRvrSybInfv9RfKuvcCDu3E7r3qSdJcYFxEDKoHVbuTtDuwFJgcEe/KZWcCz0TEpPwH0bCIOKmVcTZTD8fkDGBpRPyglbG1gqSRwMiIuFfSusAM4EDgSAbpedLLMTmYNjxPinbVdWNETM8PWc+PiN2Ao+oQVz24ey/rUR7q6ZluxROAS/L0JaT/uINGD8dk0IqIhRFxb55eAjxM6lVp0J4nvRyTtlQowUXEN8veHpnL2uXKoFL3Xm37RTRRAFMlzcjdm9kKG0XEwjz9JLBRK4NpI8dLmpWrMAdNdVw5SV3Ae0htD3yesNIxgTY8T4pewb0pIu6r17asoXbLj3PsDxyXq6Wsm0h197XX3w8c55Dute8ALAR+2NJoWkDSOsBVwIkR8UL5vMF6nlQ4Jm15ntQtwbUhd+9VQUQsyP8uBq4mVeVasijfYyjda1jc4nhaLiIWRcTyiHgDOJ9Bdr5IWo30Q/6Lsi4IB/V5UumYtOt5MpAT3Jvde0landS915QWx9RSktbON4aRtDawD/BA72sNKlNYMULGEcA1LYylLZR+yLN/ZhCdL5IE/Bx4OCLOKps1aM+Tno5Ju54nhVpRtrvcVPXHrOje699aG1FrSdqMdNUG6SH/Xw7WYyLpMmA8aZiPRcDpwG+BK4DRwDzg4IgYNI0uejgm40nVTgHMBT5fdv9pQJO0G3AHcD/wRi4+hXTPaVCeJ70ck0Npw/NkQCc4MzMbvAZyFaWZmQ1iTnBmLZB7lNmr1XGUk/TB3MvNc5L+IelqSX60xjqWE5zZICSpUkfrDwH7RsQGwD8Bs0nNv806khOcWRuRNEzS7yQ9JenZPL1JnneQpBndlv+qpGvy9BqSfiDpb5IWSTpX0tA8b7yk+ZJOkvQkcFH3z85Nvf9eVrQceGfDdtaswZzgzNrLKqTksympld7LwNl53hRgjKRtypY/HJicpycBW5Jas72T1HPPaWXLbgwMz9uu2IuNpNGSnsuf+3XgzKI7ZNYqbkVp1gK50+tjIuLmPpbbAbg1Iobl9+eQOvr9lqTtgDtJietVUkfJ746Ix/Ky7yM9CjJG0nhgKrBeRLxSRXzDgc8Bt0XEXTXtpFmL1TrgqZk1gKS1gB8B+wGl/vzWlTQkIpaTOve9TNKppKu3KyJimaS3A2sBM9KzuGlzpGdAS56qJrkBRMQzki4B7pM0KiJeL7xzZk3mKkqz9vI1YCtg54hYDyj1FSqAfDX1KvAB4DDg0jz/aVK14nYRsUF+rR8R65Rtu7/VNasCbwfWq2lPzFrMCc6sdVaTtGbZa1VgXVKiei5XE55eYb3JpPtyr0XEnQBlfQD+KF/NIWmUpH2rDUbSxyVtJWkVSRuSBjP+82DppcMGnkIJTtKuuU9DJH1a0lmSNq1PaGYD3vWkZFZ6nUHqWm4o6YrsLuCGCutdCrwL+O9u5ScBc4C7JL0A3Ey6GqzWqPx5S1jRFdM/92N9s7ZSdETvWcD2wLuBi4ELSP2y7VGX6MxsJbnp/2JgbETMbnU8Zu2qaBXl63k8pAnA2RHxU1IVi5k1zheBaU5uZr0r2opyiaSTSa25PiBpFWC14mGZWSX58QIBB7Y2ErP2V7SKcmNSS65pEXGHpNHA+IiY3MeqZmZmDVX4Qe/cqGSLiLg5P8MzJCKW1CU6MzOzGhWqopT0OVKXP8OBzUmtsM4F9iweWu1GjBgRXV1drQzBzMyaYMaMGU9HxIaV5hW9B3ccsBNphFsiYnbpGZxW6urqYvr06a0Ow8zMGkzSvJ7mFU1wyyLi1VLXQPlBVXduaWbWh66J1zVku3Mnfbgh2+1ERR8TuE3SKcBQSXsDVwLXFg/LzMysmKIJbiLwFKnXg8+TemY4tWhQZmZmRRWtohwKXBgR5wNIGpLLXioamJmZ9V8jqj47tdqz6BXcLaSEVjKU1P+dmZlZSxVNcGtGxNLSmzy9VsFtmpmZFVa0ivJFSWMj4l4ASe8l9YpuNiB0Uku3TorVrBmKJrgTgSsl/Z3UP97GwCeLBmVm1i4a9YeDNV6hBBcR0yRtzYoxpx6NiNf6Wi93GLsEWE4akWBcHtzxV0AXMJc07M6zSg/Z/QQ4gNR45cjSFaNZiX+EzKy7eozovSNpPLixwKGSPlPleh+MiB0iYlx+PxG4JSK2IDVemZjL9we2yK9jgXPqELOZmQ1wRfuivJTUB+VM0tUYpJ5MahlNYAIwPk9fAvyBNELxBGByHnfuLkkbSBoZEQtrj9zMquVm59apit6DGwdsG/0fkiCAqZIC+K+IOA/YqCxpPQlslKdHAU+UrTs/lznBmXUoN4ixZiia4B4gNSzpb7LZLSIW5I6Zb5L0SPnMiIic/Kom6VhSFSajR4/uZzhmZjbQFE1wI4CHJN0DLCsVRsTHelspIhbkfxdLupo0IsGiUtWjpJHA4rz4AuAdZatvksu6b/M84DyAcePGucNnM7NBrmiCO6O/K0haG1glIpbk6X2A7wBTgCOASfnfa/IqU4DjJV0O7Aw87/tvZmbWl6KPCdxWw2obAVfnIXZWBX4ZETdImgZcIeloYB5wcF7+etIjAnNIjwkcVSRmaz036TezZijainIX4D+BbYDVgSHAixGxXk/rRMTjwPYVyv9BhZHAcwOW44rEaWZmtevURkFFn4M7GzgUmE3qaPkY4KdFgzIzMyuq8IPeETEHGBIRyyPiImC/4mGZmZkVU7SRyUuSVgdmSjqT9LhAPXpHMTMzK6RoMjo8b+N44EVSc/6PFw3KzMysqKJXcAdGxE+AV4BvA0g6gdQ5spn1wC1JzRqv6BXcERXKjiy4TTMzs8JquoKTdChwGDBG0pSyWesBz9QjMDMzsyJqraL8I6lByQjgh2XlS4BZRYMyM6uFq36tXE0JLiLmAfMk7QW8HBFvSNoS2Bq4v54BmpmZ1aLoPbjbgTUljQKmklpVXlw0KDMzs6KKJjhFxEukRwN+FhEHAdsVD8vMzKyYoo8JSNL7gE8BR+eyIQW3aTXwqMtmZm9VNMGdAJwMXB0RD0raDLi1eFjWDnzD3sw6WdHhcm4n3YcrvX8c+HLRoMzMzIoqOlzOlsDXga7ybUXEh4qFZWZmVkzRKsorgXOBC4DlxcMxMzOrj6IJ7vWIOKcukfRB0n6kPi6HABdExKRmfK6ZmXWmognuWklfAq4GlpUKI6Ku3XVJGkIaSHVvYD4wTdKUiHionp/TDG64YWbWHEUTXKmz5W+UlQWwWcHtdrcTMCc3YkHS5cAEoOMSnJmZNUfRVpRj6hVIH0YBT5S9nw/sXL6ApGOBY/PbpZIebVJszTYCeLrVQTTZYNxnGJz7PRj3GQbpfut7ddnvTXuaUetoAr0OahoRv6llu0VExHnAec3+3GaTND0ixrU6jmYajPsMg3O/B+M+g/e7Uduv9Qruo73MC6DeCW4BabTwkk1ymZmZWUW1jiZwFICkUyPiu3l6jYhY1vuaNZsGbCFpDCmxHUIaj87MzKyimjpblnRS7oPyE2XFf6pPSCuLiNeB44EbgYeBKyLiwUZ9Xpsb8NWwFQzGfYbBud+DcZ/B+90Qioj+ryRNAPYAjgHuAx4B9gH2iYiB2rjDzMw6SK0Jbg/gbtLI3jsC2wDXAb8HtoqI99czSDMzs/6qtZHJvsBpwObAWcAs4MXSvTkzM7NWq+keXEScEhF7AnOBS0ndZ20o6U5J19YxPsskzZV0v6SZkqa3Op5GkXShpMWSHigrGy7pJkmz87/DWhljI/Sw32dIWpC/85mSDmhljPUm6R2SbpX0kKQHJZ2Qywfs993LPg/073pNSfdIui/v97dz+RhJd0uaI+lXklav6+fWUkX55srSmRHxzTz954h4j6QRETHoHlhsNElzgXED/dhK2h1YCkyOiHflsjOBZyJikqSJwLCIOKmVcdZbD/t9BrA0In7QytgaRdJIYGRE3CtpXWAGcCBwJAP0++5lnw9mYH/XAtaOiKWSVgPuJI0n+lXgNxFxuaRzgfvq2b9xTVdwJaXklh2Zywb0D7A1Vh5jsHtfphOAS/L0JaQfhAGlh/0e0CJiYUTcm6eXkFpIj2IAf9+97POAFsnS/Ha1/ArgQ8Cvc3ndv+tCCa5cRNxXr21ZRQFMlTQjd0s2mGwUEQvz9JPARq0MpsmOlzQrV2EOmKq67iR1Ae8hNV4bFN93t32GAf5dSxoiaSawGLgJeAx4Lj8GBqkLxrom+7olOGu43SJiLLA/cFyu0hp0ItWp116v3lnOITXk2gFYCPywpdE0iKR1gKuAEyPihfJ5A/X7rrDPA/67jojlEbEDqSeqnYCtG/2ZTnAdIiIW5H8Xk4Yn2qm1ETXVonzvonQPY3GL42mKiFiUfxTeAM5nAH7n+X7MVcAvyvqwHdDfd6V9HgzfdUlEPAfcCrwP2EBSqTV/3btgdILrAJLWzjekkbQ26aH6B3pfa0CZwoqhmY4ArmlhLE1T+pHP/pkB9p3nhgc/Bx6OiLPKZg3Y77unfR4E3/WGkjbI00NJY3s+TEp0pR6x6v5dF2pFac0haTPSVRukZxd/GRH/1sKQGkbSZcB40vAhi4DTgd8CVwCjgXnAwfUeVLfVetjv8aQqqyA9kvP5sntTHU/SbsAdwP3AG7n4FNI9qQH5ffeyz4cysL/rd5MakQwhXVhdERHfyb9tlwPDgT8Dn65nn8ZOcGZmNiC5itLMzAYkJzgzMxuQnODMzGxAcoIzM7MByQnOzMwGJCc4azhJIemHZe+/njsSrse2L5b0ib6XLPw5B0l6WNKt3cq7JL2ce4C/T9IfJW3Vx7a6ykcNaDalkSlGVChfR9J/SXosdwn3B0k753lLV95SVZ91oKRti8bcbZv/I2mTHN+4KtcZL+l3/fycqrdv7ckJzpphGfDxSj+qrVTWg0I1jgY+FxEfrDDvsYjYISK2Jz3rc0pdAmy+C0gdPm8REe8FjiI9l1fEgUC/Elxv30t+SPhtETG/YFw2CDjBWTO8DpwHfKX7jO5XYKUrhfwX922SrpH0uKRJkj6Vx5S6X9LmZZvZS9J0SX+R9JG8/hBJ35c0LXdg+/my7d4haQrwUIV4Ds3bf0DS93LZacBuwM8lfb+PfV0PeLa3GLp93pqSLsqf+WdJH8zlR0r6jaQblMZFO7NsnaPzvt4j6XxJZ+fyDSVdlT9vmqRdc/nbJE1VGofrAkAV4tgc2Bk4NXcXRUT8NSKu67bcW66EJJ0t6cg8PUlpnLNZkn4g6f3Ax4Dv5yvczfPrhnyFeIekrfO6F0s6V9LdwJmS9tCKsdH+rNyTD+nh9z/0dPDz1fEdku7Nr/eXfzeSrpP0aP6sVfI6+0j6U17+SqV+Isu3OSTH90D+nlY6j6091Tqit1l//RSYVf5DXYXtgW1IVxWPAxdExE5Kg0T+C3BiXq6L1Hff5sCtkt4JfAZ4PiJ2lLQG8L+SpublxwLvioi/ln+YpH8Cvge8l5Skpko6MPe48CHg6xFRabDZzZV6SV8XWIuUKCBd9VWKobx3heNIfQr/n/xjP1XSlnneDqTe5pcBj0r6T2A58P/yPiwBfg+URvL4CfCjiLhT0mjgxnz8TgfuzPvx4RxXd9sBMyNieYV5fZL0NlIXU1tHREjaICKey39I/C4ifp2XuwX4QkTMVqr+/BlpyBRIfRG+PyKWKw2cfFxE/G9OOK/kZfYn9WzTk8XA3hHxiqQtgMuAUjXjTqSryXnADaRahT8ApwJ7RcSLkk4ijVH2nbJt7gCMKhunb4MaDpG1gBOcNUVEvCBpMvBl4OUqV5tW6q5I0mNAKUHdD5RXFV6RrzpmS3qc1Ev5PsC7teLqcH1gC+BV4J7uyS3bEfhDRDyVP/MXwO70/oMKuYoyr/NJ0tXqfr3E8JeydXcD/hMgIh6RNA8oJbhbIuL5vN2HgE1JVYa3lbquknRl2fJ7AdtKb16grZeTw+7Ax/NnXCfp2T72pxbPk5LQz/MV3kr3u3Is7weuLItxjbJFrixLsP8LnJW/g9+UVUnuCny9lzhWA86WtAPpj4Ety+bdExGP51guIx37V0hJ739zTKsDf+q2zceBzfIfGNex4jy0NucEZ830Y+Be4KKystfJVeW5yqh8yPryPuneKHv/Bm89d7v3Nxekarh/iYgby2dIGg+8WEvwVZrCiv3rKYauKrdVvv/L6fv/6yrALhHxSnlhWTLpzYPA9pKG9HEV9+b3la0JEBGvS9oJ2JPUee7xrLgyK4/vudIfAxW8+b3k0byvAw4gJZ99SX+cPBERr/YS31dIfXlunz+v/Fj0dJ7cFBGH9rTBiHhW0vbAvsAXSKNvf7aXGKxN+B6cNU2+6riCt1aRzSVVCUK6X7NaDZs+SNIq+T7SZsCjpOq5LyoNTYKkLZVGYujNPcAekkZIGkLqAPe2fsayG2kgR6qM4Q7gU6X5pA6GH+1l+9NyjMOUGmP837J5U0lVt+Tt7ZAnbwcOy2X7AysNphkRjwHTgW8rZ8R8P+vD3RadR7pKXCNX1e2Zl10HWD8iriclme3z8ktIVbfkcc/+KumgvI5y4liJpM0j4v6I+F7e561J1ZM39HJsIF0lL8xX9IeTOvct2UnSmPyH1CeBO4G7gF1ztXZp5I7yqz6UGketEhFXkaozx/YRg7UJX8FZs/2Q9Nd9yfnANZLuI/141XJ19TdSclqPdH/nFaXGFF3AvfkH+ylSi74eRcRCSRNJQ3gIuC4iqhm+o3QPTqSrjGNyeTUx/Aw4R9L9pKujIyNiWU9XXRGxQNK/5/19BniEVD0Iqfr3p5Jmkf5v30664vg2cJmkB4E/ko5XJceQvp85kl4Gnga+0e3zn5B0BWk4l7+SeoCHlMSukbRmPg5fzeWXA+dL+jLpyu5TeX9PJf0xczkr7iGWO1Gpwc0bpKvL/wF+TVkCz66T9Fqe/hOpBetVkj7DyufTNOBs4J2k7/jqiHhDqZHMZfk+KaQkVl6NPAq4KCdGgJMrxGttyKMJmHUYSetExNJ8BXc1cGFEXN3Xep2s1EgnIvxcmlXNCc6sw0j6AalByZqkaskTwv+RzVbiBGdmZgOSG5mYmdmA5ARnZmYDkhOcmZkNSE5wZmY2IDnBmZnZgPT/AaO8B7imYR5xAAAAAElFTkSuQmCC\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "import matplotlib.pyplot as plt\n",
+ "from pecos.core import clib\n",
+ "from pecos.utils import smat_util\n",
+ "\n",
+ "fig, axes = plt.subplots(nrows=len(cluster_chain), ncols=1)\n",
+ "fig.tight_layout()\n",
+ "\n",
+ "cur_Y = Y_tst\n",
+ "\n",
+ "counts, bins = np.histogram(cur_Y.getnnz(1), bins=16) \n",
+ "ax = plt.subplot(len(cluster_chain), 1, len(cluster_chain))\n",
+ "ax.hist(bins[:-1], bins, weights=counts)\n",
+ "ax.set_title(\"Layer {}\".format(len(cluster_chain) - 1))\n",
+ "plt.ylabel(\"#Instances\")\n",
+ "\n",
+ "for d in range(len(cluster_chain) - 1, 0, -1):\n",
+ " cur_Y = smat_util.binarized(clib.sparse_matmul(cur_Y, cluster_chain[d]))\n",
+ " counts, bins = np.histogram(cur_Y.getnnz(1), bins=min(16, cluster_chain[d].shape[1])) \n",
+ " ax = plt.subplot(len(cluster_chain), 1, d)\n",
+ " ax.hist(bins[:-1], bins, weights=counts)\n",
+ " ax.set_title(\"Layer {}\".format(d - 1))\n",
+ " plt.ylabel(\"#Instances\")\n",
+ " \n",
+ " \n",
+ "plt.subplot(len(cluster_chain), 1, len(cluster_chain))\n",
+ "plt.xlabel(\"Number of Belonged Clusters/Labels\");"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "790b21dc",
+ "metadata": {},
+ "source": [
+ "### Dive Deep in Model Weights\n",
+ "\n",
+ "Model weights in an XR-Linear model are also accessible as `model_chain` for analysis and computations. For the i-th layer in the hierarchy, the model weights of matchers/rankers are available as a CSC matrix of shape `(nr_feat + 1, L[i])`, which concatenates weights for features and the bias term. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "9e101f6b",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "model_chain[0].W is a csc matrix of shape (101939, 8)\n",
+ "model_chain[1].W is a csc matrix of shape (101939, 64)\n",
+ "model_chain[2].W is a csc matrix of shape (101939, 512)\n",
+ "model_chain[3].W is a csc matrix of shape (101939, 30938)\n"
+ ]
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "for d, m in enumerate(xlm.model.model_chain):\n",
+ " print(\"model_chain[{}].W is a {} matrix of shape {}\".format(d, m.W.getformat(), m.W.shape))\n",
+ "\n",
+ "layer_d = 1\n",
+ "from sklearn.decomposition import TruncatedSVD\n",
+ "svd = TruncatedSVD(n_components=2, random_state=0)\n",
+ "Wt = svd.fit_transform(xlm.model.model_chain[layer_d].W.transpose())\n",
+ "\n",
+ "import numpy as np\n",
+ "color = cluster_chain[layer_d].tocsr() * np.arange(cluster_chain[layer_d].shape[1])\n",
+ "\n",
+ "import matplotlib.pyplot as plt\n",
+ "plt.scatter(Wt[:, 0], Wt[:, 1], c=color)\n",
+ "plt.xlim(-4, 4);\n",
+ "plt.ylim(-10, 10);"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ae9baf18",
+ "metadata": {},
+ "source": [
+ "### PECOS and One-versus-All (OVA) Model\n",
+ "\n",
+ "PECOS also supports to train an OVA model without leveraing clustering hierarchy if needed.\n",
+ "\n",
+ "**Training OVA models is time-consuming, we suggest to try it offline after the tutorial.**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "c95f0acf",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Training time for the OVA model: 1047.3194 seconds.\n",
+ "XR-Linear is 25.56 times faster than the OVA model\n"
+ ]
+ }
+ ],
+ "source": [
+ "import time\n",
+ "start_time = time.time()\n",
+ "\n",
+ "xlm_ova = XLinearModel.train(X_trn, Y_trn, C=None, negative_sampling_scheme=\"tfn\") \n",
+ "\n",
+ "training_time_ova = time.time() - start_time\n",
+ "print(\"Training time for the OVA model: {:.4f} seconds.\".format(training_time_ova))\n",
+ "\n",
+ "print(\"XR-Linear is {:.2f} times faster than the OVA model\".format(training_time_ova / training_time))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "73a599a0",
+ "metadata": {},
+ "source": [
+ "## Customized Parameters and Advanced Training Options\n",
+ "\n",
+ "PECOS also supports using customized parameters and several advanced training options, such as different solvers and cost-sensitive learning.\n",
+ "\n",
+ "### Customized Parameters\n",
+ "\n",
+ "The parameters for either of indexing, training, and inference can be easily customized by feeding a dictionary into the corresponding parameter class and its constructor:\n",
+ "\n",
+ "* Semantic Indexing (Hierarchical K-Means): `HierarchicalKMeans.TrainParams.from_dict(dict)`\n",
+ "* Training: `XLinearModel.TrainParams.from_dict(dict)`\n",
+ "* Inference: `XLinearModel.PredParams.from_dict(dict)`\n",
+ "\n",
+ "Although most of the parameters can be also passed by `kwargs` of Python methods, **we encourage to use the dictionary to designate the parameters because it is easier to manage, modularize, and store parameters in certain formats like JSON.**\n",
+ "\n",
+ "For XR-Linear models, the default values and skeleton of the parameters can be revealed and generated by the following command:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "1ddc9bfa",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{\r\n",
+ " \"train_params\": {\r\n",
+ " \"__meta__\": {\r\n",
+ " \"class_fullname\": \"pecos.xmc.xlinear.model###XLinearModel.TrainParams\"\r\n",
+ " },\r\n",
+ " \"mode\": \"full-model\",\r\n",
+ " \"ranker_level\": 1,\r\n",
+ " \"nr_splits\": 16,\r\n",
+ " \"min_codes\": null,\r\n",
+ " \"shallow\": false,\r\n",
+ " \"rel_mode\": \"disable\",\r\n",
+ " \"rel_norm\": \"no-norm\",\r\n",
+ " \"hlm_args\": {\r\n",
+ " \"__meta__\": {\r\n",
+ " \"class_fullname\": \"pecos.xmc.base###HierarchicalMLModel.TrainParams\"\r\n",
+ " },\r\n",
+ " \"neg_mining_chain\": \"tfn\",\r\n",
+ " \"model_chain\": {\r\n",
+ " \"__meta__\": {\r\n",
+ " \"class_fullname\": \"pecos.xmc.base###MLModel.TrainParams\"\r\n",
+ " },\r\n",
+ " \"threshold\": 0.1,\r\n",
+ " \"max_nonzeros_per_label\": null,\r\n",
+ " \"solver_type\": \"L2R_L2LOSS_SVC_DUAL\",\r\n",
+ " \"Cp\": 1.0,\r\n",
+ " \"Cn\": 1.0,\r\n",
+ " \"max_iter\": 100,\r\n",
+ " \"eps\": 0.1,\r\n",
+ " \"bias\": 1.0,\r\n",
+ " \"threads\": -1,\r\n",
+ " \"verbose\": 0,\r\n",
+ " \"newton_eps\": 0.01\r\n",
+ " }\r\n",
+ " }\r\n",
+ " },\r\n",
+ " \"pred_params\": {\r\n",
+ " \"__meta__\": {\r\n",
+ " \"class_fullname\": \"pecos.xmc.xlinear.model###XLinearModel.PredParams\"\r\n",
+ " },\r\n",
+ " \"hlm_args\": {\r\n",
+ " \"__meta__\": {\r\n",
+ " \"class_fullname\": \"pecos.xmc.base###HierarchicalMLModel.PredParams\"\r\n",
+ " },\r\n",
+ " \"model_chain\": {\r\n",
+ " \"__meta__\": {\r\n",
+ " \"class_fullname\": \"pecos.xmc.base###MLModel.PredParams\"\r\n",
+ " },\r\n",
+ " \"only_topk\": 20,\r\n",
+ " \"post_processor\": \"l3-hinge\"\r\n",
+ " }\r\n",
+ " }\r\n",
+ " },\r\n",
+ " \"indexer_params\": {\r\n",
+ " \"__meta__\": {\r\n",
+ " \"class_fullname\": \"pecos.xmc.base###HierarchicalKMeans.TrainParams\"\r\n",
+ " },\r\n",
+ " \"nr_splits\": 16,\r\n",
+ " \"min_codes\": null,\r\n",
+ " \"max_leaf_size\": 100,\r\n",
+ " \"imbalanced_ratio\": 0.0,\r\n",
+ " \"imbalanced_depth\": 100,\r\n",
+ " \"spherical\": true,\r\n",
+ " \"seed\": 0,\r\n",
+ " \"kmeans_max_iter\": 20,\r\n",
+ " \"threads\": -1\r\n",
+ " }\r\n",
+ "}\r\n"
+ ]
+ }
+ ],
+ "source": [
+ "! python3 -m pecos.xmc.xlinear.train --generate-params-skeleton"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "35472517",
+ "metadata": {},
+ "source": [
+ "### Training Parameters for Hierarchial Models in XR-Linear\n",
+ "\n",
+ "Hierarchical models could have different parameters over layers. To have customized parameters for the hierarchical model, `hlm_args` needs to be designated in the parameter dictionary. The values of `model_chain` and `neg_mining_chain` in `hlm_args` can be **a single dictionary** of general parameters for all layers or **a list of dictinoaries** for specific parameters of individual layers.\n",
+ "\n",
+ "#### General Parameters for All Layers\n",
+ "\n",
+ "```\n",
+ "train_params_l1 = XLinearModel.TrainParams.from_dict(\n",
+ " {\n",
+ " ...\n",
+ " \"hlm_args\": {\n",
+ " ...\n",
+ " \"neg_mining_chain\": \"tfn\", # Negative sampling scheme for all layers\n",
+ " \"model_chain\":{...}, # Parameters for all layers\n",
+ " }\n",
+ " ...\n",
+ " })\n",
+ "```\n",
+ "\n",
+ "#### Specific Parameters of Individual Layers\n",
+ "\n",
+ "```\n",
+ "train_params_l1 = XLinearModel.TrainParams.from_dict(\n",
+ " {\n",
+ " ...\n",
+ " \"hlm_args\": {\n",
+ " ...\n",
+ " \"neg_mining_chain\": [\n",
+ " \"tfn\", # Negative sampling scheme for layer-0\n",
+ " \"tfn\", # Negative sampling scheme for layer-1\n",
+ " \"tfn+man\", # Negative sampling scheme for layer-2\n",
+ " ...\n",
+ " ],\n",
+ " \"model_chain\": [\n",
+ " {...}, # Parameters for layer-0\n",
+ " {...}, # Parameters for layer-1\n",
+ " {...}, # Parameters for layer-2\n",
+ " ...\n",
+ " ],\n",
+ " }\n",
+ " ...\n",
+ " })\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "81f86b8a",
+ "metadata": {},
+ "source": [
+ "### Variety of Solvers\n",
+ "\n",
+ "The solver for optimization can be adjusted by the argument `solver_type` in the `train` function. PECOS currently provides the following solvers for training each matcher/ranker:\n",
+ "\n",
+ "* \"L2R_L2LOSS_SVC_DUAL\" (default): L2-regularized L2-loss Dual SVM\n",
+ "* \"L2R_L1LOSS_SVC_DUAL\": : L2-regularized L1-loss Dual SVM\n",
+ "* \"L2R_LR_DUAL\": L2-reguarlized Logistic Regression"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "8f26ee42",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "xlm_l1_kwargs = XLinearModel.train(\n",
+ " X_trn, Y_trn,\n",
+ " C=cluster_chain,\n",
+ " threshold=0.1,\n",
+ " negative_sampling_scheme=\"tfn\",\n",
+ " solver_type=\"L2R_L1LOSS_SVC_DUAL\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "id": "197926a7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "train_params_l1 = XLinearModel.TrainParams.from_dict(\n",
+ " {\n",
+ " \"hlm_args\": {\n",
+ " \"threshold\": 0.1,\n",
+ " \"neg_mining_chain\": \"tfn\",\n",
+ " \"model_chain\":{\n",
+ " \"solver_type\": \"L2R_L1LOSS_SVC_DUAL\",\n",
+ " },\n",
+ " }\n",
+ " }\n",
+ ")\n",
+ "\n",
+ "xlm_l1_dict = XLinearModel.train(\n",
+ " X_trn, Y_trn,\n",
+ " C=cluster_chain,\n",
+ " train_params=train_params_l1)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "id": "eddf91a6",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Evaluation Metrics with L2R_L1LOSS_SVC_DUAL (by method kwargs)\n",
+ "prec = 83.43 77.66 72.47 67.67 63.73 60.18 56.90 54.04 51.45 49.04\n",
+ "recall = 4.93 9.11 12.62 15.59 18.19 20.51 22.49 24.28 25.92 27.36\n",
+ "\n",
+ "Evaluation Metrics with L2R_L1LOSS_SVC_DUAL (by dictionary)\n",
+ "prec = 83.43 77.66 72.47 67.67 63.73 60.18 56.90 54.04 51.45 49.04\n",
+ "recall = 4.93 9.11 12.62 15.59 18.19 20.51 22.49 24.28 25.92 27.36\n"
+ ]
+ }
+ ],
+ "source": [
+ "Y_pred_l1_kwargs = xlm_l1_kwargs.predict(X_tst, beam_size=10, only_topk=10)\n",
+ "Y_pred_l1_dict = xlm_l1_dict.predict(X_tst, beam_size=10, only_topk=10)\n",
+ "metrics_l1_kwargs = smat_util.Metrics.generate(Y_tst, Y_pred_l1_kwargs, topk=10)\n",
+ "metrics_l1_dict = smat_util.Metrics.generate(Y_tst, Y_pred_l1_dict, topk=10)\n",
+ "\n",
+ "print(\"Evaluation Metrics with L2R_L1LOSS_SVC_DUAL (by method kwargs)\")\n",
+ "print(metrics_l1_kwargs)\n",
+ "\n",
+ "print(\"\\nEvaluation Metrics with L2R_L1LOSS_SVC_DUAL (by dictionary)\")\n",
+ "print(metrics_l1_dict)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9f481ed3",
+ "metadata": {},
+ "source": [
+ "### Cost-sensitive Learning\n",
+ "\n",
+ "PECOS supports to adjust the cost of each training instance. To enable cost-sensitive learning, we need to provide a **relevance matrix** `R_trn` with the same shape to the label matrix `Y_trn` for the argument `R`. When `R` is `None` (default), cost-sensitive learning is disable. \n",
+ "\n",
+ "Since PECOS models are usually hierarhical, costs for upper layers also need to be decided as the cost-sensitive learning mode by the argument `rel_mode`. Currently, PECOS supports the following cost-sensitive learning modes:\n",
+ "\n",
+ "* `\"disable\"` (default): The cost-sensitive learning is disable.\n",
+ "* `\"induce\"`: Induce the costs into upper layers by the clustering chain.\n",
+ "* `\"ranker-only\"`: Only apply cost-sensitive learning to the model in the last ranker layer without induction.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "id": "382277f3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# An exmaple of using training label frequency scores as costs. \n",
+ "import copy\n",
+ "from sklearn.preprocessing import normalize\n",
+ "\n",
+ "R_trn = copy.deepcopy(Y_trn)\n",
+ "\n",
+ "# Training parameters for cost-sensitive learning.\n",
+ "train_params_cost = XLinearModel.TrainParams.from_dict(\n",
+ " {\n",
+ " \"rel_mode\": \"induce\",\n",
+ " \"rel_norm\": \"l1\",\n",
+ " \"hlm_args\": {\n",
+ " \"neg_mining_chain\": \"tfn\",\n",
+ " \"model_chain\":\n",
+ " [\n",
+ " {\n",
+ " \"threshold\": 0.1,\n",
+ " \"Cp\": 1.0,\n",
+ " \"Cn\": 1.0,\n",
+ " },\n",
+ " {\n",
+ " \"threshold\": 0.1,\n",
+ " \"Cp\": 8.0,\n",
+ " \"Cn\": 1.0,\n",
+ " },\n",
+ " {\n",
+ " \"threshold\": 0.1,\n",
+ " \"Cp\": 4.0,\n",
+ " \"Cn\": 1.0,\n",
+ " },\n",
+ " {\n",
+ " \"threshold\": 0.1,\n",
+ " \"Cp\": 4.0,\n",
+ " \"Cn\": 1.0,\n",
+ " },\n",
+ " ],\n",
+ " }\n",
+ " })\n",
+ " \n",
+ "# Cost-sensitive learning.\n",
+ "xlm_cost = XLinearModel.train(\n",
+ " X_trn, Y_trn,\n",
+ " C=cluster_chain,\n",
+ " R=R_trn,\n",
+ " train_params=train_params_cost)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "id": "559c15cb",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Evaluation Metrics with Cost-sensitive Learning\n",
+ "prec = 85.02 80.58 74.57 69.37 64.79 60.82 57.34 54.29 51.37 48.81\n",
+ "recall = 5.02 9.46 13.02 15.99 18.54 20.74 22.69 24.42 25.92 27.27\n",
+ "\n",
+ "Original Evaluation Metrics\n",
+ "prec = 84.07 78.17 72.68 67.79 63.79 60.06 56.63 53.51 50.83 48.33\n",
+ "recall = 4.97 9.16 12.68 15.60 18.25 20.49 22.40 24.05 25.60 26.95\n"
+ ]
+ }
+ ],
+ "source": [
+ "Y_pred_cost = xlm_cost.predict(X_tst, beam_size=10, only_topk=10)\n",
+ "metrics_cost = smat_util.Metrics.generate(Y_tst, Y_pred_cost, topk=10)\n",
+ "print(\"Evaluation Metrics with Cost-sensitive Learning\")\n",
+ "print(metrics_cost)\n",
+ "print(\"\\nOriginal Evaluation Metrics\")\n",
+ "print(metrics)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1663277d",
+ "metadata": {},
+ "source": [
+ "# Customized PECOS Model\n",
+ "\n",
+ "Besides pre-defined models in PECOS, such as XR-Linear, it is also convenient for users to customize PECOS for specific purposes and usage. Specifically, we suggest to establishing a model class to wrap fundamental PECOS functions and tailored operations. As a result, the customized model can be easily constructed and consumed for arbitrary data types and feature extractors. \n",
+ "\n",
+ "## Structure of a Customized PECOS Model\n",
+ "\n",
+ "Even though a customized machine learning pipeline can be seperated into several independent scripts, we recommend declaring a customized PECOS model as a **model class** for better re-usability and code maintenance.\n",
+ "\n",
+ "A customized PECOS model should at least consist of the following components:\n",
+ "\n",
+ "* `preprocessor` or `encoder`: The procedure, which can be a method or a functionable object, pre-processes or encodes an arbitrary input with the designated data format into features. For example, text data and image data can be encoded by BERT and ResNet.\n",
+ "* `train()`: The training method takes a set of training data with a preprocessor, learns a primitive PECOS model, and returns a PECOS-based customized machine learning model. The training function could be a class method to construct the model object with the learned model and essential components after training.\n",
+ "* `model`: A primitive PECOS model taking pre-processed features is capable of deriving the predictions for arbitrary testing data. The model weights should be learned by `train()`. \n",
+ "* `predict()`: The prediction method takes arbitrary testing data and infers the prediction based on the pre-processor and the learned model.\n",
+ "* `save()`: The saving function serializes the trained model, including model weights and configuration, for further usage.\n",
+ "* `load()`: The loading function reads the serialized model so that the trained model can be loaded and re-used.\n",
+ "\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "\n",
+ "In this part of the tutorial, we will use the task of *extreme multi-label text classification* as an example to demonstrate how to **customize a PECOS model that can handle text data with either a conventional bag-of-words (BoW) model or a deep learning model as the text encoder for feature extraction**.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c3acc325",
+ "metadata": {},
+ "source": [
+ "## Example: eXtreme Multi-label Text Classification (XMTC)\n",
+ "\n",
+ "The task of extreme multi-label text classification (XMTC) seeks to find relevant labels from an extreme large label collection for a given text input. Many real-world applications can be formulated as XMTC tasks, such as recommendation systems, document tagging, and semantic search. \n",
+ "\n",
+ "In this section, we guide through how to establish a customized PECOS model for XMTC tasks. We will walk through (1) PECOS' built-in BOW model for text preprocessing and vectorizing; (2) how to customize a PECOS model; and (3) \n",
+ "advanced usage of XR-Transformer based on deep learning.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3b2ec0e6",
+ "metadata": {},
+ "source": [
+ "### Preprocessor: Text Preprocessing and Vectorizing\n",
+ "\n",
+ "The preprocessor plays a role of encoding input data into machine readable vector representations. Any encoder that can transform text data into a vector representation can be considered as the preprocessor or encoder of a customized PECOS model for XMTC tasks.\n",
+ "\n",
+ "In the PECOS library, we provide [various text vectorizers](https://github.com/amzn/pecos/blob/mainline/pecos/utils/featurization/text/vectorizers.py), such as TF-IDF, hashing, and pretrained transformer, as **built-in preprocessors** to deal with text data. In this tutorial, we will utilize the [n-gram](https://en.wikipedia.org/wiki/N-gram) [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) model as our preprocessor.\n",
+ "\n",
+ "#### Label Space File Format for Built-in Text Preprocessors\n",
+ "\n",
+ "Label space is also essential for text preprocessors, especially for understanding the label space size to create the appropriate label matrix. The label IDs start from zero and can be referred to the line numbers and corresponding text descriptions in the label space file."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "id": "3f48f4f7",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Artificial intelligence researchers\r\n",
+ "Computability theorists\r\n",
+ "British computer scientists\r\n",
+ "Machine learning researchers\r\n",
+ "Turing Award laureates\r\n",
+ "Deep Learning\r\n"
+ ]
+ }
+ ],
+ "source": [
+ "! cat \"./text2text_demo/output-labels.txt\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e0862645",
+ "metadata": {},
+ "source": [
+ "#### Data File Format for Built-in Text Preprocessors\n",
+ "\n",
+ "PECOS built-in text preprocessors majorly take the files of text data with labels in a tab-separated values (TSV) format. Each line in the TSV file consists of two elements that represent the comma-separated label IDs and the input text of a data instance. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "id": "bd5ebfc6",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0,1,2\tAlan Turing is widely considered to be the father of theoretical computer science and artificial intelligence.\r\n",
+ "0,2,3\tHinton was co-author of a highly cited paper published in 1986 that popularized the backpropagation algorithm for training multi-layer neural networks.\r\n",
+ "3,4,5\tHinton received the 2018 Turing Award, together with Yoshua Bengio and Yann LeCun, for their work on artificial intelligence and deep learning.\r\n",
+ "0,3,5\tYoshua Bengio is a Canadian computer scientist, most noted for his work on artificial neural networks and deep learning.\r\n"
+ ]
+ }
+ ],
+ "source": [
+ "! cat ./text2text_demo/training-data.txt"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "566d1eb5",
+ "metadata": {},
+ "source": [
+ "The data file format also supports to represent the label relevance for cost-sensitive learning by using double colons to separate a label and its relevance.\n",
+ "\n",
+ "\n",
+ "0::0.1,1::0.2,2::0.8 <TAB> Alan Turing is widely considered to be the father of theoretical computer science and artificial intelligence.
\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4d4e4419",
+ "metadata": {},
+ "source": [
+ "#### Training a Text Preprocessor\n",
+ "\n",
+ "The preprocessor model `Preprocessor` is defined in `pecos.utils.featurization.text.preprocess`. Given a training text corpus and the configuration dictionary, the class method `Preprocessor.train` will train a corresponding text preprocesssor. Besides, the built-in preprocessors also support serialization with the function `save()` for the re-usability.\n",
+ "\n",
+ "With the previously mentioned data and label space file formats, the utility function `Preprocessor.load_data_from_file(input_text_path, output_text_path)` returns a dictionary with three keys:\n",
+ "\n",
+ "* `label_matrix`: a `(num_inst, num_labels)` CSR matrix for the labels of each instance.\n",
+ "* `label_relevance`: `None` or a `(num_inst, num_labels)` CSR matrix for the relevance of each label in cost-sensitive learning if available.\n",
+ "* `corpus`: a list of string as the text corpus in the input_text_path.\n",
+ "\n",
+ "The configuration settings of text preprocessor including the preprocessor type and hyper-parameters should be defined in a dictionary. Specifially, the key `type` defines the preprocessor choice while the key `kwargs` represents the hyper-parameters. In this tutorial, we adopt n-gram TFIDF features containing *word unigrams*, *word bigrams*, and *character trigrams*. Note that each of the n-gram feature can have different hyper-parameters, such as `max_feature` and `max_df`. Users need to properly set max_feature (e.g., hundred of thousands or millions) based on the corpus size and downstream tasks."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "id": "b7f70a8f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from pecos.utils.featurization.text.preprocess import Preprocessor\n",
+ "\n",
+ "input_text_path = \"./text2text_demo/training-data.txt\"\n",
+ "output_text_path = \"./text2text_demo/output-labels.txt\"\n",
+ "model_folder = \"./text2text_demo/pecos-text2text-model\"\n",
+ "\n",
+ "parsed_result = Preprocessor.load_data_from_file(input_text_path, output_text_path) # Read files\n",
+ "corpus = parsed_result[\"corpus\"] # Corpus input text: List of strings\n",
+ "\n",
+ "vectorizer_config = {\n",
+ " \"type\": \"tfidf\",\n",
+ " \"kwargs\": {\n",
+ " \"base_vect_configs\": [\n",
+ " \n",
+ " {\n",
+ " \"ngram_range\": [1, 1],\n",
+ " \"max_df_ratio\": 0.98,\n",
+ " \"analyzer\": \"word\",\n",
+ " },\n",
+ " {\n",
+ " \"ngram_range\": [2, 2],\n",
+ " \"max_df_ratio\": 0.98,\n",
+ " \"analyzer\": \"word\",\n",
+ " },\n",
+ " {\n",
+ " \"ngram_range\": [3, 3],\n",
+ " \"max_df_ratio\": 0.98,\n",
+ " \"analyzer\": \"char_wb\",\n",
+ " },\n",
+ " ],\n",
+ " },\n",
+ " }\n",
+ "\n",
+ "preprocessor = Preprocessor.train(corpus, vectorizer_config)\n",
+ "preprocessor.save(model_folder) "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a0300f8c",
+ "metadata": {},
+ "source": [
+ "#### Preprocessing with a Trained Text Preprocessor\n",
+ "\n",
+ "The function `predict` of a trained text preprocessor encodes texts in a **text data file** into a CSR matrix of shape `(num_inst, dim)` as numerical vector representations, where `num_inst` is the number of instances in the file; `dim` is the number of feature dimensions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "id": "3b182171",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "The file consists of 4 instances with 405-dimensional features in a csr matrix.\n",
+ "\n",
+ "Text 0: Alan Turing is widely considered to be the father of theoretical computer science and artificial intelligence.\n",
+ "Text 1: Hinton was co-author of a highly cited paper published in 1986 that popularized the backpropagation algorithm for training multi-layer neural networks.\n",
+ "Text 2: Hinton received the 2018 Turing Award, together with Yoshua Bengio and Yann LeCun, for their work on artificial intelligence and deep learning.\n",
+ "Text 3: Yoshua Bengio is a Canadian computer scientist, most noted for his work on artificial neural networks and deep learning.\n",
+ "\n",
+ "The cosine similarity is 0.0076 between text 0 and text 1.\n",
+ "The cosine similarity is 0.0325 between text 0 and text 2.\n",
+ "The cosine similarity is 0.0082 between text 1 and text 2.\n",
+ "The cosine similarity is 0.0366 between text 0 and text 3.\n",
+ "The cosine similarity is 0.0267 between text 1 and text 3.\n",
+ "The cosine similarity is 0.0943 between text 2 and text 3.\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Obtaining numerical vectors from text\n",
+ "X = preprocessor.predict(corpus)\n",
+ "\n",
+ "print(\"The file consists of {} instances \"\n",
+ " \"with {}-dimensional features \"\n",
+ " \"in a {} matrix.\\n\".format(*X.shape, X.getformat()))\n",
+ "\n",
+ "from sklearn.metrics.pairwise import cosine_similarity\n",
+ "\n",
+ "sim = cosine_similarity(X)\n",
+ "\n",
+ "for i, ti in enumerate(corpus):\n",
+ " print(\"Text {}: {}\".format(i, ti))\n",
+ "\n",
+ "print(\"\")\n",
+ "for i in range(X.shape[0]):\n",
+ " for j in range(i):\n",
+ " print(\"The cosine similarity is {:.4f} between text {} and text {}.\".format(sim[i][j], j, i))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "18fcd09b",
+ "metadata": {},
+ "source": [
+ "#### Efficiency of PECOS Built-in TF-IDF Vectorizer\n",
+ "\n",
+ "Moreover, the TF-IDF vectorizer in PECOS is implemented in C++ and efficient."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "id": "a3d6f675",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "PECOS TFIDF time: 27.30768s, result shaepe=(14146, 10858825), nnz=37194670\n"
+ ]
+ }
+ ],
+ "source": [
+ "vectorizer_config = {\n",
+ " \"type\": \"tfidf\",\n",
+ " \"kwargs\": {\n",
+ " \"base_vect_configs\": [ \n",
+ " {\n",
+ " \"ngram_range\": [1, 2],\n",
+ " \"max_df_ratio\": 0.98,\n",
+ " \"analyzer\": \"word\",\n",
+ " },\n",
+ " ],\n",
+ " },\n",
+ " }\n",
+ "\n",
+ "input_text_path = \"xmc-base/wiki10-31k/X.trn.txt\"\n",
+ "corpus = Preprocessor.load_data_from_file(input_text_path, text_pos=0)[\"corpus\"]\n",
+ "\n",
+ "import time\n",
+ "start_time = time.time()\n",
+ "preprocessor = Preprocessor.train(corpus, vectorizer_config)\n",
+ "X = preprocessor.predict(input_text_path)\n",
+ "print(f\"PECOS TFIDF time: {time.time() - start_time:.5f}s, result shaepe={X.shape}, nnz={X.nnz}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e63c62ce",
+ "metadata": {},
+ "source": [
+ "As a baseline method, we compare with the [Sklearn TFIDF vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "id": "677f77de",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Sklearn TFIDF time: 221.65870s, result shaepe=(14146, 7269690), nnz=33505461\n"
+ ]
+ }
+ ],
+ "source": [
+ "start_time = time.time()\n",
+ "preprocessor = Preprocessor.train(\n",
+ " corpus,\n",
+ " {\"type\": \"sklearntfidf\", \"kwargs\":{\"ngram_range\": [1, 2], \"max_df\": 0.98}},\n",
+ ")\n",
+ "X = preprocessor.predict(corpus)\n",
+ "print(f\"Sklearn TFIDF time: {time.time() - start_time:.5f}s, result shaepe={X.shape}, nnz={X.nnz}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "75f5aaf8",
+ "metadata": {},
+ "source": [
+ "### Customized PECOS Model with TF-IDF Preprocessor\n",
+ "\n",
+ "\n",
+ "After being powered with text preprocessors, following the [aforementioned illustration](#Structure-of-a-Customized-PECOS-Model), we demonstrate an example of declaring a **customized PECOS model class** based on a TF-IDF preprocessor and a XR-Linear model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "id": "3893c23b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import json\n",
+ "from os import path\n",
+ "import pathlib\n",
+ "from pecos.utils.featurization.text.preprocess import Preprocessor\n",
+ "from pecos.xmc.xlinear.model import XLinearModel\n",
+ "from pecos.xmc import Indexer, LabelEmbeddingFactory\n",
+ "from pecos.utils import smat_util\n",
+ "\n",
+ "class CustomPECOS:\n",
+ " def __init__(self, preprocessor=None, xlinear_model=None, output_items=None):\n",
+ " self.preprocessor = preprocessor\n",
+ " self.xlinear_model = xlinear_model\n",
+ " self.output_items = output_items\n",
+ " \n",
+ " @classmethod\n",
+ " def train(cls, input_text_path, output_text_path):\n",
+ " \"\"\"Train a CustomPECOS model\n",
+ " \n",
+ " Args: \n",
+ " input_text_path (str): Text input file name. \n",
+ " output_text_path (str): The file path for output text items.\n",
+ " vectorizer_config (str): Json_format string for vectorizer config (default None). e.g. {\"type\": \"tfidf\", \"kwargs\": {}}\n",
+ " \n",
+ " Returns:\n",
+ " A CustomPECOS object\n",
+ " \"\"\"\n",
+ " # Obtain X_text, Y\n",
+ " parsed_result = Preprocessor.load_data_from_file(input_text_path, output_text_path)\n",
+ " Y = parsed_result[\"label_matrix\"]\n",
+ " corpus = parsed_result[\"corpus\"]\n",
+ "\n",
+ " # Train TF-IDF vectorizer\n",
+ " preprocessor = Preprocessor.train(corpus, {\"type\": \"tfidf\", \"kwargs\":{}}) \n",
+ " X = preprocessor.predict(corpus) \n",
+ " \n",
+ " # Train a XR-Linear model with TF-IDF features\n",
+ " label_feat = LabelEmbeddingFactory.create(Y, X, method=\"pifa\")\n",
+ " cluster_chain = Indexer.gen(label_feat)\n",
+ " xlinear_model = XLinearModel.train(X, Y, C=cluster_chain)\n",
+ " \n",
+ " # Load output items\n",
+ " with open(output_text_path, \"r\", encoding=\"utf-8\") as f:\n",
+ " output_items = [q.strip() for q in f]\n",
+ " \n",
+ " return cls(preprocessor, xlinear_model, output_items)\n",
+ " \n",
+ " def predict(self, corpus):\n",
+ " \"\"\"Predict labels for given inputs\n",
+ " \n",
+ " Args:\n",
+ " corpus (list of strings): input strings.\n",
+ " Returns:\n",
+ " csr_matrix: predicted label matrix (num_samples x num_labels)\n",
+ " \"\"\"\n",
+ " X = self.preprocessor.predict(corpus)\n",
+ " Y_pred = self.xlinear_model.predict(X)\n",
+ " return smat_util.sorted_csr(Y_pred)\n",
+ "\n",
+ " def save(self, model_folder):\n",
+ " \"\"\"Save the CustomPECOS model\n",
+ "\n",
+ " Args:\n",
+ " model_folder (str): folder name to save\n",
+ " \"\"\"\n",
+ " self.preprocessor.save(f\"{model_folder}/preprocessor\")\n",
+ " self.xlinear_model.save(f\"{model_folder}/xlinear_model\")\n",
+ " with open(f\"{model_folder}/output_items.json\", \"w\", encoding=\"utf-8\") as fp:\n",
+ " json.dump(self.output_items, fp)\n",
+ "\n",
+ " @classmethod\n",
+ " def load(cls, model_folder):\n",
+ " \"\"\"Load the CustomPECOS model\n",
+ "\n",
+ " Args:\n",
+ " model_folder (str): folder name to load\n",
+ " Returns:\n",
+ " CustomPECOS\n",
+ " \"\"\"\n",
+ " preprocessor = Preprocessor.load(f\"{model_folder}/preprocessor\")\n",
+ " xlinear_model = XLinearModel.load(f\"{model_folder}/xlinear_model\")\n",
+ " with open(f\"{model_folder}/output_items.json\", \"r\", encoding=\"utf-8\") as fin:\n",
+ " output_items = json.load(fin)\n",
+ " return cls(preprocessor, xlinear_model, output_items)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fcdbb2c6",
+ "metadata": {},
+ "source": [
+ "### Operating the Customized PECOS Model\n",
+ "\n",
+ "With a well-declared model class, the customized PECOS model can be modularized and very convenient to use."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "id": "24134357",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Declare the path for model serialization and preprocessor configuration.\n",
+ "model_folder = \"./text2text_demo/pecos-CustomPECOS-model\"\n",
+ "\n",
+ "# Train and save the trained model\n",
+ "input_text_path = \"./text2text_demo/training-data.txt\"\n",
+ "output_text_path = \"./text2text_demo/output-labels.txt\"\n",
+ "model = CustomPECOS.train(input_text_path, output_text_path)\n",
+ "model.save(model_folder)\n",
+ "\n",
+ "# Load the trained model and predict\n",
+ "model = model.load(model_folder)\n",
+ "testing_text_path = \"./text2text_demo/testing-data.txt\"\n",
+ "Y_pred = model.predict(testing_text_path)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "id": "31efd9ac",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Text Input: In 1989, Yann LeCun et al. applied the standard backpropagation algorithm on neural networks for hand digit recognition.\n",
+ "Score 0.9515: Machine learning researchers\n",
+ "Score 0.8233: Artificial intelligence researchers\n",
+ "Score 0.4659: Deep Learning\n",
+ "Score 0.2779: British computer scientists\n",
+ "Score 0.0569: Turing Award laureates\n",
+ "Score 0.0129: Computability theorists\n"
+ ]
+ }
+ ],
+ "source": [
+ "test_texts = Preprocessor.load_data_from_file(testing_text_path, output_text_path)[\"corpus\"]\n",
+ "\n",
+ "for i, text in enumerate(test_texts):\n",
+ " print(\"Text Input: {}\".format(text))\n",
+ " for j in range(Y_pred.indptr[i], Y_pred.indptr[i + 1]):\n",
+ " pred_label = model.output_items[Y_pred.indices[j]]\n",
+ " pred_score = Y_pred.data[j]\n",
+ " print(\"Score {:.4f}: {}\".format(pred_score, pred_label))"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.10"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/tutorials/kdd22/Session 3 Approximate Nearest Neighbor Search in PECOS.ipynb b/tutorials/kdd22/Session 3 Approximate Nearest Neighbor Search in PECOS.ipynb
new file mode 100644
index 0000000..2574bd8
--- /dev/null
+++ b/tutorials/kdd22/Session 3 Approximate Nearest Neighbor Search in PECOS.ipynb
@@ -0,0 +1,495 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "e5073aac",
+ "metadata": {},
+ "source": [
+ "# Approximate Nearest Neighbor (ANN) Search in PECOS \n",
+ "\n",
+ "PECOS provides the efficient approach for **approximate nearest neighbor (ANN) search**. More specifically, after training an hierarchical navigable small world (HNSW) model (or buildling the **PECOS-HNSW indexer**) with a corpus of vectors, PECOS supports to efficiently infer top-K approximated nearest indexed vectors for an arbitrary query vector. In this part of the tutorial, we will demonstrate how to use PECOS-HNSW tackle the approximate nearest neighbor (ANN) search problem and how to integrate HNSW with PECOS XMR models.\n",
+ "\n",
+ "#### HNSW at a glimpse\n",
+ "The search procedure of HNSW can be summarized as:\n",
+ "* traverse from top layer (course-grain graph, long-range link) to bottom layer (fine-grain graph, short-range link)\n",
+ "* best first search traversal on each graph, where the best candidate serves as initial to next layer\n",
+ " \n",
+ "\n",
+ "\n",
+ "## Highlight of PECOS-HNSW\n",
+ "\n",
+ "* Support both sparse and dense input features\n",
+ "* Support SIMD instructions (SSE, AVX256, and AVX512)\n",
+ "* Modularity implementation\n",
+ "\n",
+ "## Comparison of PECOS and NMSLIB on the sparse data\n",
+ "\n",
+ "#### Disclaimer \n",
+ "The benchmarking results listed in this notebook are based on an `r5dn-24xlarge` AWS instance with 96 Intel(R) Xeon(R) Platinum 8259CL CPUs @ 2.50GHz. With distinct environments, the magnitude of improvments could be also different.\n",
+ "\n",
+ "#### Results\n",
+ "* We compare two implementations of HNSW: `PECOS` and `NMSLIB` on a sparse dataset (i.e., RCV1).\n",
+ "* For RCV1, the instances in training/test set are `781,265` and `23,149`, respectively. The feature dimension is `47,236`.\n",
+ "* The HNSW index is constructed under `M=16` and `efConstruction=500`.\n",
+ "* From the table below, we see that, under similar Recall@10, `PECOS` achieves `[88%,93%]` speedup compared to the `NMSLIB` package.\n",
+ "\n",
+ "| M=16, efC=500 | | | HNSW (PECOS) | | | HNSW (NMSLIB) | speedup (PECOS/NMSLIB) |\n",
+ "|:-------------:|:---------:|:-----------------------:|:------------------:|:---------:|:-----------------------:|:------------------:|:----------------------------:|\n",
+ "| efS | Recall@10 | Throughput (#query/sec) | Latency (ms/query) | Recall@10 | Throughput (#query/sec) | Latency (ms/query) | |\n",
+ "| 10 | 0.7733 | 5250.297 | 0.1905 | 0.7790 | 2710.256 | 0.3690 | 93.72% |\n",
+ "| 20 | 0.8545 | 3677.292 | 0.2719 | 0.8581 | 1924.505 | 0.5196 | 91.08% |\n",
+ "| 40 | 0.9043 | 2409.959 | 0.4149 | 0.9055 | 1271.085 | 0.7867 | 89.60% |\n",
+ "| 80 | 0.9325 | 1508.349 | 0.6630 | 0.9326 | 800.999 | 1.2484 | 88.31% |\n",
+ "| 120 | 0.9434 | 1125.047 | 0.8889 | 0.9426 | 597.873 | 1.6726 | 88.17% |\n",
+ "| 200 | 0.9533 | 763.752 | 1.3093 | 0.9523 | 404.518 | 2.4721 | 88.81% |\n",
+ "| 400 | 0.9621 | 433.872 | 2.3048 | 0.9608 | 229.553 | 4.3563 | 89.01% |\n",
+ "| 600 | 0.9657 | 305.747 | 3.2707 | 0.9644 | 161.879 | 6.1775 | 88.87% |\n",
+ "| 800 | 0.9678 | 237.651 | 4.2078 | 0.9663 | 124.806 | 8.0124 | 90.42% |\n",
+ "\n",
+ "## Hands-on Tutorial\n",
+ "\n",
+ "The life cycle of a PECOS-HNSW model consists of two stages:\n",
+ "\n",
+ "* building the indexer (training)\n",
+ "* inference (testing).\n",
+ "\n",
+ "### Data Loading"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "140a0d24",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "--2022-07-15 21:03:07-- https://archive.org/download/pecos-dataset/ann-benchmarks/rcv1-angular-47236.tar.gz\n",
+ "Resolving archive.org (archive.org)... 207.241.224.2\n",
+ "Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.\n",
+ "HTTP request sent, awaiting response... 302 Found\n",
+ "Location: https://ia802308.us.archive.org/21/items/pecos-dataset/ann-benchmarks/rcv1-angular-47236.tar.gz [following]\n",
+ "--2022-07-15 21:03:07-- https://ia802308.us.archive.org/21/items/pecos-dataset/ann-benchmarks/rcv1-angular-47236.tar.gz\n",
+ "Resolving ia802308.us.archive.org (ia802308.us.archive.org)... 207.241.228.48\n",
+ "Connecting to ia802308.us.archive.org (ia802308.us.archive.org)|207.241.228.48|:443... connected.\n",
+ "HTTP request sent, awaiting response... 200 OK\n",
+ "Length: 317972212 (303M) [application/octet-stream]\n",
+ "Saving to: ‘rcv1-angular-47236.tar.gz’\n",
+ "\n",
+ "100%[======================================>] 317,972,212 11.0MB/s in 40s \n",
+ "\n",
+ "2022-07-15 21:03:47 (7.68 MB/s) - ‘rcv1-angular-47236.tar.gz’ saved [317972212/317972212]\n",
+ "\n",
+ "rcv1-angular-47236/\n",
+ "rcv1-angular-47236/X.trn.npz\n",
+ "rcv1-angular-47236/X.tst.npz\n",
+ "rcv1-angular-47236/Y.tst.npy\n"
+ ]
+ }
+ ],
+ "source": [
+ "! wget https://archive.org/download/pecos-dataset/ann-benchmarks/rcv1-angular-47236.tar.gz\n",
+ "! tar -zxvf ./rcv1-angular-47236.tar.gz"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "46dc982b",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "n_trn 781265 n_tst 23149 data_dim 47236\n"
+ ]
+ }
+ ],
+ "source": [
+ "import numpy as np\n",
+ "from pecos.utils import smat_util\n",
+ "X_trn = smat_util.load_matrix(\"./rcv1-angular-47236/X.trn.npz\").astype(np.float32)\n",
+ "X_tst = smat_util.load_matrix(\"./rcv1-angular-47236/X.tst.npz\").astype(np.float32)\n",
+ "Y_tst = smat_util.load_matrix(\"./rcv1-angular-47236/Y.tst.npy\")\n",
+ "print(\"n_trn {:7d} n_tst {:7d} data_dim {:7d}\".format(\n",
+ " X_trn.shape[0], X_tst.shape[0], X_trn.shape[1])\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bb73c619",
+ "metadata": {},
+ "source": [
+ "### Training Indexer\n",
+ "\n",
+ "To train a PECOS-HNSW model, training parameters need to be defined in an object of HNSW.TrainParams as the argument train_params. The key parameters of training a PECOS-HNSW model include:\n",
+ "* `M` (default 32): The maximum number of edges per node for each layer. A larger M leads to a larger model size and greater memory consumption. Higher/lower M are more suitable for high/low dimensional data or the pursue of high/low recall.\n",
+ "* `efC` (default 100): The size of the priority queue for best first search in construction. `efC` can be considered as the trade-off between efficiency and accuracy for indexing. A higher `efC` results in longer construction time but better quality of indexing.\n",
+ "* `metric_type` (default ip): The distance metric type for ANN search. PECOS-HNSW currently supports Euclidean distance (`l2`); and inner product (`ip`)\n",
+ "* `threads` (default -1): The number of threads for training, or -1 to use all available cores.\n",
+ "\n",
+ "The parameters for inference can be also decided as the argument pred_params during model construction so that the model can be directly applied for inference without further parameter designation.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "553aaf55",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "HNSW Indexer | M 32 efC 100 metric ip | time(s) 11.980276823043823\n"
+ ]
+ }
+ ],
+ "source": [
+ "import time\n",
+ "from pecos.ann.hnsw import HNSW\n",
+ "\n",
+ "M, efC = 32, 100\n",
+ "metric = \"ip\"\n",
+ "train_params = HNSW.TrainParams(\n",
+ " M=M,\n",
+ " efC=efC,\n",
+ " metric_type=metric,\n",
+ " threads=-1,\n",
+ ")\n",
+ "start_time = time.time()\n",
+ "model = HNSW.train(X_trn, train_params=train_params, pred_params=None)\n",
+ "print(\"HNSW Indexer | M {} efC {} metric {} | time(s) {}\".format(\n",
+ " M, efC, metric, time.time() - start_time),\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6a150c44",
+ "metadata": {},
+ "source": [
+ "### Save and Load Indexer"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "bf7905f3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model_folder = \"./rcv1.pecos-hnsw.index\"\n",
+ "model.save(model_folder)\n",
+ "del model\n",
+ "model = HNSW.load(model_folder)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b1af6ee7",
+ "metadata": {},
+ "source": [
+ "### Inference and Evaluation\n",
+ "\n",
+ "To conduct inference with a train HNSW model, prediction parameters need to be defined in an object of HNSW.PredParams as the argument pred_params. The key parameters of inference with a PECOS-HNSW model include:\n",
+ "\n",
+ "* `efS` (default 100): The size of the priority queue for best first search during inference. Similar to efC, efS can be considered as the trade-off between search efficiency and accuracy. A higher efS results in more accurate results with slower speed. efS is required to be greater than topk.\n",
+ "* `topk` (default 10): The number of approximate nearest neighbor to be returned. \n",
+ "* `threads` (default -1): The number of searchers for parallel inference, -1 to use all available searchers.\n",
+ "\n",
+ "The predict function derives the search results based on a query matrix of shape (# of data points for inference, # of dimentions) and `pred_params`, as well as searchers. The argument `ret_csr` (default `true`) decides the format of returned results as:\n",
+ "\n",
+ "* If `ret_csr` is false, the returned results would be two matrices of shape (# of data points, topk), which indicate the topk indices in the training corpus and the corresponding distances for each testing instance.\n",
+ "* If `ret_csr` is true, the returned results would be a [Compressed Sparse Row (CSR) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) of shape (# of data points, # of points in the training corpus). Each row contains sorted topk distance values at the corresponding columns (i.e., indices in training corpus). The data for each row (i.e., `data[indptr[i]:indptr[i + 1]]`) are also sorted by the distance values.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "e25e31d4",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Prediction Time = 15.7988 seconds.\n"
+ ]
+ }
+ ],
+ "source": [
+ "pred_params = HNSW.PredParams(efS=100, topk=10)\n",
+ "searchers = model.searchers_create(num_searcher=1)\n",
+ "start_time = time.time()\n",
+ "indices, distances = model.predict(\n",
+ " X_tst,\n",
+ " pred_params=pred_params,\n",
+ " searchers=searchers,\n",
+ " ret_csr=False,\n",
+ ")\n",
+ "pred_time = time.time() - start_time\n",
+ "print(f\"Prediction Time = {pred_time:.4f} seconds.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0ce9aefa",
+ "metadata": {},
+ "source": [
+ "### Evaluation"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "38401700",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def compute_recall(neighbors, true_neighbors):\n",
+ " total = 0\n",
+ " for gt_row, row in zip(true_neighbors, neighbors):\n",
+ " total += np.intersect1d(gt_row, row).shape[0]\n",
+ " return total / true_neighbors.size"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "b0b6d72a",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "HNSW inference | R@10 0.9025 Throughput(q/s) 1465.236 latency(ms/q) 0.6825\n"
+ ]
+ }
+ ],
+ "source": [
+ "recall = compute_recall(indices, Y_tst)\n",
+ "throughput = indices.shape[0] / pred_time\n",
+ "latency = 1.0 / throughput * 1000.\n",
+ "print(f\"HNSW inference | R@10 {recall:.4f} Throughput(q/s) {throughput:8.3f} latency(ms/q) {latency:8.4f}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "75880f0d",
+ "metadata": {},
+ "source": [
+ "## Recall vs Throughput Trade-off"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "12bd7fb6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def run_pecos(X_trn, X_tst, Y_tst):\n",
+ " metric = \"ip\"\n",
+ " M_list = [16]\n",
+ " efC = 500\n",
+ " topk = 10\n",
+ " efS_list = [10, 20, 40, 80, 120, 200, 400, 600, 800]\n",
+ " for M in M_list:\n",
+ " train_params = HNSW.TrainParams(M=M, efC=efC, metric_type=metric, threads=-1)\n",
+ " start_time = time.time()\n",
+ " model = HNSW.train(X_trn, train_params=train_params, pred_params=None)\n",
+ " print(\"Indexer | M {} efC {} metric {} | train time(s) {}\".format(\n",
+ " M, efC, metric, time.time() - start_time)\n",
+ " )\n",
+ " \n",
+ " for efS in efS_list:\n",
+ " pred_params = HNSW.PredParams(efS=efS, topk=topk)\n",
+ " searchers = model.searchers_create(num_searcher=1)\n",
+ " \n",
+ " start_time = time.time()\n",
+ " indices, distances = model.predict(X_tst, pred_params=pred_params, searchers=searchers, ret_csr=False)\n",
+ " pred_time = time.time() - start_time\n",
+ " \n",
+ " recall = compute_recall(indices, Y_tst)\n",
+ " throughput = indices.shape[0] / pred_time\n",
+ " latency = 1.0 / throughput * 1000.\n",
+ " print(\"inference | efS {:3d} R@10 {:.4f} Throughput(q/s) {:8.3f} latency(ms/q) {:8.4f}\".format(\n",
+ " efS, recall, throughput, latency)\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "0b4af0fb",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Indexer | M 16 efC 500 metric ip | train time(s) 46.87919640541077\n",
+ "inference | efS 10 R@10 0.7733 Throughput(q/s) 5250.297 latency(ms/q) 0.1905\n",
+ "inference | efS 20 R@10 0.8545 Throughput(q/s) 3677.292 latency(ms/q) 0.2719\n",
+ "inference | efS 40 R@10 0.9043 Throughput(q/s) 2409.959 latency(ms/q) 0.4149\n",
+ "inference | efS 80 R@10 0.9325 Throughput(q/s) 1508.349 latency(ms/q) 0.6630\n",
+ "inference | efS 120 R@10 0.9434 Throughput(q/s) 1125.047 latency(ms/q) 0.8889\n",
+ "inference | efS 200 R@10 0.9533 Throughput(q/s) 763.752 latency(ms/q) 1.3093\n",
+ "inference | efS 400 R@10 0.9621 Throughput(q/s) 433.872 latency(ms/q) 2.3048\n",
+ "inference | efS 600 R@10 0.9657 Throughput(q/s) 305.747 latency(ms/q) 3.2707\n",
+ "inference | efS 800 R@10 0.9678 Throughput(q/s) 237.651 latency(ms/q) 4.2078\n"
+ ]
+ }
+ ],
+ "source": [
+ "run_pecos(X_trn, X_tst, Y_tst)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b58276b2",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Appendix: Install PECOS and NMSLIB\n",
+ "\n",
+ "### Install via Conda \n",
+ "```bash\n",
+ "conda create -n pecos-hnsw-tutorial python=3.8\n",
+ "conda activate pecos-hnsw-tutorial\n",
+ "\n",
+ "pip install pyarrow pandas ipython jupyterlab\n",
+ "```\n",
+ "\n",
+ "### Install PECOS from Source\n",
+ "\n",
+ "We will install PECOS from source with the -march=native flag to optimize the best SIMD instruction available in your machine. More details available in https://github.com/amzn/pecos#installation-from-source\n",
+ "\n",
+ "```bash\n",
+ "# prerequisite, assuming amazon linux 2 \n",
+ "sudo yum -y install python3 python3-devel python3-distutils python3-venv && sudo yum -y groupinstall 'Development Tools' \n",
+ "sudo amazon-linux-extras install epel -y\n",
+ "sudo yum install openblas-devel -y\n",
+ "# pecos with -march=native flag\n",
+ "git clone https://github.com/amzn/pecos\n",
+ "cd pecos\n",
+ "PECOS_MANUAL_COMPILE_ARGS=\"-march=native\" python -m pip install --editable .\n",
+ "```\n",
+ "\n",
+ "### Install NMSLIB from Source\n",
+ "\n",
+ "We follow the install guide [install guide](https://github.com/erikbern/ann-benchmarks/blob/master/install/Dockerfile.nmslib) from ANN-Benchmark to install NMSLIB from source for the best performance.\n",
+ "\n",
+ "```bash\n",
+ "# pre-requisite, assuming amazon linux 2\n",
+ "sudo yum -y install cmake boost-devel eigen3-devel\n",
+ "git clone https://github.com/searchivarius/nmslib.git\n",
+ "cd nmslib/similarity_search\n",
+ "cmake . -DWITH_EXTRAS=1\n",
+ "make -j4\n",
+ "pip install pybind11\n",
+ "cd ../python_bindings/\n",
+ "python setup.py build\n",
+ "python setup.py install\n",
+ "python -c 'import nmslib'\n",
+ "```\n",
+ "\n",
+ "### Install via Docker (as in ANN-Benchmkark)\n",
+ "\n",
+ "```bash\n",
+ "# install some basic stuff\n",
+ "sudo yum -y update\n",
+ "sudo yum install -y git curl zip unzip vim gcc-c++ htop\n",
+ "\n",
+ "# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/docker-basics.html\n",
+ "# sudo yum update -y\n",
+ "# sudo amazon-linux-extras install docker\n",
+ "sudo service docker start\n",
+ "sudo systemctl enable docker\n",
+ "sudo usermod -a -G docker ec2-user\n",
+ "docker info\n",
+ "```\n",
+ "\n",
+ "### Install Docker Image\n",
+ "\n",
+ "```bash\n",
+ "# install miniconda fist!\n",
+ "conda create -n ann-benchmarks python=3.8\n",
+ "conda activate ann-benchmarks\n",
+ "\n",
+ "# install ANN package supported by ann-benchmarks\n",
+ "git clone https://github.com/erikbern/ann-benchmarks.git\n",
+ "cd ann-benchmarks\n",
+ "pip install -r requirements.txt\n",
+ "\n",
+ "# install docker containers\n",
+ "python -u install.py --algorithm faiss\n",
+ "python -u install.py --algorithm hnswlib\n",
+ "python -u install.py --algorithm n2\n",
+ "python -u install.py --algorithm pecos\n",
+ "python -u install.py --algorithm scann\n",
+ "python -u install.py --algorithm ngt\n",
+ "python -u install.py --algorithm nmslib\n",
+ "python -u install.py --algorithm diskann\n",
+ "python -u install.py --algorithm pynndescent\n",
+ "\n",
+ "# list all dockers\n",
+ "docker image ls\n",
+ "REPOSITORY TAG IMAGE ID CREATED SIZE\n",
+ "ann-benchmarks-hnswlib latest 2e1ea8d11df7 2 hours ago 1.04GB\n",
+ "ann-benchmarks-nmslib latest 1e094d3e96f7 3 hours ago 1.64GB\n",
+ "ann-benchmarks-faiss latest 44e5bd15bfcd 5 hours ago 4.9GB\n",
+ "ann-benchmarks-scann latest 5151abe3b09e 5 hours ago 2.76GB\n",
+ "ann-benchmarks latest c2c612131da4 5 hours ago 938MB\n",
+ "```\n",
+ "\n",
+ "### Enter Docker Env\n",
+ "\n",
+ "```bash\n",
+ "EFS_DIR=/PATH/TO/pecos-hnsw-kdd22\n",
+ "DOCKER_IMAGE=ann-benchmarks-nmslib\n",
+ "\n",
+ "docker run --rm -it -v ${EFS_DIR}:/home/app/ws \\\n",
+ " --entrypoint /bin/bash ${DOCKER_IMAGE}\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b0b85b8a-f1a6-42f0-acf0-6596361e35b3",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.10"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/tutorials/kdd22/Session 4 Utilities in PECOS.ipynb b/tutorials/kdd22/Session 4 Utilities in PECOS.ipynb
new file mode 100644
index 0000000..18cd0c8
--- /dev/null
+++ b/tutorials/kdd22/Session 4 Utilities in PECOS.ipynb
@@ -0,0 +1,1105 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "b1ebc316",
+ "metadata": {},
+ "source": [
+ "# Utilities in PECOS\n",
+ "\n",
+ "PECOS provides various useful interfaces and utility functions for XMR problems and related tasks. In this session, we will introduce how to tackle arbitrary data formats for XMR, and then present some utilities in PECOS for efficient matrix operations and hierarchical clustering."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3e9cc45c",
+ "metadata": {},
+ "source": [
+ "## Working with Arbitrary Data Formats\n",
+ "\n",
+ "PECOS is a general machine learning framework and able to fit arbitary data format and interact with different data manipulation and analysis libraries like [Pandas](https://pandas.pydata.org/). In the following example, we will show how to learn a PECOS model with Pandas-loaded data of text, categorical, and numerical features."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "9f0f17a7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pecos\n",
+ "import pandas as pd\n",
+ "import numpy as np"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "526820f8",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Archive: drugLib_raw.zip\r\n",
+ " inflating: drugLibTest_raw.tsv \r\n",
+ " inflating: drugLibTrain_raw.tsv \r\n"
+ ]
+ }
+ ],
+ "source": [
+ "! wget -nv -nc https://archive.ics.uci.edu/ml/machine-learning-databases/00461/drugLib_raw.zip\n",
+ "! unzip -o drugLib_raw.zip"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "ac8ab46f",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Training DataFrame consists of 3107 instances.\n",
+ "Testing DataFrame consists of 1036 instances.\n",
+ "Index(['Unnamed: 0', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects',\n",
+ " 'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview'],\n",
+ " dtype='object')\n"
+ ]
+ }
+ ],
+ "source": [
+ "train_df = pd.read_csv(\"drugLibTrain_raw.tsv\", sep=\"\\t\")\n",
+ "test_df = pd.read_csv(\"drugLibTest_raw.tsv\", sep=\"\\t\")\n",
+ "print(f\"Training DataFrame consists of {len(train_df)} instances.\")\n",
+ "print(f\"Testing DataFrame consists of {len(test_df)} instances.\")\n",
+ "print(train_df.columns)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "6d6ee989",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "label_name = \"effectiveness\"\n",
+ "text_features = [\"condition\", \"benefitsReview\", \"sideEffectsReview\", \"commentsReview\"]\n",
+ "categorical_features = [\"sideEffects\"]\n",
+ "numerical_features = [\"rating\"]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "c25b1cb2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "X_trn_list = []\n",
+ "X_tst_list = []"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4f72d047",
+ "metadata": {},
+ "source": [
+ "### Label Encoding\n",
+ "\n",
+ "To encode labels into the sparse matrix format compatible to PECOS, [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer) and [MultiLabelBinarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer) are helpful for the scenarios of multi-class and multi-label classification."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "4f0d9ec7",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Y_trn is a csr matrix with a shape (3107, 5) and 3107 non-zero values.\n",
+ "Y_tst is a csr matrix with a shape (1036, 5) and 1036 non-zero values.\n"
+ ]
+ }
+ ],
+ "source": [
+ "from sklearn.preprocessing import OneHotEncoder\n",
+ "\n",
+ "label_encoder = OneHotEncoder(dtype=np.float32)\n",
+ "Y_trn = label_encoder.fit_transform(train_df[[label_name]])\n",
+ "Y_tst = label_encoder.transform(test_df[[label_name]])\n",
+ "\n",
+ "print(f\"Y_trn is a {Y_trn.getformat()} matrix with a shape {Y_trn.shape} and {Y_trn.nnz} non-zero values.\")\n",
+ "print(f\"Y_tst is a {Y_tst.getformat()} matrix with a shape {Y_tst.shape} and {Y_tst.nnz} non-zero values.\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "62ec7371",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Y_trn_mlb is a csr matrix with a shape (3107, 5) and 3107 non-zero values.\n",
+ "Y_tst_mlb is a csr matrix with a shape (1036, 5) and 1036 non-zero values.\n"
+ ]
+ }
+ ],
+ "source": [
+ "from sklearn.preprocessing import MultiLabelBinarizer\n",
+ "\n",
+ "label_encoder_multilabel = MultiLabelBinarizer(sparse_output=True)\n",
+ "Y_trn_mlb = label_encoder.fit_transform([[lbl] for lbl in train_df[label_name].tolist()])\n",
+ "Y_tst_mlb = label_encoder.fit_transform([[lbl] for lbl in test_df[label_name].tolist()])\n",
+ "print(f\"Y_trn_mlb is a {Y_trn_mlb.getformat()} matrix with a shape {Y_trn_mlb.shape} and {Y_trn_mlb.nnz} non-zero values.\")\n",
+ "print(f\"Y_tst_mlb is a {Y_tst_mlb.getformat()} matrix with a shape {Y_tst_mlb.shape} and {Y_tst_mlb.nnz} non-zero values.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "260a9a8a",
+ "metadata": {},
+ "source": [
+ "### Text Feature Encoding\n",
+ "\n",
+ "As introduced in Session 1, we can use PECOS vectorizer for featurize text data. In addition, the encoder of [XR-Transformer](https://github.com/amzn/pecos/tree/mainline/pecos/xmc/xtransformer) can be also utilized for deriving text features with proper fine-tuning."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "96fef619",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "condition: (3107, 3759) and (1036, 3759) in training and testing.\n",
+ "benefitsReview: (3107, 72861) and (1036, 72861) in training and testing.\n",
+ "sideEffectsReview: (3107, 64321) and (1036, 64321) in training and testing.\n",
+ "commentsReview: (3107, 91731) and (1036, 91731) in training and testing.\n"
+ ]
+ }
+ ],
+ "source": [
+ "from pecos.utils.featurization.text.vectorizers import Vectorizer\n",
+ "\n",
+ "for feature_name in text_features:\n",
+ " vectorizer_config = {\n",
+ " \"type\": \"tfidf\",\n",
+ " \"kwargs\": {\n",
+ " \"base_vect_configs\": [\n",
+ "\n",
+ " {\n",
+ " \"ngram_range\": [1, 2],\n",
+ " \"max_df_ratio\": 0.98,\n",
+ " \"analyzer\": \"word\",\n",
+ " },\n",
+ " ],\n",
+ " },\n",
+ " } \n",
+ " train_texts = [str(x) for x in train_df[feature_name].tolist()]\n",
+ " test_texts = test_df[feature_name].tolist()\n",
+ " vectorizer = Vectorizer.train(train_texts, config=vectorizer_config)\n",
+ " X_trn_local = vectorizer.predict(train_texts)\n",
+ " X_tst_local = vectorizer.predict(test_texts)\n",
+ " print(f\"{feature_name}: {X_trn_local.shape} and {X_tst_local.shape} in training and testing.\")\n",
+ " \n",
+ " X_trn_list.append(X_trn_local)\n",
+ " X_tst_list.append(X_tst_local)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "38e75fa2",
+ "metadata": {},
+ "source": [
+ "### Categorical Feature Encoding\n",
+ "\n",
+ "Similar to labels, categorical features can also be considered as one-hot or multi-hot embeddings."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "386b96ad",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "sideEffects: (3107, 5) and (1036, 5) in training and testing.\n"
+ ]
+ }
+ ],
+ "source": [
+ "from sklearn.preprocessing import OneHotEncoder\n",
+ "\n",
+ "for feature_name in categorical_features:\n",
+ " local_encoder = OneHotEncoder(dtype=np.float32)\n",
+ " X_trn_local = local_encoder.fit_transform(train_df[[feature_name]])\n",
+ " X_tst_local = local_encoder.transform(test_df[[feature_name]])\n",
+ " print(f\"{feature_name}: {X_trn_local.shape} and {X_tst_local.shape} in training and testing.\")\n",
+ " \n",
+ " X_trn_list.append(X_trn_local)\n",
+ " X_tst_list.append(X_tst_local)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6deaf4e8",
+ "metadata": {},
+ "source": [
+ "### Numerical Features Encoding\n",
+ "\n",
+ "Numberical features can be directly incorporated as model inputs after some simple normalization."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "90668ea4",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "rating: (3107, 1) and (1036, 1) in training and testing.\n"
+ ]
+ }
+ ],
+ "source": [
+ "from scipy.sparse import csr_matrix\n",
+ "from sklearn.preprocessing import StandardScaler\n",
+ "\n",
+ "for feature_name in numerical_features:\n",
+ " X_trn_values = train_df[[\"rating\"]].values\n",
+ " X_tst_values = test_df[[\"rating\"]].values\n",
+ " scaler = StandardScaler()\n",
+ " X_trn_local = csr_matrix(scaler.fit_transform(X_trn_values), dtype=np.float32)\n",
+ " X_tst_local = csr_matrix(scaler.transform(X_tst_values), dtype=np.float32)\n",
+ " print(f\"{feature_name}: {X_trn_local.shape} and {X_tst_local.shape} in training and testing.\")\n",
+ " \n",
+ " X_trn_list.append(X_trn_local)\n",
+ " X_tst_list.append(X_tst_local)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d2441580",
+ "metadata": {},
+ "source": [
+ "### Feature Concatenation\n",
+ "\n",
+ "PECOS provides easy-going utility functions for efficient matrix operations. The `hstack_csr` function can concatenate different features for each individual instance. More detils about other utilities will be introduced later in this session."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "d0d3e69c",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "X_trn is a csr matrix with a shape (3107, 232678) and 653987 non-zero values.\n",
+ "X_tst is a csr matrix with a shape (1036, 232678) and 164272 non-zero values.\n"
+ ]
+ }
+ ],
+ "source": [
+ "from pecos.utils import smat_util\n",
+ "\n",
+ "X_trn = smat_util.hstack_csr(X_trn_list)\n",
+ "X_tst = smat_util.hstack_csr(X_tst_list)\n",
+ "\n",
+ "print(f\"X_trn is a {X_trn.getformat()} matrix with a shape {X_trn.shape} and {X_trn.nnz} non-zero values.\")\n",
+ "print(f\"X_tst is a {X_tst.getformat()} matrix with a shape {X_tst.shape} and {X_tst.nnz} non-zero values.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5e61775b",
+ "metadata": {},
+ "source": [
+ "### Model Training and Testing"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "38189597",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "prec = 52.80 40.69 30.92 24.52 20.00\n",
+ "recall = 52.80 81.37 92.76 98.07 100.00\n"
+ ]
+ }
+ ],
+ "source": [
+ "from pecos.xmc.xlinear.model import XLinearModel\n",
+ "xlm = XLinearModel.train(X_trn, Y_trn)\n",
+ "\n",
+ "Y_pred = xlm.predict(X_tst, beam_size=10, only_topk=5)\n",
+ "metrics = smat_util.Metrics.generate(Y_tst, Y_pred, topk=5)\n",
+ "print(metrics)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c2b3f61e",
+ "metadata": {},
+ "source": [
+ "## Sparse Matrix Operations\n",
+ "\n",
+ "Most of the computations in PECOS are based on sparse matrices, so PECOS also provides various useful and efficient operation utilities for sparse matrices."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5fba58d4",
+ "metadata": {},
+ "source": [
+ "### Genric Matriox IO and Conversion\n",
+ "\n",
+ "`smat_util.load_matrix` and `smat_util.save_matrix` provide generic interfaces for loading and storing matrices in arbitrary common formats, including [dense matrix](https://numpy.org/doc/stable/reference/generated/numpy.array.html) in NumPy or different sparse matrix formats (i.e., [sparse Compressed Sparse Row (CSR) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html), [sparse Compressed Sparse Column (CSC) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html), and [sparse COOrdinate (COO) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html))."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "385ed0ba",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Dense Matrtix IO\n",
+ "mat is a matrix with a shape (2, 3).\n",
+ "[[0.6757516 0.42168422 0.40557039]\n",
+ " [0.86806547 0.9198075 0.7494449 ]]\n",
+ "mat_loaded is a matrix with a shape (2, 3).\n",
+ "[[0.6757516 0.42168422 0.40557039]\n",
+ " [0.86806547 0.9198075 0.7494449 ]]\n",
+ "\n",
+ "csr Sparse Matrix IO\n",
+ "mat is a matrix with a shape (5, 4) and 4 non-zero values.\n",
+ " (1, 2)\t0.17821196669035588\n",
+ " (2, 1)\t0.8259001065480657\n",
+ " (4, 2)\t0.5111159408743305\n",
+ " (4, 3)\t0.6337428297507509\n",
+ "mat_loaded is a matrix with a shape (5, 4) and 4 non-zero values.\n",
+ " (1, 2)\t0.17821196669035588\n",
+ " (2, 1)\t0.8259001065480657\n",
+ " (4, 2)\t0.5111159408743305\n",
+ " (4, 3)\t0.6337428297507509\n",
+ "\n",
+ "csc Sparse Matrix IO\n",
+ "mat is a matrix with a shape (5, 4) and 4 non-zero values.\n",
+ " (0, 1)\t0.868116915403953\n",
+ " (1, 2)\t0.7454473997071077\n",
+ " (0, 3)\t0.21167432752493887\n",
+ " (1, 3)\t0.4685535255015949\n",
+ "mat_loaded is a matrix with a shape (5, 4) and 4 non-zero values.\n",
+ " (0, 1)\t0.868116915403953\n",
+ " (1, 2)\t0.7454473997071077\n",
+ " (0, 3)\t0.21167432752493887\n",
+ " (1, 3)\t0.4685535255015949\n",
+ "\n",
+ "coo Sparse Matrix IO\n",
+ "mat is a matrix with a shape (5, 4) and 4 non-zero values.\n",
+ " (1, 0)\t0.3217900085041965\n",
+ " (3, 3)\t0.15316424313380772\n",
+ " (2, 3)\t0.7835729602784944\n",
+ " (2, 2)\t0.396664789900256\n",
+ "mat_loaded is a matrix with a shape (5, 4) and 4 non-zero values.\n",
+ " (1, 0)\t0.3217900085041965\n",
+ " (3, 3)\t0.15316424313380772\n",
+ " (2, 3)\t0.7835729602784944\n",
+ " (2, 2)\t0.396664789900256\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "from pecos.utils import smat_util\n",
+ "import numpy as np\n",
+ "import scipy.sparse as smat\n",
+ "\n",
+ "print(\"Dense Matrtix IO\")\n",
+ "mat = np.random.rand(2, 3)\n",
+ "print(f\"mat is a {type(mat)} matrix with a shape {mat.shape}.\")\n",
+ "print(mat)\n",
+ "smat_util.save_matrix(\"mat.npz\", mat)\n",
+ "mat_loaded = smat_util.load_matrix(\"mat.npz\")\n",
+ "print(f\"mat_loaded is a {type(mat_loaded)} matrix with a shape {mat_loaded.shape}.\")\n",
+ "print(mat)\n",
+ "print(\"\") \n",
+ "\n",
+ "for matrix_format in [\"csr\", \"csc\", \"coo\"]:\n",
+ " print(f\"{matrix_format} Sparse Matrix IO\")\n",
+ " mat = smat.random(5, 4, density=0.2, format=matrix_format)\n",
+ " print(f\"mat is a {type(mat)} matrix\"\n",
+ " f\" with a shape {mat.shape} and {mat.nnz} non-zero values.\")\n",
+ " print(mat)\n",
+ " \n",
+ " smat_util.save_matrix(\"mat.npz\", mat)\n",
+ " mat_loaded = smat_util.load_matrix(\"mat.npz\")\n",
+ " print(f\"mat_loaded is a {type(mat_loaded)} matrix\"\n",
+ " f\" with a shape {mat_loaded.shape} and {mat_loaded.nnz} non-zero values.\")\n",
+ " print(mat_loaded)\n",
+ " print(\"\") "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "2579b855",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Original Matrix mat\n",
+ " [[7.75480297e-01 3.06141999e-01 1.48439313e-01 4.82371302e-01\n",
+ " 7.94080899e-01 4.22684703e-04]\n",
+ " [1.22368102e-01 9.06639305e-02 5.88889479e-01 3.37880581e-01\n",
+ " 7.45595442e-01 9.58851999e-01]\n",
+ " [5.65962202e-01 7.18344997e-01 1.99721347e-01 2.02474399e-01\n",
+ " 5.86650110e-01 1.93471414e-01]\n",
+ " [5.72012816e-02 1.20470044e-01 1.27986695e-01 6.43206432e-01\n",
+ " 5.42874998e-01 7.05113274e-01]] \n",
+ "\n",
+ "csr_mat = dense_to_csr(mat)\n",
+ "csr_mat is a matrix with a shape (4, 6) and 24 non-zero values.\n",
+ "[[7.75480297e-01 3.06141999e-01 1.48439313e-01 4.82371302e-01\n",
+ " 7.94080899e-01 4.22684703e-04]\n",
+ " [1.22368102e-01 9.06639305e-02 5.88889479e-01 3.37880581e-01\n",
+ " 7.45595442e-01 9.58851999e-01]\n",
+ " [5.65962202e-01 7.18344997e-01 1.99721347e-01 2.02474399e-01\n",
+ " 5.86650110e-01 1.93471414e-01]\n",
+ " [5.72012816e-02 1.20470044e-01 1.27986695e-01 6.43206432e-01\n",
+ " 5.42874998e-01 7.05113274e-01]] \n",
+ "\n",
+ "csr_mat_topk = dense_to_csr(mat, topk=2)\n",
+ "csr_mat is a matrix with a shape (4, 6) and 8 non-zero values.\n",
+ "[[0.7754803 0. 0. 0. 0.7940809 0. ]\n",
+ " [0. 0. 0. 0. 0.74559544 0.958852 ]\n",
+ " [0.5659622 0. 0. 0. 0.58665011 0. ]\n",
+ " [0. 0. 0. 0. 0.542875 0.70511327]] \n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "mat = np.random.rand(4, 6)\n",
+ "\n",
+ "print(f\"Original Matrix mat\\n\", mat, \"\\n\")\n",
+ "\n",
+ "print(\"csr_mat = dense_to_csr(mat)\")\n",
+ "csr_mat = smat_util.dense_to_csr(mat)\n",
+ "print(f\"csr_mat is a {type(csr_mat)} matrix\"\n",
+ " f\" with a shape {csr_mat.shape} and {csr_mat.nnz} non-zero values.\")\n",
+ "print(csr_mat.toarray(), \"\\n\")\n",
+ "\n",
+ "print(\"csr_mat_topk = dense_to_csr(mat, topk=2)\")\n",
+ "csr_mat_topk = smat_util.dense_to_csr(mat, topk=2)\n",
+ "print(f\"csr_mat is a {type(csr_mat_topk)} matrix\"\n",
+ " f\" with a shape {csr_mat_topk.shape} and {csr_mat_topk.nnz} non-zero values.\")\n",
+ "print(csr_mat_topk.toarray(), \"\\n\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b2746e3c",
+ "metadata": {},
+ "source": [
+ "### Memory-efficient Sparse Matrix Operations\n",
+ "\n",
+ "To manipulate with sparse matrix, PECOS provides many useful memory-efficient functions. For example, for CSR matrices, we have following functions to combine multiple matrices.\n",
+ "\n",
+ "* `hstack_csr([mat, mat, mat]`\n",
+ "* `vstack_csr([mat, mat, mat]`\n",
+ "* `block_diag_csr([mat, mat, mat]`\n",
+ "\n",
+ "These funcations are also available for CSC matrices as `hstack_csc`, `vstack_csr`, and `block_diag_csr`.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "b9dac617",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Original Matrix mat\n",
+ " [[0.45989041 0. ]\n",
+ " [0.77400378 0. ]\n",
+ " [0. 0.44557291]] \n",
+ "\n",
+ "hstack_csr([mat, mat, mat])\n",
+ "[[0.45989041 0. 0.45989041 0. 0.45989041 0. ]\n",
+ " [0.77400378 0. 0.77400378 0. 0.77400378 0. ]\n",
+ " [0. 0.44557291 0. 0.44557291 0. 0.44557291]] \n",
+ "\n",
+ "vstack_csr([mat, mat, mat])\n",
+ "[[0.45989041 0. ]\n",
+ " [0.77400378 0. ]\n",
+ " [0. 0.44557291]\n",
+ " [0.45989041 0. ]\n",
+ " [0.77400378 0. ]\n",
+ " [0. 0.44557291]\n",
+ " [0.45989041 0. ]\n",
+ " [0.77400378 0. ]\n",
+ " [0. 0.44557291]] \n",
+ "\n",
+ "block_diag_csr([mat, mat, mat])\n",
+ "[[0.45989041 0. 0. 0. 0. 0. ]\n",
+ " [0.77400378 0. 0. 0. 0. 0. ]\n",
+ " [0. 0.44557291 0. 0. 0. 0. ]\n",
+ " [0. 0. 0.45989041 0. 0. 0. ]\n",
+ " [0. 0. 0.77400378 0. 0. 0. ]\n",
+ " [0. 0. 0. 0.44557291 0. 0. ]\n",
+ " [0. 0. 0. 0. 0.45989041 0. ]\n",
+ " [0. 0. 0. 0. 0.77400378 0. ]\n",
+ " [0. 0. 0. 0. 0. 0.44557291]] \n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "from pecos.utils import smat_util\n",
+ "import scipy.sparse as smat\n",
+ "\n",
+ "mat = smat.random(3, 2, density=0.5, format=\"csr\")\n",
+ "print(f\"Original Matrix {type(mat)} mat\\n\", mat.toarray(), \"\\n\")\n",
+ "\n",
+ "print(f\"hstack_csr([mat, mat, mat])\")\n",
+ "print(smat_util.hstack_csr([mat, mat, mat]).toarray(), \"\\n\")\n",
+ "\n",
+ "print(f\"vstack_csr([mat, mat, mat])\")\n",
+ "print(smat_util.vstack_csr([mat, mat, mat]).toarray(), \"\\n\")\n",
+ "\n",
+ "print(f\"block_diag_csr([mat, mat, mat])\")\n",
+ "print(smat_util.block_diag_csr([mat, mat, mat]).toarray(), \"\\n\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5f9cf7f1",
+ "metadata": {},
+ "source": [
+ "### Sparse-to-sparse Matrix Multiplication (SpMM)\n",
+ "\n",
+ "Many operations in PECOS or XMR problems rely on Sparse-to-sparse Matrix Multiplication (SpMM), such as the computation of PIFA features. It is also one of the key primitives in large-scale linear algebra operations, with a broad range of applications in machine learning and natural language processing.\n",
+ "\n",
+ "For SpMM, PECOS provides a highly optimized multi-core CPU implementation with state-of-the-art performance, where the underlying operations are implemented and optimized in C/C++.\n",
+ "Specifically, the Python interface and parameters are as follows:\n",
+ "\n",
+ "```python\n",
+ "from pecos.core import clib as pecos_clib\n",
+ "Z = pecos_clib.sparse_matmul(X, Y, eliminate_zeros=False, sorted_indices=True, threads=-1)\n",
+ "```\n",
+ "* Parameters\n",
+ " * `X` (scipy.sparse.csr_matrix or scipy.sparse.csc_matrix): the first sparse matrix to be multiplied.\n",
+ " * `Y` (scipy.sparse.csr_matrix or scipy.sparse.csc_matrix): the second sparse matrix to be multiplied.\n",
+ " * `eliminate_zeros` (bool, optional): if true, then eliminate (potential) zeros created by maxnnz in output matrix Z. Default is false.\n",
+ " * `sorted_indices` (bool, optional): if true, then sort the Z.indices for the output matrix Z. Default is true.\n",
+ " * `threads` (int, optional): The number of threads. Default -1 to use all CPU cores."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "b2e6c54a",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "||Z_true - Z_pred|| = 0.0\n"
+ ]
+ }
+ ],
+ "source": [
+ "import numpy as np\n",
+ "import scipy.sparse as smat\n",
+ "from scipy.sparse import linalg\n",
+ "from pecos.core import clib as pecos_clib\n",
+ "X = smat.random(1000, 1000, density=0.01, format='csr', dtype=np.float32)\n",
+ "Y = smat.random(1000, 1000, density=0.01, format='csr', dtype=np.float32)\n",
+ "Z_true = X.dot(Y)\n",
+ "Z_pred = pecos_clib.sparse_matmul(X, Y)\n",
+ "print(\"||Z_true - Z_pred|| = \", linalg.norm(Z_true - Z_pred))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "14e4bded",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import time\n",
+ "DATASET = \"wiki10-31k\"\n",
+ "X = smat_util.load_matrix(f\"xmc-base/{DATASET}/tfidf-attnxml/X.trn.npz\").astype(np.float32)\n",
+ "Y = smat_util.load_matrix(f\"xmc-base/{DATASET}/Y.trn.npz\").astype(np.float32)\n",
+ "YT_csr = Y.T.tocsr()\n",
+ "X_csr = X.tocsr()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bb5af1e9",
+ "metadata": {},
+ "source": [
+ "#### Benchmarking Sparse Matrix Muplication\n",
+ "\n",
+ "The SpMM utility has state-of-the-art performance in efficiency as shown in the following figure.\n",
+ "\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "In this part, we provide some hands-on instructions for benchmarking different methods for SpMM."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "c71cffdb",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Benchmarking SciPy\n",
+ "\n",
+ "start = time.time()\n",
+ "Z = YT_csr.dot(X_csr)\n",
+ "Z.sort_indices()\n",
+ "run_time_scipy = time.time() - start"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "2f2b9795",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Benchmarking PyTorch\n",
+ "\n",
+ "import torch\n",
+ "\n",
+ "def csr_to_coo(A):\n",
+ " A_coo = smat.coo_matrix(A)\n",
+ " indices = np.vstack([A_coo.row, A_coo.col]).T\n",
+ " values = A_coo.data\n",
+ " return indices, values\n",
+ "\n",
+ "def get_pt_data(A_csr):\n",
+ " A_indices, A_values = csr_to_coo(A_csr)\n",
+ " A_pt = torch.sparse_coo_tensor(\n",
+ " A_indices.T.astype(np.int64),\n",
+ " A_values.astype(np.float32),\n",
+ " A_csr.shape,\n",
+ " )\n",
+ " return A_pt\n",
+ " \n",
+ "YT_pt = get_pt_data(YT_csr)\n",
+ "X_pt = get_pt_data(X_csr)\n",
+ "start = time.time()\n",
+ "Z_pt = torch.sparse.mm(YT_pt, X_pt)\n",
+ "run_time_pytorch = time.time() - start"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "54b24694",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Benchmarking PECOS\n",
+ "\n",
+ "start = time.time()\n",
+ "Z = pecos_clib.sparse_matmul(\n",
+ " YT_csr, X_csr,\n",
+ " eliminate_zeros=False,\n",
+ " sorted_indices=True\n",
+ ")\n",
+ "run_time_pecos = time.time() - start"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "id": "e12f29e9",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXgAAAD4CAYAAADmWv3KAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAAW/ElEQVR4nO3deZRkZZnn8e8PEFlkaaRUXKBEFEcdZEkXDrZKtTqitONxp911KLuPC4z2zIFpuhV7uhunGwTU8UwJKoKioLK6oKDY2u0oVVBAgcehQFGUhkJpWWQRfOaPuClJkhl1MyJuVFbw/ZwTJ+PuTxDUk28+973vm6pCkjR5NtnQAUiSumGCl6QJZYKXpAllgpekCWWCl6QJtdmGDmCmHXfcsZYuXbqhw5CkjcaqVatuqqolc21bVAl+6dKlrFy5ckOHIUkbjSTXzrfNEo0kTSgTvCRNKBO8JE0oE7wkTSgTvCRNKBO8JE0oE7wkTSgTvCRNKBO8JE2ozp5kTbI78IUZq3YF/qaqju3ieksP+0oXpxXw06NeuqFDkDSAzhJ8Vf0Y2BMgyabAL4AzurqeJOn+xlWi+RPg6qqad8wESdJojSvBvw44da4NSZYnWZlk5bp168YUjiRNvs4TfJLNgZcBp8+1vapWVNVUVU0tWTLniJeSpAGMowV/AHBxVd0whmtJkhrjSPAHMU95RpLUnU4TfJKtgRcCX+7yOpKkB+p0Rqequh14eJfXkCTNzSdZJWlCrbcFn+QRwH7Ao4E7gDXAyqr6fcexSZKGMG+CT7I/cBiwA3AJcCOwBfBy4AlJvggcXVW3jCFOSdIC9WvBvwQ4uKp+NntDks2AA+ndQP1SR7FJkoYwb4Kvqv/WZ9s9wJldBCRJGo313mRNckiSbdNzYpKLk7xoHMFJkgbXphfN25o6+4uAPwLeCBzVaVSSpKG1SfBpfr4EOLmqrpixTpK0SLVJ8KuSfINegj8vyTaAXSQlaZFr8yTr2+lN3HFNVf02ycOBt3YalSRpaP36we89a9WuiZUZSdpY9GvBH9383ALYB7iMXu19D2AlsG+3oUmShjFvDb6q9q+q/YHrgX2aSTn2AfaiN7+qJGkRa3OTdfequnx6oarWAP+hu5AkSaPQ5ibrZUlOAE5pll9Pr1wjSVrE2iT4twJ/ARzSLP8z8PHOIpIkjcR6E3xV3Ql8uHlJkjYSbcaD3w/4ALDLzP2ratfuwpIkDatNieZE4L8Cq4B7uw1HkjQqbRL8b6rqa51HIkkaqTYJ/ttJ/hH4MnDX9Mqqunh9BybZHjgBeBpQ9Eam/P5goUqSFqJNgn9W83NqxroClrU49jjg61X1qiSbA1stMD5J0oDa9KLZf5ATJ9kOeC7wluY8dwN3D3IuSdLCtZnRabskxyRZ2byObpL3+jweWAd8KsklSU5IsvUc518+fe5169YN8BEkSXNpM1TBJ4Fbgdc0r1uAT7U4bjNgb+DjVbUXcDtw2OydqmpFM87N1JIlS1oHLknqr00N/glV9coZy0cmWd3iuOuA66rqB83yF5kjwUuSutGmBX9HkudMLzQPPt2xvoOq6t+AnyfZvVn1J8CVA0UpSVqwNi34vwBOmlF3v5nmxmkL7wY+2/SguQZngpKksWnTi2Y18PQk2zbLt7Q9eXPs1Pr2kySNXpteNH+fZPuquqWqbknyR0n+5ziCkyQNrk0N/oCq+vfphaq6GXhJZxFJkkaiTYLfNMlDpxeSbAk8tM/+kqRFoM1N1s8CFySZ7vv+VuCk7kKSJI1Cm5usH0pyKfCCZtXfVtV53YYlSRpWmxY8wI+Ae6rq/CRbJdmmqm7tMjBJ0nDa9KI5mN5TqP+nWfUY4MwOY5IkjUCbm6zvBPajNwYNVXUV8Igug5IkDa9Ngr+rGeoXgCSb0RsPXpK0iLVJ8N9J8j+ALZO8EDgdOKfbsCRJw2qT4A+jN6775cA7gK8CR3QZlCRpeG26Sf4e+ATwiSQ7AI+tKks0krTItelFc2GSbZvkvopeov9w96FJkobRpkSzXTOC5CuAz1TVs+iN7S5JWsTaJPjNkuxEb7q+czuOR5I0Im0S/AeB84C1VXVRkl2Bq7oNS5I0rDY3WU+n1zVyevka4JXzHyFJWgzmbcEnOaK5sTrf9mVJDuwmLEnSsPq14C8HzklyJ3Axvb7wWwBPBPYEzgf+vusAJUmDmTfBV9VZwFlJnkhvLJqd6I1HcwqwvKruGE+IkqRBtKnBX8WAN1WT/BS4FbiX3nDDTsAtSWPSdjz4YexfVTeN4TqSpBnadJOUJG2Euk7wBXwjyaoky+faIcnyJCuTrFy3bl3H4UjSg0ebsWielOSCJGua5T2StB1N8jlVtTdwAPDOJM+dvUNVraiqqaqaWrJkyYKClyTNr00L/hPA4cDvAKrqMuB1bU5eVb9oft4InAE8c7AwJUkL1SbBb1VVP5y17p71HZRk6yTbTL8HXgSsWXiIkqRBtOlFc1OSJ9BM05fkVcD1LY57JHBGkunrfK6qvj5ooJKkhWmT4N8JrACenOQXwE+AN6zvoGbMmqcPF54kaVBtHnS6BnhBU2bZpKpu7T4sSdKw1pvgk2wPvAlYSm9seACq6j1dBiZJGk6bEs1Xgf9Lb/Cx33cbjiRpVNok+C2q6r2dRyJJGqk23SRPTnJwkp2S7DD96jwySdJQ2rTg7wb+Efgrmq6Szc9duwpKkjS8Ngn+fcBujggpSRuXNiWatcBvuw5EkjRabVrwtwOrk3wbuGt6pd0kJWlxa5Pgz2xekqSNSJsnWU8aRyCSpNGaN8EnOa2qXpPkcu7rPfMHVbVHp5FJkobSrwX/4ebngeMIRA8uSw/7yoYOYWL99KiXbugQtEj0S/AfA/auqmvHFYwkaXT6dZPM2KKQJI1cvxb8Y5IcP99Gu0lK0uLWL8HfAawaVyCSpNHql+B/ZRdJSdp49avB3z22KCRJIzdvgq+qZ48zEEnSaLUZbGwoSTZNckmSc7u+liTpPp0neOAQ4EdjuI4kaYZWCb5phT86yc7Tr5bHPRZ4KXDCMEFKkhZuvYONJXk38H7gBu6bdLuANmPRHAv8d2CbPudfDiwH2HnnVr83JEkttBku+BBg96r61UJOnORA4MaqWpXk+fPtV1UrgBUAU1NTDxjUTJI0mDYlmp8Dvxng3PsBL0vyU+DzwLIkpwxwHknSANq04K8BLkzyFe4/o9Mx/Q6qqsOBwwGaFvxfVtUbBo5UkrQgbRL8z5rX5s1LkrQRaDOj05EASR7WLN+20ItU1YXAhQs9TpI0uPXW4JM8LcklwBXAFUlWJXlq96FJkobR5ibrCuC9VbVLVe0CvA/4RLdhSZKG1SbBb11V355eaMotW3cWkSRpJFr1okny18DJzfIb6PWskSQtYm1a8G8DlgBfbl5LmnWSpEWsTS+amwGn55Okjcy8CT7JsVV1aJJz6I09cz9V9bJOI5MkDaVfC3665v5P4whEkjRa8yb4qpqecHvPqjpu5rYkhwDf6TIwSdJw2txkffMc694y4jgkSSPWrwZ/EPBnwOOTnD1j0zbAr7sOTJI0nH41+H8Frgd2BI6esf5W4LIug5IkDa9fDf5a4Fpg3/GFI0kalTaDjT07yUVJbktyd5J7k9wyjuAkSYNrc5P1o8BBwFXAlsB/AT7WZVCSpOG1SfBU1Vpg06q6t6o+Bby427AkScNqM9jYb5NsDqxO8r/o3Xht9YtBkrThtEnUb2z2exdwO/A44JVdBiVJGl6bFvxNwN1VdSdwZJJNgYd2G5YkaVhtWvAXAFvNWN4SOL+bcCRJo9ImwW8xc6Lt5v1WffYHIMkWSX6Y5NIkVyQ5cphAJUkL0ybB355k7+mFJPsAd7Q47i5gWVU9HdgTeHGSZw8UpSRpwdrU4A8FTk/ySyDAo4DXru+gqipguuX/kOb1gHHlJUndaDOj00VJngzs3qz6cVX9rs3Jmxuyq4DdgI9V1Q8GjlSStCD9RpNcVlXfSvKKWZuelISq+vL6Tl5V9wJ7JtkeOCPJ06pqzazrLAeWA+y8884L/gCSpLn1a8E/D/gW8KdzbCt6E3C3UlX/nuTb9J6AXTNr2wpgBcDU1JQlHEkakX6jSb6/+fnWQU6cZAnwuya5bwm8EPjQQFFKkhasX4nmvf0OrKpj1nPunYCTmjr8JsBpVXXuwkOUJA2iX4lmm2FOXFWXAXsNcw5J0uD6lWh8MEmSNmJtJvzYNck5SdYluTHJWUl2HUdwkqTBtXmS9XPAafRq6o8GTgdO7TIoSdLw2iT4rarq5Kq6p3mdAmzRdWCSpOG0Garga0kOAz5Pr//7a4GvJtkBoKp+3WF8kqQBtUnwr2l+vmPW+tfRS/jW4yVpEWozFs3jxxGIJGm0BhmLBqDVWDSSpA1nLGPRSJLGb71j0QAfrKqfzNyWxLKNJC1ybbpJfmmOdV8cdSCSpNHqV4N/MvBUYLtZdfhtsR+8JC16/WrwuwMHAttz/zr8rcDBHcYkSRqBfjX4s4CzkuxbVd8fY0ySpBHoV6L5CM0k2UkOmr29qt7TYVySpCH1K9GsHFsUkqSR61eiOWmcgUiSRmu9QxU0k2U/YDLsqlrWSUSSpJFoM9jYX854vwXwSuCebsKRJI1Km8HGVs1a9S9JfthRPJKkEWlTotlhxuImwD7Adi2OexzwGeCR9Eo8K6rquAHjlCQtUJsSzSp6CTr0SjM/Ad7e4rh7gPdV1cVJtgFWJflmVV05cLSSpNY6Gw++qq4Hrm/e35rkR8BjABO8JI1Bvwed5hwHftpCxoNPshTYC/hB68gkSUPp14L/IrC6eUGvRDOt9XjwSR5Gb0TKQ6vqljm2LweWA+y8885tTilJaqFfgn8FvXlX9wDOAk6tqrULOXmSh9BL7p+dr8VfVSuAFQBTU1MP6G8vSRrMvOPBV9WZVfU6ejM7XQ0cneR7SZ7X5sRJApwI/KiqjhlJtJKk1tpM+HEn8BvgFuBhtB8Lfj/gjcCyJKub10sGC1OStFB9J92mV6J5JnA+cFxVtR6ArKq+x/3r9pKkMepXgz8fuAz4HvBQ4E1J3jS90eGCJWlx65fg3zq2KCRJI+dwwZI0odrcZJUkbYRM8JI0oQZK8Ek2H3UgkqTRWm+CT3JhM5bM9PIzgYu6DEqSNLw2wwX/A/D1JMfTGw3yAOxhI0mLXpvhgs9L8ufAN4GbgL2q6t86j0ySNJQ2JZq/Bj4CPBf4AHBhkpd2HJckaUhtSjQPB55ZVXcA30/ydeAE4CudRiZJGkqbEs2hs5avBV7YVUCSpNHoN9jYsVV1aJJz6E3wcT9V9bJOI5MkDaVfC/7k5uc/jSMQSdJo9RuLZlWSTYHlVfX6McYkSRqBvr1oqupeYBefXJWkjU+bXjTXAP+S5Gzg9umVTsMnSYtbmwR/dfPaBNimWefk2JK0yLVJ8FdW1ekzVyR5dUfxSJJGpM1okoe3XCdJWkT69YM/AHgJ8JhmoLFp2wL3dB2YJGk4/VrwvwRWAncCq2a8zgb+0/pOnOSTSW5MsmYUgUqSFqZfP/hLgUuTfK6qfjfAuT8NfBT4zICxSZKG0OYm69Ik/wA8BdhiemVV7drvoKr655kThUiSxqvNTdZPAR+nV3ffn16L/JRRBZBkeZKVSVauW7duVKeVpAe9Ngl+y6q6AEhVXVtVHwBGNh58Va2oqqmqmlqyZMmoTitJD3ptSjR3JdkEuCrJu4BfAA/rNixJ0rDatOAPAbYC3gPsA7wReHOXQUmShtdmwo+Lmre3sYDJtpOcCjwf2DHJdcD7q+rEQYKUJC1cvwedzu534Pom/KiqgwYNSpI0vH4t+H2BnwOnAj8AMpaIJEkj0S/BP4re3KsHAX9Gb5LtU6vqinEEJmlxWXrYVzZ0CBPrp0eNrGPi/cx7k7Wq7q2qr1fVm4FnA2uBC5ueNJKkRa7vTdYkD6XX5/0gYClwPHBG92FJkobV7ybrZ4CnAV8FjqwqBw2TpI1Ivxb8G+hN0XcI8J7kD/dYA1RVbdtxbJKkIfQbTbLNQ1CSpEXKJC5JE8oEL0kTygQvSRPKBC9JE8oEL0kTygQvSRPKBC9JE8oEL0kTygQvSRPKBC9JE8oEL0kTygQvSRPKBC9JE6rTBJ/kxUl+nGRtksO6vJYk6f46S/BJNgU+BhwAPAU4KMlTurqeJOn+umzBPxNYW1XXVNXdwOeB/9zh9SRJM/Sdk3VIjwF+PmP5OuBZs3dKshxY3izeluTHHca0WOwI3LShg2grH9rQESwKG8135vf1Bw+W72yX+TZ0meBbqaoVwIoNHcc4JVlZVVMbOg6153e28fE767ZE8wvgcTOWH9uskySNQZcJ/iLgiUken2Rz4HXA2R1eT5I0Q2clmqq6J8m7gPOATYFPVtUVXV1vI/OgKklNCL+zjc+D/jtLVW3oGCRJHfBJVkmaUCZ4SZpQJvgRSfJXSa5IclmS1Uke0Oe/2W8qyfHN+7ckWdfsf2WSg8cb9eRLcm/z33dNktOTbDXPfv+x2W91kl8n+Unz/vwhrv3pJK8aPPoHt/m+uxnrp1+HNesfkuSoJFcluTjJ95Mc0GzbLslnmmFTrm7eb9ds2yTJ8c11Lk9yUZLHb7hPPjobvB/8JEiyL3AgsHdV3ZVkR2DzufatqpXAyhmrvlBV70ryCOCKJGdX1Q3dR/2gcUdV7QmQ5LPAnwPHzN6pqi4Hpvf7NHBuVX2xzQWSbFpV944oXt1nvu/uD+tn+VtgJ+Bpzb/DRwLPa7adCKypqjc15zsSOAF4NfBa4NHAHlX1+ySPBW7v7FONkS340dgJuKmq7gKoqpuq6pdJnpHkX5NcmuSHSbZJ8vwk584+QVXdCFwN7NK0QJbAH1oXa6eXNZTvArsl+WCSQ6dXJvm7JIfMdUCSg5pW3ZrkvucNk9yW5OgklwL7JnlT89fbpUlOnnGK5zb/D1xja34o3wV2m29j07o/GHj3jH+HN1TVaUl2A/ah9wtg2geBqSRPoPfv9/qq+n1z3HVVdXNHn2OsTPCj8Q3gcUn+X5L/neR5Td//LwCHVNXTgRcAd8x3giS7ArsCa4FTgNc3m14AXFpV6zr9BBMuyWb0Br67HPgkMN2S24TeMxqnzHHMo4EPAcvote6fkeTlzeatgR803+3NwBHAsmZ55i+LnYDn0PsL76hRf64Hg1nfHcCWs0o0r6WX/H9WVbfMcYqnAKtn/pXVvF8NPBU4DfjT5lxHJ9mry88zTpZoRqCqbkuyD/DHwP70Evvf0WsVXNTscwtAktmHvzbJc4C7gHdU1a+TfBI4CzgWeBvwqXF8jgm1ZZLVzfvvAidW1d1JftX8Q34kcElV/WqOY58BXDj9y7UpEzwXOBO4F/hSs98y4PSqugmgqn494xxnNi3DK5uSgdp7wHfXvH9AiSbJHoNepKquS7I7ve9xGXBBkldX1QWDnnOxMMGPSNMiuBC4MMnlwDtbHvqFqnrXrHP9PMkNSZbRG5Xz9XMfqhbmq9eeALwFeBS9Fv1C3dmy7n7XjPcP+O2uvub77uayFtg5ybZztOKvBPZMssl0Gab5y23PZhtNWedrwNeS3AC8HNjoE7wlmhFIsnuSJ85YtSfwI2CnJM9o9tmm+VOzrRPolQ1O9wZeJ84AXkyvlX7ePPv8EHhekh3Tm9/gIOA7c+z3LeDVSR4OkGSHDuJVH1X1W3ot/OOa8ihJljQt8bXAJfTKaNOOAC6uqrVJ9m7KcdOJfw/g2vF+gm6Y4EfjYcBJ6XV1vIxeze9v6N2d/0hzI+6bwBYLOOfZzXktz3SgmaPg28Bp8/0CrarrgcOa/S4FVlXVWXPsdwW9ktx3mu/6Ab10NFKza/DT9zaOANbRK4etAc4Fplvzbwee1HSRvBp4UrMO4BHAOc0xlwH3AB8d14fpkkMVLFJJpoAPV9Ufb+hYJlHTUrsYeHVVXbWh45G6YAt+EUrvwY0vAYdv6FgmUXpTR64FLjC5a5LZgpekCWULXpImlAlekiaUCV6SJpQJXpImlAlekibU/wc3dIaywUV9lgAAAABJRU5ErkJggg==\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from matplotlib import pyplot as plt\n",
+ "plt.bar(\n",
+ " [1,2,3],\n",
+ " [run_time_scipy, run_time_pytorch, run_time_pecos],\n",
+ " tick_label = [\"SciPy\", \"PyTorch\", \"PECOS\"])\n",
+ "\n",
+ "plt.ylabel(\"Matrix Multiplication Time (seconds)\");"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1f833846",
+ "metadata": {},
+ "source": [
+ "## Hierarchical Clustering\n",
+ "\n",
+ "Hierarchical clustering is an essential function for tree-based XMR models and plays a role of the indexer in PECOS. Accordingly, PECOS also implements hierarchical K-means algorithms in the manner of efficient C/C++, which can also be considered as useful functions for arbitrary tasks. The Python interface of PECOS hierarchical K-means algorithms is as follows:\n",
+ "\n",
+ "```python\n",
+ "from pecos.xmc import HierarchicalKMeans\n",
+ "HierarchicalKMeans.gen(feature_matrix, ... [training parameters])\n",
+ "```\n",
+ "* Training Parameters\n",
+ " * `nr_splits` (int, optional): The out-degree of each internal node of the tree. Ignored if `imbalanced_ratio != 0` because imbalanced clustering supports only 2-means. Default is `16`.\n",
+ " * `min_codes` (int): The number of direct child nodes that the top level of the hierarchy should have.\n",
+ " * `max_leaf_size` (int, optional): The maximum size of each leaf node of the tree. Default is `100`.\n",
+ " * `spherical` (bool, optional): True will l2-normalize the centroids of k-means after each iteration. Default is `True`.\n",
+ " * `seed` (int, optional): Random seed. Default is `0`.\n",
+ " * `kmeans_max_iter` (int, optional): Maximum number of iterations for each k-means problem. Default is `20`.\n",
+ " * `threads` (int, optional): Number of threads to use. `-1` denotes all CPUs. Default is `-1`.\n",
+ " \n",
+ "#### Clustering Chains\n",
+ "\n",
+ "Similar to the results of semantic label indexing in PECOS, the hierarchical clustering results will be returned as a list of `D` CSC matrices `C[d]` to denote hierarchical cluster assignments over layers, where `D` is the layers of resulting hierarchical clusters."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3b3b073c",
+ "metadata": {},
+ "source": [
+ "### Naive Clustering as Degenerated Hierarchical Clustering\n",
+ "\n",
+ "When `min_codes` and `max_leaf_size` as the stopping criteria are large enough, the hierarchical clustering will be degenerated to conventional naive clustering."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "id": "4a5da8c1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from pecos.utils import smat_util\n",
+ "import time\n",
+ "DATASET = \"wiki10-31k\"\n",
+ "X = smat_util.load_matrix(f\"xmc-base/{DATASET}/tfidf-attnxml/X.trn.npz\").astype(np.float32)\n",
+ "Y = smat_util.load_matrix(f\"xmc-base/{DATASET}/Y.trn.npz\").astype(np.float32)\n",
+ "YT_csr = Y.T.tocsr()\n",
+ "X_csr = X.tocsr()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "id": "088d87a3",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "2 layers in the trained hierarchical clusters with C[d] as:\n",
+ "cluster_chain[0] is a csc matrix of shape (4, 1).\n",
+ "cluster_chain[1] is a csc matrix of shape (14146, 4).\n"
+ ]
+ }
+ ],
+ "source": [
+ "from pecos.xmc.base import HierarchicalKMeans\n",
+ "import scipy.sparse as smat\n",
+ "import numpy as np\n",
+ "\n",
+ "num_splits = 4\n",
+ "cluster_chain = HierarchicalKMeans.gen(\n",
+ " X_csr,\n",
+ " min_codes=num_splits,\n",
+ " nr_splits=num_splits,\n",
+ " max_leaf_size=np.ceil(X_csr.shape[0]/num_splits))\n",
+ "\n",
+ "print(f\"{len(cluster_chain)} layers in the trained hierarchical clusters with C[d] as:\")\n",
+ "for d, C in enumerate(cluster_chain):\n",
+ " print(f\"cluster_chain[{d}] is a {C.getformat()} matrix of shape {C.shape}.\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "id": "ef1c9ee1",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "(14146,) (14146,)\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from sklearn.decomposition import TruncatedSVD\n",
+ "svd = TruncatedSVD(n_components=2)\n",
+ "X_svd = svd.fit_transform(X_csr)\n",
+ "\n",
+ "from matplotlib import pyplot as plt\n",
+ "cluster_x, cluster_y = X_svd[:, 0], X_svd[:, 1]\n",
+ "cluster_c = list(cluster_chain[-1].tocsr().indices)\n",
+ "print(cluster_x.shape, cluster_y.shape)\n",
+ "plt.scatter(cluster_x, cluster_y, c=cluster_c, s=25)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3acd550d",
+ "metadata": {},
+ "source": [
+ "### Tracing Cluster in Hierarchical Clustering"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "id": "b6c75c21",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "4 layers in the trained hierarchical clusters with C[d] as:\n",
+ "cluster_chain[0] is a csc matrix of shape (4, 1).\n",
+ "cluster_chain[1] is a csc matrix of shape (32, 4).\n",
+ "cluster_chain[2] is a csc matrix of shape (256, 32).\n",
+ "cluster_chain[3] is a csc matrix of shape (14146, 256).\n"
+ ]
+ }
+ ],
+ "source": [
+ "from pecos.xmc.base import HierarchicalKMeans\n",
+ "import scipy.sparse as smat\n",
+ "import numpy as np\n",
+ "\n",
+ "cluster_chain = HierarchicalKMeans.gen(X_csr, nr_splits=8)\n",
+ "\n",
+ "print(f\"{len(cluster_chain)} layers in the trained hierarchical clusters with C[d] as:\")\n",
+ "for d, C in enumerate(cluster_chain):\n",
+ " print(f\"cluster_chain[{d}] is a {C.getformat()} matrix of shape {C.shape}.\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "id": "dc644526",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "55 instances belong to the first cluster in the layer-3.\n",
+ "442 instances belong to the first cluster in the layer-2.\n",
+ "3536 instances belong to the first cluster in the layer-1.\n"
+ ]
+ }
+ ],
+ "source": [
+ "from scipy.sparse import linalg\n",
+ "from pecos.core import clib as pecos_clib\n",
+ "\n",
+ "current_cluster = cluster_chain[-1]\n",
+ "for i in range(len(cluster_chain) - 2, -1, -1):\n",
+ " print(f\"{current_cluster.getnnz(0)[0]} instances belong to the first cluster in the layer-{i + 1}.\")\n",
+ " current_cluster = pecos_clib.sparse_matmul(current_cluster, cluster_chain[i])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "id": "f2203831",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "The 10-th instance belongs to the cluster-228 in the layer-3.\n",
+ "The 10-th instance belongs to the cluster-28 in the layer-2.\n",
+ "The 10-th instance belongs to the cluster-3 in the layer-1.\n"
+ ]
+ }
+ ],
+ "source": [
+ "inst_idx = 10\n",
+ "\n",
+ "current_cluster = cluster_chain[-1]\n",
+ "for i in range(len(cluster_chain) - 2, -1, -1):\n",
+ " print(f\"The {inst_idx}-th instance belongs to the cluster-{current_cluster.tocsr().indices[inst_idx]} in the layer-{i + 1}.\")\n",
+ " current_cluster = pecos_clib.sparse_matmul(current_cluster, cluster_chain[i])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "080f044a",
+ "metadata": {},
+ "source": [
+ "### Performance Benchmarking\n",
+ "\n",
+ "Here we benchmark the efficiency performance of PECOS hierarchicaly clustering and compare with a pure Python implementation based on [sklearn.cluster.KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "id": "cbcacb51",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "PECOS takes 1.0543 seconds for hierarchical clustering with a depth 5.\n"
+ ]
+ }
+ ],
+ "source": [
+ "import time\n",
+ "from pecos.xmc.base import HierarchicalKMeans\n",
+ "\n",
+ "nr_splits = 4\n",
+ "\n",
+ "start_time = time.time()\n",
+ "cluster_chain = HierarchicalKMeans.gen(X_csr, nr_splits=nr_splits)\n",
+ "pred_time = time.time() - start_time\n",
+ "\n",
+ "cluster_depth = len(cluster_chain)\n",
+ "print(f\"PECOS takes {pred_time:.4f} seconds for hierarchical clustering with a depth {cluster_depth}.\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "id": "5c8179cd",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "scikit-learn takes 49.1826 seconds for hierarchical clustering with a depth 5.\n"
+ ]
+ }
+ ],
+ "source": [
+ "from sklearn.cluster import KMeans\n",
+ "import numpy as np\n",
+ "\n",
+ "start_time = time.time()\n",
+ "current_clusters = [X_csr]\n",
+ "for d in range(cluster_depth):\n",
+ " next_clusters = []\n",
+ " for cur_X in current_clusters:\n",
+ " if cur_X.shape[0] >= nr_splits:\n",
+ " kmeans = KMeans(n_clusters=nr_splits).fit(cur_X)\n",
+ " next_clusters.append(cur_X[kmeans.labels_ == 0])\n",
+ " next_clusters.append(cur_X[kmeans.labels_ == 1])\n",
+ " else:\n",
+ " next_clusters.append(cur_X)\n",
+ " \n",
+ " current_clusters = next_clusters\n",
+ "pred_time = time.time() - start_time\n",
+ "\n",
+ "print(f\"scikit-learn takes {pred_time:.4f} seconds for hierarchical clustering with a depth {cluster_depth}.\")"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.10"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/tutorials/kdd22/Session 5 XR-Transformer cookbook and Distributed PECOS.ipynb b/tutorials/kdd22/Session 5 XR-Transformer cookbook and Distributed PECOS.ipynb
new file mode 100644
index 0000000..684272f
--- /dev/null
+++ b/tutorials/kdd22/Session 5 XR-Transformer cookbook and Distributed PECOS.ipynb
@@ -0,0 +1,883 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "3ee4c624-46ff-4a69-b315-097a4a471737",
+ "metadata": {},
+ "source": [
+ "# eXtreme Multi-label Ranking (XMR) with Transformers\n",
+ "In many XMC applications, XR-Transformer is able to yield better performance than XR-Linear due to better extraction of semantic information. However, unlike the linear models, the training hyper-parameters need to be carefully set to achieve the best performance. Naively using the default setting will often lead to sub-optimal results.\n",
+ "\n",
+ "In this section, we will discuss about crucial components in training a good XR-Transformer model.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "235e6dce-80eb-48f7-8604-f1695f04c878",
+ "metadata": {},
+ "source": [
+ "## 1. Overview: Multi-Resolution Fine-tuning\n",
+ "\n",
+ "One important thing to note is that XR-Transformer leverages multi-resolution fine-tuning to allow tuning from easy to hard tasks. The training can be separated into three steps:\n",
+ "\n",
+ "* **Step1**: Label features are computed (usually via PIFA) and is used to build preliminary hierarchical label tree (HLT) via hierarchical k-means.\n",
+ "* **Step2**: Fine-tune the transformer encoder on the chosen levels of the preliminary HLT.\n",
+ "* **Step3**: Concatenate final instance embeddings and sparse features and train the linear rankers on the refined HLT.\n",
+ "\n",
+ " \n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f0a355aa-3f50-4471-9b8e-08538f3bc57e",
+ "metadata": {},
+ "source": [
+ "## 2. Parameter structure\n",
+ "\n",
+ "Although we provide basic functionalities to supply training and prediction parameters in the CLI API `pecos.xmc.xtransformer.train`, `pecos.xmc.xtransformer.predict` and `pecos.xmc.xtransformer.encode`, for advanced users it is recommended to give parameters via JSON format.\n",
+ "\n",
+ "You can generate a `.json` file with all of the parameters that you can edit and fill in via\n",
+ "```bash\n",
+ "python3 -m pecos.xmc.xtransformer.train --generate-params-skeleton &> params.json\n",
+ "```\n",
+ "\n",
+ "After filling in the desired parameters into `params.json`, the training can be done end2end via:\n",
+ "```bash\n",
+ "python3 -m pecos.xmc.xtransformer.train -t ${T_path} -x ${X_path} -y ${Y_path} -m ${model_dir} --params-path params.json\n",
+ "```\n",
+ "\n",
+ "The high-level structure of the training and prediction parameters for XR-Transformer:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "24e47f23-8525-4588-ab44-97c8f52f6cec",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Training Parameters of XTransformer.\n",
+ "\n",
+ " preliminary_indexer_params (HierarchicalKMeans.TrainParams): params to generate preliminary hierarchial label tree.\n",
+ " ignored if clustering is given\n",
+ " refined_indexer_params (HierarchicalKMeans.TrainParams): params to generate refined hierarchial label tree.\n",
+ " ignored if fix_clustering is True\n",
+ " matcher_params_chain (TransformerMatcher.TrainParams or list): chain of params for TransformerMatchers.\n",
+ " ranker_params (XLinearModel.TrainParams): train params for linear ranker\n",
+ "\n",
+ " do_fine_tune (bool, optional): if False, skip fine-tuning steps and directly use pre-trained transformer models.\n",
+ " Default True\n",
+ " only_encoder (bool, optional): if True, skip linear ranker training. Default False\n",
+ " fix_clustering (bool, optional): if True, use the same hierarchial label tree for fine-tuning and final prediction. Default false.\n",
+ " max_match_clusters (int, optional): max number of clusters on which to fine-tune transformer. Default 32768\n",
+ " \n",
+ "Pred Parameters of XTransformer.\n",
+ "\n",
+ " matcher_params_chain (TransformerMatcher.PredParams or list): chain of params for TransformerMatchers\n",
+ " ranker_params (XLinearModel.PredParams): pred params for linear ranker\n",
+ " \n"
+ ]
+ }
+ ],
+ "source": [
+ "import logging\n",
+ "from pecos.xmc.xtransformer.model import XTransformer\n",
+ "from pecos.utils import logging_util\n",
+ "\n",
+ "LOGGER = logging.getLogger(__name__)\n",
+ "\n",
+ "logging_util.setup_logging_config(level=1)\n",
+ "\n",
+ "print(XTransformer.TrainParams.__doc__)\n",
+ "print(XTransformer.PredParams.__doc__)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "960b84bd-5c52-425c-b31f-93eb7c6b5e49",
+ "metadata": {},
+ "source": [
+ "We provide the fexibility to control almost every aspect of XR-Transformer taining, let's cover the main components."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e101dc60-0fed-47dd-8c2b-de3ecbc7bd97",
+ "metadata": {},
+ "source": [
+ "### 2.1 Specify the Label Hierarchy\n",
+ "\n",
+ "The structure and construction of the preliminary HLT and the refined-HLT are controlled by `preliminary_indexer_params` and `refined_indexer_params`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "22f9523a-b7e5-49ff-9874-bcb21bb55eee",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Training Parameters of Hierarchical K-means.\n",
+ "\n",
+ " nr_splits (int, optional): The out-degree of each internal node of the tree. Ignored if `imbalanced_ratio != 0` because imbalanced clustering supports only 2-means. Default is `16`.\n",
+ " min_codes (int): The number of direct child nodes that the top level of the hierarchy should have.\n",
+ " max_leaf_size (int, optional): The maximum size of each leaf node of the tree. Default is `100`.\n",
+ " imbalanced_ratio (float, optional): Value between `0.0` and `0.5` (inclusive). Indicates how relaxed the balancedness constraint of 2-means can be. Specifically, if an iteration of 2-means is clustering `L` labels, the size of the output 2 clusters will be within approx `imbalanced_ratio * 2 * L` of each other. Default is `0.0`.\n",
+ " imbalanced_depth (int, optional): Maximum depth of imbalanced clustering. After depth `imbalanced_depth` is reached, balanced clustering will be used. Default is `100`.\n",
+ " spherical (bool, optional): True will l2-normalize the centroids of k-means after each iteration. Default is `True`.\n",
+ " seed (int, optional): Random seed. Default is `0`.\n",
+ " kmeans_max_iter (int, optional): Maximum number of iterations for each k-means problem. Default is `20`.\n",
+ " threads (int, optional): Number of threads to use. `-1` denotes all CPUs. Default is `-1`.\n",
+ " \n"
+ ]
+ }
+ ],
+ "source": [
+ "from pecos.xmc.base import HierarchicalKMeans; print(HierarchicalKMeans.TrainParams.__doc__)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "18c61c61-20ad-4a4f-93a3-85a736215310",
+ "metadata": {},
+ "source": [
+ "Here is an example of the parameters related to label hierarchy in `eurlex-4k` model:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "2759001f-413f-49be-bfad-38f58b9147d1",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{\n",
+ " \"__meta__\": {\n",
+ " \"class_fullname\": \"pecos.xmc.base###HierarchicalKMeans.TrainParams\"\n",
+ " },\n",
+ " \"nr_splits\": 16,\n",
+ " \"min_codes\": 16,\n",
+ " \"max_leaf_size\": 16,\n",
+ " \"imbalanced_ratio\": 0.0,\n",
+ " \"imbalanced_depth\": 100,\n",
+ " \"spherical\": true,\n",
+ " \"seed\": 0,\n",
+ " \"kmeans_max_iter\": 20,\n",
+ " \"threads\": -1\n",
+ "}\n"
+ ]
+ }
+ ],
+ "source": [
+ "import json\n",
+ "import pecos\n",
+ "import requests\n",
+ "import numpy as np\n",
+ "from pecos.utils import smat_util\n",
+ "from pecos.xmc import Indexer, LabelEmbeddingFactory\n",
+ "\n",
+ "param_url = \"https://raw.githubusercontent.com/amzn/pecos/mainline/examples/xr-transformer-neurips21/params/eurlex-4k/bert/params.json\"\n",
+ "params = json.loads(requests.get(param_url).text)\n",
+ " \n",
+ "eurlex4k_train_params = XTransformer.TrainParams.from_dict(params[\"train_params\"])\n",
+ "eurlex4k_pred_params = XTransformer.PredParams.from_dict(params[\"pred_params\"])\n",
+ "\n",
+ "print(json.dumps(eurlex4k_train_params.preliminary_indexer_params.to_dict(), indent=True))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "90802b2f-55ef-4a64-9252-08f7a1e30bcc",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Preliminary HLT structure [16, 256, 3956]\n"
+ ]
+ }
+ ],
+ "source": [
+ "X_feat = smat_util.load_matrix(\"work_dir/xmc-base/eurlex-4k/X.trn.npz\", dtype=np.float32)\n",
+ "Y = smat_util.load_matrix(\"work_dir/xmc-base/eurlex-4k/Y.trn.npz\", dtype=np.float32)\n",
+ "\n",
+ "with open(\"work_dir/xmc-base/eurlex-4k/X.trn.txt\", 'r') as fin:\n",
+ " X_txt = [xx.strip() for xx in fin.readlines()]\n",
+ "\n",
+ "preliminary_hlt = Indexer.gen(\n",
+ " LabelEmbeddingFactory.create(Y, X_feat, method=\"pifa\"),\n",
+ " train_params=eurlex4k_train_params.preliminary_indexer_params,\n",
+ ")\n",
+ "\n",
+ "print(f\"Preliminary HLT structure {[c.shape[0] for c in preliminary_hlt]}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "11232266-fceb-464b-8b71-ceaf723d7b39",
+ "metadata": {},
+ "source": [
+ "In this case the preliminiary HLT has 3 levels (16-256-3956) and the refined HLT has 4 levels ( 4-32-256-3956).\n",
+ "As we choose the `max_match_clusters` to be `32768`, the fine-tuning will happen on all 3 levels of preliminary HLT.\n",
+ "\n",
+ "The preliminary HLT is usually constructed such that:\n",
+ "* The initial fine-tuning task has low enough label resolution (i.e. < 1000 labels, in this case 128). This is to ensure Transformers can start from simple task to 'warm-up'.\n",
+ "* The final fine-tuning task has high enough label resolution (controlled by `max_match_clusters`, in this case 32768). The is to ensure training efficiency."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ac812a91-024f-45cc-9d3c-f97b7f98ac94",
+ "metadata": {},
+ "source": [
+ "### 2.2 Control fine-tuning at each level\n",
+ "\n",
+ "At each level of the fine-tuning task, user can independently specify the training parameters such as `loss_function`, `batch_size` and etc.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "be25c125-3dc6-4ec7-b83f-ef289312778f",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Training Parameters of MLModel\n",
+ "\n",
+ " model_shortcut (str): string of pre-trained model shortcut. Default 'bert-base-cased'\n",
+ " negative_sampling (str): negative sampling types. Default tfn\n",
+ " loss_function (str): type of loss function to use for transformer\n",
+ " training. Default 'squared-hinge'\n",
+ " bootstrap_method (str): algorithm to bootstrap text_model. If not None, initialize\n",
+ " TransformerMatcher projection layer with one of:\n",
+ " 'linear' (default): linear model trained on final embeddings of parent layer\n",
+ " 'inherit': inherit weights from parent labels\n",
+ " lr_schedule (str): learning rate schedule. See transformers.SchedulerType for details.\n",
+ " Default 'linear'\n",
+ "\n",
+ " threshold (float): threshold to sparsify the model weights. Default 0.1\n",
+ " hidden_dropout_prob (float): hidden dropout prob in deep transformer models. Default 0.1\n",
+ " batch_size (int): batch size for transformer training. Default 8\n",
+ " batch_gen_workers (int): number of workers for batch generation. Default 4\n",
+ " max_active_matching_labels (int): max number of active matching labels,\n",
+ " will sub-sample from existing negative samples if necessary. Default None\n",
+ " to ignore\n",
+ " max_num_labels_in_gpu (int): Upper limit on labels to put output layer in GPU.\n",
+ " Default 65536.\n",
+ " max_steps (int): if > 0: set total number of training steps to perform.\n",
+ " Override num-train-epochs. Default -1.\n",
+ " max_no_improve_cnt (int): if > 0, training will stop when this number of\n",
+ " validation steps result in no improvement. Default -1.\n",
+ " num_train_epochs (int): total number of training epochs to perform. Default 5\n",
+ " gradient_accumulation_steps (int): number of updates steps to accumulate\n",
+ " before performing a backward/update pass. Default 1.\n",
+ " weight_decay (float): weight decay rate for regularization. Default 0 to ignore\n",
+ " max_grad_norm (float): max gradient norm used for gradient clipping. Default 1.0\n",
+ " learning_rate (float): maximum learning rate for Adam. Default 5e-5\n",
+ " adam_epsilon (float): epsilon for Adam optimizer.Default 1e-8\n",
+ " warmup_steps (float): learning rate warmup over warmup-steps. Default 0\n",
+ " logging_steps (int): log training information every NUM updates steps. Default 50\n",
+ " save_steps (int): save checkpoint every NUM updates steps. Default 100\n",
+ "\n",
+ " cost_sensitive_ranker (bool, optional): if True, use clustering count aggregating for ranker's cost-sensitive learnin\n",
+ " Default False\n",
+ " pre_tokenize (bool, optional): if True, will tokenize training instances before training\n",
+ " This could potentially accelerate batch-generation but increases memory cost.\n",
+ " Default False\n",
+ " use_gpu (bool, optional): whether to use GPU even if available. Default True\n",
+ " eval_by_true_shorlist (bool, optional): if True, will compute validation scores by true label\n",
+ " shortlisting at intermediat layer. Default False\n",
+ "\n",
+ " checkpoint_dir (str): path to save training checkpoints. Default empty to use a temp dir.\n",
+ " cache_dir (str): dir to store the pre-trained models downloaded from\n",
+ " s3. Default empty to use a temp dir.\n",
+ " init_model_dir (str): path to load checkpoint of TransformerMatcher. If given,\n",
+ " start from the given checkpoint rather than downloading a\n",
+ " pre-trained model from S3. Default empty to ignore\n",
+ " \n"
+ ]
+ }
+ ],
+ "source": [
+ "from pecos.xmc.xtransformer.matcher import TransformerMatcher; print(TransformerMatcher.TrainParams.__doc__)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "38cc7480-d6a7-4c81-96cf-cb35aa823fc2",
+ "metadata": {},
+ "source": [
+ "For the `eurlex-4k` model, we are fine-tuning the `bert-base-uncased` pre-trained model at 3 levels of the preliminary HLT:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "db634cb8-6763-411a-8f6e-6100ff1d4ead",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "========== matcher_params_chain[0] (len=3) ==========\n",
+ "{\n",
+ " \"__meta__\": {\n",
+ " \"class_fullname\": \"pecos.xmc.xtransformer.matcher###TransformerMatcher.TrainParams\"\n",
+ " },\n",
+ " \"adam_epsilon\": 1e-08,\n",
+ " \"batch_gen_workers\": 16,\n",
+ " \"batch_size\": 32,\n",
+ " \"bootstrap_method\": \"weighted-linear\",\n",
+ " \"cache_dir\": \"\",\n",
+ " \"checkpoint_dir\": \"\",\n",
+ " \"cost_sensitive_ranker\": false,\n",
+ " \"eval_by_true_shorlist\": false,\n",
+ " \"gradient_accumulation_steps\": 1,\n",
+ " \"hidden_dropout_prob\": 0.1,\n",
+ " \"init_model_dir\": \"\",\n",
+ " \"learning_rate\": 5e-05,\n",
+ " \"logging_steps\": 50,\n",
+ " \"loss_function\": \"weighted-squared-hinge\",\n",
+ " \"lr_schedule\": \"linear\",\n",
+ " \"max_active_matching_labels\": 1000,\n",
+ " \"max_grad_norm\": 1.0,\n",
+ " \"max_no_improve_cnt\": -1,\n",
+ " \"max_num_labels_in_gpu\": 65536,\n",
+ " \"max_steps\": 600,\n",
+ " \"model_shortcut\": \"bert-base-uncased\",\n",
+ " \"negative_sampling\": \"tfn+man\",\n",
+ " \"num_train_epochs\": 10,\n",
+ " \"pre_tensorize_labels\": false,\n",
+ " \"pre_tokenize\": false,\n",
+ " \"save_steps\": 200,\n",
+ " \"threshold\": 0.001,\n",
+ " \"use_gpu\": true,\n",
+ " \"warmup_steps\": 100,\n",
+ " \"weight_decay\": 0.0\n",
+ "}\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"=\"*10, f\"matcher_params_chain[0] (len={len(eurlex4k_train_params.matcher_params_chain)})\", \"=\"*10)\n",
+ "print(json.dumps(eurlex4k_train_params.matcher_params_chain[0].to_dict(), sort_keys=True, indent=True))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "888b1b30-a300-42e2-b8e0-295c426fe655",
+ "metadata": {},
+ "source": [
+ "Though the best parameters may vary a lot for different tasks, there are some common notes you should alwasy\n",
+ "* It's recommended to finish at least one epoch at each level. This will allow the model to visit the label matrix at least once.\n",
+ " * i.e. `max_steps * batch_size * num_gpus > num_instances` (if `max_steps` is null, it will be infered from `num_train_epochs`)\n",
+ "* `model_shortcut` will only be used in the first fine-tuning layer, as the later ones will just continue on the same encoder.\n",
+ "* Learning rate and its schedule is controlled by `learning_rate`, `lr_schedule`, `warmup_steps`, `max_steps`. For more info, refer to: https://huggingface.co/docs/transformers/main_classes/optimizer_schedules"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9a7f5f39-9d20-4776-970c-2d46464ea983",
+ "metadata": {},
+ "source": [
+ "#### 2.2.1 Use pre-trained models\n",
+ "\n",
+ "There are two ways to provide pre-trained Transformer encoder:\n",
+ "* **Download from huggingface repo** (https://huggingface.co/models): model name provided by `model_shortcut`. (e.x. `bert-base-uncased` or `w11wo/javanese-distilbert-small`)\n",
+ "* **Load from local disk**: model path provided by `init_model_dir`. Model should be loadable through `TransformerMatcher.load()`\n",
+ "\n",
+ "Note that both `model_shortcut` and `init_model_dir` will only be used in the first fine-tuning layer, as the later ones will just continue on the final state from parent encoder.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "9570261c-10b4-4c74-b018-0bd3c8b524d7",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " model loaded with encoder_type=distilbert num_labels=2\n"
+ ]
+ }
+ ],
+ "source": [
+ "import os\n",
+ "import scipy.sparse as smat\n",
+ "from pecos.xmc.xtransformer.matcher import TransformerMatcher\n",
+ "from transformers import AutoTokenizer, AutoModelForSequenceClassification\n",
+ "\n",
+ "init_model_dir = \"work_dir/my_pre_trained_model\"\n",
+ "os.makedirs(init_model_dir, exist_ok=True)\n",
+ "\n",
+ "# example to use your own pre-trained model, here we use huggingface model as an example\n",
+ "my_tokenizer = AutoTokenizer.from_pretrained(\"distilbert-base-uncased\")\n",
+ "my_encoder = AutoModelForSequenceClassification.from_pretrained(\"distilbert-base-uncased\")\n",
+ "\n",
+ "# do my own pre-training/tuning/etc\n",
+ "# ...\n",
+ "\n",
+ "# save my own model to disk\n",
+ "my_tokenizer.save_pretrained(f\"{init_model_dir}/text_tokenizer\")\n",
+ "my_encoder.save_pretrained(f\"{init_model_dir}/text_encoder\")\n",
+ "\n",
+ "# then the `work_dir` can be fed as `init_model_dir` as initial model.\n",
+ "# Sanity check: if this dir can be loaded via TransformerMatcher.load(*)\n",
+ "matcher = TransformerMatcher.load(init_model_dir)\n",
+ "print(f\"{matcher.__class__} model loaded with encoder_type={matcher.model_type} num_labels={matcher.nr_labels}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "7561e709-f14c-47c4-b14b-d80c62d156bf",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%bash\n",
+ "DATASET=\"eurlex-4k\"\n",
+ "wget -q https://archive.org/download/xr-transformer-encoders/${DATASET}.tar.gz\n",
+ "mkdir -p ./work_dir/xr-transformer-encoder\n",
+ "tar -zxf ./${DATASET}.tar.gz -C ./work_dir/xr-transformer-encoder"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "9e04d9f9-0e1c-4000-a947-a417d04e84b7",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " model loaded with encoder_type=bert num_labels=3956\n"
+ ]
+ }
+ ],
+ "source": [
+ "matcher = TransformerMatcher.load(\"./work_dir/xr-transformer-encoder/eurlex-4k/bert/text_encoder\")\n",
+ "print(f\"{matcher.__class__} model loaded with encoder_type={matcher.model_type} num_labels={matcher.nr_labels}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6f374aaf-c2cd-4daf-937b-f5974b0d4a75",
+ "metadata": {},
+ "source": [
+ "#### 2.2.2 Bootstrapping and Cost Sensitive Leanring\n",
+ "\n",
+ "We provide three options to boostrap the XMC head at child level (i.e. W^(t+1)) from parent level (i.e. W^(t)):\n",
+ "* `bootstrap_method=None`: No bootstrap, W^(t+1) will be randomly initialized.\n",
+ "* `bootstrap_method='inherit'`: Bootstrap by inherit the weight vector from parent node. \n",
+ "* `bootstrap_method='linear'`(default): linear model will be trained on final embeddings of parent layer and be used as initial point for W^(t+1).\n",
+ "\n",
+ "In most cases the default linear bootstrapper would give good enough initial point the XMC heads.\n",
+ "Compared with linear bootstrapper, the inherit bootstrapper has less memory/time overhead. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2be8776a-6e46-488c-97cd-d72a5842078b",
+ "metadata": {},
+ "source": [
+ "XR-Transformer allows taking magnutute of label strength into consideration via cost-sensitive learning.\n",
+ "This is available even when input label matrix is binary. In this case, the cost (at non-leaf level) will be inferred via label aggregation.\n",
+ "\n",
+ "To use cost-sensitive fine-tuning, use the `weighted-` version of loss functions. I.e."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "6217ae9b-eb10-4dd7-a41d-3b36024ea915",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "['weighted-hinge', 'weighted-squared-hinge']\n"
+ ]
+ }
+ ],
+ "source": [
+ "print([lf for lf in TransformerMatcher.LOSS_FUNCTION_TYPES.keys() if 'weighted-' in lf])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5a01e03b-427d-47c9-ba5d-8ae45e647996",
+ "metadata": {},
+ "source": [
+ "### 2.3 Linear models with concatenated feature\n",
+ "\n",
+ "The training of linear models is controlled by the `ranker_params`, which is of the same format as PECOS XR-Linear.\n",
+ "\n",
+ "User should pay special attention to the `threshold` which controls the sparsification of final linear models.\n",
+ "Unlike purely sparse features, the linear models trained on sparse+dense concatenated features are more sensitive to the sparsification.\n",
+ "Usually `threshold=0.01` or `0.001` is recommended for XR-Transformer."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "f8531b6d-f96f-476e-836e-01a2f6daa35f",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "prec = 84.97 78.05 71.25 64.93 58.97 53.42 48.24 43.70 39.92 36.81\n",
+ "recall = 17.26 31.35 42.42 51.00 57.39 62.01 65.08 67.23 68.95 70.56\n"
+ ]
+ }
+ ],
+ "source": [
+ "from pecos.xmc.xtransformer.module import MLProblemWithText\n",
+ "prob = MLProblemWithText(X_txt, Y, X_feat=X_feat)\n",
+ "\n",
+ "# disable fine-tuning, use pre-trained bert model from huggingface\n",
+ "eurlex4k_train_params.do_fine_tune = False\n",
+ "\n",
+ "xtf_pretrained = XTransformer.train(\n",
+ " prob,\n",
+ " clustering=preliminary_hlt,\n",
+ " train_params=eurlex4k_train_params,\n",
+ " pred_params=eurlex4k_pred_params,\n",
+ ")\n",
+ "\n",
+ "X_feat_tst = smat_util.load_matrix(\"work_dir/xmc-base/eurlex-4k/X.tst.npz\", dtype=np.float32)\n",
+ "Y_tst = smat_util.load_matrix(\"work_dir/xmc-base/eurlex-4k/Y.tst.npz\", dtype=np.float32)\n",
+ "\n",
+ "with open(\"work_dir/xmc-base/eurlex-4k/X.tst.txt\", 'r') as fin:\n",
+ " X_txt_tst = [xx.strip() for xx in fin.readlines()]\n",
+ "\n",
+ "P_pretrained = xtf_pretrained.predict(X_txt_tst, X_feat=X_feat_tst)\n",
+ "metrics = smat_util.Metrics.generate(Y_tst, P_pretrained, topk=10)\n",
+ "print(metrics)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "34ecca84-75a5-4d38-b1ad-3256761fb695",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "prec = 87.17 80.98 74.50 68.13 61.61 55.58 50.18 45.46 41.51 38.12\n",
+ "recall = 17.72 32.60 44.41 53.52 59.92 64.44 67.58 69.79 71.55 72.89\n"
+ ]
+ }
+ ],
+ "source": [
+ "# use fine-tuned bert model\n",
+ "eurlex4k_train_params.matcher_params_chain[0].init_model_dir = \"./work_dir/xr-transformer-encoder/eurlex-4k/bert/text_encoder\"\n",
+ "\n",
+ "xtf_fine_tuned = XTransformer.train(\n",
+ " prob,\n",
+ " clustering=preliminary_hlt,\n",
+ " train_params=eurlex4k_train_params,\n",
+ " pred_params=eurlex4k_pred_params,\n",
+ ")\n",
+ "\n",
+ "P_fine_tuned = xtf_fine_tuned.predict(X_txt_tst, X_feat=X_feat_tst)\n",
+ "metrics = smat_util.Metrics.generate(Y_tst, P_fine_tuned, topk=10)\n",
+ "print(metrics)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "175c3e79-97bb-4cfa-968b-5a018c5ef2f9",
+ "metadata": {},
+ "source": [
+ "# (BETA) Distributed PECOS\n",
+ "\n",
+ "`pecos.distributed` is a PECOS module that enables distributed training.\n",
+ "\n",
+ "Currently the following sub-modules are implemented:\n",
+ "\n",
+ "* Distributed X-Linear ([`pecos.distributed.xmc.xlinear`](xmc/xlinear/README.md))\n",
+ "\n",
+ "We are working to implement more distributed algorithms for PECOS existing models, please watch out for our newest releases."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "de65b077-2339-48b6-92e9-554d676a2b6b",
+ "metadata": {},
+ "source": [
+ "## 1. Distributed XR-Linear\n",
+ "\n",
+ "`pecos.distributed.xmc.xlinear` enables distributed training for PECOS XLinear model (`pecos.xmc.xlinear`).\n",
+ "\n",
+ "### Prerequisites\n",
+ "\n",
+ "* **Hardware**: \n",
+ " * Cluster of machines connected by network which can password-less SSH to each other.\n",
+ " * IP address of every machine in the cluster is known.\n",
+ " * Shared network disk mounted on all machines.\n",
+ " * For accessing data and saving trained models.\n",
+ "\n",
+ "* **Software**: Install the following software on **every** machine of your cluster\n",
+ " * MPI and mpi4py\n",
+ " \n",
+ "Due to the hardware constraint during the tutorial, we only include a basic example in local mode here."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "2834997c-4bd9-4732-acb6-f7e4e7f6a090",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Using built-in specs.\n",
+ "COLLECT_GCC=/usr/bin/gcc\n",
+ "COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/7/lto-wrapper\n",
+ "Target: x86_64-redhat-linux\n",
+ "Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,objc,obj-c++,fortran,ada,go,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --enable-libmpx --enable-libsanitizer --enable-gnu-indirect-function --enable-libcilkrts --enable-libatomic --enable-libquadmath --enable-libitm --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux\n",
+ "Thread model: posix\n",
+ "gcc version 7.3.1 20180712 (Red Hat 7.3.1-15) (GCC) \n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "/opt/amazon/openmpi/bin/mpiexec\n",
+ "Hello, World! I am process 3 of 8 on ip-172-31-8-94.ec2.internal.\n",
+ "Hello, World! I am process 0 of 8 on ip-172-31-8-94.ec2.internal.\n",
+ "Hello, World! I am process 1 of 8 on ip-172-31-8-94.ec2.internal.\n",
+ "Hello, World! I am process 2 of 8 on ip-172-31-8-94.ec2.internal.\n",
+ "Hello, World! I am process 4 of 8 on ip-172-31-8-94.ec2.internal.\n",
+ "Hello, World! I am process 5 of 8 on ip-172-31-8-94.ec2.internal.\n",
+ "Hello, World! I am process 6 of 8 on ip-172-31-8-94.ec2.internal.\n",
+ "Hello, World! I am process 7 of 8 on ip-172-31-8-94.ec2.internal.\n"
+ ]
+ }
+ ],
+ "source": [
+ "%%bash\n",
+ "# check the required dependencies\n",
+ "mpicc -v\n",
+ "which mpiexec\n",
+ "python3 -m pip install mpi4py\n",
+ "mpiexec -n 8 python3 -m mpi4py.bench helloworld"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b04cd3aa-3d7a-4e11-8b15-5074d24525ec",
+ "metadata": {},
+ "source": [
+ "### Basic Usage\n",
+ "\n",
+ "Below is a simple showcase of the usage of `pecos.distributed.xmc.xlinear.train`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "f3b7e689-480d-4c81-a408-d4b840695880",
+ "metadata": {
+ "collapsed": true,
+ "jupyter": {
+ "outputs_hidden": true
+ },
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "07/07/2022 22:33:34 - INFO - pecos.utils.profile_util - psutil module installed, will print memory info.\n",
+ "07/07/2022 22:33:34 - INFO - __main__ - Started loading data on Rank 1 ... RSS 91.0 MB. Full mem info: pmem(rss=95412224, vms=35591151616, shared=47382528, text=2732032, lib=0, data=166395904, dirty=0)\n",
+ "07/07/2022 22:33:34 - INFO - pecos.utils.profile_util - psutil module installed, will print memory info.\n",
+ "07/07/2022 22:33:34 - INFO - __main__ - Started loading data on Rank 0 ... RSS 91.2 MB. Full mem info: pmem(rss=95682560, vms=35591151616, shared=47669248, text=2732032, lib=0, data=166395904, dirty=0)\n",
+ "07/07/2022 22:33:34 - INFO - __main__ - Done loading data on Rank 0. RSS 126.0 MB. Full mem info: pmem(rss=132136960, vms=35626360832, shared=47800320, text=2732032, lib=0, data=201605120, dirty=0)\n",
+ "07/07/2022 22:33:34 - INFO - pecos.distributed.xmc.base - Starts creating label embedding PIFA for meta tree on Rank 0 node... RSS 126.0 MB. Full mem info: pmem(rss=132136960, vms=35626360832, shared=47800320, text=2732032, lib=0, data=201605120, dirty=0)\n",
+ "07/07/2022 22:33:34 - INFO - __main__ - Done loading data on Rank 1. RSS 125.8 MB. Full mem info: pmem(rss=131928064, vms=35626360832, shared=47579136, text=2732032, lib=0, data=201605120, dirty=0)\n",
+ "07/07/2022 22:33:34 - INFO - pecos.distributed.xmc.base - Done creating label embedding PIFA for meta tree on Rank 0 node. RSS 198.7 MB. Full mem info: pmem(rss=208375808, vms=35776598016, shared=48451584, text=2732032, lib=0, data=285478912, dirty=0)\n",
+ "07/07/2022 22:33:34 - INFO - pecos.distributed.xmc.base - Starts generating meta tree cluster on main node...\n",
+ "07/07/2022 22:33:34 - INFO - pecos.distributed.xmc.base - Determined meta-tree leaf clusters number: 4. 2 nodes will train 4 sub-trees. Number of data labels: 3956, nr_splits: 16\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Done generating meta tree cluster. RSS 225.4 MB. Full mem info: pmem(rss=236306432, vms=35804254208, shared=48713728, text=2732032, lib=0, data=313135104, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Rank 0 get 2 sub-tree assignments.\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Rank 1 get 2 sub-tree assignments.\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - On rank 0, 0th sub-tree assignment has 989 labels: [0, 1, 2, 4, 5, 6, 7, 8, 9, 11]...\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - On rank 1, 0th sub-tree assignment has 989 labels: [18, 22, 31, 32, 35, 37, 38, 39, 57, 62]...\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Starts creating label embedding PIFA for 0th sub-tree on rank 0... RSS 155.5 MB. Full mem info: pmem(rss=163016704, vms=35730997248, shared=48934912, text=2732032, lib=0, data=239878144, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Starts creating label embedding PIFA for 0th sub-tree on rank 1... RSS 125.8 MB. Full mem info: pmem(rss=131928064, vms=35626622976, shared=47579136, text=2732032, lib=0, data=201867264, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Done creating label embedding PIFA for 0th sub-tree on rank 0. RSS 160.6 MB. Full mem info: pmem(rss=168361984, vms=35734740992, shared=49242112, text=2732032, lib=0, data=245108736, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Starts generating 0th sub-tree cluster on rank 0...\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Done creating label embedding PIFA for 0th sub-tree on rank 1. RSS 148.8 MB. Full mem info: pmem(rss=156049408, vms=35724161024, shared=48783360, text=2732032, lib=0, data=233041920, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Starts generating 0th sub-tree cluster on rank 1...\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Done generating 0th sub-tree cluster on rank 0. RSS 160.8 MB. Full mem info: pmem(rss=168628224, vms=35734740992, shared=49242112, text=2732032, lib=0, data=245108736, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - On rank 0, 1th sub-tree assignment has 989 labels: [3, 10, 12, 13, 17, 21, 26, 33, 34, 36]...\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Starts creating label embedding PIFA for 1th sub-tree on rank 0... RSS 160.8 MB. Full mem info: pmem(rss=168628224, vms=35734740992, shared=49242112, text=2732032, lib=0, data=245108736, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Done creating label embedding PIFA for 1th sub-tree on rank 0. RSS 179.4 MB. Full mem info: pmem(rss=188153856, vms=35754147840, shared=49242112, text=2732032, lib=0, data=264515584, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Starts generating 1th sub-tree cluster on rank 0...\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Done generating 0th sub-tree cluster on rank 1. RSS 159.7 MB. Full mem info: pmem(rss=167444480, vms=35735547904, shared=48979968, text=2732032, lib=0, data=244428800, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - On rank 1, 1th sub-tree assignment has 989 labels: [14, 20, 30, 46, 50, 54, 60, 78, 85, 100]...\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Starts creating label embedding PIFA for 1th sub-tree on rank 1... RSS 159.7 MB. Full mem info: pmem(rss=167444480, vms=35735547904, shared=48979968, text=2732032, lib=0, data=244428800, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Done generating 1th sub-tree cluster on rank 0. RSS 179.4 MB. Full mem info: pmem(rss=188153856, vms=35754147840, shared=49242112, text=2732032, lib=0, data=264515584, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Done creating label embedding PIFA for 1th sub-tree on rank 1. RSS 176.8 MB. Full mem info: pmem(rss=185393152, vms=35751538688, shared=49176576, text=2732032, lib=0, data=261906432, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Starts generating 1th sub-tree cluster on rank 1...\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Done generating 1th sub-tree cluster on rank 1. RSS 176.8 MB. Full mem info: pmem(rss=185393152, vms=35751538688, shared=49176576, text=2732032, lib=0, data=261906432, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Starts assmebling cluster chain... RSS 129.2 MB. Full mem info: pmem(rss=135426048, vms=35701280768, shared=49242112, text=2732032, lib=0, data=211648512, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Done assmebling cluster chain. Split depth: 1. Chain length: 3 RSS 129.2 MB. Full mem info: pmem(rss=135426048, vms=35701280768, shared=49242112, text=2732032, lib=0, data=211648512, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Broadcasting distributed cluster chain from Node 0...\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - Done broadcast distributed cluster chain from Node 0.\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.xlinear.model - meta, sub negative samples: 32 61\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.xlinear.model - Starts receiving sub-training jobs from source 0 for rank 1...\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - meta_tree_leaf_cluster: (3956, 64)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.xlinear.model - Main node workload: 69941.31147540984\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.xlinear.model - Min worker node workload, machine rank: (69387, 0). Max worker node workload, machine rank: (69387, 0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.xlinear.model - Training jobs for all Sub-trees divided onto 2 machines: Main node will train for 13 sub-trees, Worker nodes will train for [51] sub-trees, worker receive order: [1].\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.xlinear.model - Starts sending sub-training jobs from node 0 to 1...\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.xlinear.model - Done sending sub-training jobs from node 0 to 1.\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.xlinear.model - Rank 0 starts meta-tree training... RSS 130.0 MB. Full mem info: pmem(rss=136282112, vms=35702648832, shared=49307648, text=2732032, lib=0, data=213016576, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.xlinear.model - Done receiving sub-training jobs from source 0 for rank 1.\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.xlinear.model - Rank 1 get 51 sub-trees to train\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.xlinear.model - Rank 1 starts sub-tree training... RSS 129.0 MB. Full mem info: pmem(rss=135274496, vms=35701280768, shared=49176576, text=2732032, lib=0, data=211648512, dirty=0)\n",
+ "07/07/2022 22:33:35 - INFO - pecos.distributed.xmc.base - meta_tree_leaf_cluster: (3956, 64)\n",
+ "07/07/2022 22:33:39 - INFO - pecos.distributed.xmc.xlinear.model - Rank 0 done meta-tree training. RSS 163.5 MB. Full mem info: pmem(rss=171479040, vms=35735203840, shared=49307648, text=2732032, lib=0, data=250142720, dirty=0)\n",
+ "07/07/2022 22:33:39 - INFO - pecos.distributed.xmc.xlinear.model - Rank 0 get 13 sub-trees to train\n",
+ "07/07/2022 22:33:39 - INFO - pecos.distributed.xmc.xlinear.model - Rank 0 starts sub-tree training... RSS 163.5 MB. Full mem info: pmem(rss=171479040, vms=35735203840, shared=49307648, text=2732032, lib=0, data=250142720, dirty=0)\n",
+ "07/07/2022 22:33:42 - INFO - pecos.distributed.xmc.xlinear.model - Rank 0 total 13 sub-tree training finished. RSS 163.5 MB. Full mem info: pmem(rss=171479040, vms=35735203840, shared=49307648, text=2732032, lib=0, data=250142720, dirty=0)\n",
+ "07/07/2022 22:33:42 - INFO - pecos.distributed.xmc.xlinear.model - Main node start recv 51 sub-tree models from rank 1\n",
+ "07/07/2022 22:33:48 - INFO - pecos.distributed.xmc.xlinear.model - Rank 1 total 51 sub-tree training finished. RSS 148.8 MB. Full mem info: pmem(rss=155975680, vms=35721224192, shared=49369088, text=2732032, lib=0, data=232181760, dirty=0)\n",
+ "07/07/2022 22:33:48 - INFO - pecos.distributed.xmc.xlinear.model - Rank 1 node starts sending 51 sub-tree models.\n",
+ "07/07/2022 22:33:48 - INFO - pecos.distributed.xmc.xlinear.model - Main node done receive 51 sub-tree models from rank 1\n",
+ "07/07/2022 22:33:48 - INFO - pecos.distributed.xmc.xlinear.model - Rank 1 node done sending 51 sub-tree models.\n",
+ "07/07/2022 22:33:48 - INFO - pecos.distributed.xmc.xlinear.model - Reconstruct full model on Rank 0 node... RSS 163.9 MB. Full mem info: pmem(rss=171864064, vms=35735465984, shared=49446912, text=2732032, lib=0, data=250404864, dirty=0)\n",
+ "07/07/2022 22:33:48 - INFO - pecos.distributed.xmc.xlinear.model - Done reconstruct full model on Rank 0 node. RSS 164.7 MB. Full mem info: pmem(rss=172675072, vms=35735990272, shared=49446912, text=2732032, lib=0, data=250929152, dirty=0)\n",
+ "07/07/2022 22:33:48 - INFO - __main__ - Saving model to work_dir/dist_xlinear_model...\n",
+ "07/07/2022 22:33:49 - INFO - __main__ - Done saving model.\n"
+ ]
+ }
+ ],
+ "source": [
+ "%%bash\n",
+ "mpiexec -n 2 \\\n",
+ "python3 -m pecos.distributed.xmc.xlinear.train \\\n",
+ "-x work_dir/xmc-base/eurlex-4k/X.trn.npz \\\n",
+ "-y work_dir/xmc-base/eurlex-4k/Y.trn.npz \\\n",
+ "-m work_dir/dist_xlinear_model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "afb08655-7de1-426c-b203-0c012ac50c7a",
+ "metadata": {},
+ "source": [
+ "We didn't setup the multi-node cluster therefore only single machine is used here. In practice, you can store your machines' addresses in `hostfile` and run the distributed training via \n",
+ "```\n",
+ "mpiexec -f hostfile -n ${NUM_MACHINE} python3 -m pecos.distributed.xmc.xlinear.train [..]\n",
+ "```\n",
+ "\n",
+ "The distributed trained model is serialized in the same way as the single node trained model. We can use the same way to predict and evaluate the model:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "b39cbf96-6cf2-4721-86d7-90cc27bdbe1f",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "==== evaluation results ====\n",
+ "prec = 82.25 74.92 68.58 62.92 57.58 52.55 47.68 43.56 40.08 37.08\n",
+ "recall = 16.62 29.97 40.75 49.33 55.96 60.91 64.26 66.97 69.14 70.97\n"
+ ]
+ }
+ ],
+ "source": [
+ "%%bash\n",
+ "python3 -m pecos.xmc.xlinear.predict \\\n",
+ "-x work_dir/xmc-base/eurlex-4k/X.tst.npz \\\n",
+ "-y work_dir/xmc-base/eurlex-4k/Y.tst.npz \\\n",
+ "-m work_dir/dist_xlinear_model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ed14fb96-7ab3-4e1c-8b23-f1ef76d83a0d",
+ "metadata": {},
+ "source": [
+ "## Distributed Training Algorithm\n",
+ "\n",
+ "Because of the model separability of PECOS XR-Linear model, we can split the original problem into multiple independent problems:\n",
+ "* **One meta-problem**: XMC problem to match an input X to K clusters\n",
+ "* **K sub-problems**: XMC problem to rank the labels in one of the K clusters for an input X.\n",
+ "\n",
+ "\n",
+ "\n",
+ " "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1522bf81-3d13-4821-815e-ddb24cb412d0",
+ "metadata": {},
+ "source": [
+ "In addition to distributed training, `pecos.distributed.xmc.xlinear` also has the following features:\n",
+ "\n",
+ "* **Distributed Hierarchical Clustering**: We leverage the same meta-sub problem split to build the Hierarchical label tree. Since that building label feature for a huge dataset could be memory intensive for meta node, we provide option to use simpler label embedding for meta-tree generation:`--meta-label-embedding-method pii`\n",
+ "* **Load Balancing**: Beacuse of the long tail distribution in most XMC problems, workload to train each sub-problem varies a lot. To address that, the distributed training algorithm does load balancing when K > #workers. The sub-tree number K can be controlled via `--min-n-sub-tree`.\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.10"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/tutorials/kdd22/imgs/dist-xlinear.png b/tutorials/kdd22/imgs/dist-xlinear.png
new file mode 100644
index 0000000..c7f2e95
Binary files /dev/null and b/tutorials/kdd22/imgs/dist-xlinear.png differ
diff --git a/tutorials/kdd22/imgs/hnsw_example.png b/tutorials/kdd22/imgs/hnsw_example.png
new file mode 100644
index 0000000..93f59b3
Binary files /dev/null and b/tutorials/kdd22/imgs/hnsw_example.png differ
diff --git a/tutorials/kdd22/imgs/illus_customized_model.jpg b/tutorials/kdd22/imgs/illus_customized_model.jpg
new file mode 100644
index 0000000..a292f01
Binary files /dev/null and b/tutorials/kdd22/imgs/illus_customized_model.jpg differ
diff --git a/tutorials/kdd22/imgs/pecos_beam_search.png b/tutorials/kdd22/imgs/pecos_beam_search.png
new file mode 100644
index 0000000..f13ac0b
Binary files /dev/null and b/tutorials/kdd22/imgs/pecos_beam_search.png differ
diff --git a/tutorials/kdd22/imgs/pecos_matcher_ranker.png b/tutorials/kdd22/imgs/pecos_matcher_ranker.png
new file mode 100644
index 0000000..8302b8b
Binary files /dev/null and b/tutorials/kdd22/imgs/pecos_matcher_ranker.png differ
diff --git a/tutorials/kdd22/imgs/pecos_spmm.png b/tutorials/kdd22/imgs/pecos_spmm.png
new file mode 100644
index 0000000..6f7278b
Binary files /dev/null and b/tutorials/kdd22/imgs/pecos_spmm.png differ
diff --git a/tutorials/kdd22/imgs/pecos_xmr_framework.png b/tutorials/kdd22/imgs/pecos_xmr_framework.png
new file mode 100644
index 0000000..dcef9d1
Binary files /dev/null and b/tutorials/kdd22/imgs/pecos_xmr_framework.png differ