From 8b51f95c40bf689cceaf0b16c6f8ea7d79f20d0b Mon Sep 17 00:00:00 2001 From: Jyun-Yu Jiang Date: Sat, 13 Aug 2022 07:21:37 +0000 Subject: [PATCH] Update KDD22 Tutorial Instructions, Session 2, and Session 5 --- tutorials/kdd22/README.md | 27 +- ...ulti-label Classification with PECOS.ipynb | 1458 ++++++++++++++++ ...treme Multi-label Ranking with PECOS.ipynb | 1505 ----------------- .../kdd22/Session 4 Utilities in PECOS.ipynb | 1022 ++++++----- tutorials/kdd22/imgs/pecos_label_matrix.png | Bin 0 -> 73040 bytes tutorials/kdd22/imgs/pecos_pipeline.png | Bin 0 -> 18041 bytes tutorials/kdd22/imgs/pecos_xmc_examples.png | Bin 0 -> 48386 bytes 7 files changed, 1987 insertions(+), 2025 deletions(-) create mode 100644 tutorials/kdd22/Session 2 Extreme Multi-label Classification with PECOS.ipynb delete mode 100644 tutorials/kdd22/Session 2 Extreme Multi-label Ranking with PECOS.ipynb create mode 100644 tutorials/kdd22/imgs/pecos_label_matrix.png create mode 100644 tutorials/kdd22/imgs/pecos_pipeline.png create mode 100644 tutorials/kdd22/imgs/pecos_xmc_examples.png diff --git a/tutorials/kdd22/README.md b/tutorials/kdd22/README.md index c6760bf5..6e6eddfc 100644 --- a/tutorials/kdd22/README.md +++ b/tutorials/kdd22/README.md @@ -15,10 +15,33 @@ By the end of the tutorial, we believe that attendees will be easily capable of |---|---|---| | 8:00 AM - 8:30 AM | Check-in and Environment Setup | | | 8:30 AM - 8:50 AM | Session 1: Introduction to PECOS | | -| 8:50 AM - 9:30 AM | Session 2: Extreme Multi-label Ranking with PECOS | [Notebook](https://github.com/amzn/pecos/blob/mainline/tutorials/kdd22/Session%202%20Extreme%20Multi-label%20Ranking%20with%20PECOS.ipynb) | +| 8:50 AM - 9:30 AM | Session 2: Extreme Multi-label Classification with PECOS | [Notebook](https://github.com/amzn/pecos/blob/mainline/tutorials/kdd22/Session%202%20Extreme%20Multi-label%20Ranking%20with%20PECOS.ipynb) | | 9:30 AM - 10:00 AM | Coffee Break | | | 10:00 AM - 10:30 AM | Session 3: Approximate Nearest Neighbor (ANN) Search in PECOS | [Notebook](https://github.com/amzn/pecos/blob/mainline/tutorials/kdd22/Session%203%20Approximate%20Nearest%20Neighbor%20Search%20in%20PECOS.ipynb) | | 10:30 AM - 11:10 AM | Session 4: Utilities in PECOS | [Notebook](https://github.com/amzn/pecos/blob/mainline/tutorials/kdd22/Session%204%20Utilities%20in%20PECOS.ipynb) | -| 11:10 AM - 11:40 AM | Session 5: XR-Transformer cookbook and Distributed PECOS | [Notebook](https://github.com/amzn/pecos/blob/mainline/tutorials/kdd22/Session%205%20XR-Transformer%20cookbook%20and%20Distributed%20PECOS.ipynb) | +| 11:10 AM - 11:40 AM | Session 5: eXtreme Multi-label Ranking (XMR) with XR-Transformer | [Notebook](https://github.com/amzn/pecos/blob/mainline/tutorials/kdd22/Session%205%20eXtreme%20Multi-label%20Classification%20with%20Transformers.ipynb) | | 11:40 AM - 11:50 AM | Session 6: Research with PECOS | | | 11:50 AM - 12:00 PM | Closing Remarks | | + +## Tutorial Instructions + +### Miniconda Installation +```bash +mkdir -p ~/miniconda3 +wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh +bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 +rm -rf ~/miniconda3/miniconda.sh +~/miniconda3/bin/conda init bash +~/miniconda3/bin/conda init zsh +``` + +### Tutorial Mateiral Execution +```bash +conda create -n tutorial_env python=3.9 -y +conda activate tutorial_env +python -m pip install libpecos==0.4.0 matplotlib panda requests jupyterlab +mkdir -p ~/pecos_tutorial_playground +cd ~/pecos_tutorial_playground +git clone https://github.com/amzn/pecos +python -m jupyterlab.labapp --ip=0.0.0.0 --port 8888 --no-browser --allow-root --notebook-dir=pecos/tutorials/kdd22 +``` \ No newline at end of file diff --git a/tutorials/kdd22/Session 2 Extreme Multi-label Classification with PECOS.ipynb b/tutorials/kdd22/Session 2 Extreme Multi-label Classification with PECOS.ipynb new file mode 100644 index 00000000..3d098965 --- /dev/null +++ b/tutorials/kdd22/Session 2 Extreme Multi-label Classification with PECOS.ipynb @@ -0,0 +1,1458 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "67e70878", + "metadata": {}, + "source": [ + "# eXtreme Multi-label Classification (XMC) Problem and PECOS\n", + "\n", + "Prediction for Enormous and Correlated Output Spaces (PECOS) is a versatile and modular machine learning framework for solving prediction problems with very large outputs spaces. For a given input instance, we apply PECOS to the eXtreme Multilabel Classification (XMC) problem to find and rank the most relevant items from an enormous but fixed and finite output space. Generally, PECOS trains an XMC model that takes numerical features to rank labels from the enormous output space. PECOS also provides feature extraction functions for text data, such as TF-IDF (this session) and Transformers (Session 5).\n", + "\n", + "

\n", + "\n", + "
\n", + "\n", + "Using PECOS, we can tackle lots of real-world large-scale applications with only few commands or limited programming codes.\n", + "\n", + "

\n", + "\n", + "
\n", + "\n", + "\n", + "In this part of the tutorial, we will use XR-Linear as an example to demonstrate how to use PECOS to tackle real-world problems and understrand the model architecture in PECOS." + ] + }, + { + "cell_type": "markdown", + "id": "11c281dc", + "metadata": {}, + "source": [ + "## Outline in this Session\n", + "\n", + "1. Experimental dataset preparation\n", + "2. Hands-on PECOS in only few commands \n", + "3. Code with the PECOS library\n", + "4. Build your customized PECOS XR-Linear model" + ] + }, + { + "cell_type": "markdown", + "id": "41d87d24", + "metadata": {}, + "source": [ + "## 1. Experimental Dataset Preparation\n", + "\n", + "`eurlex-4k`, `wiki10-31k`, `amazoncat-13k`, `amazon-670k`, `wiki-500k`, and `amazon-3m` are available." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "1073ac9c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "xmc-base/wiki10-31k/output-items.txt\r\n", + "xmc-base/wiki10-31k/tfidf-attnxml\r\n", + "xmc-base/wiki10-31k/tfidf-attnxml/X.trn.npz\r\n", + "xmc-base/wiki10-31k/tfidf-attnxml/X.tst.npz\r\n", + "xmc-base/wiki10-31k/X.trn.txt\r\n", + "xmc-base/wiki10-31k/X.tst.txt\r\n", + "xmc-base/wiki10-31k/Y.trn.npz\r\n", + "xmc-base/wiki10-31k/Y.trn.txt\r\n", + "xmc-base/wiki10-31k/Y.tst.npz\r\n", + "xmc-base/wiki10-31k/Y.tst.txt\r\n" + ] + } + ], + "source": [ + "DATASET = \"wiki10-31k\"\n", + "! wget -nv -nc https://archive.org/download/pecos-dataset/xmc-base/{DATASET}.tar.gz\n", + "! tar --skip-old-files -zxf {DATASET}.tar.gz \n", + "! find xmc-base/{DATASET}/*" + ] + }, + { + "cell_type": "markdown", + "id": "057fb642", + "metadata": {}, + "source": [ + "### Numerical Feature and Label Format in PECOS\n", + "\n", + "In PECOS, numerical features of instances can be in either a [dense NumPy matrix](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) or a [Compressed Sparse Row (CSR) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) of shape `(nr_inst, nr_feat)`, where `nr_inst` and `nr_feat` are numbers of instances and features. \n", + "\n", + "Similary, labels of instances can be also presented as a dense or a sparse matrix of shape `(nr_inst, nr_labels)`, where `nr_labels` is the number of labels in the XMC problem. The following figure shows an example of the sparse matrix for label representations of six instances.\n", + "\n", + "

\n", + "\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "id": "fcf8d41d", + "metadata": {}, + "source": [ + "## 2. Hands-on PECOS in Only Few Commands\n", + "\n", + "PECOS provides convenient command-line interfaces to establish a pipeline from feature extraction to training and inference. Specifically, a PECOS XR-Linear model for text data can be established and evaluated using the following command-line modules without writing any code.\n", + "\n", + "* Text Vectorizer: `pecos.utils.featurization.text.preprocess`\n", + "* XR-Linear Train/Predict/Evaluate: `pecos.xmc.xlinear.train`, `pecos.xmc.xlinear.predict`, `pecos.xmc.xlinear.evaluate`\n", + "\n", + "All of these commands can be supplied with JSON-format configuration files (See Section 4.2.2 and Appendix 2). In this section, we first use the default setting to have a quick hands-on demo.\n", + "\n", + "\n", + "### 2.1. Text Vectorizer\n", + "\n", + "PECOS text vectorizer `pecos.utils.featurization.text.preprocess` extracts TF-IDF features. The options `build` and `run` learn the vectorizer and extract features, respectively. The `--help` argument will list all available command-line options. In the default setting, the vectorizer learns a unigram TF-IDF without any filtering. If you have prepared a JSON-format vectorizer configuration file, the argument `--vectorizer-config-path` can help customize the vectorizer (See Section 4.2.2)." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "6b47e6d2", + "metadata": {}, + "outputs": [], + "source": [ + "! python3 -m pecos.utils.featurization.text.preprocess build \\\n", + " --text-pos 0 \\\n", + " --input-text-path xmc-base/{DATASET}/X.trn.txt \\\n", + " --output-model-folder simplest.{DATASET}.vectorizer \\\n", + " --from-file true" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "870f6b0d", + "metadata": {}, + "outputs": [], + "source": [ + "! python3 -m pecos.utils.featurization.text.preprocess run \\\n", + " --text-pos 0 \\\n", + " --input-preprocessor-folder simplest.{DATASET}.vectorizer \\\n", + " --input-text-path xmc-base/{DATASET}/X.trn.txt \\\n", + " --output-inst-path simplest.{DATASET}.X.trn.npz \\\n", + " --from-file true\n", + "! python3 -m pecos.utils.featurization.text.preprocess run \\\n", + " --text-pos 0 \\\n", + " --input-preprocessor-folder simplest.{DATASET}.vectorizer \\\n", + " --input-text-path xmc-base/{DATASET}/X.tst.txt \\\n", + " --output-inst-path simplest.{DATASET}.X.tst.npz \\\n", + " --from-file true" + ] + }, + { + "cell_type": "markdown", + "id": "a34bb4e2", + "metadata": {}, + "source": [ + "### 2.2. Train, Predict, Evaluate a PECOS Model\n", + "\n", + "With feature and label matrices, the pipeline of training, prediction, and evalution can be easily established with the modules `pecos.xmc.xlinear.train`, `pecos.xmc.xlinear.predict`, and `pecos.xmc.xlinear.evaluate`. The `--help` argument will list all available command-line options. If you have prepared a JSON-format configuration file, the argument `--params-path` can help customize training and prediction procedures (See Appendix 2)." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "b2316bff", + "metadata": {}, + "outputs": [], + "source": [ + "! python3 -m pecos.xmc.xlinear.train \\\n", + " -x simplest.{DATASET}.X.trn.npz \\\n", + " -y xmc-base/{DATASET}/Y.trn.npz \\\n", + " -m simplest.{DATASET}.model \\\n", + " --nr-splits 16 \\\n", + " -t 0.1" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "580ecfc7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==== evaluation results ====\r\n", + "prec = 83.89 78.41 72.53 67.68 63.45 59.72 56.48 53.53 50.86 48.39\r\n", + "recall = 4.95 9.21 12.64 15.60 18.11 20.34 22.30 24.07 25.63 27.00\r\n" + ] + } + ], + "source": [ + "! python3 -m pecos.xmc.xlinear.predict \\\n", + " -x simplest.{DATASET}.X.tst.npz \\\n", + " -y xmc-base/{DATASET}/Y.tst.npz \\\n", + " -m simplest.{DATASET}.model \\\n", + " -o simplest.{DATASET}.Y.tst.pred.npz \\\n", + " -b 10 \\\n", + " -k 10" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "5e75b408", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==== evaluation results ====\r\n", + "prec = 83.89 78.41 72.53 67.68 63.45 59.72 56.48 53.53 50.86 48.39\r\n", + "recall = 4.95 9.21 12.64 15.60 18.11 20.34 22.30 24.07 25.63 27.00\r\n" + ] + } + ], + "source": [ + "! python3 -m pecos.xmc.xlinear.evaluate \\\n", + " -y xmc-base/{DATASET}/Y.tst.npz \\\n", + " -p simplest.{DATASET}.Y.tst.pred.npz \\\n", + " -k 10" + ] + }, + { + "cell_type": "markdown", + "id": "b0c731f5", + "metadata": {}, + "source": [ + "## 3. Code with the PECOS library\n", + "\n", + "PECOS includes the comprehensieve Python library and interfaces so that we can easily utilize PECOS in the code-level with more flexibility.\n", + "\n", + "### 3.1. Loading Features and Labels\n", + "For convenience, PECOS also provides APIs `load_feature_matrix` and `load_label_matrix` for loading features and labels from binary files in arbitary formats.\n", + "Note that for the sparse format, training labels should be loaded as a [Compressed Sparse Column (CSC) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html) while testing labels should be loaded as a CSR matrix for the purpose of computational efficiency. " + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "c518d892", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Training features X_trn is a csr matrix of shape (14146, 101938).\n", + "Training labels Y_trn is a csc matrix of shape (14146, 30938).\n", + "Testing features X_tst is a csr matrix of shape (6616, 101938).\n", + "Testing labels Y_tst is a csr matrix of shape (6616, 30938).\n" + ] + } + ], + "source": [ + "import numpy as np\n", + "from pecos.xmc.xlinear.model import XLinearModel\n", + "\n", + "DATASET = \"wiki10-31k\"\n", + "\n", + "X_trn = XLinearModel.load_feature_matrix(f\"xmc-base/{DATASET}/tfidf-attnxml/X.trn.npz\".format())\n", + "Y_trn = XLinearModel.load_label_matrix(f\"xmc-base/{DATASET}/Y.trn.npz\", for_training=True)\n", + "\n", + "X_tst = XLinearModel.load_feature_matrix(f\"xmc-base/{DATASET}/tfidf-attnxml/X.tst.npz\")\n", + "Y_tst = XLinearModel.load_label_matrix(f\"xmc-base/{DATASET}/Y.tst.npz\", for_training=False)\n", + "\n", + "print(f\"Training features X_trn is a {X_trn.getformat()} matrix of shape {X_trn.shape}.\")\n", + "print(f\"Training labels Y_trn is a {Y_trn.getformat()} matrix of shape {Y_trn.shape}.\")\n", + "print(f\"Testing features X_tst is a {X_tst.getformat()} matrix of shape {X_tst.shape}.\")\n", + "print(f\"Testing labels Y_tst is a {Y_tst.getformat()} matrix of shape {Y_tst.shape}.\")" + ] + }, + { + "cell_type": "markdown", + "id": "150fea14", + "metadata": {}, + "source": [ + "### 3.2. Semantic Label Indexing and Cluster Chain in XR-Linear\n", + "\n", + "The first step of training an XR-Linear model is to conduct semantic label indexing and establish the *hierarchial label tree* for resursive training the XR-Linear model and its inference. \n", + "\n", + "

\n", + "\n", + "
\n", + "\n", + "PECOS supports any method for semantic label indexing. In the PECOS library, as a build-in method, we provide Label Representation via Positive Instance Feature Aggregation (PIFA) for semantic label indexing with only the need of positive instances and their features in training data. PECOS can also consider additional label features `Z` of shape `(nr_labels, nr_label_feat)` in either dense or sparse matrix format, where `nr_label_feat` is the number of label features. These representations and features for each label are concatenated or combined as label embedding in `LabelEmbeddingFactory` in PECOS.\n", + "\n", + "To conduct semantic label indexing, PECOS learns an indexer based on label embedding. PECOS currently supports to use the **Hierarchical K-Means** for semantic label indexing with a hyper-parameter `nr_splits` (the number of clusters in each layer, or `B` in [our report](https://arxiv.org/pdf/2010.05878.pdf)), which decides the depth `D` of the hierarchical label tree. " + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "26794215", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "4 layers in the trained hierarchical label tree.\n" + ] + } + ], + "source": [ + "from pecos.xmc import Indexer, LabelEmbeddingFactory\n", + "\n", + "label_feat = LabelEmbeddingFactory.create(Y_trn, X_trn, method=\"pifa\")\n", + "# label_feat = LabelEmbeddingFactory.create(Y_trn, X_trn, Z, method=\"pifa_lf_concat\") # for using label features Z\n", + "\n", + "cluster_chain = Indexer.gen(label_feat, nr_splits=16, indexer_type=\"hierarchicalkmeans\")\n", + "\n", + "print(f\"{len(cluster_chain)} layers in the trained hierarchical label tree.\")" + ] + }, + { + "cell_type": "markdown", + "id": "02ffda21", + "metadata": {}, + "source": [ + "### 3.3. Training XR-Linear Negative Sampling and Sparsification\n", + "\n", + "Negative sampling plays an important role in solving the XMC problem. PECOS currently provides two negative sampling schemes, including Teacher Forcing Negatives (TFN) and Matcher Aware Negatives (MAN). Please refer to [our report](https://arxiv.org/pdf/2010.05878.pdf) for more details about negative sampling schemes.\n", + "\n", + "To reduce model sizes and improve efficiency, PECOS conduct model sparsification with a hyper-parameter `threshold`. The model weights with absolute values smaller than the threshold will be discarded." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "bd3d6527", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Training time: 39.8403 seconds.\n" + ] + } + ], + "source": [ + "import time\n", + "start_time = time.time()\n", + "\n", + "# For negative_sampling_scheme in model training, \"man\" and tfn+man\" are also available.\n", + "xlm = XLinearModel.train(X_trn, Y_trn, C=cluster_chain, threshold=0.1, negative_sampling_scheme=\"tfn\")\n", + "\n", + "training_time = time.time() - start_time\n", + "print(f\"Training time: {training_time:.4f} seconds.\")" + ] + }, + { + "cell_type": "markdown", + "id": "20f5cfa7", + "metadata": {}, + "source": [ + "PECOS supports serializing and loading the trained model into binary on disk with convenient interfaces. Note that model loading with `is_predict_only=True` could lead to faster prediction speed by disabling the flexibility of model modification." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "3d5d468d", + "metadata": {}, + "outputs": [], + "source": [ + "xlm.save(f\"{DATASET}.xlm.model\")\n", + "xlm = XLinearModel.load(f\"{DATASET}.xlm.model\", is_predict_only=False)" + ] + }, + { + "cell_type": "markdown", + "id": "4b6038ec", + "metadata": {}, + "source": [ + "### 3.4. Prediction and Evaluation\n", + "\n", + "As a tree model, the inference method significantly affects the prediction efficiency of XR-Linear in PECOS. As illustrated in the following figure, the prediction process in PECOS employs a beam search with a hyper-parameter `beam_size`. The other hyper-parameter `only_topk` also needs to be decided to limit the predicted most relevant labels for each instance. The `predict` function of the trained model will result in a CSR matrix of shape `(nr_inst, nr_labels)` and exactly `only_topk` non-zero columns for each row (or instance).\n", + "\n", + "
\n", + "
\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "7f851bc1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Y_pred is a csr matrix of shape (6616, 30938) and 66160 non-zero elements.\n" + ] + } + ], + "source": [ + "Y_pred = xlm.predict(X_tst, beam_size=10, only_topk=10)\n", + "\n", + "print(f\"Y_pred is a {Y_pred.getformat()} matrix of shape {Y_pred.shape} and {Y_pred.nnz} non-zero elements.\")" + ] + }, + { + "cell_type": "markdown", + "id": "fb1ed22c", + "metadata": {}, + "source": [ + "For evaluation, we evaluate the trained model with conventional ranking metrics, including Precision@K and Recall@K. PECOS also provides the evaluation interface for predicted sparse matrices." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "4c57da1a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "prec = 84.36 78.20 72.67 68.03 63.85 60.23 56.86 53.85 51.07 48.63\n", + "recall = 4.99 9.17 12.65 15.67 18.28 20.56 22.50 24.21 25.74 27.15\n" + ] + } + ], + "source": [ + "from pecos.utils import smat_util\n", + "metrics = smat_util.Metrics.generate(Y_tst, Y_pred, topk=10)\n", + "print(metrics)" + ] + }, + { + "cell_type": "markdown", + "id": "ae9baf18", + "metadata": {}, + "source": [ + "### 3.5. PECOS and One-versus-All (OVA) Model\n", + "\n", + "PECOS also supports to train an OVA model without leveraing clustering hierarchy if needed. \n", + "\n", + "**Training OVA models is time-consuming, we suggest to try the following code offline after the tutorial. Note that training the above XR-Linear model is 26 times faster than training an OVA model using an AWS *i3.4xlarge* instance.**\n", + "\n", + "```python\n", + "import time\n", + "start_time = time.time()\n", + "\n", + "xlm_ova = XLinearModel.train(X_trn, Y_trn, C=None, negative_sampling_scheme=\"tfn\") \n", + "\n", + "training_time_ova = time.time() - start_time\n", + "print(f\"Training time for the OVA model: {training_time_ova:.4f} seconds.\")\n", + "pecos_faster_ratio = training_time_ova / training_time\n", + "print(f\"XR-Linear is {pecos_faster_ratio:.2f} times faster than the OVA model\")\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "1663277d", + "metadata": {}, + "source": [ + "# 4. Customized PECOS Model\n", + "\n", + "Besides pre-defined models in PECOS, such as XR-Linear, it is also convenient for users to customize PECOS for specific purposes and usage. Specifically, we suggest to establishing a model class to wrap fundamental PECOS functions and tailored operations. As a result, the customized model can be easily constructed and consumed for arbitrary data types and feature extractors. \n", + "\n", + "## 4.1. Structure of a Customized PECOS Model\n", + "\n", + "Even though a customized machine learning pipeline can be seperated into several independent scripts, we recommend declaring a customized PECOS model as a **model class** for better re-usability and code maintenance.\n", + "\n", + "A customized PECOS model should at least consist of the following components:\n", + "\n", + "* `preprocessor` or `encoder`: The procedure, which can be a method or a functionable object, pre-processes or encodes an arbitrary input with the designated data format into features. For example, text data and image data can be encoded by BERT and ResNet.\n", + "* `train()`: The training method takes a set of training data with a preprocessor, learns a primitive PECOS model, and returns a PECOS-based customized machine learning model. The training function could be a class method to construct the model object with the learned model and essential components after training.\n", + "* `model`: A primitive PECOS model taking pre-processed features is capable of deriving the predictions for arbitrary testing data. The model weights should be learned by `train()`. \n", + "* `predict()`: The prediction method takes arbitrary testing data and infers the prediction based on the pre-processor and the learned model.\n", + "* `save()`: The saving function serializes the trained model, including model weights and configuration, for further usage.\n", + "* `load()`: The loading function reads the serialized model so that the trained model can be loaded and re-used.\n", + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "In this part of the tutorial, we will use the task of *extreme multi-label text classification* as an example to demonstrate how to **customize a PECOS model that can handle text data with either a conventional bag-of-words (BoW) model or a deep learning model as the text encoder for feature extraction**.\n" + ] + }, + { + "cell_type": "markdown", + "id": "c3acc325", + "metadata": {}, + "source": [ + "## 4.2. Example: eXtreme Multi-label Text Classification (XMTC)\n", + "\n", + "The task of extreme multi-label text classification (XMTC) seeks to find relevant labels from an extreme large label collection for a given text input. Many real-world applications can be formulated as XMTC tasks, such as recommendation systems, document tagging, and semantic search. \n", + "\n", + "In this section, we guide through how to establish a customized PECOS model for XMTC tasks. We will walk through (1) PECOS' built-in BOW model for text preprocessing and vectorizing; (2) how to customize a PECOS model; and (3) \n", + "advanced usage of XR-Transformer based on deep learning." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "237164ca", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "text2text_demo/output-labels.txt\r\n", + "text2text_demo/pecos-CustomPECOS-model\r\n", + "text2text_demo/pecos-CustomPECOS-model/preprocessor\r\n", + "text2text_demo/pecos-CustomPECOS-model/preprocessor/0.base\r\n", + "text2text_demo/pecos-CustomPECOS-model/preprocessor/0.base/tokenizer\r\n", + "text2text_demo/pecos-CustomPECOS-model/preprocessor/0.base/tokenizer/vocab.txt\r\n", + "text2text_demo/pecos-CustomPECOS-model/preprocessor/0.base/tokenizer/config.json\r\n", + "text2text_demo/pecos-CustomPECOS-model/preprocessor/0.base/vectorizer\r\n", + "text2text_demo/pecos-CustomPECOS-model/preprocessor/0.base/vectorizer/tfidf-model.txt\r\n", + "text2text_demo/pecos-CustomPECOS-model/preprocessor/0.base/vectorizer/config.json\r\n", + "text2text_demo/pecos-CustomPECOS-model/preprocessor/meta.json\r\n", + "text2text_demo/pecos-CustomPECOS-model/preprocessor/config.json\r\n", + "text2text_demo/pecos-CustomPECOS-model/xlinear_model\r\n", + "text2text_demo/pecos-CustomPECOS-model/xlinear_model/ranker\r\n", + "text2text_demo/pecos-CustomPECOS-model/xlinear_model/ranker/0.model\r\n", + "text2text_demo/pecos-CustomPECOS-model/xlinear_model/ranker/0.model/W.npz\r\n", + "text2text_demo/pecos-CustomPECOS-model/xlinear_model/ranker/0.model/C.npz\r\n", + "text2text_demo/pecos-CustomPECOS-model/xlinear_model/ranker/0.model/param.json\r\n", + "text2text_demo/pecos-CustomPECOS-model/xlinear_model/ranker/param.json\r\n", + "text2text_demo/pecos-CustomPECOS-model/xlinear_model/param.json\r\n", + "text2text_demo/pecos-CustomPECOS-model/output_items.json\r\n", + "text2text_demo/pecos-text2text-model\r\n", + "text2text_demo/pecos-text2text-model/0.base\r\n", + "text2text_demo/pecos-text2text-model/0.base/tokenizer\r\n", + "text2text_demo/pecos-text2text-model/0.base/tokenizer/vocab.txt\r\n", + "text2text_demo/pecos-text2text-model/0.base/tokenizer/config.json\r\n", + "text2text_demo/pecos-text2text-model/0.base/vectorizer\r\n", + "text2text_demo/pecos-text2text-model/0.base/vectorizer/tfidf-model.txt\r\n", + "text2text_demo/pecos-text2text-model/0.base/vectorizer/config.json\r\n", + "text2text_demo/pecos-text2text-model/meta.json\r\n", + "text2text_demo/pecos-text2text-model/config.json\r\n", + "text2text_demo/pecos-text2text-model/2.base\r\n", + "text2text_demo/pecos-text2text-model/2.base/tokenizer\r\n", + "text2text_demo/pecos-text2text-model/2.base/tokenizer/vocab.txt\r\n", + "text2text_demo/pecos-text2text-model/2.base/tokenizer/config.json\r\n", + "text2text_demo/pecos-text2text-model/2.base/vectorizer\r\n", + "text2text_demo/pecos-text2text-model/2.base/vectorizer/tfidf-model.txt\r\n", + "text2text_demo/pecos-text2text-model/2.base/vectorizer/config.json\r\n", + "text2text_demo/pecos-text2text-model/1.base\r\n", + "text2text_demo/pecos-text2text-model/1.base/tokenizer\r\n", + "text2text_demo/pecos-text2text-model/1.base/tokenizer/vocab.txt\r\n", + "text2text_demo/pecos-text2text-model/1.base/tokenizer/config.json\r\n", + "text2text_demo/pecos-text2text-model/1.base/vectorizer\r\n", + "text2text_demo/pecos-text2text-model/1.base/vectorizer/tfidf-model.txt\r\n", + "text2text_demo/pecos-text2text-model/1.base/vectorizer/config.json\r\n", + "text2text_demo/testing-data.txt\r\n", + "text2text_demo/training-data.txt\r\n" + ] + } + ], + "source": [ + "! wget -nv -nc https://archive.org/download/text2text_demo.tar.gz/text2text_demo.tar.gz\n", + "! tar --skip-old-files -zxf text2text_demo.tar.gz\n", + "! find text2text_demo/*" + ] + }, + { + "cell_type": "markdown", + "id": "3b2ec0e6", + "metadata": {}, + "source": [ + "### 4.2.1. Preprocessor: Text Preprocessing and Vectorizing\n", + "\n", + "The preprocessor plays a role of encoding input data into machine readable vector representations. Any encoder that can transform text data into a vector representation can be considered as the preprocessor or encoder of a customized PECOS model for XMTC tasks.\n", + "\n", + "In the PECOS library, we provide [various text vectorizers](https://github.com/amzn/pecos/blob/mainline/pecos/utils/featurization/text/vectorizers.py), such as TF-IDF, hashing, and pretrained transformer, as **built-in preprocessors** to deal with text data. In this tutorial, we will utilize the [n-gram](https://en.wikipedia.org/wiki/N-gram) [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) model as our preprocessor.\n", + "\n", + "#### Label Space File Format for Built-in Text Preprocessors\n", + "\n", + "Label space is also essential for text preprocessors, especially for understanding the label space size to create the appropriate label matrix. The label IDs start from zero and can be referred to the line numbers and corresponding text descriptions in the label space file." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "3f48f4f7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Artificial intelligence researchers\r\n", + "Computability theorists\r\n", + "British computer scientists\r\n", + "Machine learning researchers\r\n", + "Turing Award laureates\r\n", + "Deep Learning\r\n" + ] + } + ], + "source": [ + "! cat \"./text2text_demo/output-labels.txt\"" + ] + }, + { + "cell_type": "markdown", + "id": "e0862645", + "metadata": {}, + "source": [ + "#### Data File Format for Built-in Text Preprocessors\n", + "\n", + "PECOS built-in text preprocessors majorly take the files of text data with labels in a tab-separated values (TSV) format. Each line in the TSV file consists of two elements that represent the comma-separated label IDs and the input text of a data instance. " + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "bd5ebfc6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0,1,2\tAlan Turing is widely considered to be the father of theoretical computer science and artificial intelligence.\r\n", + "0,2,3\tHinton was co-author of a highly cited paper published in 1986 that popularized the backpropagation algorithm for training multi-layer neural networks.\r\n", + "3,4,5\tHinton received the 2018 Turing Award, together with Yoshua Bengio and Yann LeCun, for their work on artificial intelligence and deep learning.\r\n", + "0,3,5\tYoshua Bengio is a Canadian computer scientist, most noted for his work on artificial neural networks and deep learning.\r\n" + ] + } + ], + "source": [ + "! cat ./text2text_demo/training-data.txt" + ] + }, + { + "cell_type": "markdown", + "id": "566d1eb5", + "metadata": {}, + "source": [ + "The data file format also supports to represent the label relevance for cost-sensitive learning by using double colons to separate a label and its relevance.\n", + "\n", + "

\n", + "0::0.1,1::0.2,2::0.8 <TAB> Alan Turing is widely considered to be the father of theoretical computer science and artificial intelligence.

\n" + ] + }, + { + "cell_type": "markdown", + "id": "4d4e4419", + "metadata": {}, + "source": [ + "#### Training a Text Preprocessor\n", + "\n", + "The preprocessor model `Preprocessor` is defined in `pecos.utils.featurization.text.preprocess`. Given a training text corpus and the configuration dictionary, the class method `Preprocessor.train` will train a corresponding text preprocesssor. Besides, the built-in preprocessors also support serialization with the function `save()` for the re-usability.\n", + "\n", + "With the previously mentioned data and label space file formats, the utility function `Preprocessor.load_data_from_file(input_text_path, output_text_path)` returns a dictionary with three keys:\n", + "\n", + "* `label_matrix`: a `(num_inst, num_labels)` CSR matrix for the labels of each instance.\n", + "* `label_relevance`: `None` or a `(num_inst, num_labels)` CSR matrix for the relevance of each label in cost-sensitive learning if available.\n", + "* `corpus`: a list of string as the text corpus in the input_text_path.\n", + "\n", + "The configuration settings of text preprocessor including the preprocessor type and hyper-parameters should be defined in a dictionary. Specifially, the key `type` defines the preprocessor choice while the key `kwargs` represents the hyper-parameters. In this tutorial, we adopt n-gram TFIDF features containing *word unigrams*, *word bigrams*, and *character trigrams*. Note that each of the n-gram feature can have different hyper-parameters, such as `max_feature` and `max_df`. Users need to properly set max_feature (e.g., hundred of thousands or millions) based on the corpus size and downstream tasks." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "b7f70a8f", + "metadata": {}, + "outputs": [], + "source": [ + "from pecos.utils.featurization.text.preprocess import Preprocessor\n", + "\n", + "input_text_path = \"./text2text_demo/training-data.txt\"\n", + "output_text_path = \"./text2text_demo/output-labels.txt\"\n", + "model_folder = \"./text2text_demo/pecos-text2text-model\"\n", + "\n", + "parsed_result = Preprocessor.load_data_from_file(input_text_path, output_text_path) # Read files\n", + "corpus = parsed_result[\"corpus\"] # Corpus input text: List of strings\n", + "Y = parsed_result[\"label_matrix\"] # Label Matrix: Sparse Matrix\n", + "\n", + "vectorizer_config = {\n", + " \"type\": \"tfidf\",\n", + " \"kwargs\": {\n", + " \"base_vect_configs\": [\n", + " \n", + " {\n", + " \"ngram_range\": [1, 1],\n", + " \"max_df_ratio\": 0.98,\n", + " \"analyzer\": \"word\",\n", + " },\n", + " {\n", + " \"ngram_range\": [2, 2],\n", + " \"max_df_ratio\": 0.98,\n", + " \"analyzer\": \"word\",\n", + " },\n", + " {\n", + " \"ngram_range\": [3, 3],\n", + " \"max_df_ratio\": 0.98,\n", + " \"analyzer\": \"char_wb\",\n", + " },\n", + " ],\n", + " },\n", + " }\n", + "\n", + "preprocessor = Preprocessor.train(corpus, vectorizer_config)\n", + "preprocessor.save(model_folder) " + ] + }, + { + "cell_type": "markdown", + "id": "a0300f8c", + "metadata": {}, + "source": [ + "#### Preprocessing with a Trained Text Preprocessor\n", + "\n", + "The function `predict` of a trained text preprocessor encodes texts in a **text data file** into a CSR matrix of shape `(num_inst, dim)` as numerical vector representations, where `num_inst` is the number of instances in the file; `dim` is the number of feature dimensions." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "3b182171", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The file consists of 4 instances with {X.shape[1]}-dimensional features in a {X.getformat()} matrix.\n", + "\n", + "Text 0: Alan Turing is widely considered to be the father of theoretical computer science and artificial intelligence.\n", + "Text 1: Hinton was co-author of a highly cited paper published in 1986 that popularized the backpropagation algorithm for training multi-layer neural networks.\n", + "Text 2: Hinton received the 2018 Turing Award, together with Yoshua Bengio and Yann LeCun, for their work on artificial intelligence and deep learning.\n", + "Text 3: Yoshua Bengio is a Canadian computer scientist, most noted for his work on artificial neural networks and deep learning.\n", + "\n", + "The cosine similarity is 0.0076 between text 0 and text 1.\n", + "The cosine similarity is 0.0325 between text 0 and text 2.\n", + "The cosine similarity is 0.0082 between text 1 and text 2.\n", + "The cosine similarity is 0.0366 between text 0 and text 3.\n", + "The cosine similarity is 0.0267 between text 1 and text 3.\n", + "The cosine similarity is 0.0943 between text 2 and text 3.\n" + ] + } + ], + "source": [ + "# Obtaining numerical vectors from text\n", + "X = preprocessor.predict(corpus)\n", + "\n", + "print(f\"The file consists of {X.shape[0]} instances \"\n", + " \"with {X.shape[1]}-dimensional features \"\n", + " \"in a {X.getformat()} matrix.\\n\")\n", + "\n", + "from sklearn.metrics.pairwise import cosine_similarity\n", + "\n", + "sim = cosine_similarity(X)\n", + "\n", + "for i, ti in enumerate(corpus):\n", + " print(f\"Text {i}: {ti}\")\n", + "\n", + "print(\"\")\n", + "for i in range(X.shape[0]):\n", + " for j in range(i):\n", + " print(f\"The cosine similarity is {sim[i][j]:.4f} between text {j} and text {i}.\")" + ] + }, + { + "cell_type": "markdown", + "id": "18fcd09b", + "metadata": {}, + "source": [ + "#### Command-line Interface\n", + "\n", + "The above vectorizer operations can also be achieved by the following commands with a JSON-format configuration file:\n", + "\n", + "```bash\n", + "python3 -m pecos.utils.featurization.text.preprocess build \\\n", + " --text-pos 1 \\\n", + " --input-text-path ./text2text_demo/training-data.txt \\\n", + " --vectorizer-config-path /path/to/vectorizer-config.json \\\n", + " --output-model-folder ./text2text_demo/pecos-text2text-model\n", + "\n", + "python3 -m pecos.utils.featurization.text.preprocess run \\\n", + " --input-preprocessor-folder ./text2text_demo/pecos-text2text-model \\\n", + " --text-pos 1 \\\n", + " --input-text-path ./text2text_demo/training-data.txt \\\n", + " --output-inst-path /path/to/X.npz\n", + " --label-pos 0 \\\n", + " --output-label-path /path/to/Y.npz \\\n", + " --label-text-path ./text2text_demo/output-labels.txt\n", + "```\n", + "\n", + "#### Efficiency of PECOS Built-in TF-IDF Vectorizer\n", + "\n", + "Moreover, the TF-IDF vectorizer in PECOS is implemented in C++ and efficient." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "a3d6f675", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "PECOS TFIDF time: 27.30493s, result shape=(14146, 10858825), nnz=37194670\n" + ] + } + ], + "source": [ + "vectorizer_config = {\n", + " \"type\": \"tfidf\",\n", + " \"kwargs\": {\n", + " \"base_vect_configs\": [ \n", + " {\n", + " \"ngram_range\": [1, 2],\n", + " \"max_df_ratio\": 0.98,\n", + " \"analyzer\": \"word\",\n", + " },\n", + " ],\n", + " },\n", + " }\n", + "\n", + "input_text_path = \"xmc-base/wiki10-31k/X.trn.txt\"\n", + "corpus = Preprocessor.load_data_from_file(input_text_path, text_pos=0)[\"corpus\"]\n", + "\n", + "import time\n", + "start_time = time.time()\n", + "preprocessor = Preprocessor.train(corpus, vectorizer_config)\n", + "X = preprocessor.predict(input_text_path)\n", + "print(f\"PECOS TFIDF time: {time.time() - start_time:.5f}s, result shape={X.shape}, nnz={X.nnz}\")" + ] + }, + { + "cell_type": "markdown", + "id": "e63c62ce", + "metadata": {}, + "source": [ + "As a baseline method, we compare with the [Sklearn TFIDF vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html):\n", + "```python\n", + "start_time = time.time()\n", + "preprocessor = Preprocessor.train(\n", + " corpus,\n", + " {\"type\": \"sklearntfidf\", \"kwargs\":{\"ngram_range\": [1, 2], \"max_df\": 0.98}},\n", + ")\n", + "X = preprocessor.predict(corpus)\n", + "print(f\"Sklearn TFIDF time: {time.time() - start_time:.5f}s, result shaepe={X.shape}, nnz={X.nnz}\")\n", + "```\n", + "\n", + "**Training Sklearn TFIDF models is time-consuming, we suggest to try the following code offline after the tutorial. The execution results using an AWS *i3.4xlarge* instance are as follows:**\n", + "```\n", + "Sklearn TFIDF time: 221.40709s, result shaepe=(14146, 7269690), nnz=33505461\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "75f5aaf8", + "metadata": {}, + "source": [ + "### 4.2.2 Customized PECOS Model with TF-IDF Preprocessor\n", + "\n", + "\n", + "After being powered with text preprocessors, following the [aforementioned illustration](#Structure-of-a-Customized-PECOS-Model), we demonstrate an example of declaring a **customized PECOS model class** based on a TF-IDF preprocessor and a XR-Linear model." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "3893c23b", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from os import path\n", + "import pathlib\n", + "from pecos.utils.featurization.text.preprocess import Preprocessor\n", + "from pecos.xmc.xlinear.model import XLinearModel\n", + "from pecos.xmc import Indexer, LabelEmbeddingFactory\n", + "from pecos.utils import smat_util\n", + "\n", + "class CustomPECOS:\n", + " def __init__(self, preprocessor=None, xlinear_model=None, output_items=None):\n", + " self.preprocessor = preprocessor\n", + " self.xlinear_model = xlinear_model\n", + " self.output_items = output_items\n", + " \n", + " @classmethod\n", + " def train(cls, input_text_path, output_text_path):\n", + " \"\"\"Train a CustomPECOS model\n", + " \n", + " Args: \n", + " input_text_path (str): Text input file name. \n", + " output_text_path (str): The file path for output text items.\n", + " vectorizer_config (str): Json_format string for vectorizer config (default None). e.g. {\"type\": \"tfidf\", \"kwargs\": {}}\n", + " \n", + " Returns:\n", + " A CustomPECOS object\n", + " \"\"\"\n", + " # Obtain X_text, Y\n", + " parsed_result = Preprocessor.load_data_from_file(input_text_path, output_text_path)\n", + " Y = parsed_result[\"label_matrix\"]\n", + " corpus = parsed_result[\"corpus\"]\n", + "\n", + " # Train TF-IDF vectorizer\n", + " preprocessor = Preprocessor.train(corpus, {\"type\": \"tfidf\", \"kwargs\":{}}) \n", + " X = preprocessor.predict(corpus) \n", + " \n", + " # Train a XR-Linear model with TF-IDF features\n", + " label_feat = LabelEmbeddingFactory.create(Y, X, method=\"pifa\")\n", + " cluster_chain = Indexer.gen(label_feat)\n", + " xlinear_model = XLinearModel.train(X, Y, C=cluster_chain)\n", + " \n", + " # Load output items\n", + " with open(output_text_path, \"r\", encoding=\"utf-8\") as f:\n", + " output_items = [q.strip() for q in f]\n", + " \n", + " return cls(preprocessor, xlinear_model, output_items)\n", + " \n", + " def predict(self, corpus):\n", + " \"\"\"Predict labels for given inputs\n", + " \n", + " Args:\n", + " corpus (list of strings): input strings.\n", + " Returns:\n", + " csr_matrix: predicted label matrix (num_samples x num_labels)\n", + " \"\"\"\n", + " X = self.preprocessor.predict(corpus)\n", + " Y_pred = self.xlinear_model.predict(X)\n", + " return smat_util.sorted_csr(Y_pred)\n", + "\n", + " def save(self, model_folder):\n", + " \"\"\"Save the CustomPECOS model\n", + "\n", + " Args:\n", + " model_folder (str): folder name to save\n", + " \"\"\"\n", + " self.preprocessor.save(f\"{model_folder}/preprocessor\")\n", + " self.xlinear_model.save(f\"{model_folder}/xlinear_model\")\n", + " with open(f\"{model_folder}/output_items.json\", \"w\", encoding=\"utf-8\") as fp:\n", + " json.dump(self.output_items, fp)\n", + "\n", + " @classmethod\n", + " def load(cls, model_folder):\n", + " \"\"\"Load the CustomPECOS model\n", + "\n", + " Args:\n", + " model_folder (str): folder name to load\n", + " Returns:\n", + " CustomPECOS\n", + " \"\"\"\n", + " preprocessor = Preprocessor.load(f\"{model_folder}/preprocessor\")\n", + " xlinear_model = XLinearModel.load(f\"{model_folder}/xlinear_model\")\n", + " with open(f\"{model_folder}/output_items.json\", \"r\", encoding=\"utf-8\") as fin:\n", + " output_items = json.load(fin)\n", + " return cls(preprocessor, xlinear_model, output_items)" + ] + }, + { + "cell_type": "markdown", + "id": "fcdbb2c6", + "metadata": {}, + "source": [ + "### 4.2.3. Operating the Customized PECOS Model\n", + "\n", + "With a well-declared model class, the customized PECOS model can be modularized and very convenient to use." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "24134357", + "metadata": {}, + "outputs": [], + "source": [ + "# Declare the path for model serialization and preprocessor configuration.\n", + "model_folder = \"./text2text_demo/pecos-CustomPECOS-model\"\n", + "\n", + "# Train and save the trained model\n", + "input_text_path = \"./text2text_demo/training-data.txt\"\n", + "output_text_path = \"./text2text_demo/output-labels.txt\"\n", + "model = CustomPECOS.train(input_text_path, output_text_path)\n", + "model.save(model_folder)\n", + "\n", + "# Load the trained model and predict\n", + "model = model.load(model_folder)\n", + "testing_text_path = \"./text2text_demo/testing-data.txt\"\n", + "Y_pred = model.predict(testing_text_path)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "31efd9ac", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Text Input: {text}\n", + "Score {pred_score:.4f}: {pred_label}\n", + "Score {pred_score:.4f}: {pred_label}\n", + "Score {pred_score:.4f}: {pred_label}\n", + "Score {pred_score:.4f}: {pred_label}\n", + "Score {pred_score:.4f}: {pred_label}\n", + "Score {pred_score:.4f}: {pred_label}\n" + ] + } + ], + "source": [ + "test_texts = Preprocessor.load_data_from_file(testing_text_path, output_text_path)[\"corpus\"]\n", + "\n", + "for i, text in enumerate(test_texts):\n", + " print(\"Text Input: {text}\")\n", + " for j in range(Y_pred.indptr[i], Y_pred.indptr[i + 1]):\n", + " pred_label = model.output_items[Y_pred.indices[j]]\n", + " pred_score = Y_pred.data[j]\n", + " print(\"Score {pred_score:.4f}: {pred_label}\")" + ] + }, + { + "cell_type": "markdown", + "id": "32908310", + "metadata": {}, + "source": [ + "## Appedix 1: Model Parameters in PECOS Implementation\n", + "\n", + "### A1.1. Cluster Chain in PECOS Implementation\n", + "\n", + "Specifically, PECOS trains a *cluster_chain* of `D` matching matrices `C[d]`, where `C[d]` is a CSC matrix of shape `(L[d], K[d])`; `L[d]` and `K[d]` are the numbers of labels and clusters in the layer `d`. Note that the clusters of a layer would be the labels of the next layer. The labels of the last layer `L[D - 1]` would be the labels of the overall XMC problem `nr_labels`." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "6b0cb55e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "4 layers in the trained hierarchical label tree with C[d] as:\n", + "cluster_chain[0] is a csc matrix of shape (2, 1)\n", + "cluster_chain[1] is a csc matrix of shape (32, 2)\n", + "cluster_chain[2] is a csc matrix of shape (512, 32)\n", + "cluster_chain[3] is a csc matrix of shape (30938, 512)\n" + ] + } + ], + "source": [ + "print(f\"{len(cluster_chain)} layers in the trained hierarchical label tree with C[d] as:\")\n", + "for d, C in enumerate(cluster_chain):\n", + " print(f\"cluster_chain[{d}] is a {C.getformat()} matrix of shape {C.shape}\")" + ] + }, + { + "cell_type": "markdown", + "id": "790b21dc", + "metadata": {}, + "source": [ + "### A2.2. Model Weights in PECOS Implementation\n", + "\n", + "Model weights in an XR-Linear model are also accessible as `model_chain` for analysis and computations. For the i-th layer in the hierarchy, the model weights of matchers/rankers are available as a CSC matrix of shape `(nr_feat + 1, L[i])`, which concatenates weights for features and the bias term. " + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "9e101f6b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "model_chain[0].W is a csc matrix of shape (101939, 2)\n", + "model_chain[1].W is a csc matrix of shape (101939, 32)\n", + "model_chain[2].W is a csc matrix of shape (101939, 512)\n", + "model_chain[3].W is a csc matrix of shape (101939, 30938)\n" + ] + } + ], + "source": [ + "for d, m in enumerate(xlm.model.model_chain):\n", + " print(f\"model_chain[{d}].W is a {m.W.getformat()} matrix of shape {m.W.shape}\")" + ] + }, + { + "cell_type": "markdown", + "id": "73a599a0", + "metadata": {}, + "source": [ + "## Appendix 2: Customized Parameters and Advanced Training Options\n", + "\n", + "PECOS also supports using customized parameters and several advanced training options, such as different solvers and cost-sensitive learning.\n", + "\n", + "### A2.1. Customized Parameters\n", + "\n", + "The parameters for either of indexing, training, and inference can be easily customized by feeding a dictionary into the corresponding parameter class and its constructor:\n", + "\n", + "* Semantic Indexing (Hierarchical K-Means): `HierarchicalKMeans.TrainParams.from_dict(dict)`\n", + "* Training: `XLinearModel.TrainParams.from_dict(dict)`\n", + "* Inference: `XLinearModel.PredParams.from_dict(dict)`\n", + "\n", + "Although most of the parameters can be also passed by `kwargs` of Python methods, **we encourage to use the dictionary to designate the parameters because it is easier to manage, modularize, and store parameters in certain formats like JSON.**\n", + "\n", + "For XR-Linear models, the default values and skeleton of the parameters can be revealed and generated by the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "1ddc9bfa", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\r\n", + " \"train_params\": {\r\n", + " \"__meta__\": {\r\n", + " \"class_fullname\": \"pecos.xmc.xlinear.model###XLinearModel.TrainParams\"\r\n", + " },\r\n", + " \"mode\": \"full-model\",\r\n", + " \"ranker_level\": 1,\r\n", + " \"nr_splits\": 16,\r\n", + " \"min_codes\": null,\r\n", + " \"shallow\": false,\r\n", + " \"rel_mode\": \"disable\",\r\n", + " \"rel_norm\": \"no-norm\",\r\n", + " \"hlm_args\": {\r\n", + " \"__meta__\": {\r\n", + " \"class_fullname\": \"pecos.xmc.base###HierarchicalMLModel.TrainParams\"\r\n", + " },\r\n", + " \"neg_mining_chain\": \"tfn\",\r\n", + " \"model_chain\": {\r\n", + " \"__meta__\": {\r\n", + " \"class_fullname\": \"pecos.xmc.base###MLModel.TrainParams\"\r\n", + " },\r\n", + " \"threshold\": 0.1,\r\n", + " \"max_nonzeros_per_label\": null,\r\n", + " \"solver_type\": \"L2R_L2LOSS_SVC_DUAL\",\r\n", + " \"Cp\": 1.0,\r\n", + " \"Cn\": 1.0,\r\n", + " \"max_iter\": 100,\r\n", + " \"eps\": 0.1,\r\n", + " \"bias\": 1.0,\r\n", + " \"threads\": -1,\r\n", + " \"verbose\": 0,\r\n", + " \"newton_eps\": 0.01\r\n", + " }\r\n", + " }\r\n", + " },\r\n", + " \"pred_params\": {\r\n", + " \"__meta__\": {\r\n", + " \"class_fullname\": \"pecos.xmc.xlinear.model###XLinearModel.PredParams\"\r\n", + " },\r\n", + " \"hlm_args\": {\r\n", + " \"__meta__\": {\r\n", + " \"class_fullname\": \"pecos.xmc.base###HierarchicalMLModel.PredParams\"\r\n", + " },\r\n", + " \"model_chain\": {\r\n", + " \"__meta__\": {\r\n", + " \"class_fullname\": \"pecos.xmc.base###MLModel.PredParams\"\r\n", + " },\r\n", + " \"only_topk\": 20,\r\n", + " \"post_processor\": \"l3-hinge\"\r\n", + " }\r\n", + " }\r\n", + " },\r\n", + " \"indexer_params\": {\r\n", + " \"__meta__\": {\r\n", + " \"class_fullname\": \"pecos.xmc.base###HierarchicalKMeans.TrainParams\"\r\n", + " },\r\n", + " \"nr_splits\": 16,\r\n", + " \"min_codes\": null,\r\n", + " \"max_leaf_size\": 100,\r\n", + " \"imbalanced_ratio\": 0.0,\r\n", + " \"imbalanced_depth\": 100,\r\n", + " \"spherical\": true,\r\n", + " \"seed\": 0,\r\n", + " \"kmeans_max_iter\": 20,\r\n", + " \"threads\": -1\r\n", + " }\r\n", + "}\r\n" + ] + } + ], + "source": [ + "! python3 -m pecos.xmc.xlinear.train --generate-params-skeleton" + ] + }, + { + "cell_type": "markdown", + "id": "35472517", + "metadata": {}, + "source": [ + "### A2.2. Training Parameters for Hierarchial Models in XR-Linear\n", + "\n", + "Hierarchical models could have different parameters over layers. To have customized parameters for the hierarchical model, `hlm_args` needs to be designated in the parameter dictionary. The values of `model_chain` and `neg_mining_chain` in `hlm_args` can be **a single dictionary** of general parameters for all layers or **a list of dictinoaries** for specific parameters of individual layers.\n", + "\n", + "#### General Parameters for All Layers\n", + "\n", + "```\n", + "train_params_l1 = XLinearModel.TrainParams.from_dict(\n", + " {\n", + " ...\n", + " \"hlm_args\": {\n", + " ...\n", + " \"neg_mining_chain\": \"tfn\", # Negative sampling scheme for all layers\n", + " \"model_chain\":{...}, # Parameters for all layers\n", + " }\n", + " ...\n", + " })\n", + "```\n", + "\n", + "#### Specific Parameters of Individual Layers\n", + "\n", + "```\n", + "train_params_l1 = XLinearModel.TrainParams.from_dict(\n", + " {\n", + " ...\n", + " \"hlm_args\": {\n", + " ...\n", + " \"neg_mining_chain\": [\n", + " \"tfn\", # Negative sampling scheme for layer-0\n", + " \"tfn\", # Negative sampling scheme for layer-1\n", + " \"tfn+man\", # Negative sampling scheme for layer-2\n", + " ...\n", + " ],\n", + " \"model_chain\": [\n", + " {...}, # Parameters for layer-0\n", + " {...}, # Parameters for layer-1\n", + " {...}, # Parameters for layer-2\n", + " ...\n", + " ],\n", + " }\n", + " ...\n", + " })\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "81f86b8a", + "metadata": {}, + "source": [ + "### A2.3. Variety of Solvers\n", + "\n", + "The solver for optimization can be adjusted by the argument `solver_type` in the `train` function. PECOS currently provides the following solvers for training each matcher/ranker:\n", + "\n", + "* \"L2R_L2LOSS_SVC_DUAL\" (default): L2-regularized L2-loss Dual SVM\n", + "* \"L2R_L1LOSS_SVC_DUAL\": : L2-regularized L1-loss Dual SVM\n", + "* \"L2R_LR_DUAL\": L2-reguarlized Logistic Regression" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "8f26ee42", + "metadata": {}, + "outputs": [], + "source": [ + "xlm_l1_kwargs = XLinearModel.train(\n", + " X_trn, Y_trn,\n", + " C=cluster_chain,\n", + " threshold=0.1,\n", + " negative_sampling_scheme=\"tfn\",\n", + " solver_type=\"L2R_L1LOSS_SVC_DUAL\")" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "197926a7", + "metadata": {}, + "outputs": [], + "source": [ + "train_params_l1 = XLinearModel.TrainParams.from_dict(\n", + " {\n", + " \"hlm_args\": {\n", + " \"threshold\": 0.1,\n", + " \"neg_mining_chain\": \"tfn\",\n", + " \"model_chain\":{\n", + " \"solver_type\": \"L2R_L1LOSS_SVC_DUAL\",\n", + " },\n", + " }\n", + " }\n", + ")\n", + "\n", + "xlm_l1_dict = XLinearModel.train(\n", + " X_trn, Y_trn,\n", + " C=cluster_chain,\n", + " train_params=train_params_l1)" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "eddf91a6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Evaluation Metrics with L2R_L1LOSS_SVC_DUAL (by method kwargs)\n", + "prec = 83.65 77.27 72.01 67.67 63.91 60.53 57.33 54.53 52.03 49.66\n", + "recall = 4.94 9.06 12.53 15.59 18.26 20.64 22.67 24.52 26.23 27.72\n", + "\n", + "Evaluation Metrics with L2R_L1LOSS_SVC_DUAL (by dictionary)\n", + "prec = 83.65 77.27 72.01 67.67 63.91 60.53 57.33 54.53 52.03 49.66\n", + "recall = 4.94 9.06 12.53 15.59 18.26 20.64 22.67 24.52 26.23 27.72\n" + ] + } + ], + "source": [ + "Y_pred_l1_kwargs = xlm_l1_kwargs.predict(X_tst, beam_size=10, only_topk=10)\n", + "Y_pred_l1_dict = xlm_l1_dict.predict(X_tst, beam_size=10, only_topk=10)\n", + "metrics_l1_kwargs = smat_util.Metrics.generate(Y_tst, Y_pred_l1_kwargs, topk=10)\n", + "metrics_l1_dict = smat_util.Metrics.generate(Y_tst, Y_pred_l1_dict, topk=10)\n", + "\n", + "print(\"Evaluation Metrics with L2R_L1LOSS_SVC_DUAL (by method kwargs)\")\n", + "print(metrics_l1_kwargs)\n", + "\n", + "print(\"\\nEvaluation Metrics with L2R_L1LOSS_SVC_DUAL (by dictionary)\")\n", + "print(metrics_l1_dict)" + ] + }, + { + "cell_type": "markdown", + "id": "9f481ed3", + "metadata": {}, + "source": [ + "## Appendix 3: Cost-sensitive Learning\n", + "\n", + "PECOS supports to adjust the cost of each training instance. To enable cost-sensitive learning, we need to provide a **relevance matrix** `R_trn` with the same shape to the label matrix `Y_trn` for the argument `R`. When `R` is `None` (default), cost-sensitive learning is disable. \n", + "\n", + "Since PECOS models are usually hierarhical, costs for upper layers also need to be decided as the cost-sensitive learning mode by the argument `rel_mode`. Currently, PECOS supports the following cost-sensitive learning modes:\n", + "\n", + "* `\"disable\"` (default): The cost-sensitive learning is disable.\n", + "* `\"induce\"`: Induce the costs into upper layers by the clustering chain.\n", + "* `\"ranker-only\"`: Only apply cost-sensitive learning to the model in the last ranker layer without induction.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "382277f3", + "metadata": {}, + "outputs": [], + "source": [ + "# An exmaple of using training label frequency scores as costs. \n", + "import copy\n", + "from sklearn.preprocessing import normalize\n", + "\n", + "R_trn = copy.deepcopy(Y_trn)\n", + "\n", + "# Training parameters for cost-sensitive learning.\n", + "train_params_cost = XLinearModel.TrainParams.from_dict(\n", + " {\n", + " \"rel_mode\": \"induce\",\n", + " \"rel_norm\": \"l1\",\n", + " \"hlm_args\": {\n", + " \"neg_mining_chain\": \"tfn\",\n", + " \"model_chain\":\n", + " [\n", + " {\n", + " \"threshold\": 0.1,\n", + " \"Cp\": 1.0,\n", + " \"Cn\": 1.0,\n", + " },\n", + " {\n", + " \"threshold\": 0.1,\n", + " \"Cp\": 8.0,\n", + " \"Cn\": 1.0,\n", + " },\n", + " {\n", + " \"threshold\": 0.1,\n", + " \"Cp\": 4.0,\n", + " \"Cn\": 1.0,\n", + " },\n", + " {\n", + " \"threshold\": 0.1,\n", + " \"Cp\": 4.0,\n", + " \"Cn\": 1.0,\n", + " },\n", + " ],\n", + " }\n", + " })\n", + " \n", + "# Cost-sensitive learning.\n", + "xlm_cost = XLinearModel.train(\n", + " X_trn, Y_trn,\n", + " C=cluster_chain,\n", + " R=R_trn,\n", + " train_params=train_params_cost)" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "559c15cb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Evaluation Metrics with Cost-sensitive Learning\n", + "prec = 84.93 80.15 74.50 69.32 64.73 61.05 57.68 54.52 51.70 49.10\n", + "recall = 5.01 9.42 13.00 15.99 18.52 20.83 22.85 24.54 26.09 27.43\n", + "\n", + "Original Evaluation Metrics\n", + "prec = 84.36 78.20 72.67 68.03 63.85 60.23 56.86 53.85 51.07 48.63\n", + "recall = 4.99 9.17 12.65 15.67 18.28 20.56 22.50 24.21 25.74 27.15\n" + ] + } + ], + "source": [ + "Y_pred_cost = xlm_cost.predict(X_tst, beam_size=10, only_topk=10)\n", + "metrics_cost = smat_util.Metrics.generate(Y_tst, Y_pred_cost, topk=10)\n", + "print(\"Evaluation Metrics with Cost-sensitive Learning\")\n", + "print(metrics_cost)\n", + "print(\"\\nOriginal Evaluation Metrics\")\n", + "print(metrics)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/tutorials/kdd22/Session 2 Extreme Multi-label Ranking with PECOS.ipynb b/tutorials/kdd22/Session 2 Extreme Multi-label Ranking with PECOS.ipynb deleted file mode 100644 index fdb579f2..00000000 --- a/tutorials/kdd22/Session 2 Extreme Multi-label Ranking with PECOS.ipynb +++ /dev/null @@ -1,1505 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "67e70878", - "metadata": {}, - "source": [ - "# eXtreme Multi-label Ranking (XMR) Problem and PECOS\n", - "\n", - "Prediction for Enormous and Correlated Output Spaces (PECOS) is a versatile and modular machine learning framework for solving prediction problems with very large outputs spaces. For a given input instance, we apply PECOS to the eXtreme Multilabel Ranking (XMR) problem to find and rank the most relevant items from an enormous but fixed and finite output space.\n", - "\n", - "

\n", - "\n", - "As shown in the above figure, to address the XMR problem, PECOS conceptually consists of three stages, including semantic label indexing, machine-learned matching, and ranking. In this part of the tutorial, we will use XR-Linear as an example to demonstrate how to use PECOS to tackle real-world problems and understrand the model architecture in PECOS.\n", - "\n", - "### Install PECOS through Python PIP" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "6d9fa78b", - "metadata": {}, - "outputs": [], - "source": [ - "! pip install libpecos" - ] - }, - { - "cell_type": "markdown", - "id": "41d87d24", - "metadata": {}, - "source": [ - "## Experimental Dataset\n", - "\n", - "`eurlex-4k`, `wiki10-31k`, `amazoncat-13k`, `amazon-670k`, `wiki-500k`, and `amazon-3m` are available." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "1073ac9c", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2022-07-14 08:54:02 URL:https://ia802308.us.archive.org/21/items/pecos-dataset/xmc-base/wiki10-31k.tar.gz [162277861/162277861] -> \"wiki10-31k.tar.gz\" [1]\n", - "xmc-base/wiki10-31k/output-items.txt\n", - "xmc-base/wiki10-31k/tfidf-attnxml\n", - "xmc-base/wiki10-31k/tfidf-attnxml/X.trn.npz\n", - "xmc-base/wiki10-31k/tfidf-attnxml/X.tst.npz\n", - "xmc-base/wiki10-31k/X.trn.txt\n", - "xmc-base/wiki10-31k/X.tst.txt\n", - "xmc-base/wiki10-31k/Y.trn.npz\n", - "xmc-base/wiki10-31k/Y.trn.txt\n", - "xmc-base/wiki10-31k/Y.tst.npz\n", - "xmc-base/wiki10-31k/Y.tst.txt\n" - ] - } - ], - "source": [ - "DATASET = \"wiki10-31k\"\n", - "! wget -nv -nc https://archive.org/download/pecos-dataset/xmc-base/{DATASET}.tar.gz\n", - "! tar --skip-old-files -zxf {DATASET}.tar.gz \n", - "! find xmc-base/{DATASET}/*" - ] - }, - { - "cell_type": "markdown", - "id": "73f0fa78", - "metadata": {}, - "source": [ - "### Analyze Sparse Features and Label Space" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "d680e1e0", - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import scipy.sparse as smat\n", - "import matplotlib.pyplot as plt\n", - "X_trn = smat.load_npz(f\"xmc-base/{DATASET}/tfidf-attnxml/X.trn.npz\")\n", - "Y_trn = smat.load_npz(f\"xmc-base/{DATASET}/Y.trn.npz\")" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "0b34281d", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'14146 instances with 101938 features.'" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "\"{} instances with {} features.\".format(*X_trn.shape)" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "7f16a000", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'Overall Sparsity: 99.34%'" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "\"Overall Sparsity: {:.2f}%\".format(100 * (1 - X_trn.nnz / (X_trn.shape[0] * X_trn.shape[1])))" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "dcf0f0cf", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "counts, bins = np.histogram(100 - 100 * X_trn.getnnz(1) / X_trn.shape[1], bins=20)\n", - "plt.hist(bins[:-1], bins, weights=counts)\n", - "plt.title(DATASET);\n", - "plt.xlabel(\"Feature Sparsity (%)\");\n", - "plt.ylabel(\"Number of Instances\");" - ] - }, - { - "cell_type": "markdown", - "id": "a1e0157b", - "metadata": {}, - "source": [ - "### Extremely large label space" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "49f8fe28", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'14146 instances with 30938 labels.'" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "\"{} instances with {} labels.\".format(*Y_trn.shape)" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "64f5fc9b", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'Overall Sparsity: 99.94%'" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "\"Overall Sparsity: {:.2f}%\".format(100 * (1 - Y_trn.nnz / (Y_trn.shape[0] * Y_trn.shape[1])))" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "b5cb2084", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "counts, bins = np.histogram(100 - 100 * Y_trn.getnnz(1) / Y_trn.shape[1], bins=20, range=(99.85, 100))\n", - "plt.hist(bins[:-1], bins, weights=counts)\n", - "plt.title(DATASET);\n", - "plt.xlabel(\"Label Sparsity (%)\");\n", - "plt.ylabel(\"Number of Instances\");" - ] - }, - { - "cell_type": "markdown", - "id": "057fb642", - "metadata": {}, - "source": [ - "## Numerical Feature and Label Format in PECOS\n", - "\n", - "In PECOS, numerical features of instances can be in either a [dense NumPy matrix](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) or a [Compressed Sparse Row (CSR) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) of shape `(nr_inst, nr_feat)`, where `nr_inst` and `nr_feat` are numbers of instances and features. Similary, labels of instances can be also presented as a dense or a sparse matrix of shape `(nr_inst, nr_labels)`, where `nr_labels` is the number of labels in the XMR problem. Note that for the sparse format, training labels should be a [Compressed Sparse Column (CSC) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html) while testing labels should be a CSR matrix for the purpose of computational efficiency. For convenience, PECOS also provides APIs for loading features and labels from binary files in arbitary formats.\n", - "\n", - "In addition to numerical features, PECOS also supports handling text data with transformer." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "c518d892", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Training features X_trn is a csr matrix of shape (14146, 101938).\n", - "Training labels Y_trn is a csc matrix of shape (14146, 30938).\n", - "Testing features X_tst is a csr matrix of shape (6616, 101938).\n", - "Testing labels Y_tst is a csr matrix of shape (6616, 30938).\n" - ] - } - ], - "source": [ - "import numpy as np\n", - "from pecos.xmc.xlinear.model import XLinearModel\n", - "\n", - "DATASET = \"wiki10-31k\"\n", - "\n", - "X_trn = XLinearModel.load_feature_matrix(\"xmc-base/{}/tfidf-attnxml/X.trn.npz\".format(DATASET))\n", - "Y_trn = XLinearModel.load_label_matrix(\"xmc-base/{}/Y.trn.npz\".format(DATASET), for_training=True)\n", - "\n", - "X_tst = XLinearModel.load_feature_matrix(\"xmc-base/{}/tfidf-attnxml/X.tst.npz\".format(DATASET))\n", - "Y_tst = XLinearModel.load_label_matrix(\"xmc-base/{}/Y.tst.npz\".format(DATASET), for_training=False)\n", - "\n", - "print(f\"Training features X_trn is a {X_trn.getformat()} matrix of shape {X_trn.shape}.\")\n", - "print(f\"Training labels Y_trn is a {Y_trn.getformat()} matrix of shape {Y_trn.shape}.\")\n", - "print(f\"Testing features X_tst is a {X_tst.getformat()} matrix of shape {X_tst.shape}.\")\n", - "print(f\"Testing labels Y_tst is a {Y_tst.getformat()} matrix of shape {Y_tst.shape}.\")" - ] - }, - { - "cell_type": "markdown", - "id": "b0c731f5", - "metadata": {}, - "source": [ - "## Hands-on Example: XMR with XR-Linear\n", - "\n", - "XR-LINEAR is a recursive linear machine learned realization of our PECOS framework. As shown in the below figure, XR-Linear treats machine-learned matching as a smaller XMR problem, thereby recursively apply the three-stage framework of PECOS to address the problem.\n", - "\n", - "

\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "150fea14", - "metadata": {}, - "source": [ - "### Semantic Label Indexing and Cluster Chain in XR-Linear\n", - "\n", - "The first step of training an XR-Linear model is to conduct semantic label indexing and establish the *hierarchial label tree* for resursive training the XR-Linear model and its inference. \n", - "\n", - "PECOS supports any method for semantic label indexing. In the PECOS library, as a build-in method, we provide Label Representation via Positive Instance Feature Aggregation (PIFA) for semantic label indexing with only the need of positive instances and their features in training data. PECOS can also consider additional label features `Z` of shape `(nr_labels, nr_label_feat)` in either dense or sparse matrix format, where `nr_label_feat` is the number of label features. These representations and features for each label are concatenated or combined as label embedding in `LabelEmbeddingFactory` in PECOS.\n", - "\n", - "To conduct semantic label indexing, PECOS learns an indexer based on label embedding. PECOS currently supports to use the Hierarchical K-Means for semantic label indexing with a hyper-parameter `nr_splits` (the number of clusters in each layer, or `B` in [our report](https://arxiv.org/pdf/2010.05878.pdf)), which decides the depth `D` of the hierarchical label tree. " - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "26794215", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "4 layers in the trained hierarchical label tree.\n" - ] - } - ], - "source": [ - "from pecos.xmc import Indexer, LabelEmbeddingFactory\n", - "\n", - "label_feat = LabelEmbeddingFactory.create(Y_trn, X_trn, method=\"pifa\")\n", - "# label_feat = LabelEmbeddingFactory.create(Y_trn, X_trn, Z, method=\"pifa_lf_concat\") # for using label features Z\n", - "\n", - "cluster_chain = Indexer.gen(label_feat, nr_splits=8, indexer_type=\"hierarchicalkmeans\")\n", - "\n", - "print(f\"{len(cluster_chain)} layers in the trained hierarchical label tree.\")" - ] - }, - { - "cell_type": "markdown", - "id": "02ffda21", - "metadata": {}, - "source": [ - "### Training XR-Linear Negative Sampling and Sparsification\n", - "\n", - "Negative sampling plays an important role in solving the XMR problem. PECOS currently provides two negative sampling schemes, including Teacher Forcing Negatives (TFN) and Matcher Aware Negatives (MAN). Please refer to [our report](https://arxiv.org/pdf/2010.05878.pdf) for more details about negative sampling schemes.\n", - "\n", - "To reduce model sizes and improve efficiency, PECOS conduct model sparsification with a hyper-parameter `threshold`. The model weights with absolute values smaller than the threshold will be discarded." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "bd3d6527", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Training time: 40.9793 seconds.\n" - ] - } - ], - "source": [ - "import time\n", - "start_time = time.time()\n", - "\n", - "# For negative_sampling_scheme in model training, \"man\" and tfn+man\" are also available.\n", - "xlm = XLinearModel.train(X_trn, Y_trn, C=cluster_chain, threshold=0.1, negative_sampling_scheme=\"tfn\")\n", - "\n", - "training_time = time.time() - start_time\n", - "print(f\"Training time: {training_time:.4f} seconds.\")" - ] - }, - { - "cell_type": "markdown", - "id": "20f5cfa7", - "metadata": {}, - "source": [ - "PECOS supports serializing and loading the trained model into binary on disk with convenient interfaces. Note that model loading with `is_predict_only=True` could lead to faster prediction speed by disabling the flexibility of model modification." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "3d5d468d", - "metadata": {}, - "outputs": [], - "source": [ - "xlm.save(\"{}.xlm.model\".format(DATASET))\n", - "xlm = XLinearModel.load(\"{}.xlm.model\".format(DATASET), is_predict_only=False)" - ] - }, - { - "cell_type": "markdown", - "id": "4b6038ec", - "metadata": {}, - "source": [ - "### Prediction and Evaluation\n", - "\n", - "As a tree model, the inference method significantly affects the prediction efficiency of XR-Linear in PECOS. As illustrated in the following figure, the prediction process in PECOS employs a beam search with a hyper-parameter `beam_size`. The other hyper-parameter `only_topk` also needs to be decided to limit the predicted most relevant labels for each instance. The `predict` function of the trained model will result in a CSR matrix of shape `(nr_inst, nr_labels)` and exactly `only_topk` non-zero columns for each row (or instance).\n", - "\n", - "
\n", - "
\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "7f851bc1", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Y_pred is a csr matrix of shape (6616, 30938) and 66160 non-zero elements.\n" - ] - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "Y_pred = xlm.predict(X_tst, beam_size=10, only_topk=10)\n", - "\n", - "print(f\"Y_pred is a {Y_pred.getformat()} matrix of shape {Y_pred.shape} and {Y_pred.nnz} non-zero elements.\")\n", - "\n", - "import matplotlib.pyplot as plt\n", - "plt.plot(range(Y_pred.shape[0]), Y_pred.getnnz(1))\n", - "plt.xlabel(\"Instance ID\");\n", - "plt.ylabel(\"Number of Predictions in Y_pred\");" - ] - }, - { - "cell_type": "markdown", - "id": "fb1ed22c", - "metadata": {}, - "source": [ - "For evaluation, we evaluate the trained model with conventional ranking metrics, including Precision@K and Recall@K. PECOS also provides the evaluation interface for predicted sparse matrices." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "4c57da1a", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "prec = 84.07 78.17 72.68 67.79 63.79 60.06 56.63 53.51 50.83 48.33\n", - "recall = 4.97 9.16 12.68 15.60 18.25 20.49 22.40 24.05 25.60 26.95\n" - ] - } - ], - "source": [ - "from pecos.utils import smat_util\n", - "metrics = smat_util.Metrics.generate(Y_tst, Y_pred, topk=10)\n", - "print(metrics)" - ] - }, - { - "cell_type": "markdown", - "id": "32908310", - "metadata": {}, - "source": [ - "### Dive Deep in Cluster Chain\n", - "\n", - "Specifically, PECOS trains a *cluster_chain* of `D` matching matrices `C[d]`, where `C[d]` is a CSC matrix of shape `(L[d], K[d])`; `L[d]` and `K[d]` are the numbers of labels and clusters in the layer `d`. Note that the clusters of a layer would be the labels of the next layer. The labels of the last layer `L[D - 1]` would be the labels of the overall XMR problem `nr_labels`." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "6b0cb55e", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "4 layers in the trained hierarchical label tree with C[d] as:\n", - "cluster_chain[0] is a csc matrix of shape (8, 1)\n", - "cluster_chain[1] is a csc matrix of shape (64, 8)\n", - "cluster_chain[2] is a csc matrix of shape (512, 64)\n", - "cluster_chain[3] is a csc matrix of shape (30938, 512)\n" - ] - } - ], - "source": [ - "print(f\"{len(cluster_chain)} layers in the trained hierarchical label tree with C[d] as:\")\n", - "for d, C in enumerate(cluster_chain):\n", - " print(f\"cluster_chain[{d}] is a {C.getformat()} matrix of shape {C.shape}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "55eeb4e5", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "import matplotlib.pyplot as plt\n", - "from pecos.core import clib\n", - "from pecos.utils import smat_util\n", - "\n", - "fig, axes = plt.subplots(nrows=len(cluster_chain), ncols=1)\n", - "fig.tight_layout()\n", - "\n", - "cur_Y = Y_tst\n", - "\n", - "counts, bins = np.histogram(cur_Y.getnnz(1), bins=16) \n", - "ax = plt.subplot(len(cluster_chain), 1, len(cluster_chain))\n", - "ax.hist(bins[:-1], bins, weights=counts)\n", - "ax.set_title(\"Layer {}\".format(len(cluster_chain) - 1))\n", - "plt.ylabel(\"#Instances\")\n", - "\n", - "for d in range(len(cluster_chain) - 1, 0, -1):\n", - " cur_Y = smat_util.binarized(clib.sparse_matmul(cur_Y, cluster_chain[d]))\n", - " counts, bins = np.histogram(cur_Y.getnnz(1), bins=min(16, cluster_chain[d].shape[1])) \n", - " ax = plt.subplot(len(cluster_chain), 1, d)\n", - " ax.hist(bins[:-1], bins, weights=counts)\n", - " ax.set_title(\"Layer {}\".format(d - 1))\n", - " plt.ylabel(\"#Instances\")\n", - " \n", - " \n", - "plt.subplot(len(cluster_chain), 1, len(cluster_chain))\n", - "plt.xlabel(\"Number of Belonged Clusters/Labels\");" - ] - }, - { - "cell_type": "markdown", - "id": "790b21dc", - "metadata": {}, - "source": [ - "### Dive Deep in Model Weights\n", - "\n", - "Model weights in an XR-Linear model are also accessible as `model_chain` for analysis and computations. For the i-th layer in the hierarchy, the model weights of matchers/rankers are available as a CSC matrix of shape `(nr_feat + 1, L[i])`, which concatenates weights for features and the bias term. " - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "9e101f6b", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "model_chain[0].W is a csc matrix of shape (101939, 8)\n", - "model_chain[1].W is a csc matrix of shape (101939, 64)\n", - "model_chain[2].W is a csc matrix of shape (101939, 512)\n", - "model_chain[3].W is a csc matrix of shape (101939, 30938)\n" - ] - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "for d, m in enumerate(xlm.model.model_chain):\n", - " print(\"model_chain[{}].W is a {} matrix of shape {}\".format(d, m.W.getformat(), m.W.shape))\n", - "\n", - "layer_d = 1\n", - "from sklearn.decomposition import TruncatedSVD\n", - "svd = TruncatedSVD(n_components=2, random_state=0)\n", - "Wt = svd.fit_transform(xlm.model.model_chain[layer_d].W.transpose())\n", - "\n", - "import numpy as np\n", - "color = cluster_chain[layer_d].tocsr() * np.arange(cluster_chain[layer_d].shape[1])\n", - "\n", - "import matplotlib.pyplot as plt\n", - "plt.scatter(Wt[:, 0], Wt[:, 1], c=color)\n", - "plt.xlim(-4, 4);\n", - "plt.ylim(-10, 10);" - ] - }, - { - "cell_type": "markdown", - "id": "ae9baf18", - "metadata": {}, - "source": [ - "### PECOS and One-versus-All (OVA) Model\n", - "\n", - "PECOS also supports to train an OVA model without leveraing clustering hierarchy if needed.\n", - "\n", - "**Training OVA models is time-consuming, we suggest to try it offline after the tutorial.**" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "c95f0acf", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Training time for the OVA model: 1047.3194 seconds.\n", - "XR-Linear is 25.56 times faster than the OVA model\n" - ] - } - ], - "source": [ - "import time\n", - "start_time = time.time()\n", - "\n", - "xlm_ova = XLinearModel.train(X_trn, Y_trn, C=None, negative_sampling_scheme=\"tfn\") \n", - "\n", - "training_time_ova = time.time() - start_time\n", - "print(\"Training time for the OVA model: {:.4f} seconds.\".format(training_time_ova))\n", - "\n", - "print(\"XR-Linear is {:.2f} times faster than the OVA model\".format(training_time_ova / training_time))" - ] - }, - { - "cell_type": "markdown", - "id": "73a599a0", - "metadata": {}, - "source": [ - "## Customized Parameters and Advanced Training Options\n", - "\n", - "PECOS also supports using customized parameters and several advanced training options, such as different solvers and cost-sensitive learning.\n", - "\n", - "### Customized Parameters\n", - "\n", - "The parameters for either of indexing, training, and inference can be easily customized by feeding a dictionary into the corresponding parameter class and its constructor:\n", - "\n", - "* Semantic Indexing (Hierarchical K-Means): `HierarchicalKMeans.TrainParams.from_dict(dict)`\n", - "* Training: `XLinearModel.TrainParams.from_dict(dict)`\n", - "* Inference: `XLinearModel.PredParams.from_dict(dict)`\n", - "\n", - "Although most of the parameters can be also passed by `kwargs` of Python methods, **we encourage to use the dictionary to designate the parameters because it is easier to manage, modularize, and store parameters in certain formats like JSON.**\n", - "\n", - "For XR-Linear models, the default values and skeleton of the parameters can be revealed and generated by the following command:" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "1ddc9bfa", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{\r\n", - " \"train_params\": {\r\n", - " \"__meta__\": {\r\n", - " \"class_fullname\": \"pecos.xmc.xlinear.model###XLinearModel.TrainParams\"\r\n", - " },\r\n", - " \"mode\": \"full-model\",\r\n", - " \"ranker_level\": 1,\r\n", - " \"nr_splits\": 16,\r\n", - " \"min_codes\": null,\r\n", - " \"shallow\": false,\r\n", - " \"rel_mode\": \"disable\",\r\n", - " \"rel_norm\": \"no-norm\",\r\n", - " \"hlm_args\": {\r\n", - " \"__meta__\": {\r\n", - " \"class_fullname\": \"pecos.xmc.base###HierarchicalMLModel.TrainParams\"\r\n", - " },\r\n", - " \"neg_mining_chain\": \"tfn\",\r\n", - " \"model_chain\": {\r\n", - " \"__meta__\": {\r\n", - " \"class_fullname\": \"pecos.xmc.base###MLModel.TrainParams\"\r\n", - " },\r\n", - " \"threshold\": 0.1,\r\n", - " \"max_nonzeros_per_label\": null,\r\n", - " \"solver_type\": \"L2R_L2LOSS_SVC_DUAL\",\r\n", - " \"Cp\": 1.0,\r\n", - " \"Cn\": 1.0,\r\n", - " \"max_iter\": 100,\r\n", - " \"eps\": 0.1,\r\n", - " \"bias\": 1.0,\r\n", - " \"threads\": -1,\r\n", - " \"verbose\": 0,\r\n", - " \"newton_eps\": 0.01\r\n", - " }\r\n", - " }\r\n", - " },\r\n", - " \"pred_params\": {\r\n", - " \"__meta__\": {\r\n", - " \"class_fullname\": \"pecos.xmc.xlinear.model###XLinearModel.PredParams\"\r\n", - " },\r\n", - " \"hlm_args\": {\r\n", - " \"__meta__\": {\r\n", - " \"class_fullname\": \"pecos.xmc.base###HierarchicalMLModel.PredParams\"\r\n", - " },\r\n", - " \"model_chain\": {\r\n", - " \"__meta__\": {\r\n", - " \"class_fullname\": \"pecos.xmc.base###MLModel.PredParams\"\r\n", - " },\r\n", - " \"only_topk\": 20,\r\n", - " \"post_processor\": \"l3-hinge\"\r\n", - " }\r\n", - " }\r\n", - " },\r\n", - " \"indexer_params\": {\r\n", - " \"__meta__\": {\r\n", - " \"class_fullname\": \"pecos.xmc.base###HierarchicalKMeans.TrainParams\"\r\n", - " },\r\n", - " \"nr_splits\": 16,\r\n", - " \"min_codes\": null,\r\n", - " \"max_leaf_size\": 100,\r\n", - " \"imbalanced_ratio\": 0.0,\r\n", - " \"imbalanced_depth\": 100,\r\n", - " \"spherical\": true,\r\n", - " \"seed\": 0,\r\n", - " \"kmeans_max_iter\": 20,\r\n", - " \"threads\": -1\r\n", - " }\r\n", - "}\r\n" - ] - } - ], - "source": [ - "! python3 -m pecos.xmc.xlinear.train --generate-params-skeleton" - ] - }, - { - "cell_type": "markdown", - "id": "35472517", - "metadata": {}, - "source": [ - "### Training Parameters for Hierarchial Models in XR-Linear\n", - "\n", - "Hierarchical models could have different parameters over layers. To have customized parameters for the hierarchical model, `hlm_args` needs to be designated in the parameter dictionary. The values of `model_chain` and `neg_mining_chain` in `hlm_args` can be **a single dictionary** of general parameters for all layers or **a list of dictinoaries** for specific parameters of individual layers.\n", - "\n", - "#### General Parameters for All Layers\n", - "\n", - "```\n", - "train_params_l1 = XLinearModel.TrainParams.from_dict(\n", - " {\n", - " ...\n", - " \"hlm_args\": {\n", - " ...\n", - " \"neg_mining_chain\": \"tfn\", # Negative sampling scheme for all layers\n", - " \"model_chain\":{...}, # Parameters for all layers\n", - " }\n", - " ...\n", - " })\n", - "```\n", - "\n", - "#### Specific Parameters of Individual Layers\n", - "\n", - "```\n", - "train_params_l1 = XLinearModel.TrainParams.from_dict(\n", - " {\n", - " ...\n", - " \"hlm_args\": {\n", - " ...\n", - " \"neg_mining_chain\": [\n", - " \"tfn\", # Negative sampling scheme for layer-0\n", - " \"tfn\", # Negative sampling scheme for layer-1\n", - " \"tfn+man\", # Negative sampling scheme for layer-2\n", - " ...\n", - " ],\n", - " \"model_chain\": [\n", - " {...}, # Parameters for layer-0\n", - " {...}, # Parameters for layer-1\n", - " {...}, # Parameters for layer-2\n", - " ...\n", - " ],\n", - " }\n", - " ...\n", - " })\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "81f86b8a", - "metadata": {}, - "source": [ - "### Variety of Solvers\n", - "\n", - "The solver for optimization can be adjusted by the argument `solver_type` in the `train` function. PECOS currently provides the following solvers for training each matcher/ranker:\n", - "\n", - "* \"L2R_L2LOSS_SVC_DUAL\" (default): L2-regularized L2-loss Dual SVM\n", - "* \"L2R_L1LOSS_SVC_DUAL\": : L2-regularized L1-loss Dual SVM\n", - "* \"L2R_LR_DUAL\": L2-reguarlized Logistic Regression" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "8f26ee42", - "metadata": {}, - "outputs": [], - "source": [ - "xlm_l1_kwargs = XLinearModel.train(\n", - " X_trn, Y_trn,\n", - " C=cluster_chain,\n", - " threshold=0.1,\n", - " negative_sampling_scheme=\"tfn\",\n", - " solver_type=\"L2R_L1LOSS_SVC_DUAL\")" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "197926a7", - "metadata": {}, - "outputs": [], - "source": [ - "train_params_l1 = XLinearModel.TrainParams.from_dict(\n", - " {\n", - " \"hlm_args\": {\n", - " \"threshold\": 0.1,\n", - " \"neg_mining_chain\": \"tfn\",\n", - " \"model_chain\":{\n", - " \"solver_type\": \"L2R_L1LOSS_SVC_DUAL\",\n", - " },\n", - " }\n", - " }\n", - ")\n", - "\n", - "xlm_l1_dict = XLinearModel.train(\n", - " X_trn, Y_trn,\n", - " C=cluster_chain,\n", - " train_params=train_params_l1)" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "eddf91a6", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Evaluation Metrics with L2R_L1LOSS_SVC_DUAL (by method kwargs)\n", - "prec = 83.43 77.66 72.47 67.67 63.73 60.18 56.90 54.04 51.45 49.04\n", - "recall = 4.93 9.11 12.62 15.59 18.19 20.51 22.49 24.28 25.92 27.36\n", - "\n", - "Evaluation Metrics with L2R_L1LOSS_SVC_DUAL (by dictionary)\n", - "prec = 83.43 77.66 72.47 67.67 63.73 60.18 56.90 54.04 51.45 49.04\n", - "recall = 4.93 9.11 12.62 15.59 18.19 20.51 22.49 24.28 25.92 27.36\n" - ] - } - ], - "source": [ - "Y_pred_l1_kwargs = xlm_l1_kwargs.predict(X_tst, beam_size=10, only_topk=10)\n", - "Y_pred_l1_dict = xlm_l1_dict.predict(X_tst, beam_size=10, only_topk=10)\n", - "metrics_l1_kwargs = smat_util.Metrics.generate(Y_tst, Y_pred_l1_kwargs, topk=10)\n", - "metrics_l1_dict = smat_util.Metrics.generate(Y_tst, Y_pred_l1_dict, topk=10)\n", - "\n", - "print(\"Evaluation Metrics with L2R_L1LOSS_SVC_DUAL (by method kwargs)\")\n", - "print(metrics_l1_kwargs)\n", - "\n", - "print(\"\\nEvaluation Metrics with L2R_L1LOSS_SVC_DUAL (by dictionary)\")\n", - "print(metrics_l1_dict)" - ] - }, - { - "cell_type": "markdown", - "id": "9f481ed3", - "metadata": {}, - "source": [ - "### Cost-sensitive Learning\n", - "\n", - "PECOS supports to adjust the cost of each training instance. To enable cost-sensitive learning, we need to provide a **relevance matrix** `R_trn` with the same shape to the label matrix `Y_trn` for the argument `R`. When `R` is `None` (default), cost-sensitive learning is disable. \n", - "\n", - "Since PECOS models are usually hierarhical, costs for upper layers also need to be decided as the cost-sensitive learning mode by the argument `rel_mode`. Currently, PECOS supports the following cost-sensitive learning modes:\n", - "\n", - "* `\"disable\"` (default): The cost-sensitive learning is disable.\n", - "* `\"induce\"`: Induce the costs into upper layers by the clustering chain.\n", - "* `\"ranker-only\"`: Only apply cost-sensitive learning to the model in the last ranker layer without induction.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "id": "382277f3", - "metadata": {}, - "outputs": [], - "source": [ - "# An exmaple of using training label frequency scores as costs. \n", - "import copy\n", - "from sklearn.preprocessing import normalize\n", - "\n", - "R_trn = copy.deepcopy(Y_trn)\n", - "\n", - "# Training parameters for cost-sensitive learning.\n", - "train_params_cost = XLinearModel.TrainParams.from_dict(\n", - " {\n", - " \"rel_mode\": \"induce\",\n", - " \"rel_norm\": \"l1\",\n", - " \"hlm_args\": {\n", - " \"neg_mining_chain\": \"tfn\",\n", - " \"model_chain\":\n", - " [\n", - " {\n", - " \"threshold\": 0.1,\n", - " \"Cp\": 1.0,\n", - " \"Cn\": 1.0,\n", - " },\n", - " {\n", - " \"threshold\": 0.1,\n", - " \"Cp\": 8.0,\n", - " \"Cn\": 1.0,\n", - " },\n", - " {\n", - " \"threshold\": 0.1,\n", - " \"Cp\": 4.0,\n", - " \"Cn\": 1.0,\n", - " },\n", - " {\n", - " \"threshold\": 0.1,\n", - " \"Cp\": 4.0,\n", - " \"Cn\": 1.0,\n", - " },\n", - " ],\n", - " }\n", - " })\n", - " \n", - "# Cost-sensitive learning.\n", - "xlm_cost = XLinearModel.train(\n", - " X_trn, Y_trn,\n", - " C=cluster_chain,\n", - " R=R_trn,\n", - " train_params=train_params_cost)" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "id": "559c15cb", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Evaluation Metrics with Cost-sensitive Learning\n", - "prec = 85.02 80.58 74.57 69.37 64.79 60.82 57.34 54.29 51.37 48.81\n", - "recall = 5.02 9.46 13.02 15.99 18.54 20.74 22.69 24.42 25.92 27.27\n", - "\n", - "Original Evaluation Metrics\n", - "prec = 84.07 78.17 72.68 67.79 63.79 60.06 56.63 53.51 50.83 48.33\n", - "recall = 4.97 9.16 12.68 15.60 18.25 20.49 22.40 24.05 25.60 26.95\n" - ] - } - ], - "source": [ - "Y_pred_cost = xlm_cost.predict(X_tst, beam_size=10, only_topk=10)\n", - "metrics_cost = smat_util.Metrics.generate(Y_tst, Y_pred_cost, topk=10)\n", - "print(\"Evaluation Metrics with Cost-sensitive Learning\")\n", - "print(metrics_cost)\n", - "print(\"\\nOriginal Evaluation Metrics\")\n", - "print(metrics)" - ] - }, - { - "cell_type": "markdown", - "id": "1663277d", - "metadata": {}, - "source": [ - "# Customized PECOS Model\n", - "\n", - "Besides pre-defined models in PECOS, such as XR-Linear, it is also convenient for users to customize PECOS for specific purposes and usage. Specifically, we suggest to establishing a model class to wrap fundamental PECOS functions and tailored operations. As a result, the customized model can be easily constructed and consumed for arbitrary data types and feature extractors. \n", - "\n", - "## Structure of a Customized PECOS Model\n", - "\n", - "Even though a customized machine learning pipeline can be seperated into several independent scripts, we recommend declaring a customized PECOS model as a **model class** for better re-usability and code maintenance.\n", - "\n", - "A customized PECOS model should at least consist of the following components:\n", - "\n", - "* `preprocessor` or `encoder`: The procedure, which can be a method or a functionable object, pre-processes or encodes an arbitrary input with the designated data format into features. For example, text data and image data can be encoded by BERT and ResNet.\n", - "* `train()`: The training method takes a set of training data with a preprocessor, learns a primitive PECOS model, and returns a PECOS-based customized machine learning model. The training function could be a class method to construct the model object with the learned model and essential components after training.\n", - "* `model`: A primitive PECOS model taking pre-processed features is capable of deriving the predictions for arbitrary testing data. The model weights should be learned by `train()`. \n", - "* `predict()`: The prediction method takes arbitrary testing data and infers the prediction based on the pre-processor and the learned model.\n", - "* `save()`: The saving function serializes the trained model, including model weights and configuration, for further usage.\n", - "* `load()`: The loading function reads the serialized model so that the trained model can be loaded and re-used.\n", - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "In this part of the tutorial, we will use the task of *extreme multi-label text classification* as an example to demonstrate how to **customize a PECOS model that can handle text data with either a conventional bag-of-words (BoW) model or a deep learning model as the text encoder for feature extraction**.\n" - ] - }, - { - "cell_type": "markdown", - "id": "c3acc325", - "metadata": {}, - "source": [ - "## Example: eXtreme Multi-label Text Classification (XMTC)\n", - "\n", - "The task of extreme multi-label text classification (XMTC) seeks to find relevant labels from an extreme large label collection for a given text input. Many real-world applications can be formulated as XMTC tasks, such as recommendation systems, document tagging, and semantic search. \n", - "\n", - "In this section, we guide through how to establish a customized PECOS model for XMTC tasks. We will walk through (1) PECOS' built-in BOW model for text preprocessing and vectorizing; (2) how to customize a PECOS model; and (3) \n", - "advanced usage of XR-Transformer based on deep learning.\n" - ] - }, - { - "cell_type": "markdown", - "id": "3b2ec0e6", - "metadata": {}, - "source": [ - "### Preprocessor: Text Preprocessing and Vectorizing\n", - "\n", - "The preprocessor plays a role of encoding input data into machine readable vector representations. Any encoder that can transform text data into a vector representation can be considered as the preprocessor or encoder of a customized PECOS model for XMTC tasks.\n", - "\n", - "In the PECOS library, we provide [various text vectorizers](https://github.com/amzn/pecos/blob/mainline/pecos/utils/featurization/text/vectorizers.py), such as TF-IDF, hashing, and pretrained transformer, as **built-in preprocessors** to deal with text data. In this tutorial, we will utilize the [n-gram](https://en.wikipedia.org/wiki/N-gram) [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) model as our preprocessor.\n", - "\n", - "#### Label Space File Format for Built-in Text Preprocessors\n", - "\n", - "Label space is also essential for text preprocessors, especially for understanding the label space size to create the appropriate label matrix. The label IDs start from zero and can be referred to the line numbers and corresponding text descriptions in the label space file." - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "id": "3f48f4f7", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Artificial intelligence researchers\r\n", - "Computability theorists\r\n", - "British computer scientists\r\n", - "Machine learning researchers\r\n", - "Turing Award laureates\r\n", - "Deep Learning\r\n" - ] - } - ], - "source": [ - "! cat \"./text2text_demo/output-labels.txt\"" - ] - }, - { - "cell_type": "markdown", - "id": "e0862645", - "metadata": {}, - "source": [ - "#### Data File Format for Built-in Text Preprocessors\n", - "\n", - "PECOS built-in text preprocessors majorly take the files of text data with labels in a tab-separated values (TSV) format. Each line in the TSV file consists of two elements that represent the comma-separated label IDs and the input text of a data instance. " - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "id": "bd5ebfc6", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "0,1,2\tAlan Turing is widely considered to be the father of theoretical computer science and artificial intelligence.\r\n", - "0,2,3\tHinton was co-author of a highly cited paper published in 1986 that popularized the backpropagation algorithm for training multi-layer neural networks.\r\n", - "3,4,5\tHinton received the 2018 Turing Award, together with Yoshua Bengio and Yann LeCun, for their work on artificial intelligence and deep learning.\r\n", - "0,3,5\tYoshua Bengio is a Canadian computer scientist, most noted for his work on artificial neural networks and deep learning.\r\n" - ] - } - ], - "source": [ - "! cat ./text2text_demo/training-data.txt" - ] - }, - { - "cell_type": "markdown", - "id": "566d1eb5", - "metadata": {}, - "source": [ - "The data file format also supports to represent the label relevance for cost-sensitive learning by using double colons to separate a label and its relevance.\n", - "\n", - "

\n", - "0::0.1,1::0.2,2::0.8 <TAB> Alan Turing is widely considered to be the father of theoretical computer science and artificial intelligence.

\n" - ] - }, - { - "cell_type": "markdown", - "id": "4d4e4419", - "metadata": {}, - "source": [ - "#### Training a Text Preprocessor\n", - "\n", - "The preprocessor model `Preprocessor` is defined in `pecos.utils.featurization.text.preprocess`. Given a training text corpus and the configuration dictionary, the class method `Preprocessor.train` will train a corresponding text preprocesssor. Besides, the built-in preprocessors also support serialization with the function `save()` for the re-usability.\n", - "\n", - "With the previously mentioned data and label space file formats, the utility function `Preprocessor.load_data_from_file(input_text_path, output_text_path)` returns a dictionary with three keys:\n", - "\n", - "* `label_matrix`: a `(num_inst, num_labels)` CSR matrix for the labels of each instance.\n", - "* `label_relevance`: `None` or a `(num_inst, num_labels)` CSR matrix for the relevance of each label in cost-sensitive learning if available.\n", - "* `corpus`: a list of string as the text corpus in the input_text_path.\n", - "\n", - "The configuration settings of text preprocessor including the preprocessor type and hyper-parameters should be defined in a dictionary. Specifially, the key `type` defines the preprocessor choice while the key `kwargs` represents the hyper-parameters. In this tutorial, we adopt n-gram TFIDF features containing *word unigrams*, *word bigrams*, and *character trigrams*. Note that each of the n-gram feature can have different hyper-parameters, such as `max_feature` and `max_df`. Users need to properly set max_feature (e.g., hundred of thousands or millions) based on the corpus size and downstream tasks." - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "id": "b7f70a8f", - "metadata": {}, - "outputs": [], - "source": [ - "from pecos.utils.featurization.text.preprocess import Preprocessor\n", - "\n", - "input_text_path = \"./text2text_demo/training-data.txt\"\n", - "output_text_path = \"./text2text_demo/output-labels.txt\"\n", - "model_folder = \"./text2text_demo/pecos-text2text-model\"\n", - "\n", - "parsed_result = Preprocessor.load_data_from_file(input_text_path, output_text_path) # Read files\n", - "corpus = parsed_result[\"corpus\"] # Corpus input text: List of strings\n", - "\n", - "vectorizer_config = {\n", - " \"type\": \"tfidf\",\n", - " \"kwargs\": {\n", - " \"base_vect_configs\": [\n", - " \n", - " {\n", - " \"ngram_range\": [1, 1],\n", - " \"max_df_ratio\": 0.98,\n", - " \"analyzer\": \"word\",\n", - " },\n", - " {\n", - " \"ngram_range\": [2, 2],\n", - " \"max_df_ratio\": 0.98,\n", - " \"analyzer\": \"word\",\n", - " },\n", - " {\n", - " \"ngram_range\": [3, 3],\n", - " \"max_df_ratio\": 0.98,\n", - " \"analyzer\": \"char_wb\",\n", - " },\n", - " ],\n", - " },\n", - " }\n", - "\n", - "preprocessor = Preprocessor.train(corpus, vectorizer_config)\n", - "preprocessor.save(model_folder) " - ] - }, - { - "cell_type": "markdown", - "id": "a0300f8c", - "metadata": {}, - "source": [ - "#### Preprocessing with a Trained Text Preprocessor\n", - "\n", - "The function `predict` of a trained text preprocessor encodes texts in a **text data file** into a CSR matrix of shape `(num_inst, dim)` as numerical vector representations, where `num_inst` is the number of instances in the file; `dim` is the number of feature dimensions." - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "id": "3b182171", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The file consists of 4 instances with 405-dimensional features in a csr matrix.\n", - "\n", - "Text 0: Alan Turing is widely considered to be the father of theoretical computer science and artificial intelligence.\n", - "Text 1: Hinton was co-author of a highly cited paper published in 1986 that popularized the backpropagation algorithm for training multi-layer neural networks.\n", - "Text 2: Hinton received the 2018 Turing Award, together with Yoshua Bengio and Yann LeCun, for their work on artificial intelligence and deep learning.\n", - "Text 3: Yoshua Bengio is a Canadian computer scientist, most noted for his work on artificial neural networks and deep learning.\n", - "\n", - "The cosine similarity is 0.0076 between text 0 and text 1.\n", - "The cosine similarity is 0.0325 between text 0 and text 2.\n", - "The cosine similarity is 0.0082 between text 1 and text 2.\n", - "The cosine similarity is 0.0366 between text 0 and text 3.\n", - "The cosine similarity is 0.0267 between text 1 and text 3.\n", - "The cosine similarity is 0.0943 between text 2 and text 3.\n" - ] - } - ], - "source": [ - "# Obtaining numerical vectors from text\n", - "X = preprocessor.predict(corpus)\n", - "\n", - "print(\"The file consists of {} instances \"\n", - " \"with {}-dimensional features \"\n", - " \"in a {} matrix.\\n\".format(*X.shape, X.getformat()))\n", - "\n", - "from sklearn.metrics.pairwise import cosine_similarity\n", - "\n", - "sim = cosine_similarity(X)\n", - "\n", - "for i, ti in enumerate(corpus):\n", - " print(\"Text {}: {}\".format(i, ti))\n", - "\n", - "print(\"\")\n", - "for i in range(X.shape[0]):\n", - " for j in range(i):\n", - " print(\"The cosine similarity is {:.4f} between text {} and text {}.\".format(sim[i][j], j, i))" - ] - }, - { - "cell_type": "markdown", - "id": "18fcd09b", - "metadata": {}, - "source": [ - "#### Efficiency of PECOS Built-in TF-IDF Vectorizer\n", - "\n", - "Moreover, the TF-IDF vectorizer in PECOS is implemented in C++ and efficient." - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "id": "a3d6f675", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "PECOS TFIDF time: 27.30768s, result shaepe=(14146, 10858825), nnz=37194670\n" - ] - } - ], - "source": [ - "vectorizer_config = {\n", - " \"type\": \"tfidf\",\n", - " \"kwargs\": {\n", - " \"base_vect_configs\": [ \n", - " {\n", - " \"ngram_range\": [1, 2],\n", - " \"max_df_ratio\": 0.98,\n", - " \"analyzer\": \"word\",\n", - " },\n", - " ],\n", - " },\n", - " }\n", - "\n", - "input_text_path = \"xmc-base/wiki10-31k/X.trn.txt\"\n", - "corpus = Preprocessor.load_data_from_file(input_text_path, text_pos=0)[\"corpus\"]\n", - "\n", - "import time\n", - "start_time = time.time()\n", - "preprocessor = Preprocessor.train(corpus, vectorizer_config)\n", - "X = preprocessor.predict(input_text_path)\n", - "print(f\"PECOS TFIDF time: {time.time() - start_time:.5f}s, result shaepe={X.shape}, nnz={X.nnz}\")" - ] - }, - { - "cell_type": "markdown", - "id": "e63c62ce", - "metadata": {}, - "source": [ - "As a baseline method, we compare with the [Sklearn TFIDF vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html):" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "id": "677f77de", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Sklearn TFIDF time: 221.65870s, result shaepe=(14146, 7269690), nnz=33505461\n" - ] - } - ], - "source": [ - "start_time = time.time()\n", - "preprocessor = Preprocessor.train(\n", - " corpus,\n", - " {\"type\": \"sklearntfidf\", \"kwargs\":{\"ngram_range\": [1, 2], \"max_df\": 0.98}},\n", - ")\n", - "X = preprocessor.predict(corpus)\n", - "print(f\"Sklearn TFIDF time: {time.time() - start_time:.5f}s, result shaepe={X.shape}, nnz={X.nnz}\")" - ] - }, - { - "cell_type": "markdown", - "id": "75f5aaf8", - "metadata": {}, - "source": [ - "### Customized PECOS Model with TF-IDF Preprocessor\n", - "\n", - "\n", - "After being powered with text preprocessors, following the [aforementioned illustration](#Structure-of-a-Customized-PECOS-Model), we demonstrate an example of declaring a **customized PECOS model class** based on a TF-IDF preprocessor and a XR-Linear model." - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "id": "3893c23b", - "metadata": {}, - "outputs": [], - "source": [ - "import json\n", - "from os import path\n", - "import pathlib\n", - "from pecos.utils.featurization.text.preprocess import Preprocessor\n", - "from pecos.xmc.xlinear.model import XLinearModel\n", - "from pecos.xmc import Indexer, LabelEmbeddingFactory\n", - "from pecos.utils import smat_util\n", - "\n", - "class CustomPECOS:\n", - " def __init__(self, preprocessor=None, xlinear_model=None, output_items=None):\n", - " self.preprocessor = preprocessor\n", - " self.xlinear_model = xlinear_model\n", - " self.output_items = output_items\n", - " \n", - " @classmethod\n", - " def train(cls, input_text_path, output_text_path):\n", - " \"\"\"Train a CustomPECOS model\n", - " \n", - " Args: \n", - " input_text_path (str): Text input file name. \n", - " output_text_path (str): The file path for output text items.\n", - " vectorizer_config (str): Json_format string for vectorizer config (default None). e.g. {\"type\": \"tfidf\", \"kwargs\": {}}\n", - " \n", - " Returns:\n", - " A CustomPECOS object\n", - " \"\"\"\n", - " # Obtain X_text, Y\n", - " parsed_result = Preprocessor.load_data_from_file(input_text_path, output_text_path)\n", - " Y = parsed_result[\"label_matrix\"]\n", - " corpus = parsed_result[\"corpus\"]\n", - "\n", - " # Train TF-IDF vectorizer\n", - " preprocessor = Preprocessor.train(corpus, {\"type\": \"tfidf\", \"kwargs\":{}}) \n", - " X = preprocessor.predict(corpus) \n", - " \n", - " # Train a XR-Linear model with TF-IDF features\n", - " label_feat = LabelEmbeddingFactory.create(Y, X, method=\"pifa\")\n", - " cluster_chain = Indexer.gen(label_feat)\n", - " xlinear_model = XLinearModel.train(X, Y, C=cluster_chain)\n", - " \n", - " # Load output items\n", - " with open(output_text_path, \"r\", encoding=\"utf-8\") as f:\n", - " output_items = [q.strip() for q in f]\n", - " \n", - " return cls(preprocessor, xlinear_model, output_items)\n", - " \n", - " def predict(self, corpus):\n", - " \"\"\"Predict labels for given inputs\n", - " \n", - " Args:\n", - " corpus (list of strings): input strings.\n", - " Returns:\n", - " csr_matrix: predicted label matrix (num_samples x num_labels)\n", - " \"\"\"\n", - " X = self.preprocessor.predict(corpus)\n", - " Y_pred = self.xlinear_model.predict(X)\n", - " return smat_util.sorted_csr(Y_pred)\n", - "\n", - " def save(self, model_folder):\n", - " \"\"\"Save the CustomPECOS model\n", - "\n", - " Args:\n", - " model_folder (str): folder name to save\n", - " \"\"\"\n", - " self.preprocessor.save(f\"{model_folder}/preprocessor\")\n", - " self.xlinear_model.save(f\"{model_folder}/xlinear_model\")\n", - " with open(f\"{model_folder}/output_items.json\", \"w\", encoding=\"utf-8\") as fp:\n", - " json.dump(self.output_items, fp)\n", - "\n", - " @classmethod\n", - " def load(cls, model_folder):\n", - " \"\"\"Load the CustomPECOS model\n", - "\n", - " Args:\n", - " model_folder (str): folder name to load\n", - " Returns:\n", - " CustomPECOS\n", - " \"\"\"\n", - " preprocessor = Preprocessor.load(f\"{model_folder}/preprocessor\")\n", - " xlinear_model = XLinearModel.load(f\"{model_folder}/xlinear_model\")\n", - " with open(f\"{model_folder}/output_items.json\", \"r\", encoding=\"utf-8\") as fin:\n", - " output_items = json.load(fin)\n", - " return cls(preprocessor, xlinear_model, output_items)" - ] - }, - { - "cell_type": "markdown", - "id": "fcdbb2c6", - "metadata": {}, - "source": [ - "### Operating the Customized PECOS Model\n", - "\n", - "With a well-declared model class, the customized PECOS model can be modularized and very convenient to use." - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "id": "24134357", - "metadata": {}, - "outputs": [], - "source": [ - "# Declare the path for model serialization and preprocessor configuration.\n", - "model_folder = \"./text2text_demo/pecos-CustomPECOS-model\"\n", - "\n", - "# Train and save the trained model\n", - "input_text_path = \"./text2text_demo/training-data.txt\"\n", - "output_text_path = \"./text2text_demo/output-labels.txt\"\n", - "model = CustomPECOS.train(input_text_path, output_text_path)\n", - "model.save(model_folder)\n", - "\n", - "# Load the trained model and predict\n", - "model = model.load(model_folder)\n", - "testing_text_path = \"./text2text_demo/testing-data.txt\"\n", - "Y_pred = model.predict(testing_text_path)" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "id": "31efd9ac", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Text Input: In 1989, Yann LeCun et al. applied the standard backpropagation algorithm on neural networks for hand digit recognition.\n", - "Score 0.9515: Machine learning researchers\n", - "Score 0.8233: Artificial intelligence researchers\n", - "Score 0.4659: Deep Learning\n", - "Score 0.2779: British computer scientists\n", - "Score 0.0569: Turing Award laureates\n", - "Score 0.0129: Computability theorists\n" - ] - } - ], - "source": [ - "test_texts = Preprocessor.load_data_from_file(testing_text_path, output_text_path)[\"corpus\"]\n", - "\n", - "for i, text in enumerate(test_texts):\n", - " print(\"Text Input: {}\".format(text))\n", - " for j in range(Y_pred.indptr[i], Y_pred.indptr[i + 1]):\n", - " pred_label = model.output_items[Y_pred.indices[j]]\n", - " pred_score = Y_pred.data[j]\n", - " print(\"Score {:.4f}: {}\".format(pred_score, pred_label))" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.10" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/tutorials/kdd22/Session 4 Utilities in PECOS.ipynb b/tutorials/kdd22/Session 4 Utilities in PECOS.ipynb index 81757997..20f4f90b 100644 --- a/tutorials/kdd22/Session 4 Utilities in PECOS.ipynb +++ b/tutorials/kdd22/Session 4 Utilities in PECOS.ipynb @@ -5,794 +5,780 @@ "id": "b1ebc316", "metadata": {}, "source": [ - "# Utilities in PECOS\n", + "# PECOS Utilities\n", "\n", - "PECOS provides various useful interfaces and utility functions for XMR problems and related tasks. In this session, we will introduce how to tackle arbitrary data formats for XMR, and then present some utilities in PECOS for efficient matrix operations and hierarchical clustering.\n", + "PECOS provides various useful interfaces and utility functions for XMC problems and related tasks. In this session, we will present some utilities in PECOS for efficient matrix operations and hierarchical clustering.\n", "\n", - "### Install PECOS through Python PIP" + "## Outline\n", + "\n", + "1. Sparse Matrix Operations\n", + "2. Hierarchical Clustering" ] }, { - "cell_type": "code", - "execution_count": null, - "id": "4eba0f0b", + "cell_type": "markdown", + "id": "c2b3f61e", "metadata": {}, - "outputs": [], "source": [ - "! pip install libpecos" + "## 1. Sparse Matrix Operations\n", + "\n", + "Most of the computations in PECOS are based on sparse matrices, so PECOS also provides various useful and efficient operation utilities for sparse matrices." ] }, { "cell_type": "markdown", - "id": "3e9cc45c", + "id": "5fba58d4", "metadata": {}, "source": [ - "## Working with Arbitrary Data Formats\n", + "### 1.1 Genric Matriox IO and Conversion\n", "\n", - "PECOS is a general machine learning framework and able to fit arbitary data format and interact with different data manipulation and analysis libraries like [Pandas](https://pandas.pydata.org/). In the following example, we will show how to learn a PECOS model with Pandas-loaded data of text, categorical, and numerical features." + "`smat_util.load_matrix` and `smat_util.save_matrix` provide generic interfaces for loading and storing matrices in arbitrary common formats, including [dense matrix](https://numpy.org/doc/stable/reference/generated/numpy.array.html) in NumPy or different sparse matrix formats (i.e., [sparse Compressed Sparse Row (CSR) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html), [sparse Compressed Sparse Column (CSC) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html), and [sparse COOrdinate (COO) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html))." ] }, { "cell_type": "code", "execution_count": 1, - "id": "9f0f17a7", - "metadata": {}, - "outputs": [], - "source": [ - "import pecos\n", - "import pandas as pd\n", - "import numpy as np" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "526820f8", + "id": "385ed0ba", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Archive: drugLib_raw.zip\r\n", - " inflating: drugLibTest_raw.tsv \r\n", - " inflating: drugLibTrain_raw.tsv \r\n" + "Dense Matrtix IO\n", + "mat is a matrix with a shape (2, 3).\n", + "[[0.32706124 0.94765886 0.16764024]\n", + " [0.29065096 0.23160388 0.3871939 ]]\n", + "mat_loaded is a matrix with a shape (2, 3).\n", + "[[0.32706124 0.94765886 0.16764024]\n", + " [0.29065096 0.23160388 0.3871939 ]]\n", + "\n", + "csr Sparse Matrix IO\n", + "mat is a matrix with a shape (5, 4) and 4 non-zero values.\n", + " (1, 1)\t0.3859982113301277\n", + " (2, 1)\t0.5399444869534915\n", + " (3, 1)\t0.008896715300809821\n", + " (4, 2)\t0.9634283904734527\n", + "mat_loaded is a matrix with a shape (5, 4) and 4 non-zero values.\n", + " (1, 1)\t0.3859982113301277\n", + " (2, 1)\t0.5399444869534915\n", + " (3, 1)\t0.008896715300809821\n", + " (4, 2)\t0.9634283904734527\n", + "\n", + "csc Sparse Matrix IO\n", + "mat is a matrix with a shape (5, 4) and 4 non-zero values.\n", + " (2, 0)\t0.8430107100693552\n", + " (0, 3)\t0.5602000410939516\n", + " (3, 3)\t0.4358575080842668\n", + " (4, 3)\t0.454532975053182\n", + "mat_loaded is a matrix with a shape (5, 4) and 4 non-zero values.\n", + " (2, 0)\t0.8430107100693552\n", + " (0, 3)\t0.5602000410939516\n", + " (3, 3)\t0.4358575080842668\n", + " (4, 3)\t0.454532975053182\n", + "\n", + "coo Sparse Matrix IO\n", + "mat is a matrix with a shape (5, 4) and 4 non-zero values.\n", + " (2, 0)\t0.6008844736009242\n", + " (3, 3)\t0.5296164005351621\n", + " (3, 2)\t0.6884529935778093\n", + " (0, 3)\t0.847894528567365\n", + "mat_loaded is a matrix with a shape (5, 4) and 4 non-zero values.\n", + " (2, 0)\t0.6008844736009242\n", + " (3, 3)\t0.5296164005351621\n", + " (3, 2)\t0.6884529935778093\n", + " (0, 3)\t0.847894528567365\n", + "\n" ] } ], "source": [ - "! wget -nv -nc https://archive.ics.uci.edu/ml/machine-learning-databases/00461/drugLib_raw.zip\n", - "! unzip -o drugLib_raw.zip" + "from pecos.utils import smat_util\n", + "import numpy as np\n", + "import scipy.sparse as smat\n", + "\n", + "print(\"Dense Matrtix IO\")\n", + "mat = np.random.rand(2, 3)\n", + "print(f\"mat is a {type(mat)} matrix with a shape {mat.shape}.\")\n", + "print(mat)\n", + "smat_util.save_matrix(\"mat.npz\", mat)\n", + "mat_loaded = smat_util.load_matrix(\"mat.npz\")\n", + "print(f\"mat_loaded is a {type(mat_loaded)} matrix with a shape {mat_loaded.shape}.\")\n", + "print(mat)\n", + "print(\"\") \n", + "\n", + "for matrix_format in [\"csr\", \"csc\", \"coo\"]:\n", + " print(f\"{matrix_format} Sparse Matrix IO\")\n", + " mat = smat.random(5, 4, density=0.2, format=matrix_format)\n", + " print(f\"mat is a {type(mat)} matrix\"\n", + " f\" with a shape {mat.shape} and {mat.nnz} non-zero values.\")\n", + " print(mat)\n", + " \n", + " smat_util.save_matrix(\"mat.npz\", mat)\n", + " mat_loaded = smat_util.load_matrix(\"mat.npz\")\n", + " print(f\"mat_loaded is a {type(mat_loaded)} matrix\"\n", + " f\" with a shape {mat_loaded.shape} and {mat_loaded.nnz} non-zero values.\")\n", + " print(mat_loaded)\n", + " print(\"\") " ] }, { "cell_type": "code", - "execution_count": 3, - "id": "ac8ab46f", + "execution_count": 2, + "id": "2579b855", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Training DataFrame consists of 3107 instances.\n", - "Testing DataFrame consists of 1036 instances.\n", - "Index(['Unnamed: 0', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects',\n", - " 'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview'],\n", - " dtype='object')\n" + "Original Matrix mat\n", + " [[0.16936286 0.78425304 0.8562633 0.61722574 0.4486684 0.23233178]\n", + " [0.79099373 0.3961628 0.91564054 0.58414229 0.43155964 0.55876417]\n", + " [0.44718835 0.05151288 0.42833526 0.12533758 0.2968885 0.82826553]\n", + " [0.57886779 0.45415528 0.24104546 0.04155873 0.7281743 0.08374103]] \n", + "\n", + "csr_mat = dense_to_csr(mat)\n", + "csr_mat is a matrix with a shape (4, 6) and 24 non-zero values.\n", + "[[0.16936286 0.78425304 0.8562633 0.61722574 0.4486684 0.23233178]\n", + " [0.79099373 0.3961628 0.91564054 0.58414229 0.43155964 0.55876417]\n", + " [0.44718835 0.05151288 0.42833526 0.12533758 0.2968885 0.82826553]\n", + " [0.57886779 0.45415528 0.24104546 0.04155873 0.7281743 0.08374103]] \n", + "\n", + "csr_mat_topk = dense_to_csr(mat, topk=2)\n", + "csr_mat is a matrix with a shape (4, 6) and 8 non-zero values.\n", + "[[0. 0.78425304 0.8562633 0. 0. 0. ]\n", + " [0.79099373 0. 0.91564054 0. 0. 0. ]\n", + " [0. 0. 0.42833526 0. 0. 0.82826553]\n", + " [0. 0.45415528 0. 0. 0.7281743 0. ]] \n", + "\n" ] } ], "source": [ - "train_df = pd.read_csv(\"drugLibTrain_raw.tsv\", sep=\"\\t\")\n", - "test_df = pd.read_csv(\"drugLibTest_raw.tsv\", sep=\"\\t\")\n", - "print(f\"Training DataFrame consists of {len(train_df)} instances.\")\n", - "print(f\"Testing DataFrame consists of {len(test_df)} instances.\")\n", - "print(train_df.columns)" + "mat = np.random.rand(4, 6)\n", + "\n", + "print(f\"Original Matrix mat\\n\", mat, \"\\n\")\n", + "\n", + "print(\"csr_mat = dense_to_csr(mat)\")\n", + "csr_mat = smat_util.dense_to_csr(mat)\n", + "print(f\"csr_mat is a {type(csr_mat)} matrix\"\n", + " f\" with a shape {csr_mat.shape} and {csr_mat.nnz} non-zero values.\")\n", + "print(csr_mat.toarray(), \"\\n\")\n", + "\n", + "print(\"csr_mat_topk = dense_to_csr(mat, topk=2)\")\n", + "csr_mat_topk = smat_util.dense_to_csr(mat, topk=2)\n", + "print(f\"csr_mat is a {type(csr_mat_topk)} matrix\"\n", + " f\" with a shape {csr_mat_topk.shape} and {csr_mat_topk.nnz} non-zero values.\")\n", + "print(csr_mat_topk.toarray(), \"\\n\")" ] }, { - "cell_type": "code", - "execution_count": 4, - "id": "6d6ee989", + "cell_type": "markdown", + "id": "b2746e3c", "metadata": {}, - "outputs": [], "source": [ - "label_name = \"effectiveness\"\n", - "text_features = [\"condition\", \"benefitsReview\", \"sideEffectsReview\", \"commentsReview\"]\n", - "categorical_features = [\"sideEffects\"]\n", - "numerical_features = [\"rating\"]" + "### 1.2. Memory-efficient Sparse Matrix Operations\n", + "\n", + "To manipulate with sparse matrix, PECOS provides many useful memory-efficient functions. For example, for CSR matrices, we have following functions to combine multiple matrices.\n", + "\n", + "* `hstack_csr([mat, mat, mat]`\n", + "* `vstack_csr([mat, mat, mat]`\n", + "* `block_diag_csr([mat, mat, mat]`\n", + "\n", + "These funcations are also available for CSC matrices as `hstack_csc`, `vstack_csr`, and `block_diag_csr`.\n" ] }, { "cell_type": "code", - "execution_count": 5, - "id": "c25b1cb2", + "execution_count": 3, + "id": "b9dac617", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Original Matrix mat\n", + " [[0.77445782 0.45465131]\n", + " [0. 0.43456434]\n", + " [0. 0. ]] \n", + "\n", + "hstack_csr([mat, mat, mat])\n", + "[[0.77445782 0.45465131 0.77445782 0.45465131 0.77445782 0.45465131]\n", + " [0. 0.43456434 0. 0.43456434 0. 0.43456434]\n", + " [0. 0. 0. 0. 0. 0. ]] \n", + "\n", + "vstack_csr([mat, mat, mat])\n", + "[[0.77445782 0.45465131]\n", + " [0. 0.43456434]\n", + " [0. 0. ]\n", + " [0.77445782 0.45465131]\n", + " [0. 0.43456434]\n", + " [0. 0. ]\n", + " [0.77445782 0.45465131]\n", + " [0. 0.43456434]\n", + " [0. 0. ]] \n", + "\n", + "block_diag_csr([mat, mat, mat])\n", + "[[0.77445782 0.45465131 0. 0. 0. 0. ]\n", + " [0. 0.43456434 0. 0. 0. 0. ]\n", + " [0. 0. 0. 0. 0. 0. ]\n", + " [0. 0. 0.77445782 0.45465131 0. 0. ]\n", + " [0. 0. 0. 0.43456434 0. 0. ]\n", + " [0. 0. 0. 0. 0. 0. ]\n", + " [0. 0. 0. 0. 0.77445782 0.45465131]\n", + " [0. 0. 0. 0. 0. 0.43456434]\n", + " [0. 0. 0. 0. 0. 0. ]] \n", + "\n" + ] + } + ], "source": [ - "X_trn_list = []\n", - "X_tst_list = []" + "from pecos.utils import smat_util\n", + "import scipy.sparse as smat\n", + "\n", + "mat = smat.random(3, 2, density=0.5, format=\"csr\")\n", + "print(f\"Original Matrix {type(mat)} mat\\n\", mat.toarray(), \"\\n\")\n", + "\n", + "print(f\"hstack_csr([mat, mat, mat])\")\n", + "print(smat_util.hstack_csr([mat, mat, mat]).toarray(), \"\\n\")\n", + "\n", + "print(f\"vstack_csr([mat, mat, mat])\")\n", + "print(smat_util.vstack_csr([mat, mat, mat]).toarray(), \"\\n\")\n", + "\n", + "print(f\"block_diag_csr([mat, mat, mat])\")\n", + "print(smat_util.block_diag_csr([mat, mat, mat]).toarray(), \"\\n\")" ] }, { "cell_type": "markdown", - "id": "4f72d047", + "id": "5f9cf7f1", "metadata": {}, "source": [ - "### Label Encoding\n", + "### 1.3. Sparse-to-sparse Matrix Multiplication (SpMM)\n", + "\n", + "Many operations in PECOS or XMC problems rely on Sparse-to-sparse Matrix Multiplication (SpMM), such as the computation of PIFA features. It is also one of the key primitives in large-scale linear algebra operations, with a broad range of applications in machine learning and natural language processing.\n", "\n", - "To encode labels into the sparse matrix format compatible to PECOS, [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer) and [MultiLabelBinarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer) are helpful for the scenarios of multi-class and multi-label classification." + "For SpMM, PECOS provides a highly optimized multi-core CPU implementation with state-of-the-art performance, where the underlying operations are implemented and optimized in C/C++.\n", + "Specifically, the Python interface and parameters are as follows:\n", + "\n", + "```python\n", + "from pecos.core import clib as pecos_clib\n", + "Z = pecos_clib.sparse_matmul(X, Y, eliminate_zeros=False, sorted_indices=True, threads=-1)\n", + "```\n", + "* Parameters\n", + " * `X` (scipy.sparse.csr_matrix or scipy.sparse.csc_matrix): the first sparse matrix to be multiplied.\n", + " * `Y` (scipy.sparse.csr_matrix or scipy.sparse.csc_matrix): the second sparse matrix to be multiplied.\n", + " * `eliminate_zeros` (bool, optional): if true, then eliminate (potential) zeros created by maxnnz in output matrix Z. Default is false.\n", + " * `sorted_indices` (bool, optional): if true, then sort the Z.indices for the output matrix Z. Default is true.\n", + " * `threads` (int, optional): The number of threads. Default -1 to use all CPU cores." ] }, { "cell_type": "code", - "execution_count": 6, - "id": "4f0d9ec7", + "execution_count": 4, + "id": "b2e6c54a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Y_trn is a csr matrix with a shape (3107, 5) and 3107 non-zero values.\n", - "Y_tst is a csr matrix with a shape (1036, 5) and 1036 non-zero values.\n" + "||Z_true - Z_pred|| = 0.0\n" ] } ], "source": [ - "from sklearn.preprocessing import OneHotEncoder\n", - "\n", - "label_encoder = OneHotEncoder(dtype=np.float32)\n", - "Y_trn = label_encoder.fit_transform(train_df[[label_name]])\n", - "Y_tst = label_encoder.transform(test_df[[label_name]])\n", - "\n", - "print(f\"Y_trn is a {Y_trn.getformat()} matrix with a shape {Y_trn.shape} and {Y_trn.nnz} non-zero values.\")\n", - "print(f\"Y_tst is a {Y_tst.getformat()} matrix with a shape {Y_tst.shape} and {Y_tst.nnz} non-zero values.\")" + "import numpy as np\n", + "import scipy.sparse as smat\n", + "from scipy.sparse import linalg\n", + "from pecos.core import clib as pecos_clib\n", + "X = smat.random(1000, 1000, density=0.01, format='csr', dtype=np.float32)\n", + "Y = smat.random(1000, 1000, density=0.01, format='csr', dtype=np.float32)\n", + "Z_true = X.dot(Y)\n", + "Z_pred = pecos_clib.sparse_matmul(X, Y)\n", + "print(\"||Z_true - Z_pred|| = \", linalg.norm(Z_true - Z_pred))" ] }, { "cell_type": "code", - "execution_count": 7, - "id": "62ec7371", + "execution_count": 5, + "id": "14e4bded", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Y_trn_mlb is a csr matrix with a shape (3107, 5) and 3107 non-zero values.\n", - "Y_tst_mlb is a csr matrix with a shape (1036, 5) and 1036 non-zero values.\n" - ] - } - ], + "outputs": [], "source": [ - "from sklearn.preprocessing import MultiLabelBinarizer\n", - "\n", - "label_encoder_multilabel = MultiLabelBinarizer(sparse_output=True)\n", - "Y_trn_mlb = label_encoder.fit_transform([[lbl] for lbl in train_df[label_name].tolist()])\n", - "Y_tst_mlb = label_encoder.fit_transform([[lbl] for lbl in test_df[label_name].tolist()])\n", - "print(f\"Y_trn_mlb is a {Y_trn_mlb.getformat()} matrix with a shape {Y_trn_mlb.shape} and {Y_trn_mlb.nnz} non-zero values.\")\n", - "print(f\"Y_tst_mlb is a {Y_tst_mlb.getformat()} matrix with a shape {Y_tst_mlb.shape} and {Y_tst_mlb.nnz} non-zero values.\")" + "import time\n", + "DATASET = \"wiki10-31k\"\n", + "X = smat_util.load_matrix(f\"xmc-base/{DATASET}/tfidf-attnxml/X.trn.npz\").astype(np.float32)\n", + "Y = smat_util.load_matrix(f\"xmc-base/{DATASET}/Y.trn.npz\").astype(np.float32)\n", + "YT_csr = Y.T.tocsr()\n", + "X_csr = X.tocsr()" ] }, { "cell_type": "markdown", - "id": "260a9a8a", + "id": "bb5af1e9", "metadata": {}, "source": [ - "### Text Feature Encoding\n", + "#### Benchmarking Sparse Matrix Muplication\n", "\n", - "As introduced in Session 1, we can use PECOS vectorizer for featurize text data. In addition, the encoder of [XR-Transformer](https://github.com/amzn/pecos/tree/mainline/pecos/xmc/xtransformer) can be also utilized for deriving text features with proper fine-tuning." + "The SpMM utility has state-of-the-art performance in efficiency as shown in the following figure.\n", + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "In this part, we provide some hands-on instructions for benchmarking different methods for SpMM." ] }, { "cell_type": "code", - "execution_count": 8, - "id": "96fef619", + "execution_count": 6, + "id": "c71cffdb", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "condition: (3107, 3759) and (1036, 3759) in training and testing.\n", - "benefitsReview: (3107, 72861) and (1036, 72861) in training and testing.\n", - "sideEffectsReview: (3107, 64321) and (1036, 64321) in training and testing.\n", - "commentsReview: (3107, 91731) and (1036, 91731) in training and testing.\n" - ] - } - ], + "outputs": [], "source": [ - "from pecos.utils.featurization.text.vectorizers import Vectorizer\n", + "# Benchmarking SciPy\n", "\n", - "for feature_name in text_features:\n", - " vectorizer_config = {\n", - " \"type\": \"tfidf\",\n", - " \"kwargs\": {\n", - " \"base_vect_configs\": [\n", + "start = time.time()\n", + "Z = YT_csr.dot(X_csr)\n", + "Z.sort_indices()\n", + "run_time_scipy = time.time() - start" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "2f2b9795", + "metadata": {}, + "outputs": [], + "source": [ + "# Benchmarking PyTorch\n", "\n", - " {\n", - " \"ngram_range\": [1, 2],\n", - " \"max_df_ratio\": 0.98,\n", - " \"analyzer\": \"word\",\n", - " },\n", - " ],\n", - " },\n", - " } \n", - " train_texts = [str(x) for x in train_df[feature_name].tolist()]\n", - " test_texts = test_df[feature_name].tolist()\n", - " vectorizer = Vectorizer.train(train_texts, config=vectorizer_config)\n", - " X_trn_local = vectorizer.predict(train_texts)\n", - " X_tst_local = vectorizer.predict(test_texts)\n", - " print(f\"{feature_name}: {X_trn_local.shape} and {X_tst_local.shape} in training and testing.\")\n", + "import torch\n", + "\n", + "def csr_to_coo(A):\n", + " A_coo = smat.coo_matrix(A)\n", + " indices = np.vstack([A_coo.row, A_coo.col]).T\n", + " values = A_coo.data\n", + " return indices, values\n", + "\n", + "def get_pt_data(A_csr):\n", + " A_indices, A_values = csr_to_coo(A_csr)\n", + " A_pt = torch.sparse_coo_tensor(\n", + " A_indices.T.astype(np.int64),\n", + " A_values.astype(np.float32),\n", + " A_csr.shape,\n", + " )\n", + " return A_pt\n", " \n", - " X_trn_list.append(X_trn_local)\n", - " X_tst_list.append(X_tst_local)" + "YT_pt = get_pt_data(YT_csr)\n", + "X_pt = get_pt_data(X_csr)\n", + "start = time.time()\n", + "Z_pt = torch.sparse.mm(YT_pt, X_pt)\n", + "run_time_pytorch = time.time() - start" ] }, { - "cell_type": "markdown", - "id": "38e75fa2", + "cell_type": "code", + "execution_count": 8, + "id": "54b24694", "metadata": {}, + "outputs": [], "source": [ - "### Categorical Feature Encoding\n", + "# Benchmarking PECOS\n", "\n", - "Similar to labels, categorical features can also be considered as one-hot or multi-hot embeddings." + "start = time.time()\n", + "Z = pecos_clib.sparse_matmul(\n", + " YT_csr, X_csr,\n", + " eliminate_zeros=False,\n", + " sorted_indices=True\n", + ")\n", + "run_time_pecos = time.time() - start" ] }, { "cell_type": "code", "execution_count": 9, - "id": "386b96ad", + "id": "e12f29e9", "metadata": {}, "outputs": [ { - "name": "stdout", - "output_type": "stream", - "text": [ - "sideEffects: (3107, 5) and (1036, 5) in training and testing.\n" - ] + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" } ], "source": [ - "from sklearn.preprocessing import OneHotEncoder\n", + "from matplotlib import pyplot as plt\n", + "plt.bar(\n", + " [1,2,3],\n", + " [run_time_scipy, run_time_pytorch, run_time_pecos],\n", + " tick_label = [\"SciPy\", \"PyTorch\", \"PECOS\"])\n", "\n", - "for feature_name in categorical_features:\n", - " local_encoder = OneHotEncoder(dtype=np.float32)\n", - " X_trn_local = local_encoder.fit_transform(train_df[[feature_name]])\n", - " X_tst_local = local_encoder.transform(test_df[[feature_name]])\n", - " print(f\"{feature_name}: {X_trn_local.shape} and {X_tst_local.shape} in training and testing.\")\n", - " \n", - " X_trn_list.append(X_trn_local)\n", - " X_tst_list.append(X_tst_local)" + "plt.ylabel(\"Matrix Multiplication Time (seconds)\");" ] }, { "cell_type": "markdown", - "id": "6deaf4e8", + "id": "3e9cc45c", "metadata": {}, "source": [ - "### Numerical Features Encoding\n", + "### 1.4. Sparse Matrix Operations for Working with Arbitrary Data Formats\n", "\n", - "Numberical features can be directly incorporated as model inputs after some simple normalization." + "PECOS is a general machine learning framework and able to fit arbitary data format and interact with different data manipulation and analysis libraries like [Pandas](https://pandas.pydata.org/). In the following example, we will show how to learn a PECOS model with Pandas-loaded data of text, categorical, and numerical features based on sparse matrix operations." ] }, { "cell_type": "code", "execution_count": 10, - "id": "90668ea4", + "id": "9f0f17a7", + "metadata": {}, + "outputs": [], + "source": [ + "import pecos\n", + "import pandas as pd\n", + "import numpy as np" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "526820f8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "rating: (3107, 1) and (1036, 1) in training and testing.\n" + "2022-08-13 06:42:02 URL:https://archive.ics.uci.edu/ml/machine-learning-databases/00461/drugLib_raw.zip [1133354/1133354] -> \"drugLib_raw.zip\" [1]\n", + "Archive: drugLib_raw.zip\n", + " inflating: drugLibTest_raw.tsv \n", + " inflating: drugLibTrain_raw.tsv \n" ] } ], "source": [ - "from scipy.sparse import csr_matrix\n", - "from sklearn.preprocessing import StandardScaler\n", - "\n", - "for feature_name in numerical_features:\n", - " X_trn_values = train_df[[\"rating\"]].values\n", - " X_tst_values = test_df[[\"rating\"]].values\n", - " scaler = StandardScaler()\n", - " X_trn_local = csr_matrix(scaler.fit_transform(X_trn_values), dtype=np.float32)\n", - " X_tst_local = csr_matrix(scaler.transform(X_tst_values), dtype=np.float32)\n", - " print(f\"{feature_name}: {X_trn_local.shape} and {X_tst_local.shape} in training and testing.\")\n", - " \n", - " X_trn_list.append(X_trn_local)\n", - " X_tst_list.append(X_tst_local)" - ] - }, - { - "cell_type": "markdown", - "id": "d2441580", - "metadata": {}, - "source": [ - "### Feature Concatenation\n", - "\n", - "PECOS provides easy-going utility functions for efficient matrix operations. The `hstack_csr` function can concatenate different features for each individual instance. More detils about other utilities will be introduced later in this session." + "! wget -nv -nc https://archive.ics.uci.edu/ml/machine-learning-databases/00461/drugLib_raw.zip\n", + "! unzip -o drugLib_raw.zip" ] }, { "cell_type": "code", - "execution_count": 11, - "id": "d0d3e69c", + "execution_count": 12, + "id": "ac8ab46f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "X_trn is a csr matrix with a shape (3107, 232678) and 653987 non-zero values.\n", - "X_tst is a csr matrix with a shape (1036, 232678) and 164272 non-zero values.\n" + "Training DataFrame consists of 3107 instances.\n", + "Testing DataFrame consists of 1036 instances.\n", + "Index(['Unnamed: 0', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects',\n", + " 'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview'],\n", + " dtype='object')\n" ] } ], "source": [ - "from pecos.utils import smat_util\n", - "\n", - "X_trn = smat_util.hstack_csr(X_trn_list)\n", - "X_tst = smat_util.hstack_csr(X_tst_list)\n", - "\n", - "print(f\"X_trn is a {X_trn.getformat()} matrix with a shape {X_trn.shape} and {X_trn.nnz} non-zero values.\")\n", - "print(f\"X_tst is a {X_tst.getformat()} matrix with a shape {X_tst.shape} and {X_tst.nnz} non-zero values.\")" + "train_df = pd.read_csv(\"drugLibTrain_raw.tsv\", sep=\"\\t\")\n", + "test_df = pd.read_csv(\"drugLibTest_raw.tsv\", sep=\"\\t\")\n", + "print(f\"Training DataFrame consists of {len(train_df)} instances.\")\n", + "print(f\"Testing DataFrame consists of {len(test_df)} instances.\")\n", + "print(train_df.columns)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "6d6ee989", + "metadata": {}, + "outputs": [], + "source": [ + "label_name = \"effectiveness\"\n", + "text_features = [\"condition\", \"benefitsReview\", \"sideEffectsReview\", \"commentsReview\"]\n", + "categorical_features = [\"sideEffects\"]\n", + "numerical_features = [\"rating\"]" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "c25b1cb2", + "metadata": {}, + "outputs": [], + "source": [ + "X_trn_list = []\n", + "X_tst_list = []" ] }, { "cell_type": "markdown", - "id": "5e61775b", + "id": "4f72d047", "metadata": {}, "source": [ - "### Model Training and Testing" + "#### Label Encoding\n", + "\n", + "To encode labels into the sparse matrix format compatible to PECOS, [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) and [MultiLabelBinarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer) are helpful for the scenarios of multi-class and multi-label classification." ] }, { "cell_type": "code", - "execution_count": 12, - "id": "38189597", + "execution_count": 15, + "id": "4f0d9ec7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "prec = 52.80 40.69 30.92 24.52 20.00\n", - "recall = 52.80 81.37 92.76 98.07 100.00\n" + "Y_trn is a csr matrix with a shape (3107, 5) and 3107 non-zero values.\n", + "Y_tst is a csr matrix with a shape (1036, 5) and 1036 non-zero values.\n" ] } ], "source": [ - "from pecos.xmc.xlinear.model import XLinearModel\n", - "xlm = XLinearModel.train(X_trn, Y_trn)\n", + "from sklearn.preprocessing import OneHotEncoder\n", "\n", - "Y_pred = xlm.predict(X_tst, beam_size=10, only_topk=5)\n", - "metrics = smat_util.Metrics.generate(Y_tst, Y_pred, topk=5)\n", - "print(metrics)" + "label_encoder = OneHotEncoder(dtype=np.float32)\n", + "Y_trn = label_encoder.fit_transform(train_df[[label_name]])\n", + "Y_tst = label_encoder.transform(test_df[[label_name]])\n", + "\n", + "print(f\"Y_trn is a {Y_trn.getformat()} matrix with a shape {Y_trn.shape} and {Y_trn.nnz} non-zero values.\")\n", + "print(f\"Y_tst is a {Y_tst.getformat()} matrix with a shape {Y_tst.shape} and {Y_tst.nnz} non-zero values.\")" ] }, { - "cell_type": "markdown", - "id": "c2b3f61e", + "cell_type": "code", + "execution_count": 16, + "id": "62ec7371", "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Y_trn_mlb is a csr matrix with a shape (3107, 5) and 3107 non-zero values.\n", + "Y_tst_mlb is a csr matrix with a shape (1036, 5) and 1036 non-zero values.\n" + ] + } + ], "source": [ - "## Sparse Matrix Operations\n", + "from sklearn.preprocessing import MultiLabelBinarizer\n", "\n", - "Most of the computations in PECOS are based on sparse matrices, so PECOS also provides various useful and efficient operation utilities for sparse matrices." + "label_encoder_multilabel = MultiLabelBinarizer(sparse_output=True)\n", + "Y_trn_mlb = label_encoder.fit_transform([[lbl] for lbl in train_df[label_name].tolist()])\n", + "Y_tst_mlb = label_encoder.fit_transform([[lbl] for lbl in test_df[label_name].tolist()])\n", + "print(f\"Y_trn_mlb is a {Y_trn_mlb.getformat()} matrix with a shape {Y_trn_mlb.shape} and {Y_trn_mlb.nnz} non-zero values.\")\n", + "print(f\"Y_tst_mlb is a {Y_tst_mlb.getformat()} matrix with a shape {Y_tst_mlb.shape} and {Y_tst_mlb.nnz} non-zero values.\")" ] }, { "cell_type": "markdown", - "id": "5fba58d4", + "id": "260a9a8a", "metadata": {}, "source": [ - "### Genric Matriox IO and Conversion\n", + "#### 1.4.2. Text Feature Encoding\n", "\n", - "`smat_util.load_matrix` and `smat_util.save_matrix` provide generic interfaces for loading and storing matrices in arbitrary common formats, including [dense matrix](https://numpy.org/doc/stable/reference/generated/numpy.array.html) in NumPy or different sparse matrix formats (i.e., [sparse Compressed Sparse Row (CSR) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html), [sparse Compressed Sparse Column (CSC) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html), and [sparse COOrdinate (COO) matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html))." + "As introduced in Session 1, we can use PECOS vectorizer for featurize text data. In addition, the encoder of [XR-Transformer](https://github.com/amzn/pecos/tree/mainline/pecos/xmc/xtransformer) can be also utilized for deriving text features with proper fine-tuning." ] }, { "cell_type": "code", - "execution_count": 13, - "id": "385ed0ba", + "execution_count": 17, + "id": "96fef619", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Dense Matrtix IO\n", - "mat is a matrix with a shape (2, 3).\n", - "[[0.6757516 0.42168422 0.40557039]\n", - " [0.86806547 0.9198075 0.7494449 ]]\n", - "mat_loaded is a matrix with a shape (2, 3).\n", - "[[0.6757516 0.42168422 0.40557039]\n", - " [0.86806547 0.9198075 0.7494449 ]]\n", - "\n", - "csr Sparse Matrix IO\n", - "mat is a matrix with a shape (5, 4) and 4 non-zero values.\n", - " (1, 2)\t0.17821196669035588\n", - " (2, 1)\t0.8259001065480657\n", - " (4, 2)\t0.5111159408743305\n", - " (4, 3)\t0.6337428297507509\n", - "mat_loaded is a matrix with a shape (5, 4) and 4 non-zero values.\n", - " (1, 2)\t0.17821196669035588\n", - " (2, 1)\t0.8259001065480657\n", - " (4, 2)\t0.5111159408743305\n", - " (4, 3)\t0.6337428297507509\n", - "\n", - "csc Sparse Matrix IO\n", - "mat is a matrix with a shape (5, 4) and 4 non-zero values.\n", - " (0, 1)\t0.868116915403953\n", - " (1, 2)\t0.7454473997071077\n", - " (0, 3)\t0.21167432752493887\n", - " (1, 3)\t0.4685535255015949\n", - "mat_loaded is a matrix with a shape (5, 4) and 4 non-zero values.\n", - " (0, 1)\t0.868116915403953\n", - " (1, 2)\t0.7454473997071077\n", - " (0, 3)\t0.21167432752493887\n", - " (1, 3)\t0.4685535255015949\n", - "\n", - "coo Sparse Matrix IO\n", - "mat is a matrix with a shape (5, 4) and 4 non-zero values.\n", - " (1, 0)\t0.3217900085041965\n", - " (3, 3)\t0.15316424313380772\n", - " (2, 3)\t0.7835729602784944\n", - " (2, 2)\t0.396664789900256\n", - "mat_loaded is a matrix with a shape (5, 4) and 4 non-zero values.\n", - " (1, 0)\t0.3217900085041965\n", - " (3, 3)\t0.15316424313380772\n", - " (2, 3)\t0.7835729602784944\n", - " (2, 2)\t0.396664789900256\n", - "\n" + "condition: (3107, 3759) and (1036, 3759) in training and testing.\n", + "benefitsReview: (3107, 72861) and (1036, 72861) in training and testing.\n", + "sideEffectsReview: (3107, 64321) and (1036, 64321) in training and testing.\n", + "commentsReview: (3107, 91731) and (1036, 91731) in training and testing.\n" ] } ], "source": [ - "from pecos.utils import smat_util\n", - "import numpy as np\n", - "import scipy.sparse as smat\n", + "from pecos.utils.featurization.text.vectorizers import Vectorizer\n", "\n", - "print(\"Dense Matrtix IO\")\n", - "mat = np.random.rand(2, 3)\n", - "print(f\"mat is a {type(mat)} matrix with a shape {mat.shape}.\")\n", - "print(mat)\n", - "smat_util.save_matrix(\"mat.npz\", mat)\n", - "mat_loaded = smat_util.load_matrix(\"mat.npz\")\n", - "print(f\"mat_loaded is a {type(mat_loaded)} matrix with a shape {mat_loaded.shape}.\")\n", - "print(mat)\n", - "print(\"\") \n", + "for feature_name in text_features:\n", + " vectorizer_config = {\n", + " \"type\": \"tfidf\",\n", + " \"kwargs\": {\n", + " \"base_vect_configs\": [\n", "\n", - "for matrix_format in [\"csr\", \"csc\", \"coo\"]:\n", - " print(f\"{matrix_format} Sparse Matrix IO\")\n", - " mat = smat.random(5, 4, density=0.2, format=matrix_format)\n", - " print(f\"mat is a {type(mat)} matrix\"\n", - " f\" with a shape {mat.shape} and {mat.nnz} non-zero values.\")\n", - " print(mat)\n", + " {\n", + " \"ngram_range\": [1, 2],\n", + " \"max_df_ratio\": 0.98,\n", + " \"analyzer\": \"word\",\n", + " },\n", + " ],\n", + " },\n", + " } \n", + " train_texts = [str(x) for x in train_df[feature_name].tolist()]\n", + " test_texts = test_df[feature_name].tolist()\n", + " vectorizer = Vectorizer.train(train_texts, config=vectorizer_config)\n", + " X_trn_local = vectorizer.predict(train_texts)\n", + " X_tst_local = vectorizer.predict(test_texts)\n", + " print(f\"{feature_name}: {X_trn_local.shape} and {X_tst_local.shape} in training and testing.\")\n", " \n", - " smat_util.save_matrix(\"mat.npz\", mat)\n", - " mat_loaded = smat_util.load_matrix(\"mat.npz\")\n", - " print(f\"mat_loaded is a {type(mat_loaded)} matrix\"\n", - " f\" with a shape {mat_loaded.shape} and {mat_loaded.nnz} non-zero values.\")\n", - " print(mat_loaded)\n", - " print(\"\") " + " X_trn_list.append(X_trn_local)\n", + " X_tst_list.append(X_tst_local)" + ] + }, + { + "cell_type": "markdown", + "id": "38e75fa2", + "metadata": {}, + "source": [ + "#### 1.4.3. Categorical Feature Encoding\n", + "\n", + "Similar to labels, categorical features can also be considered as one-hot or multi-hot embeddings." ] }, { "cell_type": "code", - "execution_count": 14, - "id": "2579b855", + "execution_count": 18, + "id": "386b96ad", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Original Matrix mat\n", - " [[7.75480297e-01 3.06141999e-01 1.48439313e-01 4.82371302e-01\n", - " 7.94080899e-01 4.22684703e-04]\n", - " [1.22368102e-01 9.06639305e-02 5.88889479e-01 3.37880581e-01\n", - " 7.45595442e-01 9.58851999e-01]\n", - " [5.65962202e-01 7.18344997e-01 1.99721347e-01 2.02474399e-01\n", - " 5.86650110e-01 1.93471414e-01]\n", - " [5.72012816e-02 1.20470044e-01 1.27986695e-01 6.43206432e-01\n", - " 5.42874998e-01 7.05113274e-01]] \n", - "\n", - "csr_mat = dense_to_csr(mat)\n", - "csr_mat is a matrix with a shape (4, 6) and 24 non-zero values.\n", - "[[7.75480297e-01 3.06141999e-01 1.48439313e-01 4.82371302e-01\n", - " 7.94080899e-01 4.22684703e-04]\n", - " [1.22368102e-01 9.06639305e-02 5.88889479e-01 3.37880581e-01\n", - " 7.45595442e-01 9.58851999e-01]\n", - " [5.65962202e-01 7.18344997e-01 1.99721347e-01 2.02474399e-01\n", - " 5.86650110e-01 1.93471414e-01]\n", - " [5.72012816e-02 1.20470044e-01 1.27986695e-01 6.43206432e-01\n", - " 5.42874998e-01 7.05113274e-01]] \n", - "\n", - "csr_mat_topk = dense_to_csr(mat, topk=2)\n", - "csr_mat is a matrix with a shape (4, 6) and 8 non-zero values.\n", - "[[0.7754803 0. 0. 0. 0.7940809 0. ]\n", - " [0. 0. 0. 0. 0.74559544 0.958852 ]\n", - " [0.5659622 0. 0. 0. 0.58665011 0. ]\n", - " [0. 0. 0. 0. 0.542875 0.70511327]] \n", - "\n" + "sideEffects: (3107, 5) and (1036, 5) in training and testing.\n" ] } ], "source": [ - "mat = np.random.rand(4, 6)\n", - "\n", - "print(f\"Original Matrix mat\\n\", mat, \"\\n\")\n", - "\n", - "print(\"csr_mat = dense_to_csr(mat)\")\n", - "csr_mat = smat_util.dense_to_csr(mat)\n", - "print(f\"csr_mat is a {type(csr_mat)} matrix\"\n", - " f\" with a shape {csr_mat.shape} and {csr_mat.nnz} non-zero values.\")\n", - "print(csr_mat.toarray(), \"\\n\")\n", + "from sklearn.preprocessing import OneHotEncoder\n", "\n", - "print(\"csr_mat_topk = dense_to_csr(mat, topk=2)\")\n", - "csr_mat_topk = smat_util.dense_to_csr(mat, topk=2)\n", - "print(f\"csr_mat is a {type(csr_mat_topk)} matrix\"\n", - " f\" with a shape {csr_mat_topk.shape} and {csr_mat_topk.nnz} non-zero values.\")\n", - "print(csr_mat_topk.toarray(), \"\\n\")" + "for feature_name in categorical_features:\n", + " local_encoder = OneHotEncoder(dtype=np.float32)\n", + " X_trn_local = local_encoder.fit_transform(train_df[[feature_name]])\n", + " X_tst_local = local_encoder.transform(test_df[[feature_name]])\n", + " print(f\"{feature_name}: {X_trn_local.shape} and {X_tst_local.shape} in training and testing.\")\n", + " \n", + " X_trn_list.append(X_trn_local)\n", + " X_tst_list.append(X_tst_local)" ] }, { "cell_type": "markdown", - "id": "b2746e3c", + "id": "6deaf4e8", "metadata": {}, "source": [ - "### Memory-efficient Sparse Matrix Operations\n", + "#### 1.4.4. Numerical Features Encoding\n", "\n", - "To manipulate with sparse matrix, PECOS provides many useful memory-efficient functions. For example, for CSR matrices, we have following functions to combine multiple matrices.\n", - "\n", - "* `hstack_csr([mat, mat, mat]`\n", - "* `vstack_csr([mat, mat, mat]`\n", - "* `block_diag_csr([mat, mat, mat]`\n", - "\n", - "These funcations are also available for CSC matrices as `hstack_csc`, `vstack_csr`, and `block_diag_csr`.\n" + "Numberical features can be directly incorporated as model inputs after some simple normalization." ] }, { "cell_type": "code", - "execution_count": 15, - "id": "b9dac617", + "execution_count": 19, + "id": "90668ea4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Original Matrix mat\n", - " [[0.45989041 0. ]\n", - " [0.77400378 0. ]\n", - " [0. 0.44557291]] \n", - "\n", - "hstack_csr([mat, mat, mat])\n", - "[[0.45989041 0. 0.45989041 0. 0.45989041 0. ]\n", - " [0.77400378 0. 0.77400378 0. 0.77400378 0. ]\n", - " [0. 0.44557291 0. 0.44557291 0. 0.44557291]] \n", - "\n", - "vstack_csr([mat, mat, mat])\n", - "[[0.45989041 0. ]\n", - " [0.77400378 0. ]\n", - " [0. 0.44557291]\n", - " [0.45989041 0. ]\n", - " [0.77400378 0. ]\n", - " [0. 0.44557291]\n", - " [0.45989041 0. ]\n", - " [0.77400378 0. ]\n", - " [0. 0.44557291]] \n", - "\n", - "block_diag_csr([mat, mat, mat])\n", - "[[0.45989041 0. 0. 0. 0. 0. ]\n", - " [0.77400378 0. 0. 0. 0. 0. ]\n", - " [0. 0.44557291 0. 0. 0. 0. ]\n", - " [0. 0. 0.45989041 0. 0. 0. ]\n", - " [0. 0. 0.77400378 0. 0. 0. ]\n", - " [0. 0. 0. 0.44557291 0. 0. ]\n", - " [0. 0. 0. 0. 0.45989041 0. ]\n", - " [0. 0. 0. 0. 0.77400378 0. ]\n", - " [0. 0. 0. 0. 0. 0.44557291]] \n", - "\n" + "rating: (3107, 1) and (1036, 1) in training and testing.\n" ] } ], "source": [ - "from pecos.utils import smat_util\n", - "import scipy.sparse as smat\n", - "\n", - "mat = smat.random(3, 2, density=0.5, format=\"csr\")\n", - "print(f\"Original Matrix {type(mat)} mat\\n\", mat.toarray(), \"\\n\")\n", - "\n", - "print(f\"hstack_csr([mat, mat, mat])\")\n", - "print(smat_util.hstack_csr([mat, mat, mat]).toarray(), \"\\n\")\n", - "\n", - "print(f\"vstack_csr([mat, mat, mat])\")\n", - "print(smat_util.vstack_csr([mat, mat, mat]).toarray(), \"\\n\")\n", + "from scipy.sparse import csr_matrix\n", + "from sklearn.preprocessing import StandardScaler\n", "\n", - "print(f\"block_diag_csr([mat, mat, mat])\")\n", - "print(smat_util.block_diag_csr([mat, mat, mat]).toarray(), \"\\n\")" + "for feature_name in numerical_features:\n", + " X_trn_values = train_df[[\"rating\"]].values\n", + " X_tst_values = test_df[[\"rating\"]].values\n", + " scaler = StandardScaler()\n", + " X_trn_local = csr_matrix(scaler.fit_transform(X_trn_values), dtype=np.float32)\n", + " X_tst_local = csr_matrix(scaler.transform(X_tst_values), dtype=np.float32)\n", + " print(f\"{feature_name}: {X_trn_local.shape} and {X_tst_local.shape} in training and testing.\")\n", + " \n", + " X_trn_list.append(X_trn_local)\n", + " X_tst_list.append(X_tst_local)" ] }, { "cell_type": "markdown", - "id": "5f9cf7f1", + "id": "d2441580", "metadata": {}, "source": [ - "### Sparse-to-sparse Matrix Multiplication (SpMM)\n", - "\n", - "Many operations in PECOS or XMR problems rely on Sparse-to-sparse Matrix Multiplication (SpMM), such as the computation of PIFA features. It is also one of the key primitives in large-scale linear algebra operations, with a broad range of applications in machine learning and natural language processing.\n", - "\n", - "For SpMM, PECOS provides a highly optimized multi-core CPU implementation with state-of-the-art performance, where the underlying operations are implemented and optimized in C/C++.\n", - "Specifically, the Python interface and parameters are as follows:\n", + "#### 1.4.5. Feature Concatenation\n", "\n", - "```python\n", - "from pecos.core import clib as pecos_clib\n", - "Z = pecos_clib.sparse_matmul(X, Y, eliminate_zeros=False, sorted_indices=True, threads=-1)\n", - "```\n", - "* Parameters\n", - " * `X` (scipy.sparse.csr_matrix or scipy.sparse.csc_matrix): the first sparse matrix to be multiplied.\n", - " * `Y` (scipy.sparse.csr_matrix or scipy.sparse.csc_matrix): the second sparse matrix to be multiplied.\n", - " * `eliminate_zeros` (bool, optional): if true, then eliminate (potential) zeros created by maxnnz in output matrix Z. Default is false.\n", - " * `sorted_indices` (bool, optional): if true, then sort the Z.indices for the output matrix Z. Default is true.\n", - " * `threads` (int, optional): The number of threads. Default -1 to use all CPU cores." + "PECOS provides easy-going utility functions for efficient matrix operations. The `hstack_csr` function can concatenate different features for each individual instance. More detils about other utilities will be introduced later in this session." ] }, { "cell_type": "code", - "execution_count": 16, - "id": "b2e6c54a", + "execution_count": 20, + "id": "d0d3e69c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "||Z_true - Z_pred|| = 0.0\n" + "X_trn is a csr matrix with a shape (3107, 232678) and 653987 non-zero values.\n", + "X_tst is a csr matrix with a shape (1036, 232678) and 164272 non-zero values.\n" ] } ], "source": [ - "import numpy as np\n", - "import scipy.sparse as smat\n", - "from scipy.sparse import linalg\n", - "from pecos.core import clib as pecos_clib\n", - "X = smat.random(1000, 1000, density=0.01, format='csr', dtype=np.float32)\n", - "Y = smat.random(1000, 1000, density=0.01, format='csr', dtype=np.float32)\n", - "Z_true = X.dot(Y)\n", - "Z_pred = pecos_clib.sparse_matmul(X, Y)\n", - "print(\"||Z_true - Z_pred|| = \", linalg.norm(Z_true - Z_pred))" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "14e4bded", - "metadata": {}, - "outputs": [], - "source": [ - "import time\n", - "DATASET = \"wiki10-31k\"\n", - "X = smat_util.load_matrix(f\"xmc-base/{DATASET}/tfidf-attnxml/X.trn.npz\").astype(np.float32)\n", - "Y = smat_util.load_matrix(f\"xmc-base/{DATASET}/Y.trn.npz\").astype(np.float32)\n", - "YT_csr = Y.T.tocsr()\n", - "X_csr = X.tocsr()" - ] - }, - { - "cell_type": "markdown", - "id": "bb5af1e9", - "metadata": {}, - "source": [ - "#### Benchmarking Sparse Matrix Muplication\n", - "\n", - "The SpMM utility has state-of-the-art performance in efficiency as shown in the following figure.\n", - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "In this part, we provide some hands-on instructions for benchmarking different methods for SpMM." - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "c71cffdb", - "metadata": {}, - "outputs": [], - "source": [ - "# Benchmarking SciPy\n", - "\n", - "start = time.time()\n", - "Z = YT_csr.dot(X_csr)\n", - "Z.sort_indices()\n", - "run_time_scipy = time.time() - start" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "2f2b9795", - "metadata": {}, - "outputs": [], - "source": [ - "# Benchmarking PyTorch\n", - "\n", - "import torch\n", + "from pecos.utils import smat_util\n", "\n", - "def csr_to_coo(A):\n", - " A_coo = smat.coo_matrix(A)\n", - " indices = np.vstack([A_coo.row, A_coo.col]).T\n", - " values = A_coo.data\n", - " return indices, values\n", + "X_trn = smat_util.hstack_csr(X_trn_list)\n", + "X_tst = smat_util.hstack_csr(X_tst_list)\n", "\n", - "def get_pt_data(A_csr):\n", - " A_indices, A_values = csr_to_coo(A_csr)\n", - " A_pt = torch.sparse_coo_tensor(\n", - " A_indices.T.astype(np.int64),\n", - " A_values.astype(np.float32),\n", - " A_csr.shape,\n", - " )\n", - " return A_pt\n", - " \n", - "YT_pt = get_pt_data(YT_csr)\n", - "X_pt = get_pt_data(X_csr)\n", - "start = time.time()\n", - "Z_pt = torch.sparse.mm(YT_pt, X_pt)\n", - "run_time_pytorch = time.time() - start" + "print(f\"X_trn is a {X_trn.getformat()} matrix with a shape {X_trn.shape} and {X_trn.nnz} non-zero values.\")\n", + "print(f\"X_tst is a {X_tst.getformat()} matrix with a shape {X_tst.shape} and {X_tst.nnz} non-zero values.\")" ] }, { - "cell_type": "code", - "execution_count": 20, - "id": "54b24694", + "cell_type": "markdown", + "id": "5e61775b", "metadata": {}, - "outputs": [], "source": [ - "# Benchmarking PECOS\n", - "\n", - "start = time.time()\n", - "Z = pecos_clib.sparse_matmul(\n", - " YT_csr, X_csr,\n", - " eliminate_zeros=False,\n", - " sorted_indices=True\n", - ")\n", - "run_time_pecos = time.time() - start" + "#### 1.4.6. Model Training and Testing" ] }, { "cell_type": "code", "execution_count": 21, - "id": "e12f29e9", + "id": "38189597", "metadata": {}, "outputs": [ { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" + "name": "stdout", + "output_type": "stream", + "text": [ + "prec = 52.80 40.69 30.92 24.52 20.00\n", + "recall = 52.80 81.37 92.76 98.07 100.00\n" + ] } ], "source": [ - "from matplotlib import pyplot as plt\n", - "plt.bar(\n", - " [1,2,3],\n", - " [run_time_scipy, run_time_pytorch, run_time_pecos],\n", - " tick_label = [\"SciPy\", \"PyTorch\", \"PECOS\"])\n", + "from pecos.xmc.xlinear.model import XLinearModel\n", + "xlm = XLinearModel.train(X_trn, Y_trn)\n", "\n", - "plt.ylabel(\"Matrix Multiplication Time (seconds)\");" + "Y_pred = xlm.predict(X_tst, beam_size=10, only_topk=5)\n", + "metrics = smat_util.Metrics.generate(Y_tst, Y_pred, topk=5)\n", + "print(metrics)" ] }, { @@ -800,9 +786,9 @@ "id": "1f833846", "metadata": {}, "source": [ - "## Hierarchical Clustering\n", + "## 2. Hierarchical Clustering\n", "\n", - "Hierarchical clustering is an essential function for tree-based XMR models and plays a role of the indexer in PECOS. Accordingly, PECOS also implements hierarchical K-means algorithms in the manner of efficient C/C++, which can also be considered as useful functions for arbitrary tasks. The Python interface of PECOS hierarchical K-means algorithms is as follows:\n", + "Hierarchical clustering is an essential function for tree-based XMC models and plays a role of the indexer in PECOS. Accordingly, PECOS also implements hierarchical K-means algorithms in the manner of efficient C/C++, which can also be considered as useful functions for arbitrary tasks. The Python interface of PECOS hierarchical K-means algorithms is as follows:\n", "\n", "```python\n", "from pecos.xmc import HierarchicalKMeans\n", @@ -827,7 +813,7 @@ "id": "3b3b073c", "metadata": {}, "source": [ - "### Naive Clustering as Degenerated Hierarchical Clustering\n", + "### 2.1. Naive Clustering as Degenerated Hierarchical Clustering\n", "\n", "When `min_codes` and `max_leaf_size` as the stopping criteria are large enough, the hierarchical clustering will be degenerated to conventional naive clustering." ] @@ -897,7 +883,7 @@ { "data": { "text/plain": [ - "" + "" ] }, "execution_count": 24, @@ -906,7 +892,7 @@ }, { "data": { - "image/png": "\n", + "image/png": "\n", "text/plain": [ "
" ] @@ -934,7 +920,7 @@ "id": "3acd550d", "metadata": {}, "source": [ - "### Tracing Cluster in Hierarchical Clustering" + "### 2.2. Tracing Cluster in Hierarchical Clustering" ] }, { @@ -1023,7 +1009,7 @@ "id": "080f044a", "metadata": {}, "source": [ - "### Performance Benchmarking\n", + "### 2.3. Performance Benchmarking\n", "\n", "Here we benchmark the efficiency performance of PECOS hierarchicaly clustering and compare with a pure Python implementation based on [sklearn.cluster.KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)." ] @@ -1038,7 +1024,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "PECOS takes 1.0543 seconds for hierarchical clustering with a depth 5.\n" + "PECOS takes 1.0039 seconds for hierarchical clustering with a depth 5.\n" ] } ], @@ -1066,7 +1052,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "scikit-learn takes 49.1826 seconds for hierarchical clustering with a depth 5.\n" + "scikit-learn takes 58.2397 seconds for hierarchical clustering with a depth 5.\n" ] } ], diff --git a/tutorials/kdd22/imgs/pecos_label_matrix.png b/tutorials/kdd22/imgs/pecos_label_matrix.png new file mode 100644 index 0000000000000000000000000000000000000000..837cc9fce3393bac75e08e19f2e2d9c4c8febf8e GIT binary patch literal 73040 zcmeEvcRba7|39UWj0j0MR%Av+c8Vk`$;vp9nN_xoQzSAXvR5f&?>$cCNOqDv4%vIp z;~c;DM|EB4y6^jY-}iNWzkmE5SC0qYpU>yL$LswX&)2Ib*OcW*4;?>*hlfWhe?{gx z9^OF|9^O6}(LwM=Q2THg9^PRK^Glbm$zQqzyJl-+Vs2@Shj-;km?oi?S{+Ti&UJaS zhu2;|8X$P4|N7A<7NRTEmXFwiuNf*5aXH?&ad(!*k6Mpf)t>hO50@dozu~<9fv_-# z+!vf&HW&Am&2gi}(85kQ*Re)KeO&1}8Qx6^@zk@l?FV5hA^iS9L^u^&7M8vua4FAJDoso{mn~# ztTb;I6>8o^6U+?mv?iRjKtJ&<6 z@g-%}XM66${>G~4*jD@Bo;G&Ajio>B@yg@;n?xydIneK=WOh6T<*W# zWL``wJ+B^3pHfuI>s@-}Y9%|0amkZlRaabuiBFkYD*jYckiw0cC$uKZmp`O_eHYel5 zj8kRie6Ypub+f&l&S))&W~JpLehaL5zbcG&fP2eS{$M4$SIRY3oR1TurQ0zZKFj(= zL^jhWkG7nbCvRq|9(Fb|C)J{?_I-)EVtBq4U--lRdkFF4`d$Wh)$VuE55=pxsp5~^ zBn~1LCbD`YMpYx5$CP;DKo1O2Qbt3E?%Y@G9UJdV#E93;e5M%{`0&geJ~n3&HqlLW zM`WrbOx*zc7F&F{%JujiF+c7DNyn~vo?4S?{$l@S<%E%H#};PmjbYQV3}Mr|BXT>j z^J)AK+`j~ccd^XC9BITJ#?>Cr6@B~u>Fw1n_L(=7T?N@ty_%!WNuj@G{GVrJ0ndFep z>~npGe~s|?JL0xS>@}k@^-y?=kyI{(`^k2OQSPZV5 zr3`w_GI@oDMf&K+!$yqP{W4@aSaqIpjnIw|+3phwDqwL6mb*f@ba3;LAt!13B_&lI z`(smHd99phk1zz8ybNuIg}ZP(@}SF4sOI|8(3*^tR3O_dN|fwteX176OJJ_H$h1fxMy~)(@^2Y zsqN!9qWSko$_WB0PpWr8Z`lKwjgBN!_z@dFVt7>jsO>I)A=LoY6Pc5(PbWy%4Y>qj z&c~HJDtT5C)FordF3Ic|dWDTOgzD6TqY7W2Ey>rsc6crEdiZtvYsc&4+Fq8*j;ubR zZ67IZkCX;f#JB~xg>A?7DYPbcUEAi&J&JF9+?dW-Ea{E1?u5DAn!&SYM8R_?fu(9`4>yh*dK&$Z5d z8u&bEuXF#jL}0uM{AzY$!fUD5>t=#y_0E6NeX6Vmz9R?GJJD;@IwP`2j>x@@4ljwzHWm-pOU2v+J z6Y38wQ#p35+V7pg0^RblUb+;z+50jQViH;UMf#}&`js@5eC~SgK65u+-h&+ zv9|C!<1xVq0rJx}f>Lym$8$&@9rP!?b2#g`+vyXA$1FL|@(A>qI#(}*9%$ttR{xRkq?d!##~P{d;ST}QJ} zjpIwJxLFdhOJZ4iFZBxh-CfJvgRw@KtC(pr8A2wX6Fz}tnkQ7{ROFuB_8vCo_)^y( zKU2_}>cDC6X8u~;r}~%0vF6Q|(Z69$ZQa>FIX5(y-#^_CcRlYyBq8L|w`S!0bV+xjduFqy zp;uu{eYkzdYBp}J@*O5TC|)JDg)+geiU3W4?yd@~V!~EzFJZ|xu%q)c3B$*MTckz=7~I-9dDWDujjzMOCEo` zNao`H-};#M)8is{+TUOVgakq&%d{vO0`;gdux37M z+rI2qPhJgQ)(r5VkrMZ~yxmQ3`911UKk+=(CCUi)0fju3>ul#ZEZBn-vaaHSDWmfQ zldKwZ1kTjzDxOoOPU%Q__Bo`@@wGz`DSf$#CFXp~y_#X`r^WPA4%?f=AE<0OPQ<-A z_e^;4mMF6XySvktsXkA3NN3!;rmo=Dn(>9sCHNuF1L+4C-s25j9a?UD&=?lQpSLeh z!`Q*sQ^q(^SwOBre_K>c(A93uC$=j#PQ<o z{z+D$*pt~&eeC`rv%#!Oi%VN>5*um=d4p`N7tS{*w63XsI>YW#|1MQjja`jj$u73Y zxo7$P!uy1j;kM1T7*nmrNU`Bo#kI!F{9tXm{Ag31rg*7xcV$$po?WV)-55?|=@s_9^BWaZ9lwp&h_F|?tjzpOMJ?#_PLLNtWh4o@5rF5P5_0G?EpIa0* z6!^B!Zl^3IY=xWfS!s_ITDIAKwXIu<;r7&+xtW@ucXu>Y{gp<0mPsa`HrBccb#A;t zyk00fDN8-WiaUre8*W-@y;;^WEjw*<JNvX>e;eIv|Cz1=#rzag#l zj;up%{5{vK8Pa|o_^@ui1Ct$m(jy0+yAuTej3{-1O7b^kBQ z>9c86?S<_r9mR}vF0V`9A%_;S`H*v#1!fuN(-8LE0#mu0wOdbV8RxA-gdE~OiY~r( zF;lKJ%}k|(b}#1=J8OCr;v>j*#x{G0 zXtKk58U`Y!vp4k3%@|tm&)oUcPStraJFc|3(qr**V_-*&ys!=WxKsLU(cqfDSgDgf znzy9a)u!3~JvM}}hrFIKT0CRBe`wv^O5aM3$y>5)<^0fMCC4@ge?onNhjqeE04`&u zbaOe4)|&sg*n6?8ZGp9*wTRCVm2owQsqNCCFl_oI#DN`$R0~B-j79f!m9WCV@yh}l z`@y@%CMNq9OgQjXZr~kT=)262E=saSk|>H?V_itdnX2gG*T?(vR(X@Y#eCl_Da#@| znA2&zdiVzF)`PNga-qVHXZrhZ=$>_^WLa5Rtvq+&!56%@wAyg^{D-P4{$7I|Cp@$0 z3{fvGZ=#mNIv2wxn$s5T-jXMZCU`^)2T5GJJc#cxvvUCB6jWo&$Cv`74?klqc@rfi zJT8zX!Xv;xj&}g0@WGEHKK*a$%lMpl`*+{(!^872$0PWjM;ZKu{sn^{sLZ#&`$Hb% z5rQX5@ZUomW;XV@XS@Qz3*x(1wC(Wl zXxO12eEI8a3!wf^^BY?BT1twdw{5KW42*0Hjrp9d?n3q8i93seq?NI~0nFLT(%Mec zS%PIZhbTxxulZSEyIJflBv`bRuE8$Z*c!ux_)hbkW|2GugTcgYjZ8$Z%Uu3m96U*| znAzLk73Jr5a&qExI>TpUYsxPmA|k?nT998*kQe0OwR5qyH*n^)wqyNP$&Y$ujO}jQ zn%}iIx3Pvn^%@x3IM_?Dus{v{_U~Iijh)SZZ^_#3`>?wDp!KYlML&JPX!r>Xch&%1BIL`xnL=l^ZdBoEP>pxVGXo-mhDxdHxyT?YNb zrvSe=zx{>M?wzUOg}iup7x3g|q;ELm&-4>V9&ImOSai}lN`2s&tc=VN3fYUnSAv7d zRrm4vY92YFiPF=RvXNY_u{J!1? z^G~M;`T!>B#NU(j_ayy`ZTa_3`r~nXXu_3aR7ZL0=%F-Ibnb1r$<|kLUUqQjp#$3G zj#ggFiNw7&$jX(OZnbj}@_T5)1K9^Klh)h3dnit@+QTqz&9r?HuJ^qN$-`^>Y0J|G zW*?K%zjnv0E!A-(INM}PIBzf3Hq3jhm-kfb%I%>?$Mqh-IwoHCG4HL^{h>DrZT-Sf zb$d327-lBlxTnB;RWi32?V;n?@1VK4z1%V#H*Q_A_U( z&ADTHHpj=Jw-(uzCu#NUpEX9GOP_g4cKqy>WaT(6%xp2dTO?*6$E4}_=D>QDrj0ng zi0$j`F=6Dhe#c&`wj|xld%7L12)b=l&?oR~wLrqq}_y0^x~N?+(TO*CLe|Qo=f6e58OjaM=spL zZms6EDT~!%=87}p! z{4_RMX=x$d`KAp9J`5TshWG8MAg2`EaxprwwU_oRqaf!3*r8WypE9uD$DovuWe(FP zZ9vzPFB$WzLZcX^UR#8spRb}GO|l)~LaYo*&D#v|~A$aO~NL45Pp#G2&lU$&M|Ii9p-e(3Up zJ!kKN(1EPdvE!u+?@6X-`oc8zPk~*m8h1~aqSA9F9ga=Q(68(vt!oyW=TQjfd8QmE zN?lvB=SUE}22-n`5ML|1cM4ACSau6e_7>a3Rr={qj=GL&on7?juL{Plqv15XH*e^6 zmN{5x=iGWY15RI!0N-B5#rzNj>hqWJ+#^5V#18|AXR0uBOH=9D;WO>wI@YJ=kvmnE zm8%`>!>wWSZsTH0{_yQYqiw03WvS7MxZXnn%f=L;|fzfslV*9qZ2Io<~VL2xEf<6z-Y|XUp&;7wC$OGI$!+X_ROndim z08t-e2kY=&v%o@?Ge*?m&OLtg{prjNyK3BK3xcLLu~=t&8otA2)my~-^dA1+&Uk_@ zlE^)D-`kQPOOE60&B;$My)tx)vy6i7uTHTyMDPdI3R!l+ST6?z1Hwi#ZkgXCv=o8d zmR;+y$o_hac5k;b0M*X3W$Pl}v)f>-`v{rbavI~rikk<34yl3C{X-Xfgf5E|rn}{; z^$vC(!C%YhYFT&emy`NWt3z(JXO939<#I0tt-%rvy+9-tW%Q>Z%e8rv~zIO*| zfFqePJYy8dF(J1@doo4Ar18l9uRZbc)d z2e9x$*7vx*n~9FX*Hi20hbc@CIKxJDA?$e#eC1z+1AF&Q&lk{M(((18k>N%*#$?eSAAh<(0RI<)A@407S!Nx#L2@fK}Jl^yE zHQQ;+uG88DW~y;=2;Anladzce+P_;CwopS0{|Ihn4vWq#UY8Ze*`7LZFktHH>g>b1 z(tCDA9v@ijts{mzdl$QjX8~{xb&{AFwQ*JCYNkzHh0F3(z?r=S(+Kb;;j64D+}>r3 z(H`8I&W^}$7ENz{g&QQ43Rd8+EcnB;bwJZ5b#aDi&$Q)49Z)_$L=Yu#`*NII<$M)p zB={0p>K}&g0H&=Yhyt_sVzD%N7%$;k4|o`jaaFYYy53}8sa@Q?U#|QNa4F?^JnE9# zvvqHsIh3K5dl5`qv1m|Np@pV);T`oqlobb*XX)BKLe&ODYfM`>)!qROKoRM@uV~Ap19!cna(}O z2Q>5WAC79z>_GJ@frDmYYsmO(@x6V6DS_#Y-VztuGwHwf@CEDdT?R;=?rxQo#*2E@!HIfH>YT_|A)TeQoqJ zm;udrw{LD#m9eF?e@fymdC&a|Hlw~pL#-|R)>P6Co2q3bdl zhQI!)B4Qg{biPQ$dY_%j<|sd+yFy@JJT}N@4Mb1beixjwq?tL?j4b?;9mQDU{f zq^K~JjzwEU3txVlzf-5;!4f2ebKZ7jhuF@2b1z-^OK0*Lj9L?A7kk^WTb)Sjfn)eB zI8!N1l=xcL?ZsXPUeY9=va)6a#3WrAYjfm zhnaRb8N?i)>sXf_c=l*t>y917Z{}5#iG76B9FNkY7_o6K&9NeR*tOn+g2<@S@;Cp)tfHcNv}hzR1CJddzm{>U06ZdipJb0lmv zEp3VsR(VX$#J4=&Vq>j>-gYCG+!);8FQ|S@8VUrE==to@{&z2Z7?1CGTTXm>xyb)G zrgynm5)B5xgCpqdmW>tn#@Ki1!a+SHP3!yt~pW*8mpa)c%8qgHsYbM2GGq?_RusKUUbmVMhT;IKcMu zZEh_icX7X?&i0OA>rUH3nBw3gD&=6hGbUHiRqmsJ$5RBeZTnfCqXtRz6eryRv>z0p zcA!(LWoFshemU9@!J*r}U8kwjMf!C5@JzT6B>g7*G11D|&;)5g7_y@{DW> zf|)F{`gU)FN2&kz@w%Q8g+dq_Ge{u7?|(lSmYlq#%;xK&{l$ww;9Qs!i*#c{A|sn` zYH9@)v7tWADok2YX*BK%#d6*oLSn-^T=O!XkrA>IS18ofTsSCE6+BbYo{`b@V$Wb> z#;W9$s4q7Lp8TV#)z?l^qVk*+i42QwU!=(2Jc6?HAP|UQl5QN6`!65iY3am+k;3bNxRLD{V%J&DVH7k^*H3 zjpt8OFt9Yp8=6$8>!a(E=l{3J)jt_lSmhIf+!+Qc&&XM5Jb#vgK}!=cb;naHZF_)_ z<$@DcVkHya z@{XgHmd5$PrHK1}xx+QQwDdior*SZyF$xrbkFOmmXj2Z2(7&IG!g!MK1*d0Ml^OBO za%&$pbXfw`G!OC|%oT@P?uELr%O^nJw!U=;N{WJ(i=Zz2OiK9urFm2kI7-mg4}xbb z;5#}AJvri%%E~@_bpic78!g-|x9@H>>2=PdQ(9E0so>{_M_Pkm8n#4K@Ve_14a1-v z1-JSz9++>~UAC%=6bgzOc`yZOo**gEEPIODS8nY?hA!gZ;fYFV$N_>}%BLtz@KAS^ zh5~iu+AL|XCH4G%KujEcZ98qdhfsoSv`1i2NsO$7>W zgthtiT9|#qid*|a!eHT@(CRj6hqsD^9YaAf_!Hc9=kYivp*PAD=?c%CKui6%qm9^R zVd3jw>cdAY)5RiS8l7x)MAAG@H2O=;33DrUwpo4KB~n5HUNfMUpmh(4wQ?AzL`_Ar z(SccMP5TE$K!?&@mD%@vztSVkGY?I-d~;DiQzi=Lh+XY$&^C=zXX#H;poJvik*8vF z!N4q;?MyG7aYeNv80z8TwS#(B#C%7wpdSa~9BX6=a#>_Q9q`nm(7C7jqs|(&Qxg}V z2WSZ<3b^@Le5p@)@=>5XyZro#3}yXZkPzg;ioPZmcxesnY_t1Pt0aAY#p0>;jOANl zshwyRU)hB9*IUsdb|x$Dt}P-Itm$+&EZ^Y=ZlR|LAJg)Q-R^up=wWq83|N26&_!uE zDrjDkPF%vkK&{K|wVe0zKK48_1DWbC^;Y}b*L(8rPOGdfHCxTNla?ZT`KL8kbjs{g z>nmQ$6EZsAa!UZl9W%2#p?w~kqY<?-I-)(9s7YBgt;fX z2^->`9zVx5P&-uZqceBX*=hhwCOYw=%cd*G;N8I{mmy}K6O;VKbtD)4LfY%P%XWp-1kgtqB7q>5ymVs;w`{Y$ zv6zc~fWLpYq;lkG=Cx@w62YayK0l5Imgw8SEV-5RHDaAm0=F@I8(HO!8N;T&&fDRL zcQ%~qOi@X=!6-FPE^7Yybyv1#w!RC3D|KdPq%QO}5@?~5mo{7wpMV9KcWoLV2%uUK zxcGY?=yKse(o%yutFdz%*OjePY`t3&v06KV@&HO#rP0M@+>7yeSL+!~EKA6Z_^Sg5 zoU9V_9TFf>Tb@NK668i$qq9HRrGzhr+O;-WrKcq<1SqWY5bL#|MPGkPy5ujqoPw|k z3UOf`N=xmyVMtLvnH=-=_T9Jmgh%6nv>Fon8JpL`Am^z?tm4K#??I4TR(rN?chin1 zVCXJUDA;ziH?sS-Ydz9s@%4Vwo*@ryO0eHD#&i#rK6j*Oz<@JNA~Joag93Gj=OW(` z)Wt7*)6faR0b;saLWY7wHC_7>MS3pLx6&=hWFkZL)`uBSX-GZ522e;WCi1IFgEOCW zcP+5G8)))AI%tO-p0I7|GR#C@uZ*^i;nEgyEwOAZ^WGlzY8icy&lI7l*&YXHs5dIM z)`_T0IBW9;Rje4G9M_(%so?N3^V0M{b7TKC_oe!FP55k5M)P*I4 zWpi}`k|1~M>8I9Eq#K{G z^-==eWqZ65MLKWDjHonEl;&(VL2eIAT^7|}@qwBibw{eT4S@@2jKqxga&}blmU%RJ zmp4rAbW}U3nbhv~kRt`^A+(Re^4WptoP=Ry_VpH_L_r#1&&W_B(96FhnPx(TvW~61 z-i(|wUawFn3azydfg{Qhby|#B6e#G1R9kQGD1HnZW8*Sa0Ia+Mw9q7WY`)&5rRoF% z97l@S|Dt5mNnh-0%{jsC#-4g>1*Mz%M|JE%3qxNv3ckw zW^fehy8$;=``JB+0`>j`v7v#rHMMl3{3ENsD5;)LY&hLv^IVy*DM*UHQDV~!QyE&` z;KZ>yV9zNWK@9CAj`aV-b;7O6O5Z;r3GZ;bez!$pfUy+e8g*|{lKnQbW+DdNmo3QZ z3$aA$a}ii;QZN%Yn!^Db9X{v!2u3q@Wp|Gg83NnGx9$GtlhH~PV(azeBgN}g$W$S!6$DLNs==%9fxc{MZfbR@Ee6LYj0TVot7A48b19MS7`@{+d(4Wd zpzW`G^8af4Q!}Ax_^}r*C+s|T!+E>bI~0@SrqY)J@foHD2I)2}XAhQcA>o<0d8JKD zy~$qwr9%SbTC8NHaEw*DLW(qex{<`i z;Ge;)BT7Gw3OHrKM3p7k{d>Yxs(tN45xgtBf;U zo9IZDz1JwQ_9@#^-K~$=wfNLbd1tuyREhct|26KL<>;qRf&lcd;8n&pJTqi##U0vNuVQ?ONxaj^A01sMt7LWMMi?VY=FHa%2jZe8POoj zZQL9+@cHnRfj$8N@UP9)b0*eILFd2vcG&=6VpMOj_{^stY`KZGQ-2H1`g$`D4r}~_ zITsv{@L`BPZE$g>6N%VqoV8|bMn0k9_yBnG)rMF&qvOQJa8M9-Yi%~hr@yIBGT!;? zlU#U$q`P3TjZ4={`OuR!9+mf3>vtRGP>DZ*S)J)VV}}X%VGx}yXfs4IO58WHc3NH- zpQkg8J%VwVACR!~03`u_g?nwfBU6XbQP>W19$-7VvH_sxTbE16j9^5<)}6X0ULD_UTrJjUM!5WFEwbxcu*3)c;9^b0rpT=tdffQMRUNMSSh$DM z+O;_a3f>YKkPPL>yd&Vk9lPyOJz|_2|oEE?mS+_A(76K%~a!5~=ocvvJ^1Bi% z2@(Xq>DxY&8M;7Pkw5CbAgv6fYx?~Qly_%SQ$TvAtwreL=T6D+b|w$Le$m2b5>8g! zIt5u-y2e!;PGeWJMxF#>GvC|JZ|+!Hx{i=W}gIU{O<$injAsZ*G+sNHWW>< z%_lM(HL@Qj5QtMN)h5Wb6)d~w8QGlL$K$CbLf3y@T5=Z{i}q?(${gn zVg|$~GTUmWgICnm@4b=d2gtPN!V}HJhSHXM)R#wqmh*`HLl_N8Lsy1^`#OOh%=bVp zHSH0U-%)8=qQ4+}>w-6Rx@0Jm%1*i4MjdjbP4Gy5{J)?DzzGUEY^A45o=6%UVd5T2s6uzXEt}(-N&qH7r!OoV=w|2^)cdoj9qnk|B zW22kEGF@cH(gOmc0#_mBn?;tVfB~eSU;B)1V8ZN+VV?x0w?)y*Eo18d(O3su-I8Se zX+Ec_ar~o75Zk$KYu%=cI|#SM8rmW;OhUW5_Y{AVgdTXdt)bR5OBPJQe^@?WrIUz= zv%&hYXsswX{7aAZhgYfI0GGNzdSRjZw(Eqs=~-o0`N}2|H>5cNW?&3O7X6%JpLFNI zJ(N(_$07MNVeVCsBp2`}L~f?e190CG0FvrK9(Ap~!i!;wutT9X6^oB&fOT2Wy065@ zKZ-Zsv?Uxkk6!J}(gXf?+yI)_XV)UJ1R;-s^^kM8Hb@0P04Xa!JJ$G7Q&wIVx{o;x zGBrk>k<)jbJ+~c_OKpmFLc2qbW*>IS&fUkHgX%!+_~T8F9G4CL}X#Wwt{y(&upx(Kk8A;5C23YLijvY&yLXvP7$ zu~&X&Ee^n!odG{HvN@g*VO89(aZ)Mzxurlp`^Od>PPH)-O!9Kl|EIJ`=tN9e<_a13Vf48Zoi**px8S3_q2Yv=YD5R?|c6fDw# z8ym=QW3{FLBCv57K$#MlO_qoMqfGqmEffG6K*k6d+|Hu_^e3dgmuLQkVEy}#f1@G) zJQ(b6G~~a;=l#u}{mq~K1fc#K4f!uc#lO*ze?#qm_8_PqK#d(mb!5Z&rFJ%3M#&Ky zUclE1g6X6`FryyVCN*>s+0~Bnr%Utje^ZwkpYt6Z1!%DDW@N^} z+=^HPlxq<@R{?G59^n2r^r8)fSOmC`qIo#rFm@|KBX?nsQ&VG;LoL~VgFDiAX~jdDz1i&sLyCqrpg#oq&TrsFDAx#h zcHj9$3AT$@Y(Yc#M+o)^H5>znDuOx-pfm05VT>rrtqIoBjj#TrkLj(0FTT818x7O& zVO!t-1C&@~^N{C2u6X<|fHDFCA%0UH=N&~YIpd50_7I>QO)xlsTKE|t-vB`$6Od3X zD?g?O3033)^%8<_a*Oyj0Z*c2#V8h=%2f^R+n!q5(R~BK5O{F6iPd~c)YAQL!~X$D z1jFA&Fh+n}n$UD$B7T8wa!d1MBXB0{zM}y1_`P3wprS48u8KG{D+>i%TdZYLSc!Nq zHeZo14XhttGNm!WzbRLjAf@mhw9emNe%BX&o6Dcoko%@|M#qIMVa=@$gcmBUw69|zZ4YdL3-Fy)xc3x%Euv3@0Tpe;#N#A-=gS*sCLeMp>(UsL^0gOUWPVr>$0poQ zdY}b)iA;oTjb7B=jLQSS-hd#8^0@T)u>5H12zHwuz}9AjofiggY%k$n$ep++Gb3#F z=>-IAYeC){{!rtwpO8$kh96+*JhnHwY~DV=PZbXXGXkQz_+q!#=Y!x6-u?jO9mp`L z1GCA5S)J{r4ZwQ{;;#s-O}C|}yuA~y7=8BC+H`w5?J@j)KuCCsy}vd?mlMt2o^TH2 zpJ-*J4yf3{js~1*_QjSt%|8Xz0!$_~21u&Bx6Ub~p{aS+rh5vxLGfMq9WAyQ*b4wE z)Ia*-7Y0$snJzOof?NXC21A zNS&^@+f7$ab8y{4t?Q+p|3yrZy0`n5wRXwc&dOR9{ z)1&mu+>sYs;%?Hwtwva`7G$R3Y(6BphCj+0w_&(G@%g1}+-4iHSgx+(gP(TbJt=_f z6|(Nrat1zv6;jl@*2WQPL;Yl3@*k!Y%M|2a-%mP}M$@LrOIK6GVLJiia;vmM7K}Yo zPxZLx7&V0AWnX3pd{tNLlLIrk<#D0}z$I>9p-5NZPACS8Kc&`C{R4LwBlf}kK!{C>_BDy&E)EmFDm{Z>g zsTfe?gmatU1qB?dYUvp0eO0V&=@;CtwJuLl{-SCgdBYmSBe>zlN6I8A9#-3e@dBh& zE*KC%JA@h4{|+Xq=0I)`Bc1BT)olL#T>!UGBW;0mz&Lw#9-u0tsY2MUf3?yn3V)xJ zGIxCq)5u2r@NmAia6uBzEVLbyhzDigXA>{@^Lg!&Ty@x5y-;~qBlTD zqvxlP;xhqaAHIQM7ZqpodWDPCw6`~%Himo}B-Z;_e?&6VGfqGH&G8)2bLpj#P!%ps zt9WyO{dg)m#aTL?Rc9i$)Qh_esr1CIBffhRT4LD8`gA5fOu zG%DR3i%9~2>^!&4CE#i70UW?Tu1vNinYP43x-LARp;~EEiJ-TsPz3>Sn7K7Fu9TFS zr;HEH^P9%IBty;>pDSIRv`cM`+dSDS*2OpCfJdxtzeA^dv336laKKmGI2`5CEHeJF zz`X4>0EOh;Y3BkC%&{Oi4dng|?oWRZwMmRvQ+Reo+-*nSdC1$g>0BoiNNW`M$66c> z1|aU9mXVbNDb_P%Sy>Xg6+a-Fubd^wfjJC>?1)W~Ujq(KGU;==o?3dLY^-zSR-Rgv zh0&!1xySv36$o-E3Zu{cwP^s6>_6|C6pLtY(0^5TTeL14*^hSB2?w6DAi7RM0yF0F zPf+fkVcq{5=`WU^U2wQAYjsNohDH=1!^3OZeB`G!y4@{MwuBH~liSI9#32@N&$fTk z#}K{;i)pErHy_uhS0$FZo_wjaR1qleM#qfQ=QyPmiPcSxOJ^Jou#JQzKjJzQ2-7%t zSwxXS?P6FI=+6FECr_Ot{@2TEI#*+Ob=b7;#4CVKMyVtQ^aH!MB2%dlannzQt$eA>? z-#tT=*3hSX9JnOIT^r3HH0D(8wmlBtTh2ow=kDAF1BvUprCyigblO+GN$%u{OvI_~K#&pI+LHY(&pPLdIk0ao;1XQwfppGjD$gI3Jgp#1eke)bydKc-nI9yxn z0i5wa_RWnAZO44anjyCL5u8+xJ%(AG?&JsD;ZrHxMpw2Cy6P4IQS1+p**=YYqtCD9 zbK%+gm2ZhCBE?;7XF7q}{y{qrBxZd>myxl8>se$t;EW-1uD^Vk7IwEo+qA}`zs!L| zt0@G;65DnAIKX;g5UBS9vA}>rnNwySF)}m`UlA*uShv{q`v_<}kPGa|3{vvz{wSql zUY#pL33l{kS_n+`JMS#90?yKe4&J03_>Pdbo|C1v;6S-6cf0#dL#+bnpjZ^l_XPAE z0b2tG?S*w8q5maxrW=K;eb3wXG?`Uy)IG_}mTe6X%9OM!ae63qzd~YXkjk{G^LW(` zb+*60h$CQK!~1YZj=3*`?%)(|+o^T)2_VZ|y10t|&kA+; zkr}pZ&rN$X^t4x`ao9@i%{T8|WZer#NjfdvAw`gk}*11J5XEvX*YPTlfJwq9tnm4wlBGRin&?!|nZS5LpSdcsSr06hBd^L;wLPw2)7D+~C+v)vp$LPQDP>=-# zExV49cztxIKmU3&X0%}u{0ARtSHt`A5@z~Jo06USCnA2s4`KQA%5H3}PMdlJ0OVeM zHMtZI6l773(FNs(ke|N=h>-1Qm}Wh&%T4oC0=(MDXaw@?Q|lAr&N(ue@<~mAfKA$Z z9S(w$+K5(k^9393m#Z8yLEHp5KQf7xYo#D;pD_Rg z=b5!8Hq8N7I`NsDY9kQ$MbPLYg$pnj3S-*pFuyUDR6CaX70!z$>kXLwmk~^UIQ_F` zpI|N^q{;VuUsUDlKrBCYpfGsOt`KhNh^&wr1m<5}zY zilsv^zxzA#V)ay{IshjL_AEJFR0EYwB zO~pmXY59-o-K-wq>l(I!8`xPvYOMB0B4UjjavHSh$%d36P<1=N2_Xg!J!95wYAP=N zGcc&f6RoIhhXkAFDG~0R=~B1|d&^?)mC33=OQhbkV15HR61R4f&SlQfWxA3de$R2U zyS&bY%8?uS;)qQuO4GL=L^ZYTyH^|aJQ2ZcUK#X>VUXJJr7b*50YeR8R-L^e#N)qC zfA;VOz)~nDCPEWPo*>l}^drGPs|nzeE$9L`bGVa~J~L1hj#7g-5YRJN%w%V8EiTAs zFz~$w65F!6lxdd`tf^_`J3olwaEjOwOLgjsJCKnEREa-BHz|&ox2LIZYj~qKJ_vhu z=m0dGrRQ%to6TL_(KpTQuP^@_eG@_M+hv>8=$3Ri-P$7_olRLc@||^5+$bhxg}9

G#+*=~cOV5B4blhyhqNdF-BvxYBLCy92#}myOE&V$N_wY~ikdvcbtKnvwT0#vsH1?4c(={b${=hSFu#WRj16}w*7f*givHqmQ z`vD#P4}=uTExo(#*EgGH5U6xzhF)zCht>gyIfFT>`MS0S$>(+-=1*D`TWgQ6{jb0n z?Pw~w0ZM$+x#k_lirWci2?vUa6*3LP2LEQyYuTka4E<6lKbY;s)yBU_udZn^iHgWm-zb9O_!Zl7sYY) zdg$ORhGL$mwnz}bATYK2{5t3XaAN!mpt*{ZPMe5I9;x`M>mry2^ZAxuc#oy_G`rE0 zxtF*)h|Bz35(YC-1E9{I%@Dx!@ZUUso_{kherjC`_sq8FV|Y71BWCdB{x8dyK**_! zU!om>v)ve^;Q#L?b5BsD|G&b5{2g=Z&jvpGJLc5Cq!xe2oca$!Ap+hbJ_Y`)$GW%zcjS&L$fmqZdnZ*Mk&2eH1N1ZmVo?HOvQYf(c6S!c}SIhBE zfB^+z5*6>ol88R^VZ483z1F&sQwjMm4^qH(-X!`qw7qlKON)%**(cRgl%iz8T{o@ZIm zkDkvvKtl5o1Onok?I{eMVA6zwHL4)CLMwp2IS+sw>wt@62(L*Ik8a6n9QGXuX35D)u+0WVLz3Ak2>?S>5np^%0c`A_la=Uy;5wmDr$A<)z+{3@ z&#E$MhuY!Mu4RxR-j6%$lcKTm{(S%{>hx-RALo|rXAsHt)7w|~&it!D#KBSz)6%%Q zOYYr%HB%z$i(T-Mdd>u(zl&au^L^Kg%=Fs70boOSS_ZJ*S6t%moscsB5`4CR_Q*_L z7JX)+P7#Yx4)rr@b*>J-8?rN=zE@;3!q_w)2FK_V*o_-cDP6kGN#ONWHcR0G*CPtf z+(9Ls=#I-uN?2Me7L)R2AybFg6tNap4~u4zMedI?u`bQc8!o0FjFB@{!cru(l3QqL z*X1T*(@~C)g9i`32l8snMXB@WH(p$8?m^}+w<*Qt^_^NFq2-G$Y{9&6OKeu@>@T^u zABo!k6wcB&-kXR9F+(XIKYm=GvW7Q@1H0^fEfeCzf`Zt_Mp9{wUaQSOg^K|Q(0VUQ zA?3oj1roSZ5pKucS`Qdy|t64tA>4 zS20NS)|4N|0-22piaa|nG$u(~;=E{Ba)0&CTmIQo58=@|ebiV+kBy-m=f$rjjY(>Q zK!iW6nWam-A@XjHu6mv^I9&4;e0(Pe#E~iNvkEHr>H_Yu$%xGA^6WtGXG68KE&q?b z_ke0DTi=JTBNlW-6vYt)VgW`084IA)!LA6*h#(ye3Q|Huq?e4Kf+CV_u2a`&-*-&)7ReL zO4%Qi8G3#D{;u8?=s1~P(T1rl!B<~ZY^7}0sa=^e=_<0JCc-VmAplMp?fM6&Sm(5^j|4#>PnmdFDnNzP4 z#3=j4*}&9t@X0j*ljO%f`FroIJm3;&Y7#IC#ZYpgzQ8Y6)u6Z; zAzaKudgIfE67pKrh4HAEXnL_*;MK~P)@wov(w+e6>lfavl3WuKjqt@%k>2=-kKlJ5 za;!sNw7Lr0xmEYcQCDS%mLGhD`r?~C3l>wV5C%_#F7ki=e1G@k?|t|=LvkU1JO-p?V&;&)S|cU z`O=%l-a$fy5oBUQdgGtYr@uV;Z@(mG773%xCTle0L#`vYD4FYcV9HJ zV~h`b5^(^;ihKJ%tQ^#V*tg3dd2x`bK6Cou^lr7kU!C>lp#eg&piYZzxdn*hu3o=B z=v^#CkdDkgE5X}053RykL@BiXDhn@4F^_V`phDbT^GXjK5X~6oMWC8FYg_Vf+(wyn zjpE_|TR@yQ(9_D9r<Uk$9L`!jZfLb`S}X40YI-P_Bp7Pq#=Kd;^F@c8PG*YzB)DocrP(NV!F<8tNMZt z1e#_samhjibD&1pt4!Od3aEuGBAz<)0$V#I7&>wx`D#VJb94t?!dNn56l#1&$!bs0 zB9QWW?Pv4?lI@;4ilcz2Rn}t!6a<+vfK#xuWk)t?dpz;YRH zZ^3jPLYZ(!l|4l>0JGN2jdRZ3ChBynNkdO`K*b?95`b%bek}YygwSI4wulw*SLg&3 zkNRGWBQr}?RlU2i+@~-{-f%*1H3Zn#UU$95mNX{^_>XRdS~(i_3T?sZ1(K?p1G>7p z%Ci*5#ah8h=_1+H#0A`HcbV%<%}X1$DDla6RnX%JsbV|ZCji5lPXCr9`IQCH*jBD>(chu=0?Nzx&B`;+`F6_woaQw{SMHKTLO0r)qrr0sIYB(i@pi+KZ{M%Z8at;9@ zyPXV78QlqZ5I4CbNU>Fd&#ET4V3*x81RfL8xO+dNU!-R~y%cCu04B^Izvv1Ws(s4rwL<4{CIxB?Pj_yL(k5lXt5p~(aLejS9_ zvD*jWd76(}NzO%Et;oW6i=gw#0(fPeobFb}-W71$8S; zis-y8!KOf=@(Sk?Y$GGE*y7#qH!-TCy?}%@JSRK9_iPf@Yzb@X=$+nZ-@cL57iVwT zKT6ZF77-Ml)KIk8tgNXqT=S+u#_yIr@eHug3h%zb-2iQ)MYUcXlD~8yQ9i`oH?Kte zR%N4M$XUQ(lREhK>v~1do1}h%cl=scao@yxr#H}oc9EG6B2cOgiw1FkWhFn`FD68| zc8Wuf(Fmy-o9pBySRGPxN{IXG-~1lv8Q@8=QxD_;m*u!xoI@Z=HOfFhgU-4pePn?E zhx8rS-yx`#+lxJ?Uso!{AE~2hRlL5D7ER~(YaF31oRfm(4ircp=}@MNKrq1BCu{Tp z)_THYqAg-}QS-inZK4W2d)mEM2oW^I*UBuS9B3RGTp7|q*faTUz%w90eK!9zD(02A zXAfVPFtH`GdV(54ivDJaFnaj)ZL7tU`u3OW15K~&DOwYP1OtQ!-(L3EE2`%fS~M5^ z<0?kv^lukYZtkGU>MGVK4FTQm`8++)cd}^TIX%VORG~a3sabD$CAhgba;xN{RShkr zoFBXd-CkGK*a|ORl64jg1h`T!+T%DK>36-}5bkB&feZW?e9Qa0_bl>dp<_oyzq?~`KQc|7iziFBPauV{#;i6?JS=iRHjzO`#3Q!Nh&yk5CM znwvO1?MkXTAD=3(s0gz#>zBU^MFHhlpoQh_D(Dbx&2up7B0)r?%Or*)?A%#gFdRP* z{q|7KF}*G|RaLw8IM21xvTn>q|&TnE>Mq;tJ=P<50f0}SDLEb1-*!^*KZK!EO<0&DMzpUw^$M_{T@@W@B+h)1PbtzwhE#{U)7?`W^M|giI{5l*y7jkB>Egz+ z$!UuZZrgl2&e37&hn;6c7wdnkduEU9&681sxZ1t;%BWnoNjrr}(v^J7PF<^=dM@0x_ZTmO$p!pvP=8p2 zKYujIFaoL!d{wm{=V#wvMxH2ZL};Cd#uZi9->uW;-VOkrU5wu)C#%yS4Wv9S2|yHI zt1h_VH$mtaxpm)QcW&7Z-?pce4AjCNtcZ7QEu*o_-SG zZ;M)b3rvLrKukB*hzv314TkI3M1=>;H;)z?@@9Yi&GP^exa$c(Zl?r*?v;O6kmbO< z6=3hC*?)WkT>cwdA3BNw6{!^uXE2l$a$yJnx>RJ&#M_yA~g zo$GL3z5%_Xp9BQWF;cr^r>3@m9?xjid8jI;yBd2V;A~t^OzIx3IL!bpMqldS_8~X9ef4Fgt-n z;1~!a{k$~rD%@}T$gnwe2CLH->-+rloJ?R`8tJZ7Vl>3+fYf3~3l!}igKlf}s65Q6|Nyi|& zQw@|l$8)@Fgae?y=nQZ&)WR(gt=D?=P2wav_la#8y4U6W{{OQuIeTk8~>pg z4{9vuAkI=4(B(azG2j_(eHWI(PsB0CnM1?P3UP$p03cc+bWETmS!e2lKC@SdACYst%5E3~w~x z4K;kxV?Q{x5dD%Q2se!Oi#qL*BfAG)wKM$Gn4(RnFj_A*E60Vq-@ryWQ?+fU{CLiM z!}dbffig?EkZ5|^st||XPmaHmq1{aU4j%ItNpzmCb0>YZ$7|^X%#$*WCNqVeptGTe zS|hhP4?H_2<2(3yLPZhjU0UF#C`sJr^zrBG3NB9 z6^W>r8ClKZKuCA-Yl!+Sf9i-?8PfFd@d+q!_<{$_Bz2#p2VxPT6vqP`dZWc`6X522 zlRrXtr&9|#9v4W5{OBq9#w8jKtrA$zR|!SNah_IlBEj*c}aN?=%oMfaB@)W+#jo{@Hs>9tK zzT#SMhbxu~Q1`t&5LdNlb2OL1`UQ%DC;LeN)_k?Mr_)1rLJIOjX^4_Rd);ZcaJc5k z<3=n%AFl_TBOnl;)=$7$DBO|i5Kj&y`O$^?j;gmLwdw+~NQNG;%c7T>I z>j`+BfG^#{Q|&m_kn`xynIsu#0+0R{kfxp5ICl+MY+ry*W4=?%*F`*~V@oOgfR1ou z1}!_>rOf7x{}j&#H+fBEHbEXo(wk@8?%UI5aWFVfU?6QB?PpoM9#B6GoXp{?LK_70 z??fIFMpN6~rI)_^6lNzyN$DtbYAR1~lx}ioy@+o-Z&LP?v4*u0pPT6NI9-Y&vb?!1 zVR_E>ivBL1YQ}6U zLm$gy&$fu>2KYDIbS0jfjm&Hv?8zb}3xdy0;}=%L24k=5seJDSmuA~8_rcepiA3QD z#dNo;M?I#N(rBADEo+m#)qA^T>B~4PZ8S~Y&53jPu|ouPld_Uwg@MlPU>^FYKm~I- zo6LMhVmxM+UXs4`Mgnc()#=~oRrv~(F{`Eh-smb6pZgPQ22ND<9_fx@^;IW<>hzFP zl$?IfL#L27$N;aB!6!={7Vy|4AaA12!>_c9&|;bDamRY<*!O2H%!7QZ`HQt;3@A&E zjcBPCIq~Mf92!#Qg}g109laD2j6uOIp*@z@cbI;Qe|OJSwC-Zwr*8rgRoylZ96R z!K

lW&j{Yp!m%l~<$xomC%>pK#)#ut&8L$@GN^Wf1rR9KX68;~o-apkC=xI%zk3 zcAUz9n4nvMH^97NWIzC!6AR>AKyZ}fcbo?8lNy)e;a$j?B~X+oJ7hNnb$2A-kT!$U zoNKQ)k>~G}U4H_&nW!MrK0P?V_*q?gx$@xi%-q@>s2{9F+Q6;1mPavO$LpWCNEr! zZHAtf8aLVCv%JIZ(gQr>`v)BBno5h0)e(*7^EYPR=5Fc_EN!aLcX8ujg7#69dY+** zw<&U>S(P=zb2rq)?S>GP<~GQXFG9!JbVi5*Ul=zXC(^>dW?j|uP+cN8UUT3ZXBziT(-J~HM z-@}_r*e4iOX6_C#E~Hc8;7+n^+{O?qZI2pUQ^w63LY7_|yIl(A#-ac;kD~Oz{{Fdid zW}*=B^CqtE{R&(;zOU;c`EHBonGWwcgnjI44R{W)wvd;ReHVsfqD)|vESi3kAk$DCIAO&ZQ#%Zj=iF_|rLQW-5J^9@VyJsp}Fh*x#1 zox^IJFGHQ9m#*kdEknBgth@`Rq~$JM!!AZ8apQ(iDz)R0k(^_x=CPZvDqtf3*@$Hu z$&~`;g$RSQ4PvYt{Z|3In0`{f$Ky*Yudy=XM*^xS@VeZ!VhSA*srwsCgS@MD)F4~X zcW_8BI=z?w_7Bc`C*pp8qB1jef95H2{_vIaP_ZcpxL2Pd@Fggj=C%8H3>>Dpl$PdW z58bg`P^@HCT9xSUa9PQ;_+g=kBi7$H%U%-Gnti&ju=-HDIxWDo^T&gN;fQWQM%$D1)mB}7%wq-s8w?y>)qt8RIqN%0H4zg&(6I50qdmi8dZnd zOF3OBr65)B_#@8?{4tQ%{U@-UUmEXyPtNoylg%c(*iW5`$zQKI^QD{Z>^AZEEDLOJ zol62n=bnQ>VQ0~9$@r}vrX|->EP<0Z*{)$lC@KYT+@B0Mn9t$4yc|${X`fr)cVnhH zwM2*3q@LMO=)-ORbE0&0Dl&uA=iA_Xoitnp)=e$=Fc446zmYB#_33YpB)I2unnci* zF+-C1MBZ$yv(!Ap8(Rj#1k~+@{)aP+uC_VM@g!x`(bju)588q-r-zf1yzc&}cT?

y-tW7QOwn(g$g(s(HcGc}!77>BDQKG?i^mqayjWz**tB9V4MFrY zm&0_FvhSb1)SnH;Y6PHkD1Z%|*6a1WKyW1GB%rLm@a8%)02w9|5J-u^$9|6=bZ7zK zVtT5WouCJMOc=mD4-XCjxaSLR_8>4TvKw?r?dJi!wE$|*a;&G8Dk%V4 zec8?QDFof)3El8(Kz z3HYoanW^u!t@SkSBL+W&O0WBK8j^@I{5QdGBfSW|mo=DM2g!EcV5Hvq%hL`IC_2)h z$k7(E-}&86mr3hFu6t8Uun5cOkT@$un2Ps5L;2KzR@l(5VnguLxk}Ce1i-^FE0B zVGx?(OiyNKjGywTN!)#v7E6Frv#w%@ZFVt!N+Ylf(ArK&MB2aKC@JUNlZ|KCvke2%i918c z3j&$fQMANg`~bOHT6;z=^6N)sghrhSh5f z-GBsGW8Q9y5_hQa?K6rQ1PU<^pyREoaB`TlCvm%$cU7#FM_ZXo5ra7GT4P66^j8{{ zt}*QJ1|RhC+tUJBFM?$Z1d8`yAX-eP2`b&5${S8@JbeE@l4>s3r_%}DTVKogyk1E= z5-c^CT{PtAkW@M~2|7g&`OtSMhGu>+Sl?YQNZY$A5_p80<@R5FTCeRtIjWzC>0Q>A%dS63oDtX7~t9@oP$RpH~Ll}^a>r+CjvWw3aHQNP3c&@UdWYiFHHqK?RMW1 z`Lh<#GcESqlsNtO;mDtPBh?uK!lreur@Lh~%i$BEif2Rkpj3kX_~<(LQF*s}Ya~ho z@-heC=4BdMtRYpGS^^&_gfKKG=Y+Lbd0KHK65(5(b$iF#tE`R}YV#g_6@4#$7-GR{ z{)~=pm2Z3NlIUwtLIk-ZmZQ3J(N3zI+wx#7fU?F%!s6evua*B@Yd>~f{;X>)V4KO7 zfCoO=|5jw35(O9)y|K=>MA zN`C*FRcJw(tkuC)tArR|fCU#}8veC<=f6F|Hxz)&|A;tV7%1v<9~uSBIwL(JJ3Tjs z4fGX^uVIoSpZppfJ;t*L*UgMXkmzjX5AA_yThWZ$N#veyDC;W0ll?WCj)Z{zetLUw zavmDohSCHeqg8cm{5B&!J>3(im&f<43pfpv(?d8OJ;Adu(9K*7SlvI~*+?{l0(V|Q zSqB`;|Lo4fXeda%yJ+>URw%^n>#RF^8P$x^FSFJaZ7{Hc`1d{@UkTLd(x46s;ZXiy z!l;rBG48LUpA<$%1CG4`VT=B?O6TtbBZoV6orhnR#Y`PI(tgi%h?fBsY8hG2G}nt! z@ecdkejfvow|_X`I8n$QYO6KKRx_k3_Q3n4{nw}fBijF33fV( zp*s9BRGDaypl$|uIOZ1J%x|ChbwWe9UHbwJQFeSIP&4@gh0`PMHW$4WyU0&J$^(!d~40z`UfwP(c zVW7@dnEV-V6o*797Z%s_YC&Wr7RhA^_%-j6Ss#4ah#iNF6v5OV4&%+{5M=cZQ5SRp zOJqcCZY7?Kv9z7H>;M!aPAv*GF{gJj811)6dCLAS zQ9ijnccHo?IQIv(Hjw!GT;YHm-a$g)s1Tu0IyvCrML8kDfgY2(UMY%1QA?Sd#r2gT zOJ(Egd?CKLepCb<*1_F3@yl~(5;om?HGsGij?RIo=Bu@F_|+j2+*)GqBE>IX)o3}2r z-Dj8x0Hb&l8#Yr-C+~&5mIGE`Z#-^(^2pduTI-KCug&bUEVWS~(bd=P6v!aNql&R} z(PoW~EK!Q1VvglHzTtdb#n8m%x{C1@t}VbS>j=5o2nJB0s)Bs(%Z<#SuY?GwgEJx0 zJ^OHA^`~#wYz$GdEPm*rH7Ud5XpGyMK0ESg+`R3&@S{xsqGndDZ#8D(T13(C;91(7 zHo-uFi>rlMiF`G9_3WAII^x4-!+;BC{b3OV*;M%#AL48r4+=TR{` zv+Ne?W|pt2mJKu&&yUmkV_SdH57OFzWO*K5Om!@;Vz8jCnsNU|!eDK+2T|U}^0@Ey z4f%<6k5{YZFKEU%>2Nv*`rHTdHWj@&)9FE*>Bg{e)6VgqjXB(fE7U9T@5zC29lPmA zH=Vg?6>+omizqpFngB)LWO}qt3}Y43=< z^BeYcueEEYCLIApTEp!83n232mHMvB{w;YH+&l~RoLB?QP)v#z&#+t zOwX|bURZkCXYtt(aNs*orSqYI?wN&c_|p6mU#qQz#*fETEv zd7Qjl`CSY1Juubmv_+EYWkQ5=M@K9G+QjnsEH+BW-#7UaIC1tDK8q9g z7w}2tV=oZw1M{= z+-VIqd1}Ik7U4h7sgVJoXIC?cU!XIv*5dAQ z$v6_RK4fpH`@kH=9LZTB<#=_8-{Jefjec zs|}+;JF;wlehb}<%T~iHQ)L>Eg0d zU$hPPWp>Af)#IJyk@8^G*BOSbbMzux77QGZ&?2o#fZ6qyjl>S&Sq#hMktGbwzmTYlrwZk8i3wRt`=m5J~oS z>vQyW)$PS=^oHm&Mi$YUuPFq)c~-TKHu?nD+OR3M&BG!uSI>}XZ^L1Y_B9C+ZXT%i z1I%@;#&3=eppA&2LQ(_-Ihe}GCc*c~Ec;C!gQoFXz8>TC_fOBn;7eHP{4E7fv#m5F zLHO7mSnF#!^5ZigJ&qItexxY}tlFt|ig`y;4YN|8-SAf^V)o9Q7xB`ZCvn76^G!;^ z$u_<5SB}cgZMj)mTYW;@2cM>}PLB74x?FW2-?Pom$2zfIhZQZ6b|p57tPC;c=5-_e zF!)|S0QyEu-XO1g#X9V*-4Vp@Wg6ba(Aa=Ge>IJV`sWTwoW3Y`+zL-s!}}+P?}~4> zBU#ncMW`ja$9+a#XJbTgt)1pSupPugXKEGY6VK}iGBFMIyq5xWZ zoL5~+EJ)x+D-D+JrW03_x!cqMh1P!}{|^Hnd=eKj&qL$YRomN!XsCu5*^tvY1DH$- znz-;?+919R83$HrG{{qY=&Zpq03B=6r}M|dPO7?tPCDMTF1*pACW*bRg{*iU)CrB- z%o*%^XtY||^~EOlo2HBFL?k=Q)^Bx}9LV1bK-$LAyN936qm6f0I1d$NIJ4_Aop7a& zm-5#S7WKaH_ISd!*}oek(aX#`0lVq}L@U4gjW9kyZ}OQ3)j=Lgdd*S{Z-b8n5YMBM z8;|Zu;q}V}UX|MeGc5v6R?mqnppv=+C1mBn4Jv+j(>P7cewPL+DsYGy-5Y{b127xeuG{b?sk zOjwPtb(>yf-ay4@N@Tfbm^f^98oR$7pUHci2YDI`C_>sKon~h-61NU1!$C4Ntf$)- zHd#JMzVshmMM~pyQQb`4xTi}&3cx66J6P2Y+YW)zzs)n}VlYf+l}Ley_nZxk&R((x z6BR@L$pdD&Uz_ONqWwFV*Lwgkm3UOjhD>jEnAoemi-BH^XSB{llzYh|c;>ki@# zYe_-m@nV1GZEIG77%4juQw89elt$+`TNv=WKa8sUW4IF14Kv8g@d1ao8~9hntjFg_ zTba|2rM4`vG_12c-V%Y+(5ja^J5Z-p7{TKVZDmo5qdm&ZELzQivC{XV>vzvZC&yA-$EFnoeoPX{;s63P7X78wL~pTnm6*sFd1)eQu6c+0uBpl+Yt*#dFpyBuIr8aI{%pl-K;uP)(f-QsE z=>rQ}Ub1ii*UATAP6LO)Ua1BA{o<cf0mo*(*tU@bVhycO37sl^jyD(e(|>*;-~4&N;qWh90Ke^hyY+pWaGqmR z6=1xY*04iS&1)_uOPxWP#4T%h{|dkR)zc9nNVWjuX^HH@MTVWF-un%qj5HxeUfHi) zxj3iQh2JnWHa9|>b7#TwjXSnb`?D$=*2^tyGECgK%6nof0G-w>B>D9v$2voOP+s?> zrMH4UYI4&`?>hk_S0pPRv5~*M@0}VsxTw;|Wz9l*Z`S71^r=^B2z$M0lD5C1_k=cb z6+_SO{5huvI*LgCU6S-~F`F7#G&;W-Ms&g0TARpJo#*hw?mp|tRL6aZBRz1;!y*0} zkD&TBZI+JZChVD zfd=5;HyR$9vW88a2*|HHQg2^-%>+=vDo$FlK>hT%4(D0Kf7bqp@-7_iP8L^fJv_4r zuuFRFHC`j9*b8;0?q-qs93l{Z%FPhg*{0~771+JAJBpM>`Z_Pb-)jNkK<%%r-NFks>)U<2O#{S1zmMPIdUJuq=8%@Bm2Z+SX<`69*XaZ`s@*E5 zHcrUCg?1IswipT3M~de`+GB92h)G147IksfsVkDTCJWq@kEtGG9`p84J6?tJ93*EY?$ z1mj?i*BZecjmbIQSSV3LYVrp%a}oHWo1AU(O!JO-OK6QM2%(zafQ*kcMMMe3+j>2VlDrDj`s_pS^kBv{F(ppuQ~j$Is9*J z!SB(k|M%uFKq~*mJN{{E;{Wf9#ovLx+UX~KAi!gP3qgAM@uHnXGnV{5#`_lR1<7k_ z3j(hG_z9n|BB&LZHXKtEebRv@F}JpN!^!SFMg8d+{h2*H$XaO3L=atl1Nap|rm_(0 z`T&n)vvoFMA7>K4Ql)$(#{T#Cqrby~0(h$vC-B!0OpQ)c3>h06Gu=QddUe{w*=hLK z_@Q6mUotCyiyM-skBtTa>WlYj9&fd5i2l{@9)Ljc5iIZo1hGLH2?2urxL!B&0GW?q zz0wg_-7UD+FCo7}$T1vXA@cA@I_u-1$mn13U_O8cb1JTb2;*Yo_V1^27X0Z#bcj~+ zAHNsmeha~x15d<3qJe*pH~YJqJb)h|=9LY;f zXFu%UD+7F+$ocPI24O-EuNySGtt&!Dy0hZB;|zHG=BD_QE>FB)|Z+@GF`R`vxnT-z+79Mqnia_@*`}mO1 z-)NSwK!~8eLKX)&m1~)`K)x##9;x#fAYjJEZGp4qXM%k5_>k`~G>XDp^v?6IMzxU) z7SJ*vyazBMa53)z)2FjVZ@&_i0Tl6Q-@c+FQ%2+ z&O6$MzeJ3kht{@bXDtrkFQ!On1UZ#v1Ab-eXiRHBueJmJiIth*4dWv{&7q|H?d;Y( zP9cswc>@ycdwsb&8R;#N^KU6r%fmU_>eQWjMu5l5vUW!e59XTRVSf$`D+7W_(Z^1R z;Ov4BW!~IDWdGGRl!aXnTfV6r(X8BL_vEq1CGlb?22b$&%fLNAu!tvykt89mb%5Cz zab#yU5MCT}llwZC#W1I7FqKFz?6rr=bCR|7ap`I z2DPq@)z8(CHoa7}f{Z;Bugdk0S9S8%N^_0H+H*zvUAVZeB-We8g#%V(*VJYfr^L6> zLOFI;plQMN^Ko3=|0Cen`_QsvyJZyZb;Sm)PaC9jd+n4-GlM;HG?Zy=iGHp(7_Q}w zKH94z+8b3(Nm;ztp7_+3tcKdbhhjc7%)%%6xSGkwtmS#9!vuZn$63+UU;Sn0e;MG{ z`_Qrjx%1IG&r}*MYi*9tvezo|+!IUG8ad*NvF=FyVKC>2g2k9_be3YVvniy`arc}x zXt)!edN--*#>UNs?g~ycIqx#Uh#!K?^8P%?EIUbxcI`=OPXi>0nhbAE3=c#`;I%#V zG#GmR#Tiu|AQ~Q!dhT-pjfz|!uF&XRmTQ$2weYnanbXV~a_@nxd%~uwR zD_XkD2^iQ*J4UPZsPU$)4L0wx4s4Y;uRRPa+b)9-7q;IFsQ(p)gFx!+Z6=F(tkTKx zJPymnsjib#;4{$qP)u1d4{y!7?rA(ur3uU&WE9hKv^>n~hRBXB{o|iErD(oGl-ogR z!BFb#3Jw4>Xoh1c20W+XxO3 zuncRy2ekO-O@J8MPAuI)1Pa&Lt7%)UTKo;-DN!SbicFHQ@rjP{9YxX`mz`vaw# zz4jl2JoM|Vgc7(B5Z(M@GXB@3^LysQZ;$xDp~Qf^0%J}}SBp|?6hY^fFCe-|=6m)= zmNlQX&BJubq(0D>g}MfXMT27B0ATF#ZqqGEl690lU@=tp9PF6^$L* zMHqXH!N4@YG%Fb0=vFD=kA+%gF<(S6Y9=#gnq78zBO=AXwhYAtZ1JP$D4s{twFr2B z*A|zu%VYiw6pMQ1Mst?iBi+o}@_eBA|D5C=00|+!ytQtiw2qExyU>KHw9QTD+c(_O zbK>^u7nwzbwGeJ8`%vXHbMAzjZ`9ylU@)#-3muGn=2kxd)OAKheiWiRFi*K}nYh0Q z6|(hPS;a+^efLl95=M*c69#1P;=9*@y)x*)!^OIpLW=fRP_KTza(9QQ!i?(WveT!7 zy$*c%zmchxNJEUhCQ+-%#~;X+i7nQ?#<=T%56MmlTd}oX-E+v&o)^V|F|-%Z_3L;8 z({}Cet5BwqSxGWr4GiG_3?c?l%?z8>fbs_Rq4mCLotBKVO1{K=`P~Z%;ZoE_>+I1h z=QX!wiY$N;|9tF{jMg8$vyRn5%SDj#fpv&mBF ze;P_2ob8VU=s>VeXqqnhyxZiEA!&jP$lJXf3q{wq_^u z5;$45Q8+E&dUaT;<8?j?KGz4@rvX!Q4x0VbD@HSNxhcNAtt~k1A|Hle7sEM;(USJr z38}8WI%LltS(=eqy-bdlF3|FQsZ1tB`;Y(J>24P{GBeQB$WgV10E2z?-wmVRP2nGo zvG)$*JBO4OCOBi2zSRsyzhKNs9V3oL(9=8j8!5qhx-X|M9E>!@PhL+~hs@%ch)c6n zj+L*d$T!&LE|xJQ5+fgb($ODWuaQ;D1OFaDSr0u)F7_pF1zX$SE!E;Cqx(XU}cs5GMI%iEZu6-=7x_khpR>6LOmNs#+QoX=i8(3Ta1nM`xX zbcACzu2A9v*Rw}aem+zZRQ~OB^*3+#fvxxP!LUyt0r)p7`avK3VgUThxc0H^Z2!-L z1j{UgS`lh6Iv``}HHvKMETT)W>NGMfaHmY$lNOWPOAzYhHRX4(CBAGTV2nD{-TRMf z4Bid$xhGB%O!g*`&dy)0Uc2Gs z%8dc9l#Aa4@_Kay8>Vt*i1V6laQ-yc3aMPF+q0NH!)41bQ0;rphNv8P69IbJYcTSR zh(USJtsr!G7vzjv-c+-8{xD_%U^+~1f@dufO>~0Fd{9d8d!$TT_5g&FkAmh>xcgx1 zGi`2Lot?&>`JHEzJVrtTq|{H{1@fw$8m9Bxd3YSJx-)=_fge>SsW`>d#H-aco7qcA zZaHyy`i48nEC!7R%v97s*&P4=T>CmAsne(WO~Vv5-^o5`_JS(p-UGmmOazi=iGa4! zxsF_r0@I}FY7;YzjbAWqorZ#YlW?V0u4PMy+!bTw-c;!DpGD#l@TEW{Cc9v9ziVs6 zub8mkA-a%1uf%_!VUp$Cly0nllJrn_aHLw91d7k!l{|Z(aVl@z>x}7M`R57j7+RiL zhQceLZaQo;@o?EbDsurOYF+`WGTv-)`NT5S#2>nx47J7&a+IgksK;0H&+2y_9LfhG z>b-)qKE1UGlgn0R|IigWp)!=4;)JMyjoI9hkGRxF(tRI)OoLY*FIG<^=7q+$6xAv{p9Co^t_CVo6K;`Ns!3$+N zpRYJH@m1v}zcswo!%$grP3PcC$1Ncp@Vl5|yImuy<1we3@pT;8&V$v`y(&4SLC zyWacwH0UR=H~q`V^k?F_pOBFMc=o0*$;c!imlQ=1)3Zmn@JnLnU(keqdtTmIH1^%` zZ8sf`hAq(Va?IP@66d;a^Tyb#hL7)=R9E$c*e#5_wdw5bkl9yn96y$|ZTt5tbeAkT zv$kwc{iQjV?eDqO9bwE2=F!*zZ2o_Ild8FVx7x`eW+}V9MVqv~q=-G+tR39F_;vSy zR5j;bFbF3bg8+>9QP+ha1fBjIM(yL5(ulD2gPnn_ePR#%m$0 zu(&rtV*lmFpY@H>z2NmM#ukIt1F%3N~c;kf%X0AASYDzx9R? z$OO=kN({Z&^;sOv@{7xD5mzPM&RgeOv}lGbg8!QLPDu85pZ^8WmG0vrCrkr@c7Bb{ z9~*~qU~AkThwRA@W`iS_16$86kQ_t=+z@p>2}w!Yq?x*kyAG(_*$Mi^O|tuWKcQmc zGW&%BO_%QOpCd$QJZda2N=e^xKnE3)ESI4UbTDO?Il7lYAuB+zTU3u1`Qbj^V#)z> zce(*8#Mt$A0H{3co47*|4LAZIVCk5XE;bi!ydr-b5HH08JE3Fu{utPLarvfcAHX=n zL^fcX5MhTzhA}+Nh|2hil_B0&B0cA#D^(L$3vGxvFe0L>*jDa#2KDO4eVr>qWNBXE zOAu9QHoSXXhN5>cD#p2M>j6YP8tIDPc5M?K{Z^JKN{Pg1Lcy(ca<^2VX^#Gj2q0bZ zJP+onN~J~!Qu<=T*%J2?x!n%rKCVsIfakaghR)!=$yTDvb zc`4@rY)G5Uhphrl+t<{r0eH&*9Rz#pi>WV*gb4Tc#_a`oOEWnDWU;L=@6I}dirJ-5 zswWy6fBTv!rT%!Cr!d+zDh{?Uc&5F3j1EvKgesrc>4rBC;jL9c&=e^S(c({ZMFSzx zyq!G;SwIuEE_P!Byn`IEYA>=gMMH?N-#I#JE_&&O{m@|7TGz?LNlUH`L7lZD4~C0r zN~~4Snn2UDs~QpjPLqX$uG&~K_{(Xcl#K*qgH<8M-)`(!8FCDraeaqq#t)61aF5~a z<;a8xSsx2*NmF^alm&l{goH#DS#=nwsdJzh$aIO`&ALk{@x6luS?=W55(_0&pS-F~ zDC!mKO4R!a^$NYua#e_I{XG#;iskN+9TNE4DllX zxO!-OCqP7q;24n?0jUTiE>N%2k01kK0b1z%=J|tY!<83gpZ)aP8;Vg_mJOTPbI}fY z?|S>M#t1Z!otFcb~O?MpaVPDx0|p={d>{SHLUjLLYq@z!f4-YhdAH&M)IGJ(allQ zg9iELe|zfbcjL&0Bzxz-9m$);0!<~dCf4gJx)3KP6Q?Sz>l2A8$7Zqlm#oGj$zOI_I z7urBJJ|2YusR8>!+Rdes&d)|L8=@UZ zUlZ%T9&lKVPqa5wYF(lca4^=#x9eFhdi3Vk0Msj3?3*}8*EwQm479PLqi^kUAPLbL zricOKLuwn4(dYJlKVzurq0XI~y?aEfS%L8JxrNq(2=n4LXot4;o4Iq*mv;~?5T(o4 zih-tUE;C{PXs4te3hi;UNcMj{J)>zBK}sJdKRBkBL|AsnITxp163QEV6@}qPx_dOg zp7pHPvOsFIOuM-+WHy&og1urNoUbCu##T^g8uH7mtqLC1E^n?32pHI{z};E2T~vX( zbOS7Yh?((;UYjDTgl|1+Y@_z~`|QvQ$GD)hGLuLY+tO9s-Fn z|A=U5oFKi6QSZhL7Fp`COH`pV86a4n;nlaxh4AVwMtk%~DFm9^@SPXNpJ2=hs<@!J zCS)nIAArV)`SWQRFZxO`N}}(vgoMea3Hl;moPfw0q%>qdEK{D#kt*sC3xZjy#A0A|05=I|t=ST9o2_#!{d1=~V z<#xevYx|~ec@pJVK6_(){_~Rb@o>h#pgn!^XaAR1hKxsfU~v)2JIC4JynX%YO|d4h zwPX|hZZ+0I`ii3wmenF(o41-u5|%cUFz|Cu_V3~OoT2*0KOxhQ z{V39ahIb2W4d`iCWHJL~gqf+2H!U5>Ei}`8B`H(~Bsp&O4J3EbcQi zA;O+jG$n+)@szl(V&r2Fy-z(ozaK^P>+w~6IZ)RYH(peg#9Z3TCTgT+g=@2`7$ju| z_K?jfzscd+6rKMxr<7)&bf#-W)OlCel8?B0`*{_qb33f$X}b+x7c;54-k?bP?UOX^ zxw?u=4>?~$iLc(P0=DzZ3rb=X^qREULWF~s1fYt$xTXng?F$OT{i30lZ%c?$4!D-Z z2%|Sd9t2ByU3Tk6u*)MI!GbPZY9SX=w&%ctxiD1G)k1`C_NLGH!^KJID33=G)`XVM z_M=JC%?$?2TbpVni|%)NIoVG}X*=zi<2x08!{g;-u5%NuXr|e1S86bohGkBfPQD1K z?loO|fmcx`PfZPGg_u;4a8|iZ2Blu%CAl{UQxn+(26w@f%Ajt&3cNgXjA(3t9cZWC zRthwYyofar9o;I6Swx9E>gN`CwQj9LC@N-6SG*X-&tR{G5Mj-7I`GZA?W2MPSUK@Z zXhV`wJ@nh&a+q@_uSXK+qW5kr-KzVf_e$PQQN76F8sXnxIzJ9Y^xu7JToXFQojYWh zKeoxbC^y;+?=r5+tdPmI-YsRftaGBcX5tZNT}1s*T?Ou`_tP-(Z#U7{Ow_ zjA_2-Xh}8W)m~WFWaB&B}o+$5Rw|C_J9E%6Bb7U@vqc|kqv>S)w^mi0FP*JnEXaJ^VqFds&mmb zeR+NtQO!9&x=9KVUY#=627S@CI|T~@P2JR$Wd7tfw}5%9*V@e9KEPdf&`pVF{-~mq zdqk3dqXV~k@PYiHwnHw~;xwA?2|xPmcpcY!`8vh=$J$#DwrHPtV8r6|cj&l!aF@;; zEH;$JkmI;9B~ulE1nm0%+Pl)Irp`2spiY5;2%<6=q=(akf=aDJP*w%2vY3bqLNn}w zB1=@5AcZ8dD6O(Mt!za=?64^z1|cXR#9{!;5*8&S1_H`%HVA7%66U+K9d(M#oTLA0 z{@;7Qd-E;t^S;mfJes_Y${cL}1H{ZP$Eq_1P(y<%5s8MFOFow)owanve?+`@8lDIS zPAQ}XcK1jcO_Jf$foo!duVaNtUTKn)oNA#FOG!z~$Z`L^%AJ_y&N$w)%Msn+Nu%rs4}l&W)CofhHY69dee?qDhCv zv0{&Oae?dDBp`N$vdils@h{Nv9M0Ft0{DpFPP+Z29T2pJgcHiFcl1JRDTtt7jS+lp zZs)!R3c4X7Ft3t7!ep?ILvmqE+JDeIM6{a*LE`}HL>Hy!$p`L+!Lg|~um+f@w} z>9y+#L(fef`N&DV%|d>bZkB{{*R6M&;Mc6}<)eN=l;fzJlU!uPNUgCV#G5+ZHR>xN zx4BNwu%A*;qOe4Bc>)WQEgnRX1LeYYH>r6NjS~E=1}vgqk$4_}@}Vts6_8ic123@a zG^66^I#({l$}7Nc9G*xA!~6%73b?cMzqyDs+jJ*IK4{A@uHLwNDv8;YwPKOY>WE&oIRY#b_U=K^?=9Bjk}E8H3q2REMy=`RbKMIhI)g5Nfk8Lz&IC` z<`$9c5{aRrK9(wxwWf&b!L^^AT@Fyj(2ImW1cOzrEISeJ!AD5-QBbfq%QV9 zQaNY2&&2`gJ_ohgF5Cw_Q4;ktZGgu4j_?~&RAZ0efgL7g^{?Y7w3(NIjAT&vA1*@B zsCY5Y0pHCSmJd$#XW_rD{AD#Z$ZH7fYxd_4^?+zMBkP?{dc8crTwFHa_*9e;-dB#- zqdchTt@O{4OGrp}Y(Mgb&E!j~nG#d4Eg~4VjKNh*doT-gy|YG7l}XIOWmh*dsJL1i>bYZ*QU>f5zlA0W zu}KX%-Ru@t!M+C17%A9@d0ITnwy6D^VSO}`=A`9}#@uQHO^9_LVgyoixV(YB`7pj! zm?{2|fIEG%i(Np|gUcSn_~}HAil0?1Pr$6Spb*3UM0DNgAB zgnoNf{4}qN`zjk}$@ihIrE8{FEDS&^KmTx?`=NR}qi-by`|OpZ>sRpyv?fuP;i`V0 z=`JK04p*s#YSZma6ngtw(7Mk7SL?WJS6pVFNrj`9ddb{7@&;>{xUo}m%O{}XbL3;i z=WSs7vEuU{b;gespC!+o|0=tDppxyfHLMo%#kTUY2QRORe~?R~sDy*v7@;xaHrtel zu1Lg9BC=kN73Gn4F?Kry>RDgBN0o`2;lM!<(p?&yjVz|d1Eas^S9MMKz_jo_FrkN? zg)9UC0gXzIkzC;Q(P)6}6(F9sddB_vUyeWUfx~1a@RMXZ_4SC;5&cDW@{;W+atqX$ z3WR+TkzaB!eb`xcXMwXI(?5mOX{UM7+4u_(aiPjjeFwu%b+}0zSD?pM8eHs;70;Ss zrBfv_=q#blrso{Mr6Sc|+kg^rsf4LiQ*NiK^<01aOsBJ%2hoelZ!#sifZIb_uk34P zwP$RiXDWb7mEreCeO^5)S=7)mMKL}hKHyLhS709?56h7QS(u7&E8j|&o<-)v(!6DzMZ@!b8u26}T%}sPU z7zAisSrzz`IN^ z_bSCYAPqAN$>>;98&$4Md0t`Yg~d`A1`}GTMAdGi6(v~{-0qdMz+2jyFEW${HVup< zIJrMIThXwmp-naI-W6}B;KzHtIWztg->ZLN#b;5etJ2gQ9w>pI<5!6N JW%fRC{|57&S8D(O literal 0 HcmV?d00001 diff --git a/tutorials/kdd22/imgs/pecos_pipeline.png b/tutorials/kdd22/imgs/pecos_pipeline.png new file mode 100644 index 0000000000000000000000000000000000000000..634edf2b796516e30303759e7e2fd8a0a471d98d GIT binary patch literal 18041 zcmd43byQrv_V|sv4DLF(dvSLu?#11m;_mKFad#>1ZpEQUao6I-`y1|k?$_3O|9>ZI zI5}Zw@7PIBl20a5>9Z6v0zLv57#OmQw73cw7$gnIuL%bM`fG^rzzn*8yQoNsg4IqF zo`7x=%(P_86%@eeKz=wd7;qRc$PW=PkUTj0KYnnKA0G_*U*TY28sG%~<<|$N`EU7f zD=?^kWXwU=kB>Cy3yS^E6*3?EzcuDV{6`v+CLi)YelX3CXi9B_e?S*_M`>*rFfc@n zk1se_Rt_#07zCn~nwG1Uf;_K@gB_ERse`c@lZTz-M=UUY4_=Vh&dk+_)Wgo!-i6mg zfc#$)ydeKaFf%#nzeHSZ1jw}%lt{%KoXtqNm{^!t$OREdNlE#gP0e{##3lcU4!RQ{ zw{&%NW*S9T@`XA5Rl9v&WM7B*%!Hb#&Hql>4#tC0tzy$kT)O8%oq+|0$q z*~-z?%E6xWL$8ssgPW@WIr)d7|N8tpovv2q|82?M<)3VUGRXY#A7)l27Uuub4T{SD z5z4D#=Hg)M_Mu+g-pW;wjsIW5|2O!*t^GTeqO+A5D7_!i1X=%y@;}4=ue_Lpoujjv zi_3>R=RflQGwlD$D_J_YI)KXB*~&!5-qp+*6!{-(|4HEgx5PiX_?bV7>wgsAzw72- zp`gkUMBr!suZk8#SVTAlwQylD8F3Lc5AchPuiI}+uiTDkjWUw%vhZYlreRSNG{rHd zA>yWL=IC@q;$$2W(MmezB}u8NOTrot)9(UzuRL3O-#6=CI{x@>{pk@pdcXF|`{|kc z#6NTU>!?fP`A_d(-;x`80L}k$T0U^qU5|Px#P6sOnmsqE(Zw(R+cXt2!iFWO8qYs_ z-#H`G(sT*TI%C@lp}&tn8CK|%&cX$fdvFGsHL%I(NnGhLY}b}w?8ecbF5+Z-UA0F9 zvn==t{T(|<$(QFxmF`$NBCzs~YX+w&k*)4=yW7*&vaXONImgN{sU z;R~a85tvUFm;u@|o%B-61T>pn!bxT?1=#CyYtQ$y+Rn?S89|%6Zuc3dG-G;FD98D{ z_q#Hb*~!wc}^^rCaCv zhW*n_%akJ$+$@58$6oSh4 zz%JoZWHy1@GSohTt;dnPw{wf1dS5rb0U3r7pwOYFx%XmV`M1H#a^2S?3G7nXEZCKy z>MCj{@8DO}$~1X(5#TV3t%JP+f zJ#Rny#(0V{yvThG4(v|8och}>_$7dRq5&-&}mkr;|W=qgLA(x z$|6o;LG*U*wodb&^=)S6nnYe_t(M@~7=CRVBoZgm@C#O#p^k~;yKIwV?slza{jTZv z!$w~f&zk?T>vsAgwYC|z=o!?^>`N;aY`*N_Gy7dfFPO+wZv)CnD_zuE+ke=7`_pyw z_R!jkJjQX!H|8_Vb0i*E=W{^+r)^ZA)+P7s*X_4IY934KnxXst@3(m_eV46Ueyv>x zj1JDp*}q;k|Ma}y*TskW+!c}hKZ0YhLEoqjDr8ToPLtx@32o@M(-;^Z=izlo@8u&oedYf1C>mVf6Y5( zY+Cvb&y44Ejf3x5*D?%&ZGw=-d^Bt2Aga(WC%U~z-{@{+7#*|b11CJ;{Gf7>;jcp1 zk&0cnxxZhZX8irm8%DDX_j-~Q6K?+Ok^Fts(6y7u^1rV#Pc-1OByY!`dah%ceNW^#o$uzPe%BA849sl3KeRSw z=K4LK{PzF*l6MtCop5pK^|OEGx z9+6MJ;O3FwpK(Ru7=(&;dPPmi&Lhi}8J|P;p9cPKnfO;vN4)gxL9%)w2f?eJ4C5B&)!b5cJQtlI6iXrF>`x&l0| z+h=^^YUIWhxK>R^iN1DC>RsM^wB6ghzx=nUZH6l8uY2)Af?pC){T~M-jZuA%#SES% zG?QvAsi2stI3P%V$N~u*x>)apC+`ZMu(~$AhU<I1mMEH}O_qwlkCTH+Gzj5jQsaR>+eccZB zJ@0w34D8hi)jK2ScAD=u8LDRl7w1hOPB&b)G?pra>fC#J@9RNIGWyG9XWhujUE*!5 zgU{aO*WMRIgU5mA@>9J}pQ+cL?#hHN7CV&tF^>CP$+mvYDFMm;cJE@aQ7Tws#RG@m`8bO!<%bA~HenF>5GSd@ z`A=jN$ti6+PD*`DDPpAfchS-ev+~Ph9Cs6}#20H?H+7`!^X9!rd-!g~fLOr%(Nh~WX^Sa+Aj)eX`Avw@ms6<9W*4r7j zkRi8U*zJo^97BmiJ-11biu?H6=`P0fA&}u_$%fF62qRiAwz z$@m|U5C|HETkoxx%_8D^n&;=1b^Ej5pj&;y&@0&_VapXid)Fd@(9+udI=}lWDBcDQ z)6$Oz-MKz*lDYSn-N56h>bD!rUAWaaNtCq^f`l^K`9swWXpD7--tGh(1c6;+ipP#b zNBM*cx}r+g4K5;gh<$>jjOm4M4;-@W-oe+=m*jQXIB8)ap-^EVbH&I>S~92Pns{IX zUdILD=o^dZMq&r4S7|b-t&8U@nn7dOy0#75=ZCFeFKOd`JBS7z4^4Rr0I5Sdmq1jD zJYO!W6}HvX;5|h8AJc03Bx%o_>CbeyI$_l~Nql21s4geFZx69IL)e?lEv zRm^jwUWa=Qaas;8>=%m$3^0Xx$u&e|zRJcf=$YsV*Pn2go^biOX3Mf>qY#-;mONc4 zDH6h^UYbA5o%WSy5`7C)pZl6Ny2i@YwMC}lU1u`6TGfovi^$Yw_`Cbhh$>dC?YB6l z!^_mS%-{0tZVk*fTiKxT(1L#t@3-fS_Y_uSFiTWm|Q;id^G`kJ9z63)=0-G_7(cONsB9)D?`E0OLVGwRki!*LFsMeEY zojHk2qQS+4%z3Qp-4N$21aC9V7(AR9*j&+O`)hvWd32xYF}q%4DDtD>6S^%b6WA87A`!Xd#&`^k3eKiPr$+Izd!JnuFV>j9q&1+u4 z^dj?{x}$Us)k8*$iE4I%_e+-Sk4_}C`_UPZk4Gm+Lg|@kNyCMAKR&sCviNJ0()hRbj-$AZEV-wUC>+ZDfkLLRA zMN2*{XjSr^fL{SlOj>3Sh)QZVA_IyO!V?w0JB7Wm(ad0M%PljdfHYX}!4jFX%u~@m zGk4yzBCy>r8^5_=v~t5(4dA`x)&quX(XNd?*GjfvQN87p!R3~t@6yJb>W;rZJtB^h z`xk~@4l>yOb!;?Y;*ZBNYxM8+lzC>E!DxDZVkT^I+>rUN} z9_e3^uM9ZvLNNSBwd6~7BBnQk%37$iW#ZO`Iu3wDDaT|t;JMHDPOOwa9)p?;F+_Zw z@2Oe5c#`IfuQzA~7;#NO`g!X49>mis*xMu4QWH3-?h2qOCReJVSJ*tI42+xr=)Wg> zyNpE?Qqn0L#B*%OVmZLq#gxj_?ZPOiQ;Axu+OVe@qL_XyZmlQ$LxQUwPB_yt;o85u z(z|P$X8H~O7M&}mXK&((o8V6_#6x;7m&EYdDIEgYgQ+&LuBa~V$PG-%1yJz$I7NQu z;78MV%juE&im1meJdIsTf^6{Z&6u+o0mWVOhWj!%S{Mldzx#1U`Zgr93sr#Bn25>O zd-*EThX5M+$D-wB#F!LFIQ~WPi<~ir6^Q(6xh%o;^$cOH`Z_!YQfy-wsnoDLp>Me< zd{Xeg@blyzvfqJH98Zy?X51UknmK1l4Q@g=iSgU{Pw{S9`?9!Z?PxMrWkPT8E*Crt z5tp)|s@0Nrp}G4+GMY#*6~Ow_6l|d@j;{c%FK3ZCr&> z$fn;mBuRP7&dp(RP{_>&YascO@?02Y9&6(OuXN8OQ)jWHl2Dw(gJ0F)WZF%-2t*}I zIvZI$Nac{7OC|6GR%-+tnvLAqCflwB?>ib)Abg(kYab*k;dgs|V!%Kqd(t8n;}sV^ zkI8L@@Yty8JVs!j52F4UpkW+2glGK>6~~>q6f+Czx9(pZ7do;o3oDf$Hb3pT^bD0=5|>LO()}oZR-6)%y_ZHtu4UhGN>@lQfHVZ7lv|1 zmZ81vN){ua-o6FX`eb>&cNNR*-xov;VMVWw4r!#Quz59l!7BQ3ty^mU%4F<-=tS}T*KWrv2FYOT65nyPue+rBO%F@|J!z{xI^q+`6-fC)uV2jG%GDr!Z zi&UDd41Vd_EoRgf!2{Pd-uvb1AcuZ;kqajl67_%?11;%3wDtG>PwhMU38kZPeT)j; z#U07s^J=IoZz5&;HZ+{E?KMBNnvss7TZCiDNaQDT4M}c;_-PqK6B9Pk7BMcc zK=2cW0UvNO$@PXd{m6duT+KtXWhmyxlw66#zPUjX1yr(Fv;q}R?d?D@cZ+p#ttpBTi|LeS=R+>KSUMbL)0Wxc#U z-3|ADGK=l<=EV*g#5@8j-WeDNKBRdR)4dUpXgI4#4@V5Zh_#s_R!c2y6Ber{-PMbf z>V)WN5E|P|;N@&LQOzn5#8fJi=U|$`#Mjr5>jhO{VLM&6o{;%$#SPTobIntVY(ccz z(ZgVMw)N=S>BXlA(SnV1zbZx3ElIPU61Dpb)##z=^UP<^hw@nOu!q(p^==mhZ=T2P z9$_=A>VYLQ=}|Cq zWLx!%ja@#Ph&5@Q`x;ojmY%ahC3;zRlH*-E=B+wvR#NCg^WAFifpd!_HCyJ;=Hf(i zNK__dAbR1cLu}$uXiK_dODOoY8D@(MO zS>$f~iX(HHMQWO%OPFGwY|`r_F5E0kJX8uC)uZ1?oF2WtE?KtJ80xuO5en>;AJR12 zUBeqOIL8E>2_3d10-tgj~t`NB;#Y#(R@aaD5A5q=-c7PPJzxgnb{1QQj2a@>Y( z3(|GJF>#H>k~0tNn-?W5V4B1}f=gy}5lGh{t?C&V)OCckbpPiv($SIx`R$9xn_$H&u8S^Zl&);J5jA!DGf5QD~ z0UW%vv178Z$Y+$^A~kHMD>44WB1YNDR0XNlB#ZstdgS+YmP!BQ2<{eus7>@I&RbMO z{8Nz^3{j+c`)xfG)frrHE^QY>ur@IOfNn}|vTLVyge=1nIMOxcJOm+0%1_Lacrl`E zV<2zZC;E1CYy8=7`;0or5hAg{4=9$UBNGD^7+IZyDq1Wyd`| z5h0UzrQ{qfr50NrZ01b>84(7awNPRsS_kMFrffrtD(CWNH~?D(YhZyT=ZH5iVyZY%hP92OA_*53!iB*De%^Jp zpF26eV-fA^cg1}+`2pGYO{EglO-Ck_MvhE#%nOYXZb%T0GS{5R>`s;|l+0xbH z_28Y$Io0ej*70k=&N>e}dnd^v-a%kU)_mZrh5;_z#NFEEVUP+$xi6P)C+)Xt}WJf6HVRAb%v39;~hWqp|`T2|RC zYj8e#3BWEy*bwF$d_+Y0%5URik~&kIHN8zcWg@{cqwv_o70MHLUC7gKwWDn^uy}%X zZc6U-T+AfqthQN3Cz;eC9|$<{tFDmIvk&|%PsCd+(xcP$_5H zkCuxx18?(pM4Mby^*%%nNC3;8D8D7;(wi>R9%4(5jAAnce|0&diel5!kTz(*bh{5` zz37jj??$0P#i!4!3S{HqxvmL!i0 zvF--3kg(!f+bbpHDOQKZ@7Gs+8n;+gf{KF?Ydzk3QkZ?sv zu_~H-U|)(y={Fv&i{+Q8(i!8`C=u4*$9_H)UY=V3S1Yver2DA{vym}+fD+%OJZ*s1|#?V zu*j9Hg`K znApa86r{pE;rsyO)Tjg2@Q?RwL0GaN1XYsr=$#FdS;3RyWaoT&p&!)) z^!+O=-)qM)WQwmu(1UQ4X-|rJh)gL$n%Wlq`pThNov_6vSSx#$WV>Aa6;)E7IpPj^ zQL$nZHbt*)hC9Ib@V-Zhu}Z>ym#pDly%`C(e5qbYLe!r!JkRZE&ai(XVNu;h9LlJ0 z#1Ok#Rj`zVFxfWsa3^=(^8KSwkm-T+73ZUtOb^I#XuvEGAI-(9&KEUH zvWsUK`D)@bL?uM9q~URWQvt-ZSED*{A8SSIvs_`2mm>BwHThW@Qwp#rJT%S{ZTHY0 zON6CVI(2e`U#f(svzA6 zUUZ{;7C!P&tPKw7OPC@BsS<7#tc6ZZ3jRwlyR1-bpC@wz z27%zHXNe23d2@EUgO$uGj5Z+wEW_~j82*y`Y@6trPVFmrxn9WG+wAz%%QxQ9i}H z3;U~fdm%0ttWVjL{w}_-K%{*%-%zN|sM8aX4>B*pC1?pVr)-P{JJg*8?P*uM8yPcD zlqz4`$kjkPM4;5Cd!J!{1qR|I1B!|hw~``gtX#NuceG$8G5lCZd}&s76&JMQCrFwQ zSYMZu{x8s)xT3(lnjoz?mpdk7IW%%H?Ju-6 z?=`D31q9dNb+Q)YThc>~Yh{9do_b;GpQI1vZaAG?skG<6V0e6^eFF^Zreu)bIwrQQwnTM$FMOmhK>Aj2@3_cxj#L*85o*QR{MsQ2 zUs3RB-(xAv&<>c{?^8a2c-|y!$ai{`AkMc~sU4(zc6e|iY(i)*x>ST z^pKix@rr4(!k2*0E957;jy^u<#gy+Fl86@=hi!aLd@Uib;&%9wA;wy=D)>o6Xr)X_ zq)s=p1p!b8xJx3iL=?7ZucIo3+Koiu%De$4+6v-~BE*5XUL-INP|(K;iIY;qB&K+kNG5@r~=HKGEZo6OF91t#-QNoCai%4SvnFZE_= zni)Xh@N(Fq&q8mmLxF@O<fGvO%1bOz8t_GI`tEslZJB_<5@W zJm_&Fq|8mxLrneBj8n+F4cw`rzc)4*iZVMC62Yc8=f?x+05tBb>l#W_)iYdbEAHiC zouARG*$G2jrY-}%t0QB_D2$~^Lu@C6Tk}+` zuM8b?{bXh^+o>c0r^=e^u6gbe{SF4v3iASyHT7#&IirNoW)MgP7ZP_lQ3VdrLJ3N@ zDv~|NbA@O<7?E)`eY!pxg0=r}-WNSw%b__xms?eMTEs zCMqTN4or2(}iEiB#`KuZ6o=E_Jb#wtZ=_Ms~Ce;x|F`XupOwOZ!m_B zQmn;CITEy+;Kyqgpp(z zsuE)Yzo3ZGh(h)@-KH4bW)|$VVeBIuDC;J)OL|hWQvUX3o*_{lZ1CB-N@DqFHK4ru z8(0ugh1l<;$Xy9eAg>ku?wLh|$* z99QL-O*2;6kf3J~;j{(0wT;E|6O7fuTJ2+19(V#uvM=R`Dm)gFeyuF8Hz#L0S77Al z%zO%Rx)MJ0I2rUvIyMEVr*C!h(5bF`iR$DHmLHSX3(=+Zpau%jG9es