diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 00000000..1b273e59 Binary files /dev/null and b/.DS_Store differ diff --git a/.gitignore b/.gitignore index 1320f90e..df4b4ee0 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1,2 @@ site +venv/ \ No newline at end of file diff --git a/docs/.DS_Store b/docs/.DS_Store new file mode 100644 index 00000000..666b13c1 Binary files /dev/null and b/docs/.DS_Store differ diff --git a/docs/pephub/.DS_Store b/docs/pephub/.DS_Store new file mode 100644 index 00000000..d0452c3a Binary files /dev/null and b/docs/pephub/.DS_Store differ diff --git a/docs/pephub/development.md b/docs/pephub/development.md index 1db98008..f83d1362 100644 --- a/docs/pephub/development.md +++ b/docs/pephub/development.md @@ -4,7 +4,7 @@ _The following assumes you have already setup a database. If you have not, please see [here](#1-database-setup)._ -There are two components to PEPhub: a FastAPI backend, and a React frontend. As such, when developing, you will need to run both the backend and frontend development servers. +There are two components to PEPhub: a FastAPI backend, and a React frontend. As such, when developing, you will need to run both the backend and frontend development servers. Full API documentation can be found at https://pephub-api.databio.org/api/v1/docs. ## Backend development diff --git a/docs/pephub/img/architecture.png b/docs/pephub/img/architecture.png new file mode 100644 index 00000000..16baa8bc Binary files /dev/null and b/docs/pephub/img/architecture.png differ diff --git a/docs/pephub/img/cartoon_sample_modifiers.svg b/docs/pephub/img/cartoon_sample_modifiers.svg new file mode 100644 index 00000000..fac9ed56 --- /dev/null +++ b/docs/pephub/img/cartoon_sample_modifiers.svg @@ -0,0 +1,332 @@ + + + + + + + + + + image/svg+xml + + + + + + + + + + + + yaml + + csv + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/pephub/img/pepembed-arch.svg b/docs/pephub/img/pepembed-arch.svg new file mode 100644 index 00000000..132cc9ea --- /dev/null +++ b/docs/pephub/img/pepembed-arch.svg @@ -0,0 +1,1276 @@ + + + +Submit query1"cardiomyocytes in females"Send embedding23Return hits (KNN)4Return full PEPsQdrantClient[0.1, -1.2, 0.9][0.09, -1.0, 0.8][0.12, -0.9, 0.95][0.11, -0.5, 0.99]Search:Pull all PEPs12ExtractInsert embeddings3Qdrantpepmbed[0.09, -1.0, 0.8][0.12, -0.9, 0.95][0.11, -0.5, 0.99]Populate: diff --git a/docs/pephub/pepembed/README.md b/docs/pephub/pepembed/README.md new file mode 100644 index 00000000..ade8b6b2 --- /dev/null +++ b/docs/pephub/pepembed/README.md @@ -0,0 +1,104 @@ +# pepembed + +## Overview + +PEPembed is a Python package for computing text-embeddings of sample metadata stored in [pephub](https://github.com/pepkit/pephub) for search-and-retrieval tasks. It provides both a CLI and a Python API. It handles the long-running job of downloading projects inside pephub, mining any relevant metadata from them, computing a rich text embedding on that data, and finally upserting it into a vector database. We use [qdrant](https://qdrant.tech/) as our vector database for its performance and simplicity and payload capabilities. + +Understand everything? Jump to [running `pepembed`](#install-and-run). Or view the quick start below. + +## Quick Start + +```console +pip install . +``` + +```console +pepembed \ + --postgres-host $POSTGRES_HOST \ + --postgres-user $POSTGRES_USER \ + --postgres-password $POSTGRES_PASSWORD \ + --postgres-db $POSTGRES_DB \ +``` + +## Architecture + +

+ pepembed architecture +

+ +`pepembed` works in three steps: 1) Download PEPs from pephub, 2) Extract metadata from these PEPs and embeds them using a [sentence transformer](https://www.sbert.net/), and 3) inserts these PEPs into a [qdrant](https://qdrant.tech/) instance. + +**1. Download PEPs:** +`pepembed` downloads all PEPS from pephub. This is the most time-consuming process. Currently there is no way to parametrize this, but in the future we should. We should also allow for generating embeddings straight from files on disc. + +**2. Extract Metadata from PEPs adn embeddings:** +Once the PEPs are downloaded, we then extract any relevant metadata from them. This is done by looking for **keywords** in the [**project-level** attributes](https://pep.databio.org/en/latest/specification/#project-attribute-sample_modifiers). For each PEP, a pseudo-description is built by looking for these keywords and building a string. Some example keyword attributes might be: `cell_type`, `protocol`, `procedure`, `institution`, etc. You can specify your own keywords to `pepembed` if you wish. + +

+ Sample modifiers in a configuration file +

+ +Once the pseudo-descriptions are mined, we can then utilize a `sentence-transformer` to generate low-dimensional representations of these descriptions. By defauly, we use a [state-of-the-art transformer](https://arxiv.org/abs/1908.10084) trained for the semantic textual similarity task (*Reimers & Gurevych, 2019*). The embeddings are linked back to the original PEP registry path, along with other information like the mined pseudo-description and the row id in the database. + +**3. Insert Embeddings:** +Finally, we insert the embeddings into a [qdrant](https://qdrant.tech/) instance. qdrant is a **vector database** that is designed to store embeddings as first-class data types as well as supporting native graph-based indexing of these embeddings. The allows for near-instant search and retrieval of nearest embeddings neighbors given a new embedding (say an encoded search query on a web application). qdrant supports arming the embeddings with a [**payload**](https://qdrant.tech/documentation/payload/) where we store basic information on that PEP like registry path, row id, and its description. + +## Install and Run + +While simple to install and run, `pepembed` requires lots of information to function. There are three key aspects: 1) The pephub instance, 2) the qdrant instance, and 3) the keywords. Ensure the following before running the `cli`: + +### Setup + +**1. PEPhub instance:** +Make sure you have access to a running pephub instance store with peps. Once complete, you can use the following environment variables to tell `pepembed` where to get data. Alternatively, you can pass these as command-line args: +* `POSTGRES_HOST` +* `POSTGRES_DB` +* `POSTGRES_USER` +* `POSTGRES_PASSWORD` + +**2. Qdrant instance:** +In addition to a pephub instance, you will need a running instance of qdrant. It is quite simple and instructions can be found [here](https://qdrant.tech/documentation/quick_start/). The TL;DR is: + +```console +docker pull qdrant/qdrant +docker run -p 6333:6333 \ + -v $(pwd)/qdrant_storage:/qdrant/storage \ + qdrant/qdrant +``` + +This will give you a qdrant instance served at http://localhost:6333. You can pass this information to `pepembed` as environment variables. Alternatively, you may pass these as command-line args: +* `QDRANT_HOST` +* `QDRANT_PORT` +* `QDRANT_API_KEY` +* `QDRANT_COLLECTION_NAME` + +*Unless you are running this for production, you most likely do not need to specify any of these.* + +**3. Keywords:** +Finally, we need a keywords file. This is technically optional, and `pepembed` comes with [default keywords](pepembed/const.py), but you may supply your own as a plain text file. This can be supplied only as command-line args: +* `KEYWORDS_FILE` + +There are many other options as well (like specifying the transformer model to use), but the defaults work great for a first try. Use `pepembed --help` to see all options. If you are like me, and like to keep your secrets in a `.env` file, you can export them easily to the environment with `export $(cat .env | xargs)` + +### Install + +Clone this repository and install with `pip`: + +```console +pip install . +``` + +### Run + +```console +pepembed \ + --keywords-file keywords.txt \ + --postgres-host $POSTGRES_HOST \ + --postgres-user $POSTGRES_USER \ + --postgres-password $POSTGRES_PASSWORD \ + --postgres-db $POSTGRES_DB \ +``` diff --git a/mkdocs.yml b/mkdocs.yml index 823230d4..00940d9c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -184,6 +184,8 @@ nav: - Deployment: pephub/deployment.md - Development: pephub/development.md - Server settings: pephub/server-settings.md + - PEPembed: + - PEPembed: pephub/pepembed/README.md - pepdbagent: - pepdbagent: pephub/pepdbagent/README.md - Database tutorial: pephub/pepdbagent/db_tutorial.md @@ -192,8 +194,6 @@ nav: - Schema registry: https://schema.databio.org - How to cite: citations.md - Changelog: pephub/changelog.md - - - Peppy: - Peppy: peppy/README.md - Getting started: