diff --git a/.DS_Store b/.DS_Store
new file mode 100644
index 00000000..1b273e59
Binary files /dev/null and b/.DS_Store differ
diff --git a/.gitignore b/.gitignore
index 1320f90e..df4b4ee0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1 +1,2 @@
site
+venv/
\ No newline at end of file
diff --git a/docs/.DS_Store b/docs/.DS_Store
new file mode 100644
index 00000000..666b13c1
Binary files /dev/null and b/docs/.DS_Store differ
diff --git a/docs/pephub/.DS_Store b/docs/pephub/.DS_Store
new file mode 100644
index 00000000..d0452c3a
Binary files /dev/null and b/docs/pephub/.DS_Store differ
diff --git a/docs/pephub/development.md b/docs/pephub/development.md
index 1db98008..f83d1362 100644
--- a/docs/pephub/development.md
+++ b/docs/pephub/development.md
@@ -4,7 +4,7 @@
_The following assumes you have already setup a database. If you have not, please see [here](#1-database-setup)._
-There are two components to PEPhub: a FastAPI backend, and a React frontend. As such, when developing, you will need to run both the backend and frontend development servers.
+There are two components to PEPhub: a FastAPI backend, and a React frontend. As such, when developing, you will need to run both the backend and frontend development servers. Full API documentation can be found at https://pephub-api.databio.org/api/v1/docs.
## Backend development
diff --git a/docs/pephub/img/architecture.png b/docs/pephub/img/architecture.png
new file mode 100644
index 00000000..16baa8bc
Binary files /dev/null and b/docs/pephub/img/architecture.png differ
diff --git a/docs/pephub/img/cartoon_sample_modifiers.svg b/docs/pephub/img/cartoon_sample_modifiers.svg
new file mode 100644
index 00000000..fac9ed56
--- /dev/null
+++ b/docs/pephub/img/cartoon_sample_modifiers.svg
@@ -0,0 +1,332 @@
+
+
+
+
diff --git a/docs/pephub/img/pepembed-arch.svg b/docs/pephub/img/pepembed-arch.svg
new file mode 100644
index 00000000..132cc9ea
--- /dev/null
+++ b/docs/pephub/img/pepembed-arch.svg
@@ -0,0 +1,1276 @@
+
+
+
+
diff --git a/docs/pephub/pepembed/README.md b/docs/pephub/pepembed/README.md
new file mode 100644
index 00000000..ade8b6b2
--- /dev/null
+++ b/docs/pephub/pepembed/README.md
@@ -0,0 +1,104 @@
+# pepembed
+
+## Overview
+
+PEPembed is a Python package for computing text-embeddings of sample metadata stored in [pephub](https://github.com/pepkit/pephub) for search-and-retrieval tasks. It provides both a CLI and a Python API. It handles the long-running job of downloading projects inside pephub, mining any relevant metadata from them, computing a rich text embedding on that data, and finally upserting it into a vector database. We use [qdrant](https://qdrant.tech/) as our vector database for its performance and simplicity and payload capabilities.
+
+Understand everything? Jump to [running `pepembed`](#install-and-run). Or view the quick start below.
+
+## Quick Start
+
+```console
+pip install .
+```
+
+```console
+pepembed \
+ --postgres-host $POSTGRES_HOST \
+ --postgres-user $POSTGRES_USER \
+ --postgres-password $POSTGRES_PASSWORD \
+ --postgres-db $POSTGRES_DB \
+```
+
+## Architecture
+
+
+
+
+
+`pepembed` works in three steps: 1) Download PEPs from pephub, 2) Extract metadata from these PEPs and embeds them using a [sentence transformer](https://www.sbert.net/), and 3) inserts these PEPs into a [qdrant](https://qdrant.tech/) instance.
+
+**1. Download PEPs:**
+`pepembed` downloads all PEPS from pephub. This is the most time-consuming process. Currently there is no way to parametrize this, but in the future we should. We should also allow for generating embeddings straight from files on disc.
+
+**2. Extract Metadata from PEPs adn embeddings:**
+Once the PEPs are downloaded, we then extract any relevant metadata from them. This is done by looking for **keywords** in the [**project-level** attributes](https://pep.databio.org/en/latest/specification/#project-attribute-sample_modifiers). For each PEP, a pseudo-description is built by looking for these keywords and building a string. Some example keyword attributes might be: `cell_type`, `protocol`, `procedure`, `institution`, etc. You can specify your own keywords to `pepembed` if you wish.
+
+
+
+
+
+Once the pseudo-descriptions are mined, we can then utilize a `sentence-transformer` to generate low-dimensional representations of these descriptions. By defauly, we use a [state-of-the-art transformer](https://arxiv.org/abs/1908.10084) trained for the semantic textual similarity task (*Reimers & Gurevych, 2019*). The embeddings are linked back to the original PEP registry path, along with other information like the mined pseudo-description and the row id in the database.
+
+**3. Insert Embeddings:**
+Finally, we insert the embeddings into a [qdrant](https://qdrant.tech/) instance. qdrant is a **vector database** that is designed to store embeddings as first-class data types as well as supporting native graph-based indexing of these embeddings. The allows for near-instant search and retrieval of nearest embeddings neighbors given a new embedding (say an encoded search query on a web application). qdrant supports arming the embeddings with a [**payload**](https://qdrant.tech/documentation/payload/) where we store basic information on that PEP like registry path, row id, and its description.
+
+## Install and Run
+
+While simple to install and run, `pepembed` requires lots of information to function. There are three key aspects: 1) The pephub instance, 2) the qdrant instance, and 3) the keywords. Ensure the following before running the `cli`:
+
+### Setup
+
+**1. PEPhub instance:**
+Make sure you have access to a running pephub instance store with peps. Once complete, you can use the following environment variables to tell `pepembed` where to get data. Alternatively, you can pass these as command-line args:
+* `POSTGRES_HOST`
+* `POSTGRES_DB`
+* `POSTGRES_USER`
+* `POSTGRES_PASSWORD`
+
+**2. Qdrant instance:**
+In addition to a pephub instance, you will need a running instance of qdrant. It is quite simple and instructions can be found [here](https://qdrant.tech/documentation/quick_start/). The TL;DR is:
+
+```console
+docker pull qdrant/qdrant
+docker run -p 6333:6333 \
+ -v $(pwd)/qdrant_storage:/qdrant/storage \
+ qdrant/qdrant
+```
+
+This will give you a qdrant instance served at http://localhost:6333. You can pass this information to `pepembed` as environment variables. Alternatively, you may pass these as command-line args:
+* `QDRANT_HOST`
+* `QDRANT_PORT`
+* `QDRANT_API_KEY`
+* `QDRANT_COLLECTION_NAME`
+
+*Unless you are running this for production, you most likely do not need to specify any of these.*
+
+**3. Keywords:**
+Finally, we need a keywords file. This is technically optional, and `pepembed` comes with [default keywords](pepembed/const.py), but you may supply your own as a plain text file. This can be supplied only as command-line args:
+* `KEYWORDS_FILE`
+
+There are many other options as well (like specifying the transformer model to use), but the defaults work great for a first try. Use `pepembed --help` to see all options. If you are like me, and like to keep your secrets in a `.env` file, you can export them easily to the environment with `export $(cat .env | xargs)`
+
+### Install
+
+Clone this repository and install with `pip`:
+
+```console
+pip install .
+```
+
+### Run
+
+```console
+pepembed \
+ --keywords-file keywords.txt \
+ --postgres-host $POSTGRES_HOST \
+ --postgres-user $POSTGRES_USER \
+ --postgres-password $POSTGRES_PASSWORD \
+ --postgres-db $POSTGRES_DB \
+```
diff --git a/mkdocs.yml b/mkdocs.yml
index 823230d4..00940d9c 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -184,6 +184,8 @@ nav:
- Deployment: pephub/deployment.md
- Development: pephub/development.md
- Server settings: pephub/server-settings.md
+ - PEPembed:
+ - PEPembed: pephub/pepembed/README.md
- pepdbagent:
- pepdbagent: pephub/pepdbagent/README.md
- Database tutorial: pephub/pepdbagent/db_tutorial.md
@@ -192,8 +194,6 @@ nav:
- Schema registry: https://schema.databio.org
- How to cite: citations.md
- Changelog: pephub/changelog.md
-
-
- Peppy:
- Peppy: peppy/README.md
- Getting started: