You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to update the function that mines a description from PEPs inside pephub for subsequent embedding and insertion into Qdrant. Currently, the pipeline looks inside each PEP and attempts to extract any project-level attributes. These are typically attributes that describe the data from a global perspective. See here for more info on the architecture. It would be better if we extracted all meaningful text from the PEP and used that to compute an embedding for vector search and retrieval. Example:
We need this to be more flexible and more intelligent. For example, if we have a project yaml/dict that looks like:
name: GSE226825pep_version: 2.1.0sample_table: GSE226825_PEP_raw.csvsample_modifiers:
append:
sample_data_processing: Adapter sequences were trimmed by Trimmomatic (v0.39).Trimmed reads aligned using HISAT2 (v2.2.0) with referring hg19 genome. Alignedreads are sorted by samtools (v1.9)...sample_extract_protocol_ch1: "RNA was extracted from ...experiment_metadata: series_type: Expression profiling by high throughput sequencing series_title: RNA sequencing of peripheral blood mononuclear cells isolated from Korean patients with asthma series_status: Public on Mar 08 2023 series_summary: More than 200 asthma-associated genetic variants have been identified in genome-wide association studies (GWASs). Expression quantitative trait loci (eQTL) ...
We would need to extract out things like, sample_data_processing: and series_summary: since these contain so much information about the data.
This is the exact spot in the code that is mining the description. This is where the magic is happening! Nearly all else in this repo can be thought of as a convenient "glue" that keeps the pipeline going and consistent. The mine_metadata_from_dict function is the only one that needs significant changes (at least for now...)
Secrets, debugging, and Testing
There are three things that you will need for efficient development and testing.
The first is lab secrets. We are working with two databases in this package, as such, there are a handful of secrets and passwords we use to connect to those. This repo is set up to be compatible with the lab secret workflow. If you are setup properly with the lab secret workflow, then you can simply run source production.env and your environment will be populated with the correct credentials. Ask @nleroy917 or @nsheff if you need help here...
The second is debugging. I also have this repository setup to function with VSCode debugging. By hitting F5, you can launch the debugger, and you should then be able to use breakpoints to stop the code and inspect things.
The third is testing. I have a tests/ directory, but it doesn't contain anything 😅. The best way to test currently is by installing the package locally with pip install (pip install .), and then just running the cli: pepembed. You can speed things up by limiting the results from the database: pepembed -n 100.
Extras
In addition, we discussed in meeting that we should have multiple vectors for each object that is stored inside the Qdrant collection. Here is a blog post that explains how to do just that with Qdrant.
The text was updated successfully, but these errors were encountered:
Overview
We need to update the function that mines a description from PEPs inside pephub for subsequent embedding and insertion into Qdrant. Currently, the pipeline looks inside each PEP and attempts to extract any project-level attributes. These are typically attributes that describe the data from a global perspective. See here for more info on the architecture. It would be better if we extracted all meaningful text from the PEP and used that to compute an embedding for vector search and retrieval. Example:
Bad:
Good:
Design Goals
There are two main design goals for the updated
pepembed
description mining functionality:Basically, we want one function that operates for any PEP you give it, and it should be capable of accessing rich biological information at any level.
Technical details and code
Pseudo-code of the current implementation looks like this:
We need this to be more flexible and more intelligent. For example, if we have a project
yaml
/dict
that looks like:We would need to extract out things like,
sample_data_processing:
andseries_summary:
since these contain so much information about the data.This is the exact spot in the code that is mining the description. This is where the magic is happening! Nearly all else in this repo can be thought of as a convenient "glue" that keeps the pipeline going and consistent. The
mine_metadata_from_dict
function is the only one that needs significant changes (at least for now...)Secrets, debugging, and Testing
There are three things that you will need for efficient development and testing.
The first is lab secrets. We are working with two databases in this package, as such, there are a handful of secrets and passwords we use to connect to those. This repo is set up to be compatible with the lab secret workflow. If you are setup properly with the lab secret workflow, then you can simply run
source production.env
and your environment will be populated with the correct credentials. Ask @nleroy917 or @nsheff if you need help here...The second is debugging. I also have this repository setup to function with VSCode debugging. By hitting F5, you can launch the debugger, and you should then be able to use breakpoints to stop the code and inspect things.
The third is testing. I have a
tests/
directory, but it doesn't contain anything 😅. The best way to test currently is by installing the package locally withpip install
(pip install .
), and then just running thecli
:pepembed
. You can speed things up by limiting the results from the database:pepembed -n 100
.Extras
In addition, we discussed in meeting that we should have multiple vectors for each object that is stored inside the Qdrant collection. Here is a blog post that explains how to do just that with Qdrant.
The text was updated successfully, but these errors were encountered: