Skip to content

LLM agents for working with the SRA and associated bioinformatics databases.

License

Notifications You must be signed in to change notification settings

ArcInstitute/SRAgent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SRAgent

Agentic workflows for obtaining data from the Sequence Read Archive.

Manuscript

scBaseCamp: An AI agent-curated, uniformly processed, and continually expanding single cell data repository. Nicholas D Youngblut, Christopher Carpenter, Jaanak Prashar, Chiara Ricci-Tam, Rajesh Ilango, Noam Teyssier, Silvana Konermann, Patrick Hsu, Alexander Dobin, David P Burke, Hani Goodarzi, Yusuf H Roohani. bioRxiv 2025.02.27.640494; doi: https://doi.org/10.1101/2025.02.27.640494

Install

Create a conda environment [optional]:

mamba create -n sragent-env -y python=3.12 sra-tools=3.1 \
  && conda activate sragent-env

Clone the repository:

git clone https://github.com/ArcInstitute/SRAgent.git \
  && cd SRAgent

Install the package:

pip install .

Environmental variables

  • OPENAI_API_KEY = API key for using the OpenAI API
    • required
    • currently, no other models are supported besides OpenAI
  • EMAIL = email for using the Entrez API
    • optional, but HIGHLY recommended
  • NCBI_API_KEY = API key for using the Entrez API
    • optional, increases rate limits
  • DYNACONF = switch between "test" and "prod" environments
    • optional, default is "prod"
    • this only affects the SQL database used, and no database is used by default

Testing

pip install pytest
pytest tests/

Usage

SQL database

Components of SRAgent can use an SQL database to store the results.

This was crucial for the scBaseCamp project, in order to:

  • track which datasets had been processed
  • quickly assess the progress of the project

However, for most users, the SQL database is not necessary. SRAgent does not use the SQL database by default.

Note: currently only a GCP Postgresql database is supported.

To set up the database, see Setting up the SQL Database.

Entrez Agent

The lowest-level agent in the SRAgent hiearchy. The agent can call various Entrez tools (esearch, efetch, esummary, and elink). Usually, the SRAgent agent will be more useful, since it includes more tools, including calling the Entrez agent.

Example accession conversion:

SRAgent entrez "Convert GSE121737 to SRX accessions"

Example of obtaining pubmed articles associated with a dataset accession:

SRAgent entrez "Obtain any available publications for GSE196830"

SRAgent agent

A general tool for extracting data from the SRA database. The tools available:

  • Entrez agent (see above)
  • SRA BigQuery
  • scraping NCBI webpage HTML
  • sra-stat and fastq-dump (directly assessing sequence data)

Example of converting a GEO accession to SRX accessions:

SRAgent sragent "Convert GSE121737 to SRX accessions"

Example of obtaining metadata for a specific SRX accession:

SRAgent sragent "Obtain any available publications for GSE196830"

Example of obtaining specific metadata fields for a dataset:

SRAgent sragent "Which 10X Genomics technology was used for ERX11887200?"

SRX-info agent

Obtain specific metadata for >=1 SRA dataset.

  • Input: >=1 Entrez ID
  • Output metadata fields:
    • SRX accession for the Entrez ID
    • SRR accessions for the SRX accession
    • Is the dataset Illumina sequence data?
    • Is the dataset single cell RNA-seq data?
    • Is the dataset paired-end sequencing data?
    • Which scRNA-seq library preparation technology?
    • If 10X Genomics, which particular 10X technologies?
    • Single nucleus or single cell RNA sequencing?
    • Which organism was sequenced?
    • Which tissue was sequenced?
    • Any disease information?
    • Any treatment/purturbation information?
    • Any cell line information?
  • Workflow
    • The agent converts the Entrez IDs to SRX accessions
    • For each SRX accession, the agent obtains metadata
    • The agent consolidates the metadata into a single report

As of now, the metadata fields are hard-coded into the agent. If you need alternative metadata fields, you will have to modify metadata.py.

Examples

A single SRA dataset:

SRAgent srx-info 25576380

Multiple SRA datasets:

SRAgent srx-info 36404865 36106630 32664033

Use the SQL database to filter out already-processed datasets:

SRAgent srx-info --use-database 18060880 27454880 27454942 27694586

Metadata agent

Similar to the SRX-info agent, but you can provide SRX accessions directly, instead of Entrez IDs. This saves compute time, since the agent does not need to convert the Entrez IDs to SRX accessions.

Provide a CSV of Entrez IDs and their associated SRX accessions to obtain metadata. Useful for when you already have the SRX accessions, instead of providing the Entrez IDs to SRAgent srx-info.

The CSV should have the header: entrez_id,srx_accession.

The metadata fields are the same as the SRX-info agent.

Examples

SRAgent metadata "entrez-id_srx-accession.csv"

find-datasets agent

A high-level agent for finding datasets in the SRA via esearch and then processing them with the SRX-info agent.

  • Input: a search query
  • Output: metadata fields for the datasets found (same as SRX-info agent)
  • Workflow
    • The agent uses esearch to find datasets
    • The agent processes the datasets with the SRX-info agent
    • The agent consolidates the metadata into a single report

Examples

SRAgent find-datasets "Obtain recent single cell RNA-seq datasets in the SRA database"

Target specific organisms

SRAgent find-datasets --no-summaries --max-datasets 1 --organisms pig -- \
  "Obtain recent single cell RNA-seq datasets in the SRA database"
Available organisms
  • Mammals

    • Human (Homo sapiens)
    • Mouse (Mus musculus)
    • Rat (Rattus norvegicus)
    • Macaque (Macaca mulatta)
    • Marmoset (Callithrix jacchus)
    • Horse (Equus caballus)
    • Dog (Canis lupus)
    • Bovine (Bos taurus)
    • Sheep (Ovis aries)
    • Pig (Sus scrofa)
    • Rabbit (Oryctolagus cuniculus)
    • Naked mole-rat (Heterocephalus glaber)
    • Chimpanzee (Pan troglodytes)
    • Gorilla (Gorilla gorilla)
  • Birds

    • Chicken (Gallus gallus)
  • Amphibians

    • Frog (Xenopus tropicalis)
  • Fish

    • Zebrafish (Danio rerio)
  • Invertebrates

    • Fruit fly (Drosophila melanogaster)
    • Roundworm (Caenorhabditis elegans)
    • Mosquito (Anopheles gambiae)
    • Blood fluke (Schistosoma mansoni)
  • Plants

    • Thale cress (Arabidopsis thaliana)
    • Rice (Oryza sativa)
    • Tomato (Solanum lycopersicum)
    • Corn (Zea mays)

Using an SQL database to store results

Using the test database:

SRAgent find-datasets --use-database --no-summaries --max-datasets 1 --organisms rat -- \
  "Obtain recent single cell RNA-seq datasets in the SRA database"

Setting up the SQL database

  • Create a GCP Postgresql database. See the docs.
  • Required secrets:
    • GCP_SQL_DB_PASSWORD
    • Store in the .env file or GCP Secret Manager
    • If using GCP Secret Manager, you must also provide:
      • GOOGLE_APPLICATION_CREDENTIALS
      • GCP_PROJECT_ID
  • Update the settings.py file with the database information.

Evaluations

See the eval.py script for running evaluations.

Contributing

Feel free to fork the repository and submit a pull request. However, the top priority is to keep SRAgent functioning for the ongoing scBaseCamp project.

About

LLM agents for working with the SRA and associated bioinformatics databases.

Resources

License

Stars

Watchers

Forks

Packages

No packages published