SRAgent

Agentic workflows for obtaining data from the Sequence Read Archive.

Manuscript

scBaseCamp: An AI agent-curated, uniformly processed, and continually expanding single cell data repository. Nicholas D Youngblut, Christopher Carpenter, Jaanak Prashar, Chiara Ricci-Tam, Rajesh Ilango, Noam Teyssier, Silvana Konermann, Patrick Hsu, Alexander Dobin, David P Burke, Hani Goodarzi, Yusuf H Roohani. bioRxiv 2025.02.27.640494; doi: https://doi.org/10.1101/2025.02.27.640494

Install

Create a conda environment [optional]:

mamba create -n sragent-env -y python=3.12 sra-tools=3.1 \
  && conda activate sragent-env

Clone the repository:

git clone https://github.com/ArcInstitute/SRAgent.git \
  && cd SRAgent

Install the package:

pip install .

Environmental variables

OPENAI_API_KEY = API key for using the OpenAI API
- required
- currently, no other models are supported besides OpenAI
EMAIL = email for using the Entrez API
- optional, but HIGHLY recommended
NCBI_API_KEY = API key for using the Entrez API
- optional, increases rate limits
DYNACONF = switch between "test" and "prod" environments
- optional, default is "prod"
- this only affects the SQL database used, and no database is used by default

Testing

pip install pytest

pytest tests/

Usage

SQL database

Components of SRAgent can use an SQL database to store the results.

This was crucial for the scBaseCamp project, in order to:

track which datasets had been processed
quickly assess the progress of the project

However, for most users, the SQL database is not necessary. SRAgent does not use the SQL database by default.

Note: currently only a GCP Postgresql database is supported.

To set up the database, see Setting up the SQL Database.

Entrez Agent

The lowest-level agent in the SRAgent hiearchy. The agent can call various Entrez tools (esearch, efetch, esummary, and elink). Usually, the SRAgent agent will be more useful, since it includes more tools, including calling the Entrez agent.

Example accession conversion:

SRAgent entrez "Convert GSE121737 to SRX accessions"

Example of obtaining pubmed articles associated with a dataset accession:

SRAgent entrez "Obtain any available publications for GSE196830"

SRAgent agent

A general tool for extracting data from the SRA database. The tools available:

Entrez agent (see above)
SRA BigQuery
scraping NCBI webpage HTML
sra-stat and fastq-dump (directly assessing sequence data)

Example of converting a GEO accession to SRX accessions:

SRAgent sragent "Convert GSE121737 to SRX accessions"

Example of obtaining metadata for a specific SRX accession:

SRAgent sragent "Obtain any available publications for GSE196830"

Example of obtaining specific metadata fields for a dataset:

SRAgent sragent "Which 10X Genomics technology was used for ERX11887200?"

SRX-info agent

Obtain specific metadata for >=1 SRA dataset.

Input: >=1 Entrez ID
Output metadata fields:
- SRX accession for the Entrez ID
- SRR accessions for the SRX accession
- Is the dataset Illumina sequence data?
- Is the dataset single cell RNA-seq data?
- Is the dataset paired-end sequencing data?
- Which scRNA-seq library preparation technology?
- If 10X Genomics, which particular 10X technologies?
- Single nucleus or single cell RNA sequencing?
- Which organism was sequenced?
- Which tissue was sequenced?
- Any disease information?
- Any treatment/purturbation information?
- Any cell line information?
Workflow
- The agent converts the Entrez IDs to SRX accessions
- For each SRX accession, the agent obtains metadata
- The agent consolidates the metadata into a single report

As of now, the metadata fields are hard-coded into the agent. If you need alternative metadata fields, you will have to modify metadata.py.

Examples

A single SRA dataset:

SRAgent srx-info 25576380

Multiple SRA datasets:

SRAgent srx-info 36404865 36106630 32664033

Use the SQL database to filter out already-processed datasets:

SRAgent srx-info --use-database 18060880 27454880 27454942 27694586

Metadata agent

Similar to the SRX-info agent, but you can provide SRX accessions directly, instead of Entrez IDs. This saves compute time, since the agent does not need to convert the Entrez IDs to SRX accessions.

Provide a CSV of Entrez IDs and their associated SRX accessions to obtain metadata. Useful for when you already have the SRX accessions, instead of providing the Entrez IDs to SRAgent srx-info.

The CSV should have the header: entrez_id,srx_accession.

The metadata fields are the same as the SRX-info agent.

Examples

SRAgent metadata "entrez-id_srx-accession.csv"

find-datasets agent

A high-level agent for finding datasets in the SRA via esearch and then processing them with the SRX-info agent.

Input: a search query
Output: metadata fields for the datasets found (same as SRX-info agent)
Workflow
- The agent uses esearch to find datasets
- The agent processes the datasets with the SRX-info agent
- The agent consolidates the metadata into a single report

Examples

SRAgent find-datasets "Obtain recent single cell RNA-seq datasets in the SRA database"

Target specific organisms

SRAgent find-datasets --no-summaries --max-datasets 1 --organisms pig -- \
  "Obtain recent single cell RNA-seq datasets in the SRA database"

Available organisms

Mammals
- Human (Homo sapiens)
- Mouse (Mus musculus)
- Rat (Rattus norvegicus)
- Macaque (Macaca mulatta)
- Marmoset (Callithrix jacchus)
- Horse (Equus caballus)
- Dog (Canis lupus)
- Bovine (Bos taurus)
- Sheep (Ovis aries)
- Pig (Sus scrofa)
- Rabbit (Oryctolagus cuniculus)
- Naked mole-rat (Heterocephalus glaber)
- Chimpanzee (Pan troglodytes)
- Gorilla (Gorilla gorilla)
Birds
- Chicken (Gallus gallus)
Amphibians
- Frog (Xenopus tropicalis)
Fish
- Zebrafish (Danio rerio)
Invertebrates
- Fruit fly (Drosophila melanogaster)
- Roundworm (Caenorhabditis elegans)
- Mosquito (Anopheles gambiae)
- Blood fluke (Schistosoma mansoni)
Plants
- Thale cress (Arabidopsis thaliana)
- Rice (Oryza sativa)
- Tomato (Solanum lycopersicum)
- Corn (Zea mays)

Using an SQL database to store results

Using the test database:

SRAgent find-datasets --use-database --no-summaries --max-datasets 1 --organisms rat -- \
  "Obtain recent single cell RNA-seq datasets in the SRA database"

Setting up the SQL database

Create a GCP Postgresql database. See the docs.
Required secrets:
- GCP_SQL_DB_PASSWORD
- Store in the .env file or GCP Secret Manager
- If using GCP Secret Manager, you must also provide:
  - GOOGLE_APPLICATION_CREDENTIALS
  - GCP_PROJECT_ID
Update the settings.py file with the database information.

Evaluations

See the eval.py script for running evaluations.

Contributing

Feel free to fork the repository and submit a pull request. However, the top priority is to keep SRAgent functioning for the ongoing scBaseCamp project.

Name		Name	Last commit message	Last commit date
Latest commit History 195 Commits
.github		.github
SRAgent		SRAgent
scripts		scripts
tests		tests
.gitignore		.gitignore
DEPLOY.md		DEPLOY.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SRAgent

Manuscript

Install

Environmental variables

Testing

Usage

SQL database

Entrez Agent

Example accession conversion:

Example of obtaining pubmed articles associated with a dataset accession:

SRAgent agent

Example of converting a GEO accession to SRX accessions:

Example of obtaining metadata for a specific SRX accession:

Example of obtaining specific metadata fields for a dataset:

SRX-info agent

Examples

Metadata agent

Examples

find-datasets agent

Examples

Target specific organisms

Using an SQL database to store results

Setting up the SQL database

Evaluations

Contributing

About

Releases 1

Packages

Languages

License

ArcInstitute/SRAgent

Folders and files

Latest commit

History

Repository files navigation

SRAgent

Manuscript

Install

Environmental variables

Testing

Usage

SQL database

Entrez Agent

Example accession conversion:

Example of obtaining pubmed articles associated with a dataset accession:

SRAgent agent

Example of converting a GEO accession to SRX accessions:

Example of obtaining metadata for a specific SRX accession:

Example of obtaining specific metadata fields for a dataset:

SRX-info agent

Examples

Metadata agent

Examples

find-datasets agent

Examples

Target specific organisms

Using an SQL database to store results

Setting up the SQL database

Evaluations

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages