Agentic workflows for obtaining data from the Sequence Read Archive.
scBaseCount: An AI agent-curated, uniformly processed, and continually expanding single cell data repository. Nicholas D Youngblut, Christopher Carpenter, Jaanak Prashar, Chiara Ricci-Tam, Rajesh Ilango, Noam Teyssier, Silvana Konermann, Patrick Hsu, Alexander Dobin, David P Burke, Hani Goodarzi, Yusuf H Roohani. bioRxiv 2025.02.27.640494; doi: https://doi.org/10.1101/2025.02.27.640494
Create a conda environment [optional]:
mamba create -n sragent-env -y python=3.12 sra-tools=3.1 \
&& conda activate sragent-env
Clone the repository:
git clone https://github.com/ArcInstitute/SRAgent.git \
&& cd SRAgent
Create a uv venv:
uv venv
Source the venv:
source .venv/bin/activate
Install the package:
pip install .
For development, install the package with the dev
optional dependencies:
pip install -e ".[dev]"
OPENAI_API_KEY
= API key for using the OpenAI API- required when using OpenAI models
- See Configuring models information on setting models
ANTHROPIC_API_KEY
= API key for using the Anthropic API- required when using Claude models
EMAIL
= email for using the Entrez API- optional, but HIGHLY recommended
NCBI_API_KEY
= API key for using the Entrez API- optional, increases rate limits
DYNACONF
= switch between "test", "prod", and "claude" environments- optional, default is "prod"
- this affects the SQL database used and models selected
- no database is used by default
pip install pytest
pytest tests/
Components of SRAgent can use an SQL database to store the results.
This was crucial for the scBaseCount project, in order to:
- track which datasets had been processed
- quickly assess the progress of the project
However, for most users, the SQL database is not necessary. SRAgent does not use the SQL database by default.
Note: currently only a GCP Postgresql database is supported.
To set up the database, see Setting up the SQL Database.
The lowest-level agent in the SRAgent hiearchy.
The agent can call various Entrez tools (esearch
, efetch
, esummary
, and elink
).
Usually, the SRAgent agent will be more useful, since it includes more tools, including calling the Entrez agent.
SRAgent entrez "Convert GSE121737 to SRX accessions"
SRAgent entrez "Summarize SRX4967527"
A general tool for extracting data from the SRA database. The tools available:
- Entrez agent (see above)
- SRA BigQuery
- scraping NCBI webpage HTML
- sra-stat and fastq-dump (directly assessing sequence data)
using more tools than the Entrez agent
SRAgent sragent "Summarize SRX4967527"
SRAgent sragent "Obtain any available publications for GSE196830"
SRAgent sragent "Which 10X Genomics technology was used for ERX11887200?"
Obtain specific metadata for >=1 SRA dataset.
- Input: >=1 Entrez ID
- Output metadata fields:
- SRX accession for the Entrez ID
- SRR accessions for the SRX accession
- Is the dataset Illumina sequence data?
- Is the dataset single cell RNA-seq data?
- Is the dataset paired-end sequencing data?
- Which scRNA-seq library preparation technology?
- If 10X Genomics, which particular 10X technologies?
- Single nucleus or single cell RNA sequencing?
- Which organism was sequenced?
- Which tissue(s) were sequenced?
- Corresponding tissue ontology ID(s)
- Any disease information?
- Any treatment/perturbation information?
- Any cell line information?
- Workflow
- The agent converts the Entrez IDs to SRX accessions
- For each SRX accession, the agent obtains metadata
- The agent consolidates the metadata into a single report
As of now, the metadata fields are hard-coded into the agent. If you need alternative metadata fields, you will have to modify metadata.py.
A single SRA dataset:
SRAgent srx-info 25576380
Multiple SRA datasets:
SRAgent srx-info 36106630 32664033 27694586
Use the SQL database to filter out already-processed datasets:
SRAgent srx-info --use-database 18060880 27454880 27454942 27694586
Similar to the SRX-info
agent, but you can provide SRX accessions directly, instead of Entrez IDs.
This saves compute time, since the agent does not need to convert the Entrez IDs to SRX accessions.
Provide a CSV of Entrez IDs and their associated SRX accessions to obtain metadata.
Useful for when you already have the SRX accessions, instead of providing the Entrez IDs to SRAgent srx-info
.
The CSV should have the header: entrez_id,srx_accession
.
The metadata fields are the same as the SRX-info
agent.
SRAgent metadata "entrez-id_srx-accession.csv"
An agent for categorizing tissue descriptions using the Uberon ontology. The agent helps identify the most suitable Uberon ontology term for a given tissue description.
- Input: Free text description of one or more tissues
- Output: Uberon ontology IDs (UBERON:XXXXXXX) for each tissue description
- Workflow
- The agent processes each tissue description separately
- For each description, it identifies the most suitable Uberon ontology term
- The agent returns the corresponding Uberon ID for each tissue
Categorize a single tissue:
SRAgent tissue-ontology "Categorize the following tissue: brain"
Categorize multiple tissues:
SRAgent tissue-ontology "Tissues: lung, heart, liver"
Finding ontology terms for complex tissue descriptions:
SRAgent tissue-ontology "Find the ontology term for the thin layer of epithelial cells lining the alveoli in lungs"
A high-level agent for finding datasets in the SRA via esearch
and then
processing them with the SRX-info
agent.
- Input: a search query
- Output: metadata fields for the datasets found (same as
SRX-info
agent) - Workflow
- The agent uses
esearch
to find datasets - The agent processes the datasets with the
SRX-info
agent - The agent consolidates the metadata into a single report
- The agent uses
SRAgent find-datasets --max-datasets 2 \
"Obtain recent single cell RNA-seq datasets in the SRA database"
SRAgent --no-summaries \
find-datasets --max-datasets 1 --organisms pig -- \
"Obtain recent single cell RNA-seq datasets in the SRA database"
Available organisms
- Mammals
- Human (Homo sapiens)
- Mouse (Mus musculus)
- Rat (Rattus norvegicus)
- Macaque (Macaca mulatta)
- Marmoset (Callithrix jacchus)
- Horse (Equus caballus)
- Dog (Canis lupus)
- Bovine (Bos taurus)
- Sheep (Ovis aries)
- Pig (Sus scrofa)
- Rabbit (Oryctolagus cuniculus)
- Naked mole-rat (Heterocephalus glaber)
- Chimpanzee (Pan troglodytes)
- Gorilla (Gorilla gorilla)
- Cat (Felis catus)
- Bonobo (Pan paniscus)
- Green monkey (Chlorocebus aethiops)
- Gray short-tailed opposum (Monodelphis domestica)
- Goat (Capra hircus)
- Alpaca (Vicugna pacos)
- Chinchilla (Chinchilla lanigera)
- Domestic guinea pig (Cavia porcellus)
- Golden hamster (Mesocricetus auratus)
- Eurasian hedgehog (Erinaceus europaeus)
- American mink (Neovison vison)
- Sunda pangolin (Manis javanica)
- Platypus (Ornithorhynchus anatinus)
- Ferret (Mustela putorius)
- Northern tree shrew (Tupaia belangeri)
- Birds
- Chicken (Gallus gallus)
- Zebrafinch (Taeniopygia guttata)
- Goose (Anser cygnoides)
- Duck (Anas platyrhynchos)
- Reptiles
- Turtle (Trachemys scripta)
- Amphibians
- Frog (Xenopus tropicalis)
- Axolotl (Ambystoma mexicanum)
- Fish
- Zebrafish (Danio rerio)
- Salmon (Salmo salar)
- Stickleback (Gasterosteus aculeatus)
- Invertebrates
- Fruit fly (Drosophila melanogaster)
- Roundworm (Caenorhabditis elegans)
- Mosquito (Anopheles gambiae)
- Blood fluke (Schistosoma mansoni)
- Plants
- Thale cress (Arabidopsis thaliana)
- Rice (Oryza sativa)
- Tomato (Solanum lycopersicum)
- Corn (Zea mays)
- Microorganisms
- Metagenome
- Other
- Other
Using the test
database:
SRAgent --no-summaries \
find-datasets --use-database --tenant test \
--max-datasets 1 --organisms rat -- \
"Obtain recent single cell RNA-seq datasets in the SRA database"
By default, most SRAgent sub-commands (e.g., entrez
and sragent
) use "fancy" CLI output formatting:
--no-progress
: Disable progress bar--no-summaries
: Disable LLM summaries for each workflowstep
Example:
SRAgent --no-summaries --no-progress \
entrez "Convert GSE121737 to SRX accessions"
The models used by SRAgent are configured in the settings.yml file. Options for updating the settings:
- Create a new settings yaml file
- Set the
DYNACONF_SETTINGS_PATH
environment variable to the path to the new settings file- e.g.,
export DYNACONF_SETTINGS_PATH=/path/to/settings.yml
- e.g.,
- No need to (re)install the package. The settings will be loaded from the new file.
- Clone the repository
- Update the settings.yml file
- (Re)install the the package
- e.g.,
pip install .
- e.g.,
SRAgent supports using Anthropic's Claude models:
- Set the
ANTHROPIC_API_KEY
environment variable to your Anthropic API key - Switch to the Claude environment using
export DYNACONF=claude
- Update the settings.yml file, as needed
- See Configuring models information on setting model parameters
- Run SRAgent commands as usual
Claude models support different reasoning effort levels:
low
: 1024 thinking tokens (best for simple tasks)medium
: 4096 thinking tokens (good balance)high
: 16384 thinking tokens (best for complex reasoning)- Anything else: Disables thinking tokens feature
Example:
export ANTHROPIC_API_KEY=your_api_key
export DYNACONF="claude"
SRAgent entrez "Convert GSE121737 to SRX accessions"
You can also customize the specific Claude model in settings.yml:
claude:
models:
default: "claude-sonnet-4-0" # Or any other Claude model version
temperature:
default: 0.1
reasoning_effort:
default: "medium" # Set your preferred reasoning effort; use "" to disable
- Create a GCP Postgresql database. See the docs.
- Required secrets:
GCP_SQL_DB_PASSWORD
- Store in the
.env
file or GCP Secret Manager - If using GCP Secret Manager, you must also provide:
GOOGLE_APPLICATION_CREDENTIALS
GCP_PROJECT_ID
- Update the settings.py file with the database information.
See the eval.py script for running evaluations.
Feel free to fork the repository and submit a pull request. However, the top priority is to keep SRAgent functioning for the ongoing scBaseCount project.