Skip to content

HolobiomicsLab/MetaboT

Repository files navigation

Hero Background MetaboT Logo

License Hugging Face YouTube Tweet Bluesky GitHub Stars Documentation

General Information

Take a break, brew a cup of tea while πŸ§ͺ MetaboT 🍡 digs into mass spec data!

πŸ§ͺ MetaboT 🍡 is an AI system that accelerates mass spectrometry-based metabolomics data mining. Leveraging advanced large language models and knowledge graph technologies, πŸ§ͺ MetaboT 🍡 translates natural language queries into SPARQL requestsβ€”enabling researchers to explore and interpret complex metabolomics datasets. Built in Python and powered by state-of-the-art libraries, πŸ§ͺ MetaboT 🍡 offers an intuitive chat interface that bridges the gap between data complexity and user-friendly access. πŸ§ͺ MetaboT 🍡 can be installed locally and you can try our demo instance on an open 1,600 plant extract dataset available at https://metabot.holobiomicslab.eu.

Take a break, brew a cup of tea 🍡, and have some fun with words while πŸ§ͺ MetaboT 🍡 digs into mass spec data! Enjoy your brew and happy puzzling!

Documentation

Comprehensive documentation is available at https://holobiomicslab.github.io/MetaboT/. It includes:

  • Installation and Quick Start Guides
  • User Guide with configuration details
  • API Reference for core components, agents, and graph management
  • Usage Examples for both basic and advanced scenarios
  • Contributing Guidelines

The documentation is automatically built and deployed using GitHub Actions on every push to the main branch.

To preview and build the documentation locally:

# Install the required dependencies
pip install mkdocs mkdocs-material mkdocstrings mkdocstrings-python

# To serve documentation locally, run:
mkdocs serve

# To build the documentation, run:
mkdocs build

Citation, Institutions & Funding Support

If you use or reference πŸ§ͺ MetaboT 🍡 in your research, please cite it as follows:

πŸ§ͺ MetaboT 🍡: An LLM-based Multi-Agent Framework for Interactive Analysis of Mass Spectrometry Metabolomics Knowledge
Madina Bekbergenova, Lucas Pradi, Benjamin Navet, Emma Tysinger, Matthieu Feraud, Yousouf Taghzouti, Martin Legrand, Tao Jiang, Franck Michel, Yan Zhou Chen, Soha Hassoun, Olivier Kirchhoffer, Jean-Luc Wolfender, Florence Mehl, Marco Pagni, Wout Bittremieux, Fabien Gandon, Louis-FΓ©lix Nothias. PREPRINT (Version 1) available at Research Square DOI

Institutions:

  • UniversitΓ© CΓ΄te d'Azur, CNRS, ICN, Nice, France
  • Interdisciplinary Institute for Artificial Intelligence (3iA) CΓ΄te d'Azur, Sophia-Antipolis, France
  • Department of Computer Science, University of Antwerp, Antwerp, Belgium
  • Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA, USA
  • INRIA, UniversitΓ© CΓ΄te d'Azur, CNRS, I3S, France
  • Department of Computer Science, Tufts University, Medford, MA 02155, USA
  • Department of Chemical and Biological Engineering, Tufts University, Medford, MA 02155, USA
  • Institute of Pharmaceutical Sciences of Western Switzerland, University of Geneva, Centre MΓ©dical Universitaire, Geneva, Switzerland
  • School of Pharmaceutical Sciences, University of Geneva, Centre MΓ©dical Universitaire, Geneva, Switzerland
  • Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland

Lab Websites:

Funding Support:
This work was supported by the French government through the France 2030 investment plan managed by the National Research Agency (ANR), as part of the Initiative of Excellence UniversitΓ© CΓ΄te d'Azur (ANR-15-IDEX-01) and served as an early prototype for the MetaboLinkAI project (ANR-24-CE93-0012-01). This work also benefited from project 189921 funded by the Swiss National Foundation (SNF).


Prepare Your Mass Spectrometry Data

To use πŸ§ͺ MetaboT 🍡, your mass spectrometry processing and annotation results must first be represented as a knowledge graph, with the corresponding endpoint deployed. You can utilize the Experimental Natural Products Knowledge Graph library for this purpose. See the ENPKG repository

By default, πŸ§ͺ MetaboT 🍡 connects to the public ENPKG endpoint for the ENPKG knowledge graph, which hosts an open and reusable annotated mass spectrometry dataset derived from a chemodiverse collection of 1,600 plant extracts. For further details, please refer to the associated publication.


Hardware

  • CPU: Any modern processor
  • RAM: At least 8GB

Software Requirements

OS Requirements

This package has been tested on:

  • macOS: Sonoma (14.5)
  • Linux: Ubuntu 22.04 LTS, Debian 11

It should also work on other Unix-based systems. For more details on compatibility, check out GitHub Issues if you run into troubles.


Installation Guide πŸš€

Prerequisites

  1. Conda Installation

  2. API Keys Required API keys:

    • Get an API key for your chosen language model:
      • OpenAI API Key: Get it from OpenAI Platform
      • DeepSeek API Key: Get it from DeepSeek
      • Claude API Key: Get it from Anthropic
      • Or other models supported by LiteLLM

    Disclaimer: Most LLM APIs are commercial and paid services. Our default model is gpt-4o, and its usage will incur costs according to the provider's pricing policy.

    Data Privacy: Please note that data submitted to LLM APIs is subject to their respective privacy policies. Avoid sending sensitive or confidential information, as data may be logged for quality assurance and research purposes.

    Optional API keys:

    • LangSmith API Key: This is used to see the interactions traces LangSmith. This is free.

    Create a .env file in the root directory with your credentials:

    OPENAI_API_KEY=your_openai_key_here
    LANGCHAIN_API_KEY=your_langsmith_key_here
    LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
    LANGCHAIN_PROJECT=metabot_project 

    Note: The system can also be used with other LLM models, namely: Meta-Llama-3_1-70B-Instruct and deepseek-reasoner. For Meta-Llama-3_1-70B-Instruct (which runs on OVH Cloud – see OVH Cloud), add the API key OVHCLOUD_API_KEY to your .env file; for deepseek-reasoner, add DEEPSEEK_API_KEY. Detailed information on how to configure other LLM models is available here. Currently, all agents use the OpenAI model gpt-4o (including the SPARQL generation chain). Furthermore, if the initial query yields no results, a SPARQL improvement chain using the OpenAI o3-mini model is activated.

Installation Steps

  1. Clone the Repository

    git clone https://github.com/holobiomicslab/MetaboT.git
    cd MetaboT
    git checkout dev
  2. Create and Activate the Conda Environment

    For macOS:

    conda env create -f environment.yml
    conda activate metabot

    For Linux:

    # Update system dependencies first
    sudo apt-get update
    sudo apt-get install -y python3-dev build-essential
    
    # Then create and activate the conda environment
    conda env create -f environment.yml
    conda activate metabot

    For Windows (using WSL):

    Install WSL if you haven't already:

    wsl --install

    Open WSL and install the required packages:

    sudo apt-get update
    sudo apt-get install -y python3-dev build-essential

    Install Miniconda in WSL:

    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    bash Miniconda3-latest-Linux-x86_64.sh
    source ~/.bashrc

    Create and activate the conda environment:

    conda env create -f environment.yml
    conda activate MetaboT

Pro-tip: If you hit any issues with psycopg2, the environment.yml uses psycopg2-binary for maximum compatibility.


Application Startup Instructions ▢️

The application is structured as a Python module with dot notation importsβ€”so choose your style, whether absolute (e.g., app.core.module1.module2) or relative (e.g., ..core.module1.module2).

Demo

To launch the application, use Python's -m option. The main entry point is in app.core.main.

To try one of the standard questions, run the following command:

cd MetaboT
python -m app.core.main -q 1

Here, the number following -q specifies the question number from the standard questions which can be viewed in app/data/standard_questions.txt. Expected output includes runtime metrics and a welcoming prompt. 😎

Running with a Custom Question

python -m app.core.main -c "Your custom question"

Running via Streamlit

To launch the application through Streamlit, set the required environment variables, install the dependencies, and run the app. In your terminal, execute:

export ADMIN_OPENAI_KEY=your_openai_api_key
export LANGCHAIN_API_KEY=your_langchain_api_key
pip install -r requirements.txt
streamlit run streamlit_webapp/streamlit_app.py

If you encounter an error stating that the app directory cannot be found (e.g., "ModuleNotFoundError: No module named 'app'"), it means Python is unable to locate the module. To resolve this, add the current directory to your PYTHONPATH by running:

export PYTHONPATH="${PYTHONPATH}:$(pwd)"

This command ensures that Python can locate the app directory.


Running in Docker

If you prefer to run the application in a containerized environment, Docker support is provided. Make sure Docker and docker-compose are installed on your system.

Building the Docker Image

To build the Docker image, run:

docker-compose build

Running the Application

To launch the application and run the first standard question, execute:

docker-compose run metabot python -m app.core.main -q 1

This command will start the container, run the application inside Docker, and process the first standard question from [app/data/standard_questions.txt]. You can adjust parameters as needed.


Project Structure

.
β”œβ”€β”€ README.md
β”œβ”€β”€ app
β”‚   β”œβ”€β”€ config
β”‚   β”‚   β”œβ”€β”€ langgraph.json
β”‚   β”‚   β”œβ”€β”€ logging.ini
β”‚   β”‚   β”œβ”€β”€ logs
β”‚   β”‚   β”‚   └── app.log
β”‚   β”‚   β”œβ”€β”€ params.ini
β”‚   β”‚   └── sparql.ini
β”‚   β”œβ”€β”€ core
β”‚   β”‚   β”œβ”€β”€ agents
β”‚   β”‚   β”‚   β”œβ”€β”€ agents_factory.py
β”‚   β”‚   β”‚   β”œβ”€β”€ enpkg
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ agent.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ prompt.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ tool_chemicals.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ tool_smiles.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ tool_target.py
β”‚   β”‚   β”‚   β”‚   └── tool_taxon.py
β”‚   β”‚   β”‚   β”œβ”€β”€ entry
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ agent.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ prompt.py
β”‚   β”‚   β”‚   β”‚   └── tool_filesparser.py
β”‚   β”‚   β”‚   β”œβ”€β”€ interpreter
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ agent.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ prompt.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ tool_interpreter.py
β”‚   β”‚   β”‚   β”‚   └── tool_spectrum.py  
β”‚   β”‚   β”‚   β”œβ”€β”€ sparql
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ agent.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ prompt.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ tool_merge_result.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ tool_sparql.py
β”‚   β”‚   β”‚   β”‚   └── tool_wikidata_query.py
β”‚   β”‚   β”‚   β”œβ”€β”€ validator
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ agent.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ prompt.py
β”‚   β”‚   β”‚   β”‚   └── tool_validator.py
β”‚   β”‚   β”‚   └── supervisor
β”‚   β”‚   β”‚       β”œβ”€β”€ agent.py
β”‚   β”‚   β”‚       └── prompt.py
β”‚   β”‚   β”œβ”€β”€ graph_management
β”‚   β”‚   β”‚   └── RdfGraphCustom.py
β”‚   β”‚   β”œβ”€β”€ main.py
β”‚   β”‚   β”œβ”€β”€ memory
β”‚   β”‚   β”‚   β”œβ”€β”€ custom_sqlite_file.py
β”‚   β”‚   β”‚   β”œβ”€β”€ database_manager.py
β”‚   β”‚   β”‚   β”œβ”€β”€ test_db_connection.py
β”‚   β”‚   β”‚   └── tools_database.py
β”‚   β”‚   β”œβ”€β”€ utils.py
β”‚   β”‚   └── workflow
β”‚   β”‚   β”‚    └── langraph_workflow.py
β”‚   β”‚   β”œβ”€β”€ tests
β”‚   β”‚   β”‚    β”œβ”€β”€ evaluation.py
β”‚   β”‚   β”‚    └── test_utils.py
β”‚   β”œβ”€β”€ data
β”‚   β”‚   β”œβ”€β”€ submitted_plants.csv
β”‚   β”‚   β”œβ”€β”€ npc_class.csv
β”‚   β”‚   └── evaluation_dataset.csv
β”‚   β”œβ”€β”€ graphs
β”‚   β”‚   β”œβ”€β”€ graph.pkl
β”‚   β”‚   └── schema.ttl
β”‚   └── notebooks
β”œβ”€β”€ docs
β”‚    β”œβ”€β”€ api-reference
β”‚    β”œβ”€β”€ assets
β”‚    β”œβ”€β”€ examples
β”‚    β”œβ”€β”€ getting-started
β”‚    β”œβ”€β”€ user-guide
β”‚    β”œβ”€β”€ contributing.md
β”‚    └── index.md
β”œβ”€β”€ streamlit_webapp
β”‚   β”œβ”€β”€ streamlit_app.py
β”‚   └── streamlit_utils.py
β”œβ”€β”€ environment.yml
β”œβ”€β”€ mkdocs.yml
└── requirements.txt

Agent Setup Guidelines πŸ§‘β€πŸ’»

Agent Directory Creation

Create a dedicated folder for your agent within the app/core/agents/ directory. See here.

Standard File Structure

  • Agent (agent.py): Copy from an existing agent unless your tool requires private class property access. Refer to "If Your Tool Serves as an Agent" for special cases.

    Psst... don't let the complexities of Python imports overcomplicate your flowβ€”trust the process!

  • Prompt (prompt.py): Adapt the prompt for your specific context/tasks. Configure the MODEL_CHOICE, default is llm-o for gpt-4o (per app/config/params.ini).

  • Tools (tool_xxxx.py) (optional): Inherit from the LangChain BaseTool, defining:

    • name, description, args_schema
    • A Pydantic model for input validation
    • The _run method for execution

Supervisor Configuration

Modify the supervisor prompt (see supervisor prompt) to detect and select your agent. Our AI PR-Agent πŸ€– is triggered automatically through issues and pull requests, so you'll be in good hands!

Configuration Updates

Update app/config/langgraph.json to include your agent in the workflow. For reference, see langgraph.json.

If Your Tool Serves as an Agent

For LLM-interaction, make sure additional class properties are set in agent.py (refer to tool_sparql.py and agent.py). Keep it snazzy and smart!


Development Guidelines

Contributing to πŸ§ͺ MetaboT 🍡

We use the dev branch for pushing our contributions here on GitHub. Please create your own branch (either user-centric like dev_benjamin or feature-centric like dev_langgraph) and submit a pull request to the dev branch when you're ready for review. Our AI PR-Agent πŸ€– is always standing by to help trigger pull requests and even handle issues smartlyβ€”because why not let a smarty pants bot lend a hand.

Documentation Standards

  • Use Google Docstring Format
  • Consider the Mintlify Doc Writer for VSCode for automatically stylish and precise docstrings.

Code Formatting

  • Stick to PEP8
  • Leverage the Black Formatter for a neat, uniform style.

Because code deserves to look as sharp as your ideas. 😎

Good Practices with Keys

Pass keys as parameters instead of environment variables for scalable production deployments.


Logging Guidelines

Centralized logging resides in app/config/logging.ini. See here.

Use the following snippet at the start of your Python scripts:

from pathlib import Path
import logging.config

parent_dir = Path(__file__).parent.parent
config_path = parent_dir / "config" / "logging.ini"
logging.config.fileConfig(config_path, disable_existing_loggers=False)
logger = logging.getLogger(__name__)

Pro-tip: Use logger over print for more elegant and traceable output.


Additional Resources


Contributing 🀝

We warmly welcome your contributions! Here's how to dive in:

  1. Fork & Clone
    • Fork the repo on GitHub and clone your fork.
  2. Create a Feature Branch
    • Branch from dev (e.g., dev_your_branch_name).
  3. Develop Your Feature
    • Write clean code with clear documentation (Google Docstring format is preferred).
    • Our AI PR-Agent πŸ€– automatically kicks in when you raise an issue or a pull request.
  4. Commit
    • Use atomic commits with present-tense messages:
      git commit -m "Add new agent for processing chemical data"
    • That's the secret sauce to a smooth GitHub PR journey!
  5. Submit a Pull Request
    • Push your changes and create a PR against the dev branch. Fill out all necessary details, including links to related issues (e.g., GitHub Issues).

Pull Request Process

  • Update documentation, run tests, and ensure your code is formatted.
  • The AI PR-Agent is active and will provide first-line feedback!

Code Quality Guidelines

  • Write meaningful tests.
  • Maintain rich inline documentation.
  • Adhere to PEP8 and best practices.

Reporting Issues

For bug reports or feature requests, please use our GitHub Issues page.


Your contributions make πŸ§ͺ MetaboT 🍡 awesome! Thank you for being part of our journey and for keeping the code as sharp as your wit. πŸ˜ŽπŸš€


License

πŸ§ͺ MetaboT 🍡 is open source and released under the Apache License 2.0. This license allows you to freely use, modify, and distribute the software, provided that you include the original copyright notice and license terms.


β˜• πŸ§ͺ MetaboT 🍡 Tea Time Word Game 🍡

Take a break, brew a cup of tea, and have some fun with words while πŸ§ͺ MetaboT 🍡 digs into mass spec data!

Here's a little puzzle to steep your brain:

  1. Unscramble the letters in t-e-a-m-o-b-o-t to reveal the secret spice behind our data wizard!
  2. What do you get when you mix a hot cup of tea with a powerful AI? Absolutely tea-rific insights!

Remember: While you relax with your favorite treat, πŸ§ͺ MetaboT 🍡 is busy infusing data with meaning. Sip, smile, and let the insights steep into brilliance!

Enjoy your brew and happy puzzling!

About

πŸ€– MetaboT 🍡 is an AI system that accelerates mass spectrometry-based metabolomics data mining.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors 8