Skip to content

Python tool for scraping clinical trial data, processing it with an LLM via OpenRouter API, and exporting the results to CSV.

License

Notifications You must be signed in to change notification settings

mrjxtr/Data_Extractor_LLM_Parser_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLINICAL TRIAL DATA EXTRACTOR WITH LLM PARSING

Project Summary 📝

The Clinical Trial Data Extractor with LLM Parsing project scrapes clinical trial data from a specified website (which will remain unnamed), processes it using a Large Language Model (LLM) via the OpenRouter API, and exports the results to a CSV file. This tool is designed for researchers, providing a streamlined and customizable solution for extracting and analyzing clinical trial data.


LinkedIn Upwork Facebook Instagram Threads Twitter Gmail

Report outline 🧾

Features 🚀

  • Customizable Scraping: Extract clinical trial data based on user-defined keywords entered via the terminal.
  • LLM-Powered Analysis: Process scraped data using advanced LLM models through OpenRouter API.
  • CSV Output: Generate CSV for trial data processed from the LLM response.
  • Data Control: Specify the number of pages to scrape, giving control over the data volume.
  • Page Count Detection: Automatically retrieves the total number of pages for any search query.
  • Automated Directory Setup: Automatically creates required directories for storing scraped and processed data.
  • Modular Design: Clean architecture with separate modules for scraping, processing, and saving data.
  • Real-Time Feedback: Displays live progress updates during scraping and data processing phases.
  • Error Handling: Robust error management for network issues and unexpected data formats.

Requirements 💻

  • Python 3.12.5+
  • All required packages are listed in requirements.txt.

Installation ⚙️

  1. Clone the repository:

    git clone https://github.com/mrjxtr/Clinical_Trial_Data_Extractor.git
    cd Clinical_Trial_Data_Extractor
  2. Install the required dependencies:

    pip install -r requirements.txt
  3. Configure your OpenRouter API key by adding it to the .env file or directly in src/main.py.

Usage 🖥

Run the script using:

python src/main.py

You will be prompted to provide a search keyword and specify the number of pages to scrape.

Project Structure 📂

  • src/main.py: Main orchestrator for scraping, processing, and saving data.
  • src/scraper.py: Contains the Scraper class for fetching clinical trial data.
  • src/llm_processor.py: Implements the LLMProcessor class for analyzing data with the LLM.
  • src/data_saver.py: Saves processed data in CSV format.
  • src/prompts.py: Houses customizable LLM prompt templates.

Notes 📌

  • Randomized Delays: To avoid server overload, requests include randomized delays.
  • Compliance: Always adhere to the website's terms of service when scraping data.
  • OpenRouter API Usage: Ensure you have sufficient API credits and follow OpenRouter's usage policies.
  • Ethical Considerations: Use this tool responsibly and only for research purposes. It is not intended for medical diagnosis or treatment.
  • Maintenance: Updates may be needed to adapt to changes in the website, LLM models, or API specifications.
  • Debugging: If issues occur with LLM parsing or CSV saving, additional debugging may be required.
  • Environment: Ensure a stable internet connection for running the script on a single machine.

Important: The current parser is optimized for "Breast Cancer" search results. You may need to modify the parser to suit other use cases. All intermediate data is stored in the output/ directory. The parsing code is located in src/llm_processor.py with the parse_llm_response function.

About

Python tool for scraping clinical trial data, processing it with an LLM via OpenRouter API, and exporting the results to CSV.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages