An open-source system for extracting and structuring data from scientific literature using Large Language Models (LLMs). It integrates a computational backend with an interactive user interface to facilitate efficient data extraction, structuring, and refinement for evidence synthesis in scientific research.
Note: This repository contains two main branches:
main
: The latest version optimized for customization and extensionscidasynth
: The original version as described in our research paper
- 🔍 Automated data extraction from scientific papers (text, tables, and figures)
- 📊 Structured data table output in standardized formats
- 🖥️ Interactive user interface for data validation and refinement
- 🚀 Retrieval-augmented generation (RAG) for enhanced accuracy and speed
- 📈 Quality evaluation metrics for extracted data
- 👥 Support for both technical and non-technical users
# Clone the repository
git clone https://github.com/xingbow/SciDaEx.git
cd SciDaEx
# Set up a virtual environment
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
# Install backend dependencies (python 3.10)
pip install -r requirements.txt && pip install "pdfservices-sdk==2.3.0"
# Install frontend dependencies
cd frontend
npm install
- Backend configuration
- Create a
config.yml
file in thebackend/app/dataService
directory - Update the
config.yml
file with the required configurations:
openai_key: your_openai_api_key adobe_credentials: client_id: your_adobe_client_id client_secret: your_adobe_client_secret
- Create a
- Place your PDF documents in the
backend/app/dataService/data
directory. - Run the preprocessing script:
This script will extract tables, figures, and metadata from the PDFs and store them in the respective directories.
cd backend/app/dataService python preprocess.py --pdf_dir data --table_dir data/table --figure_dir data/figure --meta_dir data/meta
For details, please refer to the preprocessing documentation.
-
Start the backend server
cd backend python run-data-backend.py
-
Start the frontend server
cd frontend npm run serve
-
Open your browser and navigate to
http://localhost:8080
to access the SciDaEx interface.
Period | Role | Contributor | Details |
---|---|---|---|
2024-08-06 to present | Project Maintainer | Xingbo Wang | - |
Until 2024-08-06 | Lead Developer | Xingbo Wang | 63 commits, +20,575 lines |
Until 2024-08-06 | Contributor | Rui Sheng | 14 commits, +166 lines |
Until 2024-08-06 | Contributor | Winston Tsui | 2 commits, +106 lines |
If you use the repository, please cite the following paper:
@article{wang2024scidasynth,
title={SciDaSynth: Interactive Structured Knowledge Extraction and Synthesis from Scientific Literature with Large Language Model},
author={Wang, Xingbo and Huey, Samantha L and Sheng, Rui and Mehta, Saurabh and Wang, Fei},
journal={arXiv preprint arXiv:2404.13765},
year={2024}
}