GitHub - open-sciencelab/GraphGen: GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

📚 Table of Contents

📝 What is GraphGen?
🚀 Quick Start
📌 Latest Updates
🏗️ System Architecture
🍀 Acknowledgements
📚 Citation
📜 License

📝 What is GraphGen?

GraphGen is a framework for synthetic data generation guided by knowledge graphs. Here is our paper and best practice.

It begins by constructing a fine-grained knowledge graph from the source text，then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.

🚀 Quick Start

Experience GraphGen through Web or Backup Web Entrance

For any questions, please check FAQ, open new issue or join our wechat group and ask.

Preparation

Install uv

# You could try pipx or pip to install uv when meet network issues, refer the uv doc for more details
curl -LsSf https://astral.sh/uv/install.sh | sh

Clone the repository

git clone https://github.com/open-sciencelab/GraphGen
cd GraphGen

Create a new uv environment
```
 uv venv --python 3.10
```
Configure the dependencies
```
uv pip install -r requirements.txt
```

Run Gradio Demo

uv run webui/app.py

Run from PyPI

Install GraphGen
```
uv pip install graphg
```

Run in CLI

SYNTHESIZER_MODEL=your_synthesizer_model_name \
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \
TRAINEE_MODEL=your_trainee_model_name \
TRAINEE_BASE_URL=your_base_url_for_trainee_model \
TRAINEE_API_KEY=your_api_key_for_trainee_model \
graphg --output_dir cache

Run from Source

Configure the environment

Create an .env file in the root directory
```
cp .env.example .env
```

Set the following environment variables:

# Synthesizer is the model used to construct KG and generate data
SYNTHESIZER_MODEL=your_synthesizer_model_name
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model
# Trainee is the model used to train with the generated data
TRAINEE_MODEL=your_trainee_model_name
TRAINEE_BASE_URL=your_base_url_for_trainee_model
TRAINEE_API_KEY=your_api_key_for_trainee_model

(Optional) If you want to modify the default generated configuration, you can edit the content of the configs/graphgen_config.yaml file.

# configs/graphgen_config.yaml
# Example configuration
data_type: "raw"
input_file: "resources/examples/raw_demo.jsonl"
# more configurations...

Run the generation script
```
bash scripts/generate.sh
```
Get the generated data
```
ls cache/data/graphgen
```

Run with Docker

Build the Docker image
```
docker build -t graphgen .
```
Run the Docker container
```
 docker run -p 7860:7860 graphgen
```

📌 Latest Updates

2025.04.21: We have released the initial version of GraphGen.

🏗️ System Architecture

See analysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.

Workflow

🍀 Acknowledgements

SiliconFlow Abundant LLM API, some models are free
LightRAG Simple and efficient graph retrieval solution
ROGRAG ROGRAG: A Robustly Optimized GraphRAG Framework

📚 Citation

If you find this repository useful, please consider citing our work:

@misc{chen2025graphgenenhancingsupervisedfinetuning,
      title={GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation}, 
      author={Zihong Chen and Wanli Jiang and Jinzhe Li and Zhonghang Yuan and Huanjun Kong and Wanli Ouyang and Nanqing Dong},
      year={2025},
      eprint={2505.20416},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.20416}, 
}

📜 License

This project is licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 449 Commits
.github/workflows		.github/workflows
baselines		baselines
graphgen		graphgen
resources		resources
scripts		scripts
webui		webui
.env.example		.env.example
.gitignore		.gitignore
.pylintrc		.pylintrc
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📝 What is GraphGen?

🚀 Quick Start

Preparation

Run Gradio Demo

Run from PyPI

Run from Source

Run with Docker

📌 Latest Updates

🏗️ System Architecture

Workflow

🍀 Acknowledgements

📚 Citation

📜 License

About

Uh oh!

Releases

Uh oh!

Contributors 7

Languages

License

open-sciencelab/GraphGen

Folders and files

Latest commit

History

Repository files navigation

📝 What is GraphGen?

🚀 Quick Start

Preparation

Run Gradio Demo

Run from PyPI

Run from Source

Run with Docker

📌 Latest Updates

🏗️ System Architecture

Workflow

🍀 Acknowledgements

📚 Citation

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 7

Languages