GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
📚 Table of Contents
GraphGen is a framework for synthetic data generation guided by knowledge graphs. Here is our paper and best practice.
It begins by constructing a fine-grained knowledge graph from the source text,then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.
Experience it on the OpenXLab Application Center and FAQ.
python webui/app.py
-
Install GraphGen
pip install graphg
-
Run in CLI
SYNTHESIZER_MODEL=your_synthesizer_model_name \ SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \ SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \ TRAINEE_MODEL=your_trainee_model_name \ TRAINEE_BASE_URL=your_base_url_for_trainee_model \ TRAINEE_API_KEY=your_api_key_for_trainee_model \ graphg --output_dir cache
- Install dependencies
pip install -r requirements.txt
- Configure the environment
- Create an
.env
file in the root directorycp .env.example .env
- Set the following environment variables:
# Synthesizer is the model used to construct KG and generate data SYNTHESIZER_MODEL=your_synthesizer_model_name SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model # Trainee is the model used to train with the generated data TRAINEE_MODEL=your_trainee_model_name TRAINEE_BASE_URL=your_base_url_for_trainee_model TRAINEE_API_KEY=your_api_key_for_trainee_model
- Create an
- (Optional) If you want to modify the default generated configuration, you can edit the content of the configs/graphgen_config.yaml file.
# configs/graphgen_config.yaml # Example configuration data_type: "raw" input_file: "resources/examples/raw_demo.jsonl" # more configurations...
- Run the generation script
bash scripts/generate.sh
- Get the generated data
ls cache/data/graphgen
- 2025.04.21: We have released the initial version of GraphGen.
See analysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.
- SiliconCloud Abundant LLM API, some models are free
- LightRAG Simple and efficient graph retrieval solution
- ROGRAG ROGRAG: A Robustly Optimized GraphRAG Framework
If you find this repository useful, please consider citing our work:
@software{Chen_GraphGen_2025,
author = {Chen, Zihong and Jiang, Wanli and Li, Jingzhe and Yuan, Zhonghang and Wang, Chenyang and Kong, Huanjun and Dong, Nanqing},
month = apr,
title = {{GraphGen}},
url = {https://github.com/open-sciencelab/GraphGen},
year = {2025}
}
This project is licensed under the Apache License 2.0.