Skip to content

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

License

Notifications You must be signed in to change notification settings

open-sciencelab/GraphGen

Repository files navigation

stars forks open issues issue resolution documentation

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

📚 Table of Contents

📝 What is GraphGen?

GraphGen is a framework for synthetic data generation guided by knowledge graphs. Here is our paper and best practice.

It begins by constructing a fine-grained knowledge graph from the source text,then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.

🚀 Quick Start

Experience it on the OpenXLab Application Center and FAQ.

Gradio Demo

python webui/app.py

ui

Run from PyPI

  1. Install GraphGen

    pip install graphg
  2. Run in CLI

    SYNTHESIZER_MODEL=your_synthesizer_model_name \
    SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \
    SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \
    TRAINEE_MODEL=your_trainee_model_name \
    TRAINEE_BASE_URL=your_base_url_for_trainee_model \
    TRAINEE_API_KEY=your_api_key_for_trainee_model \
    graphg --output_dir cache

Run from Source

  1. Install dependencies
    pip install -r requirements.txt
  2. Configure the environment
    • Create an .env file in the root directory
      cp .env.example .env
    • Set the following environment variables:
      # Synthesizer is the model used to construct KG and generate data
      SYNTHESIZER_MODEL=your_synthesizer_model_name
      SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model
      SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model
      # Trainee is the model used to train with the generated data
      TRAINEE_MODEL=your_trainee_model_name
      TRAINEE_BASE_URL=your_base_url_for_trainee_model
      TRAINEE_API_KEY=your_api_key_for_trainee_model
  3. (Optional) If you want to modify the default generated configuration, you can edit the content of the configs/graphgen_config.yaml file.
    # configs/graphgen_config.yaml
    # Example configuration
    data_type: "raw"
    input_file: "resources/examples/raw_demo.jsonl"
    # more configurations...
  4. Run the generation script
    bash scripts/generate.sh
  5. Get the generated data
    ls cache/data/graphgen

📌 Latest Updates

  • 2025.04.21: We have released the initial version of GraphGen.

🏗️ System Architecture

See analysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.

Workflow

workflow

🍀 Acknowledgements

  • SiliconCloud Abundant LLM API, some models are free
  • LightRAG Simple and efficient graph retrieval solution
  • ROGRAG ROGRAG: A Robustly Optimized GraphRAG Framework

📚 Citation

If you find this repository useful, please consider citing our work:

@software{Chen_GraphGen_2025,
author = {Chen, Zihong and Jiang, Wanli and Li, Jingzhe and Yuan, Zhonghang and Wang, Chenyang and Kong, Huanjun and Dong, Nanqing},
month = apr,
title = {{GraphGen}},
url = {https://github.com/open-sciencelab/GraphGen},
year = {2025}
}

📜 License

This project is licensed under the Apache License 2.0.