Skip to content

Generate synthetic datasets for instruction tuning and preference alignment using tools like `distilabel` for efficient and scalable data creation.

Notifications You must be signed in to change notification settings

thibaud-perrin/synthetic-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

image

Synthetic Datasets

Synthetic data offers a way to expand or enhance datasets by generating artificial data that mimics real-world scenarios. With the advent of large language models, synthetic datasets have become a powerful tool for pre-training, fine-tuning, and evaluating models. This repository provides tools and techniques to generate synthetic datasets for instruction tuning and preference alignment using the distilabel framework.


Repository Overview

Synthetic Data Taxonomy

The repository focuses on two primary categories of synthetic data:

  1. Instruction Datasets: Used for instruction tuning, including techniques like basic prompting, SelfInstruct, EvolInstruct, and Magpie.
  2. Preference Datasets: Designed for preference alignment, involving the generation of multiple completions and the evaluation of their quality using tools like EvolQuality and UltraFeedback.

Contents

Instruction Datasets

Explore how to create instruction datasets for instruction tuning. Techniques include:

  • Basic Prompting: Generating synthetic prompts and completions for fine-tuning.
  • SelfInstruct: Expanding datasets by generating diverse instructions from a seed dataset.
  • EvolInstruct: Iteratively evolving instructions to improve their complexity and domain relevance.
  • Magpie: Leveraging chat-template structures to generate efficient, multi-turn instruction datasets.

Preference Datasets

Build on instruction generation techniques to create preference datasets for alignment tasks. Key methods include:

  • Model Pooling: Generating multiple completions for each prompt using diverse models and configurations.
  • EvolQuality: Improving the quality of generated responses through iterative evolution. (soon)
  • UltraFeedback: Scoring and critiquing responses to identify preferred completions and filter low-quality data. (soon)

Notebooks

Instruction Dataset Notebook


Preference Dataset Notebook

About

Generate synthetic datasets for instruction tuning and preference alignment using tools like `distilabel` for efficient and scalable data creation.

Topics

Resources

Stars

Watchers

Forks