Synthetic Datasets

Synthetic data offers a way to expand or enhance datasets by generating artificial data that mimics real-world scenarios. With the advent of large language models, synthetic datasets have become a powerful tool for pre-training, fine-tuning, and evaluating models. This repository provides tools and techniques to generate synthetic datasets for instruction tuning and preference alignment using the distilabel framework.

Repository Overview

Synthetic Data Taxonomy

The repository focuses on two primary categories of synthetic data:

Instruction Datasets: Used for instruction tuning, including techniques like basic prompting, SelfInstruct, EvolInstruct, and Magpie.
Preference Datasets: Designed for preference alignment, involving the generation of multiple completions and the evaluation of their quality using tools like EvolQuality and UltraFeedback.

Basic Prompting: Generating synthetic prompts and completions for fine-tuning.
SelfInstruct: Expanding datasets by generating diverse instructions from a seed dataset.
EvolInstruct: Iteratively evolving instructions to improve their complexity and domain relevance.
Magpie: Leveraging chat-template structures to generate efficient, multi-turn instruction datasets.

Preference Datasets

Build on instruction generation techniques to create preference datasets for alignment tasks. Key methods include:

Model Pooling: Generating multiple completions for each prompt using diverse models and configurations.
EvolQuality: Improving the quality of generated responses through iterative evolution. (soon)
UltraFeedback: Scoring and critiquing responses to identify preferred completions and filter low-quality data. (soon)

Notebooks

Instruction Dataset Notebook

Description: Demonstrates generating instruction datasets for instruction tuning using distilabel.
Notebook: Instruction Dataset Notebook

Preference Dataset Notebook

Description: Explains how to create datasets for preference alignment.
Notebook: Preference Dataset Notebook

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
notebooks		notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Synthetic Datasets

Repository Overview

Synthetic Data Taxonomy

Contents

Instruction Datasets

Preference Datasets

Notebooks

Instruction Dataset Notebook

Preference Dataset Notebook

About

Uh oh!

Languages

thibaud-perrin/synthetic-datasets

Folders and files

Latest commit

History

Repository files navigation

Synthetic Datasets

Repository Overview

Synthetic Data Taxonomy

Contents

Instruction Datasets

Preference Datasets

Notebooks

Instruction Dataset Notebook

Preference Dataset Notebook

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages