Synthetic data offers a way to expand or enhance datasets by generating artificial data that mimics real-world scenarios. With the advent of large language models, synthetic datasets have become a powerful tool for pre-training, fine-tuning, and evaluating models. This repository provides tools and techniques to generate synthetic datasets for instruction tuning and preference alignment using the distilabel
framework.
The repository focuses on two primary categories of synthetic data:
- Instruction Datasets: Used for instruction tuning, including techniques like basic prompting, SelfInstruct, EvolInstruct, and Magpie.
- Preference Datasets: Designed for preference alignment, involving the generation of multiple completions and the evaluation of their quality using tools like EvolQuality and UltraFeedback.
Explore how to create instruction datasets for instruction tuning. Techniques include:
- Basic Prompting: Generating synthetic prompts and completions for fine-tuning.
- SelfInstruct: Expanding datasets by generating diverse instructions from a seed dataset.
- EvolInstruct: Iteratively evolving instructions to improve their complexity and domain relevance.
- Magpie: Leveraging chat-template structures to generate efficient, multi-turn instruction datasets.
Build on instruction generation techniques to create preference datasets for alignment tasks. Key methods include:
- Model Pooling: Generating multiple completions for each prompt using diverse models and configurations.
- EvolQuality: Improving the quality of generated responses through iterative evolution. (soon)
- UltraFeedback: Scoring and critiquing responses to identify preferred completions and filter low-quality data. (soon)
- Description: Demonstrates generating instruction datasets for instruction tuning using
distilabel
. - Notebook: Instruction Dataset Notebook
- Description: Explains how to create datasets for preference alignment.
- Notebook: Preference Dataset Notebook