MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
🤗HF Models 🤗HF Datasets 📄Paper 🚀Project
Welcome to MIG (Maximize the Information Gain) Project!
We will continue to update. Please stay tuned!
MIG is an automatic data selection method for instruction tuning. It proposes an information-based dataset measurement that comprehensively evaluates data quality and diversity.
- 🎉 [05/2025] MIG is accepted to ACL 2025 Findings!
- 📄 [04/2025] MIG paper MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space is released!
- 🤗 [04/2025] Annotated data pools and sampled datasets by different data selection methods are released at HuggingFace.
Comparison with different data selection methods:
- Sample 50K from the Tulu3 pool(939K).
- Training on Llama3.1-8B.
- Comprehensive evaluations including human-preference and knowledge-based benchmarks.
Method | Data Size | ARC | BBH | GSM | HE | MMLU | IFEval | AE | MT | Wild | Avg | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Pool | 939K | 69.15 | 63.88 | 83.40 | 63.41 | 65.77 | 67.10 | 68.79 | 8.94 | 6.86 | -24.66 | 38.40 | 53.59 |
Random | 50K | 74.24 | 64.80 | 70.36 | 51.22 | 63.86 | 61.00 | 64.25 | 8.57 | 7.06 | -22.15 | 39.36 | 51.81 |
ZIP | 50K | 77.63 | 63.00 | 52.54 | 35.98 | 65.00 | 61.00 | 59.19 | 6.71 | 6.64 | -32.10 | 35.69 | 47.44 |
IFD | 50K | 75.93 | 63.56 | 61.03 | 49.39 | 64.39 | 53.60 | 61.32 | 12.30 | 7.03 | -20.20 | 40.83 | 51.08 |
#InsTag | 50K | 72.54 | 64.80 | 69.83 | 48.17 | 63.50 | 65.99 | 64.14 | 6.58 | 6.84 | -20.70 | 38.21 | 51.17 |
DEITA | 50K | 78.98 | 66.11 | 74.07 | 49.39 | 64.00 | 64.33 | 66.15 | 10.19 | 6.83 | -19.95 | 39.50 | 52.83 |
CaR | 50K | 78.98 | 69.04 | 71.42 | 52.44 | 65.15 | 56.75 | 65.63 | 12.55 | 6.95 | -20.67 | 40.57 | 53.10 |
QDIT | 50K | 79.66 | 65.42 | 70.74 | 53.05 | 65.06 | 57.30 | 65.21 | 15.78 | 6.76 | -20.56 | 41.03 | 53.12 |
MIG | 50K | 80.00 | 66.39 | 72.02 | 57.93 | 64.44 | 65.06 | 67.64 | 14.66 | 7.32 | -17.77 | 42.99 | 55.32 |
HE denotes HumanEval, AE denotes AlpacaEvalv2, MT denotes MTBench, and Wild denotes WildBench.
Please refer to our paper for more results on different data pools(Openhermes2.5, Deita-Sota-Pool) and different base LLMs(Qwen2.5-7B, Mistral-7B-v0.3).
- Create an environment
conda create -n mig python=3.10
conda activate mig
- Install pytorch (>2.0)
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
- Install
git clone https://github.com/yichengchen24/MIG.git
cd MIG
pip install -e .
- Embedding Model
Please download the embedding model under <embedding_model_path>. We recommend e5-mistral-7b-instruct used in our paper.
mig sample <src> --out <save_path> --num-sample <num_sample> --valid-tag-path ./configs/valid_tag_path.json --label-graph-type sim --embedding-model <embedding_model_path> --sampler-type mig --batch-size 32768
should be the data pool path in the format of jsonl. Please refer to data/example.jsonl
for an example. We have open-sourced our processed data pools(Tulu3, Openhermes2.5, $X_{sota}$) with annotated #InsTag labels and Deita score.
We use LLama-Factory to fine-tune base models.
- Preparation
Add sampled data in data/dataset_info.json
.
"tulu3_pool_mig_50k": {
"file_name": <out>,
"formatting": "sharegpt",
"columns": {
"messages": "dialogs"
},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant",
"system_tag": "system"
}
}
- Training
torchrun --nnodes=1 --nproc_per_node=8 --node_rank=${RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} src/train.py \
--stage sft \
--model_name_or_path meta-llama/Llama-3.1-8B \
--do_train \
--deepspeed examples/deepspeed/ds_z3_config.json \
--dataset tulu3_pool_mig_50k \
--cutoff_len 4096 \
--template llama3 \
--finetuning_type full \
--output_dir ckpts/tulu3_pool_mig_50k \
--overwrite_cache \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--lr_scheduler_type linear \
--logging_steps 100 \
--save_steps 5000 \
--learning_rate 5e-6 \
--num_train_epochs 3.0 \
--warmup_ratio 0.03 \
--plot_loss \
--bf16 \
--save_only_model
We use OpenCompass to evaluate fine-tuned models.
- Preparation
Please install the environment accroding to the instructions from OpenCompass.
- Evaluation
opencompass eval/eval_objective.py
We will continue to update:
- More automatic data selection strategies
If you find the content of this project helpful, please cite our paper as follows:
@inproceedings{chen2025mig,
title={MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space},
author={Chen, Yicheng and Li, Yining and Hu, Kai and Ma, Zerun and Ye, Haochen and Chen, Kai},
booktitle={Findings of the Association for Computational Linguistics: ACL 2025},
year={2025}
}