MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

Welcome to MIG (Maximize the Information Gain) Project!

We will continue to update. Please stay tuned!

What is MIG?

MIG is an automatic data selection method for instruction tuning. It proposes an information-based dataset measurement that comprehensively evaluates data quality and diversity.

🔥 News

🎉 [05/2025] MIG is accepted to ACL 2025 Findings!
📄 [04/2025] MIG paper MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space is released!
🤗 [04/2025] Annotated data pools and sampled datasets by different data selection methods are released at HuggingFace.

Performance

🔦 Highlights

Comparison with different data selection methods:

Sample 50K from the Tulu3 pool(939K).
Training on Llama3.1-8B.
Comprehensive evaluations including human-preference and knowledge-based benchmarks.

📈 Full Results

Method	Data Size	ARC	BBH	GSM	HE	MMLU	IFEval	$Avg_\text{obj}$	AE	MT	Wild	$Avg_\text{sub}$	Avg
Pool	939K	69.15	63.88	83.40	63.41	65.77	67.10	68.79	8.94	6.86	-24.66	38.40	53.59
Random	50K	74.24	64.80	70.36	51.22	63.86	61.00	64.25	8.57	7.06	-22.15	39.36	51.81
ZIP	50K	77.63	63.00	52.54	35.98	65.00	61.00	59.19	6.71	6.64	-32.10	35.69	47.44
IFD	50K	75.93	63.56	61.03	49.39	64.39	53.60	61.32	12.30	7.03	-20.20	40.83	51.08
#InsTag	50K	72.54	64.80	69.83	48.17	63.50	65.99	64.14	6.58	6.84	-20.70	38.21	51.17
DEITA	50K	78.98	66.11	74.07	49.39	64.00	64.33	66.15	10.19	6.83	-19.95	39.50	52.83
CaR	50K	78.98	69.04	71.42	52.44	65.15	56.75	65.63	12.55	6.95	-20.67	40.57	53.10
QDIT	50K	79.66	65.42	70.74	53.05	65.06	57.30	65.21	15.78	6.76	-20.56	41.03	53.12
MIG	50K	80.00	66.39	72.02	57.93	64.44	65.06	67.64	14.66	7.32	-17.77	42.99	55.32

HE denotes HumanEval, AE denotes AlpacaEvalv2, MT denotes MTBench, and Wild denotes WildBench. $Avg_\text{obj}$ and $Avg_\text{sub}$ represent the average of the normalized knowledge-based and human-preference benchmark scores, respectively. Avg is the mean of $Avg_\text{obj}$ and $Avg_\text{sub}$.

Please refer to our paper for more results on different data pools(Openhermes2.5, Deita-Sota-Pool) and different base LLMs(Qwen2.5-7B, Mistral-7B-v0.3).

🏃‍♂️ How to start?

Installation

Create an environment

conda create -n mig python=3.10
conda activate mig

Install pytorch (>2.0)

conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia

Install

git clone https://github.com/yichengchen24/MIG.git
cd MIG
pip install -e .

Embedding Model

Please download the embedding model under <embedding_model_path>. We recommend e5-mistral-7b-instruct used in our paper.

Data Sampling

mig sample <src> --out <save_path> --num-sample <num_sample> --valid-tag-path ./configs/valid_tag_path.json --label-graph-type sim --embedding-model <embedding_model_path> --sampler-type mig --batch-size 32768

should be the data pool path in the format of jsonl. Please refer to data/example.jsonl for an example. We have open-sourced our processed data pools(Tulu3, Openhermes2.5, $X_{sota}$) with annotated #InsTag labels and Deita score.

SFT Training

We use LLama-Factory to fine-tune base models.

Preparation

Add sampled data in data/dataset_info.json.

"tulu3_pool_mig_50k": {
    "file_name": <out>,
    "formatting": "sharegpt",
    "columns": {
      "messages": "dialogs"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant",
      "system_tag": "system"
    }
}

Training

torchrun --nnodes=1 --nproc_per_node=8 --node_rank=${RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} src/train.py \
    --stage sft \
    --model_name_or_path meta-llama/Llama-3.1-8B \
    --do_train \
    --deepspeed examples/deepspeed/ds_z3_config.json \
    --dataset tulu3_pool_mig_50k \
    --cutoff_len 4096 \
    --template llama3 \
    --finetuning_type full \
    --output_dir  ckpts/tulu3_pool_mig_50k \
    --overwrite_cache \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --lr_scheduler_type linear \
    --logging_steps 100 \
    --save_steps 5000 \
    --learning_rate 5e-6 \
    --num_train_epochs 3.0 \
    --warmup_ratio 0.03 \
    --plot_loss \
    --bf16 \
    --save_only_model

Evaluation

We use OpenCompass to evaluate fine-tuned models.

Preparation

Please install the environment accroding to the instructions from OpenCompass.

Evaluation

opencompass eval/eval_objective.py

💪 What's more?

We will continue to update:

More automatic data selection strategies

Citation

If you find the content of this project helpful, please cite our paper as follows:

@inproceedings{chen2025mig,
  title={MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space},
  author={Chen, Yicheng and Li, Yining and Hu, Kai and Ma, Zerun and Ye, Haochen and Chen, Kai},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2025},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
configs		configs
data		data
eval		eval
mig		mig
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

What is MIG?

🔥 News

Performance

🔦 Highlights

📈 Full Results

🏃‍♂️ How to start?

Installation

Data Sampling

SFT Training

Evaluation

💪 What's more?

Citation

About

Uh oh!

Uh oh!

Languages

yichengchen24/MIG

Folders and files

Latest commit

History

Repository files navigation

MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

What is MIG?

🔥 News

Performance

🔦 Highlights

📈 Full Results

🏃‍♂️ How to start?

Installation

Data Sampling

SFT Training

Evaluation

💪 What's more?

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages