Skip to content

Official code for MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

Notifications You must be signed in to change notification settings

yichengchen24/MIG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

🤗HF Models 🤗HF Datasets 📄Paper 🚀Project

Welcome to MIG (Maximize the Information Gain) Project!

We will continue to update. Please stay tuned!

What is MIG?

MIG is an automatic data selection method for instruction tuning. It proposes an information-based dataset measurement that comprehensively evaluates data quality and diversity.

🔥 News

Performance

🔦 Highlights

x

Comparison with different data selection methods:

  • Sample 50K from the Tulu3 pool(939K).
  • Training on Llama3.1-8B.
  • Comprehensive evaluations including human-preference and knowledge-based benchmarks.

📈 Full Results

Method Data Size ARC BBH GSM HE MMLU IFEval $Avg_\text{obj}$ AE MT Wild $Avg_\text{sub}$ Avg
Pool 939K 69.15 63.88 83.40 63.41 65.77 67.10 68.79 8.94 6.86 -24.66 38.40 53.59
Random 50K 74.24 64.80 70.36 51.22 63.86 61.00 64.25 8.57 7.06 -22.15 39.36 51.81
ZIP 50K 77.63 63.00 52.54 35.98 65.00 61.00 59.19 6.71 6.64 -32.10 35.69 47.44
IFD 50K 75.93 63.56 61.03 49.39 64.39 53.60 61.32 12.30 7.03 -20.20 40.83 51.08
#InsTag 50K 72.54 64.80 69.83 48.17 63.50 65.99 64.14 6.58 6.84 -20.70 38.21 51.17
DEITA 50K 78.98 66.11 74.07 49.39 64.00 64.33 66.15 10.19 6.83 -19.95 39.50 52.83
CaR 50K 78.98 69.04 71.42 52.44 65.15 56.75 65.63 12.55 6.95 -20.67 40.57 53.10
QDIT 50K 79.66 65.42 70.74 53.05 65.06 57.30 65.21 15.78 6.76 -20.56 41.03 53.12
MIG 50K 80.00 66.39 72.02 57.93 64.44 65.06 67.64 14.66 7.32 -17.77 42.99 55.32

HE denotes HumanEval, AE denotes AlpacaEvalv2, MT denotes MTBench, and Wild denotes WildBench. $Avg_\text{obj}$ and $Avg_\text{sub}$ represent the average of the normalized knowledge-based and human-preference benchmark scores, respectively. Avg is the mean of $Avg_\text{obj}$ and $Avg_\text{sub}$.

Please refer to our paper for more results on different data pools(Openhermes2.5, Deita-Sota-Pool) and different base LLMs(Qwen2.5-7B, Mistral-7B-v0.3).

🏃‍♂️ How to start?

Installation

  • Create an environment
conda create -n mig python=3.10
conda activate mig
  • Install pytorch (>2.0)
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
  • Install
git clone https://github.com/yichengchen24/MIG.git
cd MIG
pip install -e .
  • Embedding Model

Please download the embedding model under <embedding_model_path>. We recommend e5-mistral-7b-instruct used in our paper.

Data Sampling

mig sample <src> --out <save_path> --num-sample <num_sample> --valid-tag-path ./configs/valid_tag_path.json --label-graph-type sim --embedding-model <embedding_model_path> --sampler-type mig --batch-size 32768

should be the data pool path in the format of jsonl. Please refer to data/example.jsonl for an example. We have open-sourced our processed data pools(Tulu3, Openhermes2.5, $X_{sota}$) with annotated #InsTag labels and Deita score.

SFT Training

We use LLama-Factory to fine-tune base models.

  • Preparation

Add sampled data in data/dataset_info.json.

"tulu3_pool_mig_50k": {
    "file_name": <out>,
    "formatting": "sharegpt",
    "columns": {
      "messages": "dialogs"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant",
      "system_tag": "system"
    }
}
  • Training
torchrun --nnodes=1 --nproc_per_node=8 --node_rank=${RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} src/train.py \
    --stage sft \
    --model_name_or_path meta-llama/Llama-3.1-8B \
    --do_train \
    --deepspeed examples/deepspeed/ds_z3_config.json \
    --dataset tulu3_pool_mig_50k \
    --cutoff_len 4096 \
    --template llama3 \
    --finetuning_type full \
    --output_dir  ckpts/tulu3_pool_mig_50k \
    --overwrite_cache \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --lr_scheduler_type linear \
    --logging_steps 100 \
    --save_steps 5000 \
    --learning_rate 5e-6 \
    --num_train_epochs 3.0 \
    --warmup_ratio 0.03 \
    --plot_loss \
    --bf16 \
    --save_only_model

Evaluation

We use OpenCompass to evaluate fine-tuned models.

  • Preparation

Please install the environment accroding to the instructions from OpenCompass.

  • Evaluation
opencompass eval/eval_objective.py

💪 What's more?

We will continue to update:

  • More automatic data selection strategies

Citation

If you find the content of this project helpful, please cite our paper as follows:

@inproceedings{chen2025mig,
  title={MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space},
  author={Chen, Yicheng and Li, Yining and Hu, Kai and Ma, Zerun and Ye, Haochen and Chen, Kai},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2025},
  year={2025}
}

About

Official code for MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

Topics

Resources

Stars

Watchers

Forks

Languages