Skip to content

Issue in fine tunning Qwen2.5 7B coder model with llama factory and llama cpp #8360

Open
@kharanshu

Description

@kharanshu

Reminder

  • I have read the above rules and searched the existing issues.

System Info

dev
OP
 — 2:34 PM
We are trying to fine tune Qwen2.5 7B coder model. Initially while fine tuning, we provided a context length of 2k and it failed after around 50 queries. Later we increased the context length to 32k and ran the fine tuning again. This time it fails after 200 queries. What might be going wrong?
can u help me with this
fine using llama factory
and then gguf using llama cpp
and then after executing some queries container crashes
𝒟𝑜𝓂𝒾𝓃𝒾𝒸 — 4:01 PM
remember to attach logs, we aren't wizards and can't guess the 1 in a million ways a software can fail 
frob — 4:21 PM
failed after around: what does failed mean?
container crashes: the container crashes, or the process in a container crashes?
Server logs (https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) required for debugging.
GitHub
[ollama/docs/troubleshooting.md at main · ollama/ollama](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md)
Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1 and other large language models. - ollama/ollama
ollama/docs/troubleshooting.md at main · ollama/ollama
Kharanshu — 4:58 PM
We are trying to fine tune Qwen2.5 7B Coder model for specific use case, where we are tuning it for pandas code generation with help of user query. I am attaching dataset for fine tuning. 
We encountered issue as, i created evaluation pipeline containing 526 test cases, which runs in series. And hit pandas codegen agent containing my fine tuned model. 
Attachment file type: unknown
[pandas_training_dataset.jsonl](https://cdn.discordapp.com/attachments/1382284333884375051/1382320690211852438/pandas_training_dataset.jsonl?ex=684ab9f6&is=68496876&hm=6358fe731055ca8cfcfd20bd3867a20733b1eb955bdaeb4d544f96f84ee0e69f&)
230.08 KB
Traning parameters:
This are all my tuning parameters:
top.booster: auto
top.checkpoint_path: []
top.finetuning_type: lora
top.model_name: Qwen2.5-Coder-7B
top.quantization_bit: '8'
top.quantization_method: bnb
top.rope_scaling: linear
top.template: qwen
train.additional_target: ''
train.apollo_rank: 16
train.apollo_scale: 32
train.apollo_target: all
train.apollo_update_interval: 200
train.badam_mode: layer
train.badam_switch_interval: 50
train.badam_switch_mode: ascending
train.badam_update_ratio: 0.05
train.batch_size: 2
train.compute_type: fp16
train.create_new_adapter: false
train.cutoff_len: 2048
train.dataset:
pandas_training_dataset
train.dataset_dir: /home/azureuser/LLaMA-Factory/data
train.ds_offload: false
train.ds_stage: none
train.enable_thinking: true
train.extra_args: '{"optim": "adamw_torch"}'
train.freeze_extra_modules: ''
train.freeze_language_model: false
train.freeze_multi_modal_projector: true
train.freeze_trainable_layers: 2
train.freeze_trainable_modules: all
train.freeze_vision_tower: true
train.galore_rank: 16
train.galore_scale: 2
train.galore_target: all
train.galore_update_interval: 200
train.gradient_accumulation_steps: 8
train.image_max_pixels: 768768
train.image_min_pixels: 3232
train.learning_rate: 5e-5
train.logging_steps: 10
train.lora_alpha: 128
train.lora_dropout: 0.05
train.lora_rank: 64
train.lora_target: ''
train.loraplus_lr_ratio: 0
train.lr_scheduler_type: cosine
train.mask_history: false
train.max_grad_norm: '1.0'
train.max_samples: '1000'
train.neat_packing: false
train.neftune_alpha: 0
train.num_train_epochs: '3.0'
train.packing: false
train.ppo_score_norm: false
train.ppo_whiten_rewards: false
train.pref_beta: 0.1
train.pref_ftx: 0
train.pref_loss: sigmoid
train.report_to: none
train.resize_vocab: false
train.reward_model: []
train.save_steps: 100
train.swanlab_api_key: ''
train.swanlab_link: ''
train.swanlab_mode: cloud
train.swanlab_project: llamafactory
train.swanlab_run_name: ''
train.swanlab_workspace: ''
train.train_on_prompt: false
train.training_stage: Supervised Fine-Tuning
train.use_apollo: false
train.use_badam: false
train.use_dora: false
train.use_galore: false
train.use_llama_pro: false
train.use_pissa: false
train.use_rslora: false
train.use_swanlab: false
train.val_size: 0
train.video_max_pixels: 256256
train.video_min_pixels: 1616
train.warmup_steps: 0
=======================================
Tuning finished properly. Created guff image running on ollama. Given context window size of 32k. But, in my case it executes for few test cases from my evaluation pipeline and stucks in berween. Once it stuck, it will generate incorrect answers for remaining queries until i restart my agent container, else it will crash container with error:
Child exited for unknown reason! (wstatus == 15)
or
Child exited for unknown reason! (wstatus == 13) 
frob — 5:06 PM
The training is not relevant for an ollama discord.
Kharanshu — 5:07 PM
Container memory usage and i/o usage is about to 350 mb. And i am using A100 GPU with 80GB VRAM for duing this task. Usally is spike to 88% while working fine. But when it stuck on perticular point is stuck on 88%

Reproduction

Put your message here.

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpendingThis problem is yet to be addressed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions