Skip to content

[Question]: PP-UIE-7B在微调时为什么突然显存爆增,导致OOM,差不多80G显存都爆炸了 #10572

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Gnem-zx opened this issue May 8, 2025 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@Gnem-zx
Copy link

Gnem-zx commented May 8, 2025

请提出你的问题

这是配置文件,就是官方文档的,使用的一张 A800 有80G显存

{
"model_name_or_path": "paddlenlp/PP-UIE-7B",
"dataset_name_or_path": "./application/information_extraction/data",
"output_dir": "./checkpoints/ie_ckpts",
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 1,
"per_device_eval_batch_size": 1,
"eval_accumulation_steps":8,
"num_train_epochs": 3,
"learning_rate": 3e-05,
"warmup_steps": 30,
"logging_steps": 1,
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"src_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"do_train": true,
"do_eval": true,
"disable_tqdm": true,
"load_best_model_at_end": true,
"eval_with_do_generation": false,
"metric_for_best_model": "accuracy",
"recompute": false,
"save_total_limit": 1,
"tensor_parallel_degree": 1,
"pipeline_parallel_degree": 1,
"sharding": "stage1",
"zero_padding": false,
"unified_checkpoint": true,
"use_flash_attention": false
}

这是报错日志

jovyan@9ee80a409cbe:/mnt/data/lyy/mzx/PaddleNLP/llm$ python -u  -m paddle.distributed.launch  run_finetune.py ./config/pp-uie/sft_argument.json
/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:711: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
  warnings.warn(warning_message)
LAUNCH INFO 2025-05-08 10:39:34,154 -----------  Configuration  ----------------------
LAUNCH INFO 2025-05-08 10:39:34,154 auto_cluster_config: 0
LAUNCH INFO 2025-05-08 10:39:34,154 auto_parallel_config: None
LAUNCH INFO 2025-05-08 10:39:34,154 auto_tuner_json: None
LAUNCH INFO 2025-05-08 10:39:34,154 devices: None
LAUNCH INFO 2025-05-08 10:39:34,154 elastic_level: -1
LAUNCH INFO 2025-05-08 10:39:34,154 elastic_timeout: 30
LAUNCH INFO 2025-05-08 10:39:34,154 enable_gpu_log: True
LAUNCH INFO 2025-05-08 10:39:34,154 gloo_port: 6767
LAUNCH INFO 2025-05-08 10:39:34,154 host: None
LAUNCH INFO 2025-05-08 10:39:34,154 ips: None
LAUNCH INFO 2025-05-08 10:39:34,154 job_id: default
LAUNCH INFO 2025-05-08 10:39:34,154 legacy: False
LAUNCH INFO 2025-05-08 10:39:34,154 log_dir: log
LAUNCH INFO 2025-05-08 10:39:34,154 log_level: INFO
LAUNCH INFO 2025-05-08 10:39:34,154 log_overwrite: False
LAUNCH INFO 2025-05-08 10:39:34,154 master: None
LAUNCH INFO 2025-05-08 10:39:34,154 max_restart: 3
LAUNCH INFO 2025-05-08 10:39:34,154 nnodes: 1
LAUNCH INFO 2025-05-08 10:39:34,154 nproc_per_node: None
LAUNCH INFO 2025-05-08 10:39:34,154 rank: -1
LAUNCH INFO 2025-05-08 10:39:34,154 run_mode: collective
LAUNCH INFO 2025-05-08 10:39:34,154 server_num: None
LAUNCH INFO 2025-05-08 10:39:34,154 servers: 
LAUNCH INFO 2025-05-08 10:39:34,154 sort_ip: False
LAUNCH INFO 2025-05-08 10:39:34,154 start_port: 6070
LAUNCH INFO 2025-05-08 10:39:34,154 trainer_num: None
LAUNCH INFO 2025-05-08 10:39:34,154 trainers: 
LAUNCH INFO 2025-05-08 10:39:34,154 training_script: run_finetune.py
LAUNCH INFO 2025-05-08 10:39:34,154 training_script_args: ['./config/pp-uie/sft_argument.json']
LAUNCH INFO 2025-05-08 10:39:34,154 with_gloo: 1
LAUNCH INFO 2025-05-08 10:39:34,154 --------------------------------------------------
LAUNCH INFO 2025-05-08 10:39:34,155 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2025-05-08 10:39:34,156 Run Pod: tvtjzi, replicas 1, status ready
LAUNCH INFO 2025-05-08 10:39:34,183 Watching Pod: tvtjzi, replicas 1, status running
/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:711: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
  warnings.warn(warning_message)
/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
[2025-05-08 10:39:36,847] [    INFO] - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
[2025-05-08 10:39:36,847] [   DEBUG] - ============================================================
[2025-05-08 10:39:36,847] [   DEBUG] -      Model Configuration Arguments      
[2025-05-08 10:39:36,847] [   DEBUG] - paddle commit id              : 6ed5dd3833c32c3b21e14b1fb1a71f5a535a0fcc
[2025-05-08 10:39:36,847] [   DEBUG] - paddlenlp commit id           : a286abc1063e516ed56b746fcca33bedce5fcef3
[2025-05-08 10:39:36,847] [   DEBUG] - aistudio_repo_id              : None
[2025-05-08 10:39:36,847] [   DEBUG] - aistudio_repo_license         : Apache License 2.0
[2025-05-08 10:39:36,847] [   DEBUG] - aistudio_repo_private         : True
[2025-05-08 10:39:36,847] [   DEBUG] - aistudio_token                : None
[2025-05-08 10:39:36,847] [   DEBUG] - attention_probs_dropout_prob  : 0.1
[2025-05-08 10:39:36,847] [   DEBUG] - continue_training             : True
[2025-05-08 10:39:36,847] [   DEBUG] - flash_mask                    : False
[2025-05-08 10:39:36,847] [   DEBUG] - from_aistudio                 : False
[2025-05-08 10:39:36,847] [   DEBUG] - fuse_attention_ffn            : None
[2025-05-08 10:39:36,847] [   DEBUG] - fuse_attention_qkv            : None
[2025-05-08 10:39:36,847] [   DEBUG] - hidden_dropout_prob           : 0.1
[2025-05-08 10:39:36,848] [   DEBUG] - lokr                          : False
[2025-05-08 10:39:36,848] [   DEBUG] - lokr_dim                      : 8
[2025-05-08 10:39:36,848] [   DEBUG] - lokr_path                     : None
[2025-05-08 10:39:36,848] [   DEBUG] - lora                          : False
[2025-05-08 10:39:36,848] [   DEBUG] - lora_path                     : None
[2025-05-08 10:39:36,848] [   DEBUG] - lora_plus_scale               : 1.0
[2025-05-08 10:39:36,848] [   DEBUG] - lora_rank                     : 8
[2025-05-08 10:39:36,848] [   DEBUG] - lora_use_mixer                : False
[2025-05-08 10:39:36,848] [   DEBUG] - model_name_or_path            : paddlenlp/PP-UIE-7B
[2025-05-08 10:39:36,848] [   DEBUG] - neftune                       : False
[2025-05-08 10:39:36,848] [   DEBUG] - neftune_noise_alpha           : 5.0
[2025-05-08 10:39:36,848] [   DEBUG] - num_prefix_tokens             : 128
[2025-05-08 10:39:36,848] [   DEBUG] - pissa                         : False
[2025-05-08 10:39:36,848] [   DEBUG] - prefix_path                   : None
[2025-05-08 10:39:36,848] [   DEBUG] - prefix_tuning                 : False
[2025-05-08 10:39:36,848] [   DEBUG] - reft                          : False
[2025-05-08 10:39:36,848] [   DEBUG] - rope_scaling_factor           : 1.0
[2025-05-08 10:39:36,848] [   DEBUG] - rslora                        : False
[2025-05-08 10:39:36,848] [   DEBUG] - save_to_aistudio              : False
[2025-05-08 10:39:36,848] [   DEBUG] - strategy_name                 : None
[2025-05-08 10:39:36,848] [   DEBUG] - strategy_type                 : None
[2025-05-08 10:39:36,848] [   DEBUG] - tokenizer_name_or_path        : None
[2025-05-08 10:39:36,848] [   DEBUG] - use_fast_layer_norm           : False
[2025-05-08 10:39:36,848] [   DEBUG] - use_long_sequence_strategies  : False
[2025-05-08 10:39:36,848] [   DEBUG] - use_mora                      : False
[2025-05-08 10:39:36,848] [   DEBUG] - use_quick_lora                : False
[2025-05-08 10:39:36,849] [   DEBUG] - vera                          : False
[2025-05-08 10:39:36,849] [   DEBUG] - vera_rank                     : 8
[2025-05-08 10:39:36,849] [   DEBUG] - weight_blocksize              : 64
[2025-05-08 10:39:36,849] [   DEBUG] - weight_double_quant           : False
[2025-05-08 10:39:36,849] [   DEBUG] - weight_double_quant_block_size: 256
[2025-05-08 10:39:36,849] [   DEBUG] - weight_quantize_algo          : None
[2025-05-08 10:39:36,849] [   DEBUG] - 
[2025-05-08 10:39:36,849] [   DEBUG] - ============================================================
[2025-05-08 10:39:36,849] [   DEBUG] -       Data Configuration Arguments      
[2025-05-08 10:39:36,849] [   DEBUG] - paddle commit id              : 6ed5dd3833c32c3b21e14b1fb1a71f5a535a0fcc
[2025-05-08 10:39:36,849] [   DEBUG] - paddlenlp commit id           : a286abc1063e516ed56b746fcca33bedce5fcef3
[2025-05-08 10:39:36,849] [   DEBUG] - autoregressive                : False
[2025-05-08 10:39:36,849] [   DEBUG] - chat_template                 : None
[2025-05-08 10:39:36,849] [   DEBUG] - dataset_name_or_path          : ./application/information_extraction/data
[2025-05-08 10:39:36,849] [   DEBUG] - eval_with_do_generation       : False
[2025-05-08 10:39:36,849] [   DEBUG] - greedy_zero_padding           : False
[2025-05-08 10:39:36,849] [   DEBUG] - lazy                          : False
[2025-05-08 10:39:36,849] [   DEBUG] - max_length                    : 2048
[2025-05-08 10:39:36,849] [   DEBUG] - pad_to_max_length             : False
[2025-05-08 10:39:36,849] [   DEBUG] - pad_to_multiple_of            : None
[2025-05-08 10:39:36,849] [   DEBUG] - save_generation_output        : False
[2025-05-08 10:39:36,849] [   DEBUG] - src_length                    : 1024
[2025-05-08 10:39:36,849] [   DEBUG] - task_name                     : None
[2025-05-08 10:39:36,849] [   DEBUG] - use_pose_convert              : False
[2025-05-08 10:39:36,849] [   DEBUG] - zero_padding                  : False
[2025-05-08 10:39:36,849] [   DEBUG] - 
[2025-05-08 10:39:36,850] [   DEBUG] - ============================================================
[2025-05-08 10:39:36,850] [   DEBUG] -    Generation Configuration Arguments   
[2025-05-08 10:39:36,850] [   DEBUG] - paddle commit id              : 6ed5dd3833c32c3b21e14b1fb1a71f5a535a0fcc
[2025-05-08 10:39:36,850] [   DEBUG] - paddlenlp commit id           : a286abc1063e516ed56b746fcca33bedce5fcef3
[2025-05-08 10:39:36,850] [   DEBUG] - top_k                         : 1
[2025-05-08 10:39:36,850] [   DEBUG] - top_p                         : 1.0
[2025-05-08 10:39:36,850] [   DEBUG] - 
[2025-05-08 10:39:36,850] [    INFO] - The global seed is set to 42, local seed is set to 43 and random seed is set to 42.
[2025-05-08 10:39:36,850] [ WARNING] - Process rank: -1, device: gpu, world_size: 1, distributed training: False, 16-bits training: True
[2025-05-08 10:39:36,850] [    INFO] - Loading configuration file /home/jovyan/.paddlenlp/models/paddlenlp/PP-UIE-7B/config.json
[2025-05-08 10:39:36,851] [    INFO] - Final model config: Qwen2Config {
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "context_parallel_degree": -1,
  "dpo_config": null,
  "dtype": "float16",
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "pad_token_id": 0,
  "paddlenlp_version": "3.0.0b4",
  "pipeline_parallel_degree": -1,
  "refined_recompute": {},
  "rms_norm_eps": 1e-06,
  "rope_scaling_factor": 1.0,
  "rope_scaling_type": null,
  "rope_theta": 1000000.0,
  "sep_parallel_degree": -1,
  "seq_length": 2048,
  "sliding_window": 131072,
  "tensor_parallel_degree": -1,
  "tensor_parallel_output": false,
  "tie_word_embeddings": false,
  "use_fast_layer_norm": false,
  "use_sliding_window": false,
  "vocab_size": 152064
}

[2025-05-08 10:39:36,852] [    INFO] - Creating model
[2025-05-08 10:39:36,852] [    INFO] - We are using <class 'paddlenlp.transformers.qwen2.modeling.Qwen2ForCausalLM'> to load 'paddlenlp/PP-UIE-7B'.
[2025-05-08 10:39:36,852] [    INFO] - Loading weights file from cache at /home/jovyan/.paddlenlp/models/paddlenlp/PP-UIE-7B/model.safetensors.index.json

Downloading shards:   0%|                                                                                                                                          | 0/4 [00:00<?, ?it/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 44384.17it/s]
W0508 10:39:37.002657 726199 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.3, Runtime API Version: 12.6
W0508 10:39:37.003424 726199 gpu_resources.cc:164] device: 0, cuDNN Version: 9.5.
W0508 10:39:37.003436 726199 gpu_resources.cc:196] WARNING: device: 0. The installed Paddle is compiled with CUDA 12.6, but CUDA runtime version in your machine is 12.3, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDA version.

Loading checkpoint shards:   0%|                                                                                                                                   | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  25%|██████████████████████████████▊                                                                                            | 1/4 [00:10<00:30, 10.15s/it]
Loading checkpoint shards:  50%|█████████████████████████████████████████████████████████████▌                                                             | 2/4 [00:20<00:20, 10.24s/it]
Loading checkpoint shards:  75%|████████████████████████████████████████████████████████████████████████████████████████████▎                              | 3/4 [00:30<00:10, 10.17s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:40<00:00, 10.19s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:40<00:00, 10.19s/it]
[2025-05-08 10:40:28,104] [    INFO] - All model checkpoint weights were used when initializing Qwen2ForCausalLM.

[2025-05-08 10:40:28,104] [    INFO] - All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at paddlenlp/PP-UIE-7B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training.
[2025-05-08 10:40:28,106] [    INFO] - Loading configuration file /home/jovyan/.paddlenlp/models/paddlenlp/PP-UIE-7B/generation_config.json
[2025-05-08 10:40:28,166] [    INFO] - The `unk_token` parameter needs to be defined: we use `eos_token` by default.
[2025-05-08 10:40:28,360] [    INFO] - load train
[2025-05-08 10:40:28,381] [    INFO] - load eval
[2025-05-08 10:40:28,385] [    INFO] - load test
[2025-05-08 10:40:28,385] [    INFO] - Trans the dataset text into token ids, please wait for a moment.
[2025-05-08 10:40:28,385] [    INFO] - The global seed is set to 42, local seed is set to 43 and random seed is set to 42.
[2025-05-08 10:40:28,453] [    INFO] - Using half precision
[2025-05-08 10:40:28,471] [   DEBUG] - ============================================================
[2025-05-08 10:40:28,471] [   DEBUG] -     Training Configuration Arguments    
[2025-05-08 10:40:28,471] [   DEBUG] - paddle commit id              : 6ed5dd3833c32c3b21e14b1fb1a71f5a535a0fcc
[2025-05-08 10:40:28,471] [   DEBUG] - paddlenlp commit id           : a286abc1063e516ed56b746fcca33bedce5fcef3
[2025-05-08 10:40:28,471] [   DEBUG] - _no_sync_in_gradient_accumulation: True
[2025-05-08 10:40:28,471] [   DEBUG] - adam_beta1                    : 0.9
[2025-05-08 10:40:28,472] [   DEBUG] - adam_beta2                    : 0.999
[2025-05-08 10:40:28,472] [   DEBUG] - adam_epsilon                  : 1e-08
[2025-05-08 10:40:28,472] [   DEBUG] - amp_custom_black_list         : None
[2025-05-08 10:40:28,472] [   DEBUG] - amp_custom_white_list         : None
[2025-05-08 10:40:28,472] [   DEBUG] - amp_master_grad               : False
[2025-05-08 10:40:28,472] [   DEBUG] - auto_parallel_resume_form_hybrid_parallel: False
[2025-05-08 10:40:28,472] [   DEBUG] - autotuner_benchmark           : False
[2025-05-08 10:40:28,472] [   DEBUG] - benchmark                     : False
[2025-05-08 10:40:28,472] [   DEBUG] - bf16                          : False
[2025-05-08 10:40:28,472] [   DEBUG] - bf16_full_eval                : False
[2025-05-08 10:40:28,472] [   DEBUG] - ckpt_quant_stage              : O0
[2025-05-08 10:40:28,472] [   DEBUG] - context_parallel_degree       : -1
[2025-05-08 10:40:28,472] [   DEBUG] - count_trained_tokens          : False
[2025-05-08 10:40:28,472] [   DEBUG] - current_device                : gpu:0
[2025-05-08 10:40:28,472] [   DEBUG] - data_parallel_config          : 
[2025-05-08 10:40:28,472] [   DEBUG] - data_parallel_degree          : 1
[2025-05-08 10:40:28,472] [   DEBUG] - data_parallel_rank            : 0
[2025-05-08 10:40:28,472] [   DEBUG] - dataloader_drop_last          : False
[2025-05-08 10:40:28,472] [   DEBUG] - dataloader_num_workers        : 0
[2025-05-08 10:40:28,472] [   DEBUG] - dataset_batch_size            : 1000
[2025-05-08 10:40:28,472] [   DEBUG] - dataset_kwargs                : {}
[2025-05-08 10:40:28,472] [   DEBUG] - dataset_num_proc              : None
[2025-05-08 10:40:28,472] [   DEBUG] - dataset_rank                  : 0
[2025-05-08 10:40:28,472] [   DEBUG] - dataset_text_field            : text
[2025-05-08 10:40:28,472] [   DEBUG] - dataset_world_size            : 1
[2025-05-08 10:40:28,473] [   DEBUG] - ddp_find_unused_parameters    : None
[2025-05-08 10:40:28,473] [   DEBUG] - decay_steps                   : 0
[2025-05-08 10:40:28,473] [   DEBUG] - device                        : gpu
[2025-05-08 10:40:28,473] [   DEBUG] - disable_tqdm                  : True
[2025-05-08 10:40:28,473] [   DEBUG] - distributed_dataloader        : False
[2025-05-08 10:40:28,473] [   DEBUG] - do_eval                       : True
[2025-05-08 10:40:28,473] [   DEBUG] - do_export                     : False
[2025-05-08 10:40:28,473] [   DEBUG] - do_predict                    : False
[2025-05-08 10:40:28,473] [   DEBUG] - do_train                      : True
[2025-05-08 10:40:28,473] [   DEBUG] - enable_auto_parallel          : False
[2025-05-08 10:40:28,473] [   DEBUG] - eval_accumulation_steps       : 8
[2025-05-08 10:40:28,473] [   DEBUG] - eval_batch_size               : 1
[2025-05-08 10:40:28,473] [   DEBUG] - eval_packing                  : None
[2025-05-08 10:40:28,473] [   DEBUG] - eval_steps                    : None
[2025-05-08 10:40:28,473] [   DEBUG] - evaluation_strategy           : IntervalStrategy.EPOCH
[2025-05-08 10:40:28,473] [   DEBUG] - expert_max_capacity           : 4294967296
[2025-05-08 10:40:28,473] [   DEBUG] - expert_min_capacity           : 1
[2025-05-08 10:40:28,473] [   DEBUG] - expert_parallel_degree        : -1
[2025-05-08 10:40:28,473] [   DEBUG] - expert_tensor_parallel_degree : -1
[2025-05-08 10:40:28,473] [   DEBUG] - flatten_param_grads           : False
[2025-05-08 10:40:28,473] [   DEBUG] - force_reshard_pp              : False
[2025-05-08 10:40:28,473] [   DEBUG] - fp16                          : True
[2025-05-08 10:40:28,473] [   DEBUG] - fp16_full_eval                : False
[2025-05-08 10:40:28,473] [   DEBUG] - fp16_opt_level                : O2
[2025-05-08 10:40:28,473] [   DEBUG] - fuse_sequence_parallel_allreduce: False
[2025-05-08 10:40:28,473] [   DEBUG] - gradient_accumulation_steps   : 1
[2025-05-08 10:40:28,474] [   DEBUG] - greater_is_better             : True
[2025-05-08 10:40:28,474] [   DEBUG] - hybrid_parallel_topo_order    : pp_first
[2025-05-08 10:40:28,474] [   DEBUG] - ignore_data_skip              : False
[2025-05-08 10:40:28,474] [   DEBUG] - ignore_load_lr_and_optim      : False
[2025-05-08 10:40:28,474] [   DEBUG] - ignore_save_lr_and_optim      : False
[2025-05-08 10:40:28,474] [   DEBUG] - label_names                   : None
[2025-05-08 10:40:28,474] [   DEBUG] - lazy_data_processing          : True
[2025-05-08 10:40:28,474] [   DEBUG] - learning_rate                 : 3e-05
[2025-05-08 10:40:28,474] [   DEBUG] - load_best_model_at_end        : True
[2025-05-08 10:40:28,474] [   DEBUG] - load_sharded_model            : False
[2025-05-08 10:40:28,474] [   DEBUG] - local_process_index           : 0
[2025-05-08 10:40:28,474] [   DEBUG] - local_rank                    : -1
[2025-05-08 10:40:28,474] [   DEBUG] - log_level                     : -1
[2025-05-08 10:40:28,474] [   DEBUG] - log_level_replica             : -1
[2025-05-08 10:40:28,474] [   DEBUG] - log_on_each_node              : True
[2025-05-08 10:40:28,474] [   DEBUG] - logging_dir                   : ./checkpoints/ie_ckpts/runs/May08_10-39-36_9ee80a409cbe
[2025-05-08 10:40:28,474] [   DEBUG] - logging_first_step            : False
[2025-05-08 10:40:28,474] [   DEBUG] - logging_steps                 : 1
[2025-05-08 10:40:28,474] [   DEBUG] - logging_strategy              : IntervalStrategy.STEPS
[2025-05-08 10:40:28,474] [   DEBUG] - logical_process_index         : 0
[2025-05-08 10:40:28,474] [   DEBUG] - lr_end                        : 1e-07
[2025-05-08 10:40:28,474] [   DEBUG] - lr_scheduler_type             : SchedulerType.LINEAR
[2025-05-08 10:40:28,474] [   DEBUG] - max_evaluate_steps            : -1
[2025-05-08 10:40:28,474] [   DEBUG] - max_grad_norm                 : 1.0
[2025-05-08 10:40:28,474] [   DEBUG] - max_seq_length                : 2048
[2025-05-08 10:40:28,474] [   DEBUG] - max_steps                     : -1
[2025-05-08 10:40:28,475] [   DEBUG] - metric_for_best_model         : accuracy
[2025-05-08 10:40:28,475] [   DEBUG] - metrics_output_path           : None
[2025-05-08 10:40:28,475] [   DEBUG] - minimum_eval_times            : None
[2025-05-08 10:40:28,475] [   DEBUG] - model_init_kwargs             : None
[2025-05-08 10:40:28,475] [   DEBUG] - no_cuda                       : False
[2025-05-08 10:40:28,475] [   DEBUG] - no_recompute_layers           : None
[2025-05-08 10:40:28,475] [   DEBUG] - num_cycles                    : 0.5
[2025-05-08 10:40:28,475] [   DEBUG] - num_train_epochs              : 3.0
[2025-05-08 10:40:28,475] [   DEBUG] - offload_optim                 : False
[2025-05-08 10:40:28,475] [   DEBUG] - offload_recompute_inputs      : False
[2025-05-08 10:40:28,475] [   DEBUG] - optim                         : OptimizerNames.ADAMW
[2025-05-08 10:40:28,475] [   DEBUG] - optimizer_name_suffix         : None
[2025-05-08 10:40:28,475] [   DEBUG] - ordered_save_group_size       : 0
[2025-05-08 10:40:28,475] [   DEBUG] - output_dir                    : ./checkpoints/ie_ckpts
[2025-05-08 10:40:28,475] [   DEBUG] - output_signal_dir             : ./checkpoints/ie_ckpts
[2025-05-08 10:40:28,475] [   DEBUG] - overwrite_output_dir          : False
[2025-05-08 10:40:28,475] [   DEBUG] - pad_token_id                  : 0
[2025-05-08 10:40:28,475] [   DEBUG] - past_index                    : -1
[2025-05-08 10:40:28,475] [   DEBUG] - pdc_download_ckpt             : False
[2025-05-08 10:40:28,475] [   DEBUG] - pdc_download_timeout          : 300
[2025-05-08 10:40:28,475] [   DEBUG] - per_device_eval_batch_size    : 1
[2025-05-08 10:40:28,475] [   DEBUG] - per_device_train_batch_size   : 1
[2025-05-08 10:40:28,475] [   DEBUG] - pipeline_parallel_config      : 
[2025-05-08 10:40:28,475] [   DEBUG] - pipeline_parallel_degree      : -1
[2025-05-08 10:40:28,475] [   DEBUG] - pipeline_parallel_rank        : 0
[2025-05-08 10:40:28,475] [   DEBUG] - power                         : 1.0
[2025-05-08 10:40:28,476] [   DEBUG] - pp_recompute_interval         : 1
[2025-05-08 10:40:28,476] [   DEBUG] - prediction_loss_only          : False
[2025-05-08 10:40:28,476] [   DEBUG] - process_index                 : 0
[2025-05-08 10:40:28,476] [   DEBUG] - recompute                     : False
[2025-05-08 10:40:28,476] [   DEBUG] - recompute_granularity         : full
[2025-05-08 10:40:28,476] [   DEBUG] - recompute_use_reentrant       : False
[2025-05-08 10:40:28,476] [   DEBUG] - refined_recompute             : {}
[2025-05-08 10:40:28,476] [   DEBUG] - release_grads                 : False
[2025-05-08 10:40:28,476] [   DEBUG] - remove_unused_columns         : True
[2025-05-08 10:40:28,476] [   DEBUG] - report_to                     : ['visualdl']
[2025-05-08 10:40:28,476] [   DEBUG] - resume_from_checkpoint        : None
[2025-05-08 10:40:28,476] [   DEBUG] - run_name                      : ./checkpoints/ie_ckpts
[2025-05-08 10:40:28,476] [   DEBUG] - save_on_each_node             : False
[2025-05-08 10:40:28,476] [   DEBUG] - save_sharded_model            : False
[2025-05-08 10:40:28,476] [   DEBUG] - save_sharding_stage1_model_include_freeze_params: False
[2025-05-08 10:40:28,476] [   DEBUG] - save_steps                    : 500
[2025-05-08 10:40:28,476] [   DEBUG] - save_strategy                 : IntervalStrategy.EPOCH
[2025-05-08 10:40:28,476] [   DEBUG] - save_total_limit              : 1
[2025-05-08 10:40:28,476] [   DEBUG] - scale_loss                    : 32768
[2025-05-08 10:40:28,476] [   DEBUG] - seed                          : 42
[2025-05-08 10:40:28,476] [   DEBUG] - sep_parallel_degree           : -1
[2025-05-08 10:40:28,476] [   DEBUG] - sequence_parallel             : False
[2025-05-08 10:40:28,476] [   DEBUG] - sequence_parallel_config      : 
[2025-05-08 10:40:28,476] [   DEBUG] - sharding                      : []
[2025-05-08 10:40:28,476] [   DEBUG] - sharding_comm_buffer_size_MB  : -1
[2025-05-08 10:40:28,476] [   DEBUG] - sharding_degree               : -1
[2025-05-08 10:40:28,477] [   DEBUG] - sharding_parallel_config      : 
[2025-05-08 10:40:28,477] [   DEBUG] - sharding_parallel_degree      : -1
[2025-05-08 10:40:28,477] [   DEBUG] - sharding_parallel_mesh_dimension: dp
[2025-05-08 10:40:28,477] [   DEBUG] - sharding_parallel_rank        : 0
[2025-05-08 10:40:28,477] [   DEBUG] - should_load_dataset           : True
[2025-05-08 10:40:28,477] [   DEBUG] - should_load_sharding_stage1_model: False
[2025-05-08 10:40:28,477] [   DEBUG] - should_log                    : True
[2025-05-08 10:40:28,477] [   DEBUG] - should_save                   : True
[2025-05-08 10:40:28,477] [   DEBUG] - should_save_model_state       : True
[2025-05-08 10:40:28,477] [   DEBUG] - should_save_model_with_tensor_fusion: False
[2025-05-08 10:40:28,477] [   DEBUG] - should_save_sharding_stage1_model: False
[2025-05-08 10:40:28,477] [   DEBUG] - skip_data_intervals           : None
[2025-05-08 10:40:28,477] [   DEBUG] - skip_memory_metrics           : True
[2025-05-08 10:40:28,477] [   DEBUG] - skip_profile_timer            : True
[2025-05-08 10:40:28,477] [   DEBUG] - split_inputs_sequence_dim     : True
[2025-05-08 10:40:28,477] [   DEBUG] - ssa_group_size_ratio          : 0.25
[2025-05-08 10:40:28,477] [   DEBUG] - tensor_parallel_config        : 
[2025-05-08 10:40:28,477] [   DEBUG] - tensor_parallel_degree        : -1
[2025-05-08 10:40:28,477] [   DEBUG] - tensor_parallel_output        : False
[2025-05-08 10:40:28,477] [   DEBUG] - tensor_parallel_rank          : 0
[2025-05-08 10:40:28,477] [   DEBUG] - to_static                     : False
[2025-05-08 10:40:28,477] [   DEBUG] - train_batch_size              : 1
[2025-05-08 10:40:28,477] [   DEBUG] - unified_checkpoint            : True
[2025-05-08 10:40:28,477] [   DEBUG] - unified_checkpoint_config     : ['']
[2025-05-08 10:40:28,477] [   DEBUG] - use_async_save                : False
[2025-05-08 10:40:28,477] [   DEBUG] - use_expert_parallel           : False
[2025-05-08 10:40:28,478] [   DEBUG] - use_flash_attention           : False
[2025-05-08 10:40:28,478] [   DEBUG] - use_fused_dropout_add         : False
[2025-05-08 10:40:28,478] [   DEBUG] - use_fused_linear              : False
[2025-05-08 10:40:28,478] [   DEBUG] - use_fused_linear_cross_entropy: False
[2025-05-08 10:40:28,478] [   DEBUG] - use_fused_rms_norm            : False
[2025-05-08 10:40:28,478] [   DEBUG] - use_fused_rope                : False
[2025-05-08 10:40:28,478] [   DEBUG] - use_hybrid_parallel           : False
[2025-05-08 10:40:28,478] [   DEBUG] - use_ssa                       : False
[2025-05-08 10:40:28,478] [   DEBUG] - virtual_pp_degree             : 1
[2025-05-08 10:40:28,478] [   DEBUG] - wandb_api_key                 : None
[2025-05-08 10:40:28,478] [   DEBUG] - wandb_http_proxy              : None
[2025-05-08 10:40:28,478] [   DEBUG] - warmup_ratio                  : 0.0
[2025-05-08 10:40:28,478] [   DEBUG] - warmup_steps                  : 30
[2025-05-08 10:40:28,478] [   DEBUG] - weight_decay                  : 0.0
[2025-05-08 10:40:28,478] [   DEBUG] - weight_name_suffix            : None
[2025-05-08 10:40:28,478] [   DEBUG] - world_size                    : 1
[2025-05-08 10:40:28,478] [   DEBUG] - 
[2025-05-08 10:40:28,479] [    INFO] - Starting training from resume_from_checkpoint : None
[2025-05-08 10:40:28,480] [    INFO] - [timelog] checkpoint loading time: 0.00s (2025-05-08 10:40:28) 
[2025-05-08 10:40:28,480] [    INFO] - ***** Running training *****
[2025-05-08 10:40:28,480] [    INFO] -   Num examples = 176
[2025-05-08 10:40:28,480] [    INFO] -   Num Epochs = 3
[2025-05-08 10:40:28,481] [    INFO] -   Instantaneous batch size per device = 1
[2025-05-08 10:40:28,481] [    INFO] -   Total train batch size (w. parallel, distributed & accumulation) = 1
[2025-05-08 10:40:28,481] [    INFO] -   Gradient Accumulation steps = 1
[2025-05-08 10:40:28,481] [    INFO] -   Total optimization steps = 528
[2025-05-08 10:40:28,481] [    INFO] -   Total num train samples = 528
[2025-05-08 10:40:28,483] [   DEBUG] -   Number of trainable parameters = 7,615,616,512 (per device)
W0508 10:40:29.779784 726199 multiply_fwd_func.cc:76] got different data type, run type promotion automatically, this may cause data type been changed.
Traceback (most recent call last):
  File "/mnt/data/lyy/mzx/PaddleNLP/llm/run_finetune.py", line 723, in <module>
    main()
  File "/mnt/data/lyy/mzx/PaddleNLP/llm/run_finetune.py", line 458, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 892, in train
    return self._inner_training_loop(
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 1232, in _inner_training_loop
    self.scaler.step(self.optimizer)
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/amp/grad_scaler.py", line 848, in step
    optimizer.step()
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/decorator.py", line 235, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/base/dygraph/base.py", line 386, in __impl__
    return func(*args, **kwargs)
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/decorator.py", line 235, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/base/wrapped_decorator.py", line 40, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/base/framework.py", line 718, in __impl__
    return func(*args, **kwargs)
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/adamw.py", line 684, in step
    optimize_ops = self._apply_optimize(
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/optimizer.py", line 1685, in _apply_optimize
    optimize_ops = self._create_optimization_pass(
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/optimizer.py", line 1319, in _create_optimization_pass
    self._create_accumulators(
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/adamw.py", line 453, in _create_accumulators
    self._add_moments_pows(master_p)
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/adamw.py", line 409, in _add_moments_pows
    self._add_accumulator(self._moment1_acc_str, p, dtype=acc_dtype)
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/optimizer.py", line 1104, in _add_accumulator
    self.helper.set_variable_initializer(
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/base/layer_helper_base.py", line 589, in set_variable_initializer
    initializer(var, self.main_program.global_block())
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/nn/initializer/initializer.py", line 69, in __call__
    return self.forward(param, block)
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/nn/initializer/constant.py", line 91, in forward
    _C_ops.full_(
MemoryError: 

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::pybind::eager_api_full_(_object*, _object*, _object*)
1   full__ad_func(paddle::Tensor&, paddle::experimental::IntArrayBase<paddle::Tensor>, paddle::experimental::ScalarBase<paddle::Tensor>, phi::DataType, phi::Place)
2   paddle::experimental::full_(paddle::Tensor&, paddle::experimental::IntArrayBase<paddle::Tensor> const&, paddle::experimental::ScalarBase<paddle::Tensor> const&, phi::DataType, phi::Place const&)
3   void phi::FullKernel<float, phi::GPUContext>(phi::GPUContext const&, paddle::experimental::IntArrayBase<phi::DenseTensor> const&, paddle::experimental::ScalarBase<phi::DenseTensor> const&, phi::DataType, phi::DenseTensor*)
4   float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const
5   phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
6   paddle::memory::allocation::Allocator::Allocate(unsigned long)
7   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
8   paddle::memory::allocation::Allocator::Allocate(unsigned long)
9   paddle::memory::allocation::Allocator::Allocate(unsigned long)
10  std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
11  common::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 0. Cannot allocate 259.000000MB memory on GPU 0, 79.306213GB memory has been allocated and available memory is only 19.250000MB.

Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model. 
 (at ../paddle/phi/core/memory/allocation/cuda_allocator.cc:71)

LAUNCH INFO 2025-05-08 10:40:44,265 Pod failed
LAUNCH ERROR 2025-05-08 10:40:44,265 Container failed !!!
Container rank 0 status failed cmd ['/home/jovyan/.conda/envs/mzx/bin/python', '-u', 'run_finetune.py', './config/pp-uie/sft_argument.json'] code 1 log log/workerlog.0
LAUNCH INFO 2025-05-08 10:40:44,265 ------------------------- ERROR LOG DETAIL -------------------------
in _create_optimization_pass
    self._create_accumulators(
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/adamw.py", line 453, in _create_accumulators
    self._add_moments_pows(master_p)
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/adamw.py", line 409, in _add_moments_pows
    self._add_accumulator(self._moment1_acc_str, p, dtype=acc_dtype)
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/optimizer.py", line 1104, in _add_accumulator
    self.helper.set_variable_initializer(
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/base/layer_helper_base.py", line 589, in set_variable_initializer
    initializer(var, self.main_program.global_block())
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/nn/initializer/initializer.py", line 69, in __call__
    return self.forward(param, block)
  File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/nn/initializer/constant.py", line 91, in forward
    _C_ops.full_(
MemoryError: 

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::pybind::eager_api_full_(_object*, _object*, _object*)
1   full__ad_func(paddle::Tensor&, paddle::experimental::IntArrayBase<paddle::Tensor>, paddle::experimental::ScalarBase<paddle::Tensor>, phi::DataType, phi::Place)
2   paddle::experimental::full_(paddle::Tensor&, paddle::experimental::IntArrayBase<paddle::Tensor> const&, paddle::experimental::ScalarBase<paddle::Tensor> const&, phi::DataType, phi::Place const&)
3   void phi::FullKernel<float, phi::GPUContext>(phi::GPUContext const&, paddle::experimental::IntArrayBase<phi::DenseTensor> const&, paddle::experimental::ScalarBase<phi::DenseTensor> const&, phi::DataType, phi::DenseTensor*)
4   float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const
5   phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
6   paddle::memory::allocation::Allocator::Allocate(unsigned long)
7   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
8   paddle::memory::allocation::Allocator::Allocate(unsigned long)
9   paddle::memory::allocation::Allocator::Allocate(unsigned long)
10  std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
11  common::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 0. Cannot allocate 259.000000MB memory on GPU 0, 79.306213GB memory has been allocated and available memory is only 19.250000MB.

Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model. 
 (at ../paddle/phi/core/memory/allocation/cuda_allocator.cc:71)

LAUNCH INFO 2025-05-08 10:40:44,266 Exit code 1
@Gnem-zx Gnem-zx added the question Further information is requested label May 8, 2025
@JunnYu
Copy link
Member

JunnYu commented Jun 3, 2025

你好,你这里可以尝试一下开启flash attn,使用bf16进行训练,只需要在脚本里面添加这两个参数 --use_flash_attention 1 --bf16 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants