We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
There was an error while loading. Please reload this page.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
这是配置文件,就是官方文档的,使用的一张 A800 有80G显存
A800
{ "model_name_or_path": "paddlenlp/PP-UIE-7B", "dataset_name_or_path": "./application/information_extraction/data", "output_dir": "./checkpoints/ie_ckpts", "per_device_train_batch_size": 1, "gradient_accumulation_steps": 1, "per_device_eval_batch_size": 1, "eval_accumulation_steps":8, "num_train_epochs": 3, "learning_rate": 3e-05, "warmup_steps": 30, "logging_steps": 1, "evaluation_strategy": "epoch", "save_strategy": "epoch", "src_length": 1024, "max_length": 2048, "fp16": true, "fp16_opt_level": "O2", "do_train": true, "do_eval": true, "disable_tqdm": true, "load_best_model_at_end": true, "eval_with_do_generation": false, "metric_for_best_model": "accuracy", "recompute": false, "save_total_limit": 1, "tensor_parallel_degree": 1, "pipeline_parallel_degree": 1, "sharding": "stage1", "zero_padding": false, "unified_checkpoint": true, "use_flash_attention": false }
这是报错日志
jovyan@9ee80a409cbe:/mnt/data/lyy/mzx/PaddleNLP/llm$ python -u -m paddle.distributed.launch run_finetune.py ./config/pp-uie/sft_argument.json /home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:711: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md warnings.warn(warning_message) LAUNCH INFO 2025-05-08 10:39:34,154 ----------- Configuration ---------------------- LAUNCH INFO 2025-05-08 10:39:34,154 auto_cluster_config: 0 LAUNCH INFO 2025-05-08 10:39:34,154 auto_parallel_config: None LAUNCH INFO 2025-05-08 10:39:34,154 auto_tuner_json: None LAUNCH INFO 2025-05-08 10:39:34,154 devices: None LAUNCH INFO 2025-05-08 10:39:34,154 elastic_level: -1 LAUNCH INFO 2025-05-08 10:39:34,154 elastic_timeout: 30 LAUNCH INFO 2025-05-08 10:39:34,154 enable_gpu_log: True LAUNCH INFO 2025-05-08 10:39:34,154 gloo_port: 6767 LAUNCH INFO 2025-05-08 10:39:34,154 host: None LAUNCH INFO 2025-05-08 10:39:34,154 ips: None LAUNCH INFO 2025-05-08 10:39:34,154 job_id: default LAUNCH INFO 2025-05-08 10:39:34,154 legacy: False LAUNCH INFO 2025-05-08 10:39:34,154 log_dir: log LAUNCH INFO 2025-05-08 10:39:34,154 log_level: INFO LAUNCH INFO 2025-05-08 10:39:34,154 log_overwrite: False LAUNCH INFO 2025-05-08 10:39:34,154 master: None LAUNCH INFO 2025-05-08 10:39:34,154 max_restart: 3 LAUNCH INFO 2025-05-08 10:39:34,154 nnodes: 1 LAUNCH INFO 2025-05-08 10:39:34,154 nproc_per_node: None LAUNCH INFO 2025-05-08 10:39:34,154 rank: -1 LAUNCH INFO 2025-05-08 10:39:34,154 run_mode: collective LAUNCH INFO 2025-05-08 10:39:34,154 server_num: None LAUNCH INFO 2025-05-08 10:39:34,154 servers: LAUNCH INFO 2025-05-08 10:39:34,154 sort_ip: False LAUNCH INFO 2025-05-08 10:39:34,154 start_port: 6070 LAUNCH INFO 2025-05-08 10:39:34,154 trainer_num: None LAUNCH INFO 2025-05-08 10:39:34,154 trainers: LAUNCH INFO 2025-05-08 10:39:34,154 training_script: run_finetune.py LAUNCH INFO 2025-05-08 10:39:34,154 training_script_args: ['./config/pp-uie/sft_argument.json'] LAUNCH INFO 2025-05-08 10:39:34,154 with_gloo: 1 LAUNCH INFO 2025-05-08 10:39:34,154 -------------------------------------------------- LAUNCH INFO 2025-05-08 10:39:34,155 Job: default, mode collective, replicas 1[1:1], elastic False LAUNCH INFO 2025-05-08 10:39:34,156 Run Pod: tvtjzi, replicas 1, status ready LAUNCH INFO 2025-05-08 10:39:34,183 Watching Pod: tvtjzi, replicas 1, status running /home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:711: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md warnings.warn(warning_message) /home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml warnings.warn( [2025-05-08 10:39:36,847] [ INFO] - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-). [2025-05-08 10:39:36,847] [ DEBUG] - ============================================================ [2025-05-08 10:39:36,847] [ DEBUG] - Model Configuration Arguments [2025-05-08 10:39:36,847] [ DEBUG] - paddle commit id : 6ed5dd3833c32c3b21e14b1fb1a71f5a535a0fcc [2025-05-08 10:39:36,847] [ DEBUG] - paddlenlp commit id : a286abc1063e516ed56b746fcca33bedce5fcef3 [2025-05-08 10:39:36,847] [ DEBUG] - aistudio_repo_id : None [2025-05-08 10:39:36,847] [ DEBUG] - aistudio_repo_license : Apache License 2.0 [2025-05-08 10:39:36,847] [ DEBUG] - aistudio_repo_private : True [2025-05-08 10:39:36,847] [ DEBUG] - aistudio_token : None [2025-05-08 10:39:36,847] [ DEBUG] - attention_probs_dropout_prob : 0.1 [2025-05-08 10:39:36,847] [ DEBUG] - continue_training : True [2025-05-08 10:39:36,847] [ DEBUG] - flash_mask : False [2025-05-08 10:39:36,847] [ DEBUG] - from_aistudio : False [2025-05-08 10:39:36,847] [ DEBUG] - fuse_attention_ffn : None [2025-05-08 10:39:36,847] [ DEBUG] - fuse_attention_qkv : None [2025-05-08 10:39:36,847] [ DEBUG] - hidden_dropout_prob : 0.1 [2025-05-08 10:39:36,848] [ DEBUG] - lokr : False [2025-05-08 10:39:36,848] [ DEBUG] - lokr_dim : 8 [2025-05-08 10:39:36,848] [ DEBUG] - lokr_path : None [2025-05-08 10:39:36,848] [ DEBUG] - lora : False [2025-05-08 10:39:36,848] [ DEBUG] - lora_path : None [2025-05-08 10:39:36,848] [ DEBUG] - lora_plus_scale : 1.0 [2025-05-08 10:39:36,848] [ DEBUG] - lora_rank : 8 [2025-05-08 10:39:36,848] [ DEBUG] - lora_use_mixer : False [2025-05-08 10:39:36,848] [ DEBUG] - model_name_or_path : paddlenlp/PP-UIE-7B [2025-05-08 10:39:36,848] [ DEBUG] - neftune : False [2025-05-08 10:39:36,848] [ DEBUG] - neftune_noise_alpha : 5.0 [2025-05-08 10:39:36,848] [ DEBUG] - num_prefix_tokens : 128 [2025-05-08 10:39:36,848] [ DEBUG] - pissa : False [2025-05-08 10:39:36,848] [ DEBUG] - prefix_path : None [2025-05-08 10:39:36,848] [ DEBUG] - prefix_tuning : False [2025-05-08 10:39:36,848] [ DEBUG] - reft : False [2025-05-08 10:39:36,848] [ DEBUG] - rope_scaling_factor : 1.0 [2025-05-08 10:39:36,848] [ DEBUG] - rslora : False [2025-05-08 10:39:36,848] [ DEBUG] - save_to_aistudio : False [2025-05-08 10:39:36,848] [ DEBUG] - strategy_name : None [2025-05-08 10:39:36,848] [ DEBUG] - strategy_type : None [2025-05-08 10:39:36,848] [ DEBUG] - tokenizer_name_or_path : None [2025-05-08 10:39:36,848] [ DEBUG] - use_fast_layer_norm : False [2025-05-08 10:39:36,848] [ DEBUG] - use_long_sequence_strategies : False [2025-05-08 10:39:36,848] [ DEBUG] - use_mora : False [2025-05-08 10:39:36,848] [ DEBUG] - use_quick_lora : False [2025-05-08 10:39:36,849] [ DEBUG] - vera : False [2025-05-08 10:39:36,849] [ DEBUG] - vera_rank : 8 [2025-05-08 10:39:36,849] [ DEBUG] - weight_blocksize : 64 [2025-05-08 10:39:36,849] [ DEBUG] - weight_double_quant : False [2025-05-08 10:39:36,849] [ DEBUG] - weight_double_quant_block_size: 256 [2025-05-08 10:39:36,849] [ DEBUG] - weight_quantize_algo : None [2025-05-08 10:39:36,849] [ DEBUG] - [2025-05-08 10:39:36,849] [ DEBUG] - ============================================================ [2025-05-08 10:39:36,849] [ DEBUG] - Data Configuration Arguments [2025-05-08 10:39:36,849] [ DEBUG] - paddle commit id : 6ed5dd3833c32c3b21e14b1fb1a71f5a535a0fcc [2025-05-08 10:39:36,849] [ DEBUG] - paddlenlp commit id : a286abc1063e516ed56b746fcca33bedce5fcef3 [2025-05-08 10:39:36,849] [ DEBUG] - autoregressive : False [2025-05-08 10:39:36,849] [ DEBUG] - chat_template : None [2025-05-08 10:39:36,849] [ DEBUG] - dataset_name_or_path : ./application/information_extraction/data [2025-05-08 10:39:36,849] [ DEBUG] - eval_with_do_generation : False [2025-05-08 10:39:36,849] [ DEBUG] - greedy_zero_padding : False [2025-05-08 10:39:36,849] [ DEBUG] - lazy : False [2025-05-08 10:39:36,849] [ DEBUG] - max_length : 2048 [2025-05-08 10:39:36,849] [ DEBUG] - pad_to_max_length : False [2025-05-08 10:39:36,849] [ DEBUG] - pad_to_multiple_of : None [2025-05-08 10:39:36,849] [ DEBUG] - save_generation_output : False [2025-05-08 10:39:36,849] [ DEBUG] - src_length : 1024 [2025-05-08 10:39:36,849] [ DEBUG] - task_name : None [2025-05-08 10:39:36,849] [ DEBUG] - use_pose_convert : False [2025-05-08 10:39:36,849] [ DEBUG] - zero_padding : False [2025-05-08 10:39:36,849] [ DEBUG] - [2025-05-08 10:39:36,850] [ DEBUG] - ============================================================ [2025-05-08 10:39:36,850] [ DEBUG] - Generation Configuration Arguments [2025-05-08 10:39:36,850] [ DEBUG] - paddle commit id : 6ed5dd3833c32c3b21e14b1fb1a71f5a535a0fcc [2025-05-08 10:39:36,850] [ DEBUG] - paddlenlp commit id : a286abc1063e516ed56b746fcca33bedce5fcef3 [2025-05-08 10:39:36,850] [ DEBUG] - top_k : 1 [2025-05-08 10:39:36,850] [ DEBUG] - top_p : 1.0 [2025-05-08 10:39:36,850] [ DEBUG] - [2025-05-08 10:39:36,850] [ INFO] - The global seed is set to 42, local seed is set to 43 and random seed is set to 42. [2025-05-08 10:39:36,850] [ WARNING] - Process rank: -1, device: gpu, world_size: 1, distributed training: False, 16-bits training: True [2025-05-08 10:39:36,850] [ INFO] - Loading configuration file /home/jovyan/.paddlenlp/models/paddlenlp/PP-UIE-7B/config.json [2025-05-08 10:39:36,851] [ INFO] - Final model config: Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "context_parallel_degree": -1, "dpo_config": null, "dtype": "float16", "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "pad_token_id": 0, "paddlenlp_version": "3.0.0b4", "pipeline_parallel_degree": -1, "refined_recompute": {}, "rms_norm_eps": 1e-06, "rope_scaling_factor": 1.0, "rope_scaling_type": null, "rope_theta": 1000000.0, "sep_parallel_degree": -1, "seq_length": 2048, "sliding_window": 131072, "tensor_parallel_degree": -1, "tensor_parallel_output": false, "tie_word_embeddings": false, "use_fast_layer_norm": false, "use_sliding_window": false, "vocab_size": 152064 } [2025-05-08 10:39:36,852] [ INFO] - Creating model [2025-05-08 10:39:36,852] [ INFO] - We are using <class 'paddlenlp.transformers.qwen2.modeling.Qwen2ForCausalLM'> to load 'paddlenlp/PP-UIE-7B'. [2025-05-08 10:39:36,852] [ INFO] - Loading weights file from cache at /home/jovyan/.paddlenlp/models/paddlenlp/PP-UIE-7B/model.safetensors.index.json Downloading shards: 0%| | 0/4 [00:00<?, ?it/s] Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 44384.17it/s] W0508 10:39:37.002657 726199 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.3, Runtime API Version: 12.6 W0508 10:39:37.003424 726199 gpu_resources.cc:164] device: 0, cuDNN Version: 9.5. W0508 10:39:37.003436 726199 gpu_resources.cc:196] WARNING: device: 0. The installed Paddle is compiled with CUDA 12.6, but CUDA runtime version in your machine is 12.3, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDA version. Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██████████████████████████████▊ | 1/4 [00:10<00:30, 10.15s/it] Loading checkpoint shards: 50%|█████████████████████████████████████████████████████████████▌ | 2/4 [00:20<00:20, 10.24s/it] Loading checkpoint shards: 75%|████████████████████████████████████████████████████████████████████████████████████████████▎ | 3/4 [00:30<00:10, 10.17s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:40<00:00, 10.19s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:40<00:00, 10.19s/it] [2025-05-08 10:40:28,104] [ INFO] - All model checkpoint weights were used when initializing Qwen2ForCausalLM. [2025-05-08 10:40:28,104] [ INFO] - All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at paddlenlp/PP-UIE-7B. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training. [2025-05-08 10:40:28,106] [ INFO] - Loading configuration file /home/jovyan/.paddlenlp/models/paddlenlp/PP-UIE-7B/generation_config.json [2025-05-08 10:40:28,166] [ INFO] - The `unk_token` parameter needs to be defined: we use `eos_token` by default. [2025-05-08 10:40:28,360] [ INFO] - load train [2025-05-08 10:40:28,381] [ INFO] - load eval [2025-05-08 10:40:28,385] [ INFO] - load test [2025-05-08 10:40:28,385] [ INFO] - Trans the dataset text into token ids, please wait for a moment. [2025-05-08 10:40:28,385] [ INFO] - The global seed is set to 42, local seed is set to 43 and random seed is set to 42. [2025-05-08 10:40:28,453] [ INFO] - Using half precision [2025-05-08 10:40:28,471] [ DEBUG] - ============================================================ [2025-05-08 10:40:28,471] [ DEBUG] - Training Configuration Arguments [2025-05-08 10:40:28,471] [ DEBUG] - paddle commit id : 6ed5dd3833c32c3b21e14b1fb1a71f5a535a0fcc [2025-05-08 10:40:28,471] [ DEBUG] - paddlenlp commit id : a286abc1063e516ed56b746fcca33bedce5fcef3 [2025-05-08 10:40:28,471] [ DEBUG] - _no_sync_in_gradient_accumulation: True [2025-05-08 10:40:28,471] [ DEBUG] - adam_beta1 : 0.9 [2025-05-08 10:40:28,472] [ DEBUG] - adam_beta2 : 0.999 [2025-05-08 10:40:28,472] [ DEBUG] - adam_epsilon : 1e-08 [2025-05-08 10:40:28,472] [ DEBUG] - amp_custom_black_list : None [2025-05-08 10:40:28,472] [ DEBUG] - amp_custom_white_list : None [2025-05-08 10:40:28,472] [ DEBUG] - amp_master_grad : False [2025-05-08 10:40:28,472] [ DEBUG] - auto_parallel_resume_form_hybrid_parallel: False [2025-05-08 10:40:28,472] [ DEBUG] - autotuner_benchmark : False [2025-05-08 10:40:28,472] [ DEBUG] - benchmark : False [2025-05-08 10:40:28,472] [ DEBUG] - bf16 : False [2025-05-08 10:40:28,472] [ DEBUG] - bf16_full_eval : False [2025-05-08 10:40:28,472] [ DEBUG] - ckpt_quant_stage : O0 [2025-05-08 10:40:28,472] [ DEBUG] - context_parallel_degree : -1 [2025-05-08 10:40:28,472] [ DEBUG] - count_trained_tokens : False [2025-05-08 10:40:28,472] [ DEBUG] - current_device : gpu:0 [2025-05-08 10:40:28,472] [ DEBUG] - data_parallel_config : [2025-05-08 10:40:28,472] [ DEBUG] - data_parallel_degree : 1 [2025-05-08 10:40:28,472] [ DEBUG] - data_parallel_rank : 0 [2025-05-08 10:40:28,472] [ DEBUG] - dataloader_drop_last : False [2025-05-08 10:40:28,472] [ DEBUG] - dataloader_num_workers : 0 [2025-05-08 10:40:28,472] [ DEBUG] - dataset_batch_size : 1000 [2025-05-08 10:40:28,472] [ DEBUG] - dataset_kwargs : {} [2025-05-08 10:40:28,472] [ DEBUG] - dataset_num_proc : None [2025-05-08 10:40:28,472] [ DEBUG] - dataset_rank : 0 [2025-05-08 10:40:28,472] [ DEBUG] - dataset_text_field : text [2025-05-08 10:40:28,472] [ DEBUG] - dataset_world_size : 1 [2025-05-08 10:40:28,473] [ DEBUG] - ddp_find_unused_parameters : None [2025-05-08 10:40:28,473] [ DEBUG] - decay_steps : 0 [2025-05-08 10:40:28,473] [ DEBUG] - device : gpu [2025-05-08 10:40:28,473] [ DEBUG] - disable_tqdm : True [2025-05-08 10:40:28,473] [ DEBUG] - distributed_dataloader : False [2025-05-08 10:40:28,473] [ DEBUG] - do_eval : True [2025-05-08 10:40:28,473] [ DEBUG] - do_export : False [2025-05-08 10:40:28,473] [ DEBUG] - do_predict : False [2025-05-08 10:40:28,473] [ DEBUG] - do_train : True [2025-05-08 10:40:28,473] [ DEBUG] - enable_auto_parallel : False [2025-05-08 10:40:28,473] [ DEBUG] - eval_accumulation_steps : 8 [2025-05-08 10:40:28,473] [ DEBUG] - eval_batch_size : 1 [2025-05-08 10:40:28,473] [ DEBUG] - eval_packing : None [2025-05-08 10:40:28,473] [ DEBUG] - eval_steps : None [2025-05-08 10:40:28,473] [ DEBUG] - evaluation_strategy : IntervalStrategy.EPOCH [2025-05-08 10:40:28,473] [ DEBUG] - expert_max_capacity : 4294967296 [2025-05-08 10:40:28,473] [ DEBUG] - expert_min_capacity : 1 [2025-05-08 10:40:28,473] [ DEBUG] - expert_parallel_degree : -1 [2025-05-08 10:40:28,473] [ DEBUG] - expert_tensor_parallel_degree : -1 [2025-05-08 10:40:28,473] [ DEBUG] - flatten_param_grads : False [2025-05-08 10:40:28,473] [ DEBUG] - force_reshard_pp : False [2025-05-08 10:40:28,473] [ DEBUG] - fp16 : True [2025-05-08 10:40:28,473] [ DEBUG] - fp16_full_eval : False [2025-05-08 10:40:28,473] [ DEBUG] - fp16_opt_level : O2 [2025-05-08 10:40:28,473] [ DEBUG] - fuse_sequence_parallel_allreduce: False [2025-05-08 10:40:28,473] [ DEBUG] - gradient_accumulation_steps : 1 [2025-05-08 10:40:28,474] [ DEBUG] - greater_is_better : True [2025-05-08 10:40:28,474] [ DEBUG] - hybrid_parallel_topo_order : pp_first [2025-05-08 10:40:28,474] [ DEBUG] - ignore_data_skip : False [2025-05-08 10:40:28,474] [ DEBUG] - ignore_load_lr_and_optim : False [2025-05-08 10:40:28,474] [ DEBUG] - ignore_save_lr_and_optim : False [2025-05-08 10:40:28,474] [ DEBUG] - label_names : None [2025-05-08 10:40:28,474] [ DEBUG] - lazy_data_processing : True [2025-05-08 10:40:28,474] [ DEBUG] - learning_rate : 3e-05 [2025-05-08 10:40:28,474] [ DEBUG] - load_best_model_at_end : True [2025-05-08 10:40:28,474] [ DEBUG] - load_sharded_model : False [2025-05-08 10:40:28,474] [ DEBUG] - local_process_index : 0 [2025-05-08 10:40:28,474] [ DEBUG] - local_rank : -1 [2025-05-08 10:40:28,474] [ DEBUG] - log_level : -1 [2025-05-08 10:40:28,474] [ DEBUG] - log_level_replica : -1 [2025-05-08 10:40:28,474] [ DEBUG] - log_on_each_node : True [2025-05-08 10:40:28,474] [ DEBUG] - logging_dir : ./checkpoints/ie_ckpts/runs/May08_10-39-36_9ee80a409cbe [2025-05-08 10:40:28,474] [ DEBUG] - logging_first_step : False [2025-05-08 10:40:28,474] [ DEBUG] - logging_steps : 1 [2025-05-08 10:40:28,474] [ DEBUG] - logging_strategy : IntervalStrategy.STEPS [2025-05-08 10:40:28,474] [ DEBUG] - logical_process_index : 0 [2025-05-08 10:40:28,474] [ DEBUG] - lr_end : 1e-07 [2025-05-08 10:40:28,474] [ DEBUG] - lr_scheduler_type : SchedulerType.LINEAR [2025-05-08 10:40:28,474] [ DEBUG] - max_evaluate_steps : -1 [2025-05-08 10:40:28,474] [ DEBUG] - max_grad_norm : 1.0 [2025-05-08 10:40:28,474] [ DEBUG] - max_seq_length : 2048 [2025-05-08 10:40:28,474] [ DEBUG] - max_steps : -1 [2025-05-08 10:40:28,475] [ DEBUG] - metric_for_best_model : accuracy [2025-05-08 10:40:28,475] [ DEBUG] - metrics_output_path : None [2025-05-08 10:40:28,475] [ DEBUG] - minimum_eval_times : None [2025-05-08 10:40:28,475] [ DEBUG] - model_init_kwargs : None [2025-05-08 10:40:28,475] [ DEBUG] - no_cuda : False [2025-05-08 10:40:28,475] [ DEBUG] - no_recompute_layers : None [2025-05-08 10:40:28,475] [ DEBUG] - num_cycles : 0.5 [2025-05-08 10:40:28,475] [ DEBUG] - num_train_epochs : 3.0 [2025-05-08 10:40:28,475] [ DEBUG] - offload_optim : False [2025-05-08 10:40:28,475] [ DEBUG] - offload_recompute_inputs : False [2025-05-08 10:40:28,475] [ DEBUG] - optim : OptimizerNames.ADAMW [2025-05-08 10:40:28,475] [ DEBUG] - optimizer_name_suffix : None [2025-05-08 10:40:28,475] [ DEBUG] - ordered_save_group_size : 0 [2025-05-08 10:40:28,475] [ DEBUG] - output_dir : ./checkpoints/ie_ckpts [2025-05-08 10:40:28,475] [ DEBUG] - output_signal_dir : ./checkpoints/ie_ckpts [2025-05-08 10:40:28,475] [ DEBUG] - overwrite_output_dir : False [2025-05-08 10:40:28,475] [ DEBUG] - pad_token_id : 0 [2025-05-08 10:40:28,475] [ DEBUG] - past_index : -1 [2025-05-08 10:40:28,475] [ DEBUG] - pdc_download_ckpt : False [2025-05-08 10:40:28,475] [ DEBUG] - pdc_download_timeout : 300 [2025-05-08 10:40:28,475] [ DEBUG] - per_device_eval_batch_size : 1 [2025-05-08 10:40:28,475] [ DEBUG] - per_device_train_batch_size : 1 [2025-05-08 10:40:28,475] [ DEBUG] - pipeline_parallel_config : [2025-05-08 10:40:28,475] [ DEBUG] - pipeline_parallel_degree : -1 [2025-05-08 10:40:28,475] [ DEBUG] - pipeline_parallel_rank : 0 [2025-05-08 10:40:28,475] [ DEBUG] - power : 1.0 [2025-05-08 10:40:28,476] [ DEBUG] - pp_recompute_interval : 1 [2025-05-08 10:40:28,476] [ DEBUG] - prediction_loss_only : False [2025-05-08 10:40:28,476] [ DEBUG] - process_index : 0 [2025-05-08 10:40:28,476] [ DEBUG] - recompute : False [2025-05-08 10:40:28,476] [ DEBUG] - recompute_granularity : full [2025-05-08 10:40:28,476] [ DEBUG] - recompute_use_reentrant : False [2025-05-08 10:40:28,476] [ DEBUG] - refined_recompute : {} [2025-05-08 10:40:28,476] [ DEBUG] - release_grads : False [2025-05-08 10:40:28,476] [ DEBUG] - remove_unused_columns : True [2025-05-08 10:40:28,476] [ DEBUG] - report_to : ['visualdl'] [2025-05-08 10:40:28,476] [ DEBUG] - resume_from_checkpoint : None [2025-05-08 10:40:28,476] [ DEBUG] - run_name : ./checkpoints/ie_ckpts [2025-05-08 10:40:28,476] [ DEBUG] - save_on_each_node : False [2025-05-08 10:40:28,476] [ DEBUG] - save_sharded_model : False [2025-05-08 10:40:28,476] [ DEBUG] - save_sharding_stage1_model_include_freeze_params: False [2025-05-08 10:40:28,476] [ DEBUG] - save_steps : 500 [2025-05-08 10:40:28,476] [ DEBUG] - save_strategy : IntervalStrategy.EPOCH [2025-05-08 10:40:28,476] [ DEBUG] - save_total_limit : 1 [2025-05-08 10:40:28,476] [ DEBUG] - scale_loss : 32768 [2025-05-08 10:40:28,476] [ DEBUG] - seed : 42 [2025-05-08 10:40:28,476] [ DEBUG] - sep_parallel_degree : -1 [2025-05-08 10:40:28,476] [ DEBUG] - sequence_parallel : False [2025-05-08 10:40:28,476] [ DEBUG] - sequence_parallel_config : [2025-05-08 10:40:28,476] [ DEBUG] - sharding : [] [2025-05-08 10:40:28,476] [ DEBUG] - sharding_comm_buffer_size_MB : -1 [2025-05-08 10:40:28,476] [ DEBUG] - sharding_degree : -1 [2025-05-08 10:40:28,477] [ DEBUG] - sharding_parallel_config : [2025-05-08 10:40:28,477] [ DEBUG] - sharding_parallel_degree : -1 [2025-05-08 10:40:28,477] [ DEBUG] - sharding_parallel_mesh_dimension: dp [2025-05-08 10:40:28,477] [ DEBUG] - sharding_parallel_rank : 0 [2025-05-08 10:40:28,477] [ DEBUG] - should_load_dataset : True [2025-05-08 10:40:28,477] [ DEBUG] - should_load_sharding_stage1_model: False [2025-05-08 10:40:28,477] [ DEBUG] - should_log : True [2025-05-08 10:40:28,477] [ DEBUG] - should_save : True [2025-05-08 10:40:28,477] [ DEBUG] - should_save_model_state : True [2025-05-08 10:40:28,477] [ DEBUG] - should_save_model_with_tensor_fusion: False [2025-05-08 10:40:28,477] [ DEBUG] - should_save_sharding_stage1_model: False [2025-05-08 10:40:28,477] [ DEBUG] - skip_data_intervals : None [2025-05-08 10:40:28,477] [ DEBUG] - skip_memory_metrics : True [2025-05-08 10:40:28,477] [ DEBUG] - skip_profile_timer : True [2025-05-08 10:40:28,477] [ DEBUG] - split_inputs_sequence_dim : True [2025-05-08 10:40:28,477] [ DEBUG] - ssa_group_size_ratio : 0.25 [2025-05-08 10:40:28,477] [ DEBUG] - tensor_parallel_config : [2025-05-08 10:40:28,477] [ DEBUG] - tensor_parallel_degree : -1 [2025-05-08 10:40:28,477] [ DEBUG] - tensor_parallel_output : False [2025-05-08 10:40:28,477] [ DEBUG] - tensor_parallel_rank : 0 [2025-05-08 10:40:28,477] [ DEBUG] - to_static : False [2025-05-08 10:40:28,477] [ DEBUG] - train_batch_size : 1 [2025-05-08 10:40:28,477] [ DEBUG] - unified_checkpoint : True [2025-05-08 10:40:28,477] [ DEBUG] - unified_checkpoint_config : [''] [2025-05-08 10:40:28,477] [ DEBUG] - use_async_save : False [2025-05-08 10:40:28,477] [ DEBUG] - use_expert_parallel : False [2025-05-08 10:40:28,478] [ DEBUG] - use_flash_attention : False [2025-05-08 10:40:28,478] [ DEBUG] - use_fused_dropout_add : False [2025-05-08 10:40:28,478] [ DEBUG] - use_fused_linear : False [2025-05-08 10:40:28,478] [ DEBUG] - use_fused_linear_cross_entropy: False [2025-05-08 10:40:28,478] [ DEBUG] - use_fused_rms_norm : False [2025-05-08 10:40:28,478] [ DEBUG] - use_fused_rope : False [2025-05-08 10:40:28,478] [ DEBUG] - use_hybrid_parallel : False [2025-05-08 10:40:28,478] [ DEBUG] - use_ssa : False [2025-05-08 10:40:28,478] [ DEBUG] - virtual_pp_degree : 1 [2025-05-08 10:40:28,478] [ DEBUG] - wandb_api_key : None [2025-05-08 10:40:28,478] [ DEBUG] - wandb_http_proxy : None [2025-05-08 10:40:28,478] [ DEBUG] - warmup_ratio : 0.0 [2025-05-08 10:40:28,478] [ DEBUG] - warmup_steps : 30 [2025-05-08 10:40:28,478] [ DEBUG] - weight_decay : 0.0 [2025-05-08 10:40:28,478] [ DEBUG] - weight_name_suffix : None [2025-05-08 10:40:28,478] [ DEBUG] - world_size : 1 [2025-05-08 10:40:28,478] [ DEBUG] - [2025-05-08 10:40:28,479] [ INFO] - Starting training from resume_from_checkpoint : None [2025-05-08 10:40:28,480] [ INFO] - [timelog] checkpoint loading time: 0.00s (2025-05-08 10:40:28) [2025-05-08 10:40:28,480] [ INFO] - ***** Running training ***** [2025-05-08 10:40:28,480] [ INFO] - Num examples = 176 [2025-05-08 10:40:28,480] [ INFO] - Num Epochs = 3 [2025-05-08 10:40:28,481] [ INFO] - Instantaneous batch size per device = 1 [2025-05-08 10:40:28,481] [ INFO] - Total train batch size (w. parallel, distributed & accumulation) = 1 [2025-05-08 10:40:28,481] [ INFO] - Gradient Accumulation steps = 1 [2025-05-08 10:40:28,481] [ INFO] - Total optimization steps = 528 [2025-05-08 10:40:28,481] [ INFO] - Total num train samples = 528 [2025-05-08 10:40:28,483] [ DEBUG] - Number of trainable parameters = 7,615,616,512 (per device) W0508 10:40:29.779784 726199 multiply_fwd_func.cc:76] got different data type, run type promotion automatically, this may cause data type been changed. Traceback (most recent call last): File "/mnt/data/lyy/mzx/PaddleNLP/llm/run_finetune.py", line 723, in <module> main() File "/mnt/data/lyy/mzx/PaddleNLP/llm/run_finetune.py", line 458, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 892, in train return self._inner_training_loop( File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 1232, in _inner_training_loop self.scaler.step(self.optimizer) File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/amp/grad_scaler.py", line 848, in step optimizer.step() File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/decorator.py", line 235, in fun return caller(func, *(extras + args), **kw) File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/base/dygraph/base.py", line 386, in __impl__ return func(*args, **kwargs) File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/decorator.py", line 235, in fun return caller(func, *(extras + args), **kw) File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/base/wrapped_decorator.py", line 40, in __impl__ return wrapped_func(*args, **kwargs) File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/base/framework.py", line 718, in __impl__ return func(*args, **kwargs) File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/adamw.py", line 684, in step optimize_ops = self._apply_optimize( File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/optimizer.py", line 1685, in _apply_optimize optimize_ops = self._create_optimization_pass( File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/optimizer.py", line 1319, in _create_optimization_pass self._create_accumulators( File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/adamw.py", line 453, in _create_accumulators self._add_moments_pows(master_p) File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/adamw.py", line 409, in _add_moments_pows self._add_accumulator(self._moment1_acc_str, p, dtype=acc_dtype) File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/optimizer.py", line 1104, in _add_accumulator self.helper.set_variable_initializer( File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/base/layer_helper_base.py", line 589, in set_variable_initializer initializer(var, self.main_program.global_block()) File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/nn/initializer/initializer.py", line 69, in __call__ return self.forward(param, block) File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/nn/initializer/constant.py", line 91, in forward _C_ops.full_( MemoryError: -------------------------------------- C++ Traceback (most recent call last): -------------------------------------- 0 paddle::pybind::eager_api_full_(_object*, _object*, _object*) 1 full__ad_func(paddle::Tensor&, paddle::experimental::IntArrayBase<paddle::Tensor>, paddle::experimental::ScalarBase<paddle::Tensor>, phi::DataType, phi::Place) 2 paddle::experimental::full_(paddle::Tensor&, paddle::experimental::IntArrayBase<paddle::Tensor> const&, paddle::experimental::ScalarBase<paddle::Tensor> const&, phi::DataType, phi::Place const&) 3 void phi::FullKernel<float, phi::GPUContext>(phi::GPUContext const&, paddle::experimental::IntArrayBase<phi::DenseTensor> const&, paddle::experimental::ScalarBase<phi::DenseTensor> const&, phi::DataType, phi::DenseTensor*) 4 float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const 5 phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool) 6 paddle::memory::allocation::Allocator::Allocate(unsigned long) 7 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long) 8 paddle::memory::allocation::Allocator::Allocate(unsigned long) 9 paddle::memory::allocation::Allocator::Allocate(unsigned long) 10 std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int) 11 common::enforce::GetCurrentTraceBackString[abi:cxx11](bool) ---------------------- Error Message Summary: ---------------------- ResourceExhaustedError: Out of memory error on GPU 0. Cannot allocate 259.000000MB memory on GPU 0, 79.306213GB memory has been allocated and available memory is only 19.250000MB. Please check whether there is any other process using GPU 0. 1. If yes, please stop them, or start PaddlePaddle on another GPU. 2. If no, please decrease the batch size of your model. (at ../paddle/phi/core/memory/allocation/cuda_allocator.cc:71) LAUNCH INFO 2025-05-08 10:40:44,265 Pod failed LAUNCH ERROR 2025-05-08 10:40:44,265 Container failed !!! Container rank 0 status failed cmd ['/home/jovyan/.conda/envs/mzx/bin/python', '-u', 'run_finetune.py', './config/pp-uie/sft_argument.json'] code 1 log log/workerlog.0 LAUNCH INFO 2025-05-08 10:40:44,265 ------------------------- ERROR LOG DETAIL ------------------------- in _create_optimization_pass self._create_accumulators( File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/adamw.py", line 453, in _create_accumulators self._add_moments_pows(master_p) File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/adamw.py", line 409, in _add_moments_pows self._add_accumulator(self._moment1_acc_str, p, dtype=acc_dtype) File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/optimizer/optimizer.py", line 1104, in _add_accumulator self.helper.set_variable_initializer( File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/base/layer_helper_base.py", line 589, in set_variable_initializer initializer(var, self.main_program.global_block()) File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/nn/initializer/initializer.py", line 69, in __call__ return self.forward(param, block) File "/home/jovyan/.conda/envs/mzx/lib/python3.10/site-packages/paddle/nn/initializer/constant.py", line 91, in forward _C_ops.full_( MemoryError: -------------------------------------- C++ Traceback (most recent call last): -------------------------------------- 0 paddle::pybind::eager_api_full_(_object*, _object*, _object*) 1 full__ad_func(paddle::Tensor&, paddle::experimental::IntArrayBase<paddle::Tensor>, paddle::experimental::ScalarBase<paddle::Tensor>, phi::DataType, phi::Place) 2 paddle::experimental::full_(paddle::Tensor&, paddle::experimental::IntArrayBase<paddle::Tensor> const&, paddle::experimental::ScalarBase<paddle::Tensor> const&, phi::DataType, phi::Place const&) 3 void phi::FullKernel<float, phi::GPUContext>(phi::GPUContext const&, paddle::experimental::IntArrayBase<phi::DenseTensor> const&, paddle::experimental::ScalarBase<phi::DenseTensor> const&, phi::DataType, phi::DenseTensor*) 4 float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const 5 phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool) 6 paddle::memory::allocation::Allocator::Allocate(unsigned long) 7 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long) 8 paddle::memory::allocation::Allocator::Allocate(unsigned long) 9 paddle::memory::allocation::Allocator::Allocate(unsigned long) 10 std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int) 11 common::enforce::GetCurrentTraceBackString[abi:cxx11](bool) ---------------------- Error Message Summary: ---------------------- ResourceExhaustedError: Out of memory error on GPU 0. Cannot allocate 259.000000MB memory on GPU 0, 79.306213GB memory has been allocated and available memory is only 19.250000MB. Please check whether there is any other process using GPU 0. 1. If yes, please stop them, or start PaddlePaddle on another GPU. 2. If no, please decrease the batch size of your model. (at ../paddle/phi/core/memory/allocation/cuda_allocator.cc:71) LAUNCH INFO 2025-05-08 10:40:44,266 Exit code 1
The text was updated successfully, but these errors were encountered:
你好,你这里可以尝试一下开启flash attn,使用bf16进行训练,只需要在脚本里面添加这两个参数 --use_flash_attention 1 --bf16 1
--use_flash_attention 1 --bf16 1
Sorry, something went wrong.
lugimzzz
No branches or pull requests
Uh oh!
There was an error while loading. Please reload this page.
请提出你的问题
{
"model_name_or_path": "paddlenlp/PP-UIE-7B",
"dataset_name_or_path": "./application/information_extraction/data",
"output_dir": "./checkpoints/ie_ckpts",
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 1,
"per_device_eval_batch_size": 1,
"eval_accumulation_steps":8,
"num_train_epochs": 3,
"learning_rate": 3e-05,
"warmup_steps": 30,
"logging_steps": 1,
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"src_length": 1024,
"max_length": 2048,
"fp16": true,
"fp16_opt_level": "O2",
"do_train": true,
"do_eval": true,
"disable_tqdm": true,
"load_best_model_at_end": true,
"eval_with_do_generation": false,
"metric_for_best_model": "accuracy",
"recompute": false,
"save_total_limit": 1,
"tensor_parallel_degree": 1,
"pipeline_parallel_degree": 1,
"sharding": "stage1",
"zero_padding": false,
"unified_checkpoint": true,
"use_flash_attention": false
}
这是报错日志
The text was updated successfully, but these errors were encountered: