-
Notifications
You must be signed in to change notification settings - Fork 472
Description
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
- 我已经搜索过FAQ | I have searched FAQ
当前行为 | Current Behavior
训练到一段时间就出现Killed, 如下图
下面是配置项:
#!/bin/bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=pwd
GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001
MODEL="../../../agent/Qwen-VL/pretrain_model/Qwen/Qwen-VL-Chat/" #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL" Set the path if you do not want to load from huggingface directly
ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
See the section for finetuning in README for more information.
DATA="../data/qwen_vl_train_data.json"
EVAL_DATA="../data/qwen_vl_val_data.json"
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE
--nnodes $NNODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py
--model_name_or_path $MODEL
--data_path $DATA
--eval_data_path $EVAL_DATA
--bf16 True
--fix_vit True
--output_dir output_qwen_model
--num_train_epochs 10
--per_device_train_batch_size 8
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 2800
--save_total_limit 10
--learning_rate 1e-5
--weight_decay 0.1
--adam_beta2 0.95
--warmup_ratio 0.01
--lr_scheduler_type "cosine"
--logging_steps 1
--report_to "none"
--model_max_length 2048
--lazy_preprocess True
--use_lora
--gradient_checkpointing
--deepspeed finetune/ds_config_zero2.json
不知道怎么解决
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):备注 | Anything else?
No response