[BUG] <title>训练到一段出现Killed

### 是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

- [ ] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

### 该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

- [ ] 我已经搜索过FAQ | I have searched FAQ

### 当前行为 | Current Behavior

训练到一段时间就出现Killed, 如下图

![Image](https://github.com/user-attachments/assets/27cb36d7-99b0-4a9e-8233-4d78d34d033f)
下面是配置项：
#!/bin/bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

MODEL="../../../agent/Qwen-VL/pretrain_model/Qwen/Qwen-VL-Chat/" #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL"  Set the path if you do not want to load from huggingface directly
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="../data/qwen_vl_train_data.json"
EVAL_DATA="../data/qwen_vl_val_data.json"

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

torchrun $DISTRIBUTED_ARGS finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --eval_data_path $EVAL_DATA \
    --bf16 True \
    --fix_vit True \
    --output_dir output_qwen_model \
    --num_train_epochs 10 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2800 \
    --save_total_limit 10 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --model_max_length 2048 \
    --lazy_preprocess True \
    --use_lora \
    --gradient_checkpointing \
    --deepspeed finetune/ds_config_zero2.json
    
不知道怎么解决

### 期望行为 | Expected Behavior

_No response_

### 复现方法 | Steps To Reproduce

_No response_

### 运行环境 | Environment

```Markdown
- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):
```

### 备注 | Anything else?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] <title>训练到一段出现Killed #507

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.

See the section for finetuning in README for more information.

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] <title>训练到一段出现Killed #507

Description

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.

See the section for finetuning in README for more information.

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions