-
Notifications
You must be signed in to change notification settings - Fork 6.8k
fit the cases of mmdata in sys_msg #7694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi, @Kuangdd01 |
First, rebase codes with main branch. |
Thanks for hinting. I'm currently struggling with permissions problems with Are there any other ways? |
Hi, @Kuangdd01 Then I run it step by step carefully, and find it stuck in here: |
Update, TLDR: find it stuck in some lines in First, it would go into the Let's step into it. And find it finally stuck in this line in But before it gets stuck, it would get into this function in And then I find it would go back to the line My system info:
Could you share which deepspeed and torch you are using? |
I find it needs to use two |
Env info
|
The hanging reason may be the same as the mix-up with (image-text + text) data samples. When some batch is textual only, it may lead to inconsistent backward issues, especially when we turn on zero3 for model module broadcasting. But as your description, in every sample, you have the same mmdata in SYSTEM_PROMPT, so I am also confused. |
The role images I put into SYSTEM_PROMPT may be different if its role changes (training for several roles). Would this be the reason? But it would hang even if I gave 1 training sample. |
Follow with your sys mmdata PR(commit id:8ceba26b9e22d569ccda357be5653c862eed0730), and use demo data_info as you showed in #7767. Still not catching the hanging error. ### model
model_name_or_path: ./test_models/Qwen2.5-Omni-7B
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
deepspeed: ./examples/deepspeed/ds_z3_config.json
### dataset
dataset: test
template: qwen2_omni
cutoff_len: 8192
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4
### output
output_dir: saves/qwen2_omni-7b-video/lora/sft
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
### train
use_audio_in_video: true
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
freeze_vision_tower: true
learning_rate: 1.0e-4
num_train_epochs: 25.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
### eval
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500 data info [
{
"messages": [
{
"content": "<video><audio>What is the video describing?",
"role": "system"
},
{
"content": "<video><audio>What is the video describing?",
"role": "user"
},
{
"content": "A girl who is drawing a picture of a guitar and feel nervous.",
"role": "assistant"
}
],
"videos": [
"mllm_demo_data/4.mp4",
"mllm_demo_data/4.mp4"
],
"audios": [
"mllm_demo_data/4.mp3",
"mllm_demo_data/4.mp3"
]
}
] |
May I ask if we pass |
If we turn on
|
Oh, that why I found in my dataset that I have And btw I didn't catch the hanging issue either if I set |
I checked this on my machine and didn't catch the hanging problem with this PR either. Let's move to #7767 to further talk about that issue. |
8ceba26
to
52152dd
Compare
52152dd
to
14d1bfb
Compare
14d1bfb
to
36dc24d
Compare
What does this PR do?
Fit the cases of mmdata in sys_msg
In some tasks like Multimodal Roleplaying, we may need to give the role image in system message.
I've tested locally on
supervised.py
andunsupervised.py
and it works fine on my machine. If others likefeedback.py
didn't work, please feel free to comment or open an issue.😊Before submitting