fit the cases of mmdata in sys_msg #7694

Luffy-ZY-Wang · 2025-04-12T04:52:39Z

What does this PR do?

Fit the cases of mmdata in sys_msg

In some tasks like Multimodal Roleplaying, we may need to give the role image in system message.

I've tested locally on supervised.py and unsupervised.py and it works fine on my machine. If others like feedback.py didn't work, please feel free to comment or open an issue.😊

Before submitting

Did you read the contributor guideline?
Did you write any new necessary tests?

Luffy-ZY-Wang · 2025-04-19T09:29:50Z

Hi, @Kuangdd01
Please feel free to add any comments here! 🙂

Kuangdd01 · 2025-04-19T12:42:33Z

First, rebase codes with main branch.
Second, try to use py-spy to track process and see which line leads to hangging during omni training. 🤗

Luffy-ZY-Wang · 2025-04-21T07:29:03Z

to use py-spy to track process and see which line leads to hangging during omni training

Thanks for hinting. I'm currently struggling with permissions problems with py-spy. And I can't upgrade the elevate the permissions of py-spy because I can't contact the machine manager now.

Are there any other ways?

Luffy-ZY-Wang · 2025-04-21T10:50:38Z

Hi, @Kuangdd01
I used pudb instead and set_trace() here:

Then I run it step by step carefully, and find it stuck in here:

Luffy-ZY-Wang · 2025-04-22T02:28:57Z

Update, TLDR: find it stuck in some lines in deepspeed and torch. Could you share your env info regarding deepspeed and torch? @Kuangdd01

First, it would go into the forward func of DeepSpeedEngine(actually inherit from nn.Module) in our compute_loss, and it would stuck when calling module:

Let's step into it. And find it finally stuck in this line in Module._call_impl() of torch.nn.modules.module

But before it gets stuck, it would get into this function in deepspeed.utils.nvtx:

And then I find it would go back to the line loss = self.module(*inputs, **kwargs) in forward func of DeepSpeedEngine (in the first picture) right after this instrument_w_nvtx, and run into picture 2 again and stuck when it tried to set forward_call=(self.forward)

My system info:

- `llamafactory` version: 0.9.3.dev0
- Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.31
- Python version: 3.12.9
- PyTorch version: 2.6.0+cu118 (GPU)
- Transformers version: 4.50.0.dev0
- Datasets version: 3.4.1
- Accelerate version: 1.5.2
- PEFT version: 0.15.1
- TRL version: 0.9.6
- GPU type: NVIDIA GeForce RTX 4090
- GPU number: 8
- GPU memory: 23.65GB
- DeepSpeed version: 0.16.4
- vLLM version: 0.8.1
- Git commit: 610f164c69118d390ac6aad5e721b33d3212c91f

Could you share which deepspeed and torch you are using?

Luffy-ZY-Wang · 2025-04-22T02:50:33Z

I find it needs to use two pudb windows for tracing (model is distributed loaded). That _call_impl() line isn't where the issue really lies. Still tracing...

Kuangdd01 · 2025-04-22T05:18:37Z

Update, TLDR: find it stuck in some lines in deepspeed and torch. Could you share your env info regarding deepspeed and torch? @Kuangdd01

First, it would go into the forward func of DeepSpeedEngine(actually inherit from nn.Module) in our compute_loss, and it would stuck when calling module:

Let's step into it. And find it finally stuck in this line in Module._call_impl() of torch.nn.modules.module

But before it gets stuck, it would get into this function in deepspeed.utils.nvtx:

And then I find it would go back to the line loss = self.module(*inputs, **kwargs) in forward func of DeepSpeedEngine (in the first picture) right after this instrument_w_nvtx, and run into picture 2 again and stuck when it tried to set forward_call=(self.forward)

My system info:
- `llamafactory` version: 0.9.3.dev0
- Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.31
- Python version: 3.12.9
- PyTorch version: 2.6.0+cu118 (GPU)
- Transformers version: 4.50.0.dev0
- Datasets version: 3.4.1
- Accelerate version: 1.5.2
- PEFT version: 0.15.1
- TRL version: 0.9.6
- GPU type: NVIDIA GeForce RTX 4090
- GPU number: 8
- GPU memory: 23.65GB
- DeepSpeed version: 0.16.4
- vLLM version: 0.8.1
- Git commit: 610f164c69118d390ac6aad5e721b33d3212c91f
Could you share which deepspeed and torch you are using?

Env info

- Python version: 3.10.0
- Huggingface_hub version: 0.30.0
- Safetensors version: 0.5.3
- Accelerate version: 1.4.0
- Accelerate config:    not found
- DeepSpeed version: 0.16.5
- PyTorch version (GPU?): 2.6.0+cu124 (True)

Kuangdd01 · 2025-04-22T05:27:35Z

The hanging reason may be the same as the mix-up with (image-text + text) data samples. When some batch is textual only, it may lead to inconsistent backward issues, especially when we turn on zero3 for model module broadcasting. But as your description, in every sample, you have the same mmdata in SYSTEM_PROMPT, so I am also confused.

Luffy-ZY-Wang · 2025-04-22T05:34:25Z

The hanging reason may be the same as the mix-up with (image-text + text) data samples. When some batch is textual only, it may lead to inconsistent backward issues, especially when we turn on zero3 for model module broadcasting. But as your description, in every sample, you have the same mmdata in SYSTEM_PROMPT, so I am also confused.

The role images I put into SYSTEM_PROMPT may be different if its role changes (training for several roles). Would this be the reason? But it would hang even if I gave 1 training sample.

Kuangdd01 · 2025-04-22T07:11:11Z

Follow with your sys mmdata PR(commit id:8ceba26b9e22d569ccda357be5653c862eed0730), and use demo data_info as you showed in #7767. Still not catching the hanging error.
Here are my configs.

### model
model_name_or_path: ./test_models/Qwen2.5-Omni-7B
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
deepspeed: ./examples/deepspeed/ds_z3_config.json


### dataset
dataset: test
template: qwen2_omni
cutoff_len: 8192
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/qwen2_omni-7b-video/lora/sft
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false

### train
use_audio_in_video: true
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
freeze_vision_tower: true
learning_rate: 1.0e-4
num_train_epochs: 25.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

### eval
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

data info

[
    {
      "messages": [
        
        {
          "content": "<video><audio>What is the video describing?",
          "role": "system"
        },
  
        {
          "content": "<video><audio>What is the video describing?",
          "role": "user"
        },
        {
          "content": "A girl who is drawing a picture of a guitar and feel nervous.",
          "role": "assistant"
        }
      ],
      "videos": [
        "mllm_demo_data/4.mp4",
        "mllm_demo_data/4.mp4"
      ],
      "audios": [
        "mllm_demo_data/4.mp3",
        "mllm_demo_data/4.mp3"
      ]
    }
  ]

Luffy-ZY-Wang · 2025-04-22T07:42:58Z

May I ask if we pass use_audio_in_video: true and the data includes <video><audio>, would it cause some issues, and which audio would it use, or the audio in

Kuangdd01 · 2025-04-22T07:50:49Z

May I ask if we pass use_audio_in_video: true and the data includes <video><audio>, would it cause some issues, and which audio would it use, or the audio in

If we turn on use_audio_in_video , <video><audio> will be replaced with audio-video interleave type in Omni model.

<video><audio> -> [VVVVVAAAAVVVVAA]. If we don't turn on that, it should be <video><audio> -> [VVVVVVVAAAAA...]
V:VIDEO TOKEN A:AUDIO TOKEN

Luffy-ZY-Wang · 2025-04-22T08:03:38Z

Oh, that why I found in my dataset that I have <audio><video> in my dialogue and it became [VVVV...VVVAAAAA...AAA] as input at last. Maybe it's where the problem lies. I'm trying convert it into the correct format <video><audio> now.

And btw I didn't catch the hanging issue either if I set use_audio_in_video: false and remove all audios but keep role images in sys_msg. Now I'm trying this format <video><audio> in USER_PROMPT.

Luffy-ZY-Wang · 2025-04-22T08:57:34Z

Follow with your sys mmdata PR(commit id:8ceba26b9e22d569ccda357be5653c862eed0730), and use demo data_info as you showed in #7767. Still not catching the hanging error. Here are my configs.

### model
model_name_or_path: ./test_models/Qwen2.5-Omni-7B
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
deepspeed: ./examples/deepspeed/ds_z3_config.json


### dataset
dataset: test
template: qwen2_omni
cutoff_len: 8192
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/qwen2_omni-7b-video/lora/sft
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false

### train
use_audio_in_video: true
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
freeze_vision_tower: true
learning_rate: 1.0e-4
num_train_epochs: 25.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

### eval
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

data info

[
    {
      "messages": [
        
        {
          "content": "<video><audio>What is the video describing?",
          "role": "system"
        },
  
        {
          "content": "<video><audio>What is the video describing?",
          "role": "user"
        },
        {
          "content": "A girl who is drawing a picture of a guitar and feel nervous.",
          "role": "assistant"
        }
      ],
      "videos": [
        "mllm_demo_data/4.mp4",
        "mllm_demo_data/4.mp4"
      ],
      "audios": [
        "mllm_demo_data/4.mp3",
        "mllm_demo_data/4.mp3"
      ]
    }
  ]

I checked this on my machine and didn't catch the hanging problem with this PR either. Let's move to #7767 to further talk about that issue.

Luffy-ZY-Wang marked this pull request as draft April 12, 2025 10:51

Luffy-ZY-Wang marked this pull request as ready for review April 12, 2025 10:52

Luffy-ZY-Wang mentioned this pull request Apr 19, 2025

Hanging problem when LoRA finte-tune Qwen2.5Omni with multi-turn video-audio samples with deepspeedz3 #7767

Open

1 task

Luffy-ZY-Wang mentioned this pull request Apr 21, 2025

new arg video_maxlen_ttl can better control max num of frames for each video #7700

Closed

2 tasks

Kuangdd01 self-assigned this Apr 22, 2025

Luffy-ZY-Wang force-pushed the dev_my_branch branch from 8ceba26 to 52152dd Compare April 23, 2025 10:17

Kuangdd01 requested a review from Copilot April 23, 2025 12:29

This comment was marked as abuse.

Sign in to view

Luffy-ZY-Wang force-pushed the dev_my_branch branch from 52152dd to 14d1bfb Compare April 24, 2025 02:02

fit the case of mmdata in sys_msg

36dc24d

Luffy-ZY-Wang force-pushed the dev_my_branch branch from 14d1bfb to 36dc24d Compare April 29, 2025 04:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fit the cases of mmdata in sys_msg #7694

fit the cases of mmdata in sys_msg #7694

Uh oh!

Luffy-ZY-Wang commented Apr 12, 2025 •

edited

Loading

Uh oh!

Luffy-ZY-Wang commented Apr 19, 2025

Uh oh!

Kuangdd01 commented Apr 19, 2025

Uh oh!

Luffy-ZY-Wang commented Apr 21, 2025

Uh oh!

Luffy-ZY-Wang commented Apr 21, 2025

Uh oh!

Luffy-ZY-Wang commented Apr 22, 2025

Uh oh!

Luffy-ZY-Wang commented Apr 22, 2025

Uh oh!

Kuangdd01 commented Apr 22, 2025

Uh oh!

Kuangdd01 commented Apr 22, 2025

Uh oh!

Luffy-ZY-Wang commented Apr 22, 2025

Uh oh!

Kuangdd01 commented Apr 22, 2025 •

edited

Loading

Uh oh!

Luffy-ZY-Wang commented Apr 22, 2025

Uh oh!

Kuangdd01 commented Apr 22, 2025

Uh oh!

Luffy-ZY-Wang commented Apr 22, 2025 •

edited

Loading

Uh oh!

Luffy-ZY-Wang commented Apr 22, 2025

Uh oh!

This comment was marked as abuse.

Uh oh!

Uh oh!

fit the cases of mmdata in sys_msg #7694

Are you sure you want to change the base?

fit the cases of mmdata in sys_msg #7694

Uh oh!

Conversation

Luffy-ZY-Wang commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Uh oh!

Luffy-ZY-Wang commented Apr 19, 2025

Uh oh!

Kuangdd01 commented Apr 19, 2025

Uh oh!

Luffy-ZY-Wang commented Apr 21, 2025

Uh oh!

Luffy-ZY-Wang commented Apr 21, 2025

Uh oh!

Luffy-ZY-Wang commented Apr 22, 2025

Uh oh!

Luffy-ZY-Wang commented Apr 22, 2025

Uh oh!

Kuangdd01 commented Apr 22, 2025

Uh oh!

Kuangdd01 commented Apr 22, 2025

Uh oh!

Luffy-ZY-Wang commented Apr 22, 2025

Uh oh!

Kuangdd01 commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Luffy-ZY-Wang commented Apr 22, 2025

Uh oh!

Kuangdd01 commented Apr 22, 2025

Uh oh!

Luffy-ZY-Wang commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Luffy-ZY-Wang commented Apr 22, 2025

Uh oh!

This comment was marked as abuse.

Uh oh!

Uh oh!

Luffy-ZY-Wang commented Apr 12, 2025 •

edited

Loading

Kuangdd01 commented Apr 22, 2025 •

edited

Loading

Luffy-ZY-Wang commented Apr 22, 2025 •

edited

Loading