这是indexloc提供的服务,不要输入任何密码
Skip to content

fit the cases of mmdata in sys_msg #7694

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Luffy-ZY-Wang
Copy link

@Luffy-ZY-Wang Luffy-ZY-Wang commented Apr 12, 2025

What does this PR do?

Fit the cases of mmdata in sys_msg

In some tasks like Multimodal Roleplaying, we may need to give the role image in system message.

I've tested locally on supervised.py and unsupervised.py and it works fine on my machine. If others like feedback.py didn't work, please feel free to comment or open an issue.😊

Before submitting

@Luffy-ZY-Wang
Copy link
Author

Hi, @Kuangdd01
Please feel free to add any comments here! 🙂

@Kuangdd01
Copy link
Collaborator

First, rebase codes with main branch.
Second, try to use py-spy to track process and see which line leads to hangging during omni training. 🤗

@Luffy-ZY-Wang
Copy link
Author

to use py-spy to track process and see which line leads to hangging during omni training

Thanks for hinting. I'm currently struggling with permissions problems with py-spy. And I can't upgrade the elevate the permissions of py-spy because I can't contact the machine manager now.

Are there any other ways?

@Luffy-ZY-Wang
Copy link
Author

Hi, @Kuangdd01
I used pudb instead and set_trace() here:
image

Then I run it step by step carefully, and find it stuck in here:
image

@Luffy-ZY-Wang
Copy link
Author

Update, TLDR: find it stuck in some lines in deepspeed and torch. Could you share your env info regarding deepspeed and torch? @Kuangdd01

First, it would go into the forward func of DeepSpeedEngine(actually inherit from nn.Module) in our compute_loss, and it would stuck when calling module:
image

Let's step into it. And find it finally stuck in this line in Module._call_impl() of torch.nn.modules.module
image

But before it gets stuck, it would get into this function in deepspeed.utils.nvtx:
image

And then I find it would go back to the line loss = self.module(*inputs, **kwargs) in forward func of DeepSpeedEngine (in the first picture) right after this instrument_w_nvtx, and run into picture 2 again and stuck when it tried to set forward_call=(self.forward)

My system info:

- `llamafactory` version: 0.9.3.dev0
- Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.31
- Python version: 3.12.9
- PyTorch version: 2.6.0+cu118 (GPU)
- Transformers version: 4.50.0.dev0
- Datasets version: 3.4.1
- Accelerate version: 1.5.2
- PEFT version: 0.15.1
- TRL version: 0.9.6
- GPU type: NVIDIA GeForce RTX 4090
- GPU number: 8
- GPU memory: 23.65GB
- DeepSpeed version: 0.16.4
- vLLM version: 0.8.1
- Git commit: 610f164c69118d390ac6aad5e721b33d3212c91f

Could you share which deepspeed and torch you are using?

@Luffy-ZY-Wang
Copy link
Author

I find it needs to use two pudb windows for tracing (model is distributed loaded). That _call_impl() line isn't where the issue really lies. Still tracing...

@Kuangdd01
Copy link
Collaborator

Update, TLDR: find it stuck in some lines in deepspeed and torch. Could you share your env info regarding deepspeed and torch? @Kuangdd01

First, it would go into the forward func of DeepSpeedEngine(actually inherit from nn.Module) in our compute_loss, and it would stuck when calling module: image

Let's step into it. And find it finally stuck in this line in Module._call_impl() of torch.nn.modules.module image

But before it gets stuck, it would get into this function in deepspeed.utils.nvtx: image

And then I find it would go back to the line loss = self.module(*inputs, **kwargs) in forward func of DeepSpeedEngine (in the first picture) right after this instrument_w_nvtx, and run into picture 2 again and stuck when it tried to set forward_call=(self.forward)

My system info:

- `llamafactory` version: 0.9.3.dev0
- Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.31
- Python version: 3.12.9
- PyTorch version: 2.6.0+cu118 (GPU)
- Transformers version: 4.50.0.dev0
- Datasets version: 3.4.1
- Accelerate version: 1.5.2
- PEFT version: 0.15.1
- TRL version: 0.9.6
- GPU type: NVIDIA GeForce RTX 4090
- GPU number: 8
- GPU memory: 23.65GB
- DeepSpeed version: 0.16.4
- vLLM version: 0.8.1
- Git commit: 610f164c69118d390ac6aad5e721b33d3212c91f

Could you share which deepspeed and torch you are using?

Env info

- Python version: 3.10.0
- Huggingface_hub version: 0.30.0
- Safetensors version: 0.5.3
- Accelerate version: 1.4.0
- Accelerate config:    not found
- DeepSpeed version: 0.16.5
- PyTorch version (GPU?): 2.6.0+cu124 (True)

@Kuangdd01
Copy link
Collaborator

The hanging reason may be the same as the mix-up with (image-text + text) data samples. When some batch is textual only, it may lead to inconsistent backward issues, especially when we turn on zero3 for model module broadcasting. But as your description, in every sample, you have the same mmdata in SYSTEM_PROMPT, so I am also confused.

@Luffy-ZY-Wang
Copy link
Author

The hanging reason may be the same as the mix-up with (image-text + text) data samples. When some batch is textual only, it may lead to inconsistent backward issues, especially when we turn on zero3 for model module broadcasting. But as your description, in every sample, you have the same mmdata in SYSTEM_PROMPT, so I am also confused.

The role images I put into SYSTEM_PROMPT may be different if its role changes (training for several roles). Would this be the reason? But it would hang even if I gave 1 training sample.

@Kuangdd01
Copy link
Collaborator

Kuangdd01 commented Apr 22, 2025

Follow with your sys mmdata PR(commit id:8ceba26b9e22d569ccda357be5653c862eed0730), and use demo data_info as you showed in #7767. Still not catching the hanging error.
Here are my configs.

### model
model_name_or_path: ./test_models/Qwen2.5-Omni-7B
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
deepspeed: ./examples/deepspeed/ds_z3_config.json


### dataset
dataset: test
template: qwen2_omni
cutoff_len: 8192
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/qwen2_omni-7b-video/lora/sft
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false

### train
use_audio_in_video: true
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
freeze_vision_tower: true
learning_rate: 1.0e-4
num_train_epochs: 25.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

### eval
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

data info

[
    {
      "messages": [
        
        {
          "content": "<video><audio>What is the video describing?",
          "role": "system"
        },
  
        {
          "content": "<video><audio>What is the video describing?",
          "role": "user"
        },
        {
          "content": "A girl who is drawing a picture of a guitar and feel nervous.",
          "role": "assistant"
        }
      ],
      "videos": [
        "mllm_demo_data/4.mp4",
        "mllm_demo_data/4.mp4"
      ],
      "audios": [
        "mllm_demo_data/4.mp3",
        "mllm_demo_data/4.mp3"
      ]
    }
  ]

@Luffy-ZY-Wang
Copy link
Author

May I ask if we pass use_audio_in_video: true and the data includes <video><audio>, would it cause some issues, and which audio would it use, or the audio in

@Kuangdd01
Copy link
Collaborator

May I ask if we pass use_audio_in_video: true and the data includes <video><audio>, would it cause some issues, and which audio would it use, or the audio in

If we turn on use_audio_in_video , <video><audio> will be replaced with audio-video interleave type in Omni model.

<video><audio> -> [VVVVVAAAAVVVVAA]. If we don't turn on that, it should be <video><audio> -> [VVVVVVVAAAAA...]
V:VIDEO TOKEN A:AUDIO TOKEN

@Kuangdd01 Kuangdd01 self-assigned this Apr 22, 2025
@Luffy-ZY-Wang
Copy link
Author

Luffy-ZY-Wang commented Apr 22, 2025

Oh, that why I found in my dataset that I have <audio><video> in my dialogue and it became [VVVV...VVVAAAAA...AAA] as input at last. Maybe it's where the problem lies. I'm trying convert it into the correct format <video><audio> now.

And btw I didn't catch the hanging issue either if I set use_audio_in_video: false and remove all audios but keep role images in sys_msg. Now I'm trying this format <video><audio> in USER_PROMPT.

@Luffy-ZY-Wang
Copy link
Author

Follow with your sys mmdata PR(commit id:8ceba26b9e22d569ccda357be5653c862eed0730), and use demo data_info as you showed in #7767. Still not catching the hanging error. Here are my configs.

### model
model_name_or_path: ./test_models/Qwen2.5-Omni-7B
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
deepspeed: ./examples/deepspeed/ds_z3_config.json


### dataset
dataset: test
template: qwen2_omni
cutoff_len: 8192
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/qwen2_omni-7b-video/lora/sft
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false

### train
use_audio_in_video: true
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
freeze_vision_tower: true
learning_rate: 1.0e-4
num_train_epochs: 25.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

### eval
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

data info

[
    {
      "messages": [
        
        {
          "content": "<video><audio>What is the video describing?",
          "role": "system"
        },
  
        {
          "content": "<video><audio>What is the video describing?",
          "role": "user"
        },
        {
          "content": "A girl who is drawing a picture of a guitar and feel nervous.",
          "role": "assistant"
        }
      ],
      "videos": [
        "mllm_demo_data/4.mp4",
        "mllm_demo_data/4.mp4"
      ],
      "audios": [
        "mllm_demo_data/4.mp3",
        "mllm_demo_data/4.mp3"
      ]
    }
  ]

I checked this on my machine and didn't catch the hanging problem with this PR either. Let's move to #7767 to further talk about that issue.

Copilot

This comment was marked as abuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants