[Issue]: Multi-GPU Training with HuggingFace Trainer in NVFlare

## What I’m Trying to Do

I'm adapting the `/examples/advanced/llm_hf` SFT example in **NVFlare** to continue pretraining or fine-tuning a model using **HuggingFace Trainer**, with support for **DDP** or **FSDP** across multiple GPUs per client.

## What I’ve Tried

- Set `gpu=[0,1,...,n]` in the client config
- Wrapped the model with `nn.DataParallel` and attempted `DistributedDataParallel`
- Enabled `launch_external_process=True` in `ScriptRunner`
- Verified that a similar training script works with multi-GPU when run outside of NVFlare

## Issue

The main issue I'm seeing is that the workload does not get distributed to more than 1 gpu. 

## Questions

1. What’s the proper way to enable multi-GPU training in NVFlare with HuggingFace Trainer using `accelerate` or native PyTorch?  Are there any examples of multi-GPU setups that don’t use PyTorch Lightning?

2. Should I be using `PTMultiProcessExecutor` instead of `ScriptRunner`?  
   Are there any sample configs or documentation for this, specifically with HuggingFace Trainer?

3. Does `ScriptRunner` with `launch_external_process=true` support DDP workloads?  
   Or is it limited to single-process training?

4. Can the NVFlare simulator scale to 8 clients × 8 GPUs (64 GPUs total)?  
   I’d like to scale up to 16 GPUs per client. What's the difference between:
   - Simulator
   - POC 
   - Actual NVFlare production platform  
   both in general and in terms of distributed compute support?

Any guidance or references would be really appreciated. Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: Multi-GPU Training with HuggingFace Trainer in NVFlare #3514

What I’m Trying to Do

What I’ve Tried

Issue

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: Multi-GPU Training with HuggingFace Trainer in NVFlare #3514

Description

What I’m Trying to Do

What I’ve Tried

Issue

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions