-
Notifications
You must be signed in to change notification settings - Fork 204
Description
What I’m Trying to Do
I'm adapting the /examples/advanced/llm_hf
SFT example in NVFlare to continue pretraining or fine-tuning a model using HuggingFace Trainer, with support for DDP or FSDP across multiple GPUs per client.
What I’ve Tried
- Set
gpu=[0,1,...,n]
in the client config - Wrapped the model with
nn.DataParallel
and attemptedDistributedDataParallel
- Enabled
launch_external_process=True
inScriptRunner
- Verified that a similar training script works with multi-GPU when run outside of NVFlare
Issue
The main issue I'm seeing is that the workload does not get distributed to more than 1 gpu.
Questions
-
What’s the proper way to enable multi-GPU training in NVFlare with HuggingFace Trainer using
accelerate
or native PyTorch? Are there any examples of multi-GPU setups that don’t use PyTorch Lightning? -
Should I be using
PTMultiProcessExecutor
instead ofScriptRunner
?
Are there any sample configs or documentation for this, specifically with HuggingFace Trainer? -
Does
ScriptRunner
withlaunch_external_process=true
support DDP workloads?
Or is it limited to single-process training? -
Can the NVFlare simulator scale to 8 clients × 8 GPUs (64 GPUs total)?
I’d like to scale up to 16 GPUs per client. What's the difference between:- Simulator
- POC
- Actual NVFlare production platform
both in general and in terms of distributed compute support?
Any guidance or references would be really appreciated. Thanks!