-
Notifications
You must be signed in to change notification settings - Fork 247
Description
Hello, I am following the instructions to get started with EasyR1; however, I keep getting errors in the process. These errors change every time a run the docker environment: I have gotten errors ranging from a problem with the number of gpus I am passing in ---"got gpu 0 expected 8" to workers not being synchronized.
Any help in understanding what I am doing wrong would be greatly appreciated. I have looked online and in the issues on this repo, but have not found the solution. I am new to docker, so apologies if this is a simple problem to fix or if I did not set up something correctly.
The steps I have done are:
- downloaded the docker enviroment:
docker pull hiyouga/verl:ngc-th2.7.0-cu12.6-vllm0.9.1
- used
docker run --gpus all --ipc=host --ulimit memlock=-1 -it --rm -v /HOME_DIR/EasyR1:/workspace/EasyR1 -w /workspace db618adc68d5 bash
- ran
bash examples/qwen2_5_vl_7b_geo3k_grpo.sh
And, most recently, I am getting the following traceback/error:
/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2_5_vl.py:930: UserWarning: Specified kernel cache directory could not be created! This disables kernel caching. Specified directory is /root/.cache/torch/kernels. This warning will appear only once per process. (Triggered internally at /pytorch/aten/src/ATen/native/cuda/jit_utils.cpp:1442.)
(WorkerDict pid=12946) sizes = grid_thw.prod(-1) // merge_size // merge_size
(WorkerDict pid=12943) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. [repeated 7x across cluster]
[...]
NCCL version 2.21.5+cuda12.4
(WorkerDict pid=25283) [rank0]:[W713 04:12:56.043733966 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/workspace/EasyR1/verl/trainer/main.py", line 128, in <module>
main()
File "/workspace/EasyR1/verl/trainer/main.py", line 124, in main
ray.get(runner.run.remote(ppo_config))
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2771, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 919, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DistBackendError): ray::Runner.run() (pid=24423, ip=192.168.0.2, actor_id=971cb4efa74e3acac3b5b28a01000000, repr=<main.Runner object at 0x7f9a701eca90>)
File "/workspace/EasyR1/verl/trainer/main.py", line 93, in run
trainer.init_workers()
File "/workspace/EasyR1/verl/trainer/ray_trainer.py", line 302, in init_workers
self.actor_rollout_ref_wg.init_model()
File "/workspace/EasyR1/verl/single_controller/ray/base.py", line 47, in func
output = ray.get(output)
ray.exceptions.RayTaskError(DistBackendError): ray::WorkerDict.actor_rollout_ref_init_model() (pid=29344, ip=192.168.0.2, actor_id=a4a7e38e959ee13c17167d8501000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f5dbc82b310>)
File "/workspace/EasyR1/verl/single_controller/ray/base.py", line 432, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
File "/workspace/EasyR1/verl/single_controller/base/decorator.py", line 207, in inner
return func(*args, **kwargs)
File "/workspace/EasyR1/verl/workers/fsdp_workers.py", line 366, in init_model
self._build_model_optimizer(
File "/workspace/EasyR1/verl/workers/fsdp_workers.py", line 242, in _build_model_optimizer
dist.barrier()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4551, in barrier
work = group.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'invalid argument'
(WorkerDict pid=29344)
(WorkerDict pid=29344) e4077bb80b9c:29344:30132 [0] transport/p2p.cc:275 NCCL WARN Cuda failure 'invalid argument'
Another time I ran this I got:
ValueError: Total available GPUs 0 is less than total desired GPUs 8.
Another time i get:
(Runner pid=7873) File "/usr/local/lib/python3.10/dist-packages/vllm/compilation/backends.py", line 677, in __call__ (Runner pid=7873) with torch.cuda.graph(cudagraph, pool=self.graph_pool): (Runner pid=7873) File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 186, in __exit__ (Runner pid=7873) self.cuda_graph.capture_end() (Runner pid=7873) File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 84, in capture_end (Runner pid=7873) super().capture_end() (Runner pid=7873) RuntimeError: CUDA error: operation not permitted (Runner pid=7873) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (Runner pid=7873) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (Runner pid=7873) Compile with
TORCH_USE_CUDA_DSA to enable device-side assertions. (Runner pid=7873) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WorkerDict.actor_rollout_ref_init_model() (pid=12947, ip=192.168.0.3, actor_id=92b10155f368aac73ed7f31601000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f2ce366b2e0>)
I am not sure what I am doing wrong. Shouldn't this script run without any issues? At least, that is what the tutorial makes it seem like.
Thank you!