Unable to get the tutorial working

Hello, I am following the instructions to get started with EasyR1; however, I keep getting errors in the process. These errors change every time a run the docker environment: I have gotten errors ranging from a problem with the number of gpus I am passing in ---"got gpu 0 expected 8" to workers not being synchronized.
Any help in understanding what I am doing wrong would be greatly appreciated. I have looked online and in the issues on this repo, but have not found the solution. I am new to docker, so apologies if this is a simple problem to fix or if I did not set up something correctly.

The steps I have done are: 
1. downloaded the docker enviroment: `docker pull hiyouga/verl:ngc-th2.7.0-cu12.6-vllm0.9.1`
2. used `docker run --gpus all --ipc=host --ulimit memlock=-1 -it --rm -v /HOME_DIR/EasyR1:/workspace/EasyR1 -w /workspace db618adc68d5 bash`
3. ran `bash examples/qwen2_5_vl_7b_geo3k_grpo.sh`

And, most recently, I am getting the following traceback/error:
```
 /usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2_5_vl.py:930: UserWarning: Specified kernel cache directory could not be created! This disables kernel caching. Specified directory is /root/.cache/torch/kernels. This warning will appear only once per process. (Triggered internally at /pytorch/aten/src/ATen/native/cuda/jit_utils.cpp:1442.)
(WorkerDict pid=12946)   sizes = grid_thw.prod(-1) // merge_size // merge_size
(WorkerDict pid=12943) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. [repeated 7x across cluster]
[...]
NCCL version 2.21.5+cuda12.4
(WorkerDict pid=25283) [rank0]:[W713 04:12:56.043733966 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/EasyR1/verl/trainer/main.py", line 128, in <module>
    main()
  File "/workspace/EasyR1/verl/trainer/main.py", line 124, in main
    ray.get(runner.run.remote(ppo_config))
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2771, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 919, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DistBackendError): ray::Runner.run() (pid=24423, ip=192.168.0.2, actor_id=971cb4efa74e3acac3b5b28a01000000, repr=<main.Runner object at 0x7f9a701eca90>)
  File "/workspace/EasyR1/verl/trainer/main.py", line 93, in run
    trainer.init_workers()
  File "/workspace/EasyR1/verl/trainer/ray_trainer.py", line 302, in init_workers
    self.actor_rollout_ref_wg.init_model()
  File "/workspace/EasyR1/verl/single_controller/ray/base.py", line 47, in func
    output = ray.get(output)
ray.exceptions.RayTaskError(DistBackendError): ray::WorkerDict.actor_rollout_ref_init_model() (pid=29344, ip=192.168.0.2, actor_id=a4a7e38e959ee13c17167d8501000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f5dbc82b310>)
  File "/workspace/EasyR1/verl/single_controller/ray/base.py", line 432, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
  File "/workspace/EasyR1/verl/single_controller/base/decorator.py", line 207, in inner
    return func(*args, **kwargs)
  File "/workspace/EasyR1/verl/workers/fsdp_workers.py", line 366, in init_model
    self._build_model_optimizer(
  File "/workspace/EasyR1/verl/workers/fsdp_workers.py", line 242, in _build_model_optimizer
    dist.barrier()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4551, in barrier
    work = group.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'invalid argument'
(WorkerDict pid=29344) 
(WorkerDict pid=29344) e4077bb80b9c:29344:30132 [0] transport/p2p.cc:275 NCCL WARN Cuda failure 'invalid argument'

```

Another time I ran this I got:
`ValueError: Total available GPUs 0 is less than total desired GPUs 8.`

Another time i get:
`(Runner pid=7873)   File "/usr/local/lib/python3.10/dist-packages/vllm/compilation/backends.py", line 677, in __call__
(Runner pid=7873)     with torch.cuda.graph(cudagraph, pool=self.graph_pool):
(Runner pid=7873)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 186, in __exit__
(Runner pid=7873)     self.cuda_graph.capture_end()
(Runner pid=7873)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 84, in capture_end
(Runner pid=7873)     super().capture_end()
(Runner pid=7873) RuntimeError: CUDA error: operation not permitted
(Runner pid=7873) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Runner pid=7873) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Runner pid=7873) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Runner pid=7873) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WorkerDict.actor_rollout_ref_init_model() (pid=12947, ip=192.168.0.3, actor_id=92b10155f368aac73ed7f31601000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f2ce366b2e0>)`

I am not sure what I am doing wrong. Shouldn't this script run without any issues? At least, that is what the tutorial makes it seem like. 

Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unable to get the tutorial working #414

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Unable to get the tutorial working #414

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions