-
Notifications
You must be signed in to change notification settings - Fork 183
Description
Hi all, I am trying to use Nixl as the transfer backend for KV cache in SGlang PD disaggregation. I have manually assign GPUNETIO as backend for Nixl and make it use "mlx5_bond_0".
But I got the error when Nixl agent try to register VRAM via GPUNETIO
The log is here. Could someone help me to find out the problem and solve it ?
E0000 00:00:1762238119.118504 2368741 gpunetio_backend.cpp:1105] Can't register memory for unknown device 3
E0000 00:00:1762238119.118515 2368739 gpunetio_backend.cpp:1105] Can't register memory for unknown device 1
E1104 06:35:19.118512 2368741 nixl_agent.cpp:473] registerMem: registration failed for the specified or all potential backends
E1104 06:35:19.118523 2368739 nixl_agent.cpp:473] registerMem: registration failed for the specified or all potential backends
[2025-11-04 06:35:19 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3040, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 602, in init
self.init_disaggregation()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 960, in init_disaggregation
self.disagg_prefill_bootstrap_queue = PrefillBootstrapQueue(
^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 111, in init
self.kv_manager = self._init_kv_manager()
^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 152, in _init_kv_manager
kv_manager: BaseKVManager = kv_manager_class(
^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/nixl/conn.py", line 138, in init
self.register_buffer_to_engine()
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/nixl/conn.py", line 286, in register_buffer_to_engine
self.kv_descs = self.agent.register_memory(kv_addrs, "VRAM")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nixl/_api.py", line 389, in register_memory
self.agent.registerMem(reg_descs, handle_list)
nixl._bindings.nixlBackendError: NIXL_ERR_BACKEND
[2025-11-04 06:35:19 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3040, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 602, in init
self.init_disaggregation()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 960, in init_disaggregation
self.disagg_prefill_bootstrap_queue = PrefillBootstrapQueue(
^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 111, in init
self.kv_manager = self._init_kv_manager()
^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 152, in _init_kv_manager
kv_manager: BaseKVManager = kv_manager_class(
^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/nixl/conn.py", line 138, in init
self.register_buffer_to_engine()
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/nixl/conn.py", line 286, in register_buffer_to_engine
self.kv_descs = self.agent.register_memory(kv_addrs, "VRAM")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nixl/_api.py", line 389, in register_memory
self.agent.registerMem(reg_descs, handle_list)
nixl._bindings.nixlBackendError: NIXL_ERR_BACKEND
[2025-11-04 06:35:19] Received sigquit from a child process. It usually means the child failed.
[06:35:19:122075][1141507200][DOCA][ERR][linux_device_adapter.cpp:1268] failed to extract sockaddr_in
I0000 00:00:1762238119.123875 2368740 gpunetio_backend.cpp:438] DOCA Server socket created successfully
I0000 00:00:1762238119.123896 2368740 gpunetio_backend.cpp:470] Listening for incoming connections
2025-11-04 06:35:19 NIXL INFO _api.py:366 Backend GPUNETIO was instantiated
2025-11-04 06:35:19 NIXL INFO _api.py:256 Initialized NIXL agent: d17b1fc9-b249-4944-8290-2b349784dfb2
E0000 00:00:1762238119.125715 2368740 gpunetio_backend.cpp:1105] Can't register memory for unknown device 2
E1104 06:35:19.125723 2368740 nixl_agent.cpp:473] registerMem: registration failed for the specified or all potential backends