这是indexloc提供的服务,不要输入任何密码
Skip to content

How to use GPUNetIO as backend for Nixl in SGlang PD disaggregattion? #987

@TTThanos

Description

@TTThanos

Hi all, I am trying to use Nixl as the transfer backend for KV cache in SGlang PD disaggregation. I have manually assign GPUNETIO as backend for Nixl and make it use "mlx5_bond_0".

Image Image

But I got the error when Nixl agent try to register VRAM via GPUNETIO

Image

The log is here. Could someone help me to find out the problem and solve it ?

E0000 00:00:1762238119.118504 2368741 gpunetio_backend.cpp:1105] Can't register memory for unknown device 3
E0000 00:00:1762238119.118515 2368739 gpunetio_backend.cpp:1105] Can't register memory for unknown device 1
E1104 06:35:19.118512 2368741 nixl_agent.cpp:473] registerMem: registration failed for the specified or all potential backends
E1104 06:35:19.118523 2368739 nixl_agent.cpp:473] registerMem: registration failed for the specified or all potential backends
[2025-11-04 06:35:19 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3040, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 602, in init
self.init_disaggregation()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 960, in init_disaggregation
self.disagg_prefill_bootstrap_queue = PrefillBootstrapQueue(
^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 111, in init
self.kv_manager = self._init_kv_manager()
^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 152, in _init_kv_manager
kv_manager: BaseKVManager = kv_manager_class(
^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/nixl/conn.py", line 138, in init
self.register_buffer_to_engine()
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/nixl/conn.py", line 286, in register_buffer_to_engine
self.kv_descs = self.agent.register_memory(kv_addrs, "VRAM")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nixl/_api.py", line 389, in register_memory
self.agent.registerMem(reg_descs, handle_list)
nixl._bindings.nixlBackendError: NIXL_ERR_BACKEND

[2025-11-04 06:35:19 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3040, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 602, in init
self.init_disaggregation()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 960, in init_disaggregation
self.disagg_prefill_bootstrap_queue = PrefillBootstrapQueue(
^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 111, in init
self.kv_manager = self._init_kv_manager()
^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 152, in _init_kv_manager
kv_manager: BaseKVManager = kv_manager_class(
^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/nixl/conn.py", line 138, in init
self.register_buffer_to_engine()
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/nixl/conn.py", line 286, in register_buffer_to_engine
self.kv_descs = self.agent.register_memory(kv_addrs, "VRAM")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nixl/_api.py", line 389, in register_memory
self.agent.registerMem(reg_descs, handle_list)
nixl._bindings.nixlBackendError: NIXL_ERR_BACKEND

[2025-11-04 06:35:19] Received sigquit from a child process. It usually means the child failed.
[06:35:19:122075][1141507200][DOCA][ERR][linux_device_adapter.cpp:1268] failed to extract sockaddr_in
I0000 00:00:1762238119.123875 2368740 gpunetio_backend.cpp:438] DOCA Server socket created successfully
I0000 00:00:1762238119.123896 2368740 gpunetio_backend.cpp:470] Listening for incoming connections
2025-11-04 06:35:19 NIXL INFO _api.py:366 Backend GPUNETIO was instantiated
2025-11-04 06:35:19 NIXL INFO _api.py:256 Initialized NIXL agent: d17b1fc9-b249-4944-8290-2b349784dfb2
E0000 00:00:1762238119.125715 2368740 gpunetio_backend.cpp:1105] Can't register memory for unknown device 2
E1104 06:35:19.125723 2368740 nixl_agent.cpp:473] registerMem: registration failed for the specified or all potential backends

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions