这是indexloc提供的服务,不要输入任何密码
Skip to content

Openchat hangs when running a model from a docker container #231

@zeionara

Description

@zeionara

Hi, I've made the following dockerfile for configuring dependencies and running an openchat model. However, it hangs on startup.

from nvidia/cuda:12.4.0-devel-ubuntu22.04

run apt-get update && apt-get install python3-pip -y && apt-get clean
run pip3 install packaging torch && pip3 install ochat && pip3 cache purge

run apt-get install git -y
run pip3 install flash_attn==2.5.8

entrypoint python3 -m ochat.serving.openai_api_server --model $model --host 0.0.0.0 --port $port

The following log is emitted, after which the container hangs and I can't even stop it with sudo docker stop.

INFO 03-10 13:58:32 __init__.py:207] Automatically detected platform cuda.
2025-03-10 13:58:33,122	WARNING services.py:2022 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67100672 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2025-03-10 13:58:33,250	INFO worker.py:1821 -- Started a local Ray instance.
INFO 03-10 13:58:40 config.py:549] This model supports multiple tasks: {'embed', 'reward', 'score', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 03-10 13:58:40 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='openchat/openchat-3.5-0106-gemma', speculative_config=None, tokenizer='openchat/openchat-3.5-0106-gemma', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=openchat/openchat-3.5-0106-gemma, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
INFO 03-10 13:58:43 cuda.py:229] Using Flash Attention backend.

And if I run this container without gpus, it fails on startup with the following error:

INFO 03-10 14:07:35 __init__.py:211] No platform detected, vLLM is running on UnspecifiedPlatform
openchat.json: 100% 484/484 [00:00<00:00, 9.62MB/s]
2025-03-10 14:07:36,435	WARNING services.py:2022 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67104768 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2025-03-10 14:07:36,562	INFO worker.py:1821 -- Started a local Ray instance.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/ochat/serving/openai_api_server.py", line 373, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 639, in from_engine_args
    engine_config = engine_args.create_engine_config(usage_context)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 1126, in create_engine_config
    device_config = DeviceConfig(device=self.device)
  File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 1660, in __init__
    raise RuntimeError("Failed to infer device type")
RuntimeError: Failed to infer device type

Pls help me fix this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions