Try it out at Unmute.sh!
Unmute is a system that allows text LLMs to listen and speak by wrapping them in Kyutai's Text-to-speech and Speech-to-text models. The speech-to-text transcribes what the user says, the LLM generates a response in text, and the text-to-speech reads it out loud. Both the STT and TTS are optimized for low latency and the system works with any text LLM you like.
On a high level, it works like this:
graph LR
UB[User browser] --> T((Traefik))
T --> B(Backend)
T --> F(Frontend)
B --> STT(Speech-to-text)
B --> LLM(LLM)
B --> TTS(Text-to-speech)
- The user opens the Unmute website, served by the frontend.
- By clicking "connect", the user establishes a websocket connection to the backend, sending audio and other metadata back and forth in real time.
- The backend connects via websocket to the speech-to-text server, sending it the audio from the user and receiving back the transcription in real time.
- Once the speech-to-text detects that the user has stopped speaking and it's time to generate a response, the backend connects to an LLM server to retrieve the response. We host our own LLM using VLLM, but you could also use an external API like OpenAI or Mistral.
- As the response is being generated, the backend feeds it to the text-to-speech server to read it out loud, and forwards the generated speech to the user.
- Traefik routes requests to paths under
/api
to the backend and the rest to the frontend.
Note
If something isn't working for you, don't hesistate to open an issue. We'll do our best to help you figure out what's wrong.
We provide multiple ways of deploying your own unmute.sh:
Name | Number of gpus | Number of machines | Difficulty | Documented | Kyutai support |
---|---|---|---|---|---|
Docker compose | 1+ | 1 | Very easy | ✅ | ✅ |
Without Docker | 1 to 3 | 1 to 5 | Easy | ✅ | ✅ |
Docker swarm | 1 to ~100 | 1 to ~100 | Medium | ✅ | ❌ |
Since Unmute is a complex system with many services that need to be running at the same time, we recommend using Docker Compose to run Unmute. It allows you to start or stop all services using a single command. Since the services are Docker containers, you get a reproducible environment without having to worry about dependencies.
While we support deploying with Docker compose and without Docker, the Docker Swarm deployment is only given to show how we deploy and scale unmute.sh. It looks a lot like the compose files, but since debugging multi-nodes applications is hard, we cannot help you debug the swarm deployment.
You can use any LLM you want. By default, Unmute uses Mistral Small 3.2 24B as the LLM. (Gemma 3 12B is also a good choice.) This model is freely available but requires you to accept the conditions to accept it:
- Create a Hugging Face account.
- Accept the conditions on the Mistral Small 3.2 24B model page.
- Create an access token. You can use a fine-grained token, the only permission you need to grant is "Read access to contents of all public gated repos you can access". Do not use tokens with write access when deploying publicly. In case the server is compromised somehow, the attacker would get write access to any models/datasets/etc. you have on Hugging Face.
- Add the token into your
~/.bashrc
or equivalent asexport HUGGING_FACE_HUB_TOKEN=hf_...your token here...
Make sure you have Docker Compose installed. You'll also need the NVIDIA Container Toolkit to allow Docker to access your GPU. To make sure the NVIDIA Container Toolkit is installed correctly, run:
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
If you use meta-llama/Llama-3.2-1B,
the default in docker-compose.yml
, 16GB of GPU memory is sufficient.
If you're running into memory issues, open docker-compose.yml
and look for NOTE:
comments to see places that you might need to adjust.
On a machine with a GPU, run:
# Make sure you have the environment variable with the token:
echo $HUGGING_FACE_HUB_TOKEN # This should print hf_...something...
docker compose up --build
On Unmute.sh, we run the speech-to-text, text-to-speech, and the VLLM server on separate GPUs, which improves the latency compared to a single-GPU setup. The TTS latency decreases from ~750ms when running everything on a single L40S GPU to around ~450ms on Unmute.sh.
If you have at least three GPUs available, add this snippet to the stt
, tts
and llm
services to ensure they are run on separate GPUs:
stt: # Similarly for `tts` and `llm`
# ...other configuration
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Alternatively, you can choose to run Unmute by manually starting the services without going through Docker. This can be more difficult to set up because of the various dependencies needed.
The following instructions only work for Linux and WSL.
uv
: Install withcurl -LsSf https://astral.sh/uv/install.sh | sh
cargo
: Install withcurl https://sh.rustup.rs -sSf | sh
pnpm
: Install withcurl -fsSL https://get.pnpm.io/install.sh | sh -
cuda 12.1
: Install it with conda or directly from the Nvidia website. Needed for the Rust processes (tts and stt).
Start each of the services one by one in a different tmux session or terminal:
./dockerless/start_frontend.sh
./dockerless/start_backend.sh
./dockerless/start_llm.sh # Needs 6.1GB of vram
./dockerless/start_stt.sh # Needs 2.5GB of vram
./dockerless/start_tts.sh # Needs 5.3GB of vram
And the website should be accessible at http://localhost:3000
.
Here are some high-level pointers about how you'd go about making certain changes to Unmute.
The characters' voices and prompts are defined in voices.yaml
.
The format of the config file should be intuitive.
Certain system prompts contain dynamically generated elements.
For example, "Quiz show" has its 5 questions randomly chosen in advance from a fixed list.
System prompts like this are defined in unmute/llm/system_prompt.py
.
Note that the file is only loaded when the backend starts and is then cached, so if you change something in voices.yaml
,
you'll need to restart the backend.
The backend and frontend communicate over websocket using a protocol based on the
OpenAI Realtime API ("ORA").
Where possible, we try to match the ORA format, but there are some extra messages we needed to add,
and others have simplified parameters.
We try to make it clear where we deviate from the ORA format, see unmute/openai_realtime_api_events.py
.
For detailed information about the WebSocket communication protocol, message types, and audio processing pipeline, see the browser-backend communication documentation.
Ideally, it should be simple to write a single frontend that can communicate with either the Unmute backend or the OpenAI Realtime API, but we are not fully compatible yet. Contributions welcome!
The frontend is a Next.js app defined in frontend/
.
If you'd like to compare to a different frontend implementation,
there is a Python client defined in
unmute/loadtest/loadtest_client.py
,
a script that we use to benchmark the latency and throughput of Unmute.
This is a common requirement so we would appreciate a contribution to support tool calling in Unmute!
- Look into how vLLM does tool calling and modify the vLLM call in
docker-compose.yml
to use approriate arguments. - On the Unmute side, modify
_generate_response_task()
inunmute/unmute_handler.py
. Currently,llm.chat_completion()
yields the words one by one. This will need to be changed to also allow yielding a tool call, and handling that.
If you're curious to know how we deploy and scale unmute.sh, take a look at our docs on the swarm deployments.
First install pre-commit
itself – you likely want to install it globally using pip install pre-commit
rather than in a virtual environment or uv
,
because you need the pre-commit
executable to always be available. Then run:
pre-commit install --hook-type pre-commit
We recommend using uv to manage Python dependencies. The commands below assume you are using uv.
uv run fastapi dev unmute/main_websocket.py
uv run fastapi run unmute/main_websocket.py
loadtest_client.py
is a script that connects to Unmute and simulates conversations with it in order to measure latency and throughput.
uv run unmute/loadtest/loadtest_client.py --server-url ws://localhost:8000 --n-workers 16