- Multi-modal Interactions: Communicate with the AI using text, voice, images, and video
- Voice Conversations: Natural voice-based conversations with automatic speech detection
- Real-time Visualizations: Dynamic audio visualizations that respond to speech and AI processing
- File Uploads: Support for uploading and processing various file types
- Responsive Design: Works seamlessly on desktop and mobile devices
- Dark Mode Support: Automatic theme switching based on system preferences
I HAVE NOT TESTED IMAGE/VIDEO YET, I DO NOT HAVE ENOUGH VRAM
- Docker and Docker Compose
- NVIDIA GPU (recommended: 24GB+)
- NVIDIA Container Toolkit (nvidia-docker)
- 50GB+ of free disk space (for model weights and Docker images)
git clone https://github.com/phildougherty/qwen2.5_omni_chat.git
cd qwen2.5_omni_chat
The application requires SSL certificates for secure WebSocket connections. You have two options:
mkdir -p certs
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout certs/key.pem -out certs/cert.pem
When prompted, use localhost
as the Common Name if you're running the application locally.
If you have a domain name and want to deploy the application publicly:
-
Install certbot:
sudo apt-get update sudo apt-get install certbot
-
Generate certificates:
sudo certbot certonly --standalone -d yourdomain.com
-
Copy the certificates to the project:
sudo cp /etc/letsencrypt/live/yourdomain.com/fullchain.pem certs/cert.pem sudo cp /etc/letsencrypt/live/yourdomain.com/privkey.pem certs/key.pem sudo chmod 644 certs/cert.pem certs/key.pem
Edit backend/app/config.py
to customize the model settings:
# Model settings
MODEL_PATH: str = "Qwen/Qwen2.5-Omni-7B" # Model ID from Hugging Face
DEVICE_MAP: str = "auto" # Device mapping strategy
ATTN_IMPLEMENTATION: str = "sdpa" # Attention implementation (sdpa recommended)
DEFAULT_VOICE: str = "Chelsie" # Default voice for audio responses
SAMPLE_RATE: int = 24000 # Audio sample rate
ENABLE_AUDIO_OUTPUT: bool = True # Enable audio generation
USE_AUDIO_IN_VIDEO: bool = True # Extract audio from video inputs
MODEL_MAX_LENGTH: int = 8192 # Maximum context length
Edit nginx/nginx.conf
to customize the server settings:
- Port configuration (default: 80 for HTTP, 443 for HTTPS)
- SSL settings
- Proxy settings for the backend
Edit docker-compose.yml
to customize:
- Port mappings (default: 8888 for HTTP, 8443 for HTTPS)
- GPU allocation
- Volume mounts
docker-compose up -d
This will:
- Build the Docker images for the frontend, backend, and nginx
- Download the Qwen2.5-Omni model (this may take some time on first run)
- Start all services
Open your browser and navigate to:
https://localhost:8443
(if using self-signed certificates)https://yourdomain.com:8443
(if using Let's Encrypt certificates)
If you're using self-signed certificates, you'll need to accept the security warning in your browser.
- Click the microphone button in the bottom right to start a voice call
- Speak naturally - the system will automatically detect speech and silence
- Use the mute button to temporarily disable the microphone
- Click the end call button to terminate the voice session
- Type your message in the text input field
- Click the paperclip icon to attach files (images, videos, audio, documents)
- Press Enter or click the send button to submit your message
- Space: Toggle voice call (when not focused on text input)
- Escape: End voice call
- Ctrl+R or Cmd+R: Reset conversation
- Enter: Send message
- Shift+Enter: New line in text input
Qwen/Qwen2.5-Omni-7B
: 7 billion parameter model (default)
The model supports multiple voice options. Change the DEFAULT_VOICE
in backend/app/config.py
:
Chelsie
: Default female voiceEthan
: Male voice
By default, the application uses:
- Port 8888 for HTTP
- Port 8443 for HTTPS
To change these ports, edit the docker-compose.yml
file:
nginx:
ports:
- "8888:80" # Change 8888 to your desired HTTP port
- "8443:443" # Change 8443 to your desired HTTPS port
If you encounter "CUDA out of memory" errors:
- Reduce the
MAX_CONVERSATION_TURNS
inbackend/app/config.py
- Increase the GPU memory allocation in
docker-compose.yml
If you have trouble with WebSocket connections:
- Ensure your SSL certificates are properly configured
- Check that ports 8888 and 8443 (or your custom ports) are open in your firewall
- Verify that the nginx configuration is correctly routing WebSocket requests
If audio input or output isn't working:
- Ensure your browser has permission to access the microphone
- Click anywhere on the page to enable audio (browser autoplay policy)
- Check that
ENABLE_AUDIO_OUTPUT
is set toTrue
in the backend configuration
To optimize memory usage, adjust these settings in backend/app/config.py
:
# Memory management settings
MAX_CONVERSATION_TURNS: int = 3 # Maximum number of conversation turns to keep
CLEANUP_AFTER_RESPONSE: bool = True # Force garbage collection after responses
PYTORCH_CUDA_ALLOC_CONF: str = "expandable_segments:True" # PyTorch memory allocation
Ensure your docker-compose.yml
is configured to use the NVIDIA runtime:
backend:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
This project is licensed under the MIT License - see the LICENSE file for details.
- Qwen Team at Alibaba for creating the Qwen2.5-Omni model
- Hugging Face for hosting the model and providing the transformers library
- FastAPI for the backend framework
- NVIDIA for GPU acceleration support
This project uses the Qwen2.5-Omni model which has its own license and usage restrictions. Please review the Qwen2.5-Omni license before using this application for any purpose.
For questions, issues, or feature requests, please open an issue on GitHub.