Real-time Voice-Controlled 3D Avatar with Multimodal AI
Your 3D AI companion that never stops listening, never stops caring.
Transform conversations into immersive experiences with AI-powered 3D avatars that see, hear, and respond naturally.
- ๐ค Real-time Voice Activity Detection - Advanced VAD with configurable sensitivity
- ๐ฃ๏ธ Speech-to-Text - Powered by OpenAI Whisper (tiny model) for instant transcription
- ๐๏ธ Vision Understanding - SmolVLM2-256M-Video-Instruct for multimodal comprehension
- ๐ Natural Text-to-Speech - Kokoro TTS with native word-level timing
- ๐ญ 3D Avatar Animation - Lip-sync and emotion-driven animations using TalkingHead
- ๐น Camera Integration - Real-time image capture with voice commands
- โก Streaming Responses - Chunked audio generation for minimal latency
- ๐ฌ Native Timing Sync - Perfect lip-sync using Kokoro's native timing data
- ๐จ Draggable Camera View - Floating, resizable camera interface
- ๐ Real-time Analytics - Voice energy visualization and transmission tracking
- ๐ WebSocket Communication - Low-latency bidirectional data flow
- ๐ง AI Models from HuggingFace๐ค:
openai/whisper-tiny
- Speech recognitionHuggingFaceTB/SmolVLM2-256M-Video-Instruct
- Vision-language understandingKokoro TTS
- High-quality voice synthesis
- โก Framework: FastAPI with WebSocket support
- ๐ง Processing: PyTorch, Transformers, Flash Attention 2
- ๐ต Audio: SoundFile, NumPy for real-time processing
- ๐ผ๏ธ Framework: Next.js 15 with TypeScript
- ๐จ UI: Tailwind CSS + shadcn/ui components
- ๐ญ 3D Rendering: TalkingHead library
- ๐๏ธ Audio: Web Audio API with AudioWorklet
- ๐ก Communication: Native WebSocket with React Context
- ๐ฆ Package Management: UV (Python) + PNPM (Node.js)
- ๐จ Code Formatting:
- Backend: Black (Python)
- Frontend: Prettier (TypeScript/React)
- ๐ Quality Control: Husky for pre-commit hooks
- OS: Windows 11 (Linux/macOS support coming soon, will create a docker image)
- GPU: NVIDIA RTX 3070 (8GB VRAM)
- Node.js 20+
- PNPM
- Python 3.10
- UV (Python package manager)
# will setup both frontend and backend but require the prerequisites
pnpm run monorepo-setup
# Format code before committing (recommended)
pnpm format
Start Development Servers
# Run both frontend and backend from root
pnpm dev
# Or run individually
pnpm dev:client # Frontend (http://localhost:3000)
pnpm dev:server # Backend (http://localhost:8000)
- Allow microphone access when prompted
- Enable camera for multimodal interactions
- Click "Connect" to establish WebSocket connection
- Start Voice Control and begin speaking!
- Drag to move camera window
- Resize with maximize/minimize buttons
- Toggle on/off as needed
- Energy Threshold: Adjust sensitivity to background noise
- Pause Duration: How long to wait before processing speech
- Min/Max Speech: Control segment length limits
- TalkingHead (met4citizen) for 3D avatar rendering and lip-sync
- yeyu2 (Multimodal-local-phi4) for multimodal implementation inspiration
โญ Star this repo if you find it useful! โญ
Made with โค๏ธ by the Kiranbaby14