่ฟ™ๆ˜ฏindexlocๆไพ›็š„ๆœๅŠก๏ผŒไธ่ฆ่พ“ๅ…ฅไปปไฝ•ๅฏ†็ 
Skip to content

๐ŸŽญ Real-time voice-controlled 3D avatar with multimodal AI - speak naturally and watch your AI companion respond with perfect lip-sync

License

Notifications You must be signed in to change notification settings

kiranbaby14/TalkMateAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽญ TalkMateAI

Real-time Voice-Controlled 3D Avatar with Multimodal AI

Your 3D AI companion that never stops listening, never stops caring.

Transform conversations into immersive experiences with AI-powered 3D avatars that see, hear, and respond naturally.

Python FastAPI Next.js CUDA License

๐ŸŽฅ Demo Video

TalkMateAI Demo

โœจ Features

๐ŸŽฏ Core Capabilities

  • ๐ŸŽค Real-time Voice Activity Detection - Advanced VAD with configurable sensitivity
  • ๐Ÿ—ฃ๏ธ Speech-to-Text - Powered by OpenAI Whisper (tiny model) for instant transcription
  • ๐Ÿ‘๏ธ Vision Understanding - SmolVLM2-256M-Video-Instruct for multimodal comprehension
  • ๐Ÿ”Š Natural Text-to-Speech - Kokoro TTS with native word-level timing
  • ๐ŸŽญ 3D Avatar Animation - Lip-sync and emotion-driven animations using TalkingHead

๐Ÿš€ Advanced Features

  • ๐Ÿ“น Camera Integration - Real-time image capture with voice commands
  • โšก Streaming Responses - Chunked audio generation for minimal latency
  • ๐ŸŽฌ Native Timing Sync - Perfect lip-sync using Kokoro's native timing data
  • ๐ŸŽจ Draggable Camera View - Floating, resizable camera interface
  • ๐Ÿ“Š Real-time Analytics - Voice energy visualization and transmission tracking
  • ๐Ÿ”„ WebSocket Communication - Low-latency bidirectional data flow

๐Ÿ—๏ธ Architecture

System Architecture

๐Ÿ› ๏ธ Technology Stack

Backend (Python)

  • ๐Ÿง  AI Models from HuggingFace๐Ÿค—:
    • openai/whisper-tiny - Speech recognition
    • HuggingFaceTB/SmolVLM2-256M-Video-Instruct - Vision-language understanding
    • Kokoro TTS - High-quality voice synthesis
  • โšก Framework: FastAPI with WebSocket support
  • ๐Ÿ”ง Processing: PyTorch, Transformers, Flash Attention 2
  • ๐ŸŽต Audio: SoundFile, NumPy for real-time processing

Frontend (TypeScript/React)

  • ๐Ÿ–ผ๏ธ Framework: Next.js 15 with TypeScript
  • ๐ŸŽจ UI: Tailwind CSS + shadcn/ui components
  • ๐ŸŽญ 3D Rendering: TalkingHead library
  • ๐ŸŽ™๏ธ Audio: Web Audio API with AudioWorklet
  • ๐Ÿ“ก Communication: Native WebSocket with React Context

๐Ÿ”ง Development Tools

  • ๐Ÿ“ฆ Package Management: UV (Python) + PNPM (Node.js)
  • ๐ŸŽจ Code Formatting:
    • Backend: Black (Python)
    • Frontend: Prettier (TypeScript/React)
  • ๐Ÿ” Quality Control: Husky for pre-commit hooks

๐Ÿ“‹ Requirements

System Tested on

  • OS: Windows 11 (Linux/macOS support coming soon, will create a docker image)
  • GPU: NVIDIA RTX 3070 (8GB VRAM)

๐Ÿš€ Quick Start

1. Prerequisites

  • Node.js 20+
  • PNPM
  • Python 3.10
  • UV (Python package manager)

2. Setup monorepo dependencies from root

# will setup both frontend and backend but require the prerequisites
pnpm run monorepo-setup

3. Development Workflow

# Format code before committing (recommended)
pnpm format

4. Run the Application

Start Development Servers

# Run both frontend and backend from root
pnpm dev

# Or run individually
pnpm dev:client  # Frontend (http://localhost:3000)
pnpm dev:server  # Backend (http://localhost:8000)

5. Initial Setup

  1. Allow microphone access when prompted
  2. Enable camera for multimodal interactions
  3. Click "Connect" to establish WebSocket connection
  4. Start Voice Control and begin speaking!

๐ŸŽฎ Usage Guide

Camera Controls

  • Drag to move camera window
  • Resize with maximize/minimize buttons
  • Toggle on/off as needed

Voice Settings

  • Energy Threshold: Adjust sensitivity to background noise
  • Pause Duration: How long to wait before processing speech
  • Min/Max Speech: Control segment length limits

๐Ÿ™ Acknowledgments


โญ Star this repo if you find it useful! โญ

Made with โค๏ธ by the Kiranbaby14

About

๐ŸŽญ Real-time voice-controlled 3D avatar with multimodal AI - speak naturally and watch your AI companion respond with perfect lip-sync

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published