Nexa SDK is an on-device inference framework that runs any model on any device, across any backend. It runs on CPUs, GPUs, NPUs with backend support for CUDA, Metal, Vulkan, and Qualcomm NPU. It handles multiple input modalities including text 📝, image 🖼️, and audio 🎧. The SDK includes an OpenAI-compatible API server with support for JSON schema-based function calling and streaming. It supports model formats such as GGUF, MLX, Nexa AI's own .nexa
format, enabling efficient quantized inference across diverse platforms.
- LLM inference with DeepSeek-r1-distill-Qwen-1.5B and Llama3.2-3B on Intel NPU
- Real-time speech recognition with Parakeet v3 model
- First-ever Gemma-3n multimodal inference for GPU & CPU, in GGUF format.
- SDXL image generation from Civitai for GPU
- EmbeddingGemma for Qualcomm NPU
- Phi4-mini turbo and Phi3.5-mini for Qualcomm NPU
- Parakeet V3 model for Qualcomm NPU
- Nexa ML Turbo engine for optimized NPU performance
- Try Phi4-mini turbo and Llama3.2-3B-NPU-Turbo
- 80% faster at shorter contexts (<=2048), 33% faster at longer contexts (>2048) than current NPU solutions
- Unified interface supporting NPU/GPU/CPU backends:
- Single installer architecture eliminating dependency conflicts
- Lazy loading and plugin isolation for improved performance
- OmniNeural-4B: the first multimodal AI model built natively for NPUs — handling text, images, and audio in one model
- Check the model and demos at Hugging Face repo
- Check our OmniNeural-4B technical blog
- Parakeet and Kokoro models support in MLX format.
- new
/mic
mode to transcribe live speech directly in your terminal.
curl -fsSL https://github.com/NexaAI/nexa-sdk/releases/latest/download/nexa-cli_linux_x86_64.sh -o install.sh && chmod +x install.sh && ./install.sh && rm install.sh
You can run any compatible GGUF, MLX, or nexa model from 🤗 Hugging Face by using the <full repo name>
.
Tip
You need to download the arm64 with Qualcomm NPU support and make sure you have Snapdragon® X Elite chip on your laptop.
-
Login & Get Access Token (required for Pro Models)
- Create an account at sdk.nexa.ai
- Go to Deployment → Create Token
- Run this once in your terminal (replace with your token):
nexa config set license '<your_token_here>'
-
Run and chat with our multimodal model, OmniNeural-4B, or other models on NPU
nexa infer omni-neural
nexa infer NexaAI/OmniNeural-4B
nexa infer NexaAI/qwen3-1.7B-npu
Tip
GGUF runs on macOS, Linux, and Windows.
📝 Run and chat with LLMs, e.g. Qwen3:
nexa infer ggml-org/Qwen3-1.7B-GGUF
🖼️ Run and chat with Multimodal models, e.g. Qwen2.5-Omni:
nexa infer NexaAI/Qwen2.5-Omni-3B-GGUF
Tip
MLX is macOS-only (Apple Silicon). Many MLX models in the Hugging Face mlx-community organization have quality issues and may not run reliably. We recommend starting with models from our curated NexaAI Collection for best results. For example
📝 Run and chat with LLMs, e.g. Qwen3:
nexa infer NexaAI/Qwen3-4B-4bit-MLX
🖼️ Run and chat with Multimodal models, e.g. Gemma3n:
nexa infer NexaAI/gemma-3n-E4B-it-4bit-MLX
Essential Command | What it does |
---|---|
nexa -h |
show all CLI commands |
nexa pull <repo> |
Interactive download & cache of a model |
nexa infer <repo> |
Local inference |
nexa list |
Show all cached models with sizes |
nexa remove <repo> / nexa clean |
Delete one / all cached models |
nexa serve --host 127.0.0.1:8080 |
Launch OpenAI‑compatible REST server |
nexa run <repo> |
Chat with a model via an existing server |
👉 To interact with multimodal models, you can drag photos or audio clips directly into the CLI — you can even drop multiple images at once!
See CLI Reference for full commands.
We would like to thank the following projects: