+
Skip to content
forked from NexaAI/nexa-sdk

Run the latest LLMs and VLMs across GPU, NPU, and CPU with bindings for Python, Android Java, and iOS Swift, getting up and running quickly with OpenAI gpt-oss, Gemma 3, Qwen3, and more.

License

Notifications You must be signed in to change notification settings

Ceylone/nexa-sdk

 
 

Repository files navigation

Nexa SDK

Nexa SDK is an on-device inference framework that runs any model on any device, across any backend. It runs on CPUs, GPUs, NPUs with backend support for CUDA, Metal, Vulkan, and Qualcomm / Intel / AMD NPU. It handles multiple input modalities including text 📝, image 🖼️, and audio 🎧. The SDK includes an OpenAI-compatible API server with support for JSON schema-based function calling and streaming. It supports model formats such as GGUF, MLX, Nexa AI's own .nexa format, enabling efficient quantized inference across diverse platforms.

Qualcomm NPU PC Demos

Multi-Image Reasoning Demo

🖼️ Multi-Image Reasoning
Spot the difference across two images in multi-round dialogue.

Image + Audio Function Call Demo

🎤 Image + Text → Function Call
Snap a poster, add a voice note, and AI agent creates a calendar event.

Multi-Audio Comparison Demo

🎶 Multi-Audio Comparison
Tell the difference between two music clips locally.

Recent updates

📣 2025.10.01: AMD NPU Support

  • Image Generation with SDXL on AMD NPU

📣 2025.09.23: Intel NPU Support

📣 2025.09.22: Apple Neural Engine (ANE) Support

📣 2025.09.15: New Models Support

📣 2025.09.05: Turbo Engine & Unified Interface

  • Nexa ML Turbo engine for optimized NPU performance
  • Unified interface supporting NPU/GPU/CPU backends:
    • Single installer architecture eliminating dependency conflicts
    • Lazy loading and plugin isolation for improved performance

📣 2025.08.20: Qualcomm NPU Support with NexaML Turbo Engine

📣 2025.08.12: ASR & TTS Support in MLX format

  • Parakeet and Kokoro models support in MLX format.
  • new /mic mode to transcribe live speech directly in your terminal.

Installation

macOS

Windows

Linux

curl -fsSL https://github.com/NexaAI/nexa-sdk/releases/latest/download/nexa-cli_linux_x86_64.sh -o install.sh && chmod +x install.sh && ./install.sh && rm install.sh

Supported Models

You can run any compatible GGUF, MLX, or nexa model from 🤗 Hugging Face by using the <full repo name>.

Qualcomm NPU models

Tip

You need to download the arm64 with Qualcomm NPU support and make sure you have Snapdragon® X Elite chip on your laptop.

Quick Start (Windows arm64, Snapdragon X Elite)

  1. Login & Get Access Token (required for Pro Models)

    • Create an account at sdk.nexa.ai
    • Go to Deployment → Create Token
    • Run this once in your terminal (replace with your token):
      nexa config set license '<your_token_here>'
  2. Run and chat with our multimodal model, OmniNeural-4B, or other models on NPU

nexa infer omni-neural
nexa infer NexaAI/OmniNeural-4B
nexa infer NexaAI/qwen3-1.7B-npu

GGUF models

Tip

GGUF runs on macOS, Linux, and Windows.

📝 Run and chat with LLMs, e.g. Qwen3:

nexa infer ggml-org/Qwen3-1.7B-GGUF

🖼️ Run and chat with Multimodal models, e.g. Qwen2.5-Omni:

nexa infer NexaAI/Qwen2.5-Omni-3B-GGUF

MLX models

Tip

MLX is macOS-only (Apple Silicon). Many MLX models in the Hugging Face mlx-community organization have quality issues and may not run reliably. We recommend starting with models from our curated NexaAI Collection for best results. For example

📝 Run and chat with LLMs, e.g. Qwen3:

nexa infer NexaAI/Qwen3-4B-4bit-MLX

🖼️ Run and chat with Multimodal models, e.g. Gemma3n:

nexa infer NexaAI/gemma-3n-E4B-it-4bit-MLX

CLI Reference

Essential Command What it does
nexa -h show all CLI commands
nexa pull <repo> Interactive download & cache of a model
nexa infer <repo> Local inference
nexa list Show all cached models with sizes
nexa remove <repo> / nexa clean Delete one / all cached models
nexa serve --host 127.0.0.1:8080 Launch OpenAI‑compatible REST server
nexa run <repo> Chat with a model via an existing server

👉 To interact with multimodal models, you can drag photos or audio clips directly into the CLI — you can even drop multiple images at once!

See CLI Reference for full commands.

Acknowledgements

We would like to thank the following projects:

About

Run the latest LLMs and VLMs across GPU, NPU, and CPU with bindings for Python, Android Java, and iOS Swift, getting up and running quickly with OpenAI gpt-oss, Gemma 3, Qwen3, and more.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Go 75.9%
  • Swift 17.3%
  • Shell 3.7%
  • Inno Setup 1.5%
  • HTML 0.9%
  • Makefile 0.5%
  • Other 0.2%
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载