FastLLM

A Rust inference server providing OpenAI-compatible APIs for local LLM deployment. Run language models directly from HuggingFace with native MacOS Metal, CUDA and CPU support.

Experimental

This is a work in progress and the API is not yet stable!

Key Features

Native acceleration on Metal (Apple Silicon) and CUDA
Direct model loading from HuggingFace Hub
Run various architectures like Mistral, Qwen, TinyLlama
Generate embeddings using models like all-MiniLM-L6-v2

Design Principles

FastLLM adheres to the following core design principles:

Simple and Modular

Clean, well-documented code structure
Modular architecture for easy model integration
Trait-based design for flexible model implementations
Automatic architecture detection from model configs

Zero Config

Sensible defaults for all features and optimizations
Automatic hardware detection and optimization
Smart fallbacks when optimal settings aren't available

Easy to Extend

Clear separation of concerns
Minimal boilerplate for adding new models
Comprehensive test coverage and examples
Detailed documentation for model integration

The goal is to make it as straightforward as possible to add new models while maintaining high performance by default.

Supported Models

Model Family	Supported Architectures	Example Models
Llama	LlamaForCausalLM	• TinyLlama-1.1B-Chat • Any Llama2 derivative
Mistral	MistralForCausalLM	• Mistral-7B and derivatives • Mixtral-8x7B
Qwen	Qwen2ForCausalLM Qwen2_5_VLForConditionalGeneration	• Qwen2 • Qwen2.5
BERT	BertModel RobertaModel DebertaModel	• all-MiniLM-L6-v2 • Any BERT/RoBERTa/DeBERTa model

Quick Start

Prerequisites

Rust toolchain (install from rustup.rs)
HuggingFace token (for gated models)

Installation

# Clone the repository
git clone https://github.com/yourusername/fastllm.git
cd fastllm

# Optional: Set HuggingFace token for gated models
export HF_TOKEN="your_token_here"

# Build the project (MacOS Metal)
cargo build --release --features "metal"

# Build the project (Linux CUDA)
cargo build --release --features "cuda"

# Build the project (CPU)
cargo build --release

Running the Server

# Start with default settings
./target/release/fastllm

# Or specify a model directly
./target/release/fastllm --model TinyLlama/TinyLlama-1.1B-Chat-v1.0

🔧 Configuration

FastLLM can be configured through multiple methods (in order of precedence):

Command Line Arguments

./target/release/fastllm --model mistralai/Mistral-7B-v0.1

Environment Variables

export FASTLLM_SERVER__HOST=0.0.0.0
export FASTLLM_SERVER__PORT=8080
export FASTLLM_MODEL__MODEL_ID=your-model-id

Configuration File (config.json)

{
    "server": {
        "host": "127.0.0.1",
        "port": 3000
    },
    "model": {
        "model_id": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "revision": "main"
    }
}

🔌 API Examples

Chat Completion

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.7,
    "stream": true
  }'

Text Embeddings

curl http://localhost:3000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sentence-transformers/all-MiniLM-L6-v2",
    "input": "The food was delicious and the service was excellent."
  }'

🗺️ Roadmap

Support for more architectures (DeepSeek, Phi, etc.)
Comprehensive benchmarking suite
Model management API (/v1/models)
Improved caching and optimization
Multi-GPU Inference

Contributing

Contributions are welcome! Feel free to:

Open issues for bugs or feature requests
Submit pull requests
Share benchmarks and performance reports

License

Apache 2.0 - See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github		.github
docs		docs
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.example.json		config.example.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FastLLM

Experimental

Key Features

Design Principles

Supported Models

Quick Start

Prerequisites

Installation

Running the Server

🔧 Configuration

🔌 API Examples

Chat Completion

Text Embeddings

🗺️ Roadmap

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

RedDotRocket/fastllm

Folders and files

Latest commit

History

Repository files navigation

FastLLM

Experimental

Key Features

Design Principles

Supported Models

Quick Start

Prerequisites

Installation

Running the Server

🔧 Configuration

🔌 API Examples

Chat Completion

Text Embeddings

🗺️ Roadmap

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages