2025-10-09 22:29:23 +02:00
2025-10-26 14:41:59 +01:00
2025-09-28 22:17:38 +02:00
2025-10-19 18:39:46 +02:00
2025-08-08 13:41:58 +02:00
2025-09-18 21:04:11 +02:00
2025-07-27 19:05:15 +02:00
2025-07-27 19:05:15 +02:00
2025-07-16 20:02:28 +02:00
2025-10-26 13:57:16 +01:00
2025-10-26 13:43:44 +01:00

llamactl

Build and Release Go Tests WebUI Tests User Docs

Unified management and routing for llama.cpp, MLX and vLLM models with web dashboard.

📚 Full Documentation →

Dashboard Screenshot

Features

🚀 Easy Model Management

  • Multiple Models Simultaneously: Run different models at the same time (7B for speed, 70B for quality)
  • Smart Resource Management: Automatic idle timeout, LRU eviction, and configurable instance limits
  • Web Dashboard: Modern React UI for managing instances, monitoring health, and viewing logs

🔗 Flexible Integration

  • OpenAI API Compatible: Drop-in replacement - route requests to different models by instance name
  • Multi-Backend Support: Native support for llama.cpp, MLX (Apple Silicon optimized), and vLLM
  • Docker Ready: Run backends in containers with full GPU support

🌐 Distributed Deployment

  • Remote Instances: Deploy instances on remote hosts
  • Central Management: Manage everything from a single dashboard with automatic routing

Quick Start

  1. Install a backend (llama.cpp, MLX, or vLLM) - see Prerequisites below
  2. Download llamactl for your platform
  3. Run llamactl and open http://localhost:8080
  4. Create an instance and start inferencing!

Prerequisites

Backend Dependencies

For llama.cpp backend: You need llama-server from llama.cpp installed:

# Homebrew (macOS)
brew install llama.cpp

# Or build from source - see llama.cpp docs
# Or use Docker - no local installation required

For MLX backend (macOS only): You need MLX-LM installed:

# Install via pip (requires Python 3.8+)
pip install mlx-lm

# Or in a virtual environment (recommended)
python -m venv mlx-env
source mlx-env/bin/activate
pip install mlx-lm

For vLLM backend: You need vLLM installed:

# Install via pip (requires Python 3.8+, GPU required)
pip install vllm

# Or in a virtual environment (recommended)
python -m venv vllm-env
source vllm-env/bin/activate
pip install vllm

# Or use Docker - no local installation required

Docker Support

llamactl can run backends in Docker containers, eliminating the need for local backend installation:

backends:
  llama-cpp:
    docker:
      enabled: true
  vllm:
    docker:
      enabled: true

Installation

# Linux/macOS - Get latest version and download
LATEST_VERSION=$(curl -s https://api.github.com/repos/lordmathis/llamactl/releases/latest | grep '"tag_name":' | sed -E 's/.*"([^"]+)".*/\1/')
curl -L https://github.com/lordmathis/llamactl/releases/download/${LATEST_VERSION}/llamactl-${LATEST_VERSION}-$(uname -s | tr '[:upper:]' '[:lower:]')-$(uname -m).tar.gz | tar -xz
sudo mv llamactl /usr/local/bin/

# Or download manually from the releases page:
# https://github.com/lordmathis/llamactl/releases/latest

# Windows - Download from releases page

Option 2: Docker (No local backend installation required)

# Clone repository and build Docker images
git clone https://github.com/lordmathis/llamactl.git
cd llamactl
mkdir -p data/llamacpp data/vllm models

# Build and start llamactl with llama.cpp CUDA backend
docker-compose -f docker/docker-compose.yml up llamactl-llamacpp -d

# Build and start llamactl with vLLM CUDA backend
docker-compose -f docker/docker-compose.yml up llamactl-vllm -d

# Build from source using multi-stage build
docker build -f docker/Dockerfile.source -t llamactl:source .

Note: Dockerfiles are configured for CUDA. Adapt base images for other platforms (CPU, ROCm, etc.).

Option 3: Build from Source

Requires Go 1.24+ and Node.js 22+

git clone https://github.com/lordmathis/llamactl.git
cd llamactl
cd webui && npm ci && npm run build && cd ..
go build -o llamactl ./cmd/server

Usage

  1. Open http://localhost:8080
  2. Click "Create Instance"
  3. Choose backend type (llama.cpp, MLX, or vLLM)
  4. Configure your model and options
  5. Start the instance and use it with any OpenAI-compatible client

Configuration

llamactl works out of the box with sensible defaults.

server:
  host: "0.0.0.0"                # Server host to bind to
  port: 8080                     # Server port to bind to
  allowed_origins: ["*"]         # Allowed CORS origins (default: all)
  allowed_headers: ["*"]         # Allowed CORS headers (default: all)
  enable_swagger: false          # Enable Swagger UI for API docs

backends:
  llama-cpp:
    command: "llama-server"
    args: []
    environment: {}               # Environment variables for the backend process
    docker:
      enabled: false
      image: "ghcr.io/ggml-org/llama.cpp:server"
      args: ["run", "--rm", "--network", "host", "--gpus", "all", "-v", "~/.local/share/llamactl/llama.cpp:/root/.cache/llama.cpp"]
      environment: {}             # Environment variables for the container

  vllm:
    command: "vllm"
    args: ["serve"]
    environment: {}               # Environment variables for the backend process
    docker:
      enabled: false
      image: "vllm/vllm-openai:latest"
      args: ["run", "--rm", "--network", "host", "--gpus", "all", "--shm-size", "1g", "-v", "~/.local/share/llamactl/huggingface:/root/.cache/huggingface"]
      environment: {}             # Environment variables for the container

  mlx:
    command: "mlx_lm.server"
    args: []
    environment: {}               # Environment variables for the backend process

instances:
  port_range: [8000, 9000]       # Port range for instances
  data_dir: ~/.local/share/llamactl         # Data directory (platform-specific, see below)
  configs_dir: ~/.local/share/llamactl/instances  # Instance configs directory
  logs_dir: ~/.local/share/llamactl/logs    # Logs directory
  auto_create_dirs: true         # Auto-create data/config/logs dirs if missing
  max_instances: -1              # Max instances (-1 = unlimited)
  max_running_instances: -1      # Max running instances (-1 = unlimited)
  enable_lru_eviction: true      # Enable LRU eviction for idle instances
  default_auto_restart: true     # Auto-restart new instances by default
  default_max_restarts: 3        # Max restarts for new instances
  default_restart_delay: 5       # Restart delay (seconds) for new instances
  default_on_demand_start: true  # Default on-demand start setting
  on_demand_start_timeout: 120   # Default on-demand start timeout in seconds
  timeout_check_interval: 5      # Idle instance timeout check in minutes

auth:
  require_inference_auth: true   # Require auth for inference endpoints
  inference_keys: []             # Keys for inference endpoints
  require_management_auth: true  # Require auth for management endpoints
  management_keys: []            # Keys for management endpoints

For detailed configuration options including environment variables, file locations, and advanced settings, see the Configuration Guide.

License

MIT License - see LICENSE file.

Description
Unified management and routing for llama.cpp, MLX and vLLM models with web dashboard.
Readme MIT 2.9 MiB
Languages
Go 59%
TypeScript 39.5%
CSS 0.9%
JavaScript 0.5%