mirror of
https://github.com/lordmathis/llamactl.git
synced 2025-11-05 16:44:22 +00:00
3a979da8153804dd0623a7c567ff4e14a23e62f7
- Added logging to track config file reading operations
- llamactl now properly reads config files from the expected locations ("llamactl.yaml" and "config.yaml" under current directory)
llamactl
Unified management and routing for llama.cpp, MLX and vLLM models with web dashboard.
Features
🚀 Easy Model Management
- Multiple Model Serving: Run different models simultaneously (7B for speed, 70B for quality)
- On-Demand Instance Start: Automatically launch instances upon receiving API requests
- State Persistence: Ensure instances remain intact across server restarts
🔗 Universal Compatibility
- OpenAI API Compatible: Drop-in replacement - route requests by instance name
- Multi-Backend Support: Native support for llama.cpp, MLX (Apple Silicon optimized), and vLLM
- Docker Support: Run backends in containers
🌐 User-Friendly Interface
- Web Dashboard: Modern React UI for visual management (unlike CLI-only tools)
- API Key Authentication: Separate keys for management vs inference access
⚡ Smart Operations
- Instance Monitoring: Health checks, auto-restart, log management
- Smart Resource Management: Idle timeout, LRU eviction, and configurable instance limits
Quick Start
# 1. Install backend (one-time setup)
# For llama.cpp: https://github.com/ggml-org/llama.cpp#quick-start
# For MLX on macOS: pip install mlx-lm
# For vLLM: pip install vllm
# Or use Docker - no local installation required
# 2. Download and run llamactl
LATEST_VERSION=$(curl -s https://api.github.com/repos/lordmathis/llamactl/releases/latest | grep '"tag_name":' | sed -E 's/.*"([^"]+)".*/\1/')
curl -L https://github.com/lordmathis/llamactl/releases/download/${LATEST_VERSION}/llamactl-${LATEST_VERSION}-linux-amd64.tar.gz | tar -xz
sudo mv llamactl /usr/local/bin/
# 3. Start the server
llamactl
# Access dashboard at http://localhost:8080
Usage
Create and manage instances via web dashboard:
- Open http://localhost:8080
- Click "Create Instance"
- Choose backend type (llama.cpp, MLX, or vLLM)
- Set model path and backend-specific options
- Start or stop the instance
Or use the REST API:
# Create llama.cpp instance
curl -X POST localhost:8080/api/v1/instances/my-7b-model \
-H "Authorization: Bearer your-key" \
-d '{"backend_type": "llama_cpp", "backend_options": {"model": "/path/to/model.gguf", "gpu_layers": 32}}'
# Create MLX instance (macOS)
curl -X POST localhost:8080/api/v1/instances/my-mlx-model \
-H "Authorization: Bearer your-key" \
-d '{"backend_type": "mlx_lm", "backend_options": {"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit"}}'
# Create vLLM instance
curl -X POST localhost:8080/api/v1/instances/my-vllm-model \
-H "Authorization: Bearer your-key" \
-d '{"backend_type": "vllm", "backend_options": {"model": "microsoft/DialoGPT-medium", "tensor_parallel_size": 2}}'
# Use with OpenAI SDK
curl -X POST localhost:8080/v1/chat/completions \
-H "Authorization: Bearer your-key" \
-d '{"model": "my-7b-model", "messages": [{"role": "user", "content": "Hello!"}]}'
Installation
Option 1: Download Binary (Recommended)
# Linux/macOS - Get latest version and download
LATEST_VERSION=$(curl -s https://api.github.com/repos/lordmathis/llamactl/releases/latest | grep '"tag_name":' | sed -E 's/.*"([^"]+)".*/\1/')
curl -L https://github.com/lordmathis/llamactl/releases/download/${LATEST_VERSION}/llamactl-${LATEST_VERSION}-$(uname -s | tr '[:upper:]' '[:lower:]')-$(uname -m).tar.gz | tar -xz
sudo mv llamactl /usr/local/bin/
# Or download manually from the releases page:
# https://github.com/lordmathis/llamactl/releases/latest
# Windows - Download from releases page
Option 2: Build from Source
Requires Go 1.24+ and Node.js 22+
git clone https://github.com/lordmathis/llamactl.git
cd llamactl
cd webui && npm ci && npm run build && cd ..
go build -o llamactl ./cmd/server
Prerequisites
Backend Dependencies
For llama.cpp backend:
You need llama-server from llama.cpp installed:
# Homebrew (macOS)
brew install llama.cpp
# Or build from source - see llama.cpp docs
# Or use Docker - no local installation required
For MLX backend (macOS only): You need MLX-LM installed:
# Install via pip (requires Python 3.8+)
pip install mlx-lm
# Or in a virtual environment (recommended)
python -m venv mlx-env
source mlx-env/bin/activate
pip install mlx-lm
For vLLM backend: You need vLLM installed:
# Install via pip (requires Python 3.8+, GPU required)
pip install vllm
# Or in a virtual environment (recommended)
python -m venv vllm-env
source vllm-env/bin/activate
pip install vllm
# Or use Docker - no local installation required
Docker Support
llamactl supports running backends in Docker containers with identical behavior to native execution. This is particularly useful for:
- Production deployments without local backend installation
- Isolating backend dependencies
- GPU-accelerated inference using official Docker images
Docker Configuration
Enable Docker support using the new structured backend configuration:
backends:
llama-cpp:
command: "llama-server"
docker:
enabled: true
image: "ghcr.io/ggml-org/llama.cpp:server"
args: ["run", "--rm", "--network", "host", "--gpus", "all"]
vllm:
command: "vllm"
args: ["serve"]
docker:
enabled: true
image: "vllm/vllm-openai:latest"
args: ["run", "--rm", "--network", "host", "--gpus", "all", "--shm-size", "1g"]
Key Features
- Host Networking: Uses
--network hostfor seamless port management - GPU Support: Includes
--gpus allfor GPU acceleration - Environment Variables: Configure container environment as needed
- Flexible Configuration: Per-backend Docker settings with sensible defaults
Requirements
- Docker installed and running
- For GPU support: nvidia-docker2 (Linux) or Docker Desktop with GPU support
- No local backend installation required when using Docker
Configuration
llamactl works out of the box with sensible defaults.
server:
host: "0.0.0.0" # Server host to bind to
port: 8080 # Server port to bind to
allowed_origins: ["*"] # Allowed CORS origins (default: all)
enable_swagger: false # Enable Swagger UI for API docs
backends:
llama-cpp:
command: "llama-server"
args: []
docker:
enabled: false
image: "ghcr.io/ggml-org/llama.cpp:server"
args: ["run", "--rm", "--network", "host", "--gpus", "all"]
environment: {}
vllm:
command: "vllm"
args: ["serve"]
docker:
enabled: false
image: "vllm/vllm-openai:latest"
args: ["run", "--rm", "--network", "host", "--gpus", "all", "--shm-size", "1g"]
environment: {}
mlx:
command: "mlx_lm.server"
args: []
instances:
port_range: [8000, 9000] # Port range for instances
data_dir: ~/.local/share/llamactl # Data directory (platform-specific, see below)
configs_dir: ~/.local/share/llamactl/instances # Instance configs directory
logs_dir: ~/.local/share/llamactl/logs # Logs directory
auto_create_dirs: true # Auto-create data/config/logs dirs if missing
max_instances: -1 # Max instances (-1 = unlimited)
max_running_instances: -1 # Max running instances (-1 = unlimited)
enable_lru_eviction: true # Enable LRU eviction for idle instances
default_auto_restart: true # Auto-restart new instances by default
default_max_restarts: 3 # Max restarts for new instances
default_restart_delay: 5 # Restart delay (seconds) for new instances
default_on_demand_start: true # Default on-demand start setting
on_demand_start_timeout: 120 # Default on-demand start timeout in seconds
timeout_check_interval: 5 # Idle instance timeout check in minutes
auth:
require_inference_auth: true # Require auth for inference endpoints
inference_keys: [] # Keys for inference endpoints
require_management_auth: true # Require auth for management endpoints
management_keys: [] # Keys for management endpoints
For detailed configuration options including environment variables, file locations, and advanced settings, see the Configuration Guide.
License
MIT License - see LICENSE file.
Description
Unified management and routing for llama.cpp, MLX and vLLM models with web dashboard.
llama-cppllamacppllama-serverllmllm-inferencellm-routerlocalllamalocalllmmlxmlx-lmopenai-apiself-hostedvllm
Readme
MIT
2.9 MiB
Languages
Go
59%
TypeScript
39.5%
CSS
0.9%
JavaScript
0.5%
