mirror of https://github.com/lordmathis/llamactl.git synced 2025-11-06 09:04:27 +00:00

Go to file

Anuruth Lertpiya 3a979da815 fix: llamactl reads config file per documentation

- Added logging to track config file reading operations
- llamactl now properly reads config files from the expected locations ("llamactl.yaml" and "config.yaml" under current directory)

2025-09-27 17:03:54 +00:00

.github/workflows

Refactor docs workflow to trigger on version tags

2025-09-23 22:32:02 +02:00

.vscode

Add environment variables for development configuration in launch.json

2025-08-30 22:04:52 +02:00

apidocs

Refactor API endpoints to use /backends/llama-cpp path and update related documentation

2025-09-23 21:27:58 +02:00

cmd/server

Pass backend options to instances

2025-09-16 21:37:48 +02:00

docs

Add Docker support documentation and configuration for backends

2025-09-24 22:15:21 +02:00

pkg

fix: llamactl reads config file per documentation

2025-09-27 17:03:54 +00:00

webui

Add Docker badge to UI

2025-09-25 23:04:24 +02:00

.gitignore

Update documentation and add README synchronization

2025-09-22 22:37:53 +02:00

CNAME

Create CNAME

2025-08-08 13:41:58 +02:00

CONTRIBUTING.md

Create initial documentation structure

2025-09-02 22:05:01 +02:00

docs-requirements.txt

Setup docs versioning

2025-09-18 21:04:11 +02:00

go.mod

Add cors middleware

2025-07-27 19:05:15 +02:00

go.sum

Add cors middleware

2025-07-27 19:05:15 +02:00

LICENSE

Initial commit

2025-07-16 20:02:28 +02:00

mkdocs.yml

Update documentation and add README synchronization

2025-09-22 22:37:53 +02:00

README.md

Add Docker support documentation and configuration for backends

2025-09-24 22:15:21 +02:00

README.md

llamactl

Unified management and routing for llama.cpp, MLX and vLLM models with web dashboard.

Features

🚀 Easy Model Management

Multiple Model Serving: Run different models simultaneously (7B for speed, 70B for quality)
On-Demand Instance Start: Automatically launch instances upon receiving API requests
State Persistence: Ensure instances remain intact across server restarts

🔗 Universal Compatibility

OpenAI API Compatible: Drop-in replacement - route requests by instance name
Multi-Backend Support: Native support for llama.cpp, MLX (Apple Silicon optimized), and vLLM
Docker Support: Run backends in containers

🌐 User-Friendly Interface

Web Dashboard: Modern React UI for visual management (unlike CLI-only tools)
API Key Authentication: Separate keys for management vs inference access

⚡ Smart Operations

Instance Monitoring: Health checks, auto-restart, log management
Smart Resource Management: Idle timeout, LRU eviction, and configurable instance limits

Quick Start

# 1. Install backend (one-time setup)
# For llama.cpp: https://github.com/ggml-org/llama.cpp#quick-start
# For MLX on macOS: pip install mlx-lm
# For vLLM: pip install vllm
# Or use Docker - no local installation required

# 2. Download and run llamactl
LATEST_VERSION=$(curl -s https://api.github.com/repos/lordmathis/llamactl/releases/latest | grep '"tag_name":' | sed -E 's/.*"([^"]+)".*/\1/')
curl -L https://github.com/lordmathis/llamactl/releases/download/${LATEST_VERSION}/llamactl-${LATEST_VERSION}-linux-amd64.tar.gz | tar -xz
sudo mv llamactl /usr/local/bin/

# 3. Start the server
llamactl
# Access dashboard at http://localhost:8080

Usage

Create and manage instances via web dashboard:

Open http://localhost:8080
Click "Create Instance"
Choose backend type (llama.cpp, MLX, or vLLM)
Set model path and backend-specific options
Start or stop the instance

Or use the REST API:

# Create llama.cpp instance
curl -X POST localhost:8080/api/v1/instances/my-7b-model \
  -H "Authorization: Bearer your-key" \
  -d '{"backend_type": "llama_cpp", "backend_options": {"model": "/path/to/model.gguf", "gpu_layers": 32}}'

# Create MLX instance (macOS)
curl -X POST localhost:8080/api/v1/instances/my-mlx-model \
  -H "Authorization: Bearer your-key" \
  -d '{"backend_type": "mlx_lm", "backend_options": {"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit"}}'

# Create vLLM instance
curl -X POST localhost:8080/api/v1/instances/my-vllm-model \
  -H "Authorization: Bearer your-key" \
  -d '{"backend_type": "vllm", "backend_options": {"model": "microsoft/DialoGPT-medium", "tensor_parallel_size": 2}}'

# Use with OpenAI SDK
curl -X POST localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer your-key" \
  -d '{"model": "my-7b-model", "messages": [{"role": "user", "content": "Hello!"}]}'

Installation

Option 1: Download Binary (Recommended)

# Linux/macOS - Get latest version and download
LATEST_VERSION=$(curl -s https://api.github.com/repos/lordmathis/llamactl/releases/latest | grep '"tag_name":' | sed -E 's/.*"([^"]+)".*/\1/')
curl -L https://github.com/lordmathis/llamactl/releases/download/${LATEST_VERSION}/llamactl-${LATEST_VERSION}-$(uname -s | tr '[:upper:]' '[:lower:]')-$(uname -m).tar.gz | tar -xz
sudo mv llamactl /usr/local/bin/

# Or download manually from the releases page:
# https://github.com/lordmathis/llamactl/releases/latest

# Windows - Download from releases page

Option 2: Build from Source

Requires Go 1.24+ and Node.js 22+

git clone https://github.com/lordmathis/llamactl.git
cd llamactl
cd webui && npm ci && npm run build && cd ..
go build -o llamactl ./cmd/server

Prerequisites

Backend Dependencies

For llama.cpp backend: You need llama-server from llama.cpp installed:

# Homebrew (macOS)
brew install llama.cpp

# Or build from source - see llama.cpp docs
# Or use Docker - no local installation required

For MLX backend (macOS only): You need MLX-LM installed:

# Install via pip (requires Python 3.8+)
pip install mlx-lm

# Or in a virtual environment (recommended)
python -m venv mlx-env
source mlx-env/bin/activate
pip install mlx-lm

For vLLM backend: You need vLLM installed:

# Install via pip (requires Python 3.8+, GPU required)
pip install vllm

# Or in a virtual environment (recommended)
python -m venv vllm-env
source vllm-env/bin/activate
pip install vllm

# Or use Docker - no local installation required

Docker Support

llamactl supports running backends in Docker containers with identical behavior to native execution. This is particularly useful for:

Production deployments without local backend installation
Isolating backend dependencies
GPU-accelerated inference using official Docker images

Docker Configuration

Enable Docker support using the new structured backend configuration:

backends:
  llama-cpp:
    command: "llama-server"
    docker:
      enabled: true
      image: "ghcr.io/ggml-org/llama.cpp:server"
      args: ["run", "--rm", "--network", "host", "--gpus", "all"]

  vllm:
    command: "vllm"
    args: ["serve"]
    docker:
      enabled: true
      image: "vllm/vllm-openai:latest"
      args: ["run", "--rm", "--network", "host", "--gpus", "all", "--shm-size", "1g"]

Key Features

Host Networking: Uses --network host for seamless port management
GPU Support: Includes --gpus all for GPU acceleration
Environment Variables: Configure container environment as needed
Flexible Configuration: Per-backend Docker settings with sensible defaults

Requirements

Docker installed and running
For GPU support: nvidia-docker2 (Linux) or Docker Desktop with GPU support
No local backend installation required when using Docker

Configuration

llamactl works out of the box with sensible defaults.

server:
  host: "0.0.0.0"                # Server host to bind to
  port: 8080                     # Server port to bind to
  allowed_origins: ["*"]         # Allowed CORS origins (default: all)
  enable_swagger: false          # Enable Swagger UI for API docs

backends:
  llama-cpp:
    command: "llama-server"
    args: []
    docker:
      enabled: false
      image: "ghcr.io/ggml-org/llama.cpp:server"
      args: ["run", "--rm", "--network", "host", "--gpus", "all"]
      environment: {}

  vllm:
    command: "vllm"
    args: ["serve"]
    docker:
      enabled: false
      image: "vllm/vllm-openai:latest"
      args: ["run", "--rm", "--network", "host", "--gpus", "all", "--shm-size", "1g"]
      environment: {}

  mlx:
    command: "mlx_lm.server"
    args: []

instances:
  port_range: [8000, 9000]       # Port range for instances
  data_dir: ~/.local/share/llamactl         # Data directory (platform-specific, see below)
  configs_dir: ~/.local/share/llamactl/instances  # Instance configs directory
  logs_dir: ~/.local/share/llamactl/logs    # Logs directory
  auto_create_dirs: true         # Auto-create data/config/logs dirs if missing
  max_instances: -1              # Max instances (-1 = unlimited)
  max_running_instances: -1      # Max running instances (-1 = unlimited)
  enable_lru_eviction: true      # Enable LRU eviction for idle instances
  default_auto_restart: true     # Auto-restart new instances by default
  default_max_restarts: 3        # Max restarts for new instances
  default_restart_delay: 5       # Restart delay (seconds) for new instances
  default_on_demand_start: true  # Default on-demand start setting
  on_demand_start_timeout: 120   # Default on-demand start timeout in seconds
  timeout_check_interval: 5      # Idle instance timeout check in minutes

auth:
  require_inference_auth: true   # Require auth for inference endpoints
  inference_keys: []             # Keys for inference endpoints
  require_management_auth: true  # Require auth for management endpoints
  management_keys: []            # Keys for management endpoints

For detailed configuration options including environment variables, file locations, and advanced settings, see the Configuration Guide.

License

MIT License - see LICENSE file.