Enhance documentation for MLX backend support

This commit is contained in:
2025-09-18 20:01:18 +02:00
parent 11296bc5f8
commit 8820dc1146
4 changed files with 92 additions and 44 deletions

View File

@@ -19,6 +19,10 @@ server:
allowed_origins: ["*"] # Allowed CORS origins (default: all) allowed_origins: ["*"] # Allowed CORS origins (default: all)
enable_swagger: false # Enable Swagger UI for API docs enable_swagger: false # Enable Swagger UI for API docs
backends:
llama_executable: llama-server # Path to llama-server executable
mlx_lm_executable: mlx_lm.server # Path to mlx_lm.server executable
instances: instances:
port_range: [8000, 9000] # Port range for instances port_range: [8000, 9000] # Port range for instances
data_dir: ~/.local/share/llamactl # Data directory (platform-specific, see below) data_dir: ~/.local/share/llamactl # Data directory (platform-specific, see below)
@@ -28,7 +32,6 @@ instances:
max_instances: -1 # Max instances (-1 = unlimited) max_instances: -1 # Max instances (-1 = unlimited)
max_running_instances: -1 # Max running instances (-1 = unlimited) max_running_instances: -1 # Max running instances (-1 = unlimited)
enable_lru_eviction: true # Enable LRU eviction for idle instances enable_lru_eviction: true # Enable LRU eviction for idle instances
llama_executable: llama-server # Path to llama-server executable
default_auto_restart: true # Auto-restart new instances by default default_auto_restart: true # Auto-restart new instances by default
default_max_restarts: 3 # Max restarts for new instances default_max_restarts: 3 # Max restarts for new instances
default_restart_delay: 5 # Restart delay (seconds) for new instances default_restart_delay: 5 # Restart delay (seconds) for new instances
@@ -79,11 +82,23 @@ server:
enable_swagger: false # Enable Swagger UI (default: false) enable_swagger: false # Enable Swagger UI (default: false)
``` ```
**Environment Variables:** **Environment Variables:**
- `LLAMACTL_HOST` - Server host - `LLAMACTL_HOST` - Server host
- `LLAMACTL_PORT` - Server port - `LLAMACTL_PORT` - Server port
- `LLAMACTL_ALLOWED_ORIGINS` - Comma-separated CORS origins - `LLAMACTL_ALLOWED_ORIGINS` - Comma-separated CORS origins
- `LLAMACTL_ENABLE_SWAGGER` - Enable Swagger UI (true/false) - `LLAMACTL_ENABLE_SWAGGER` - Enable Swagger UI (true/false)
### Backend Configuration
```yaml
backends:
llama_executable: "llama-server" # Path to llama-server executable (default: "llama-server")
mlx_lm_executable: "mlx_lm.server" # Path to mlx_lm.server executable (default: "mlx_lm.server")
```
**Environment Variables:**
- `LLAMACTL_LLAMA_EXECUTABLE` - Path to llama-server executable
- `LLAMACTL_MLX_LM_EXECUTABLE` - Path to mlx_lm.server executable
### Instance Configuration ### Instance Configuration
@@ -97,7 +112,6 @@ instances:
max_instances: -1 # Maximum instances (-1 = unlimited) max_instances: -1 # Maximum instances (-1 = unlimited)
max_running_instances: -1 # Maximum running instances (-1 = unlimited) max_running_instances: -1 # Maximum running instances (-1 = unlimited)
enable_lru_eviction: true # Enable LRU eviction for idle instances enable_lru_eviction: true # Enable LRU eviction for idle instances
llama_executable: "llama-server" # Path to llama-server executable
default_auto_restart: true # Default auto-restart setting default_auto_restart: true # Default auto-restart setting
default_max_restarts: 3 # Default maximum restart attempts default_max_restarts: 3 # Default maximum restart attempts
default_restart_delay: 5 # Default restart delay in seconds default_restart_delay: 5 # Default restart delay in seconds
@@ -113,9 +127,8 @@ instances:
- `LLAMACTL_LOGS_DIR` - Log directory path - `LLAMACTL_LOGS_DIR` - Log directory path
- `LLAMACTL_AUTO_CREATE_DATA_DIR` - Auto-create data/config/logs directories (true/false) - `LLAMACTL_AUTO_CREATE_DATA_DIR` - Auto-create data/config/logs directories (true/false)
- `LLAMACTL_MAX_INSTANCES` - Maximum number of instances - `LLAMACTL_MAX_INSTANCES` - Maximum number of instances
- `LLAMACTL_MAX_RUNNING_INSTANCES` - Maximum number of running instances - `LLAMACTL_MAX_RUNNING_INSTANCES` - Maximum number of running instances
- `LLAMACTL_ENABLE_LRU_EVICTION` - Enable LRU eviction for idle instances - `LLAMACTL_ENABLE_LRU_EVICTION` - Enable LRU eviction for idle instances
- `LLAMACTL_LLAMA_EXECUTABLE` - Path to llama-server executable
- `LLAMACTL_DEFAULT_AUTO_RESTART` - Default auto-restart setting (true/false) - `LLAMACTL_DEFAULT_AUTO_RESTART` - Default auto-restart setting (true/false)
- `LLAMACTL_DEFAULT_MAX_RESTARTS` - Default maximum restarts - `LLAMACTL_DEFAULT_MAX_RESTARTS` - Default maximum restarts
- `LLAMACTL_DEFAULT_RESTART_DELAY` - Default restart delay in seconds - `LLAMACTL_DEFAULT_RESTART_DELAY` - Default restart delay in seconds

View File

@@ -4,11 +4,14 @@ This guide will walk you through installing Llamactl on your system.
## Prerequisites ## Prerequisites
### Backend Dependencies
llamactl supports multiple backends. Install at least one:
**For llama.cpp backend (all platforms):**
You need `llama-server` from [llama.cpp](https://github.com/ggml-org/llama.cpp) installed: You need `llama-server` from [llama.cpp](https://github.com/ggml-org/llama.cpp) installed:
**Quick install methods:**
```bash ```bash
# Homebrew (macOS/Linux) # Homebrew (macOS/Linux)
brew install llama.cpp brew install llama.cpp
@@ -18,6 +21,22 @@ winget install llama.cpp
Or build from source - see llama.cpp docs Or build from source - see llama.cpp docs
**For MLX backend (macOS only):**
MLX provides optimized inference on Apple Silicon. Install MLX-LM:
```bash
# Install via pip (requires Python 3.8+)
pip install mlx-lm
# Or in a virtual environment (recommended)
python -m venv mlx-env
source mlx-env/bin/activate
pip install mlx-lm
```
Note: MLX backend is only available on macOS with Apple Silicon (M1, M2, M3, etc.)
## Installation Methods ## Installation Methods
### Option 1: Download Binary (Recommended) ### Option 1: Download Binary (Recommended)

View File

@@ -1,22 +1,23 @@
# Llamactl Documentation # Llamactl Documentation
Welcome to the Llamactl documentation! **Management server and proxy for multiple llama.cpp instances with OpenAI-compatible API routing.** Welcome to the Llamactl documentation! **Management server and proxy for multiple llama.cpp and MLX instances with OpenAI-compatible API routing.**
![Dashboard Screenshot](images/dashboard.png) ![Dashboard Screenshot](images/dashboard.png)
## What is Llamactl? ## What is Llamactl?
Llamactl is designed to simplify the deployment and management of llama-server instances. It provides a modern solution for running multiple large language models with centralized management. Llamactl is designed to simplify the deployment and management of llama-server and MLX instances. It provides a modern solution for running multiple large language models with centralized management and multi-backend support.
## Features ## Features
🚀 **Multiple Model Serving**: Run different models simultaneously (7B for speed, 70B for quality) 🚀 **Multiple Model Serving**: Run different models simultaneously (7B for speed, 70B for quality)
🔗 **OpenAI API Compatible**: Drop-in replacement - route requests by model name 🔗 **OpenAI API Compatible**: Drop-in replacement - route requests by model name
🌐 **Web Dashboard**: Modern React UI for visual management (unlike CLI-only tools) 🍎 **Multi-Backend Support**: Native support for both llama.cpp and MLX (Apple Silicon optimized)
🔐 **API Key Authentication**: Separate keys for management vs inference access 🌐 **Web Dashboard**: Modern React UI for visual management (unlike CLI-only tools)
📊 **Instance Monitoring**: Health checks, auto-restart, log management 🔐 **API Key Authentication**: Separate keys for management vs inference access
**Smart Resource Management**: Idle timeout, LRU eviction, and configurable instance limits 📊 **Instance Monitoring**: Health checks, auto-restart, log management
💡 **On-Demand Instance Start**: Automatically launch instances upon receiving OpenAI-compatible API requests **Smart Resource Management**: Idle timeout, LRU eviction, and configurable instance limits
💡 **On-Demand Instance Start**: Automatically launch instances upon receiving OpenAI-compatible API requests
💾 **State Persistence**: Ensure instances remain intact across server restarts 💾 **State Persistence**: Ensure instances remain intact across server restarts
## Quick Links ## Quick Links

View File

@@ -1,6 +1,6 @@
# Managing Instances # Managing Instances
Learn how to effectively manage your Llama.cpp instances with Llamactl through both the Web UI and API. Learn how to effectively manage your llama.cpp and MLX instances with Llamactl through both the Web UI and API.
## Overview ## Overview
@@ -39,40 +39,55 @@ Each instance is displayed as a card showing:
1. Click the **"Create Instance"** button on the dashboard 1. Click the **"Create Instance"** button on the dashboard
2. Enter a unique **Name** for your instance (only required field) 2. Enter a unique **Name** for your instance (only required field)
3. Configure model source (choose one): 3. **Choose Backend Type**:
- **Model Path**: Full path to your downloaded GGUF model file - **llama.cpp**: For GGUF models using llama-server
- **HuggingFace Repo**: Repository name (e.g., `unsloth/gemma-3-27b-it-GGUF`) - **MLX**: For MLX-optimized models (macOS only)
- **HuggingFace File**: Specific file within the repo (optional, uses default if not specified) 4. Configure model source:
4. Configure optional instance management settings: - **For llama.cpp**: GGUF model path or HuggingFace repo
- **For MLX**: MLX model path or identifier (e.g., `mlx-community/Mistral-7B-Instruct-v0.3-4bit`)
5. Configure optional instance management settings:
- **Auto Restart**: Automatically restart instance on failure - **Auto Restart**: Automatically restart instance on failure
- **Max Restarts**: Maximum number of restart attempts - **Max Restarts**: Maximum number of restart attempts
- **Restart Delay**: Delay in seconds between restart attempts - **Restart Delay**: Delay in seconds between restart attempts
- **On Demand Start**: Start instance when receiving a request to the OpenAI compatible endpoint - **On Demand Start**: Start instance when receiving a request to the OpenAI compatible endpoint
- **Idle Timeout**: Minutes before stopping idle instance (set to 0 to disable) - **Idle Timeout**: Minutes before stopping idle instance (set to 0 to disable)
5. Configure optional llama-server backend options: 6. Configure backend-specific options:
- **Threads**: Number of CPU threads to use - **llama.cpp**: Threads, context size, GPU layers, port, etc.
- **Context Size**: Context window size (ctx_size) - **MLX**: Temperature, top-p, adapter path, Python environment, etc.
- **GPU Layers**: Number of layers to offload to GPU 7. Click **"Create"** to save the instance
- **Port**: Network port (auto-assigned by llamactl if not specified)
- **Additional Parameters**: Any other llama-server command line options (see [llama-server documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md))
6. Click **"Create"** to save the instance
### Via API ### Via API
```bash ```bash
# Create instance with local model file # Create llama.cpp instance with local model file
curl -X POST http://localhost:8080/api/instances/my-instance \ curl -X POST http://localhost:8080/api/instances/my-llama-instance \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"backend_type": "llama_cpp", "backend_type": "llama_cpp",
"backend_options": { "backend_options": {
"model": "/path/to/model.gguf", "model": "/path/to/model.gguf",
"threads": 8, "threads": 8,
"ctx_size": 4096 "ctx_size": 4096,
"gpu_layers": 32
} }
}' }'
# Create instance with HuggingFace model # Create MLX instance (macOS only)
curl -X POST http://localhost:8080/api/instances/my-mlx-instance \
-H "Content-Type: application/json" \
-d '{
"backend_type": "mlx_lm",
"backend_options": {
"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit",
"temp": 0.7,
"top_p": 0.9,
"max_tokens": 2048
},
"auto_restart": true,
"max_restarts": 3
}'
# Create llama.cpp instance with HuggingFace model
curl -X POST http://localhost:8080/api/instances/gemma-3-27b \ curl -X POST http://localhost:8080/api/instances/gemma-3-27b \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
@@ -81,9 +96,7 @@ curl -X POST http://localhost:8080/api/instances/gemma-3-27b \
"hf_repo": "unsloth/gemma-3-27b-it-GGUF", "hf_repo": "unsloth/gemma-3-27b-it-GGUF",
"hf_file": "gemma-3-27b-it-GGUF.gguf", "hf_file": "gemma-3-27b-it-GGUF.gguf",
"gpu_layers": 32 "gpu_layers": 32
}, }
"auto_restart": true,
"max_restarts": 3
}' }'
``` ```
@@ -166,14 +179,16 @@ curl -X DELETE http://localhost:8080/api/instances/{name}
## Instance Proxy ## Instance Proxy
Llamactl proxies all requests to the underlying llama-server instances. Llamactl proxies all requests to the underlying backend instances (llama-server or MLX).
```bash ```bash
# Get instance details # Get instance details
curl http://localhost:8080/api/instances/{name}/proxy/ curl http://localhost:8080/api/instances/{name}/proxy/
``` ```
Check llama-server [docs](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md) for more information. Both backends provide OpenAI-compatible endpoints. Check the respective documentation:
- [llama-server docs](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md)
- [MLX-LM docs](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md)
### Instance Health ### Instance Health