Add vLLM backend support to documentation and update instance management instructions

This commit is contained in:
2025-09-21 21:57:36 +02:00
parent 6ff9aa5470
commit 55765d2020
5 changed files with 107 additions and 16 deletions

View File

@@ -13,7 +13,7 @@
### 🔗 Universal Compatibility ### 🔗 Universal Compatibility
- **OpenAI API Compatible**: Drop-in replacement - route requests by model name - **OpenAI API Compatible**: Drop-in replacement - route requests by model name
- **Multi-Backend Support**: Native support for both llama.cpp and MLX (Apple Silicon optimized) - **Multi-Backend Support**: Native support for llama.cpp, MLX (Apple Silicon optimized), and vLLM
### 🌐 User-Friendly Interface ### 🌐 User-Friendly Interface
- **Web Dashboard**: Modern React UI for visual management (unlike CLI-only tools) - **Web Dashboard**: Modern React UI for visual management (unlike CLI-only tools)
@@ -31,6 +31,7 @@
# 1. Install backend (one-time setup) # 1. Install backend (one-time setup)
# For llama.cpp: https://github.com/ggml-org/llama.cpp#quick-start # For llama.cpp: https://github.com/ggml-org/llama.cpp#quick-start
# For MLX on macOS: pip install mlx-lm # For MLX on macOS: pip install mlx-lm
# For vLLM: pip install vllm
# 2. Download and run llamactl # 2. Download and run llamactl
LATEST_VERSION=$(curl -s https://api.github.com/repos/lordmathis/llamactl/releases/latest | grep '"tag_name":' | sed -E 's/.*"([^"]+)".*/\1/') LATEST_VERSION=$(curl -s https://api.github.com/repos/lordmathis/llamactl/releases/latest | grep '"tag_name":' | sed -E 's/.*"([^"]+)".*/\1/')
@@ -47,7 +48,7 @@ llamactl
### Create and manage instances via web dashboard: ### Create and manage instances via web dashboard:
1. Open http://localhost:8080 1. Open http://localhost:8080
2. Click "Create Instance" 2. Click "Create Instance"
3. Choose backend type (llama.cpp or MLX) 3. Choose backend type (llama.cpp, MLX, or vLLM)
4. Set model path and backend-specific options 4. Set model path and backend-specific options
5. Start or stop the instance 5. Start or stop the instance
@@ -63,6 +64,11 @@ curl -X POST localhost:8080/api/v1/instances/my-mlx-model \
-H "Authorization: Bearer your-key" \ -H "Authorization: Bearer your-key" \
-d '{"backend_type": "mlx_lm", "backend_options": {"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit"}}' -d '{"backend_type": "mlx_lm", "backend_options": {"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit"}}'
# Create vLLM instance
curl -X POST localhost:8080/api/v1/instances/my-vllm-model \
-H "Authorization: Bearer your-key" \
-d '{"backend_type": "vllm", "backend_options": {"model": "microsoft/DialoGPT-medium", "tensor_parallel_size": 2}}'
# Use with OpenAI SDK # Use with OpenAI SDK
curl -X POST localhost:8080/v1/chat/completions \ curl -X POST localhost:8080/v1/chat/completions \
-H "Authorization: Bearer your-key" \ -H "Authorization: Bearer your-key" \
@@ -121,6 +127,21 @@ source mlx-env/bin/activate
pip install mlx-lm pip install mlx-lm
``` ```
**For vLLM backend:**
You need vLLM installed:
```bash
# Install via pip (requires Python 3.8+, GPU required)
pip install vllm
# Or in a virtual environment (recommended)
python -m venv vllm-env
source vllm-env/bin/activate
pip install vllm
# For production deployments, consider container-based installation
```
## Configuration ## Configuration
llamactl works out of the box with sensible defaults. llamactl works out of the box with sensible defaults.
@@ -135,6 +156,7 @@ server:
backends: backends:
llama_executable: llama-server # Path to llama-server executable llama_executable: llama-server # Path to llama-server executable
mlx_lm_executable: mlx_lm.server # Path to mlx_lm.server executable mlx_lm_executable: mlx_lm.server # Path to mlx_lm.server executable
vllm_executable: vllm # Path to vllm executable
instances: instances:
port_range: [8000, 9000] # Port range for instances port_range: [8000, 9000] # Port range for instances

View File

@@ -22,6 +22,7 @@ server:
backends: backends:
llama_executable: llama-server # Path to llama-server executable llama_executable: llama-server # Path to llama-server executable
mlx_lm_executable: mlx_lm.server # Path to mlx_lm.server executable mlx_lm_executable: mlx_lm.server # Path to mlx_lm.server executable
vllm_executable: vllm # Path to vllm executable
instances: instances:
port_range: [8000, 9000] # Port range for instances port_range: [8000, 9000] # Port range for instances
@@ -94,11 +95,13 @@ server:
backends: backends:
llama_executable: "llama-server" # Path to llama-server executable (default: "llama-server") llama_executable: "llama-server" # Path to llama-server executable (default: "llama-server")
mlx_lm_executable: "mlx_lm.server" # Path to mlx_lm.server executable (default: "mlx_lm.server") mlx_lm_executable: "mlx_lm.server" # Path to mlx_lm.server executable (default: "mlx_lm.server")
vllm_executable: "vllm" # Path to vllm executable (default: "vllm")
``` ```
**Environment Variables:** **Environment Variables:**
- `LLAMACTL_LLAMA_EXECUTABLE` - Path to llama-server executable - `LLAMACTL_LLAMA_EXECUTABLE` - Path to llama-server executable
- `LLAMACTL_MLX_LM_EXECUTABLE` - Path to mlx_lm.server executable - `LLAMACTL_MLX_LM_EXECUTABLE` - Path to mlx_lm.server executable
- `LLAMACTL_VLLM_EXECUTABLE` - Path to vllm executable
### Instance Configuration ### Instance Configuration

View File

@@ -37,6 +37,22 @@ pip install mlx-lm
Note: MLX backend is only available on macOS with Apple Silicon (M1, M2, M3, etc.) Note: MLX backend is only available on macOS with Apple Silicon (M1, M2, M3, etc.)
**For vLLM backend:**
vLLM provides high-throughput distributed serving for LLMs. Install vLLM:
```bash
# Install via pip (requires Python 3.8+, GPU required)
pip install vllm
# Or in a virtual environment (recommended)
python -m venv vllm-env
source vllm-env/bin/activate
pip install vllm
# For production deployments, consider container-based installation
```
## Installation Methods ## Installation Methods
### Option 1: Download Binary (Recommended) ### Option 1: Download Binary (Recommended)

View File

@@ -29,8 +29,9 @@ You should see the Llamactl web interface.
1. Click the "Add Instance" button 1. Click the "Add Instance" button
2. Fill in the instance configuration: 2. Fill in the instance configuration:
- **Name**: Give your instance a descriptive name - **Name**: Give your instance a descriptive name
- **Model Path**: Path to your Llama.cpp model file - **Backend Type**: Choose from llama.cpp, MLX, or vLLM
- **Additional Options**: Any extra Llama.cpp parameters - **Model**: Model path or identifier for your chosen backend
- **Additional Options**: Backend-specific parameters
3. Click "Create Instance" 3. Click "Create Instance"
@@ -43,17 +44,46 @@ Once created, you can:
- **View logs** by clicking the logs button - **View logs** by clicking the logs button
- **Stop** the instance when needed - **Stop** the instance when needed
## Example Configuration ## Example Configurations
Here's a basic example configuration for a Llama 2 model: Here are basic example configurations for each backend:
**llama.cpp backend:**
```json ```json
{ {
"name": "llama2-7b", "name": "llama2-7b",
"model_path": "/path/to/llama-2-7b-chat.gguf", "backend_type": "llama_cpp",
"options": { "backend_options": {
"model": "/path/to/llama-2-7b-chat.gguf",
"threads": 4, "threads": 4,
"context_size": 2048 "ctx_size": 2048,
"gpu_layers": 32
}
}
```
**MLX backend (macOS only):**
```json
{
"name": "mistral-mlx",
"backend_type": "mlx_lm",
"backend_options": {
"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit",
"temp": 0.7,
"max_tokens": 2048
}
}
```
**vLLM backend:**
```json
{
"name": "dialogpt-vllm",
"backend_type": "vllm",
"backend_options": {
"model": "microsoft/DialoGPT-medium",
"tensor_parallel_size": 2,
"gpu_memory_utilization": 0.9
} }
} }
``` ```
@@ -66,12 +96,14 @@ You can also manage instances via the REST API:
# List all instances # List all instances
curl http://localhost:8080/api/instances curl http://localhost:8080/api/instances
# Create a new instance # Create a new llama.cpp instance
curl -X POST http://localhost:8080/api/instances \ curl -X POST http://localhost:8080/api/instances/my-model \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"name": "my-model", "backend_type": "llama_cpp",
"model_path": "/path/to/model.gguf", "backend_options": {
"model": "/path/to/model.gguf"
}
}' }'
# Start an instance # Start an instance

View File

@@ -1,6 +1,6 @@
# Managing Instances # Managing Instances
Learn how to effectively manage your llama.cpp and MLX instances with Llamactl through both the Web UI and API. Learn how to effectively manage your llama.cpp, MLX, and vLLM instances with Llamactl through both the Web UI and API.
## Overview ## Overview
@@ -42,9 +42,11 @@ Each instance is displayed as a card showing:
3. **Choose Backend Type**: 3. **Choose Backend Type**:
- **llama.cpp**: For GGUF models using llama-server - **llama.cpp**: For GGUF models using llama-server
- **MLX**: For MLX-optimized models (macOS only) - **MLX**: For MLX-optimized models (macOS only)
- **vLLM**: For distributed serving and high-throughput inference
4. Configure model source: 4. Configure model source:
- **For llama.cpp**: GGUF model path or HuggingFace repo - **For llama.cpp**: GGUF model path or HuggingFace repo
- **For MLX**: MLX model path or identifier (e.g., `mlx-community/Mistral-7B-Instruct-v0.3-4bit`) - **For MLX**: MLX model path or identifier (e.g., `mlx-community/Mistral-7B-Instruct-v0.3-4bit`)
- **For vLLM**: HuggingFace model identifier (e.g., `microsoft/DialoGPT-medium`)
5. Configure optional instance management settings: 5. Configure optional instance management settings:
- **Auto Restart**: Automatically restart instance on failure - **Auto Restart**: Automatically restart instance on failure
- **Max Restarts**: Maximum number of restart attempts - **Max Restarts**: Maximum number of restart attempts
@@ -54,6 +56,7 @@ Each instance is displayed as a card showing:
6. Configure backend-specific options: 6. Configure backend-specific options:
- **llama.cpp**: Threads, context size, GPU layers, port, etc. - **llama.cpp**: Threads, context size, GPU layers, port, etc.
- **MLX**: Temperature, top-p, adapter path, Python environment, etc. - **MLX**: Temperature, top-p, adapter path, Python environment, etc.
- **vLLM**: Tensor parallel size, GPU memory utilization, quantization, etc.
7. Click **"Create"** to save the instance 7. Click **"Create"** to save the instance
### Via API ### Via API
@@ -87,6 +90,20 @@ curl -X POST http://localhost:8080/api/instances/my-mlx-instance \
"max_restarts": 3 "max_restarts": 3
}' }'
# Create vLLM instance
curl -X POST http://localhost:8080/api/instances/my-vllm-instance \
-H "Content-Type: application/json" \
-d '{
"backend_type": "vllm",
"backend_options": {
"model": "microsoft/DialoGPT-medium",
"tensor_parallel_size": 2,
"gpu_memory_utilization": 0.9
},
"auto_restart": true,
"on_demand_start": true
}'
# Create llama.cpp instance with HuggingFace model # Create llama.cpp instance with HuggingFace model
curl -X POST http://localhost:8080/api/instances/gemma-3-27b \ curl -X POST http://localhost:8080/api/instances/gemma-3-27b \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
@@ -179,16 +196,17 @@ curl -X DELETE http://localhost:8080/api/instances/{name}
## Instance Proxy ## Instance Proxy
Llamactl proxies all requests to the underlying backend instances (llama-server or MLX). Llamactl proxies all requests to the underlying backend instances (llama-server, MLX, or vLLM).
```bash ```bash
# Get instance details # Get instance details
curl http://localhost:8080/api/instances/{name}/proxy/ curl http://localhost:8080/api/instances/{name}/proxy/
``` ```
Both backends provide OpenAI-compatible endpoints. Check the respective documentation: All backends provide OpenAI-compatible endpoints. Check the respective documentation:
- [llama-server docs](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md) - [llama-server docs](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md)
- [MLX-LM docs](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md) - [MLX-LM docs](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md)
- [vLLM docs](https://docs.vllm.ai/en/latest/)
### Instance Health ### Instance Health