mirror of
https://github.com/lordmathis/llamactl.git
synced 2025-11-05 16:44:22 +00:00
Add vLLM backend support to documentation and update instance management instructions
This commit is contained in:
26
README.md
26
README.md
@@ -13,7 +13,7 @@
|
|||||||
|
|
||||||
### 🔗 Universal Compatibility
|
### 🔗 Universal Compatibility
|
||||||
- **OpenAI API Compatible**: Drop-in replacement - route requests by model name
|
- **OpenAI API Compatible**: Drop-in replacement - route requests by model name
|
||||||
- **Multi-Backend Support**: Native support for both llama.cpp and MLX (Apple Silicon optimized)
|
- **Multi-Backend Support**: Native support for llama.cpp, MLX (Apple Silicon optimized), and vLLM
|
||||||
|
|
||||||
### 🌐 User-Friendly Interface
|
### 🌐 User-Friendly Interface
|
||||||
- **Web Dashboard**: Modern React UI for visual management (unlike CLI-only tools)
|
- **Web Dashboard**: Modern React UI for visual management (unlike CLI-only tools)
|
||||||
@@ -31,6 +31,7 @@
|
|||||||
# 1. Install backend (one-time setup)
|
# 1. Install backend (one-time setup)
|
||||||
# For llama.cpp: https://github.com/ggml-org/llama.cpp#quick-start
|
# For llama.cpp: https://github.com/ggml-org/llama.cpp#quick-start
|
||||||
# For MLX on macOS: pip install mlx-lm
|
# For MLX on macOS: pip install mlx-lm
|
||||||
|
# For vLLM: pip install vllm
|
||||||
|
|
||||||
# 2. Download and run llamactl
|
# 2. Download and run llamactl
|
||||||
LATEST_VERSION=$(curl -s https://api.github.com/repos/lordmathis/llamactl/releases/latest | grep '"tag_name":' | sed -E 's/.*"([^"]+)".*/\1/')
|
LATEST_VERSION=$(curl -s https://api.github.com/repos/lordmathis/llamactl/releases/latest | grep '"tag_name":' | sed -E 's/.*"([^"]+)".*/\1/')
|
||||||
@@ -47,7 +48,7 @@ llamactl
|
|||||||
### Create and manage instances via web dashboard:
|
### Create and manage instances via web dashboard:
|
||||||
1. Open http://localhost:8080
|
1. Open http://localhost:8080
|
||||||
2. Click "Create Instance"
|
2. Click "Create Instance"
|
||||||
3. Choose backend type (llama.cpp or MLX)
|
3. Choose backend type (llama.cpp, MLX, or vLLM)
|
||||||
4. Set model path and backend-specific options
|
4. Set model path and backend-specific options
|
||||||
5. Start or stop the instance
|
5. Start or stop the instance
|
||||||
|
|
||||||
@@ -63,6 +64,11 @@ curl -X POST localhost:8080/api/v1/instances/my-mlx-model \
|
|||||||
-H "Authorization: Bearer your-key" \
|
-H "Authorization: Bearer your-key" \
|
||||||
-d '{"backend_type": "mlx_lm", "backend_options": {"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit"}}'
|
-d '{"backend_type": "mlx_lm", "backend_options": {"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit"}}'
|
||||||
|
|
||||||
|
# Create vLLM instance
|
||||||
|
curl -X POST localhost:8080/api/v1/instances/my-vllm-model \
|
||||||
|
-H "Authorization: Bearer your-key" \
|
||||||
|
-d '{"backend_type": "vllm", "backend_options": {"model": "microsoft/DialoGPT-medium", "tensor_parallel_size": 2}}'
|
||||||
|
|
||||||
# Use with OpenAI SDK
|
# Use with OpenAI SDK
|
||||||
curl -X POST localhost:8080/v1/chat/completions \
|
curl -X POST localhost:8080/v1/chat/completions \
|
||||||
-H "Authorization: Bearer your-key" \
|
-H "Authorization: Bearer your-key" \
|
||||||
@@ -121,6 +127,21 @@ source mlx-env/bin/activate
|
|||||||
pip install mlx-lm
|
pip install mlx-lm
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**For vLLM backend:**
|
||||||
|
You need vLLM installed:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install via pip (requires Python 3.8+, GPU required)
|
||||||
|
pip install vllm
|
||||||
|
|
||||||
|
# Or in a virtual environment (recommended)
|
||||||
|
python -m venv vllm-env
|
||||||
|
source vllm-env/bin/activate
|
||||||
|
pip install vllm
|
||||||
|
|
||||||
|
# For production deployments, consider container-based installation
|
||||||
|
```
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
||||||
llamactl works out of the box with sensible defaults.
|
llamactl works out of the box with sensible defaults.
|
||||||
@@ -135,6 +156,7 @@ server:
|
|||||||
backends:
|
backends:
|
||||||
llama_executable: llama-server # Path to llama-server executable
|
llama_executable: llama-server # Path to llama-server executable
|
||||||
mlx_lm_executable: mlx_lm.server # Path to mlx_lm.server executable
|
mlx_lm_executable: mlx_lm.server # Path to mlx_lm.server executable
|
||||||
|
vllm_executable: vllm # Path to vllm executable
|
||||||
|
|
||||||
instances:
|
instances:
|
||||||
port_range: [8000, 9000] # Port range for instances
|
port_range: [8000, 9000] # Port range for instances
|
||||||
|
|||||||
@@ -22,6 +22,7 @@ server:
|
|||||||
backends:
|
backends:
|
||||||
llama_executable: llama-server # Path to llama-server executable
|
llama_executable: llama-server # Path to llama-server executable
|
||||||
mlx_lm_executable: mlx_lm.server # Path to mlx_lm.server executable
|
mlx_lm_executable: mlx_lm.server # Path to mlx_lm.server executable
|
||||||
|
vllm_executable: vllm # Path to vllm executable
|
||||||
|
|
||||||
instances:
|
instances:
|
||||||
port_range: [8000, 9000] # Port range for instances
|
port_range: [8000, 9000] # Port range for instances
|
||||||
@@ -94,11 +95,13 @@ server:
|
|||||||
backends:
|
backends:
|
||||||
llama_executable: "llama-server" # Path to llama-server executable (default: "llama-server")
|
llama_executable: "llama-server" # Path to llama-server executable (default: "llama-server")
|
||||||
mlx_lm_executable: "mlx_lm.server" # Path to mlx_lm.server executable (default: "mlx_lm.server")
|
mlx_lm_executable: "mlx_lm.server" # Path to mlx_lm.server executable (default: "mlx_lm.server")
|
||||||
|
vllm_executable: "vllm" # Path to vllm executable (default: "vllm")
|
||||||
```
|
```
|
||||||
|
|
||||||
**Environment Variables:**
|
**Environment Variables:**
|
||||||
- `LLAMACTL_LLAMA_EXECUTABLE` - Path to llama-server executable
|
- `LLAMACTL_LLAMA_EXECUTABLE` - Path to llama-server executable
|
||||||
- `LLAMACTL_MLX_LM_EXECUTABLE` - Path to mlx_lm.server executable
|
- `LLAMACTL_MLX_LM_EXECUTABLE` - Path to mlx_lm.server executable
|
||||||
|
- `LLAMACTL_VLLM_EXECUTABLE` - Path to vllm executable
|
||||||
|
|
||||||
### Instance Configuration
|
### Instance Configuration
|
||||||
|
|
||||||
|
|||||||
@@ -37,6 +37,22 @@ pip install mlx-lm
|
|||||||
|
|
||||||
Note: MLX backend is only available on macOS with Apple Silicon (M1, M2, M3, etc.)
|
Note: MLX backend is only available on macOS with Apple Silicon (M1, M2, M3, etc.)
|
||||||
|
|
||||||
|
**For vLLM backend:**
|
||||||
|
|
||||||
|
vLLM provides high-throughput distributed serving for LLMs. Install vLLM:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install via pip (requires Python 3.8+, GPU required)
|
||||||
|
pip install vllm
|
||||||
|
|
||||||
|
# Or in a virtual environment (recommended)
|
||||||
|
python -m venv vllm-env
|
||||||
|
source vllm-env/bin/activate
|
||||||
|
pip install vllm
|
||||||
|
|
||||||
|
# For production deployments, consider container-based installation
|
||||||
|
```
|
||||||
|
|
||||||
## Installation Methods
|
## Installation Methods
|
||||||
|
|
||||||
### Option 1: Download Binary (Recommended)
|
### Option 1: Download Binary (Recommended)
|
||||||
|
|||||||
@@ -29,8 +29,9 @@ You should see the Llamactl web interface.
|
|||||||
1. Click the "Add Instance" button
|
1. Click the "Add Instance" button
|
||||||
2. Fill in the instance configuration:
|
2. Fill in the instance configuration:
|
||||||
- **Name**: Give your instance a descriptive name
|
- **Name**: Give your instance a descriptive name
|
||||||
- **Model Path**: Path to your Llama.cpp model file
|
- **Backend Type**: Choose from llama.cpp, MLX, or vLLM
|
||||||
- **Additional Options**: Any extra Llama.cpp parameters
|
- **Model**: Model path or identifier for your chosen backend
|
||||||
|
- **Additional Options**: Backend-specific parameters
|
||||||
|
|
||||||
3. Click "Create Instance"
|
3. Click "Create Instance"
|
||||||
|
|
||||||
@@ -43,17 +44,46 @@ Once created, you can:
|
|||||||
- **View logs** by clicking the logs button
|
- **View logs** by clicking the logs button
|
||||||
- **Stop** the instance when needed
|
- **Stop** the instance when needed
|
||||||
|
|
||||||
## Example Configuration
|
## Example Configurations
|
||||||
|
|
||||||
Here's a basic example configuration for a Llama 2 model:
|
Here are basic example configurations for each backend:
|
||||||
|
|
||||||
|
**llama.cpp backend:**
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"name": "llama2-7b",
|
"name": "llama2-7b",
|
||||||
"model_path": "/path/to/llama-2-7b-chat.gguf",
|
"backend_type": "llama_cpp",
|
||||||
"options": {
|
"backend_options": {
|
||||||
|
"model": "/path/to/llama-2-7b-chat.gguf",
|
||||||
"threads": 4,
|
"threads": 4,
|
||||||
"context_size": 2048
|
"ctx_size": 2048,
|
||||||
|
"gpu_layers": 32
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**MLX backend (macOS only):**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"name": "mistral-mlx",
|
||||||
|
"backend_type": "mlx_lm",
|
||||||
|
"backend_options": {
|
||||||
|
"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit",
|
||||||
|
"temp": 0.7,
|
||||||
|
"max_tokens": 2048
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**vLLM backend:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"name": "dialogpt-vllm",
|
||||||
|
"backend_type": "vllm",
|
||||||
|
"backend_options": {
|
||||||
|
"model": "microsoft/DialoGPT-medium",
|
||||||
|
"tensor_parallel_size": 2,
|
||||||
|
"gpu_memory_utilization": 0.9
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
@@ -66,12 +96,14 @@ You can also manage instances via the REST API:
|
|||||||
# List all instances
|
# List all instances
|
||||||
curl http://localhost:8080/api/instances
|
curl http://localhost:8080/api/instances
|
||||||
|
|
||||||
# Create a new instance
|
# Create a new llama.cpp instance
|
||||||
curl -X POST http://localhost:8080/api/instances \
|
curl -X POST http://localhost:8080/api/instances/my-model \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
"name": "my-model",
|
"backend_type": "llama_cpp",
|
||||||
"model_path": "/path/to/model.gguf",
|
"backend_options": {
|
||||||
|
"model": "/path/to/model.gguf"
|
||||||
|
}
|
||||||
}'
|
}'
|
||||||
|
|
||||||
# Start an instance
|
# Start an instance
|
||||||
|
|||||||
@@ -1,6 +1,6 @@
|
|||||||
# Managing Instances
|
# Managing Instances
|
||||||
|
|
||||||
Learn how to effectively manage your llama.cpp and MLX instances with Llamactl through both the Web UI and API.
|
Learn how to effectively manage your llama.cpp, MLX, and vLLM instances with Llamactl through both the Web UI and API.
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
@@ -42,9 +42,11 @@ Each instance is displayed as a card showing:
|
|||||||
3. **Choose Backend Type**:
|
3. **Choose Backend Type**:
|
||||||
- **llama.cpp**: For GGUF models using llama-server
|
- **llama.cpp**: For GGUF models using llama-server
|
||||||
- **MLX**: For MLX-optimized models (macOS only)
|
- **MLX**: For MLX-optimized models (macOS only)
|
||||||
|
- **vLLM**: For distributed serving and high-throughput inference
|
||||||
4. Configure model source:
|
4. Configure model source:
|
||||||
- **For llama.cpp**: GGUF model path or HuggingFace repo
|
- **For llama.cpp**: GGUF model path or HuggingFace repo
|
||||||
- **For MLX**: MLX model path or identifier (e.g., `mlx-community/Mistral-7B-Instruct-v0.3-4bit`)
|
- **For MLX**: MLX model path or identifier (e.g., `mlx-community/Mistral-7B-Instruct-v0.3-4bit`)
|
||||||
|
- **For vLLM**: HuggingFace model identifier (e.g., `microsoft/DialoGPT-medium`)
|
||||||
5. Configure optional instance management settings:
|
5. Configure optional instance management settings:
|
||||||
- **Auto Restart**: Automatically restart instance on failure
|
- **Auto Restart**: Automatically restart instance on failure
|
||||||
- **Max Restarts**: Maximum number of restart attempts
|
- **Max Restarts**: Maximum number of restart attempts
|
||||||
@@ -54,6 +56,7 @@ Each instance is displayed as a card showing:
|
|||||||
6. Configure backend-specific options:
|
6. Configure backend-specific options:
|
||||||
- **llama.cpp**: Threads, context size, GPU layers, port, etc.
|
- **llama.cpp**: Threads, context size, GPU layers, port, etc.
|
||||||
- **MLX**: Temperature, top-p, adapter path, Python environment, etc.
|
- **MLX**: Temperature, top-p, adapter path, Python environment, etc.
|
||||||
|
- **vLLM**: Tensor parallel size, GPU memory utilization, quantization, etc.
|
||||||
7. Click **"Create"** to save the instance
|
7. Click **"Create"** to save the instance
|
||||||
|
|
||||||
### Via API
|
### Via API
|
||||||
@@ -87,6 +90,20 @@ curl -X POST http://localhost:8080/api/instances/my-mlx-instance \
|
|||||||
"max_restarts": 3
|
"max_restarts": 3
|
||||||
}'
|
}'
|
||||||
|
|
||||||
|
# Create vLLM instance
|
||||||
|
curl -X POST http://localhost:8080/api/instances/my-vllm-instance \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"backend_type": "vllm",
|
||||||
|
"backend_options": {
|
||||||
|
"model": "microsoft/DialoGPT-medium",
|
||||||
|
"tensor_parallel_size": 2,
|
||||||
|
"gpu_memory_utilization": 0.9
|
||||||
|
},
|
||||||
|
"auto_restart": true,
|
||||||
|
"on_demand_start": true
|
||||||
|
}'
|
||||||
|
|
||||||
# Create llama.cpp instance with HuggingFace model
|
# Create llama.cpp instance with HuggingFace model
|
||||||
curl -X POST http://localhost:8080/api/instances/gemma-3-27b \
|
curl -X POST http://localhost:8080/api/instances/gemma-3-27b \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
@@ -179,16 +196,17 @@ curl -X DELETE http://localhost:8080/api/instances/{name}
|
|||||||
|
|
||||||
## Instance Proxy
|
## Instance Proxy
|
||||||
|
|
||||||
Llamactl proxies all requests to the underlying backend instances (llama-server or MLX).
|
Llamactl proxies all requests to the underlying backend instances (llama-server, MLX, or vLLM).
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Get instance details
|
# Get instance details
|
||||||
curl http://localhost:8080/api/instances/{name}/proxy/
|
curl http://localhost:8080/api/instances/{name}/proxy/
|
||||||
```
|
```
|
||||||
|
|
||||||
Both backends provide OpenAI-compatible endpoints. Check the respective documentation:
|
All backends provide OpenAI-compatible endpoints. Check the respective documentation:
|
||||||
- [llama-server docs](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md)
|
- [llama-server docs](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md)
|
||||||
- [MLX-LM docs](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md)
|
- [MLX-LM docs](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md)
|
||||||
|
- [vLLM docs](https://docs.vllm.ai/en/latest/)
|
||||||
|
|
||||||
### Instance Health
|
### Instance Health
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user