Add vLLM backend support to documentation and update instance management instructions

2025-12-23 17:44:24 +00:00 · 2025-09-21 21:57:36 +02:00
parent 6ff9aa5470
commit 55765d2020
5 changed files with 107 additions and 16 deletions
--- a/docs/getting-started/configuration.md
+++ b/docs/getting-started/configuration.md
@@ -22,6 +22,7 @@ server:
 backends:
  llama_executable: llama-server # Path to llama-server executable
  mlx_lm_executable: mlx_lm.server # Path to mlx_lm.server executable
+  vllm_executable: vllm # Path to vllm executable

 instances:
  port_range: [8000, 9000]       # Port range for instances
@@ -94,11 +95,13 @@ server:
 backends:
  llama_executable: "llama-server"     # Path to llama-server executable (default: "llama-server")
  mlx_lm_executable: "mlx_lm.server"   # Path to mlx_lm.server executable (default: "mlx_lm.server")
+  vllm_executable: "vllm"              # Path to vllm executable (default: "vllm")
 ```

 **Environment Variables:**
 - `LLAMACTL_LLAMA_EXECUTABLE` - Path to llama-server executable
 - `LLAMACTL_MLX_LM_EXECUTABLE` - Path to mlx_lm.server executable
+- `LLAMACTL_VLLM_EXECUTABLE` - Path to vllm executable

 ### Instance Configuration

--- a/docs/getting-started/installation.md
+++ b/docs/getting-started/installation.md
@@ -37,6 +37,22 @@ pip install mlx-lm

 Note: MLX backend is only available on macOS with Apple Silicon (M1, M2, M3, etc.)

+**For vLLM backend:**
+
+vLLM provides high-throughput distributed serving for LLMs. Install vLLM:
+
+```bash
+# Install via pip (requires Python 3.8+, GPU required)
+pip install vllm
+
+# Or in a virtual environment (recommended)
+python -m venv vllm-env
+source vllm-env/bin/activate
+pip install vllm
+
+# For production deployments, consider container-based installation
+```
+
 ## Installation Methods

 ### Option 1: Download Binary (Recommended)
--- a/docs/getting-started/quick-start.md
+++ b/docs/getting-started/quick-start.md
@@ -29,8 +29,9 @@ You should see the Llamactl web interface.
 1. Click the "Add Instance" button
 2. Fill in the instance configuration:
   - **Name**: Give your instance a descriptive name
-   - **Model Path**: Path to your Llama.cpp model file
-   - **Additional Options**: Any extra Llama.cpp parameters
+   - **Backend Type**: Choose from llama.cpp, MLX, or vLLM
+   - **Model**: Model path or identifier for your chosen backend
+   - **Additional Options**: Backend-specific parameters

 3. Click "Create Instance"

@@ -43,17 +44,46 @@ Once created, you can:
 - **View logs** by clicking the logs button
 - **Stop** the instance when needed

-## Example Configuration
+## Example Configurations

-Here's a basic example configuration for a Llama 2 model:
+Here are basic example configurations for each backend:

+**llama.cpp backend:**
 ```json
 {
  "name": "llama2-7b",
-  "model_path": "/path/to/llama-2-7b-chat.gguf",
-  "options": {
+  "backend_type": "llama_cpp",
+  "backend_options": {
+    "model": "/path/to/llama-2-7b-chat.gguf",
    "threads": 4,
-    "context_size": 2048
+    "ctx_size": 2048,
+    "gpu_layers": 32
+  }
+}
+```
+
+**MLX backend (macOS only):**
+```json
+{
+  "name": "mistral-mlx",
+  "backend_type": "mlx_lm",
+  "backend_options": {
+    "model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit",
+    "temp": 0.7,
+    "max_tokens": 2048
+  }
+}
+```
+
+**vLLM backend:**
+```json
+{
+  "name": "dialogpt-vllm",
+  "backend_type": "vllm",
+  "backend_options": {
+    "model": "microsoft/DialoGPT-medium",
+    "tensor_parallel_size": 2,
+    "gpu_memory_utilization": 0.9
  }
 }
 ```
@@ -66,12 +96,14 @@ You can also manage instances via the REST API:
 # List all instances
 curl http://localhost:8080/api/instances

-# Create a new instance
-curl -X POST http://localhost:8080/api/instances \
+# Create a new llama.cpp instance
+curl -X POST http://localhost:8080/api/instances/my-model \
  -H "Content-Type: application/json" \
  -d '{
-    "name": "my-model",
-    "model_path": "/path/to/model.gguf",
+    "backend_type": "llama_cpp",
+    "backend_options": {
+      "model": "/path/to/model.gguf"
+    }
  }'

 # Start an instance