Improve instance creation documentation with clearer settings and options

2025-12-22 17:14:22 +00:00 · 2025-11-15 00:18:55 +01:00
parent 6ed99fccf9
commit 2ceeddbce5
1 changed files with 55 additions and 88 deletions
--- a/docs/managing-instances.md
+++ b/docs/managing-instances.md
@@ -42,33 +42,41 @@ Each instance is displayed as a card showing:
 ![Create Instance Screenshot](images/create_instance.png)
 1. Click the **"Create Instance"** button on the dashboard
-2. *Optional*: Click **"Import"** in the dialog header to load a previously exported configuration
+2. *Optional*: Click **"Import"** to load a previously exported configuration
-2. Enter a unique **Name** for your instance (only required field)
+
-3. **Select Target Node**: Choose which node to deploy the instance to from the dropdown
+**Instance Settings:**
-4. **Choose Backend Type**:
+
-    - **llama.cpp**: For GGUF models using llama-server
+3. Enter a unique **Instance Name** (required)
-    - **MLX**: For MLX-optimized models (macOS only)
+4. **Select Node**: Choose which node to deploy the instance to
 5. Configure **Auto Restart** settings:
    - Enable automatic restart on failure
    - Set max restarts and delay between attempts
 6. Configure basic instance options:
    - **Idle Timeout**: Minutes before stopping idle instance
    - **On Demand Start**: Start instance only when needed
 **Backend Configuration:**
 7. **Select Backend Type**:
    - **Llama Server**: For GGUF models using llama-server
    - **MLX LM**: For MLX-optimized models (macOS only)
    - **vLLM**: For distributed serving and high-throughput inference
-5. Configure model source:
+8. *Optional*: Click **"Parse Command"** to import settings from an existing backend command
-    - **For llama.cpp**: GGUF model path or HuggingFace repo
+9. Configure **Execution Context**:
-    - **For MLX**: MLX model path or identifier (e.g., `mlx-community/Mistral-7B-Instruct-v0.3-4bit`)
+    - **Enable Docker**: Run backend in Docker container
-    - **For vLLM**: HuggingFace model identifier (e.g., `microsoft/DialoGPT-medium`)
+    - **Command Override**: Custom path to backend executable
-6. Configure optional instance management settings:
+    - **Environment Variables**: Custom environment variables
    - **Auto Restart**: Automatically restart instance on failure
    - **Max Restarts**: Maximum number of restart attempts
    - **Restart Delay**: Delay in seconds between restart attempts
    - **On Demand Start**: Start instance when receiving a request to the OpenAI compatible endpoint
    - **Idle Timeout**: Minutes before stopping idle instance (set to 0 to disable)
    - **Environment Variables**: Set custom environment variables for the instance process
 7. Configure backend-specific options:
    - **llama.cpp**: Threads, context size, GPU layers, port, etc.
    - **MLX**: Temperature, top-p, adapter path, Python environment, etc.
    - **vLLM**: Tensor parallel size, GPU memory utilization, quantization, etc.
 !!! tip "Auto-Assignment"
    Llamactl automatically assigns ports from the configured port range (default: 8000-9000) and generates API keys if authentication is enabled. You typically don't need to manually specify these values.
-
+  
-8. Click **"Create"** to save the instance  
+10. Configure **Basic Backend Options** (varies by backend):
    - **llama.cpp**: Model path, threads, context size, GPU layers, etc.
    - **MLX**: Model identifier, temperature, max tokens, etc.
    - **vLLM**: Model identifier, tensor parallel size, GPU memory utilization, etc.
 11. *Optional*: Expand **Advanced Backend Options** for additional settings
 12. *Optional*: Add **Extra Args** as key-value pairs for custom command-line arguments
 13. Click **"Create"** to save the instance  
 **Via API**
@@ -83,11 +91,34 @@ curl -X POST http://localhost:8080/api/v1/instances/my-llama-instance \
      "model": "/path/to/model.gguf",
      "threads": 8,
      "ctx_size": 4096,
-      "gpu_layers": 32
+      "gpu_layers": 32,
      "flash_attn": "on"
    },
    "auto_restart": true,
    "max_restarts": 3,
    "docker_enabled": false,
    "command_override": "/opt/llama-server-dev",
    "nodes": ["main"]
  }'
 # Create vLLM instance with environment variables
 curl -X POST http://localhost:8080/api/v1/instances/my-vllm-instance \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <token>" \
  -d '{
    "backend_type": "vllm",
    "backend_options": {
      "model": "microsoft/DialoGPT-medium",
      "tensor_parallel_size": 2,
      "gpu_memory_utilization": 0.9
    },
    "on_demand_start": true,
    "environment": {
      "CUDA_VISIBLE_DEVICES": "0,1"
    },
    "nodes": ["worker1", "worker2"]
  }'
 # Create MLX instance (macOS only)
 curl -X POST http://localhost:8080/api/v1/instances/my-mlx-instance \
  -H "Content-Type: application/json" \
@@ -97,74 +128,10 @@ curl -X POST http://localhost:8080/api/v1/instances/my-mlx-instance \
    "backend_options": {
      "model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit",
      "temp": 0.7,
      "top_p": 0.9,
      "max_tokens": 2048
    },
    "auto_restart": true,
    "max_restarts": 3,
    "nodes": ["main"]
  }'
 # Create vLLM instance
 curl -X POST http://localhost:8080/api/v1/instances/my-vllm-instance \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <token>" \
  -d '{
    "backend_type": "vllm",
    "backend_options": {
      "model": "microsoft/DialoGPT-medium",
      "tensor_parallel_size": 2,
      "gpu_memory_utilization": 0.9
    },
    "auto_restart": true,
    "on_demand_start": true,
    "environment": {
      "CUDA_VISIBLE_DEVICES": "0,1",
      "NCCL_DEBUG": "INFO",
      "PYTHONPATH": "/custom/path"
    },
    "nodes": ["main"]
  }'
 # Create llama.cpp instance with HuggingFace model
 curl -X POST http://localhost:8080/api/v1/instances/gemma-3-27b \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <token>" \
  -d '{
    "backend_type": "llama_cpp",
    "backend_options": {
      "hf_repo": "unsloth/gemma-3-27b-it-GGUF",
      "hf_file": "gemma-3-27b-it-GGUF.gguf",
      "gpu_layers": 32
    },
    "nodes": ["main"]
  }'
 # Create instance on specific remote node
 curl -X POST http://localhost:8080/api/v1/instances/remote-llama \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <token>" \
  -d '{
    "backend_type": "llama_cpp",
    "backend_options": {
      "model": "/models/llama-7b.gguf",
      "gpu_layers": 32
    },
    "nodes": ["worker1"]
  }'
 # Create instance on multiple nodes for high availability
 curl -X POST http://localhost:8080/api/v1/instances/multi-node-llama \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <token>" \
  -d '{
    "backend_type": "llama_cpp",
    "backend_options": {
      "model": "/models/llama-7b.gguf",
      "gpu_layers": 32
    },
    "nodes": ["worker1", "worker2", "worker3"]
  }'
 ```
 ## Start Instance