# Managing Instances Learn how to effectively manage your llama.cpp, MLX, and vLLM instances with Llamactl through both the Web UI and API. ## Overview Llamactl provides two ways to manage instances: - **Web UI**: Accessible at `http://localhost:8080` with an intuitive dashboard - **REST API**: Programmatic access for automation and integration ![Dashboard Screenshot](images/dashboard.png) ### Authentication Llamactl uses a **Management API Key** to authenticate requests to the management API (creating, starting, stopping instances). All curl examples below use `` as a placeholder - replace this with your actual Management API Key. By default, authentication is required. If you don't configure a management API key in your configuration file, llamactl will auto-generate one and print it to the terminal on startup. See the [Configuration](configuration.md) guide for details. For Web UI access: 1. Navigate to the web UI 2. Enter your Management API Key 3. Bearer token is stored for the session ### Theme Support - Switch between light and dark themes - Setting is remembered across sessions ## Instance Cards Each instance is displayed as a card showing: - **Instance name** - **Health status badge** (unknown, ready, error, failed) - **Action buttons** (start, stop, edit, logs, delete) ## Create Instance **Via Web UI** ![Create Instance Screenshot](images/create_instance.png) 1. Click the **"Create Instance"** button on the dashboard 2. *Optional*: Click **"Import"** to load a previously exported configuration **Instance Settings:** 3. Enter a unique **Instance Name** (required) 4. **Select Node**: Choose which node to deploy the instance to 5. Configure **Auto Restart** settings: - Enable automatic restart on failure - Set max restarts and delay between attempts 6. Configure basic instance options: - **Idle Timeout**: Minutes before stopping idle instance - **On Demand Start**: Start instance only when needed **Backend Configuration:** 7. **Select Backend Type**: - **Llama Server**: For GGUF models using llama-server - **MLX LM**: For MLX-optimized models (macOS only) - **vLLM**: For distributed serving and high-throughput inference 8. *Optional*: Click **"Parse Command"** to import settings from an existing backend command 9. Configure **Execution Context**: - **Enable Docker**: Run backend in Docker container - **Command Override**: Custom path to backend executable - **Environment Variables**: Custom environment variables !!! tip "Auto-Assignment" Llamactl automatically assigns ports from the configured port range (default: 8000-9000) and generates API keys if authentication is enabled. You typically don't need to manually specify these values. 10. Configure **Basic Backend Options** (varies by backend): - **llama.cpp**: Model path, threads, context size, GPU layers, etc. - **MLX**: Model identifier, temperature, max tokens, etc. - **vLLM**: Model identifier, tensor parallel size, GPU memory utilization, etc. 11. *Optional*: Expand **Advanced Backend Options** for additional settings 12. *Optional*: Add **Extra Args** as key-value pairs for custom command-line arguments 13. Click **"Create"** to save the instance **Via API** ```bash # Create llama.cpp instance with local model file curl -X POST http://localhost:8080/api/v1/instances/my-llama-instance \ -H "Content-Type: application/json" \ -H "Authorization: Bearer " \ -d '{ "backend_type": "llama_cpp", "backend_options": { "model": "/path/to/model.gguf", "threads": 8, "ctx_size": 4096, "gpu_layers": 32, "flash_attn": "on" }, "auto_restart": true, "max_restarts": 3, "docker_enabled": false, "command_override": "/opt/llama-server-dev", "nodes": ["main"] }' # Create vLLM instance with environment variables curl -X POST http://localhost:8080/api/v1/instances/my-vllm-instance \ -H "Content-Type: application/json" \ -H "Authorization: Bearer " \ -d '{ "backend_type": "vllm", "backend_options": { "model": "microsoft/DialoGPT-medium", "tensor_parallel_size": 2, "gpu_memory_utilization": 0.9 }, "on_demand_start": true, "environment": { "CUDA_VISIBLE_DEVICES": "0,1" }, "nodes": ["worker1", "worker2"] }' # Create MLX instance (macOS only) curl -X POST http://localhost:8080/api/v1/instances/my-mlx-instance \ -H "Content-Type: application/json" \ -H "Authorization: Bearer " \ -d '{ "backend_type": "mlx_lm", "backend_options": { "model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit", "temp": 0.7, "max_tokens": 2048 }, "nodes": ["main"] }' ``` ## Start Instance **Via Web UI** 1. Click the **"Start"** button on an instance card 2. Watch the status change to "Unknown" 3. Monitor progress in the logs 4. Instance status changes to "Ready" when ready **Via API** ```bash curl -X POST http://localhost:8080/api/v1/instances/{name}/start \ -H "Authorization: Bearer " ``` ## Stop Instance **Via Web UI** 1. Click the **"Stop"** button on an instance card 2. Instance gracefully shuts down **Via API** ```bash curl -X POST http://localhost:8080/api/v1/instances/{name}/stop \ -H "Authorization: Bearer " ``` ## Edit Instance **Via Web UI** 1. Click the **"Edit"** button on an instance card 2. Modify settings in the configuration dialog 3. Changes require instance restart to take effect 4. Click **"Update & Restart"** to apply changes **Via API** Modify instance settings: ```bash curl -X PUT http://localhost:8080/api/v1/instances/{name} \ -H "Content-Type: application/json" \ -H "Authorization: Bearer " \ -d '{ "backend_options": { "threads": 8, "context_size": 4096 } }' ``` !!! note Configuration changes require restarting the instance to take effect. ## Export Instance **Via Web UI** 1. Click the **"More actions"** button (three dots) on an instance card 2. Click **"Export"** to download the instance configuration as a JSON file ## View Logs **Via Web UI** 1. Click the **"Logs"** button on any instance card 2. Real-time log viewer opens **Via API** Check instance status in real-time: ```bash # Get instance logs curl http://localhost:8080/api/v1/instances/{name}/logs \ -H "Authorization: Bearer " ``` ## Delete Instance **Via Web UI** 1. Click the **"Delete"** button on an instance card 2. Only stopped instances can be deleted 3. Confirm deletion in the dialog **Via API** ```bash curl -X DELETE http://localhost:8080/api/v1/instances/{name} \ -H "Authorization: Bearer " ``` ## Multi-Model llama.cpp Instances !!! info "llama.cpp Router Mode" llama.cpp instances support [**router mode**](https://huggingface.co/blog/ggml-org/model-management-in-llamacpp), allowing a single instance to serve multiple models dynamically. Models are loaded on-demand from the llama.cpp cache without restarting the instance. ### Creating a Multi-Model Instance **Via Web UI** 1. Click **"Create Instance"** 2. Select **Backend Type**: "Llama Server" 3. Leave **Backend Options** empty `{}` or omit the model field 4. Create the instance **Via API** ```bash # Create instance without specifying a model (router mode) curl -X POST http://localhost:8080/api/v1/instances/my-router \ -H "Content-Type: application/json" \ -H "Authorization: Bearer " \ -d '{ "backend_type": "llama_cpp", "backend_options": {}, "nodes": ["main"] }' ``` ### Managing Models **Via Web UI** 1. Start the router mode instance 2. Instance card displays a badge showing loaded/total models (e.g., "2/5 models") 3. Click the **"Models"** button on the instance card 4. Models dialog opens showing: - All available models from llama.cpp instance - Status indicator (loaded, loading, or unloaded) - Load/Unload buttons for each model 5. Click **"Load"** to load a model into memory 6. Click **"Unload"** to free up memory **Via API** ```bash # List available models curl http://localhost:8080/api/v1/llama-cpp/my-router/models \ -H "Authorization: Bearer " # Load a model curl -X POST http://localhost:8080/api/v1/llama-cpp/my-router/models/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf/load \ -H "Content-Type: application/json" \ -H "Authorization: Bearer " \ -d '{"model": "Mistral-7B-Instruct-v0.3.Q4_K_M.gguf"}' # Unload a model curl -X POST http://localhost:8080/api/v1/llama-cpp/my-router/models/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf/unload \ -H "Content-Type: application/json" \ -H "Authorization: Bearer " \ -d '{"model": "Mistral-7B-Instruct-v0.3.Q4_K_M.gguf"}' ``` ### Using Multi-Model Instances When making inference requests to a multi-model instance, specify the model using the format `instance_name/model_name`: ```bash # OpenAI-compatible chat completion with specific model curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer " \ -d '{ "model": "my-router/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf", "messages": [ {"role": "user", "content": "Hello!"} ] }' # List all available models (includes multi-model instances) curl http://localhost:8080/v1/models \ -H "Authorization: Bearer " ``` The response from `/v1/models` will include each model from multi-model instances as separate entries in the format `instance_name/model_name`. ### Model Discovery Models are automatically discovered from the llama.cpp cache directory. The default cache locations are: - **Linux/macOS**: `~/.cache/llama.cpp/` - **Windows**: `%LOCALAPPDATA%\llama.cpp\` Place your GGUF model files in the cache directory, and they will appear in the models list when you start a router mode instance. ## Instance Proxy Llamactl proxies all requests to the underlying backend instances (llama-server, MLX, or vLLM). ```bash # Proxy requests to the instance curl http://localhost:8080/api/v1/instances/{name}/proxy/ \ -H "Authorization: Bearer " ``` All backends provide OpenAI-compatible endpoints. Check the respective documentation: - [llama-server docs](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md) - [MLX-LM docs](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md) - [vLLM docs](https://docs.vllm.ai/en/latest/) ### Instance Health **Via Web UI** 1. The health status badge is displayed on each instance card **Via API** Check the health status of your instances: ```bash curl http://localhost:8080/api/v1/instances/{name}/proxy/health \ -H "Authorization: Bearer " ```