mirror of
https://github.com/lordmathis/llamactl.git
synced 2025-12-23 01:24:24 +00:00
349 lines
10 KiB
Markdown
349 lines
10 KiB
Markdown
# Managing Instances
|
|
|
|
Learn how to effectively manage your llama.cpp, MLX, and vLLM instances with Llamactl through both the Web UI and API.
|
|
|
|
## Overview
|
|
|
|
Llamactl provides two ways to manage instances:
|
|
|
|
- **Web UI**: Accessible at `http://localhost:8080` with an intuitive dashboard
|
|
- **REST API**: Programmatic access for automation and integration
|
|
|
|

|
|
|
|
### Authentication
|
|
|
|
Llamactl uses a **Management API Key** to authenticate requests to the management API (creating, starting, stopping instances). All curl examples below use `<token>` as a placeholder - replace this with your actual Management API Key.
|
|
|
|
By default, authentication is required. If you don't configure a management API key in your configuration file, llamactl will auto-generate one and print it to the terminal on startup. See the [Configuration](configuration.md) guide for details.
|
|
|
|
For Web UI access:
|
|
1. Navigate to the web UI
|
|
2. Enter your Management API Key
|
|
3. Bearer token is stored for the session
|
|
|
|
### Theme Support
|
|
|
|
- Switch between light and dark themes
|
|
- Setting is remembered across sessions
|
|
|
|
## Instance Cards
|
|
|
|
Each instance is displayed as a card showing:
|
|
|
|
- **Instance name**
|
|
- **Health status badge** (unknown, ready, error, failed)
|
|
- **Action buttons** (start, stop, edit, logs, delete)
|
|
|
|
## Create Instance
|
|
|
|
**Via Web UI**
|
|
|
|

|
|
|
|
1. Click the **"Create Instance"** button on the dashboard
|
|
2. *Optional*: Click **"Import"** to load a previously exported configuration
|
|
|
|
**Instance Settings:**
|
|
|
|
3. Enter a unique **Instance Name** (required)
|
|
4. **Select Node**: Choose which node to deploy the instance to
|
|
5. Configure **Auto Restart** settings:
|
|
- Enable automatic restart on failure
|
|
- Set max restarts and delay between attempts
|
|
6. Configure basic instance options:
|
|
- **Idle Timeout**: Minutes before stopping idle instance
|
|
- **On Demand Start**: Start instance only when needed
|
|
|
|
**Backend Configuration:**
|
|
|
|
7. **Select Backend Type**:
|
|
- **Llama Server**: For GGUF models using llama-server
|
|
- **MLX LM**: For MLX-optimized models (macOS only)
|
|
- **vLLM**: For distributed serving and high-throughput inference
|
|
8. *Optional*: Click **"Parse Command"** to import settings from an existing backend command
|
|
9. Configure **Execution Context**:
|
|
- **Enable Docker**: Run backend in Docker container
|
|
- **Command Override**: Custom path to backend executable
|
|
- **Environment Variables**: Custom environment variables
|
|
|
|
!!! tip "Auto-Assignment"
|
|
Llamactl automatically assigns ports from the configured port range (default: 8000-9000) and generates API keys if authentication is enabled. You typically don't need to manually specify these values.
|
|
|
|
10. Configure **Basic Backend Options** (varies by backend):
|
|
- **llama.cpp**: Model path, threads, context size, GPU layers, etc.
|
|
- **MLX**: Model identifier, temperature, max tokens, etc.
|
|
- **vLLM**: Model identifier, tensor parallel size, GPU memory utilization, etc.
|
|
11. *Optional*: Expand **Advanced Backend Options** for additional settings
|
|
12. *Optional*: Add **Extra Args** as key-value pairs for custom command-line arguments
|
|
13. Click **"Create"** to save the instance
|
|
|
|
**Via API**
|
|
|
|
```bash
|
|
# Create llama.cpp instance with local model file
|
|
curl -X POST http://localhost:8080/api/v1/instances/my-llama-instance \
|
|
-H "Content-Type: application/json" \
|
|
-H "Authorization: Bearer <token>" \
|
|
-d '{
|
|
"backend_type": "llama_cpp",
|
|
"backend_options": {
|
|
"model": "/path/to/model.gguf",
|
|
"threads": 8,
|
|
"ctx_size": 4096,
|
|
"gpu_layers": 32,
|
|
"flash_attn": "on"
|
|
},
|
|
"auto_restart": true,
|
|
"max_restarts": 3,
|
|
"docker_enabled": false,
|
|
"command_override": "/opt/llama-server-dev",
|
|
"nodes": ["main"]
|
|
}'
|
|
|
|
# Create vLLM instance with environment variables
|
|
curl -X POST http://localhost:8080/api/v1/instances/my-vllm-instance \
|
|
-H "Content-Type: application/json" \
|
|
-H "Authorization: Bearer <token>" \
|
|
-d '{
|
|
"backend_type": "vllm",
|
|
"backend_options": {
|
|
"model": "microsoft/DialoGPT-medium",
|
|
"tensor_parallel_size": 2,
|
|
"gpu_memory_utilization": 0.9
|
|
},
|
|
"on_demand_start": true,
|
|
"environment": {
|
|
"CUDA_VISIBLE_DEVICES": "0,1"
|
|
},
|
|
"nodes": ["worker1", "worker2"]
|
|
}'
|
|
|
|
# Create MLX instance (macOS only)
|
|
curl -X POST http://localhost:8080/api/v1/instances/my-mlx-instance \
|
|
-H "Content-Type: application/json" \
|
|
-H "Authorization: Bearer <token>" \
|
|
-d '{
|
|
"backend_type": "mlx_lm",
|
|
"backend_options": {
|
|
"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit",
|
|
"temp": 0.7,
|
|
"max_tokens": 2048
|
|
},
|
|
"nodes": ["main"]
|
|
}'
|
|
```
|
|
|
|
## Start Instance
|
|
|
|
**Via Web UI**
|
|
1. Click the **"Start"** button on an instance card
|
|
2. Watch the status change to "Unknown"
|
|
3. Monitor progress in the logs
|
|
4. Instance status changes to "Ready" when ready
|
|
|
|
**Via API**
|
|
```bash
|
|
curl -X POST http://localhost:8080/api/v1/instances/{name}/start \
|
|
-H "Authorization: Bearer <token>"
|
|
```
|
|
|
|
## Stop Instance
|
|
|
|
**Via Web UI**
|
|
1. Click the **"Stop"** button on an instance card
|
|
2. Instance gracefully shuts down
|
|
|
|
**Via API**
|
|
```bash
|
|
curl -X POST http://localhost:8080/api/v1/instances/{name}/stop \
|
|
-H "Authorization: Bearer <token>"
|
|
```
|
|
|
|
## Edit Instance
|
|
|
|
**Via Web UI**
|
|
1. Click the **"Edit"** button on an instance card
|
|
2. Modify settings in the configuration dialog
|
|
3. Changes require instance restart to take effect
|
|
4. Click **"Update & Restart"** to apply changes
|
|
|
|
**Via API**
|
|
Modify instance settings:
|
|
|
|
```bash
|
|
curl -X PUT http://localhost:8080/api/v1/instances/{name} \
|
|
-H "Content-Type: application/json" \
|
|
-H "Authorization: Bearer <token>" \
|
|
-d '{
|
|
"backend_options": {
|
|
"threads": 8,
|
|
"context_size": 4096
|
|
}
|
|
}'
|
|
```
|
|
|
|
!!! note
|
|
Configuration changes require restarting the instance to take effect.
|
|
|
|
|
|
## Export Instance
|
|
|
|
**Via Web UI**
|
|
1. Click the **"More actions"** button (three dots) on an instance card
|
|
2. Click **"Export"** to download the instance configuration as a JSON file
|
|
|
|
## View Logs
|
|
|
|
**Via Web UI**
|
|
|
|
1. Click the **"Logs"** button on any instance card
|
|
2. Real-time log viewer opens
|
|
|
|
**Via API**
|
|
Check instance status in real-time:
|
|
|
|
```bash
|
|
# Get instance logs
|
|
curl http://localhost:8080/api/v1/instances/{name}/logs \
|
|
-H "Authorization: Bearer <token>"
|
|
```
|
|
|
|
## Delete Instance
|
|
|
|
**Via Web UI**
|
|
1. Click the **"Delete"** button on an instance card
|
|
2. Only stopped instances can be deleted
|
|
3. Confirm deletion in the dialog
|
|
|
|
**Via API**
|
|
```bash
|
|
curl -X DELETE http://localhost:8080/api/v1/instances/{name} \
|
|
-H "Authorization: Bearer <token>"
|
|
```
|
|
|
|
## Multi-Model llama.cpp Instances
|
|
|
|
!!! info "llama.cpp Router Mode"
|
|
llama.cpp instances support [**router mode**](https://huggingface.co/blog/ggml-org/model-management-in-llamacpp), allowing a single instance to serve multiple models dynamically. Models are loaded on-demand from the llama.cpp cache without restarting the instance.
|
|
|
|
### Creating a Multi-Model Instance
|
|
|
|
**Via Web UI**
|
|
|
|
1. Click **"Create Instance"**
|
|
2. Select **Backend Type**: "Llama Server"
|
|
3. Leave **Backend Options** empty `{}` or omit the model field
|
|
4. Create the instance
|
|
|
|
**Via API**
|
|
|
|
```bash
|
|
# Create instance without specifying a model (router mode)
|
|
curl -X POST http://localhost:8080/api/v1/instances/my-router \
|
|
-H "Content-Type: application/json" \
|
|
-H "Authorization: Bearer <token>" \
|
|
-d '{
|
|
"backend_type": "llama_cpp",
|
|
"backend_options": {},
|
|
"nodes": ["main"]
|
|
}'
|
|
```
|
|
|
|
### Managing Models
|
|
|
|
**Via Web UI**
|
|
|
|
1. Start the router mode instance
|
|
2. Instance card displays a badge showing loaded/total models (e.g., "2/5 models")
|
|
3. Click the **"Models"** button on the instance card
|
|
4. Models dialog opens showing:
|
|
- All available models from llama.cpp instance
|
|
- Status indicator (loaded, loading, or unloaded)
|
|
- Load/Unload buttons for each model
|
|
5. Click **"Load"** to load a model into memory
|
|
6. Click **"Unload"** to free up memory
|
|
|
|
**Via API**
|
|
|
|
```bash
|
|
# List available models
|
|
curl http://localhost:8080/api/v1/llama-cpp/my-router/models \
|
|
-H "Authorization: Bearer <token>"
|
|
|
|
# Load a model
|
|
curl -X POST http://localhost:8080/api/v1/llama-cpp/my-router/models/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf/load \
|
|
-H "Content-Type: application/json" \
|
|
-H "Authorization: Bearer <token>" \
|
|
-d '{"model": "Mistral-7B-Instruct-v0.3.Q4_K_M.gguf"}'
|
|
|
|
# Unload a model
|
|
curl -X POST http://localhost:8080/api/v1/llama-cpp/my-router/models/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf/unload \
|
|
-H "Content-Type: application/json" \
|
|
-H "Authorization: Bearer <token>" \
|
|
-d '{"model": "Mistral-7B-Instruct-v0.3.Q4_K_M.gguf"}'
|
|
```
|
|
|
|
### Using Multi-Model Instances
|
|
|
|
When making inference requests to a multi-model instance, specify the model using the format `instance_name/model_name`:
|
|
|
|
```bash
|
|
# OpenAI-compatible chat completion with specific model
|
|
curl -X POST http://localhost:8080/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-H "Authorization: Bearer <inference-key>" \
|
|
-d '{
|
|
"model": "my-router/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf",
|
|
"messages": [
|
|
{"role": "user", "content": "Hello!"}
|
|
]
|
|
}'
|
|
|
|
# List all available models (includes multi-model instances)
|
|
curl http://localhost:8080/v1/models \
|
|
-H "Authorization: Bearer <inference-key>"
|
|
```
|
|
|
|
The response from `/v1/models` will include each model from multi-model instances as separate entries in the format `instance_name/model_name`.
|
|
|
|
### Model Discovery
|
|
|
|
Models are automatically discovered from the llama.cpp cache directory. The default cache locations are:
|
|
|
|
- **Linux/macOS**: `~/.cache/llama.cpp/`
|
|
- **Windows**: `%LOCALAPPDATA%\llama.cpp\`
|
|
|
|
Place your GGUF model files in the cache directory, and they will appear in the models list when you start a router mode instance.
|
|
|
|
## Instance Proxy
|
|
|
|
Llamactl proxies all requests to the underlying backend instances (llama-server, MLX, or vLLM).
|
|
|
|
```bash
|
|
# Proxy requests to the instance
|
|
curl http://localhost:8080/api/v1/instances/{name}/proxy/ \
|
|
-H "Authorization: Bearer <token>"
|
|
```
|
|
|
|
All backends provide OpenAI-compatible endpoints. Check the respective documentation:
|
|
- [llama-server docs](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md)
|
|
- [MLX-LM docs](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md)
|
|
- [vLLM docs](https://docs.vllm.ai/en/latest/)
|
|
|
|
### Instance Health
|
|
|
|
**Via Web UI**
|
|
|
|
1. The health status badge is displayed on each instance card
|
|
|
|
**Via API**
|
|
|
|
Check the health status of your instances:
|
|
|
|
```bash
|
|
curl http://localhost:8080/api/v1/instances/{name}/proxy/health \
|
|
-H "Authorization: Bearer <token>"
|
|
```
|
|
|