mirror of
https://github.com/lordmathis/llamactl.git
synced 2025-12-22 17:14:22 +00:00
Improve instance creation documentation with clearer settings and options
This commit is contained in:
@@ -42,33 +42,41 @@ Each instance is displayed as a card showing:
|
|||||||

|

|
||||||
|
|
||||||
1. Click the **"Create Instance"** button on the dashboard
|
1. Click the **"Create Instance"** button on the dashboard
|
||||||
2. *Optional*: Click **"Import"** in the dialog header to load a previously exported configuration
|
2. *Optional*: Click **"Import"** to load a previously exported configuration
|
||||||
2. Enter a unique **Name** for your instance (only required field)
|
|
||||||
3. **Select Target Node**: Choose which node to deploy the instance to from the dropdown
|
**Instance Settings:**
|
||||||
4. **Choose Backend Type**:
|
|
||||||
- **llama.cpp**: For GGUF models using llama-server
|
3. Enter a unique **Instance Name** (required)
|
||||||
- **MLX**: For MLX-optimized models (macOS only)
|
4. **Select Node**: Choose which node to deploy the instance to
|
||||||
|
5. Configure **Auto Restart** settings:
|
||||||
|
- Enable automatic restart on failure
|
||||||
|
- Set max restarts and delay between attempts
|
||||||
|
6. Configure basic instance options:
|
||||||
|
- **Idle Timeout**: Minutes before stopping idle instance
|
||||||
|
- **On Demand Start**: Start instance only when needed
|
||||||
|
|
||||||
|
**Backend Configuration:**
|
||||||
|
|
||||||
|
7. **Select Backend Type**:
|
||||||
|
- **Llama Server**: For GGUF models using llama-server
|
||||||
|
- **MLX LM**: For MLX-optimized models (macOS only)
|
||||||
- **vLLM**: For distributed serving and high-throughput inference
|
- **vLLM**: For distributed serving and high-throughput inference
|
||||||
5. Configure model source:
|
8. *Optional*: Click **"Parse Command"** to import settings from an existing backend command
|
||||||
- **For llama.cpp**: GGUF model path or HuggingFace repo
|
9. Configure **Execution Context**:
|
||||||
- **For MLX**: MLX model path or identifier (e.g., `mlx-community/Mistral-7B-Instruct-v0.3-4bit`)
|
- **Enable Docker**: Run backend in Docker container
|
||||||
- **For vLLM**: HuggingFace model identifier (e.g., `microsoft/DialoGPT-medium`)
|
- **Command Override**: Custom path to backend executable
|
||||||
6. Configure optional instance management settings:
|
- **Environment Variables**: Custom environment variables
|
||||||
- **Auto Restart**: Automatically restart instance on failure
|
|
||||||
- **Max Restarts**: Maximum number of restart attempts
|
|
||||||
- **Restart Delay**: Delay in seconds between restart attempts
|
|
||||||
- **On Demand Start**: Start instance when receiving a request to the OpenAI compatible endpoint
|
|
||||||
- **Idle Timeout**: Minutes before stopping idle instance (set to 0 to disable)
|
|
||||||
- **Environment Variables**: Set custom environment variables for the instance process
|
|
||||||
7. Configure backend-specific options:
|
|
||||||
- **llama.cpp**: Threads, context size, GPU layers, port, etc.
|
|
||||||
- **MLX**: Temperature, top-p, adapter path, Python environment, etc.
|
|
||||||
- **vLLM**: Tensor parallel size, GPU memory utilization, quantization, etc.
|
|
||||||
|
|
||||||
!!! tip "Auto-Assignment"
|
!!! tip "Auto-Assignment"
|
||||||
Llamactl automatically assigns ports from the configured port range (default: 8000-9000) and generates API keys if authentication is enabled. You typically don't need to manually specify these values.
|
Llamactl automatically assigns ports from the configured port range (default: 8000-9000) and generates API keys if authentication is enabled. You typically don't need to manually specify these values.
|
||||||
|
|
||||||
8. Click **"Create"** to save the instance
|
10. Configure **Basic Backend Options** (varies by backend):
|
||||||
|
- **llama.cpp**: Model path, threads, context size, GPU layers, etc.
|
||||||
|
- **MLX**: Model identifier, temperature, max tokens, etc.
|
||||||
|
- **vLLM**: Model identifier, tensor parallel size, GPU memory utilization, etc.
|
||||||
|
11. *Optional*: Expand **Advanced Backend Options** for additional settings
|
||||||
|
12. *Optional*: Add **Extra Args** as key-value pairs for custom command-line arguments
|
||||||
|
13. Click **"Create"** to save the instance
|
||||||
|
|
||||||
**Via API**
|
**Via API**
|
||||||
|
|
||||||
@@ -83,11 +91,34 @@ curl -X POST http://localhost:8080/api/v1/instances/my-llama-instance \
|
|||||||
"model": "/path/to/model.gguf",
|
"model": "/path/to/model.gguf",
|
||||||
"threads": 8,
|
"threads": 8,
|
||||||
"ctx_size": 4096,
|
"ctx_size": 4096,
|
||||||
"gpu_layers": 32
|
"gpu_layers": 32,
|
||||||
|
"flash_attn": "on"
|
||||||
},
|
},
|
||||||
|
"auto_restart": true,
|
||||||
|
"max_restarts": 3,
|
||||||
|
"docker_enabled": false,
|
||||||
|
"command_override": "/opt/llama-server-dev",
|
||||||
"nodes": ["main"]
|
"nodes": ["main"]
|
||||||
}'
|
}'
|
||||||
|
|
||||||
|
# Create vLLM instance with environment variables
|
||||||
|
curl -X POST http://localhost:8080/api/v1/instances/my-vllm-instance \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-H "Authorization: Bearer <token>" \
|
||||||
|
-d '{
|
||||||
|
"backend_type": "vllm",
|
||||||
|
"backend_options": {
|
||||||
|
"model": "microsoft/DialoGPT-medium",
|
||||||
|
"tensor_parallel_size": 2,
|
||||||
|
"gpu_memory_utilization": 0.9
|
||||||
|
},
|
||||||
|
"on_demand_start": true,
|
||||||
|
"environment": {
|
||||||
|
"CUDA_VISIBLE_DEVICES": "0,1"
|
||||||
|
},
|
||||||
|
"nodes": ["worker1", "worker2"]
|
||||||
|
}'
|
||||||
|
|
||||||
# Create MLX instance (macOS only)
|
# Create MLX instance (macOS only)
|
||||||
curl -X POST http://localhost:8080/api/v1/instances/my-mlx-instance \
|
curl -X POST http://localhost:8080/api/v1/instances/my-mlx-instance \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
@@ -97,74 +128,10 @@ curl -X POST http://localhost:8080/api/v1/instances/my-mlx-instance \
|
|||||||
"backend_options": {
|
"backend_options": {
|
||||||
"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit",
|
"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit",
|
||||||
"temp": 0.7,
|
"temp": 0.7,
|
||||||
"top_p": 0.9,
|
|
||||||
"max_tokens": 2048
|
"max_tokens": 2048
|
||||||
},
|
},
|
||||||
"auto_restart": true,
|
|
||||||
"max_restarts": 3,
|
|
||||||
"nodes": ["main"]
|
"nodes": ["main"]
|
||||||
}'
|
}'
|
||||||
|
|
||||||
# Create vLLM instance
|
|
||||||
curl -X POST http://localhost:8080/api/v1/instances/my-vllm-instance \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-H "Authorization: Bearer <token>" \
|
|
||||||
-d '{
|
|
||||||
"backend_type": "vllm",
|
|
||||||
"backend_options": {
|
|
||||||
"model": "microsoft/DialoGPT-medium",
|
|
||||||
"tensor_parallel_size": 2,
|
|
||||||
"gpu_memory_utilization": 0.9
|
|
||||||
},
|
|
||||||
"auto_restart": true,
|
|
||||||
"on_demand_start": true,
|
|
||||||
"environment": {
|
|
||||||
"CUDA_VISIBLE_DEVICES": "0,1",
|
|
||||||
"NCCL_DEBUG": "INFO",
|
|
||||||
"PYTHONPATH": "/custom/path"
|
|
||||||
},
|
|
||||||
"nodes": ["main"]
|
|
||||||
}'
|
|
||||||
|
|
||||||
# Create llama.cpp instance with HuggingFace model
|
|
||||||
curl -X POST http://localhost:8080/api/v1/instances/gemma-3-27b \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-H "Authorization: Bearer <token>" \
|
|
||||||
-d '{
|
|
||||||
"backend_type": "llama_cpp",
|
|
||||||
"backend_options": {
|
|
||||||
"hf_repo": "unsloth/gemma-3-27b-it-GGUF",
|
|
||||||
"hf_file": "gemma-3-27b-it-GGUF.gguf",
|
|
||||||
"gpu_layers": 32
|
|
||||||
},
|
|
||||||
"nodes": ["main"]
|
|
||||||
}'
|
|
||||||
|
|
||||||
# Create instance on specific remote node
|
|
||||||
curl -X POST http://localhost:8080/api/v1/instances/remote-llama \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-H "Authorization: Bearer <token>" \
|
|
||||||
-d '{
|
|
||||||
"backend_type": "llama_cpp",
|
|
||||||
"backend_options": {
|
|
||||||
"model": "/models/llama-7b.gguf",
|
|
||||||
"gpu_layers": 32
|
|
||||||
},
|
|
||||||
"nodes": ["worker1"]
|
|
||||||
}'
|
|
||||||
|
|
||||||
# Create instance on multiple nodes for high availability
|
|
||||||
curl -X POST http://localhost:8080/api/v1/instances/multi-node-llama \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-H "Authorization: Bearer <token>" \
|
|
||||||
-d '{
|
|
||||||
"backend_type": "llama_cpp",
|
|
||||||
"backend_options": {
|
|
||||||
"model": "/models/llama-7b.gguf",
|
|
||||||
"gpu_layers": 32
|
|
||||||
},
|
|
||||||
"nodes": ["worker1", "worker2", "worker3"]
|
|
||||||
}'
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Start Instance
|
## Start Instance
|
||||||
|
|||||||
Reference in New Issue
Block a user