mirror of
https://github.com/lordmathis/llamactl.git
synced 2025-11-05 16:44:22 +00:00
Update documentation
This commit is contained in:
10
README.md
10
README.md
@@ -22,7 +22,8 @@
|
|||||||
|
|
||||||
### ⚡ Smart Operations
|
### ⚡ Smart Operations
|
||||||
- **Instance Monitoring**: Health checks, auto-restart, log management
|
- **Instance Monitoring**: Health checks, auto-restart, log management
|
||||||
- **Smart Resource Management**: Idle timeout, LRU eviction, and configurable instance limits
|
- **Smart Resource Management**: Idle timeout, LRU eviction, and configurable instance limits
|
||||||
|
- **Environment Variables**: Set custom environment variables per instance for advanced configuration
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
@@ -52,7 +53,8 @@ llamactl
|
|||||||
2. Click "Create Instance"
|
2. Click "Create Instance"
|
||||||
3. Choose backend type (llama.cpp, MLX, or vLLM)
|
3. Choose backend type (llama.cpp, MLX, or vLLM)
|
||||||
4. Set model path and backend-specific options
|
4. Set model path and backend-specific options
|
||||||
5. Start or stop the instance
|
5. Configure environment variables if needed (optional)
|
||||||
|
6. Start or stop the instance
|
||||||
|
|
||||||
### Or use the REST API:
|
### Or use the REST API:
|
||||||
```bash
|
```bash
|
||||||
@@ -66,10 +68,10 @@ curl -X POST localhost:8080/api/v1/instances/my-mlx-model \
|
|||||||
-H "Authorization: Bearer your-key" \
|
-H "Authorization: Bearer your-key" \
|
||||||
-d '{"backend_type": "mlx_lm", "backend_options": {"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit"}}'
|
-d '{"backend_type": "mlx_lm", "backend_options": {"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit"}}'
|
||||||
|
|
||||||
# Create vLLM instance
|
# Create vLLM instance with environment variables
|
||||||
curl -X POST localhost:8080/api/v1/instances/my-vllm-model \
|
curl -X POST localhost:8080/api/v1/instances/my-vllm-model \
|
||||||
-H "Authorization: Bearer your-key" \
|
-H "Authorization: Bearer your-key" \
|
||||||
-d '{"backend_type": "vllm", "backend_options": {"model": "microsoft/DialoGPT-medium", "tensor_parallel_size": 2}}'
|
-d '{"backend_type": "vllm", "backend_options": {"model": "microsoft/DialoGPT-medium", "tensor_parallel_size": 2}, "environment": {"CUDA_VISIBLE_DEVICES": "0,1", "NCCL_DEBUG": "INFO"}}'
|
||||||
|
|
||||||
# Use with OpenAI SDK
|
# Use with OpenAI SDK
|
||||||
curl -X POST localhost:8080/v1/chat/completions \
|
curl -X POST localhost:8080/v1/chat/completions \
|
||||||
|
|||||||
@@ -116,7 +116,18 @@ Create and start a new instance.
|
|||||||
POST /api/v1/instances/{name}
|
POST /api/v1/instances/{name}
|
||||||
```
|
```
|
||||||
|
|
||||||
**Request Body:** JSON object with instance configuration. See [Managing Instances](managing-instances.md) for available configuration options.
|
**Request Body:** JSON object with instance configuration. Common fields include:
|
||||||
|
|
||||||
|
- `backend_type`: Backend type (`llama_cpp`, `mlx_lm`, or `vllm`)
|
||||||
|
- `backend_options`: Backend-specific configuration
|
||||||
|
- `auto_restart`: Enable automatic restart on failure
|
||||||
|
- `max_restarts`: Maximum restart attempts
|
||||||
|
- `restart_delay`: Delay between restarts in seconds
|
||||||
|
- `on_demand_start`: Start instance when receiving requests
|
||||||
|
- `idle_timeout`: Idle timeout in minutes
|
||||||
|
- `environment`: Environment variables as key-value pairs
|
||||||
|
|
||||||
|
See [Managing Instances](managing-instances.md) for complete configuration options.
|
||||||
|
|
||||||
**Response:**
|
**Response:**
|
||||||
```json
|
```json
|
||||||
@@ -354,7 +365,15 @@ curl -X POST http://localhost:8080/api/v1/instances/my-model \
|
|||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-H "Authorization: Bearer your-api-key" \
|
-H "Authorization: Bearer your-api-key" \
|
||||||
-d '{
|
-d '{
|
||||||
"model": "/models/llama-2-7b.gguf"
|
"backend_type": "llama_cpp",
|
||||||
|
"backend_options": {
|
||||||
|
"model": "/models/llama-2-7b.gguf",
|
||||||
|
"gpu_layers": 32
|
||||||
|
},
|
||||||
|
"environment": {
|
||||||
|
"CUDA_VISIBLE_DEVICES": "0",
|
||||||
|
"OMP_NUM_THREADS": "8"
|
||||||
|
}
|
||||||
}'
|
}'
|
||||||
|
|
||||||
# Check instance status
|
# Check instance status
|
||||||
|
|||||||
@@ -53,6 +53,7 @@ Each instance is displayed as a card showing:
|
|||||||
- **Restart Delay**: Delay in seconds between restart attempts
|
- **Restart Delay**: Delay in seconds between restart attempts
|
||||||
- **On Demand Start**: Start instance when receiving a request to the OpenAI compatible endpoint
|
- **On Demand Start**: Start instance when receiving a request to the OpenAI compatible endpoint
|
||||||
- **Idle Timeout**: Minutes before stopping idle instance (set to 0 to disable)
|
- **Idle Timeout**: Minutes before stopping idle instance (set to 0 to disable)
|
||||||
|
- **Environment Variables**: Set custom environment variables for the instance process
|
||||||
6. Configure backend-specific options:
|
6. Configure backend-specific options:
|
||||||
- **llama.cpp**: Threads, context size, GPU layers, port, etc.
|
- **llama.cpp**: Threads, context size, GPU layers, port, etc.
|
||||||
- **MLX**: Temperature, top-p, adapter path, Python environment, etc.
|
- **MLX**: Temperature, top-p, adapter path, Python environment, etc.
|
||||||
@@ -101,7 +102,12 @@ curl -X POST http://localhost:8080/api/instances/my-vllm-instance \
|
|||||||
"gpu_memory_utilization": 0.9
|
"gpu_memory_utilization": 0.9
|
||||||
},
|
},
|
||||||
"auto_restart": true,
|
"auto_restart": true,
|
||||||
"on_demand_start": true
|
"on_demand_start": true,
|
||||||
|
"environment": {
|
||||||
|
"CUDA_VISIBLE_DEVICES": "0,1",
|
||||||
|
"NCCL_DEBUG": "INFO",
|
||||||
|
"PYTHONPATH": "/custom/path"
|
||||||
|
}
|
||||||
}'
|
}'
|
||||||
|
|
||||||
# Create llama.cpp instance with HuggingFace model
|
# Create llama.cpp instance with HuggingFace model
|
||||||
|
|||||||
Reference in New Issue
Block a user