mirror of
https://github.com/lordmathis/llamactl.git
synced 2025-11-05 16:44:22 +00:00
Minor docs improvements
This commit is contained in:
@@ -42,15 +42,10 @@ Note: MLX backend is only available on macOS with Apple Silicon (M1, M2, M3, etc
|
|||||||
vLLM provides high-throughput distributed serving for LLMs. Install vLLM:
|
vLLM provides high-throughput distributed serving for LLMs. Install vLLM:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Install via pip (requires Python 3.8+, GPU required)
|
# Install in a virtual environment
|
||||||
pip install vllm
|
|
||||||
|
|
||||||
# Or in a virtual environment (recommended)
|
|
||||||
python -m venv vllm-env
|
python -m venv vllm-env
|
||||||
source vllm-env/bin/activate
|
source vllm-env/bin/activate
|
||||||
pip install vllm
|
pip install vllm
|
||||||
|
|
||||||
# For production deployments, consider container-based installation
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Installation Methods
|
## Installation Methods
|
||||||
|
|||||||
@@ -78,7 +78,8 @@ curl -X POST http://localhost:8080/api/instances/my-llama-instance \
|
|||||||
"threads": 8,
|
"threads": 8,
|
||||||
"ctx_size": 4096,
|
"ctx_size": 4096,
|
||||||
"gpu_layers": 32
|
"gpu_layers": 32
|
||||||
}
|
},
|
||||||
|
"nodes": ["main"]
|
||||||
}'
|
}'
|
||||||
|
|
||||||
# Create MLX instance (macOS only)
|
# Create MLX instance (macOS only)
|
||||||
@@ -93,7 +94,8 @@ curl -X POST http://localhost:8080/api/instances/my-mlx-instance \
|
|||||||
"max_tokens": 2048
|
"max_tokens": 2048
|
||||||
},
|
},
|
||||||
"auto_restart": true,
|
"auto_restart": true,
|
||||||
"max_restarts": 3
|
"max_restarts": 3,
|
||||||
|
"nodes": ["main"]
|
||||||
}'
|
}'
|
||||||
|
|
||||||
# Create vLLM instance
|
# Create vLLM instance
|
||||||
@@ -112,7 +114,8 @@ curl -X POST http://localhost:8080/api/instances/my-vllm-instance \
|
|||||||
"CUDA_VISIBLE_DEVICES": "0,1",
|
"CUDA_VISIBLE_DEVICES": "0,1",
|
||||||
"NCCL_DEBUG": "INFO",
|
"NCCL_DEBUG": "INFO",
|
||||||
"PYTHONPATH": "/custom/path"
|
"PYTHONPATH": "/custom/path"
|
||||||
}
|
},
|
||||||
|
"nodes": ["main"]
|
||||||
}'
|
}'
|
||||||
|
|
||||||
# Create llama.cpp instance with HuggingFace model
|
# Create llama.cpp instance with HuggingFace model
|
||||||
@@ -124,7 +127,8 @@ curl -X POST http://localhost:8080/api/instances/gemma-3-27b \
|
|||||||
"hf_repo": "unsloth/gemma-3-27b-it-GGUF",
|
"hf_repo": "unsloth/gemma-3-27b-it-GGUF",
|
||||||
"hf_file": "gemma-3-27b-it-GGUF.gguf",
|
"hf_file": "gemma-3-27b-it-GGUF.gguf",
|
||||||
"gpu_layers": 32
|
"gpu_layers": 32
|
||||||
}
|
},
|
||||||
|
"nodes": ["main"]
|
||||||
}'
|
}'
|
||||||
|
|
||||||
# Create instance on specific remote node
|
# Create instance on specific remote node
|
||||||
@@ -138,6 +142,18 @@ curl -X POST http://localhost:8080/api/instances/remote-llama \
|
|||||||
},
|
},
|
||||||
"nodes": ["worker1"]
|
"nodes": ["worker1"]
|
||||||
}'
|
}'
|
||||||
|
|
||||||
|
# Create instance on multiple nodes for high availability
|
||||||
|
curl -X POST http://localhost:8080/api/instances/multi-node-llama \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"backend_type": "llama_cpp",
|
||||||
|
"backend_options": {
|
||||||
|
"model": "/models/llama-7b.gguf",
|
||||||
|
"gpu_layers": 32
|
||||||
|
},
|
||||||
|
"nodes": ["worker1", "worker2", "worker3"]
|
||||||
|
}'
|
||||||
```
|
```
|
||||||
|
|
||||||
## Start Instance
|
## Start Instance
|
||||||
|
|||||||
@@ -29,13 +29,17 @@ You should see the Llamactl web interface.
|
|||||||
1. Click the "Add Instance" button
|
1. Click the "Add Instance" button
|
||||||
2. Fill in the instance configuration:
|
2. Fill in the instance configuration:
|
||||||
- **Name**: Give your instance a descriptive name
|
- **Name**: Give your instance a descriptive name
|
||||||
|
- **Node**: Select which node to deploy the instance to (defaults to "main" for single-node setups)
|
||||||
- **Backend Type**: Choose from llama.cpp, MLX, or vLLM
|
- **Backend Type**: Choose from llama.cpp, MLX, or vLLM
|
||||||
- **Model**: Model path or huggingface repo
|
- **Model**: Model path or huggingface repo
|
||||||
- **Additional Options**: Backend-specific parameters
|
- **Additional Options**: Backend-specific parameters
|
||||||
|
|
||||||
!!! tip "Auto-Assignment"
|
!!! tip "Auto-Assignment"
|
||||||
Llamactl automatically assigns ports from the configured port range (default: 8000-9000) and generates API keys if authentication is enabled. You typically don't need to manually specify these values.
|
Llamactl automatically assigns ports from the configured port range (default: 8000-9000) and generates API keys if authentication is enabled. You typically don't need to manually specify these values.
|
||||||
|
|
||||||
|
!!! note "Remote Node Deployment"
|
||||||
|
If you have configured remote nodes in your configuration file, you can select which node to deploy the instance to. This allows you to distribute instances across multiple machines. See the [Configuration](configuration.md#remote-node-configuration) guide for details on setting up remote nodes.
|
||||||
|
|
||||||
3. Click "Create Instance"
|
3. Click "Create Instance"
|
||||||
|
|
||||||
## Start Your Instance
|
## Start Your Instance
|
||||||
@@ -61,7 +65,8 @@ Here are basic example configurations for each backend:
|
|||||||
"threads": 4,
|
"threads": 4,
|
||||||
"ctx_size": 2048,
|
"ctx_size": 2048,
|
||||||
"gpu_layers": 32
|
"gpu_layers": 32
|
||||||
}
|
},
|
||||||
|
"nodes": ["main"]
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -74,7 +79,8 @@ Here are basic example configurations for each backend:
|
|||||||
"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit",
|
"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit",
|
||||||
"temp": 0.7,
|
"temp": 0.7,
|
||||||
"max_tokens": 2048
|
"max_tokens": 2048
|
||||||
}
|
},
|
||||||
|
"nodes": ["main"]
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -87,7 +93,21 @@ Here are basic example configurations for each backend:
|
|||||||
"model": "microsoft/DialoGPT-medium",
|
"model": "microsoft/DialoGPT-medium",
|
||||||
"tensor_parallel_size": 2,
|
"tensor_parallel_size": 2,
|
||||||
"gpu_memory_utilization": 0.9
|
"gpu_memory_utilization": 0.9
|
||||||
}
|
},
|
||||||
|
"nodes": ["main"]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Multi-node deployment example:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"name": "distributed-model",
|
||||||
|
"backend_type": "llama_cpp",
|
||||||
|
"backend_options": {
|
||||||
|
"model": "/path/to/model.gguf",
|
||||||
|
"gpu_layers": 32
|
||||||
|
},
|
||||||
|
"nodes": ["worker1", "worker2"]
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user