mirror of
https://github.com/lordmathis/llamactl.git
synced 2025-11-06 00:54:23 +00:00
317 lines
6.1 KiB
Markdown
317 lines
6.1 KiB
Markdown
# Backends
|
|
|
|
LlamaCtl supports multiple backends for running large language models. This guide covers the available backends and their configuration.
|
|
|
|
## Llama.cpp Backend
|
|
|
|
The primary backend for LlamaCtl, providing robust support for GGUF models.
|
|
|
|
### Features
|
|
|
|
- **GGUF Support**: Native support for GGUF model format
|
|
- **GPU Acceleration**: CUDA, OpenCL, and Metal support
|
|
- **Memory Optimization**: Efficient memory usage and mapping
|
|
- **Multi-threading**: Configurable CPU thread utilization
|
|
- **Quantization**: Support for various quantization levels
|
|
|
|
### Configuration
|
|
|
|
```yaml
|
|
backends:
|
|
llamacpp:
|
|
binary_path: "/usr/local/bin/llama-server"
|
|
default_options:
|
|
threads: 4
|
|
context_size: 2048
|
|
batch_size: 512
|
|
gpu:
|
|
enabled: true
|
|
layers: 35
|
|
```
|
|
|
|
### Supported Options
|
|
|
|
| Option | Description | Default |
|
|
|--------|-------------|---------|
|
|
| `threads` | Number of CPU threads | 4 |
|
|
| `context_size` | Context window size | 2048 |
|
|
| `batch_size` | Batch size for processing | 512 |
|
|
| `gpu_layers` | Layers to offload to GPU | 0 |
|
|
| `memory_lock` | Lock model in memory | false |
|
|
| `no_mmap` | Disable memory mapping | false |
|
|
| `rope_freq_base` | RoPE frequency base | 10000 |
|
|
| `rope_freq_scale` | RoPE frequency scale | 1.0 |
|
|
|
|
### GPU Acceleration
|
|
|
|
#### CUDA Setup
|
|
|
|
```bash
|
|
# Install CUDA toolkit
|
|
sudo apt update
|
|
sudo apt install nvidia-cuda-toolkit
|
|
|
|
# Verify CUDA installation
|
|
nvcc --version
|
|
nvidia-smi
|
|
```
|
|
|
|
#### Configuration for GPU
|
|
|
|
```json
|
|
{
|
|
"name": "gpu-accelerated",
|
|
"model_path": "/models/llama-2-13b.gguf",
|
|
"port": 8081,
|
|
"options": {
|
|
"gpu_layers": 35,
|
|
"threads": 2,
|
|
"context_size": 4096
|
|
}
|
|
}
|
|
```
|
|
|
|
### Performance Tuning
|
|
|
|
#### Memory Optimization
|
|
|
|
```yaml
|
|
# For limited memory systems
|
|
options:
|
|
context_size: 1024
|
|
batch_size: 256
|
|
no_mmap: true
|
|
memory_lock: false
|
|
|
|
# For high-memory systems
|
|
options:
|
|
context_size: 8192
|
|
batch_size: 1024
|
|
memory_lock: true
|
|
no_mmap: false
|
|
```
|
|
|
|
#### CPU Optimization
|
|
|
|
```yaml
|
|
# Match thread count to CPU cores
|
|
# For 8-core CPU:
|
|
options:
|
|
threads: 6 # Leave 2 cores for system
|
|
|
|
# For high-performance CPUs:
|
|
options:
|
|
threads: 16
|
|
batch_size: 1024
|
|
```
|
|
|
|
## Future Backends
|
|
|
|
LlamaCtl is designed to support multiple backends. Planned additions:
|
|
|
|
### vLLM Backend
|
|
|
|
High-performance inference engine optimized for serving:
|
|
|
|
- **Features**: Fast inference, batching, streaming
|
|
- **Models**: Supports various model formats
|
|
- **Scaling**: Horizontal scaling support
|
|
|
|
### TensorRT-LLM Backend
|
|
|
|
NVIDIA's optimized inference engine:
|
|
|
|
- **Features**: Maximum GPU performance
|
|
- **Models**: Optimized for NVIDIA GPUs
|
|
- **Deployment**: Production-ready inference
|
|
|
|
### Ollama Backend
|
|
|
|
Integration with Ollama for easy model management:
|
|
|
|
- **Features**: Simplified model downloading
|
|
- **Models**: Large model library
|
|
- **Integration**: Seamless model switching
|
|
|
|
## Backend Selection
|
|
|
|
### Automatic Detection
|
|
|
|
LlamaCtl can automatically detect the best backend:
|
|
|
|
```yaml
|
|
backends:
|
|
auto_detect: true
|
|
preference_order:
|
|
- "llamacpp"
|
|
- "vllm"
|
|
- "tensorrt"
|
|
```
|
|
|
|
### Manual Selection
|
|
|
|
Force a specific backend for an instance:
|
|
|
|
```json
|
|
{
|
|
"name": "manual-backend",
|
|
"backend": "llamacpp",
|
|
"model_path": "/models/model.gguf",
|
|
"port": 8081
|
|
}
|
|
```
|
|
|
|
## Backend-Specific Features
|
|
|
|
### Llama.cpp Features
|
|
|
|
#### Model Formats
|
|
|
|
- **GGUF**: Primary format, best compatibility
|
|
- **GGML**: Legacy format (limited support)
|
|
|
|
#### Quantization Levels
|
|
|
|
- `Q2_K`: Smallest size, lower quality
|
|
- `Q4_K_M`: Balanced size and quality
|
|
- `Q5_K_M`: Higher quality, larger size
|
|
- `Q6_K`: Near-original quality
|
|
- `Q8_0`: Minimal loss, largest size
|
|
|
|
#### Advanced Options
|
|
|
|
```yaml
|
|
advanced:
|
|
rope_scaling:
|
|
type: "linear"
|
|
factor: 2.0
|
|
attention:
|
|
flash_attention: true
|
|
grouped_query: true
|
|
```
|
|
|
|
## Monitoring Backend Performance
|
|
|
|
### Metrics Collection
|
|
|
|
Monitor backend-specific metrics:
|
|
|
|
```bash
|
|
# Get backend statistics
|
|
curl http://localhost:8080/api/instances/my-instance/backend/stats
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"backend": "llamacpp",
|
|
"version": "b1234",
|
|
"metrics": {
|
|
"tokens_per_second": 15.2,
|
|
"memory_usage": 4294967296,
|
|
"gpu_utilization": 85.5,
|
|
"context_usage": 75.0
|
|
}
|
|
}
|
|
```
|
|
|
|
### Performance Optimization
|
|
|
|
#### Benchmark Different Configurations
|
|
|
|
```bash
|
|
# Test various thread counts
|
|
for threads in 2 4 8 16; do
|
|
echo "Testing $threads threads"
|
|
curl -X PUT http://localhost:8080/api/instances/benchmark \
|
|
-d "{\"options\": {\"threads\": $threads}}"
|
|
# Run performance test
|
|
done
|
|
```
|
|
|
|
#### Memory Usage Optimization
|
|
|
|
```bash
|
|
# Monitor memory usage
|
|
watch -n 1 'curl -s http://localhost:8080/api/instances/my-instance/stats | jq .memory_usage'
|
|
```
|
|
|
|
## Troubleshooting Backends
|
|
|
|
### Common Llama.cpp Issues
|
|
|
|
**Model won't load:**
|
|
```bash
|
|
# Check model file
|
|
file /path/to/model.gguf
|
|
|
|
# Verify format
|
|
llama-server --model /path/to/model.gguf --dry-run
|
|
```
|
|
|
|
**GPU not detected:**
|
|
```bash
|
|
# Check CUDA installation
|
|
nvidia-smi
|
|
|
|
# Verify llama.cpp GPU support
|
|
llama-server --help | grep -i gpu
|
|
```
|
|
|
|
**Performance issues:**
|
|
```bash
|
|
# Check system resources
|
|
htop
|
|
nvidia-smi
|
|
|
|
# Verify configuration
|
|
curl http://localhost:8080/api/instances/my-instance/config
|
|
```
|
|
|
|
## Custom Backend Development
|
|
|
|
### Backend Interface
|
|
|
|
Implement the backend interface for custom backends:
|
|
|
|
```go
|
|
type Backend interface {
|
|
Start(config InstanceConfig) error
|
|
Stop(instance *Instance) error
|
|
Health(instance *Instance) (*HealthStatus, error)
|
|
Stats(instance *Instance) (*Stats, error)
|
|
}
|
|
```
|
|
|
|
### Registration
|
|
|
|
Register your custom backend:
|
|
|
|
```go
|
|
func init() {
|
|
backends.Register("custom", &CustomBackend{})
|
|
}
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### Production Deployments
|
|
|
|
1. **Resource allocation**: Plan for peak usage
|
|
2. **Backend selection**: Choose based on requirements
|
|
3. **Monitoring**: Set up comprehensive monitoring
|
|
4. **Fallback**: Configure backup backends
|
|
|
|
### Development
|
|
|
|
1. **Rapid iteration**: Use smaller models
|
|
2. **Resource monitoring**: Track usage patterns
|
|
3. **Configuration testing**: Validate settings
|
|
4. **Performance profiling**: Optimize bottlenecks
|
|
|
|
## Next Steps
|
|
|
|
- Learn about [Monitoring](monitoring.md) backend performance
|
|
- Explore [Troubleshooting](troubleshooting.md) guides
|
|
- Set up [Production Monitoring](monitoring.md)
|