Create initial documentation structure

This commit is contained in:
2025-08-31 14:27:00 +02:00
parent 7675271370
commit bd31c03f4a
16 changed files with 3514 additions and 0 deletions

316
docs/advanced/backends.md Normal file
View File

@@ -0,0 +1,316 @@
# Backends
LlamaCtl supports multiple backends for running large language models. This guide covers the available backends and their configuration.
## Llama.cpp Backend
The primary backend for LlamaCtl, providing robust support for GGUF models.
### Features
- **GGUF Support**: Native support for GGUF model format
- **GPU Acceleration**: CUDA, OpenCL, and Metal support
- **Memory Optimization**: Efficient memory usage and mapping
- **Multi-threading**: Configurable CPU thread utilization
- **Quantization**: Support for various quantization levels
### Configuration
```yaml
backends:
llamacpp:
binary_path: "/usr/local/bin/llama-server"
default_options:
threads: 4
context_size: 2048
batch_size: 512
gpu:
enabled: true
layers: 35
```
### Supported Options
| Option | Description | Default |
|--------|-------------|---------|
| `threads` | Number of CPU threads | 4 |
| `context_size` | Context window size | 2048 |
| `batch_size` | Batch size for processing | 512 |
| `gpu_layers` | Layers to offload to GPU | 0 |
| `memory_lock` | Lock model in memory | false |
| `no_mmap` | Disable memory mapping | false |
| `rope_freq_base` | RoPE frequency base | 10000 |
| `rope_freq_scale` | RoPE frequency scale | 1.0 |
### GPU Acceleration
#### CUDA Setup
```bash
# Install CUDA toolkit
sudo apt update
sudo apt install nvidia-cuda-toolkit
# Verify CUDA installation
nvcc --version
nvidia-smi
```
#### Configuration for GPU
```json
{
"name": "gpu-accelerated",
"model_path": "/models/llama-2-13b.gguf",
"port": 8081,
"options": {
"gpu_layers": 35,
"threads": 2,
"context_size": 4096
}
}
```
### Performance Tuning
#### Memory Optimization
```yaml
# For limited memory systems
options:
context_size: 1024
batch_size: 256
no_mmap: true
memory_lock: false
# For high-memory systems
options:
context_size: 8192
batch_size: 1024
memory_lock: true
no_mmap: false
```
#### CPU Optimization
```yaml
# Match thread count to CPU cores
# For 8-core CPU:
options:
threads: 6 # Leave 2 cores for system
# For high-performance CPUs:
options:
threads: 16
batch_size: 1024
```
## Future Backends
LlamaCtl is designed to support multiple backends. Planned additions:
### vLLM Backend
High-performance inference engine optimized for serving:
- **Features**: Fast inference, batching, streaming
- **Models**: Supports various model formats
- **Scaling**: Horizontal scaling support
### TensorRT-LLM Backend
NVIDIA's optimized inference engine:
- **Features**: Maximum GPU performance
- **Models**: Optimized for NVIDIA GPUs
- **Deployment**: Production-ready inference
### Ollama Backend
Integration with Ollama for easy model management:
- **Features**: Simplified model downloading
- **Models**: Large model library
- **Integration**: Seamless model switching
## Backend Selection
### Automatic Detection
LlamaCtl can automatically detect the best backend:
```yaml
backends:
auto_detect: true
preference_order:
- "llamacpp"
- "vllm"
- "tensorrt"
```
### Manual Selection
Force a specific backend for an instance:
```json
{
"name": "manual-backend",
"backend": "llamacpp",
"model_path": "/models/model.gguf",
"port": 8081
}
```
## Backend-Specific Features
### Llama.cpp Features
#### Model Formats
- **GGUF**: Primary format, best compatibility
- **GGML**: Legacy format (limited support)
#### Quantization Levels
- `Q2_K`: Smallest size, lower quality
- `Q4_K_M`: Balanced size and quality
- `Q5_K_M`: Higher quality, larger size
- `Q6_K`: Near-original quality
- `Q8_0`: Minimal loss, largest size
#### Advanced Options
```yaml
advanced:
rope_scaling:
type: "linear"
factor: 2.0
attention:
flash_attention: true
grouped_query: true
```
## Monitoring Backend Performance
### Metrics Collection
Monitor backend-specific metrics:
```bash
# Get backend statistics
curl http://localhost:8080/api/instances/my-instance/backend/stats
```
**Response:**
```json
{
"backend": "llamacpp",
"version": "b1234",
"metrics": {
"tokens_per_second": 15.2,
"memory_usage": 4294967296,
"gpu_utilization": 85.5,
"context_usage": 75.0
}
}
```
### Performance Optimization
#### Benchmark Different Configurations
```bash
# Test various thread counts
for threads in 2 4 8 16; do
echo "Testing $threads threads"
curl -X PUT http://localhost:8080/api/instances/benchmark \
-d "{\"options\": {\"threads\": $threads}}"
# Run performance test
done
```
#### Memory Usage Optimization
```bash
# Monitor memory usage
watch -n 1 'curl -s http://localhost:8080/api/instances/my-instance/stats | jq .memory_usage'
```
## Troubleshooting Backends
### Common Llama.cpp Issues
**Model won't load:**
```bash
# Check model file
file /path/to/model.gguf
# Verify format
llama-server --model /path/to/model.gguf --dry-run
```
**GPU not detected:**
```bash
# Check CUDA installation
nvidia-smi
# Verify llama.cpp GPU support
llama-server --help | grep -i gpu
```
**Performance issues:**
```bash
# Check system resources
htop
nvidia-smi
# Verify configuration
curl http://localhost:8080/api/instances/my-instance/config
```
## Custom Backend Development
### Backend Interface
Implement the backend interface for custom backends:
```go
type Backend interface {
Start(config InstanceConfig) error
Stop(instance *Instance) error
Health(instance *Instance) (*HealthStatus, error)
Stats(instance *Instance) (*Stats, error)
}
```
### Registration
Register your custom backend:
```go
func init() {
backends.Register("custom", &CustomBackend{})
}
```
## Best Practices
### Production Deployments
1. **Resource allocation**: Plan for peak usage
2. **Backend selection**: Choose based on requirements
3. **Monitoring**: Set up comprehensive monitoring
4. **Fallback**: Configure backup backends
### Development
1. **Rapid iteration**: Use smaller models
2. **Resource monitoring**: Track usage patterns
3. **Configuration testing**: Validate settings
4. **Performance profiling**: Optimize bottlenecks
## Next Steps
- Learn about [Monitoring](monitoring.md) backend performance
- Explore [Troubleshooting](troubleshooting.md) guides
- Set up [Production Monitoring](monitoring.md)