Create initial documentation structure

2025-11-06 09:04:27 +00:00 · 2025-08-31 14:27:00 +02:00
parent 7675271370
commit bd31c03f4a
16 changed files with 3514 additions and 0 deletions
--- a/docs/advanced/backends.md
+++ b/docs/advanced/backends.md
@@ -0,0 +1,316 @@
+# Backends
+
+LlamaCtl supports multiple backends for running large language models. This guide covers the available backends and their configuration.
+
+## Llama.cpp Backend
+
+The primary backend for LlamaCtl, providing robust support for GGUF models.
+
+### Features
+
+- **GGUF Support**: Native support for GGUF model format
+- **GPU Acceleration**: CUDA, OpenCL, and Metal support
+- **Memory Optimization**: Efficient memory usage and mapping
+- **Multi-threading**: Configurable CPU thread utilization
+- **Quantization**: Support for various quantization levels
+
+### Configuration
+
+```yaml
+backends:
+  llamacpp:
+    binary_path: "/usr/local/bin/llama-server"
+    default_options:
+      threads: 4
+      context_size: 2048
+      batch_size: 512
+    gpu:
+      enabled: true
+      layers: 35
+```
+
+### Supported Options
+
+| Option | Description | Default |
+|--------|-------------|---------|
+| `threads` | Number of CPU threads | 4 |
+| `context_size` | Context window size | 2048 |
+| `batch_size` | Batch size for processing | 512 |
+| `gpu_layers` | Layers to offload to GPU | 0 |
+| `memory_lock` | Lock model in memory | false |
+| `no_mmap` | Disable memory mapping | false |
+| `rope_freq_base` | RoPE frequency base | 10000 |
+| `rope_freq_scale` | RoPE frequency scale | 1.0 |
+
+### GPU Acceleration
+
+#### CUDA Setup
+
+```bash
+# Install CUDA toolkit
+sudo apt update
+sudo apt install nvidia-cuda-toolkit
+
+# Verify CUDA installation
+nvcc --version
+nvidia-smi
+```
+
+#### Configuration for GPU
+
+```json
+{
+  "name": "gpu-accelerated",
+  "model_path": "/models/llama-2-13b.gguf",
+  "port": 8081,
+  "options": {
+    "gpu_layers": 35,
+    "threads": 2,
+    "context_size": 4096
+  }
+}
+```
+
+### Performance Tuning
+
+#### Memory Optimization
+
+```yaml
+# For limited memory systems
+options:
+  context_size: 1024
+  batch_size: 256
+  no_mmap: true
+  memory_lock: false
+
+# For high-memory systems
+options:
+  context_size: 8192
+  batch_size: 1024
+  memory_lock: true
+  no_mmap: false
+```
+
+#### CPU Optimization
+
+```yaml
+# Match thread count to CPU cores
+# For 8-core CPU:
+options:
+  threads: 6  # Leave 2 cores for system
+  
+# For high-performance CPUs:
+options:
+  threads: 16
+  batch_size: 1024
+```
+
+## Future Backends
+
+LlamaCtl is designed to support multiple backends. Planned additions:
+
+### vLLM Backend
+
+High-performance inference engine optimized for serving:
+
+- **Features**: Fast inference, batching, streaming
+- **Models**: Supports various model formats
+- **Scaling**: Horizontal scaling support
+
+### TensorRT-LLM Backend
+
+NVIDIA's optimized inference engine:
+
+- **Features**: Maximum GPU performance
+- **Models**: Optimized for NVIDIA GPUs
+- **Deployment**: Production-ready inference
+
+### Ollama Backend
+
+Integration with Ollama for easy model management:
+
+- **Features**: Simplified model downloading
+- **Models**: Large model library
+- **Integration**: Seamless model switching
+
+## Backend Selection
+
+### Automatic Detection
+
+LlamaCtl can automatically detect the best backend:
+
+```yaml
+backends:
+  auto_detect: true
+  preference_order:
+    - "llamacpp"
+    - "vllm"
+    - "tensorrt"
+```
+
+### Manual Selection
+
+Force a specific backend for an instance:
+
+```json
+{
+  "name": "manual-backend",
+  "backend": "llamacpp",
+  "model_path": "/models/model.gguf",
+  "port": 8081
+}
+```
+
+## Backend-Specific Features
+
+### Llama.cpp Features
+
+#### Model Formats
+
+- **GGUF**: Primary format, best compatibility
+- **GGML**: Legacy format (limited support)
+
+#### Quantization Levels
+
+- `Q2_K`: Smallest size, lower quality
+- `Q4_K_M`: Balanced size and quality
+- `Q5_K_M`: Higher quality, larger size
+- `Q6_K`: Near-original quality
+- `Q8_0`: Minimal loss, largest size
+
+#### Advanced Options
+
+```yaml
+advanced:
+  rope_scaling:
+    type: "linear"
+    factor: 2.0
+  attention:
+    flash_attention: true
+    grouped_query: true
+```
+
+## Monitoring Backend Performance
+
+### Metrics Collection
+
+Monitor backend-specific metrics:
+
+```bash
+# Get backend statistics
+curl http://localhost:8080/api/instances/my-instance/backend/stats
+```
+
+**Response:**
+```json
+{
+  "backend": "llamacpp",
+  "version": "b1234",
+  "metrics": {
+    "tokens_per_second": 15.2,
+    "memory_usage": 4294967296,
+    "gpu_utilization": 85.5,
+    "context_usage": 75.0
+  }
+}
+```
+
+### Performance Optimization
+
+#### Benchmark Different Configurations
+
+```bash
+# Test various thread counts
+for threads in 2 4 8 16; do
+  echo "Testing $threads threads"
+  curl -X PUT http://localhost:8080/api/instances/benchmark \
+    -d "{\"options\": {\"threads\": $threads}}"
+  # Run performance test
+done
+```
+
+#### Memory Usage Optimization
+
+```bash
+# Monitor memory usage
+watch -n 1 'curl -s http://localhost:8080/api/instances/my-instance/stats | jq .memory_usage'
+```
+
+## Troubleshooting Backends
+
+### Common Llama.cpp Issues
+
+**Model won't load:**
+```bash
+# Check model file
+file /path/to/model.gguf
+
+# Verify format
+llama-server --model /path/to/model.gguf --dry-run
+```
+
+**GPU not detected:**
+```bash
+# Check CUDA installation
+nvidia-smi
+
+# Verify llama.cpp GPU support
+llama-server --help | grep -i gpu
+```
+
+**Performance issues:**
+```bash
+# Check system resources
+htop
+nvidia-smi
+
+# Verify configuration
+curl http://localhost:8080/api/instances/my-instance/config
+```
+
+## Custom Backend Development
+
+### Backend Interface
+
+Implement the backend interface for custom backends:
+
+```go
+type Backend interface {
+    Start(config InstanceConfig) error
+    Stop(instance *Instance) error
+    Health(instance *Instance) (*HealthStatus, error)
+    Stats(instance *Instance) (*Stats, error)
+}
+```
+
+### Registration
+
+Register your custom backend:
+
+```go
+func init() {
+    backends.Register("custom", &CustomBackend{})
+}
+```
+
+## Best Practices
+
+### Production Deployments
+
+1. **Resource allocation**: Plan for peak usage
+2. **Backend selection**: Choose based on requirements
+3. **Monitoring**: Set up comprehensive monitoring
+4. **Fallback**: Configure backup backends
+
+### Development
+
+1. **Rapid iteration**: Use smaller models
+2. **Resource monitoring**: Track usage patterns
+3. **Configuration testing**: Validate settings
+4. **Performance profiling**: Optimize bottlenecks
+
+## Next Steps
+
+- Learn about [Monitoring](monitoring.md) backend performance
+- Explore [Troubleshooting](troubleshooting.md) guides
+- Set up [Production Monitoring](monitoring.md)