# Backends LlamaCtl supports multiple backends for running large language models. This guide covers the available backends and their configuration. ## Llama.cpp Backend The primary backend for LlamaCtl, providing robust support for GGUF models. ### Features - **GGUF Support**: Native support for GGUF model format - **GPU Acceleration**: CUDA, OpenCL, and Metal support - **Memory Optimization**: Efficient memory usage and mapping - **Multi-threading**: Configurable CPU thread utilization - **Quantization**: Support for various quantization levels ### Configuration ```yaml backends: llamacpp: binary_path: "/usr/local/bin/llama-server" default_options: threads: 4 context_size: 2048 batch_size: 512 gpu: enabled: true layers: 35 ``` ### Supported Options | Option | Description | Default | |--------|-------------|---------| | `threads` | Number of CPU threads | 4 | | `context_size` | Context window size | 2048 | | `batch_size` | Batch size for processing | 512 | | `gpu_layers` | Layers to offload to GPU | 0 | | `memory_lock` | Lock model in memory | false | | `no_mmap` | Disable memory mapping | false | | `rope_freq_base` | RoPE frequency base | 10000 | | `rope_freq_scale` | RoPE frequency scale | 1.0 | ### GPU Acceleration #### CUDA Setup ```bash # Install CUDA toolkit sudo apt update sudo apt install nvidia-cuda-toolkit # Verify CUDA installation nvcc --version nvidia-smi ``` #### Configuration for GPU ```json { "name": "gpu-accelerated", "model_path": "/models/llama-2-13b.gguf", "port": 8081, "options": { "gpu_layers": 35, "threads": 2, "context_size": 4096 } } ``` ### Performance Tuning #### Memory Optimization ```yaml # For limited memory systems options: context_size: 1024 batch_size: 256 no_mmap: true memory_lock: false # For high-memory systems options: context_size: 8192 batch_size: 1024 memory_lock: true no_mmap: false ``` #### CPU Optimization ```yaml # Match thread count to CPU cores # For 8-core CPU: options: threads: 6 # Leave 2 cores for system # For high-performance CPUs: options: threads: 16 batch_size: 1024 ``` ## Future Backends LlamaCtl is designed to support multiple backends. Planned additions: ### vLLM Backend High-performance inference engine optimized for serving: - **Features**: Fast inference, batching, streaming - **Models**: Supports various model formats - **Scaling**: Horizontal scaling support ### TensorRT-LLM Backend NVIDIA's optimized inference engine: - **Features**: Maximum GPU performance - **Models**: Optimized for NVIDIA GPUs - **Deployment**: Production-ready inference ### Ollama Backend Integration with Ollama for easy model management: - **Features**: Simplified model downloading - **Models**: Large model library - **Integration**: Seamless model switching ## Backend Selection ### Automatic Detection LlamaCtl can automatically detect the best backend: ```yaml backends: auto_detect: true preference_order: - "llamacpp" - "vllm" - "tensorrt" ``` ### Manual Selection Force a specific backend for an instance: ```json { "name": "manual-backend", "backend": "llamacpp", "model_path": "/models/model.gguf", "port": 8081 } ``` ## Backend-Specific Features ### Llama.cpp Features #### Model Formats - **GGUF**: Primary format, best compatibility - **GGML**: Legacy format (limited support) #### Quantization Levels - `Q2_K`: Smallest size, lower quality - `Q4_K_M`: Balanced size and quality - `Q5_K_M`: Higher quality, larger size - `Q6_K`: Near-original quality - `Q8_0`: Minimal loss, largest size #### Advanced Options ```yaml advanced: rope_scaling: type: "linear" factor: 2.0 attention: flash_attention: true grouped_query: true ``` ## Monitoring Backend Performance ### Metrics Collection Monitor backend-specific metrics: ```bash # Get backend statistics curl http://localhost:8080/api/instances/my-instance/backend/stats ``` **Response:** ```json { "backend": "llamacpp", "version": "b1234", "metrics": { "tokens_per_second": 15.2, "memory_usage": 4294967296, "gpu_utilization": 85.5, "context_usage": 75.0 } } ``` ### Performance Optimization #### Benchmark Different Configurations ```bash # Test various thread counts for threads in 2 4 8 16; do echo "Testing $threads threads" curl -X PUT http://localhost:8080/api/instances/benchmark \ -d "{\"options\": {\"threads\": $threads}}" # Run performance test done ``` #### Memory Usage Optimization ```bash # Monitor memory usage watch -n 1 'curl -s http://localhost:8080/api/instances/my-instance/stats | jq .memory_usage' ``` ## Troubleshooting Backends ### Common Llama.cpp Issues **Model won't load:** ```bash # Check model file file /path/to/model.gguf # Verify format llama-server --model /path/to/model.gguf --dry-run ``` **GPU not detected:** ```bash # Check CUDA installation nvidia-smi # Verify llama.cpp GPU support llama-server --help | grep -i gpu ``` **Performance issues:** ```bash # Check system resources htop nvidia-smi # Verify configuration curl http://localhost:8080/api/instances/my-instance/config ``` ## Custom Backend Development ### Backend Interface Implement the backend interface for custom backends: ```go type Backend interface { Start(config InstanceConfig) error Stop(instance *Instance) error Health(instance *Instance) (*HealthStatus, error) Stats(instance *Instance) (*Stats, error) } ``` ### Registration Register your custom backend: ```go func init() { backends.Register("custom", &CustomBackend{}) } ``` ## Best Practices ### Production Deployments 1. **Resource allocation**: Plan for peak usage 2. **Backend selection**: Choose based on requirements 3. **Monitoring**: Set up comprehensive monitoring 4. **Fallback**: Configure backup backends ### Development 1. **Rapid iteration**: Use smaller models 2. **Resource monitoring**: Track usage patterns 3. **Configuration testing**: Validate settings 4. **Performance profiling**: Optimize bottlenecks ## Next Steps - Learn about [Monitoring](monitoring.md) backend performance - Explore [Troubleshooting](troubleshooting.md) guides - Set up [Production Monitoring](monitoring.md)