mirror of
https://github.com/lordmathis/llamactl.git
synced 2025-11-06 00:54:23 +00:00
6.1 KiB
6.1 KiB
Backends
Llamactl supports multiple backends for running large language models. This guide covers the available backends and their configuration.
Llama.cpp Backend
The primary backend for Llamactl, providing robust support for GGUF models.
Features
- GGUF Support: Native support for GGUF model format
- GPU Acceleration: CUDA, OpenCL, and Metal support
- Memory Optimization: Efficient memory usage and mapping
- Multi-threading: Configurable CPU thread utilization
- Quantization: Support for various quantization levels
Configuration
backends:
llamacpp:
binary_path: "/usr/local/bin/llama-server"
default_options:
threads: 4
context_size: 2048
batch_size: 512
gpu:
enabled: true
layers: 35
Supported Options
| Option | Description | Default |
|---|---|---|
threads |
Number of CPU threads | 4 |
context_size |
Context window size | 2048 |
batch_size |
Batch size for processing | 512 |
gpu_layers |
Layers to offload to GPU | 0 |
memory_lock |
Lock model in memory | false |
no_mmap |
Disable memory mapping | false |
rope_freq_base |
RoPE frequency base | 10000 |
rope_freq_scale |
RoPE frequency scale | 1.0 |
GPU Acceleration
CUDA Setup
# Install CUDA toolkit
sudo apt update
sudo apt install nvidia-cuda-toolkit
# Verify CUDA installation
nvcc --version
nvidia-smi
Configuration for GPU
{
"name": "gpu-accelerated",
"model_path": "/models/llama-2-13b.gguf",
"port": 8081,
"options": {
"gpu_layers": 35,
"threads": 2,
"context_size": 4096
}
}
Performance Tuning
Memory Optimization
# For limited memory systems
options:
context_size: 1024
batch_size: 256
no_mmap: true
memory_lock: false
# For high-memory systems
options:
context_size: 8192
batch_size: 1024
memory_lock: true
no_mmap: false
CPU Optimization
# Match thread count to CPU cores
# For 8-core CPU:
options:
threads: 6 # Leave 2 cores for system
# For high-performance CPUs:
options:
threads: 16
batch_size: 1024
Future Backends
Llamactl is designed to support multiple backends. Planned additions:
vLLM Backend
High-performance inference engine optimized for serving:
- Features: Fast inference, batching, streaming
- Models: Supports various model formats
- Scaling: Horizontal scaling support
TensorRT-LLM Backend
NVIDIA's optimized inference engine:
- Features: Maximum GPU performance
- Models: Optimized for NVIDIA GPUs
- Deployment: Production-ready inference
Ollama Backend
Integration with Ollama for easy model management:
- Features: Simplified model downloading
- Models: Large model library
- Integration: Seamless model switching
Backend Selection
Automatic Detection
Llamactl can automatically detect the best backend:
backends:
auto_detect: true
preference_order:
- "llamacpp"
- "vllm"
- "tensorrt"
Manual Selection
Force a specific backend for an instance:
{
"name": "manual-backend",
"backend": "llamacpp",
"model_path": "/models/model.gguf",
"port": 8081
}
Backend-Specific Features
Llama.cpp Features
Model Formats
- GGUF: Primary format, best compatibility
- GGML: Legacy format (limited support)
Quantization Levels
Q2_K: Smallest size, lower qualityQ4_K_M: Balanced size and qualityQ5_K_M: Higher quality, larger sizeQ6_K: Near-original qualityQ8_0: Minimal loss, largest size
Advanced Options
advanced:
rope_scaling:
type: "linear"
factor: 2.0
attention:
flash_attention: true
grouped_query: true
Monitoring Backend Performance
Metrics Collection
Monitor backend-specific metrics:
# Get backend statistics
curl http://localhost:8080/api/instances/my-instance/backend/stats
Response:
{
"backend": "llamacpp",
"version": "b1234",
"metrics": {
"tokens_per_second": 15.2,
"memory_usage": 4294967296,
"gpu_utilization": 85.5,
"context_usage": 75.0
}
}
Performance Optimization
Benchmark Different Configurations
# Test various thread counts
for threads in 2 4 8 16; do
echo "Testing $threads threads"
curl -X PUT http://localhost:8080/api/instances/benchmark \
-d "{\"options\": {\"threads\": $threads}}"
# Run performance test
done
Memory Usage Optimization
# Monitor memory usage
watch -n 1 'curl -s http://localhost:8080/api/instances/my-instance/stats | jq .memory_usage'
Troubleshooting Backends
Common Llama.cpp Issues
Model won't load:
# Check model file
file /path/to/model.gguf
# Verify format
llama-server --model /path/to/model.gguf --dry-run
GPU not detected:
# Check CUDA installation
nvidia-smi
# Verify llama.cpp GPU support
llama-server --help | grep -i gpu
Performance issues:
# Check system resources
htop
nvidia-smi
# Verify configuration
curl http://localhost:8080/api/instances/my-instance/config
Custom Backend Development
Backend Interface
Implement the backend interface for custom backends:
type Backend interface {
Start(config InstanceConfig) error
Stop(instance *Instance) error
Health(instance *Instance) (*HealthStatus, error)
Stats(instance *Instance) (*Stats, error)
}
Registration
Register your custom backend:
func init() {
backends.Register("custom", &CustomBackend{})
}
Best Practices
Production Deployments
- Resource allocation: Plan for peak usage
- Backend selection: Choose based on requirements
- Monitoring: Set up comprehensive monitoring
- Fallback: Configure backup backends
Development
- Rapid iteration: Use smaller models
- Resource monitoring: Track usage patterns
- Configuration testing: Validate settings
- Performance profiling: Optimize bottlenecks
Next Steps
- Learn about Monitoring backend performance
- Explore Troubleshooting guides
- Set up Production Monitoring