Files
llamactl/docs/advanced/backends.md

6.1 KiB

Backends

Llamactl supports multiple backends for running large language models. This guide covers the available backends and their configuration.

Llama.cpp Backend

The primary backend for Llamactl, providing robust support for GGUF models.

Features

  • GGUF Support: Native support for GGUF model format
  • GPU Acceleration: CUDA, OpenCL, and Metal support
  • Memory Optimization: Efficient memory usage and mapping
  • Multi-threading: Configurable CPU thread utilization
  • Quantization: Support for various quantization levels

Configuration

backends:
  llamacpp:
    binary_path: "/usr/local/bin/llama-server"
    default_options:
      threads: 4
      context_size: 2048
      batch_size: 512
    gpu:
      enabled: true
      layers: 35

Supported Options

Option Description Default
threads Number of CPU threads 4
context_size Context window size 2048
batch_size Batch size for processing 512
gpu_layers Layers to offload to GPU 0
memory_lock Lock model in memory false
no_mmap Disable memory mapping false
rope_freq_base RoPE frequency base 10000
rope_freq_scale RoPE frequency scale 1.0

GPU Acceleration

CUDA Setup

# Install CUDA toolkit
sudo apt update
sudo apt install nvidia-cuda-toolkit

# Verify CUDA installation
nvcc --version
nvidia-smi

Configuration for GPU

{
  "name": "gpu-accelerated",
  "model_path": "/models/llama-2-13b.gguf",
  "port": 8081,
  "options": {
    "gpu_layers": 35,
    "threads": 2,
    "context_size": 4096
  }
}

Performance Tuning

Memory Optimization

# For limited memory systems
options:
  context_size: 1024
  batch_size: 256
  no_mmap: true
  memory_lock: false

# For high-memory systems
options:
  context_size: 8192
  batch_size: 1024
  memory_lock: true
  no_mmap: false

CPU Optimization

# Match thread count to CPU cores
# For 8-core CPU:
options:
  threads: 6  # Leave 2 cores for system
  
# For high-performance CPUs:
options:
  threads: 16
  batch_size: 1024

Future Backends

Llamactl is designed to support multiple backends. Planned additions:

vLLM Backend

High-performance inference engine optimized for serving:

  • Features: Fast inference, batching, streaming
  • Models: Supports various model formats
  • Scaling: Horizontal scaling support

TensorRT-LLM Backend

NVIDIA's optimized inference engine:

  • Features: Maximum GPU performance
  • Models: Optimized for NVIDIA GPUs
  • Deployment: Production-ready inference

Ollama Backend

Integration with Ollama for easy model management:

  • Features: Simplified model downloading
  • Models: Large model library
  • Integration: Seamless model switching

Backend Selection

Automatic Detection

Llamactl can automatically detect the best backend:

backends:
  auto_detect: true
  preference_order:
    - "llamacpp"
    - "vllm"
    - "tensorrt"

Manual Selection

Force a specific backend for an instance:

{
  "name": "manual-backend",
  "backend": "llamacpp",
  "model_path": "/models/model.gguf",
  "port": 8081
}

Backend-Specific Features

Llama.cpp Features

Model Formats

  • GGUF: Primary format, best compatibility
  • GGML: Legacy format (limited support)

Quantization Levels

  • Q2_K: Smallest size, lower quality
  • Q4_K_M: Balanced size and quality
  • Q5_K_M: Higher quality, larger size
  • Q6_K: Near-original quality
  • Q8_0: Minimal loss, largest size

Advanced Options

advanced:
  rope_scaling:
    type: "linear"
    factor: 2.0
  attention:
    flash_attention: true
    grouped_query: true

Monitoring Backend Performance

Metrics Collection

Monitor backend-specific metrics:

# Get backend statistics
curl http://localhost:8080/api/instances/my-instance/backend/stats

Response:

{
  "backend": "llamacpp",
  "version": "b1234",
  "metrics": {
    "tokens_per_second": 15.2,
    "memory_usage": 4294967296,
    "gpu_utilization": 85.5,
    "context_usage": 75.0
  }
}

Performance Optimization

Benchmark Different Configurations

# Test various thread counts
for threads in 2 4 8 16; do
  echo "Testing $threads threads"
  curl -X PUT http://localhost:8080/api/instances/benchmark \
    -d "{\"options\": {\"threads\": $threads}}"
  # Run performance test
done

Memory Usage Optimization

# Monitor memory usage
watch -n 1 'curl -s http://localhost:8080/api/instances/my-instance/stats | jq .memory_usage'

Troubleshooting Backends

Common Llama.cpp Issues

Model won't load:

# Check model file
file /path/to/model.gguf

# Verify format
llama-server --model /path/to/model.gguf --dry-run

GPU not detected:

# Check CUDA installation
nvidia-smi

# Verify llama.cpp GPU support
llama-server --help | grep -i gpu

Performance issues:

# Check system resources
htop
nvidia-smi

# Verify configuration
curl http://localhost:8080/api/instances/my-instance/config

Custom Backend Development

Backend Interface

Implement the backend interface for custom backends:

type Backend interface {
    Start(config InstanceConfig) error
    Stop(instance *Instance) error
    Health(instance *Instance) (*HealthStatus, error)
    Stats(instance *Instance) (*Stats, error)
}

Registration

Register your custom backend:

func init() {
    backends.Register("custom", &CustomBackend{})
}

Best Practices

Production Deployments

  1. Resource allocation: Plan for peak usage
  2. Backend selection: Choose based on requirements
  3. Monitoring: Set up comprehensive monitoring
  4. Fallback: Configure backup backends

Development

  1. Rapid iteration: Use smaller models
  2. Resource monitoring: Track usage patterns
  3. Configuration testing: Validate settings
  4. Performance profiling: Optimize bottlenecks

Next Steps