mirror of https://github.com/lordmathis/llamactl.git synced 2025-11-05 16:44:22 +00:00

Files

LordMathis 0b264c8015 Fix typos and consistent naming for Llamactl across documentation

2025-09-02 22:05:01 +02:00

6.1 KiB

Raw Blame History

Backends

Llamactl supports multiple backends for running large language models. This guide covers the available backends and their configuration.

Llama.cpp Backend

The primary backend for Llamactl, providing robust support for GGUF models.

Features

GGUF Support: Native support for GGUF model format
GPU Acceleration: CUDA, OpenCL, and Metal support
Memory Optimization: Efficient memory usage and mapping
Multi-threading: Configurable CPU thread utilization
Quantization: Support for various quantization levels

Configuration

backends:
  llamacpp:
    binary_path: "/usr/local/bin/llama-server"
    default_options:
      threads: 4
      context_size: 2048
      batch_size: 512
    gpu:
      enabled: true
      layers: 35

Supported Options

Option	Description	Default
`threads`	Number of CPU threads	4
`context_size`	Context window size	2048
`batch_size`	Batch size for processing	512
`gpu_layers`	Layers to offload to GPU	0
`memory_lock`	Lock model in memory	false
`no_mmap`	Disable memory mapping	false
`rope_freq_base`	RoPE frequency base	10000
`rope_freq_scale`	RoPE frequency scale	1.0

GPU Acceleration

CUDA Setup

# Install CUDA toolkit
sudo apt update
sudo apt install nvidia-cuda-toolkit

# Verify CUDA installation
nvcc --version
nvidia-smi

Configuration for GPU

{
  "name": "gpu-accelerated",
  "model_path": "/models/llama-2-13b.gguf",
  "port": 8081,
  "options": {
    "gpu_layers": 35,
    "threads": 2,
    "context_size": 4096
  }
}

Performance Tuning

Memory Optimization

# For limited memory systems
options:
  context_size: 1024
  batch_size: 256
  no_mmap: true
  memory_lock: false

# For high-memory systems
options:
  context_size: 8192
  batch_size: 1024
  memory_lock: true
  no_mmap: false

CPU Optimization

# Match thread count to CPU cores
# For 8-core CPU:
options:
  threads: 6  # Leave 2 cores for system
  
# For high-performance CPUs:
options:
  threads: 16
  batch_size: 1024

Future Backends

Llamactl is designed to support multiple backends. Planned additions:

vLLM Backend

High-performance inference engine optimized for serving:

Features: Fast inference, batching, streaming
Models: Supports various model formats
Scaling: Horizontal scaling support

TensorRT-LLM Backend

NVIDIA's optimized inference engine:

Features: Maximum GPU performance
Models: Optimized for NVIDIA GPUs
Deployment: Production-ready inference

Ollama Backend

Integration with Ollama for easy model management:

Features: Simplified model downloading
Models: Large model library
Integration: Seamless model switching

Backend Selection

Automatic Detection

Llamactl can automatically detect the best backend:

backends:
  auto_detect: true
  preference_order:
    - "llamacpp"
    - "vllm"
    - "tensorrt"

Manual Selection

Force a specific backend for an instance:

{
  "name": "manual-backend",
  "backend": "llamacpp",
  "model_path": "/models/model.gguf",
  "port": 8081
}

Backend-Specific Features

Llama.cpp Features

Model Formats

GGUF: Primary format, best compatibility
GGML: Legacy format (limited support)

Quantization Levels

Q2_K: Smallest size, lower quality
Q4_K_M: Balanced size and quality
Q5_K_M: Higher quality, larger size
Q6_K: Near-original quality
Q8_0: Minimal loss, largest size

Advanced Options

advanced:
  rope_scaling:
    type: "linear"
    factor: 2.0
  attention:
    flash_attention: true
    grouped_query: true

Monitoring Backend Performance

Metrics Collection

Monitor backend-specific metrics:

# Get backend statistics
curl http://localhost:8080/api/instances/my-instance/backend/stats

Response:

{
  "backend": "llamacpp",
  "version": "b1234",
  "metrics": {
    "tokens_per_second": 15.2,
    "memory_usage": 4294967296,
    "gpu_utilization": 85.5,
    "context_usage": 75.0
  }
}

Performance Optimization

Benchmark Different Configurations

# Test various thread counts
for threads in 2 4 8 16; do
  echo "Testing $threads threads"
  curl -X PUT http://localhost:8080/api/instances/benchmark \
    -d "{\"options\": {\"threads\": $threads}}"
  # Run performance test
done

Memory Usage Optimization

# Monitor memory usage
watch -n 1 'curl -s http://localhost:8080/api/instances/my-instance/stats | jq .memory_usage'

Troubleshooting Backends

Common Llama.cpp Issues

Model won't load:

# Check model file
file /path/to/model.gguf

# Verify format
llama-server --model /path/to/model.gguf --dry-run

GPU not detected:

# Check CUDA installation
nvidia-smi

# Verify llama.cpp GPU support
llama-server --help | grep -i gpu

Performance issues:

# Check system resources
htop
nvidia-smi

# Verify configuration
curl http://localhost:8080/api/instances/my-instance/config

Custom Backend Development

Backend Interface

Implement the backend interface for custom backends:

type Backend interface {
    Start(config InstanceConfig) error
    Stop(instance *Instance) error
    Health(instance *Instance) (*HealthStatus, error)
    Stats(instance *Instance) (*Stats, error)
}

Registration

func init() {
    backends.Register("custom", &CustomBackend{})
}

Best Practices

Production Deployments

Resource allocation: Plan for peak usage
Backend selection: Choose based on requirements
Monitoring: Set up comprehensive monitoring
Fallback: Configure backup backends

Development

Rapid iteration: Use smaller models
Resource monitoring: Track usage patterns
Configuration testing: Validate settings
Performance profiling: Optimize bottlenecks

Next Steps

Learn about Monitoring backend performance
Explore Troubleshooting guides
Set up Production Monitoring

6.1 KiB Raw Blame History