Create initial documentation structure

This commit is contained in:
2025-08-31 14:27:00 +02:00
parent 7675271370
commit bd31c03f4a
16 changed files with 3514 additions and 0 deletions

316
docs/advanced/backends.md Normal file
View File

@@ -0,0 +1,316 @@
# Backends
LlamaCtl supports multiple backends for running large language models. This guide covers the available backends and their configuration.
## Llama.cpp Backend
The primary backend for LlamaCtl, providing robust support for GGUF models.
### Features
- **GGUF Support**: Native support for GGUF model format
- **GPU Acceleration**: CUDA, OpenCL, and Metal support
- **Memory Optimization**: Efficient memory usage and mapping
- **Multi-threading**: Configurable CPU thread utilization
- **Quantization**: Support for various quantization levels
### Configuration
```yaml
backends:
llamacpp:
binary_path: "/usr/local/bin/llama-server"
default_options:
threads: 4
context_size: 2048
batch_size: 512
gpu:
enabled: true
layers: 35
```
### Supported Options
| Option | Description | Default |
|--------|-------------|---------|
| `threads` | Number of CPU threads | 4 |
| `context_size` | Context window size | 2048 |
| `batch_size` | Batch size for processing | 512 |
| `gpu_layers` | Layers to offload to GPU | 0 |
| `memory_lock` | Lock model in memory | false |
| `no_mmap` | Disable memory mapping | false |
| `rope_freq_base` | RoPE frequency base | 10000 |
| `rope_freq_scale` | RoPE frequency scale | 1.0 |
### GPU Acceleration
#### CUDA Setup
```bash
# Install CUDA toolkit
sudo apt update
sudo apt install nvidia-cuda-toolkit
# Verify CUDA installation
nvcc --version
nvidia-smi
```
#### Configuration for GPU
```json
{
"name": "gpu-accelerated",
"model_path": "/models/llama-2-13b.gguf",
"port": 8081,
"options": {
"gpu_layers": 35,
"threads": 2,
"context_size": 4096
}
}
```
### Performance Tuning
#### Memory Optimization
```yaml
# For limited memory systems
options:
context_size: 1024
batch_size: 256
no_mmap: true
memory_lock: false
# For high-memory systems
options:
context_size: 8192
batch_size: 1024
memory_lock: true
no_mmap: false
```
#### CPU Optimization
```yaml
# Match thread count to CPU cores
# For 8-core CPU:
options:
threads: 6 # Leave 2 cores for system
# For high-performance CPUs:
options:
threads: 16
batch_size: 1024
```
## Future Backends
LlamaCtl is designed to support multiple backends. Planned additions:
### vLLM Backend
High-performance inference engine optimized for serving:
- **Features**: Fast inference, batching, streaming
- **Models**: Supports various model formats
- **Scaling**: Horizontal scaling support
### TensorRT-LLM Backend
NVIDIA's optimized inference engine:
- **Features**: Maximum GPU performance
- **Models**: Optimized for NVIDIA GPUs
- **Deployment**: Production-ready inference
### Ollama Backend
Integration with Ollama for easy model management:
- **Features**: Simplified model downloading
- **Models**: Large model library
- **Integration**: Seamless model switching
## Backend Selection
### Automatic Detection
LlamaCtl can automatically detect the best backend:
```yaml
backends:
auto_detect: true
preference_order:
- "llamacpp"
- "vllm"
- "tensorrt"
```
### Manual Selection
Force a specific backend for an instance:
```json
{
"name": "manual-backend",
"backend": "llamacpp",
"model_path": "/models/model.gguf",
"port": 8081
}
```
## Backend-Specific Features
### Llama.cpp Features
#### Model Formats
- **GGUF**: Primary format, best compatibility
- **GGML**: Legacy format (limited support)
#### Quantization Levels
- `Q2_K`: Smallest size, lower quality
- `Q4_K_M`: Balanced size and quality
- `Q5_K_M`: Higher quality, larger size
- `Q6_K`: Near-original quality
- `Q8_0`: Minimal loss, largest size
#### Advanced Options
```yaml
advanced:
rope_scaling:
type: "linear"
factor: 2.0
attention:
flash_attention: true
grouped_query: true
```
## Monitoring Backend Performance
### Metrics Collection
Monitor backend-specific metrics:
```bash
# Get backend statistics
curl http://localhost:8080/api/instances/my-instance/backend/stats
```
**Response:**
```json
{
"backend": "llamacpp",
"version": "b1234",
"metrics": {
"tokens_per_second": 15.2,
"memory_usage": 4294967296,
"gpu_utilization": 85.5,
"context_usage": 75.0
}
}
```
### Performance Optimization
#### Benchmark Different Configurations
```bash
# Test various thread counts
for threads in 2 4 8 16; do
echo "Testing $threads threads"
curl -X PUT http://localhost:8080/api/instances/benchmark \
-d "{\"options\": {\"threads\": $threads}}"
# Run performance test
done
```
#### Memory Usage Optimization
```bash
# Monitor memory usage
watch -n 1 'curl -s http://localhost:8080/api/instances/my-instance/stats | jq .memory_usage'
```
## Troubleshooting Backends
### Common Llama.cpp Issues
**Model won't load:**
```bash
# Check model file
file /path/to/model.gguf
# Verify format
llama-server --model /path/to/model.gguf --dry-run
```
**GPU not detected:**
```bash
# Check CUDA installation
nvidia-smi
# Verify llama.cpp GPU support
llama-server --help | grep -i gpu
```
**Performance issues:**
```bash
# Check system resources
htop
nvidia-smi
# Verify configuration
curl http://localhost:8080/api/instances/my-instance/config
```
## Custom Backend Development
### Backend Interface
Implement the backend interface for custom backends:
```go
type Backend interface {
Start(config InstanceConfig) error
Stop(instance *Instance) error
Health(instance *Instance) (*HealthStatus, error)
Stats(instance *Instance) (*Stats, error)
}
```
### Registration
Register your custom backend:
```go
func init() {
backends.Register("custom", &CustomBackend{})
}
```
## Best Practices
### Production Deployments
1. **Resource allocation**: Plan for peak usage
2. **Backend selection**: Choose based on requirements
3. **Monitoring**: Set up comprehensive monitoring
4. **Fallback**: Configure backup backends
### Development
1. **Rapid iteration**: Use smaller models
2. **Resource monitoring**: Track usage patterns
3. **Configuration testing**: Validate settings
4. **Performance profiling**: Optimize bottlenecks
## Next Steps
- Learn about [Monitoring](monitoring.md) backend performance
- Explore [Troubleshooting](troubleshooting.md) guides
- Set up [Production Monitoring](monitoring.md)

420
docs/advanced/monitoring.md Normal file
View File

@@ -0,0 +1,420 @@
# Monitoring
Comprehensive monitoring setup for LlamaCtl in production environments.
## Overview
Effective monitoring of LlamaCtl involves tracking:
- Instance health and performance
- System resource usage
- API response times
- Error rates and alerts
## Built-in Monitoring
### Health Checks
LlamaCtl provides built-in health monitoring:
```bash
# Check overall system health
curl http://localhost:8080/api/system/health
# Check specific instance health
curl http://localhost:8080/api/instances/{name}/health
```
### Metrics Endpoint
Access Prometheus-compatible metrics:
```bash
curl http://localhost:8080/metrics
```
**Available Metrics:**
- `llamactl_instances_total`: Total number of instances
- `llamactl_instances_running`: Number of running instances
- `llamactl_instance_memory_bytes`: Instance memory usage
- `llamactl_instance_cpu_percent`: Instance CPU usage
- `llamactl_api_requests_total`: Total API requests
- `llamactl_api_request_duration_seconds`: API response times
## Prometheus Integration
### Configuration
Add LlamaCtl as a Prometheus target:
```yaml
# prometheus.yml
scrape_configs:
- job_name: 'llamactl'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/metrics'
scrape_interval: 15s
```
### Custom Metrics
Enable additional metrics in LlamaCtl:
```yaml
# config.yaml
monitoring:
enabled: true
prometheus:
enabled: true
path: "/metrics"
metrics:
- instance_stats
- api_performance
- system_resources
```
## Grafana Dashboards
### LlamaCtl Dashboard
Import the official Grafana dashboard:
1. Download dashboard JSON from releases
2. Import into Grafana
3. Configure Prometheus data source
### Key Panels
**Instance Overview:**
- Instance count and status
- Resource usage per instance
- Health status indicators
**Performance Metrics:**
- API response times
- Tokens per second
- Memory usage trends
**System Resources:**
- CPU and memory utilization
- Disk I/O and network usage
- GPU utilization (if applicable)
### Custom Queries
**Instance Uptime:**
```promql
(time() - llamactl_instance_start_time_seconds) / 3600
```
**Memory Usage Percentage:**
```promql
(llamactl_instance_memory_bytes / llamactl_system_memory_total_bytes) * 100
```
**API Error Rate:**
```promql
rate(llamactl_api_requests_total{status=~"4.."}[5m]) / rate(llamactl_api_requests_total[5m]) * 100
```
## Alerting
### Prometheus Alerts
Configure alerts for critical conditions:
```yaml
# alerts.yml
groups:
- name: llamactl
rules:
- alert: InstanceDown
expr: llamactl_instance_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "LlamaCtl instance {{ $labels.instance_name }} is down"
- alert: HighMemoryUsage
expr: llamactl_instance_memory_percent > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance_name }}"
- alert: APIHighLatency
expr: histogram_quantile(0.95, rate(llamactl_api_request_duration_seconds_bucket[5m])) > 2
for: 2m
labels:
severity: warning
annotations:
summary: "High API latency detected"
```
### Notification Channels
Configure alert notifications:
**Slack Integration:**
```yaml
# alertmanager.yml
route:
group_by: ['alertname']
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts'
title: 'LlamaCtl Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
```
## Log Management
### Centralized Logging
Configure log aggregation:
```yaml
# config.yaml
logging:
level: "info"
output: "json"
destinations:
- type: "file"
path: "/var/log/llamactl/app.log"
- type: "syslog"
facility: "local0"
- type: "elasticsearch"
url: "http://elasticsearch:9200"
```
### Log Analysis
Use ELK stack for log analysis:
**Elasticsearch Index Template:**
```json
{
"index_patterns": ["llamactl-*"],
"mappings": {
"properties": {
"timestamp": {"type": "date"},
"level": {"type": "keyword"},
"message": {"type": "text"},
"instance": {"type": "keyword"},
"component": {"type": "keyword"}
}
}
}
```
**Kibana Visualizations:**
- Log volume over time
- Error rate by instance
- Performance trends
- Resource usage patterns
## Application Performance Monitoring
### OpenTelemetry Integration
Enable distributed tracing:
```yaml
# config.yaml
telemetry:
enabled: true
otlp:
endpoint: "http://jaeger:14268/api/traces"
sampling_rate: 0.1
```
### Custom Spans
Add custom tracing to track operations:
```go
ctx, span := tracer.Start(ctx, "instance.start")
defer span.End()
// Track instance startup time
span.SetAttributes(
attribute.String("instance.name", name),
attribute.String("model.path", modelPath),
)
```
## Health Check Configuration
### Readiness Probes
Configure Kubernetes readiness probes:
```yaml
readinessProbe:
httpGet:
path: /api/health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
```
### Liveness Probes
Configure liveness probes:
```yaml
livenessProbe:
httpGet:
path: /api/health/live
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
```
### Custom Health Checks
Implement custom health checks:
```go
func (h *HealthHandler) CustomCheck(ctx context.Context) error {
// Check database connectivity
if err := h.db.Ping(); err != nil {
return fmt.Errorf("database unreachable: %w", err)
}
// Check instance responsiveness
for _, instance := range h.instances {
if !instance.IsHealthy() {
return fmt.Errorf("instance %s unhealthy", instance.Name)
}
}
return nil
}
```
## Performance Profiling
### pprof Integration
Enable Go profiling:
```yaml
# config.yaml
debug:
pprof_enabled: true
pprof_port: 6060
```
Access profiling endpoints:
```bash
# CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile
# Memory profile
go tool pprof http://localhost:6060/debug/pprof/heap
# Goroutine profile
go tool pprof http://localhost:6060/debug/pprof/goroutine
```
### Continuous Profiling
Set up continuous profiling with Pyroscope:
```yaml
# config.yaml
profiling:
enabled: true
pyroscope:
server_address: "http://pyroscope:4040"
application_name: "llamactl"
```
## Security Monitoring
### Audit Logging
Enable security audit logs:
```yaml
# config.yaml
audit:
enabled: true
log_file: "/var/log/llamactl/audit.log"
events:
- "auth.login"
- "auth.logout"
- "instance.create"
- "instance.delete"
- "config.update"
```
### Rate Limiting Monitoring
Track rate limiting metrics:
```bash
# Monitor rate limit hits
curl http://localhost:8080/metrics | grep rate_limit
```
## Troubleshooting Monitoring
### Common Issues
**Metrics not appearing:**
1. Check Prometheus configuration
2. Verify network connectivity
3. Review LlamaCtl logs for errors
**High memory usage:**
1. Check for memory leaks in profiles
2. Monitor garbage collection metrics
3. Review instance configurations
**Alert fatigue:**
1. Tune alert thresholds
2. Implement alert severity levels
3. Use alert routing and suppression
### Debug Tools
**Monitoring health:**
```bash
# Check monitoring endpoints
curl -v http://localhost:8080/metrics
curl -v http://localhost:8080/api/health
# Review logs
tail -f /var/log/llamactl/app.log
```
## Best Practices
### Production Monitoring
1. **Comprehensive coverage**: Monitor all critical components
2. **Appropriate alerting**: Balance sensitivity and noise
3. **Regular review**: Analyze trends and patterns
4. **Documentation**: Maintain runbooks for alerts
### Performance Optimization
1. **Baseline establishment**: Know normal operating parameters
2. **Trend analysis**: Identify performance degradation early
3. **Capacity planning**: Monitor resource growth trends
4. **Optimization cycles**: Regular performance tuning
## Next Steps
- Set up [Troubleshooting](troubleshooting.md) procedures
- Learn about [Backend optimization](backends.md)
- Configure [Production deployment](../development/building.md)

View File

@@ -0,0 +1,560 @@
# Troubleshooting
Common issues and solutions for LlamaCtl deployment and operation.
## Installation Issues
### Binary Not Found
**Problem:** `llamactl: command not found`
**Solutions:**
1. Verify the binary is in your PATH:
```bash
echo $PATH
which llamactl
```
2. Add to PATH or use full path:
```bash
export PATH=$PATH:/path/to/llamactl
# or
/full/path/to/llamactl
```
3. Check binary permissions:
```bash
chmod +x llamactl
```
### Permission Denied
**Problem:** Permission errors when starting LlamaCtl
**Solutions:**
1. Check file permissions:
```bash
ls -la llamactl
chmod +x llamactl
```
2. Verify directory permissions:
```bash
# Check models directory
ls -la /path/to/models/
# Check logs directory
sudo mkdir -p /var/log/llamactl
sudo chown $USER:$USER /var/log/llamactl
```
3. Run with appropriate user:
```bash
# Don't run as root unless necessary
sudo -u llamactl ./llamactl
```
## Startup Issues
### Port Already in Use
**Problem:** `bind: address already in use`
**Solutions:**
1. Find process using the port:
```bash
sudo netstat -tulpn | grep :8080
# or
sudo lsof -i :8080
```
2. Kill the conflicting process:
```bash
sudo kill -9 <PID>
```
3. Use a different port:
```bash
llamactl --port 8081
```
### Configuration Errors
**Problem:** Invalid configuration preventing startup
**Solutions:**
1. Validate configuration file:
```bash
llamactl --config /path/to/config.yaml --validate
```
2. Check YAML syntax:
```bash
yamllint config.yaml
```
3. Use minimal configuration:
```yaml
server:
host: "localhost"
port: 8080
```
## Instance Management Issues
### Model Loading Failures
**Problem:** Instance fails to start with model loading errors
**Diagnostic Steps:**
1. Check model file exists:
```bash
ls -la /path/to/model.gguf
file /path/to/model.gguf
```
2. Verify model format:
```bash
# Check if it's a valid GGUF file
hexdump -C /path/to/model.gguf | head -5
```
3. Test with llama.cpp directly:
```bash
llama-server --model /path/to/model.gguf --port 8081
```
**Common Solutions:**
- **Corrupted model:** Re-download the model file
- **Wrong format:** Ensure model is in GGUF format
- **Insufficient memory:** Reduce context size or use smaller model
- **Path issues:** Use absolute paths, check file permissions
### Memory Issues
**Problem:** Out of memory errors or system becomes unresponsive
**Diagnostic Steps:**
1. Check system memory:
```bash
free -h
cat /proc/meminfo
```
2. Monitor memory usage:
```bash
top -p $(pgrep llamactl)
```
3. Check instance memory requirements:
```bash
curl http://localhost:8080/api/instances/{name}/stats
```
**Solutions:**
1. **Reduce context size:**
```json
{
"options": {
"context_size": 1024
}
}
```
2. **Enable memory mapping:**
```json
{
"options": {
"no_mmap": false
}
}
```
3. **Use quantized models:**
- Try Q4_K_M instead of higher precision models
- Use smaller model variants (7B instead of 13B)
### GPU Issues
**Problem:** GPU not detected or not being used
**Diagnostic Steps:**
1. Check GPU availability:
```bash
nvidia-smi
```
2. Verify CUDA installation:
```bash
nvcc --version
```
3. Check llama.cpp GPU support:
```bash
llama-server --help | grep -i gpu
```
**Solutions:**
1. **Install CUDA drivers:**
```bash
sudo apt update
sudo apt install nvidia-driver-470 nvidia-cuda-toolkit
```
2. **Rebuild llama.cpp with GPU support:**
```bash
cmake -DLLAMA_CUBLAS=ON ..
make
```
3. **Configure GPU layers:**
```json
{
"options": {
"gpu_layers": 35
}
}
```
## Performance Issues
### Slow Response Times
**Problem:** API responses are slow or timeouts occur
**Diagnostic Steps:**
1. Check API response times:
```bash
time curl http://localhost:8080/api/instances
```
2. Monitor system resources:
```bash
htop
iotop
```
3. Check instance logs:
```bash
curl http://localhost:8080/api/instances/{name}/logs
```
**Solutions:**
1. **Optimize thread count:**
```json
{
"options": {
"threads": 6
}
}
```
2. **Adjust batch size:**
```json
{
"options": {
"batch_size": 512
}
}
```
3. **Enable GPU acceleration:**
```json
{
"options": {
"gpu_layers": 35
}
}
```
### High CPU Usage
**Problem:** LlamaCtl consuming excessive CPU
**Diagnostic Steps:**
1. Identify CPU-intensive processes:
```bash
top -p $(pgrep -f llamactl)
```
2. Check thread allocation:
```bash
curl http://localhost:8080/api/instances/{name}/config
```
**Solutions:**
1. **Reduce thread count:**
```json
{
"options": {
"threads": 4
}
}
```
2. **Limit concurrent instances:**
```yaml
limits:
max_instances: 3
```
## Network Issues
### Connection Refused
**Problem:** Cannot connect to LlamaCtl web interface
**Diagnostic Steps:**
1. Check if service is running:
```bash
ps aux | grep llamactl
```
2. Verify port binding:
```bash
netstat -tulpn | grep :8080
```
3. Test local connectivity:
```bash
curl http://localhost:8080/api/health
```
**Solutions:**
1. **Check firewall settings:**
```bash
sudo ufw status
sudo ufw allow 8080
```
2. **Bind to correct interface:**
```yaml
server:
host: "0.0.0.0" # Instead of "localhost"
port: 8080
```
### CORS Errors
**Problem:** Web UI shows CORS errors in browser console
**Solutions:**
1. **Enable CORS in configuration:**
```yaml
server:
cors_enabled: true
cors_origins:
- "http://localhost:3000"
- "https://yourdomain.com"
```
2. **Use reverse proxy:**
```nginx
server {
listen 80;
location / {
proxy_pass http://localhost:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
```
## Database Issues
### Startup Database Errors
**Problem:** Database connection failures on startup
**Diagnostic Steps:**
1. Check database service:
```bash
systemctl status postgresql
# or
systemctl status mysql
```
2. Test database connectivity:
```bash
psql -h localhost -U llamactl -d llamactl
```
**Solutions:**
1. **Start database service:**
```bash
sudo systemctl start postgresql
sudo systemctl enable postgresql
```
2. **Create database and user:**
```sql
CREATE DATABASE llamactl;
CREATE USER llamactl WITH PASSWORD 'password';
GRANT ALL PRIVILEGES ON DATABASE llamactl TO llamactl;
```
## Web UI Issues
### Blank Page or Loading Issues
**Problem:** Web UI doesn't load or shows blank page
**Diagnostic Steps:**
1. Check browser console for errors (F12)
2. Verify API connectivity:
```bash
curl http://localhost:8080/api/system/status
```
3. Check static file serving:
```bash
curl http://localhost:8080/
```
**Solutions:**
1. **Clear browser cache**
2. **Try different browser**
3. **Check for JavaScript errors in console**
4. **Verify API endpoint accessibility**
### Authentication Issues
**Problem:** Unable to login or authentication failures
**Diagnostic Steps:**
1. Check authentication configuration:
```bash
curl http://localhost:8080/api/config | jq .auth
```
2. Verify user credentials:
```bash
# Test login endpoint
curl -X POST http://localhost:8080/api/auth/login \
-H "Content-Type: application/json" \
-d '{"username":"admin","password":"password"}'
```
**Solutions:**
1. **Reset admin password:**
```bash
llamactl --reset-admin-password
```
2. **Disable authentication temporarily:**
```yaml
auth:
enabled: false
```
## Log Analysis
### Enable Debug Logging
For detailed troubleshooting, enable debug logging:
```yaml
logging:
level: "debug"
output: "/var/log/llamactl/debug.log"
```
### Key Log Patterns
Look for these patterns in logs:
**Startup issues:**
```
ERRO Failed to start server
ERRO Database connection failed
ERRO Port binding failed
```
**Instance issues:**
```
ERRO Failed to start instance
ERRO Model loading failed
ERRO Process crashed
```
**Performance issues:**
```
WARN High memory usage detected
WARN Request timeout
WARN Resource limit exceeded
```
## Getting Help
### Collecting Information
When seeking help, provide:
1. **System information:**
```bash
uname -a
llamactl --version
```
2. **Configuration:**
```bash
llamactl --config-dump
```
3. **Logs:**
```bash
tail -100 /var/log/llamactl/app.log
```
4. **Error details:**
- Exact error messages
- Steps to reproduce
- Environment details
### Support Channels
- **GitHub Issues:** Report bugs and feature requests
- **Documentation:** Check this documentation first
- **Community:** Join discussions in GitHub Discussions
## Preventive Measures
### Health Monitoring
Set up monitoring to catch issues early:
```bash
# Regular health checks
*/5 * * * * curl -f http://localhost:8080/api/health || alert
```
### Resource Monitoring
Monitor system resources:
```bash
# Disk space monitoring
df -h /var/log/llamactl/
df -h /path/to/models/
# Memory monitoring
free -h
```
### Backup Configuration
Regular configuration backups:
```bash
# Backup configuration
cp ~/.llamactl/config.yaml ~/.llamactl/config.yaml.backup
# Backup instance configurations
curl http://localhost:8080/api/instances > instances-backup.json
```
## Next Steps
- Set up [Monitoring](monitoring.md) to prevent issues
- Learn about [Advanced Configuration](backends.md)
- Review [Best Practices](../development/contributing.md)