Remove misleading advanced section

This commit is contained in:
2025-08-31 16:04:09 +02:00
parent 92af14b350
commit b08f15c5d0
9 changed files with 3 additions and 779 deletions

View File

@@ -1,316 +0,0 @@
# Backends
Llamactl supports multiple backends for running large language models. This guide covers the available backends and their configuration.
## Llama.cpp Backend
The primary backend for Llamactl, providing robust support for GGUF models.
### Features
- **GGUF Support**: Native support for GGUF model format
- **GPU Acceleration**: CUDA, OpenCL, and Metal support
- **Memory Optimization**: Efficient memory usage and mapping
- **Multi-threading**: Configurable CPU thread utilization
- **Quantization**: Support for various quantization levels
### Configuration
```yaml
backends:
llamacpp:
binary_path: "/usr/local/bin/llama-server"
default_options:
threads: 4
context_size: 2048
batch_size: 512
gpu:
enabled: true
layers: 35
```
### Supported Options
| Option | Description | Default |
|--------|-------------|---------|
| `threads` | Number of CPU threads | 4 |
| `context_size` | Context window size | 2048 |
| `batch_size` | Batch size for processing | 512 |
| `gpu_layers` | Layers to offload to GPU | 0 |
| `memory_lock` | Lock model in memory | false |
| `no_mmap` | Disable memory mapping | false |
| `rope_freq_base` | RoPE frequency base | 10000 |
| `rope_freq_scale` | RoPE frequency scale | 1.0 |
### GPU Acceleration
#### CUDA Setup
```bash
# Install CUDA toolkit
sudo apt update
sudo apt install nvidia-cuda-toolkit
# Verify CUDA installation
nvcc --version
nvidia-smi
```
#### Configuration for GPU
```json
{
"name": "gpu-accelerated",
"model_path": "/models/llama-2-13b.gguf",
"port": 8081,
"options": {
"gpu_layers": 35,
"threads": 2,
"context_size": 4096
}
}
```
### Performance Tuning
#### Memory Optimization
```yaml
# For limited memory systems
options:
context_size: 1024
batch_size: 256
no_mmap: true
memory_lock: false
# For high-memory systems
options:
context_size: 8192
batch_size: 1024
memory_lock: true
no_mmap: false
```
#### CPU Optimization
```yaml
# Match thread count to CPU cores
# For 8-core CPU:
options:
threads: 6 # Leave 2 cores for system
# For high-performance CPUs:
options:
threads: 16
batch_size: 1024
```
## Future Backends
Llamactl is designed to support multiple backends. Planned additions:
### vLLM Backend
High-performance inference engine optimized for serving:
- **Features**: Fast inference, batching, streaming
- **Models**: Supports various model formats
- **Scaling**: Horizontal scaling support
### TensorRT-LLM Backend
NVIDIA's optimized inference engine:
- **Features**: Maximum GPU performance
- **Models**: Optimized for NVIDIA GPUs
- **Deployment**: Production-ready inference
### Ollama Backend
Integration with Ollama for easy model management:
- **Features**: Simplified model downloading
- **Models**: Large model library
- **Integration**: Seamless model switching
## Backend Selection
### Automatic Detection
Llamactl can automatically detect the best backend:
```yaml
backends:
auto_detect: true
preference_order:
- "llamacpp"
- "vllm"
- "tensorrt"
```
### Manual Selection
Force a specific backend for an instance:
```json
{
"name": "manual-backend",
"backend": "llamacpp",
"model_path": "/models/model.gguf",
"port": 8081
}
```
## Backend-Specific Features
### Llama.cpp Features
#### Model Formats
- **GGUF**: Primary format, best compatibility
- **GGML**: Legacy format (limited support)
#### Quantization Levels
- `Q2_K`: Smallest size, lower quality
- `Q4_K_M`: Balanced size and quality
- `Q5_K_M`: Higher quality, larger size
- `Q6_K`: Near-original quality
- `Q8_0`: Minimal loss, largest size
#### Advanced Options
```yaml
advanced:
rope_scaling:
type: "linear"
factor: 2.0
attention:
flash_attention: true
grouped_query: true
```
## Monitoring Backend Performance
### Metrics Collection
Monitor backend-specific metrics:
```bash
# Get backend statistics
curl http://localhost:8080/api/instances/my-instance/backend/stats
```
**Response:**
```json
{
"backend": "llamacpp",
"version": "b1234",
"metrics": {
"tokens_per_second": 15.2,
"memory_usage": 4294967296,
"gpu_utilization": 85.5,
"context_usage": 75.0
}
}
```
### Performance Optimization
#### Benchmark Different Configurations
```bash
# Test various thread counts
for threads in 2 4 8 16; do
echo "Testing $threads threads"
curl -X PUT http://localhost:8080/api/instances/benchmark \
-d "{\"options\": {\"threads\": $threads}}"
# Run performance test
done
```
#### Memory Usage Optimization
```bash
# Monitor memory usage
watch -n 1 'curl -s http://localhost:8080/api/instances/my-instance/stats | jq .memory_usage'
```
## Troubleshooting Backends
### Common Llama.cpp Issues
**Model won't load:**
```bash
# Check model file
file /path/to/model.gguf
# Verify format
llama-server --model /path/to/model.gguf --dry-run
```
**GPU not detected:**
```bash
# Check CUDA installation
nvidia-smi
# Verify llama.cpp GPU support
llama-server --help | grep -i gpu
```
**Performance issues:**
```bash
# Check system resources
htop
nvidia-smi
# Verify configuration
curl http://localhost:8080/api/instances/my-instance/config
```
## Custom Backend Development
### Backend Interface
Implement the backend interface for custom backends:
```go
type Backend interface {
Start(config InstanceConfig) error
Stop(instance *Instance) error
Health(instance *Instance) (*HealthStatus, error)
Stats(instance *Instance) (*Stats, error)
}
```
### Registration
Register your custom backend:
```go
func init() {
backends.Register("custom", &CustomBackend{})
}
```
## Best Practices
### Production Deployments
1. **Resource allocation**: Plan for peak usage
2. **Backend selection**: Choose based on requirements
3. **Monitoring**: Set up comprehensive monitoring
4. **Fallback**: Configure backup backends
### Development
1. **Rapid iteration**: Use smaller models
2. **Resource monitoring**: Track usage patterns
3. **Configuration testing**: Validate settings
4. **Performance profiling**: Optimize bottlenecks
## Next Steps
- Learn about [Monitoring](monitoring.md) backend performance
- Explore [Troubleshooting](troubleshooting.md) guides
- Set up [Production Monitoring](monitoring.md)

View File

@@ -1,420 +0,0 @@
# Monitoring
Comprehensive monitoring setup for Llamactl in production environments.
## Overview
Effective monitoring of Llamactl involves tracking:
- Instance health and performance
- System resource usage
- API response times
- Error rates and alerts
## Built-in Monitoring
### Health Checks
Llamactl provides built-in health monitoring:
```bash
# Check overall system health
curl http://localhost:8080/api/system/health
# Check specific instance health
curl http://localhost:8080/api/instances/{name}/health
```
### Metrics Endpoint
Access Prometheus-compatible metrics:
```bash
curl http://localhost:8080/metrics
```
**Available Metrics:**
- `llamactl_instances_total`: Total number of instances
- `llamactl_instances_running`: Number of running instances
- `llamactl_instance_memory_bytes`: Instance memory usage
- `llamactl_instance_cpu_percent`: Instance CPU usage
- `llamactl_api_requests_total`: Total API requests
- `llamactl_api_request_duration_seconds`: API response times
## Prometheus Integration
### Configuration
Add Llamactl as a Prometheus target:
```yaml
# prometheus.yml
scrape_configs:
- job_name: 'llamactl'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/metrics'
scrape_interval: 15s
```
### Custom Metrics
Enable additional metrics in Llamactl:
```yaml
# config.yaml
monitoring:
enabled: true
prometheus:
enabled: true
path: "/metrics"
metrics:
- instance_stats
- api_performance
- system_resources
```
## Grafana Dashboards
### Llamactl Dashboard
Import the official Grafana dashboard:
1. Download dashboard JSON from releases
2. Import into Grafana
3. Configure Prometheus data source
### Key Panels
**Instance Overview:**
- Instance count and status
- Resource usage per instance
- Health status indicators
**Performance Metrics:**
- API response times
- Tokens per second
- Memory usage trends
**System Resources:**
- CPU and memory utilization
- Disk I/O and network usage
- GPU utilization (if applicable)
### Custom Queries
**Instance Uptime:**
```promql
(time() - llamactl_instance_start_time_seconds) / 3600
```
**Memory Usage Percentage:**
```promql
(llamactl_instance_memory_bytes / llamactl_system_memory_total_bytes) * 100
```
**API Error Rate:**
```promql
rate(llamactl_api_requests_total{status=~"4.."}[5m]) / rate(llamactl_api_requests_total[5m]) * 100
```
## Alerting
### Prometheus Alerts
Configure alerts for critical conditions:
```yaml
# alerts.yml
groups:
- name: llamactl
rules:
- alert: InstanceDown
expr: llamactl_instance_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Llamactl instance {{ $labels.instance_name }} is down"
- alert: HighMemoryUsage
expr: llamactl_instance_memory_percent > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance_name }}"
- alert: APIHighLatency
expr: histogram_quantile(0.95, rate(llamactl_api_request_duration_seconds_bucket[5m])) > 2
for: 2m
labels:
severity: warning
annotations:
summary: "High API latency detected"
```
### Notification Channels
Configure alert notifications:
**Slack Integration:**
```yaml
# alertmanager.yml
route:
group_by: ['alertname']
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts'
title: 'Llamactl Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
```
## Log Management
### Centralized Logging
Configure log aggregation:
```yaml
# config.yaml
logging:
level: "info"
output: "json"
destinations:
- type: "file"
path: "/var/log/llamactl/app.log"
- type: "syslog"
facility: "local0"
- type: "elasticsearch"
url: "http://elasticsearch:9200"
```
### Log Analysis
Use ELK stack for log analysis:
**Elasticsearch Index Template:**
```json
{
"index_patterns": ["llamactl-*"],
"mappings": {
"properties": {
"timestamp": {"type": "date"},
"level": {"type": "keyword"},
"message": {"type": "text"},
"instance": {"type": "keyword"},
"component": {"type": "keyword"}
}
}
}
```
**Kibana Visualizations:**
- Log volume over time
- Error rate by instance
- Performance trends
- Resource usage patterns
## Application Performance Monitoring
### OpenTelemetry Integration
Enable distributed tracing:
```yaml
# config.yaml
telemetry:
enabled: true
otlp:
endpoint: "http://jaeger:14268/api/traces"
sampling_rate: 0.1
```
### Custom Spans
Add custom tracing to track operations:
```go
ctx, span := tracer.Start(ctx, "instance.start")
defer span.End()
// Track instance startup time
span.SetAttributes(
attribute.String("instance.name", name),
attribute.String("model.path", modelPath),
)
```
## Health Check Configuration
### Readiness Probes
Configure Kubernetes readiness probes:
```yaml
readinessProbe:
httpGet:
path: /api/health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
```
### Liveness Probes
Configure liveness probes:
```yaml
livenessProbe:
httpGet:
path: /api/health/live
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
```
### Custom Health Checks
Implement custom health checks:
```go
func (h *HealthHandler) CustomCheck(ctx context.Context) error {
// Check database connectivity
if err := h.db.Ping(); err != nil {
return fmt.Errorf("database unreachable: %w", err)
}
// Check instance responsiveness
for _, instance := range h.instances {
if !instance.IsHealthy() {
return fmt.Errorf("instance %s unhealthy", instance.Name)
}
}
return nil
}
```
## Performance Profiling
### pprof Integration
Enable Go profiling:
```yaml
# config.yaml
debug:
pprof_enabled: true
pprof_port: 6060
```
Access profiling endpoints:
```bash
# CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile
# Memory profile
go tool pprof http://localhost:6060/debug/pprof/heap
# Goroutine profile
go tool pprof http://localhost:6060/debug/pprof/goroutine
```
### Continuous Profiling
Set up continuous profiling with Pyroscope:
```yaml
# config.yaml
profiling:
enabled: true
pyroscope:
server_address: "http://pyroscope:4040"
application_name: "llamactl"
```
## Security Monitoring
### Audit Logging
Enable security audit logs:
```yaml
# config.yaml
audit:
enabled: true
log_file: "/var/log/llamactl/audit.log"
events:
- "auth.login"
- "auth.logout"
- "instance.create"
- "instance.delete"
- "config.update"
```
### Rate Limiting Monitoring
Track rate limiting metrics:
```bash
# Monitor rate limit hits
curl http://localhost:8080/metrics | grep rate_limit
```
## Troubleshooting Monitoring
### Common Issues
**Metrics not appearing:**
1. Check Prometheus configuration
2. Verify network connectivity
3. Review Llamactl logs for errors
**High memory usage:**
1. Check for memory leaks in profiles
2. Monitor garbage collection metrics
3. Review instance configurations
**Alert fatigue:**
1. Tune alert thresholds
2. Implement alert severity levels
3. Use alert routing and suppression
### Debug Tools
**Monitoring health:**
```bash
# Check monitoring endpoints
curl -v http://localhost:8080/metrics
curl -v http://localhost:8080/api/health
# Review logs
tail -f /var/log/llamactl/app.log
```
## Best Practices
### Production Monitoring
1. **Comprehensive coverage**: Monitor all critical components
2. **Appropriate alerting**: Balance sensitivity and noise
3. **Regular review**: Analyze trends and patterns
4. **Documentation**: Maintain runbooks for alerts
### Performance Optimization
1. **Baseline establishment**: Know normal operating parameters
2. **Trend analysis**: Identify performance degradation early
3. **Capacity planning**: Monitor resource growth trends
4. **Optimization cycles**: Regular performance tuning
## Next Steps
- Set up [Troubleshooting](troubleshooting.md) procedures
- Learn about [Backend optimization](backends.md)
- Configure [Production deployment](../development/building.md)

View File

@@ -148,15 +148,3 @@ llamactl --help
```
You can also override configuration using command line flags when starting llamactl.
## Next Steps
- Learn about [Managing Instances](../user-guide/managing-instances.md)
- Explore [Advanced Configuration](../advanced/monitoring.md)
- Set up [Monitoring](../advanced/monitoring.md)
## Next Steps
- Learn about [Managing Instances](../user-guide/managing-instances.md)
- Explore [Advanced Configuration](../advanced/monitoring.md)
- Set up [Monitoring](../advanced/monitoring.md)

View File

@@ -40,14 +40,13 @@ Llamactl is designed to simplify the deployment and management of llama-server i
- [Web UI Guide](user-guide/web-ui.md) - Learn to use the web interface
- [Managing Instances](user-guide/managing-instances.md) - Instance lifecycle management
- [API Reference](user-guide/api-reference.md) - Complete API documentation
- [Monitoring](advanced/monitoring.md) - Health checks and monitoring
- [Backends](advanced/backends.md) - Backend configuration options
## Getting Help
If you need help or have questions:
- Check the [Troubleshooting](advanced/troubleshooting.md) guide
- Check the [Troubleshooting](user-guide/troubleshooting.md) guide
- Visit the [GitHub repository](https://github.com/lordmathis/llamactl)
- Review the [Configuration Guide](getting-started/configuration.md) for advanced settings

View File

@@ -462,9 +462,3 @@ curl -X POST http://localhost:8080/api/instances/example/stop
# Delete instance
curl -X DELETE http://localhost:8080/api/instances/example
```
## Next Steps
- Learn about [Managing Instances](managing-instances.md) in detail
- Explore [Advanced Configuration](../advanced/backends.md)
- Set up [Monitoring](../advanced/monitoring.md) for production use

View File

@@ -163,9 +163,3 @@ curl -X POST http://localhost:8080/api/instances/stop-all
# Get status of all instances
curl http://localhost:8080/api/instances
```
## Next Steps
- Learn about the [Web UI](web-ui.md) interface
- Explore the complete [API Reference](api-reference.md)
- Set up [Monitoring](../advanced/monitoring.md) for production use

View File

@@ -552,9 +552,3 @@ cp ~/.llamactl/config.yaml ~/.llamactl/config.yaml.backup
# Backup instance configurations
curl http://localhost:8080/api/instances > instances-backup.json
```
## Next Steps
- Set up [Monitoring](monitoring.md) to prevent issues
- Learn about [Advanced Configuration](backends.md)
- Review [Best Practices](../development/contributing.md)

View File

@@ -208,9 +208,3 @@ Some features may be limited on mobile:
- Log viewing (use horizontal scrolling)
- Complex configuration forms
- File browser functionality
## Next Steps
- Learn about [API Reference](api-reference.md) for programmatic access
- Set up [Monitoring](../advanced/monitoring.md) for production use
- Explore [Advanced Configuration](../advanced/backends.md) options