diff --git a/docs/advanced/backends.md b/docs/advanced/backends.md deleted file mode 100644 index 0491bc4..0000000 --- a/docs/advanced/backends.md +++ /dev/null @@ -1,316 +0,0 @@ -# Backends - -Llamactl supports multiple backends for running large language models. This guide covers the available backends and their configuration. - -## Llama.cpp Backend - -The primary backend for Llamactl, providing robust support for GGUF models. - -### Features - -- **GGUF Support**: Native support for GGUF model format -- **GPU Acceleration**: CUDA, OpenCL, and Metal support -- **Memory Optimization**: Efficient memory usage and mapping -- **Multi-threading**: Configurable CPU thread utilization -- **Quantization**: Support for various quantization levels - -### Configuration - -```yaml -backends: - llamacpp: - binary_path: "/usr/local/bin/llama-server" - default_options: - threads: 4 - context_size: 2048 - batch_size: 512 - gpu: - enabled: true - layers: 35 -``` - -### Supported Options - -| Option | Description | Default | -|--------|-------------|---------| -| `threads` | Number of CPU threads | 4 | -| `context_size` | Context window size | 2048 | -| `batch_size` | Batch size for processing | 512 | -| `gpu_layers` | Layers to offload to GPU | 0 | -| `memory_lock` | Lock model in memory | false | -| `no_mmap` | Disable memory mapping | false | -| `rope_freq_base` | RoPE frequency base | 10000 | -| `rope_freq_scale` | RoPE frequency scale | 1.0 | - -### GPU Acceleration - -#### CUDA Setup - -```bash -# Install CUDA toolkit -sudo apt update -sudo apt install nvidia-cuda-toolkit - -# Verify CUDA installation -nvcc --version -nvidia-smi -``` - -#### Configuration for GPU - -```json -{ - "name": "gpu-accelerated", - "model_path": "/models/llama-2-13b.gguf", - "port": 8081, - "options": { - "gpu_layers": 35, - "threads": 2, - "context_size": 4096 - } -} -``` - -### Performance Tuning - -#### Memory Optimization - -```yaml -# For limited memory systems -options: - context_size: 1024 - batch_size: 256 - no_mmap: true - memory_lock: false - -# For high-memory systems -options: - context_size: 8192 - batch_size: 1024 - memory_lock: true - no_mmap: false -``` - -#### CPU Optimization - -```yaml -# Match thread count to CPU cores -# For 8-core CPU: -options: - threads: 6 # Leave 2 cores for system - -# For high-performance CPUs: -options: - threads: 16 - batch_size: 1024 -``` - -## Future Backends - -Llamactl is designed to support multiple backends. Planned additions: - -### vLLM Backend - -High-performance inference engine optimized for serving: - -- **Features**: Fast inference, batching, streaming -- **Models**: Supports various model formats -- **Scaling**: Horizontal scaling support - -### TensorRT-LLM Backend - -NVIDIA's optimized inference engine: - -- **Features**: Maximum GPU performance -- **Models**: Optimized for NVIDIA GPUs -- **Deployment**: Production-ready inference - -### Ollama Backend - -Integration with Ollama for easy model management: - -- **Features**: Simplified model downloading -- **Models**: Large model library -- **Integration**: Seamless model switching - -## Backend Selection - -### Automatic Detection - -Llamactl can automatically detect the best backend: - -```yaml -backends: - auto_detect: true - preference_order: - - "llamacpp" - - "vllm" - - "tensorrt" -``` - -### Manual Selection - -Force a specific backend for an instance: - -```json -{ - "name": "manual-backend", - "backend": "llamacpp", - "model_path": "/models/model.gguf", - "port": 8081 -} -``` - -## Backend-Specific Features - -### Llama.cpp Features - -#### Model Formats - -- **GGUF**: Primary format, best compatibility -- **GGML**: Legacy format (limited support) - -#### Quantization Levels - -- `Q2_K`: Smallest size, lower quality -- `Q4_K_M`: Balanced size and quality -- `Q5_K_M`: Higher quality, larger size -- `Q6_K`: Near-original quality -- `Q8_0`: Minimal loss, largest size - -#### Advanced Options - -```yaml -advanced: - rope_scaling: - type: "linear" - factor: 2.0 - attention: - flash_attention: true - grouped_query: true -``` - -## Monitoring Backend Performance - -### Metrics Collection - -Monitor backend-specific metrics: - -```bash -# Get backend statistics -curl http://localhost:8080/api/instances/my-instance/backend/stats -``` - -**Response:** -```json -{ - "backend": "llamacpp", - "version": "b1234", - "metrics": { - "tokens_per_second": 15.2, - "memory_usage": 4294967296, - "gpu_utilization": 85.5, - "context_usage": 75.0 - } -} -``` - -### Performance Optimization - -#### Benchmark Different Configurations - -```bash -# Test various thread counts -for threads in 2 4 8 16; do - echo "Testing $threads threads" - curl -X PUT http://localhost:8080/api/instances/benchmark \ - -d "{\"options\": {\"threads\": $threads}}" - # Run performance test -done -``` - -#### Memory Usage Optimization - -```bash -# Monitor memory usage -watch -n 1 'curl -s http://localhost:8080/api/instances/my-instance/stats | jq .memory_usage' -``` - -## Troubleshooting Backends - -### Common Llama.cpp Issues - -**Model won't load:** -```bash -# Check model file -file /path/to/model.gguf - -# Verify format -llama-server --model /path/to/model.gguf --dry-run -``` - -**GPU not detected:** -```bash -# Check CUDA installation -nvidia-smi - -# Verify llama.cpp GPU support -llama-server --help | grep -i gpu -``` - -**Performance issues:** -```bash -# Check system resources -htop -nvidia-smi - -# Verify configuration -curl http://localhost:8080/api/instances/my-instance/config -``` - -## Custom Backend Development - -### Backend Interface - -Implement the backend interface for custom backends: - -```go -type Backend interface { - Start(config InstanceConfig) error - Stop(instance *Instance) error - Health(instance *Instance) (*HealthStatus, error) - Stats(instance *Instance) (*Stats, error) -} -``` - -### Registration - -Register your custom backend: - -```go -func init() { - backends.Register("custom", &CustomBackend{}) -} -``` - -## Best Practices - -### Production Deployments - -1. **Resource allocation**: Plan for peak usage -2. **Backend selection**: Choose based on requirements -3. **Monitoring**: Set up comprehensive monitoring -4. **Fallback**: Configure backup backends - -### Development - -1. **Rapid iteration**: Use smaller models -2. **Resource monitoring**: Track usage patterns -3. **Configuration testing**: Validate settings -4. **Performance profiling**: Optimize bottlenecks - -## Next Steps - -- Learn about [Monitoring](monitoring.md) backend performance -- Explore [Troubleshooting](troubleshooting.md) guides -- Set up [Production Monitoring](monitoring.md) diff --git a/docs/advanced/monitoring.md b/docs/advanced/monitoring.md deleted file mode 100644 index 71051e6..0000000 --- a/docs/advanced/monitoring.md +++ /dev/null @@ -1,420 +0,0 @@ -# Monitoring - -Comprehensive monitoring setup for Llamactl in production environments. - -## Overview - -Effective monitoring of Llamactl involves tracking: - -- Instance health and performance -- System resource usage -- API response times -- Error rates and alerts - -## Built-in Monitoring - -### Health Checks - -Llamactl provides built-in health monitoring: - -```bash -# Check overall system health -curl http://localhost:8080/api/system/health - -# Check specific instance health -curl http://localhost:8080/api/instances/{name}/health -``` - -### Metrics Endpoint - -Access Prometheus-compatible metrics: - -```bash -curl http://localhost:8080/metrics -``` - -**Available Metrics:** -- `llamactl_instances_total`: Total number of instances -- `llamactl_instances_running`: Number of running instances -- `llamactl_instance_memory_bytes`: Instance memory usage -- `llamactl_instance_cpu_percent`: Instance CPU usage -- `llamactl_api_requests_total`: Total API requests -- `llamactl_api_request_duration_seconds`: API response times - -## Prometheus Integration - -### Configuration - -Add Llamactl as a Prometheus target: - -```yaml -# prometheus.yml -scrape_configs: - - job_name: 'llamactl' - static_configs: - - targets: ['localhost:8080'] - metrics_path: '/metrics' - scrape_interval: 15s -``` - -### Custom Metrics - -Enable additional metrics in Llamactl: - -```yaml -# config.yaml -monitoring: - enabled: true - prometheus: - enabled: true - path: "/metrics" - metrics: - - instance_stats - - api_performance - - system_resources -``` - -## Grafana Dashboards - -### Llamactl Dashboard - -Import the official Grafana dashboard: - -1. Download dashboard JSON from releases -2. Import into Grafana -3. Configure Prometheus data source - -### Key Panels - -**Instance Overview:** -- Instance count and status -- Resource usage per instance -- Health status indicators - -**Performance Metrics:** -- API response times -- Tokens per second -- Memory usage trends - -**System Resources:** -- CPU and memory utilization -- Disk I/O and network usage -- GPU utilization (if applicable) - -### Custom Queries - -**Instance Uptime:** -```promql -(time() - llamactl_instance_start_time_seconds) / 3600 -``` - -**Memory Usage Percentage:** -```promql -(llamactl_instance_memory_bytes / llamactl_system_memory_total_bytes) * 100 -``` - -**API Error Rate:** -```promql -rate(llamactl_api_requests_total{status=~"4.."}[5m]) / rate(llamactl_api_requests_total[5m]) * 100 -``` - -## Alerting - -### Prometheus Alerts - -Configure alerts for critical conditions: - -```yaml -# alerts.yml -groups: - - name: llamactl - rules: - - alert: InstanceDown - expr: llamactl_instance_up == 0 - for: 1m - labels: - severity: critical - annotations: - summary: "Llamactl instance {{ $labels.instance_name }} is down" - - - alert: HighMemoryUsage - expr: llamactl_instance_memory_percent > 90 - for: 5m - labels: - severity: warning - annotations: - summary: "High memory usage on {{ $labels.instance_name }}" - - - alert: APIHighLatency - expr: histogram_quantile(0.95, rate(llamactl_api_request_duration_seconds_bucket[5m])) > 2 - for: 2m - labels: - severity: warning - annotations: - summary: "High API latency detected" -``` - -### Notification Channels - -Configure alert notifications: - -**Slack Integration:** -```yaml -# alertmanager.yml -route: - group_by: ['alertname'] - receiver: 'slack' - -receivers: - - name: 'slack' - slack_configs: - - api_url: 'https://hooks.slack.com/services/...' - channel: '#alerts' - title: 'Llamactl Alert' - text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}' -``` - -## Log Management - -### Centralized Logging - -Configure log aggregation: - -```yaml -# config.yaml -logging: - level: "info" - output: "json" - destinations: - - type: "file" - path: "/var/log/llamactl/app.log" - - type: "syslog" - facility: "local0" - - type: "elasticsearch" - url: "http://elasticsearch:9200" -``` - -### Log Analysis - -Use ELK stack for log analysis: - -**Elasticsearch Index Template:** -```json -{ - "index_patterns": ["llamactl-*"], - "mappings": { - "properties": { - "timestamp": {"type": "date"}, - "level": {"type": "keyword"}, - "message": {"type": "text"}, - "instance": {"type": "keyword"}, - "component": {"type": "keyword"} - } - } -} -``` - -**Kibana Visualizations:** -- Log volume over time -- Error rate by instance -- Performance trends -- Resource usage patterns - -## Application Performance Monitoring - -### OpenTelemetry Integration - -Enable distributed tracing: - -```yaml -# config.yaml -telemetry: - enabled: true - otlp: - endpoint: "http://jaeger:14268/api/traces" - sampling_rate: 0.1 -``` - -### Custom Spans - -Add custom tracing to track operations: - -```go -ctx, span := tracer.Start(ctx, "instance.start") -defer span.End() - -// Track instance startup time -span.SetAttributes( - attribute.String("instance.name", name), - attribute.String("model.path", modelPath), -) -``` - -## Health Check Configuration - -### Readiness Probes - -Configure Kubernetes readiness probes: - -```yaml -readinessProbe: - httpGet: - path: /api/health - port: 8080 - initialDelaySeconds: 30 - periodSeconds: 10 -``` - -### Liveness Probes - -Configure liveness probes: - -```yaml -livenessProbe: - httpGet: - path: /api/health/live - port: 8080 - initialDelaySeconds: 60 - periodSeconds: 30 -``` - -### Custom Health Checks - -Implement custom health checks: - -```go -func (h *HealthHandler) CustomCheck(ctx context.Context) error { - // Check database connectivity - if err := h.db.Ping(); err != nil { - return fmt.Errorf("database unreachable: %w", err) - } - - // Check instance responsiveness - for _, instance := range h.instances { - if !instance.IsHealthy() { - return fmt.Errorf("instance %s unhealthy", instance.Name) - } - } - - return nil -} -``` - -## Performance Profiling - -### pprof Integration - -Enable Go profiling: - -```yaml -# config.yaml -debug: - pprof_enabled: true - pprof_port: 6060 -``` - -Access profiling endpoints: -```bash -# CPU profile -go tool pprof http://localhost:6060/debug/pprof/profile - -# Memory profile -go tool pprof http://localhost:6060/debug/pprof/heap - -# Goroutine profile -go tool pprof http://localhost:6060/debug/pprof/goroutine -``` - -### Continuous Profiling - -Set up continuous profiling with Pyroscope: - -```yaml -# config.yaml -profiling: - enabled: true - pyroscope: - server_address: "http://pyroscope:4040" - application_name: "llamactl" -``` - -## Security Monitoring - -### Audit Logging - -Enable security audit logs: - -```yaml -# config.yaml -audit: - enabled: true - log_file: "/var/log/llamactl/audit.log" - events: - - "auth.login" - - "auth.logout" - - "instance.create" - - "instance.delete" - - "config.update" -``` - -### Rate Limiting Monitoring - -Track rate limiting metrics: - -```bash -# Monitor rate limit hits -curl http://localhost:8080/metrics | grep rate_limit -``` - -## Troubleshooting Monitoring - -### Common Issues - -**Metrics not appearing:** -1. Check Prometheus configuration -2. Verify network connectivity -3. Review Llamactl logs for errors - -**High memory usage:** -1. Check for memory leaks in profiles -2. Monitor garbage collection metrics -3. Review instance configurations - -**Alert fatigue:** -1. Tune alert thresholds -2. Implement alert severity levels -3. Use alert routing and suppression - -### Debug Tools - -**Monitoring health:** -```bash -# Check monitoring endpoints -curl -v http://localhost:8080/metrics -curl -v http://localhost:8080/api/health - -# Review logs -tail -f /var/log/llamactl/app.log -``` - -## Best Practices - -### Production Monitoring - -1. **Comprehensive coverage**: Monitor all critical components -2. **Appropriate alerting**: Balance sensitivity and noise -3. **Regular review**: Analyze trends and patterns -4. **Documentation**: Maintain runbooks for alerts - -### Performance Optimization - -1. **Baseline establishment**: Know normal operating parameters -2. **Trend analysis**: Identify performance degradation early -3. **Capacity planning**: Monitor resource growth trends -4. **Optimization cycles**: Regular performance tuning - -## Next Steps - -- Set up [Troubleshooting](troubleshooting.md) procedures -- Learn about [Backend optimization](backends.md) -- Configure [Production deployment](../development/building.md) diff --git a/docs/getting-started/configuration.md b/docs/getting-started/configuration.md index 3a859ee..25256de 100644 --- a/docs/getting-started/configuration.md +++ b/docs/getting-started/configuration.md @@ -148,15 +148,3 @@ llamactl --help ``` You can also override configuration using command line flags when starting llamactl. - -## Next Steps - -- Learn about [Managing Instances](../user-guide/managing-instances.md) -- Explore [Advanced Configuration](../advanced/monitoring.md) -- Set up [Monitoring](../advanced/monitoring.md) - -## Next Steps - -- Learn about [Managing Instances](../user-guide/managing-instances.md) -- Explore [Advanced Configuration](../advanced/monitoring.md) -- Set up [Monitoring](../advanced/monitoring.md) diff --git a/docs/index.md b/docs/index.md index a1730c7..0637fdc 100644 --- a/docs/index.md +++ b/docs/index.md @@ -40,14 +40,13 @@ Llamactl is designed to simplify the deployment and management of llama-server i - [Web UI Guide](user-guide/web-ui.md) - Learn to use the web interface - [Managing Instances](user-guide/managing-instances.md) - Instance lifecycle management - [API Reference](user-guide/api-reference.md) - Complete API documentation -- [Monitoring](advanced/monitoring.md) - Health checks and monitoring -- [Backends](advanced/backends.md) - Backend configuration options + ## Getting Help If you need help or have questions: -- Check the [Troubleshooting](advanced/troubleshooting.md) guide +- Check the [Troubleshooting](user-guide/troubleshooting.md) guide - Visit the [GitHub repository](https://github.com/lordmathis/llamactl) - Review the [Configuration Guide](getting-started/configuration.md) for advanced settings diff --git a/docs/user-guide/api-reference.md b/docs/user-guide/api-reference.md index 56c274d..fcd88f3 100644 --- a/docs/user-guide/api-reference.md +++ b/docs/user-guide/api-reference.md @@ -462,9 +462,3 @@ curl -X POST http://localhost:8080/api/instances/example/stop # Delete instance curl -X DELETE http://localhost:8080/api/instances/example ``` - -## Next Steps - -- Learn about [Managing Instances](managing-instances.md) in detail -- Explore [Advanced Configuration](../advanced/backends.md) -- Set up [Monitoring](../advanced/monitoring.md) for production use diff --git a/docs/user-guide/managing-instances.md b/docs/user-guide/managing-instances.md index fa1102e..14bbd71 100644 --- a/docs/user-guide/managing-instances.md +++ b/docs/user-guide/managing-instances.md @@ -163,9 +163,3 @@ curl -X POST http://localhost:8080/api/instances/stop-all # Get status of all instances curl http://localhost:8080/api/instances ``` - -## Next Steps - -- Learn about the [Web UI](web-ui.md) interface -- Explore the complete [API Reference](api-reference.md) -- Set up [Monitoring](../advanced/monitoring.md) for production use diff --git a/docs/advanced/troubleshooting.md b/docs/user-guide/troubleshooting.md similarity index 98% rename from docs/advanced/troubleshooting.md rename to docs/user-guide/troubleshooting.md index 1d070d3..2cd299f 100644 --- a/docs/advanced/troubleshooting.md +++ b/docs/user-guide/troubleshooting.md @@ -552,9 +552,3 @@ cp ~/.llamactl/config.yaml ~/.llamactl/config.yaml.backup # Backup instance configurations curl http://localhost:8080/api/instances > instances-backup.json ``` - -## Next Steps - -- Set up [Monitoring](monitoring.md) to prevent issues -- Learn about [Advanced Configuration](backends.md) -- Review [Best Practices](../development/contributing.md) diff --git a/docs/user-guide/web-ui.md b/docs/user-guide/web-ui.md index 9b1ba29..6a3c4c1 100644 --- a/docs/user-guide/web-ui.md +++ b/docs/user-guide/web-ui.md @@ -208,9 +208,3 @@ Some features may be limited on mobile: - Log viewing (use horizontal scrolling) - Complex configuration forms - File browser functionality - -## Next Steps - -- Learn about [API Reference](api-reference.md) for programmatic access -- Set up [Monitoring](../advanced/monitoring.md) for production use -- Explore [Advanced Configuration](../advanced/backends.md) options diff --git a/mkdocs.yml b/mkdocs.yml index 4e7e107..f9fbe3d 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -57,10 +57,7 @@ nav: - Managing Instances: user-guide/managing-instances.md - Web UI: user-guide/web-ui.md - API Reference: user-guide/api-reference.md - - Advanced: - - Backends: advanced/backends.md - - Monitoring: advanced/monitoring.md - - Troubleshooting: advanced/troubleshooting.md + - Troubleshooting: user-guide/troubleshooting.md plugins: - search