Files
llamactl/docs/advanced/monitoring.md

8.1 KiB

Monitoring

Comprehensive monitoring setup for LlamaCtl in production environments.

Overview

Effective monitoring of LlamaCtl involves tracking:

  • Instance health and performance
  • System resource usage
  • API response times
  • Error rates and alerts

Built-in Monitoring

Health Checks

LlamaCtl provides built-in health monitoring:

# Check overall system health
curl http://localhost:8080/api/system/health

# Check specific instance health
curl http://localhost:8080/api/instances/{name}/health

Metrics Endpoint

Access Prometheus-compatible metrics:

curl http://localhost:8080/metrics

Available Metrics:

  • llamactl_instances_total: Total number of instances
  • llamactl_instances_running: Number of running instances
  • llamactl_instance_memory_bytes: Instance memory usage
  • llamactl_instance_cpu_percent: Instance CPU usage
  • llamactl_api_requests_total: Total API requests
  • llamactl_api_request_duration_seconds: API response times

Prometheus Integration

Configuration

Add LlamaCtl as a Prometheus target:

# prometheus.yml
scrape_configs:
  - job_name: 'llamactl'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'
    scrape_interval: 15s

Custom Metrics

Enable additional metrics in LlamaCtl:

# config.yaml
monitoring:
  enabled: true
  prometheus:
    enabled: true
    path: "/metrics"
  metrics:
    - instance_stats
    - api_performance
    - system_resources

Grafana Dashboards

LlamaCtl Dashboard

Import the official Grafana dashboard:

  1. Download dashboard JSON from releases
  2. Import into Grafana
  3. Configure Prometheus data source

Key Panels

Instance Overview:

  • Instance count and status
  • Resource usage per instance
  • Health status indicators

Performance Metrics:

  • API response times
  • Tokens per second
  • Memory usage trends

System Resources:

  • CPU and memory utilization
  • Disk I/O and network usage
  • GPU utilization (if applicable)

Custom Queries

Instance Uptime:

(time() - llamactl_instance_start_time_seconds) / 3600

Memory Usage Percentage:

(llamactl_instance_memory_bytes / llamactl_system_memory_total_bytes) * 100

API Error Rate:

rate(llamactl_api_requests_total{status=~"4.."}[5m]) / rate(llamactl_api_requests_total[5m]) * 100

Alerting

Prometheus Alerts

Configure alerts for critical conditions:

# alerts.yml
groups:
  - name: llamactl
    rules:
      - alert: InstanceDown
        expr: llamactl_instance_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "LlamaCtl instance {{ $labels.instance_name }} is down"
          
      - alert: HighMemoryUsage
        expr: llamactl_instance_memory_percent > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance_name }}"
          
      - alert: APIHighLatency
        expr: histogram_quantile(0.95, rate(llamactl_api_request_duration_seconds_bucket[5m])) > 2
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High API latency detected"

Notification Channels

Configure alert notifications:

Slack Integration:

# alertmanager.yml
route:
  group_by: ['alertname']
  receiver: 'slack'

receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts'
        title: 'LlamaCtl Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

Log Management

Centralized Logging

Configure log aggregation:

# config.yaml
logging:
  level: "info"
  output: "json"
  destinations:
    - type: "file"
      path: "/var/log/llamactl/app.log"
    - type: "syslog"
      facility: "local0"
    - type: "elasticsearch"
      url: "http://elasticsearch:9200"

Log Analysis

Use ELK stack for log analysis:

Elasticsearch Index Template:

{
  "index_patterns": ["llamactl-*"],
  "mappings": {
    "properties": {
      "timestamp": {"type": "date"},
      "level": {"type": "keyword"},
      "message": {"type": "text"},
      "instance": {"type": "keyword"},
      "component": {"type": "keyword"}
    }
  }
}

Kibana Visualizations:

  • Log volume over time
  • Error rate by instance
  • Performance trends
  • Resource usage patterns

Application Performance Monitoring

OpenTelemetry Integration

Enable distributed tracing:

# config.yaml
telemetry:
  enabled: true
  otlp:
    endpoint: "http://jaeger:14268/api/traces"
  sampling_rate: 0.1

Custom Spans

Add custom tracing to track operations:

ctx, span := tracer.Start(ctx, "instance.start")
defer span.End()

// Track instance startup time
span.SetAttributes(
    attribute.String("instance.name", name),
    attribute.String("model.path", modelPath),
)

Health Check Configuration

Readiness Probes

Configure Kubernetes readiness probes:

readinessProbe:
  httpGet:
    path: /api/health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

Liveness Probes

Configure liveness probes:

livenessProbe:
  httpGet:
    path: /api/health/live
    port: 8080
  initialDelaySeconds: 60
  periodSeconds: 30

Custom Health Checks

Implement custom health checks:

func (h *HealthHandler) CustomCheck(ctx context.Context) error {
    // Check database connectivity
    if err := h.db.Ping(); err != nil {
        return fmt.Errorf("database unreachable: %w", err)
    }
    
    // Check instance responsiveness
    for _, instance := range h.instances {
        if !instance.IsHealthy() {
            return fmt.Errorf("instance %s unhealthy", instance.Name)
        }
    }
    
    return nil
}

Performance Profiling

pprof Integration

Enable Go profiling:

# config.yaml
debug:
  pprof_enabled: true
  pprof_port: 6060

Access profiling endpoints:

# CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile

# Memory profile
go tool pprof http://localhost:6060/debug/pprof/heap

# Goroutine profile
go tool pprof http://localhost:6060/debug/pprof/goroutine

Continuous Profiling

Set up continuous profiling with Pyroscope:

# config.yaml
profiling:
  enabled: true
  pyroscope:
    server_address: "http://pyroscope:4040"
    application_name: "llamactl"

Security Monitoring

Audit Logging

Enable security audit logs:

# config.yaml
audit:
  enabled: true
  log_file: "/var/log/llamactl/audit.log"
  events:
    - "auth.login"
    - "auth.logout"
    - "instance.create"
    - "instance.delete"
    - "config.update"

Rate Limiting Monitoring

Track rate limiting metrics:

# Monitor rate limit hits
curl http://localhost:8080/metrics | grep rate_limit

Troubleshooting Monitoring

Common Issues

Metrics not appearing:

  1. Check Prometheus configuration
  2. Verify network connectivity
  3. Review LlamaCtl logs for errors

High memory usage:

  1. Check for memory leaks in profiles
  2. Monitor garbage collection metrics
  3. Review instance configurations

Alert fatigue:

  1. Tune alert thresholds
  2. Implement alert severity levels
  3. Use alert routing and suppression

Debug Tools

Monitoring health:

# Check monitoring endpoints
curl -v http://localhost:8080/metrics
curl -v http://localhost:8080/api/health

# Review logs
tail -f /var/log/llamactl/app.log

Best Practices

Production Monitoring

  1. Comprehensive coverage: Monitor all critical components
  2. Appropriate alerting: Balance sensitivity and noise
  3. Regular review: Analyze trends and patterns
  4. Documentation: Maintain runbooks for alerts

Performance Optimization

  1. Baseline establishment: Know normal operating parameters
  2. Trend analysis: Identify performance degradation early
  3. Capacity planning: Monitor resource growth trends
  4. Optimization cycles: Regular performance tuning

Next Steps