# Quick Start This guide will help you get Llamactl up and running in just a few minutes. **Before you begin:** Ensure you have at least one backend installed (llama.cpp, MLX, or vLLM). See the [Installation Guide](installation.md#prerequisites) for backend setup. ## Core Concepts Before you start, let's clarify a few key terms: - **Instance**: A running backend server that serves a specific model. Each instance has a unique name and runs independently. - **Backend**: The inference engine that actually runs the model (llama.cpp, MLX, or vLLM). You need at least one backend installed before creating instances. - **Node**: In multi-machine setups, a node represents one machine. Most users will just use the default "main" node for single-machine deployments. - **Proxy Architecture**: Llamactl acts as a proxy in front of your instances. You make requests to llamactl (e.g., `http://localhost:8080/v1/chat/completions`), and it routes them to the appropriate backend instance. This means you don't need to track individual instance ports or endpoints. ## Authentication Llamactl uses two types of API keys: - **Management API Key**: Used to authenticate with the Llamactl management API and web UI. If not configured, one is auto-generated at startup and printed to the terminal. - **Inference API Key**: Used to authenticate requests to the OpenAI-compatible endpoints (`/v1/chat/completions`, `/v1/completions`, etc.). These are created and managed via the web UI. By default, authentication is required for both management and inference endpoints. You can configure custom management keys or disable authentication in the [Configuration](configuration.md) guide. ## Start Llamactl Start the Llamactl server: ```bash llamactl ``` ``` ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⚠️ MANAGEMENT AUTHENTICATION REQUIRED ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 🔑 Generated Management API Key: sk-management-... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⚠️ IMPORTANT ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ • This key is auto-generated and will change on restart • For production, add explicit management_keys to your configuration • Copy this key before it disappears from the terminal ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Llamactl server listening on 0.0.0.0:8080 ``` Copy the **Management API Key** from the terminal - you'll need it to access the web UI. By default, Llamactl will start on `http://localhost:8080`. ## Access the Web UI Open your web browser and navigate to: ``` http://localhost:8080 ``` Login with the management API key from the terminal output. You should see the Llamactl web interface. ## Create Your First Instance 1. Click the "Add Instance" button 2. Fill in the instance configuration: - **Name**: Give your instance a descriptive name - **Node**: Select which node to deploy the instance to (defaults to "main" for single-node setups) - **Backend Type**: Choose from llama.cpp, MLX, or vLLM - **Model**: Model path or huggingface repo - **Additional Options**: Backend-specific parameters !!! tip "Auto-Assignment" Llamactl automatically assigns ports from the configured port range (default: 8000-9000) and manages API keys if authentication is enabled. You typically don't need to manually specify these values. !!! note "Remote Node Deployment" If you have configured remote nodes in your configuration file, you can select which node to deploy the instance to. This allows you to distribute instances across multiple machines. See the [Configuration](configuration.md#remote-node-configuration) guide for details on setting up remote nodes. 3. Click "Create Instance" ## Start Your Instance Once created, you can: - **Start** the instance by clicking the start button - **Monitor** its status in real-time - **View logs** by clicking the logs button - **Stop** the instance when needed ## Create an Inference API Key To make inference requests to your instances, you'll need an inference API key: 1. In the web UI, click the **Settings** icon (gear icon in the top-right) 2. Navigate to the **API Keys** tab 3. Click **Create API Key** 4. Configure your key: - **Name**: Give it a descriptive name (e.g., "Production Key", "Development Key") - **Expiration**: Optionally set an expiration date for the key - **Permissions**: Choose whether the key can access all instances or only specific ones 5. Click **Create** 6. **Copy the generated key** - it will only be shown once! The key will look like: `llamactl-...` You can create multiple inference keys with different permissions for different use cases (e.g., one for development, one for production, or keys limited to specific instances). ## Example Configurations Here are basic example configurations for each backend: **llama.cpp backend:** ```json { "name": "llama2-7b", "backend_type": "llama_cpp", "backend_options": { "model": "/path/to/llama-2-7b-chat.gguf", "threads": 4, "ctx_size": 2048, "gpu_layers": 32 }, "nodes": ["main"] } ``` **MLX backend (macOS only):** ```json { "name": "mistral-mlx", "backend_type": "mlx_lm", "backend_options": { "model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit", "temp": 0.7, "max_tokens": 2048 }, "nodes": ["main"] } ``` **vLLM backend:** ```json { "name": "dialogpt-vllm", "backend_type": "vllm", "backend_options": { "model": "microsoft/DialoGPT-medium", "tensor_parallel_size": 2, "gpu_memory_utilization": 0.9 }, "nodes": ["main"] } ``` **Remote node deployment example:** ```json { "name": "distributed-model", "backend_type": "llama_cpp", "backend_options": { "model": "/path/to/model.gguf", "gpu_layers": 32 }, "nodes": ["worker1"] } ``` ## Docker Support Llamactl can run backends in Docker containers. To enable Docker for a backend, add a `docker` section to that backend in your YAML configuration file (e.g. `config.yaml`) as shown below: ```yaml backends: vllm: command: "vllm" args: ["serve"] docker: enabled: true image: "vllm/vllm-openai:latest" args: ["run", "--rm", "--network", "host", "--gpus", "all", "--shm-size", "1g"] ``` ## Using the API You can also manage instances via the REST API: ```bash # List all instances curl http://localhost:8080/api/v1/instances # Create a new llama.cpp instance curl -X POST http://localhost:8080/api/v1/instances/my-model \ -H "Content-Type: application/json" \ -d '{ "backend_type": "llama_cpp", "backend_options": { "model": "/path/to/model.gguf" } }' # Start an instance curl -X POST http://localhost:8080/api/v1/instances/my-model/start ``` ## OpenAI Compatible API Llamactl provides OpenAI-compatible endpoints, making it easy to integrate with existing OpenAI client libraries and tools. ### Chat Completions Once you have an instance running, you can use it with the OpenAI-compatible chat completions endpoint: ```bash curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "my-model", "messages": [ { "role": "user", "content": "Hello! Can you help me write a Python function?" } ], "max_tokens": 150, "temperature": 0.7 }' ``` ### Using with Python OpenAI Client You can also use the official OpenAI Python client: ```python from openai import OpenAI # Point the client to your Llamactl server client = OpenAI( base_url="http://localhost:8080/v1", api_key="your-inference-api-key" # Use the inference API key from terminal or config ) # Create a chat completion response = client.chat.completions.create( model="my-model", # Use the name of your instance messages=[ {"role": "user", "content": "Explain quantum computing in simple terms"} ], max_tokens=200, temperature=0.7 ) print(response.choices[0].message.content) ``` !!! note "API Key" If you disabled authentication in your config, you can use any value for `api_key` (e.g., `"not-needed"`). Otherwise, use the inference API key you created via the web UI (Settings → API Keys). ### List Available Models Get a list of running instances (models) in OpenAI-compatible format: ```bash curl http://localhost:8080/v1/models ``` ## Next Steps - Manage instances [Managing Instances](managing-instances.md) - Explore the [API Reference](api-reference.md) - Configure advanced settings in the [Configuration](configuration.md) guide