8.2 KiB
Quick Start
This guide will help you get Llamactl up and running in just a few minutes.
Core Concepts
Before you start, let's clarify a few key terms:
- Instance: A running backend server that serves a specific model. Each instance has a unique name and runs independently.
- Backend: The inference engine that actually runs the model (llama.cpp, MLX, or vLLM). You need at least one backend installed before creating instances.
- Node: In multi-machine setups, a node represents one machine. Most users will just use the default "main" node for single-machine deployments.
- Proxy Architecture: Llamactl acts as a proxy in front of your instances. You make requests to llamactl (e.g.,
http://localhost:8080/v1/chat/completions), and it routes them to the appropriate backend instance. This means you don't need to track individual instance ports or endpoints.
Authentication
Llamactl uses two types of API keys:
- Management API Key: Used to authenticate with the Llamactl management API (creating, starting, stopping instances).
- Inference API Key: Used to authenticate requests to the OpenAI-compatible endpoints (
/v1/chat/completions,/v1/completions, etc.).
By default, authentication is required. If you don't configure these keys in your configuration file, llamactl will auto-generate them and print them to the terminal on startup. You can also configure custom keys or disable authentication entirely in the Configuration guide.
Start Llamactl
Start the Llamactl server:
llamactl
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ MANAGEMENT AUTHENTICATION REQUIRED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔑 Generated Management API Key:
sk-management-...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ INFERENCE AUTHENTICATION REQUIRED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔑 Generated Inference API Key:
sk-inference-...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ IMPORTANT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
• These keys are auto-generated and will change on restart
• For production, add explicit keys to your configuration
• Copy these keys before they disappear from the terminal
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Llamactl server listening on 0.0.0.0:8080
Copy the Management API Key from the terminal - you'll need it to access the web UI.
By default, Llamactl will start on http://localhost:8080.
Access the Web UI
Open your web browser and navigate to:
http://localhost:8080
Login with the management API key from the terminal output.
You should see the Llamactl web interface.
Create Your First Instance
-
Click the "Add Instance" button
-
Fill in the instance configuration:
- Name: Give your instance a descriptive name
- Node: Select which node to deploy the instance to (defaults to "main" for single-node setups)
- Backend Type: Choose from llama.cpp, MLX, or vLLM
- Model: Model path or huggingface repo
- Additional Options: Backend-specific parameters
!!! tip "Auto-Assignment" Llamactl automatically assigns ports from the configured port range (default: 8000-9000) and generates API keys if authentication is enabled. You typically don't need to manually specify these values.
!!! note "Remote Node Deployment" If you have configured remote nodes in your configuration file, you can select which node to deploy the instance to. This allows you to distribute instances across multiple machines. See the Configuration guide for details on setting up remote nodes.
-
Click "Create Instance"
Start Your Instance
Once created, you can:
- Start the instance by clicking the start button
- Monitor its status in real-time
- View logs by clicking the logs button
- Stop the instance when needed
Example Configurations
Here are basic example configurations for each backend:
llama.cpp backend:
{
"name": "llama2-7b",
"backend_type": "llama_cpp",
"backend_options": {
"model": "/path/to/llama-2-7b-chat.gguf",
"threads": 4,
"ctx_size": 2048,
"gpu_layers": 32
},
"nodes": ["main"]
}
MLX backend (macOS only):
{
"name": "mistral-mlx",
"backend_type": "mlx_lm",
"backend_options": {
"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit",
"temp": 0.7,
"max_tokens": 2048
},
"nodes": ["main"]
}
vLLM backend:
{
"name": "dialogpt-vllm",
"backend_type": "vllm",
"backend_options": {
"model": "microsoft/DialoGPT-medium",
"tensor_parallel_size": 2,
"gpu_memory_utilization": 0.9
},
"nodes": ["main"]
}
Multi-node deployment example:
{
"name": "distributed-model",
"backend_type": "llama_cpp",
"backend_options": {
"model": "/path/to/model.gguf",
"gpu_layers": 32
},
"nodes": ["worker1", "worker2"]
}
Docker Support
Llamactl can run backends in Docker containers. To enable Docker for a backend, add a docker section to that backend in your YAML configuration file (e.g. config.yaml) as shown below:
backends:
vllm:
command: "vllm"
args: ["serve"]
docker:
enabled: true
image: "vllm/vllm-openai:latest"
args: ["run", "--rm", "--network", "host", "--gpus", "all", "--shm-size", "1g"]
Using the API
You can also manage instances via the REST API:
# List all instances
curl http://localhost:8080/api/v1/instances
# Create a new llama.cpp instance
curl -X POST http://localhost:8080/api/v1/instances/my-model \
-H "Content-Type: application/json" \
-d '{
"backend_type": "llama_cpp",
"backend_options": {
"model": "/path/to/model.gguf"
}
}'
# Start an instance
curl -X POST http://localhost:8080/api/v1/instances/my-model/start
OpenAI Compatible API
Llamactl provides OpenAI-compatible endpoints, making it easy to integrate with existing OpenAI client libraries and tools.
Chat Completions
Once you have an instance running, you can use it with the OpenAI-compatible chat completions endpoint:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [
{
"role": "user",
"content": "Hello! Can you help me write a Python function?"
}
],
"max_tokens": 150,
"temperature": 0.7
}'
Using with Python OpenAI Client
You can also use the official OpenAI Python client:
from openai import OpenAI
# Point the client to your Llamactl server
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="your-inference-api-key" # Use the inference API key from terminal or config
)
# Create a chat completion
response = client.chat.completions.create(
model="my-model", # Use the name of your instance
messages=[
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
max_tokens=200,
temperature=0.7
)
print(response.choices[0].message.content)
!!! note "API Key"
If you disabled authentication in your config, you can use any value for api_key (e.g., "not-needed"). Otherwise, use the inference API key shown in the terminal output on startup.
List Available Models
Get a list of running instances (models) in OpenAI-compatible format:
curl http://localhost:8080/v1/models
Next Steps
- Manage instances Managing Instances
- Explore the API Reference
- Configure advanced settings in the Configuration guide