11 KiB
API Reference
Complete reference for the Llamactl REST API.
Base URL
All API endpoints are relative to the base URL:
http://localhost:8080/api/v1
Authentication
Llamactl supports API key authentication. If authentication is enabled, include the API key in the Authorization header:
curl -H "Authorization: Bearer <your-api-key>" \
http://localhost:8080/api/v1/instances
The server supports two types of API keys:
- Management API Keys: Required for instance management operations (CRUD operations on instances)
- Inference API Keys: Required for OpenAI-compatible inference endpoints
System Endpoints
Get Llamactl Version
Get the version information of the llamactl server.
GET /api/v1/version
Response:
Version: 1.0.0
Commit: abc123
Build Time: 2024-01-15T10:00:00Z
Get Llama Server Help
Get help text for the llama-server command.
GET /api/v1/server/help
Response: Plain text help output from llama-server --help
Get Llama Server Version
Get version information of the llama-server binary.
GET /api/v1/server/version
Response: Plain text version output from llama-server --version
List Available Devices
List available devices for llama-server.
GET /api/v1/server/devices
Response: Plain text device list from llama-server --list-devices
Instances
List All Instances
Get a list of all instances.
GET /api/v1/instances
Response:
[
{
"name": "llama2-7b",
"status": "running",
"created": 1705312200
}
]
Get Instance Details
Get detailed information about a specific instance.
GET /api/v1/instances/{name}
Response:
{
"name": "llama2-7b",
"status": "running",
"created": 1705312200
}
Create Instance
Create and start a new instance.
POST /api/v1/instances/{name}
Request Body: JSON object with instance configuration. Common fields include:
backend_type: Backend type (llama_cpp,mlx_lm, orvllm)backend_options: Backend-specific configurationauto_restart: Enable automatic restart on failuremax_restarts: Maximum restart attemptsrestart_delay: Delay between restarts in secondson_demand_start: Start instance when receiving requestsidle_timeout: Idle timeout in minutesenvironment: Environment variables as key-value pairs
See Managing Instances for complete configuration options.
Response:
{
"name": "llama2-7b",
"status": "running",
"created": 1705312200
}
Update Instance
Update an existing instance configuration. See Managing Instances for available configuration options.
PUT /api/v1/instances/{name}
Request Body: JSON object with configuration fields to update.
Response:
{
"name": "llama2-7b",
"status": "running",
"created": 1705312200
}
Delete Instance
Stop and remove an instance.
DELETE /api/v1/instances/{name}
Response: 204 No Content
Instance Operations
Start Instance
Start a stopped instance.
POST /api/v1/instances/{name}/start
Response:
{
"name": "llama2-7b",
"status": "running",
"created": 1705312200
}
Error Responses:
409 Conflict: Maximum number of running instances reached500 Internal Server Error: Failed to start instance
Stop Instance
Stop a running instance.
POST /api/v1/instances/{name}/stop
Response:
{
"name": "llama2-7b",
"status": "stopped",
"created": 1705312200
}
Restart Instance
Restart an instance (stop then start).
POST /api/v1/instances/{name}/restart
Response:
{
"name": "llama2-7b",
"status": "running",
"created": 1705312200
}
Get Instance Logs
Retrieve instance logs.
GET /api/v1/instances/{name}/logs
Query Parameters:
lines: Number of lines to return (default: all lines, use -1 for all)
Response: Plain text log output
Example:
curl "http://localhost:8080/api/v1/instances/my-instance/logs?lines=100"
Proxy to Instance
Proxy HTTP requests directly to the llama-server instance.
GET /api/v1/instances/{name}/proxy/*
POST /api/v1/instances/{name}/proxy/*
This endpoint forwards all requests to the underlying llama-server instance running on its configured port. The proxy strips the /api/v1/instances/{name}/proxy prefix and forwards the remaining path to the instance.
Example - Check Instance Health:
curl -H "Authorization: Bearer your-api-key" \
http://localhost:8080/api/v1/instances/my-model/proxy/health
This forwards the request to http://instance-host:instance-port/health on the actual llama-server instance.
Error Responses:
503 Service Unavailable: Instance is not running
OpenAI-Compatible API
Llamactl provides OpenAI-compatible endpoints for inference operations.
List Models
List all instances in OpenAI-compatible format.
GET /v1/models
Response:
{
"object": "list",
"data": [
{
"id": "llama2-7b",
"object": "model",
"created": 1705312200,
"owned_by": "llamactl"
}
]
}
Chat Completions, Completions, Embeddings
All OpenAI-compatible inference endpoints are available:
POST /v1/chat/completions
POST /v1/completions
POST /v1/embeddings
POST /v1/rerank
POST /v1/reranking
Request Body: Standard OpenAI format with model field specifying the instance name
Example:
{
"model": "llama2-7b",
"messages": [
{
"role": "user",
"content": "Hello, how are you?"
}
]
}
The server routes requests to the appropriate instance based on the model field in the request body. Instances with on-demand starting enabled will be automatically started if not running. For configuration details, see Managing Instances.
Error Responses:
400 Bad Request: Invalid request body or missing instance name503 Service Unavailable: Instance is not running and on-demand start is disabled409 Conflict: Cannot start instance due to maximum instances limit
Instance Status Values
Instances can have the following status values:
stopped: Instance is not runningrunning: Instance is running and ready to accept requestsfailed: Instance failed to start or crashed
Error Responses
All endpoints may return error responses in the following format:
{
"error": "Error message description"
}
Common HTTP Status Codes
200: Success201: Created204: No Content (successful deletion)400: Bad Request (invalid parameters or request body)401: Unauthorized (missing or invalid API key)403: Forbidden (insufficient permissions)404: Not Found (instance not found)409: Conflict (instance already exists, max instances reached)500: Internal Server Error503: Service Unavailable (instance not running)
Examples
Complete Instance Lifecycle
# Create and start instance
curl -X POST http://localhost:8080/api/v1/instances/my-model \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"backend_type": "llama_cpp",
"backend_options": {
"model": "/models/llama-2-7b.gguf",
"gpu_layers": 32
},
"environment": {
"CUDA_VISIBLE_DEVICES": "0",
"OMP_NUM_THREADS": "8"
}
}'
# Check instance status
curl -H "Authorization: Bearer your-api-key" \
http://localhost:8080/api/v1/instances/my-model
# Get instance logs
curl -H "Authorization: Bearer your-api-key" \
"http://localhost:8080/api/v1/instances/my-model/logs?lines=50"
# Use OpenAI-compatible chat completions
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-inference-api-key" \
-d '{
"model": "my-model",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 100
}'
# Stop instance
curl -X POST -H "Authorization: Bearer your-api-key" \
http://localhost:8080/api/v1/instances/my-model/stop
# Delete instance
curl -X DELETE -H "Authorization: Bearer your-api-key" \
http://localhost:8080/api/v1/instances/my-model
Using the Proxy Endpoint
You can also directly proxy requests to the llama-server instance:
# Direct proxy to instance (bypasses OpenAI compatibility layer)
curl -X POST http://localhost:8080/api/v1/instances/my-model/proxy/completion \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"prompt": "Hello, world!",
"n_predict": 50
}'
Backend-Specific Endpoints
Parse Commands
Llamactl provides endpoints to parse command strings from different backends into instance configuration options.
Parse Llama.cpp Command
Parse a llama-server command string into instance options.
POST /api/v1/backends/llama-cpp/parse-command
Request Body:
{
"command": "llama-server -m /path/to/model.gguf -c 2048 --port 8080"
}
Response:
{
"backend_type": "llama_cpp",
"llama_server_options": {
"model": "/path/to/model.gguf",
"ctx_size": 2048,
"port": 8080
}
}
Parse MLX-LM Command
Parse an MLX-LM server command string into instance options.
POST /api/v1/backends/mlx/parse-command
Request Body:
{
"command": "mlx_lm.server --model /path/to/model --port 8080"
}
Response:
{
"backend_type": "mlx_lm",
"mlx_server_options": {
"model": "/path/to/model",
"port": 8080
}
}
Parse vLLM Command
Parse a vLLM serve command string into instance options.
POST /api/v1/backends/vllm/parse-command
Request Body:
{
"command": "vllm serve /path/to/model --port 8080"
}
Response:
{
"backend_type": "vllm",
"vllm_server_options": {
"model": "/path/to/model",
"port": 8080
}
}
Error Responses for Parse Commands:
400 Bad Request: Invalid request body, empty command, or parse error500 Internal Server Error: Encoding error
Auto-Generated Documentation
The API documentation is automatically generated from code annotations using Swagger/OpenAPI. To regenerate the documentation:
- Install the swag tool:
go install github.com/swaggo/swag/cmd/swag@latest - Generate docs:
swag init -g cmd/server/main.go -o apidocs
Swagger Documentation
If swagger documentation is enabled in the server configuration, you can access the interactive API documentation at:
http://localhost:8080/swagger/
This provides a complete interactive interface for testing all API endpoints.