8.3 KiB
API Reference
Complete reference for the Llamactl REST API.
Base URL
All API endpoints are relative to the base URL:
http://localhost:8080/api/v1
Authentication
Llamactl supports API key authentication. If authentication is enabled, include the API key in the Authorization header:
curl -H "Authorization: Bearer <your-api-key>" \
http://localhost:8080/api/v1/instances
The server supports two types of API keys:
- Management API Keys: Required for instance management operations (CRUD operations on instances)
- Inference API Keys: Required for OpenAI-compatible inference endpoints
System Endpoints
Get Llamactl Version
Get the version information of the llamactl server.
GET /api/v1/version
Response:
Version: 1.0.0
Commit: abc123
Build Time: 2024-01-15T10:00:00Z
Get Llama Server Help
Get help text for the llama-server command.
GET /api/v1/server/help
Response: Plain text help output from llama-server --help
Get Llama Server Version
Get version information of the llama-server binary.
GET /api/v1/server/version
Response: Plain text version output from llama-server --version
List Available Devices
List available devices for llama-server.
GET /api/v1/server/devices
Response: Plain text device list from llama-server --list-devices
Instances
List All Instances
Get a list of all instances.
GET /api/v1/instances
Response:
[
{
"name": "llama2-7b",
"status": "running",
"created": 1705312200
}
]
Get Instance Details
Get detailed information about a specific instance.
GET /api/v1/instances/{name}
Response:
{
"name": "llama2-7b",
"status": "running",
"created": 1705312200
}
Create Instance
Create and start a new instance.
POST /api/v1/instances/{name}
Request Body: JSON object with instance configuration. See Managing Instances for available configuration options.
Response:
{
"name": "llama2-7b",
"status": "running",
"created": 1705312200
}
Update Instance
Update an existing instance configuration. See Managing Instances for available configuration options.
PUT /api/v1/instances/{name}
Request Body: JSON object with configuration fields to update.
Response:
{
"name": "llama2-7b",
"status": "running",
"created": 1705312200
}
Delete Instance
Stop and remove an instance.
DELETE /api/v1/instances/{name}
Response: 204 No Content
Instance Operations
Start Instance
Start a stopped instance.
POST /api/v1/instances/{name}/start
Response:
{
"name": "llama2-7b",
"status": "starting",
"created": 1705312200
}
Error Responses:
409 Conflict: Maximum number of running instances reached500 Internal Server Error: Failed to start instance
Stop Instance
Stop a running instance.
POST /api/v1/instances/{name}/stop
Response:
{
"name": "llama2-7b",
"status": "stopping",
"created": 1705312200
}
Restart Instance
Restart an instance (stop then start).
POST /api/v1/instances/{name}/restart
Response:
{
"name": "llama2-7b",
"status": "restarting",
"created": 1705312200
}
Get Instance Logs
Retrieve instance logs.
GET /api/v1/instances/{name}/logs
Query Parameters:
lines: Number of lines to return (default: all lines, use -1 for all)
Response: Plain text log output
Example:
curl "http://localhost:8080/api/v1/instances/my-instance/logs?lines=100"
Proxy to Instance
Proxy HTTP requests directly to the llama-server instance.
GET /api/v1/instances/{name}/proxy/*
POST /api/v1/instances/{name}/proxy/*
This endpoint forwards all requests to the underlying llama-server instance running on its configured port. The proxy strips the /api/v1/instances/{name}/proxy prefix and forwards the remaining path to the instance.
Example - Check Instance Health:
curl -H "Authorization: Bearer your-api-key" \
http://localhost:8080/api/v1/instances/my-model/proxy/health
This forwards the request to http://instance-host:instance-port/health on the actual llama-server instance.
Error Responses:
503 Service Unavailable: Instance is not running
OpenAI-Compatible API
Llamactl provides OpenAI-compatible endpoints for inference operations.
List Models
List all instances in OpenAI-compatible format.
GET /v1/models
Response:
{
"object": "list",
"data": [
{
"id": "llama2-7b",
"object": "model",
"created": 1705312200,
"owned_by": "llamactl"
}
]
}
Chat Completions, Completions, Embeddings
All OpenAI-compatible inference endpoints are available:
POST /v1/chat/completions
POST /v1/completions
POST /v1/embeddings
POST /v1/rerank
POST /v1/reranking
Request Body: Standard OpenAI format with model field specifying the instance name
Example:
{
"model": "llama2-7b",
"messages": [
{
"role": "user",
"content": "Hello, how are you?"
}
]
}
The server routes requests to the appropriate instance based on the model field in the request body. Instances with on-demand starting enabled will be automatically started if not running. For configuration details, see Managing Instances.
Error Responses:
400 Bad Request: Invalid request body or missing model name503 Service Unavailable: Instance is not running and on-demand start is disabled409 Conflict: Cannot start instance due to maximum instances limit
Instance Status Values
Instances can have the following status values:
stopped: Instance is not runningrunning: Instance is running and ready to accept requestsfailed: Instance failed to start or crashed
Error Responses
All endpoints may return error responses in the following format:
{
"error": "Error message description"
}
Common HTTP Status Codes
200: Success201: Created204: No Content (successful deletion)400: Bad Request (invalid parameters or request body)401: Unauthorized (missing or invalid API key)403: Forbidden (insufficient permissions)404: Not Found (instance not found)409: Conflict (instance already exists, max instances reached)500: Internal Server Error503: Service Unavailable (instance not running)
Examples
Complete Instance Lifecycle
# Create and start instance
curl -X POST http://localhost:8080/api/v1/instances/my-model \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "/models/llama-2-7b.gguf"
}'
# Check instance status
curl -H "Authorization: Bearer your-api-key" \
http://localhost:8080/api/v1/instances/my-model
# Get instance logs
curl -H "Authorization: Bearer your-api-key" \
"http://localhost:8080/api/v1/instances/my-model/logs?lines=50"
# Use OpenAI-compatible chat completions
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-inference-api-key" \
-d '{
"model": "my-model",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 100
}'
# Stop instance
curl -X POST -H "Authorization: Bearer your-api-key" \
http://localhost:8080/api/v1/instances/my-model/stop
# Delete instance
curl -X DELETE -H "Authorization: Bearer your-api-key" \
http://localhost:8080/api/v1/instances/my-model
Using the Proxy Endpoint
You can also directly proxy requests to the llama-server instance:
# Direct proxy to instance (bypasses OpenAI compatibility layer)
curl -X POST http://localhost:8080/api/v1/instances/my-model/proxy/completion \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"prompt": "Hello, world!",
"n_predict": 50
}'
Swagger Documentation
If swagger documentation is enabled in the server configuration, you can access the interactive API documentation at:
http://localhost:8080/swagger/
This provides a complete interactive interface for testing all API endpoints.