API Reference¶

Complete reference for the Llamactl REST API.

Base URL¶

All API endpoints are relative to the base URL:

http://localhost:8080/api/v1

Authentication¶

Llamactl supports API key authentication. If authentication is enabled, include the API key in the Authorization header:

curl -H "Authorization: Bearer <your-api-key>" \
  http://localhost:8080/api/v1/instances

The server supports two types of API keys: - Management API Keys: Required for instance management operations (CRUD operations on instances) - Inference API Keys: Required for OpenAI-compatible inference endpoints

System Endpoints¶

Get Llamactl Version¶

Get the version information of the llamactl server.

GET /api/v1/version

Response:

Version: 1.0.0
Commit: abc123
Build Time: 2024-01-15T10:00:00Z

Get Llama Server Help¶

Get help text for the llama-server command.

GET /api/v1/server/help

Response: Plain text help output from llama-server --help

Get Llama Server Version¶

Get version information of the llama-server binary.

GET /api/v1/server/version

Response: Plain text version output from llama-server --version

List Available Devices¶

List available devices for llama-server.

GET /api/v1/server/devices

Response: Plain text device list from llama-server --list-devices

Instances¶

List All Instances¶

Get a list of all instances.

GET /api/v1/instances

Response:

[
  {
    "name": "llama2-7b",
    "status": "running",
    "created": 1705312200
  }
]

Get Instance Details¶

Get detailed information about a specific instance.

GET /api/v1/instances/{name}

Response:

{
  "name": "llama2-7b",
  "status": "running",
  "created": 1705312200
}

Create Instance¶

Create and start a new instance.

POST /api/v1/instances/{name}

Request Body: JSON object with instance configuration. See Managing Instances for available configuration options.

Response:

{
  "name": "llama2-7b",
  "status": "running",
  "created": 1705312200
}

Update Instance¶

Update an existing instance configuration. See Managing Instances for available configuration options.

PUT /api/v1/instances/{name}

Request Body: JSON object with configuration fields to update.

Response:

{
  "name": "llama2-7b",
  "status": "running",
  "created": 1705312200
}

Delete Instance¶

Stop and remove an instance.

DELETE /api/v1/instances/{name}

Response: 204 No Content

Instance Operations¶

Start Instance¶

Start a stopped instance.

POST /api/v1/instances/{name}/start

Response:

{
  "name": "llama2-7b",
  "status": "starting",
  "created": 1705312200
}

Error Responses: - 409 Conflict: Maximum number of running instances reached - 500 Internal Server Error: Failed to start instance

Stop Instance¶

Stop a running instance.

POST /api/v1/instances/{name}/stop

Response:

{
  "name": "llama2-7b",
  "status": "stopping",
  "created": 1705312200
}

Restart Instance¶

Restart an instance (stop then start).

POST /api/v1/instances/{name}/restart

Response:

{
  "name": "llama2-7b",
  "status": "restarting",
  "created": 1705312200
}

Get Instance Logs¶

Retrieve instance logs.

GET /api/v1/instances/{name}/logs

Query Parameters: - lines: Number of lines to return (default: all lines, use -1 for all)

Response: Plain text log output

Example:

curl "http://localhost:8080/api/v1/instances/my-instance/logs?lines=100"

Proxy to Instance¶

Proxy HTTP requests directly to the llama-server instance.

GET /api/v1/instances/{name}/proxy/*
POST /api/v1/instances/{name}/proxy/*

This endpoint forwards all requests to the underlying llama-server instance running on its configured port. The proxy strips the /api/v1/instances/{name}/proxy prefix and forwards the remaining path to the instance.

Example - Check Instance Health:

curl -H "Authorization: Bearer your-api-key" \
  http://localhost:8080/api/v1/instances/my-model/proxy/health

This forwards the request to http://instance-host:instance-port/health on the actual llama-server instance.

Error Responses: - 503 Service Unavailable: Instance is not running

OpenAI-Compatible API¶

Llamactl provides OpenAI-compatible endpoints for inference operations.

List Models¶

List all instances in OpenAI-compatible format.

GET /v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "llama2-7b",
      "object": "model",
      "created": 1705312200,
      "owned_by": "llamactl"
    }
  ]
}

Chat Completions, Completions, Embeddings¶

All OpenAI-compatible inference endpoints are available:

POST /v1/chat/completions
POST /v1/completions
POST /v1/embeddings
POST /v1/rerank
POST /v1/reranking

Request Body: Standard OpenAI format with model field specifying the instance name

Example:

{
  "model": "llama2-7b",
  "messages": [
    {
      "role": "user",
      "content": "Hello, how are you?"
    }
  ]
}

The server routes requests to the appropriate instance based on the model field in the request body. Instances with on-demand starting enabled will be automatically started if not running. For configuration details, see Managing Instances.

Error Responses: - 400 Bad Request: Invalid request body or missing model name - 503 Service Unavailable: Instance is not running and on-demand start is disabled - 409 Conflict: Cannot start instance due to maximum instances limit

Instance Status Values¶

Instances can have the following status values:
- stopped: Instance is not running
- running: Instance is running and ready to accept requests
- failed: Instance failed to start or crashed

Error Responses¶

All endpoints may return error responses in the following format:

{
  "error": "Error message description"
}

Common HTTP Status Codes¶

200: Success
201: Created
204: No Content (successful deletion)
400: Bad Request (invalid parameters or request body)
401: Unauthorized (missing or invalid API key)
403: Forbidden (insufficient permissions)
404: Not Found (instance not found)
409: Conflict (instance already exists, max instances reached)
500: Internal Server Error
503: Service Unavailable (instance not running)

Examples¶

Complete Instance Lifecycle¶

# Create and start instance
curl -X POST http://localhost:8080/api/v1/instances/my-model \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "/models/llama-2-7b.gguf"
  }'

# Check instance status
curl -H "Authorization: Bearer your-api-key" \
  http://localhost:8080/api/v1/instances/my-model

# Get instance logs
curl -H "Authorization: Bearer your-api-key" \
  "http://localhost:8080/api/v1/instances/my-model/logs?lines=50"

# Use OpenAI-compatible chat completions
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-inference-api-key" \
  -d '{
    "model": "my-model",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 100
  }'

# Stop instance
curl -X POST -H "Authorization: Bearer your-api-key" \
  http://localhost:8080/api/v1/instances/my-model/stop

# Delete instance
curl -X DELETE -H "Authorization: Bearer your-api-key" \
  http://localhost:8080/api/v1/instances/my-model

Using the Proxy Endpoint¶

You can also directly proxy requests to the llama-server instance:

# Direct proxy to instance (bypasses OpenAI compatibility layer)
curl -X POST http://localhost:8080/api/v1/instances/my-model/proxy/completion \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "prompt": "Hello, world!",
    "n_predict": 50
  }'

Swagger Documentation¶

If swagger documentation is enabled in the server configuration, you can access the interactive API documentation at:

http://localhost:8080/swagger/

This provides a complete interactive interface for testing all API endpoints.