instances:port_range:[8000,9000]# Port range for instances (default: [8000, 9000])
@@ -983,7 +986,7 @@
- September 18, 2025
+ September 21, 2025
diff --git a/dev/getting-started/installation/index.html b/dev/getting-started/installation/index.html
index 0cced7b..4e5053e 100644
--- a/dev/getting-started/installation/index.html
+++ b/dev/getting-started/installation/index.html
@@ -825,18 +825,30 @@
pipinstallmlx-lm
Note: MLX backend is only available on macOS with Apple Silicon (M1, M2, M3, etc.)
+
For vLLM backend:
+
vLLM provides high-throughput distributed serving for LLMs. Install vLLM:
+
# Install via pip (requires Python 3.8+, GPU required)
+pipinstallvllm
+
+# Or in a virtual environment (recommended)
+python-mvenvvllm-env
+sourcevllm-env/bin/activate
+pipinstallvllm
+
+# For production deployments, consider container-based installation
+
# Linux/macOS - Get latest version and download
-LATEST_VERSION=$(curl-shttps://api.github.com/repos/lordmathis/llamactl/releases/latest|grep'"tag_name":'|sed-E's/.*"([^"]+)".*/\1/')
-curl-Lhttps://github.com/lordmathis/llamactl/releases/download/${LATEST_VERSION}/llamactl-${LATEST_VERSION}-$(uname-s|tr'[:upper:]''[:lower:]')-$(uname-m).tar.gz|tar-xz
-sudomvllamactl/usr/local/bin/
-
-# Or download manually from:
-# https://github.com/lordmathis/llamactl/releases/latest
-
-# Windows - Download from releases page
+
# Linux/macOS - Get latest version and download
+LATEST_VERSION=$(curl-shttps://api.github.com/repos/lordmathis/llamactl/releases/latest|grep'"tag_name":'|sed-E's/.*"([^"]+)".*/\1/')
+curl-Lhttps://github.com/lordmathis/llamactl/releases/download/${LATEST_VERSION}/llamactl-${LATEST_VERSION}-$(uname-s|tr'[:upper:]''[:lower:]')-$(uname-m).tar.gz|tar-xz
+sudomvllamactl/usr/local/bin/
+
+# Or download manually from:
+# https://github.com/lordmathis/llamactl/releases/latest
+
+# Windows - Download from releases page
Requirements:
@@ -844,19 +856,19 @@
- Node.js 22 or later
- Git
If you prefer to build from source:
-
# Clone the repository
-gitclonehttps://github.com/lordmathis/llamactl.git
-cdllamactl
-
-# Build the web UI
-cdwebui&&npmci&&npmrunbuild&&cd..
-
-# Build the application
-gobuild-ollamactl./cmd/server
+
# Clone the repository
+gitclonehttps://github.com/lordmathis/llamactl.git
+cdllamactl
+
+# Build the web UI
+cdwebui&&npmci&&npmrunbuild&&cd..
+
+# Build the application
+gobuild-ollamactl./cmd/server
You can also use the official OpenAI Python client:
-
fromopenaiimportOpenAI
-
-# Point the client to your Llamactl server
-client=OpenAI(
-base_url="http://localhost:8080/v1",
-api_key="not-needed"# Llamactl doesn't require API keys by default
-)
-
-# Create a chat completion
-response=client.chat.completions.create(
-model="my-model",# Use the name of your instance
-messages=[
-{"role":"user","content":"Explain quantum computing in simple terms"}
-],
-max_tokens=200,
-temperature=0.7
-)
-
-print(response.choices[0].message.content)
+
fromopenaiimportOpenAI
+
+# Point the client to your Llamactl server
+client=OpenAI(
+base_url="http://localhost:8080/v1",
+api_key="not-needed"# Llamactl doesn't require API keys by default
+)
+
+# Create a chat completion
+response=client.chat.completions.create(
+model="my-model",# Use the name of your instance
+messages=[
+{"role":"user","content":"Explain quantum computing in simple terms"}
+],
+max_tokens=200,
+temperature=0.7
+)
+
+print(response.choices[0].message.content)
@@ -992,7 +1020,7 @@
- September 3, 2025
+ September 21, 2025
diff --git a/dev/search/search_index.json b/dev/search/search_index.json
index 4b42522..33d6a9b 100644
--- a/dev/search/search_index.json
+++ b/dev/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Llamactl Documentation","text":"
Welcome to the Llamactl documentation! Management server and proxy for multiple llama.cpp and MLX instances with OpenAI-compatible API routing.
"},{"location":"#what-is-llamactl","title":"What is Llamactl?","text":"
Llamactl is designed to simplify the deployment and management of llama-server and MLX instances. It provides a modern solution for running multiple large language models with centralized management and multi-backend support.
\ud83d\ude80 Multiple Model Serving: Run different models simultaneously (7B for speed, 70B for quality) \ud83d\udd17 OpenAI API Compatible: Drop-in replacement - route requests by model name \ud83c\udf4e Multi-Backend Support: Native support for both llama.cpp and MLX (Apple Silicon optimized) \ud83c\udf10 Web Dashboard: Modern React UI for visual management (unlike CLI-only tools) \ud83d\udd10 API Key Authentication: Separate keys for management vs inference access \ud83d\udcca Instance Monitoring: Health checks, auto-restart, log management \u26a1 Smart Resource Management: Idle timeout, LRU eviction, and configurable instance limits \ud83d\udca1 On-Demand Instance Start: Automatically launch instances upon receiving OpenAI-compatible API requests \ud83d\udcbe State Persistence: Ensure instances remain intact across server restarts
server:\n host: \"0.0.0.0\" # Server host to bind to (default: \"0.0.0.0\")\n port: 8080 # Server port to bind to (default: 8080)\n allowed_origins: [\"*\"] # CORS allowed origins (default: [\"*\"])\n enable_swagger: false # Enable Swagger UI (default: false)\n
Environment Variables: - LLAMACTL_HOST - Server host - LLAMACTL_PORT - Server port - LLAMACTL_ALLOWED_ORIGINS - Comma-separated CORS origins - LLAMACTL_ENABLE_SWAGGER - Enable Swagger UI (true/false)
auth:\n require_inference_auth: true # Require API key for OpenAI endpoints (default: true)\n inference_keys: [] # List of valid inference API keys\n require_management_auth: true # Require API key for management endpoints (default: true)\n management_keys: [] # List of valid management API keys\n
Environment Variables: - LLAMACTL_REQUIRE_INFERENCE_AUTH - Require auth for OpenAI endpoints (true/false) - LLAMACTL_INFERENCE_KEYS - Comma-separated inference API keys - LLAMACTL_REQUIRE_MANAGEMENT_AUTH - Require auth for management endpoints (true/false) - LLAMACTL_MANAGEMENT_KEYS - Comma-separated management API keys
"},{"location":"getting-started/configuration/#command-line-options","title":"Command Line Options","text":"
View all available command line options:
llamactl --help\n
You can also override configuration using command line flags when starting llamactl.
Download the latest release from the GitHub releases page:
# Linux/macOS - Get latest version and download\nLATEST_VERSION=$(curl -s https://api.github.com/repos/lordmathis/llamactl/releases/latest | grep '\"tag_name\":' | sed -E 's/.*\"([^\"]+)\".*/\\1/')\ncurl -L https://github.com/lordmathis/llamactl/releases/download/${LATEST_VERSION}/llamactl-${LATEST_VERSION}-$(uname -s | tr '[:upper:]' '[:lower:]')-$(uname -m).tar.gz | tar -xz\nsudo mv llamactl /usr/local/bin/\n\n# Or download manually from:\n# https://github.com/lordmathis/llamactl/releases/latest\n\n# Windows - Download from releases page\n
"},{"location":"getting-started/installation/#option-2-build-from-source","title":"Option 2: Build from Source","text":"
Requirements: - Go 1.24 or later - Node.js 22 or later - Git
If you prefer to build from source:
# Clone the repository\ngit clone https://github.com/lordmathis/llamactl.git\ncd llamactl\n\n# Build the web UI\ncd webui && npm ci && npm run build && cd ..\n\n# Build the application\ngo build -o llamactl ./cmd/server\n
"},{"location":"getting-started/quick-start/#using-the-api","title":"Using the API","text":"
You can also manage instances via the REST API:
# List all instances\ncurl http://localhost:8080/api/instances\n\n# Create a new instance\ncurl -X POST http://localhost:8080/api/instances \\\n -H \"Content-Type: application/json\" \\\n -d '{\n \"name\": \"my-model\",\n \"model_path\": \"/path/to/model.gguf\",\n }'\n\n# Start an instance\ncurl -X POST http://localhost:8080/api/instances/my-model/start\n
Once you have an instance running, you can use it with the OpenAI-compatible chat completions endpoint:
curl -X POST http://localhost:8080/v1/chat/completions \\\n -H \"Content-Type: application/json\" \\\n -d '{\n \"model\": \"my-model\",\n \"messages\": [\n {\n \"role\": \"user\",\n \"content\": \"Hello! Can you help me write a Python function?\"\n }\n ],\n \"max_tokens\": 150,\n \"temperature\": 0.7\n }'\n
"},{"location":"getting-started/quick-start/#using-with-python-openai-client","title":"Using with Python OpenAI Client","text":"
You can also use the official OpenAI Python client:
from openai import OpenAI\n\n# Point the client to your Llamactl server\nclient = OpenAI(\n base_url=\"http://localhost:8080/v1\",\n api_key=\"not-needed\" # Llamactl doesn't require API keys by default\n)\n\n# Create a chat completion\nresponse = client.chat.completions.create(\n model=\"my-model\", # Use the name of your instance\n messages=[\n {\"role\": \"user\", \"content\": \"Explain quantum computing in simple terms\"}\n ],\n max_tokens=200,\n temperature=0.7\n)\n\nprint(response.choices[0].message.content)\n
"},{"location":"getting-started/quick-start/#list-available-models","title":"List Available Models","text":"
Get a list of running instances (models) in OpenAI-compatible format:
The server supports two types of API keys: - Management API Keys: Required for instance management operations (CRUD operations on instances) - Inference API Keys: Required for OpenAI-compatible inference endpoints
"},{"location":"user-guide/api-reference/#get-llama-server-help","title":"Get Llama Server Help","text":"
Get help text for the llama-server command.
GET /api/v1/server/help\n
Response: Plain text help output from llama-server --help
"},{"location":"user-guide/api-reference/#get-llama-server-version","title":"Get Llama Server Version","text":"
Get version information of the llama-server binary.
GET /api/v1/server/version\n
Response: Plain text version output from llama-server --version
"},{"location":"user-guide/api-reference/#list-available-devices","title":"List Available Devices","text":"
List available devices for llama-server.
GET /api/v1/server/devices\n
Response: Plain text device list from llama-server --list-devices
"},{"location":"user-guide/api-reference/#instances","title":"Instances","text":""},{"location":"user-guide/api-reference/#list-all-instances","title":"List All Instances","text":"
"},{"location":"user-guide/api-reference/#proxy-to-instance","title":"Proxy to Instance","text":"
Proxy HTTP requests directly to the llama-server instance.
GET /api/v1/instances/{name}/proxy/*\nPOST /api/v1/instances/{name}/proxy/*\n
This endpoint forwards all requests to the underlying llama-server instance running on its configured port. The proxy strips the /api/v1/instances/{name}/proxy prefix and forwards the remaining path to the instance.
All OpenAI-compatible inference endpoints are available:
POST /v1/chat/completions\nPOST /v1/completions\nPOST /v1/embeddings\nPOST /v1/rerank\nPOST /v1/reranking\n
Request Body: Standard OpenAI format with model field specifying the instance name
Example:
{\n \"model\": \"llama2-7b\",\n \"messages\": [\n {\n \"role\": \"user\",\n \"content\": \"Hello, how are you?\"\n }\n ]\n}\n
The server routes requests to the appropriate instance based on the model field in the request body. Instances with on-demand starting enabled will be automatically started if not running. For configuration details, see Managing Instances.
Error Responses: - 400 Bad Request: Invalid request body or missing model name - 503 Service Unavailable: Instance is not running and on-demand start is disabled - 409 Conflict: Cannot start instance due to maximum instances limit
"},{"location":"user-guide/api-reference/#instance-status-values","title":"Instance Status Values","text":"
Instances can have the following status values: - stopped: Instance is not running - running: Instance is running and ready to accept requests - failed: Instance failed to start or crashed
Health status badge (unknown, ready, error, failed)
Action buttons (start, stop, edit, logs, delete)
"},{"location":"user-guide/managing-instances/#create-instance","title":"Create Instance","text":""},{"location":"user-guide/managing-instances/#via-web-ui","title":"Via Web UI","text":"
Click the \"Create Instance\" button on the dashboard
Enter a unique Name for your instance (only required field)
Choose Backend Type:
llama.cpp: For GGUF models using llama-server
MLX: For MLX-optimized models (macOS only)
Configure model source:
For llama.cpp: GGUF model path or HuggingFace repo
For MLX: MLX model path or identifier (e.g., mlx-community/Mistral-7B-Instruct-v0.3-4bit)
Configure optional instance management settings:
Auto Restart: Automatically restart instance on failure
Max Restarts: Maximum number of restart attempts
Restart Delay: Delay in seconds between restart attempts
On Demand Start: Start instance when receiving a request to the OpenAI compatible endpoint
Idle Timeout: Minutes before stopping idle instance (set to 0 to disable)
Configure backend-specific options:
llama.cpp: Threads, context size, GPU layers, port, etc.
MLX: Temperature, top-p, adapter path, Python environment, etc.
"},{"location":"user-guide/managing-instances/#start-instance","title":"Start Instance","text":""},{"location":"user-guide/managing-instances/#via-web-ui_1","title":"Via Web UI","text":"
curl -X POST http://localhost:8080/api/instances/{name}/start\n
"},{"location":"user-guide/managing-instances/#stop-instance","title":"Stop Instance","text":""},{"location":"user-guide/managing-instances/#via-web-ui_2","title":"Via Web UI","text":"
curl -X POST http://localhost:8080/api/instances/{name}/stop\n
"},{"location":"user-guide/managing-instances/#edit-instance","title":"Edit Instance","text":""},{"location":"user-guide/managing-instances/#via-web-ui_3","title":"Via Web UI","text":"
Configuration changes require restarting the instance to take effect.
"},{"location":"user-guide/managing-instances/#view-logs","title":"View Logs","text":""},{"location":"user-guide/managing-instances/#via-web-ui_4","title":"Via Web UI","text":"
# Get instance details\ncurl http://localhost:8080/api/instances/{name}/logs\n
"},{"location":"user-guide/managing-instances/#delete-instance","title":"Delete Instance","text":""},{"location":"user-guide/managing-instances/#via-web-ui_5","title":"Via Web UI","text":"
Llamactl proxies all requests to the underlying backend instances (llama-server or MLX).
# Get instance details\ncurl http://localhost:8080/api/instances/{name}/proxy/\n
Both backends provide OpenAI-compatible endpoints. Check the respective documentation: - llama-server docs - MLX-LM docs
"},{"location":"user-guide/managing-instances/#instance-health","title":"Instance Health","text":""},{"location":"user-guide/managing-instances/#via-web-ui_6","title":"Via Web UI","text":"
The health status badge is displayed on each instance card
Problem: Instance fails to start with model loading errors
Common Solutions: - llama-server not found: Ensure llama-server binary is in PATH - Wrong model format: Ensure model is in GGUF format - Insufficient memory: Use smaller model or reduce context size - Path issues: Use absolute paths to model files
# Test your model and parameters directly with llama-server\nllama-server --model /path/to/model.gguf --port 8081 --n-gpu-layers 35\n
This helps determine if the issue is with llamactl or with the underlying llama.cpp/llama-server.
"},{"location":"user-guide/troubleshooting/#api-and-network-issues","title":"API and Network Issues","text":""},{"location":"user-guide/troubleshooting/#cors-errors","title":"CORS Errors","text":"
Problem: Web UI shows CORS errors in browser console
"},{"location":"user-guide/troubleshooting/#debugging-and-logs","title":"Debugging and Logs","text":""},{"location":"user-guide/troubleshooting/#viewing-instance-logs","title":"Viewing Instance Logs","text":"
# Get instance logs via API\ncurl http://localhost:8080/api/v1/instances/{name}/logs\n\n# Or check log files directly\ntail -f ~/.local/share/llamactl/logs/{instance-name}.log\n
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Llamactl Documentation","text":"
Welcome to the Llamactl documentation! Management server and proxy for multiple llama.cpp and MLX instances with OpenAI-compatible API routing.
"},{"location":"#what-is-llamactl","title":"What is Llamactl?","text":"
Llamactl is designed to simplify the deployment and management of llama-server and MLX instances. It provides a modern solution for running multiple large language models with centralized management and multi-backend support.
\ud83d\ude80 Multiple Model Serving: Run different models simultaneously (7B for speed, 70B for quality) \ud83d\udd17 OpenAI API Compatible: Drop-in replacement - route requests by model name \ud83c\udf4e Multi-Backend Support: Native support for both llama.cpp and MLX (Apple Silicon optimized) \ud83c\udf10 Web Dashboard: Modern React UI for visual management (unlike CLI-only tools) \ud83d\udd10 API Key Authentication: Separate keys for management vs inference access \ud83d\udcca Instance Monitoring: Health checks, auto-restart, log management \u26a1 Smart Resource Management: Idle timeout, LRU eviction, and configurable instance limits \ud83d\udca1 On-Demand Instance Start: Automatically launch instances upon receiving OpenAI-compatible API requests \ud83d\udcbe State Persistence: Ensure instances remain intact across server restarts
server:\n host: \"0.0.0.0\" # Server host to bind to (default: \"0.0.0.0\")\n port: 8080 # Server port to bind to (default: 8080)\n allowed_origins: [\"*\"] # CORS allowed origins (default: [\"*\"])\n enable_swagger: false # Enable Swagger UI (default: false)\n
Environment Variables: - LLAMACTL_HOST - Server host - LLAMACTL_PORT - Server port - LLAMACTL_ALLOWED_ORIGINS - Comma-separated CORS origins - LLAMACTL_ENABLE_SWAGGER - Enable Swagger UI (true/false)
auth:\n require_inference_auth: true # Require API key for OpenAI endpoints (default: true)\n inference_keys: [] # List of valid inference API keys\n require_management_auth: true # Require API key for management endpoints (default: true)\n management_keys: [] # List of valid management API keys\n
Environment Variables: - LLAMACTL_REQUIRE_INFERENCE_AUTH - Require auth for OpenAI endpoints (true/false) - LLAMACTL_INFERENCE_KEYS - Comma-separated inference API keys - LLAMACTL_REQUIRE_MANAGEMENT_AUTH - Require auth for management endpoints (true/false) - LLAMACTL_MANAGEMENT_KEYS - Comma-separated management API keys
"},{"location":"getting-started/configuration/#command-line-options","title":"Command Line Options","text":"
View all available command line options:
llamactl --help\n
You can also override configuration using command line flags when starting llamactl.
MLX provides optimized inference on Apple Silicon. Install MLX-LM:
# Install via pip (requires Python 3.8+)\npip install mlx-lm\n\n# Or in a virtual environment (recommended)\npython -m venv mlx-env\nsource mlx-env/bin/activate\npip install mlx-lm\n
Note: MLX backend is only available on macOS with Apple Silicon (M1, M2, M3, etc.)
For vLLM backend:
vLLM provides high-throughput distributed serving for LLMs. Install vLLM:
# Install via pip (requires Python 3.8+, GPU required)\npip install vllm\n\n# Or in a virtual environment (recommended)\npython -m venv vllm-env\nsource vllm-env/bin/activate\npip install vllm\n\n# For production deployments, consider container-based installation\n
Download the latest release from the GitHub releases page:
# Linux/macOS - Get latest version and download\nLATEST_VERSION=$(curl -s https://api.github.com/repos/lordmathis/llamactl/releases/latest | grep '\"tag_name\":' | sed -E 's/.*\"([^\"]+)\".*/\\1/')\ncurl -L https://github.com/lordmathis/llamactl/releases/download/${LATEST_VERSION}/llamactl-${LATEST_VERSION}-$(uname -s | tr '[:upper:]' '[:lower:]')-$(uname -m).tar.gz | tar -xz\nsudo mv llamactl /usr/local/bin/\n\n# Or download manually from:\n# https://github.com/lordmathis/llamactl/releases/latest\n\n# Windows - Download from releases page\n
"},{"location":"getting-started/installation/#option-2-build-from-source","title":"Option 2: Build from Source","text":"
Requirements: - Go 1.24 or later - Node.js 22 or later - Git
If you prefer to build from source:
# Clone the repository\ngit clone https://github.com/lordmathis/llamactl.git\ncd llamactl\n\n# Build the web UI\ncd webui && npm ci && npm run build && cd ..\n\n# Build the application\ngo build -o llamactl ./cmd/server\n
"},{"location":"getting-started/quick-start/#using-the-api","title":"Using the API","text":"
You can also manage instances via the REST API:
# List all instances\ncurl http://localhost:8080/api/instances\n\n# Create a new llama.cpp instance\ncurl -X POST http://localhost:8080/api/instances/my-model \\\n -H \"Content-Type: application/json\" \\\n -d '{\n \"backend_type\": \"llama_cpp\",\n \"backend_options\": {\n \"model\": \"/path/to/model.gguf\"\n }\n }'\n\n# Start an instance\ncurl -X POST http://localhost:8080/api/instances/my-model/start\n
Once you have an instance running, you can use it with the OpenAI-compatible chat completions endpoint:
curl -X POST http://localhost:8080/v1/chat/completions \\\n -H \"Content-Type: application/json\" \\\n -d '{\n \"model\": \"my-model\",\n \"messages\": [\n {\n \"role\": \"user\",\n \"content\": \"Hello! Can you help me write a Python function?\"\n }\n ],\n \"max_tokens\": 150,\n \"temperature\": 0.7\n }'\n
"},{"location":"getting-started/quick-start/#using-with-python-openai-client","title":"Using with Python OpenAI Client","text":"
You can also use the official OpenAI Python client:
from openai import OpenAI\n\n# Point the client to your Llamactl server\nclient = OpenAI(\n base_url=\"http://localhost:8080/v1\",\n api_key=\"not-needed\" # Llamactl doesn't require API keys by default\n)\n\n# Create a chat completion\nresponse = client.chat.completions.create(\n model=\"my-model\", # Use the name of your instance\n messages=[\n {\"role\": \"user\", \"content\": \"Explain quantum computing in simple terms\"}\n ],\n max_tokens=200,\n temperature=0.7\n)\n\nprint(response.choices[0].message.content)\n
"},{"location":"getting-started/quick-start/#list-available-models","title":"List Available Models","text":"
Get a list of running instances (models) in OpenAI-compatible format:
The server supports two types of API keys: - Management API Keys: Required for instance management operations (CRUD operations on instances) - Inference API Keys: Required for OpenAI-compatible inference endpoints
"},{"location":"user-guide/api-reference/#get-llama-server-help","title":"Get Llama Server Help","text":"
Get help text for the llama-server command.
GET /api/v1/server/help\n
Response: Plain text help output from llama-server --help
"},{"location":"user-guide/api-reference/#get-llama-server-version","title":"Get Llama Server Version","text":"
Get version information of the llama-server binary.
GET /api/v1/server/version\n
Response: Plain text version output from llama-server --version
"},{"location":"user-guide/api-reference/#list-available-devices","title":"List Available Devices","text":"
List available devices for llama-server.
GET /api/v1/server/devices\n
Response: Plain text device list from llama-server --list-devices
"},{"location":"user-guide/api-reference/#instances","title":"Instances","text":""},{"location":"user-guide/api-reference/#list-all-instances","title":"List All Instances","text":"
"},{"location":"user-guide/api-reference/#proxy-to-instance","title":"Proxy to Instance","text":"
Proxy HTTP requests directly to the llama-server instance.
GET /api/v1/instances/{name}/proxy/*\nPOST /api/v1/instances/{name}/proxy/*\n
This endpoint forwards all requests to the underlying llama-server instance running on its configured port. The proxy strips the /api/v1/instances/{name}/proxy prefix and forwards the remaining path to the instance.
All OpenAI-compatible inference endpoints are available:
POST /v1/chat/completions\nPOST /v1/completions\nPOST /v1/embeddings\nPOST /v1/rerank\nPOST /v1/reranking\n
Request Body: Standard OpenAI format with model field specifying the instance name
Example:
{\n \"model\": \"llama2-7b\",\n \"messages\": [\n {\n \"role\": \"user\",\n \"content\": \"Hello, how are you?\"\n }\n ]\n}\n
The server routes requests to the appropriate instance based on the model field in the request body. Instances with on-demand starting enabled will be automatically started if not running. For configuration details, see Managing Instances.
Error Responses: - 400 Bad Request: Invalid request body or missing model name - 503 Service Unavailable: Instance is not running and on-demand start is disabled - 409 Conflict: Cannot start instance due to maximum instances limit
"},{"location":"user-guide/api-reference/#instance-status-values","title":"Instance Status Values","text":"
Instances can have the following status values: - stopped: Instance is not running - running: Instance is running and ready to accept requests - failed: Instance failed to start or crashed
Health status badge (unknown, ready, error, failed)
Action buttons (start, stop, edit, logs, delete)
"},{"location":"user-guide/managing-instances/#create-instance","title":"Create Instance","text":""},{"location":"user-guide/managing-instances/#via-web-ui","title":"Via Web UI","text":"
Click the \"Create Instance\" button on the dashboard
Enter a unique Name for your instance (only required field)
Choose Backend Type:
llama.cpp: For GGUF models using llama-server
MLX: For MLX-optimized models (macOS only)
vLLM: For distributed serving and high-throughput inference
Configure model source:
For llama.cpp: GGUF model path or HuggingFace repo
For MLX: MLX model path or identifier (e.g., mlx-community/Mistral-7B-Instruct-v0.3-4bit)
For vLLM: HuggingFace model identifier (e.g., microsoft/DialoGPT-medium)
Configure optional instance management settings:
Auto Restart: Automatically restart instance on failure
Max Restarts: Maximum number of restart attempts
Restart Delay: Delay in seconds between restart attempts
On Demand Start: Start instance when receiving a request to the OpenAI compatible endpoint
Idle Timeout: Minutes before stopping idle instance (set to 0 to disable)
Configure backend-specific options:
llama.cpp: Threads, context size, GPU layers, port, etc.
MLX: Temperature, top-p, adapter path, Python environment, etc.
vLLM: Tensor parallel size, GPU memory utilization, quantization, etc.
"},{"location":"user-guide/managing-instances/#start-instance","title":"Start Instance","text":""},{"location":"user-guide/managing-instances/#via-web-ui_1","title":"Via Web UI","text":"
curl -X POST http://localhost:8080/api/instances/{name}/start\n
"},{"location":"user-guide/managing-instances/#stop-instance","title":"Stop Instance","text":""},{"location":"user-guide/managing-instances/#via-web-ui_2","title":"Via Web UI","text":"
curl -X POST http://localhost:8080/api/instances/{name}/stop\n
"},{"location":"user-guide/managing-instances/#edit-instance","title":"Edit Instance","text":""},{"location":"user-guide/managing-instances/#via-web-ui_3","title":"Via Web UI","text":"
Configuration changes require restarting the instance to take effect.
"},{"location":"user-guide/managing-instances/#view-logs","title":"View Logs","text":""},{"location":"user-guide/managing-instances/#via-web-ui_4","title":"Via Web UI","text":"
# Get instance details\ncurl http://localhost:8080/api/instances/{name}/logs\n
"},{"location":"user-guide/managing-instances/#delete-instance","title":"Delete Instance","text":""},{"location":"user-guide/managing-instances/#via-web-ui_5","title":"Via Web UI","text":"
Llamactl proxies all requests to the underlying backend instances (llama-server, MLX, or vLLM).
# Get instance details\ncurl http://localhost:8080/api/instances/{name}/proxy/\n
All backends provide OpenAI-compatible endpoints. Check the respective documentation: - llama-server docs - MLX-LM docs - vLLM docs
"},{"location":"user-guide/managing-instances/#instance-health","title":"Instance Health","text":""},{"location":"user-guide/managing-instances/#via-web-ui_6","title":"Via Web UI","text":"
The health status badge is displayed on each instance card
Problem: Instance fails to start with model loading errors
Common Solutions: - llama-server not found: Ensure llama-server binary is in PATH - Wrong model format: Ensure model is in GGUF format - Insufficient memory: Use smaller model or reduce context size - Path issues: Use absolute paths to model files
# Test your model and parameters directly with llama-server\nllama-server --model /path/to/model.gguf --port 8081 --n-gpu-layers 35\n
This helps determine if the issue is with llamactl or with the underlying llama.cpp/llama-server.
"},{"location":"user-guide/troubleshooting/#api-and-network-issues","title":"API and Network Issues","text":""},{"location":"user-guide/troubleshooting/#cors-errors","title":"CORS Errors","text":"
Problem: Web UI shows CORS errors in browser console
"},{"location":"user-guide/troubleshooting/#debugging-and-logs","title":"Debugging and Logs","text":""},{"location":"user-guide/troubleshooting/#viewing-instance-logs","title":"Viewing Instance Logs","text":"
# Get instance logs via API\ncurl http://localhost:8080/api/v1/instances/{name}/logs\n\n# Or check log files directly\ntail -f ~/.local/share/llamactl/logs/{instance-name}.log\n
@@ -1443,9 +1575,9 @@
- 503 Service Unavailable: Instance is not running and on-demand start is disabled
- 409 Conflict: Cannot start instance due to maximum instances limit
Instances can have the following status values:
-- stopped: Instance is not running
-- running: Instance is running and ready to accept requests
+
Instances can have the following status values:
+- stopped: Instance is not running
+- running: Instance is running and ready to accept requests
- failed: Instance failed to start or crashed