diff --git a/README.md b/README.md index 4865174..7f547cc 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,8 @@ ### ⚡ Smart Operations - **Instance Monitoring**: Health checks, auto-restart, log management -- **Smart Resource Management**: Idle timeout, LRU eviction, and configurable instance limits +- **Smart Resource Management**: Idle timeout, LRU eviction, and configurable instance limits +- **Environment Variables**: Set custom environment variables per instance for advanced configuration ![Dashboard Screenshot](docs/images/dashboard.png) @@ -52,7 +53,8 @@ llamactl 2. Click "Create Instance" 3. Choose backend type (llama.cpp, MLX, or vLLM) 4. Set model path and backend-specific options -5. Start or stop the instance +5. Configure environment variables if needed (optional) +6. Start or stop the instance ### Or use the REST API: ```bash @@ -66,10 +68,10 @@ curl -X POST localhost:8080/api/v1/instances/my-mlx-model \ -H "Authorization: Bearer your-key" \ -d '{"backend_type": "mlx_lm", "backend_options": {"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit"}}' -# Create vLLM instance +# Create vLLM instance with environment variables curl -X POST localhost:8080/api/v1/instances/my-vllm-model \ -H "Authorization: Bearer your-key" \ - -d '{"backend_type": "vllm", "backend_options": {"model": "microsoft/DialoGPT-medium", "tensor_parallel_size": 2}}' + -d '{"backend_type": "vllm", "backend_options": {"model": "microsoft/DialoGPT-medium", "tensor_parallel_size": 2}, "environment": {"CUDA_VISIBLE_DEVICES": "0,1", "NCCL_DEBUG": "INFO"}}' # Use with OpenAI SDK curl -X POST localhost:8080/v1/chat/completions \ diff --git a/docs/user-guide/api-reference.md b/docs/user-guide/api-reference.md index 348c1c0..26e01e4 100644 --- a/docs/user-guide/api-reference.md +++ b/docs/user-guide/api-reference.md @@ -116,7 +116,18 @@ Create and start a new instance. POST /api/v1/instances/{name} ``` -**Request Body:** JSON object with instance configuration. See [Managing Instances](managing-instances.md) for available configuration options. +**Request Body:** JSON object with instance configuration. Common fields include: + +- `backend_type`: Backend type (`llama_cpp`, `mlx_lm`, or `vllm`) +- `backend_options`: Backend-specific configuration +- `auto_restart`: Enable automatic restart on failure +- `max_restarts`: Maximum restart attempts +- `restart_delay`: Delay between restarts in seconds +- `on_demand_start`: Start instance when receiving requests +- `idle_timeout`: Idle timeout in minutes +- `environment`: Environment variables as key-value pairs + +See [Managing Instances](managing-instances.md) for complete configuration options. **Response:** ```json @@ -354,7 +365,15 @@ curl -X POST http://localhost:8080/api/v1/instances/my-model \ -H "Content-Type: application/json" \ -H "Authorization: Bearer your-api-key" \ -d '{ - "model": "/models/llama-2-7b.gguf" + "backend_type": "llama_cpp", + "backend_options": { + "model": "/models/llama-2-7b.gguf", + "gpu_layers": 32 + }, + "environment": { + "CUDA_VISIBLE_DEVICES": "0", + "OMP_NUM_THREADS": "8" + } }' # Check instance status diff --git a/docs/user-guide/managing-instances.md b/docs/user-guide/managing-instances.md index e094d42..824c4fe 100644 --- a/docs/user-guide/managing-instances.md +++ b/docs/user-guide/managing-instances.md @@ -53,6 +53,7 @@ Each instance is displayed as a card showing: - **Restart Delay**: Delay in seconds between restart attempts - **On Demand Start**: Start instance when receiving a request to the OpenAI compatible endpoint - **Idle Timeout**: Minutes before stopping idle instance (set to 0 to disable) + - **Environment Variables**: Set custom environment variables for the instance process 6. Configure backend-specific options: - **llama.cpp**: Threads, context size, GPU layers, port, etc. - **MLX**: Temperature, top-p, adapter path, Python environment, etc. @@ -101,7 +102,12 @@ curl -X POST http://localhost:8080/api/instances/my-vllm-instance \ "gpu_memory_utilization": 0.9 }, "auto_restart": true, - "on_demand_start": true + "on_demand_start": true, + "environment": { + "CUDA_VISIBLE_DEVICES": "0,1", + "NCCL_DEBUG": "INFO", + "PYTHONPATH": "/custom/path" + } }' # Create llama.cpp instance with HuggingFace model