# vLLM Backend Implementation Specification ## Overview This specification outlines the implementation of vLLM backend support for llamactl, following the existing patterns established by the llama.cpp and MLX backends. ## 1. Backend Configuration ### Basic Details - **Backend Type**: `vllm` - **Executable**: `vllm` (configured via `VllmExecutable`) - **Subcommand**: `serve` (automatically prepended to arguments) - **Default Host/Port**: Auto-assigned by llamactl - **Health Check**: Uses `/health` endpoint (returns HTTP 200 with no content) - **API Compatibility**: OpenAI-compatible endpoints ### Example Command ```bash vllm serve --enable-log-outputs --tensor-parallel-size 2 --gpu-memory-utilization 0.5 --model ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g ``` ## 2. File Structure Following the existing backend pattern: ``` pkg/backends/vllm/ ├── vllm.go # VllmServerOptions struct and methods ├── vllm_test.go # Unit tests for VllmServerOptions ├── parser.go # Command parsing logic └── parser_test.go # Parser tests ``` ## 3. Core Implementation Files ### 3.1 `pkg/backends/vllm/vllm.go` #### VllmServerOptions Struct ```go type VllmServerOptions struct { // Basic connection options (auto-assigned by llamactl) Host string `json:"host,omitempty"` Port int `json:"port,omitempty"` // Core model options Model string `json:"model,omitempty"` // Common serving options EnableLogOutputs bool `json:"enable_log_outputs,omitempty"` TensorParallelSize int `json:"tensor_parallel_size,omitempty"` GPUMemoryUtilization float64 `json:"gpu_memory_utilization,omitempty"` // Additional parameters to be added based on vLLM CLI documentation // Following the same comprehensive approach as llamacpp.LlamaServerOptions } ``` #### Required Methods - `UnmarshalJSON()` - Custom unmarshaling with alternative field name support (dash-to-underscore conversion) - `BuildCommandArgs()` - Convert struct to command line arguments (excluding "serve" subcommand) - `NewVllmServerOptions()` - Constructor with vLLM defaults #### Field Name Mapping Support both CLI argument names (with dashes) and programmatic names (with underscores), similar to the llama.cpp implementation: ```go fieldMappings := map[string]string{ "enable-log-outputs": "enable_log_outputs", "tensor-parallel-size": "tensor_parallel_size", "gpu-memory-utilization": "gpu_memory_utilization", // ... other mappings } ``` ### 3.2 `pkg/backends/vllm/parser.go` #### ParseVllmCommand Function Following the same pattern as `llamacpp/parser.go` and `mlx/parser.go`: ```go func ParseVllmCommand(command string) (*VllmServerOptions, error) ``` **Supported Input Formats:** 1. `vllm serve --model MODEL_NAME --other-args` 2. `/path/to/vllm serve --model MODEL_NAME` 3. `serve --model MODEL_NAME --other-args` 4. `--model MODEL_NAME --other-args` (args only) 5. Multiline commands with backslashes **Implementation Details:** - Handle "serve" subcommand detection and removal - Support quoted strings and escaped characters - Validate command structure - Convert parsed arguments to `VllmServerOptions` ## 4. Backend Integration ### 4.1 Backend Type Definition **File**: `pkg/backends/backend.go` ```go const ( BackendTypeLlamaCpp BackendType = "llama_cpp" BackendTypeMlxLm BackendType = "mlx_lm" BackendTypeVllm BackendType = "vllm" // ADD THIS ) ``` ### 4.2 Configuration Integration **File**: `pkg/config/config.go` #### BackendConfig Update ```go type BackendConfig struct { LlamaExecutable string `yaml:"llama_executable"` MLXLMExecutable string `yaml:"mlx_lm_executable"` VllmExecutable string `yaml:"vllm_executable"` // ADD THIS } ``` #### Default Configuration - **Default Value**: `"vllm"` - **Environment Variable**: `LLAMACTL_VLLM_EXECUTABLE` #### Environment Variable Loading Add to `loadEnvVars()` function: ```go if vllmExec := os.Getenv("LLAMACTL_VLLM_EXECUTABLE"); vllmExec != "" { cfg.Backends.VllmExecutable = vllmExec } ``` ### 4.3 Instance Options Integration **File**: `pkg/instance/options.go` #### CreateInstanceOptions Update ```go type CreateInstanceOptions struct { // existing fields... VllmServerOptions *vllm.VllmServerOptions `json:"-"` } ``` #### JSON Marshaling/Unmarshaling Update `UnmarshalJSON()` and `MarshalJSON()` methods to handle vLLM backend similar to existing backends. #### BuildCommandArgs Implementation ```go case backends.BackendTypeVllm: if c.VllmServerOptions != nil { // Prepend "serve" as first argument args := []string{"serve"} args = append(args, c.VllmServerOptions.BuildCommandArgs()...) return args } ``` **Key Point**: The "serve" subcommand is handled at the instance options level, keeping the `VllmServerOptions.BuildCommandArgs()` method focused only on vLLM-specific parameters. ## 5. Health Check Integration ### 5.1 Standard Health Check for vLLM **File**: `pkg/instance/lifecycle.go` vLLM provides a standard `/health` endpoint that returns HTTP 200 with no content, so no modifications are needed to the existing health check logic. The current `WaitForHealthy()` method will work as-is: ```go healthURL := fmt.Sprintf("http://%s:%d/health", host, port) ``` ### 5.2 Startup Time Considerations - vLLM typically has longer startup times compared to llama.cpp - The existing configurable timeout system should handle this adequately - Users may need to adjust `on_demand_start_timeout` for larger models ## 6. Lifecycle Integration ### 6.1 Executable Selection **File**: `pkg/instance/lifecycle.go` Update the `Start()` method to handle vLLM executable: ```go switch i.options.BackendType { case backends.BackendTypeLlamaCpp: executable = i.globalBackendSettings.LlamaExecutable case backends.BackendTypeMlxLm: executable = i.globalBackendSettings.MLXLMExecutable case backends.BackendTypeVllm: // ADD THIS executable = i.globalBackendSettings.VllmExecutable default: return fmt.Errorf("unsupported backend type: %s", i.options.BackendType) } args := i.options.BuildCommandArgs() i.cmd = exec.CommandContext(i.ctx, executable, args...) ``` ### 6.2 Command Execution The final executed command will be: ```bash vllm serve --model MODEL_NAME --other-vllm-args ``` Where: - `vllm` comes from `VllmExecutable` configuration - `serve` is prepended by `BuildCommandArgs()` - Remaining args come from `VllmServerOptions.BuildCommandArgs()` ## 7. Server Handler Integration ### 7.1 New Handler Method **File**: `pkg/server/handlers.go` ```go // ParseVllmCommand godoc // @Summary Parse vllm serve command // @Description Parses a vLLM serve command string into instance options // @Tags backends // @Security ApiKeyAuth // @Accept json // @Produce json // @Param request body ParseCommandRequest true "Command to parse" // @Success 200 {object} instance.CreateInstanceOptions "Parsed options" // @Failure 400 {object} map[string]string "Invalid request or command" // @Router /backends/vllm/parse-command [post] func (h *Handler) ParseVllmCommand() http.HandlerFunc { // Implementation similar to ParseMlxCommand() // Uses vllm.ParseVllmCommand() internally } ``` ### 7.2 Router Integration **File**: `pkg/server/routes.go` Add vLLM route: ```go r.Route("/backends", func(r chi.Router) { r.Route("/llama-cpp", func(r chi.Router) { r.Post("/parse-command", handler.ParseLlamaCommand()) }) r.Route("/mlx", func(r chi.Router) { r.Post("/parse-command", handler.ParseMlxCommand()) }) r.Route("/vllm", func(r chi.Router) { // ADD THIS r.Post("/parse-command", handler.ParseVllmCommand()) }) }) ``` ## 8. Validation Integration ### 8.1 Instance Options Validation **File**: `pkg/validation/validation.go` Add vLLM validation case: ```go func ValidateInstanceOptions(options *instance.CreateInstanceOptions) error { // existing validation... switch options.BackendType { case backends.BackendTypeLlamaCpp: return validateLlamaCppOptions(options) case backends.BackendTypeMlxLm: return validateMlxOptions(options) case backends.BackendTypeVllm: // ADD THIS return validateVllmOptions(options) default: return ValidationError(fmt.Errorf("unsupported backend type: %s", options.BackendType)) } } func validateVllmOptions(options *instance.CreateInstanceOptions) error { if options.VllmServerOptions == nil { return ValidationError(fmt.Errorf("vLLM server options cannot be nil for vLLM backend")) } // Basic validation following the same pattern as other backends if err := validateStructStrings(options.VllmServerOptions, ""); err != nil { return err } // Port validation if options.VllmServerOptions.Port < 0 || options.VllmServerOptions.Port > 65535 { return ValidationError(fmt.Errorf("invalid port range: %d", options.VllmServerOptions.Port)) } return nil } ``` ## 9. Testing Strategy ### 9.1 Unit Tests - **`vllm_test.go`**: Test `VllmServerOptions` marshaling/unmarshaling, BuildCommandArgs() - **`parser_test.go`**: Test command parsing for various formats - **Integration tests**: Mock vLLM commands and validate parsing ### 9.2 Test Cases ```go func TestBuildCommandArgs_VllmBasic(t *testing.T) { options := VllmServerOptions{ Model: "microsoft/DialoGPT-medium", Port: 8080, Host: "localhost", EnableLogOutputs: true, TensorParallelSize: 2, } args := options.BuildCommandArgs() // Validate expected arguments (excluding "serve") } func TestParseVllmCommand_FullCommand(t *testing.T) { command := "vllm serve --model ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --tensor-parallel-size 2" result, err := ParseVllmCommand(command) // Validate parsing results } ``` ## 10. Example Usage ### 10.1 Parse Existing vLLM Command ```bash curl -X POST http://localhost:8080/api/v1/backends/vllm/parse-command \ -H "Authorization: Bearer your-management-key" \ -H "Content-Type: application/json" \ -d '{ "command": "vllm serve --model ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --tensor-parallel-size 2 --gpu-memory-utilization 0.5" }' ``` ### 10.2 Create vLLM Instance ```bash curl -X POST http://localhost:8080/api/v1/instances/my-vllm-model \ -H "Authorization: Bearer your-management-key" \ -H "Content-Type: application/json" \ -d '{ "backend_type": "vllm", "backend_options": { "model": "ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g", "tensor_parallel_size": 2, "gpu_memory_utilization": 0.5, "enable_log_outputs": true } }' ``` ### 10.3 Use via OpenAI-Compatible API ```bash curl -X POST http://localhost:8080/v1/chat/completions \ -H "Authorization: Bearer your-inference-key" \ -H "Content-Type: application/json" \ -d '{ "model": "my-vllm-model", "messages": [{"role": "user", "content": "Hello!"}] }' ``` ## 11. Implementation Checklist ### Phase 1: Core Backend - [ ] Create `pkg/backends/vllm/vllm.go` - [ ] Implement `VllmServerOptions` struct with basic fields - [ ] Implement `BuildCommandArgs()`, `UnmarshalJSON()`, `MarshalJSON()` - [ ] Add comprehensive field mappings for CLI args - [ ] Create unit tests for `VllmServerOptions` ### Phase 2: Command Parsing - [ ] Create `pkg/backends/vllm/parser.go` - [ ] Implement `ParseVllmCommand()` function - [ ] Handle various command input formats - [ ] Create comprehensive parser tests - [ ] Test edge cases and error conditions ### Phase 3: Integration - [ ] Add `BackendTypeVllm` to `pkg/backends/backend.go` - [ ] Update `BackendConfig` in `pkg/config/config.go` - [ ] Add environment variable support - [ ] Update `CreateInstanceOptions` in `pkg/instance/options.go` - [ ] Implement `BuildCommandArgs()` with "serve" prepending ### Phase 4: Lifecycle & Health Checks - [ ] Update executable selection in `pkg/instance/lifecycle.go` - [ ] Test instance startup and health checking (uses existing `/health` endpoint) - [ ] Validate command execution flow ### Phase 5: API Integration - [ ] Add `ParseVllmCommand()` handler in `pkg/server/handlers.go` - [ ] Add vLLM route in `pkg/server/routes.go` - [ ] Update validation in `pkg/validation/validation.go` - [ ] Test API endpoints ### Phase 6: Testing & Documentation - [ ] Create comprehensive integration tests - [ ] Test with actual vLLM installation (if available) - [ ] Update documentation - [ ] Test OpenAI-compatible proxy functionality ## 12. Configuration Examples ### 12.1 YAML Configuration ```yaml backends: llama_executable: "llama-server" mlx_lm_executable: "mlx_lm.server" vllm_executable: "vllm" instances: # ... other instance settings ``` ### 12.2 Environment Variables ```bash export LLAMACTL_VLLM_EXECUTABLE="vllm" # OR for custom installation export LLAMACTL_VLLM_EXECUTABLE="python -m vllm" # OR for containerized deployment export LLAMACTL_VLLM_EXECUTABLE="docker run --rm --gpus all vllm/vllm-openai" ``` ## 13. Notes and Considerations ### 13.1 Startup Time - vLLM instances may take significantly longer to start than llama.cpp - Consider documenting recommended timeout values - The configurable `on_demand_start_timeout` should accommodate this ### 13.2 Resource Usage - vLLM typically requires substantial GPU memory - No special handling needed in llamactl (follows existing pattern) - Resource management is left to the user/administrator ### 13.3 Model Compatibility - Primarily designed for HuggingFace models - Supports various quantization formats (GPTQ, AWQ, etc.) - Model path validation can be basic (similar to other backends) ### 13.4 Future Enhancements - Consider adding vLLM-specific parameter validation - Could add model download/caching features - May want to add vLLM version detection capabilities This specification provides a comprehensive roadmap for implementing vLLM backend support while maintaining consistency with the existing llamactl architecture.