Remove vLLM backend implementation specification document

2025-12-22 17:14:22 +00:00 · 2025-09-21 21:38:10 +02:00
parent 501afb7f0d
commit 6ff9aa5470
1 changed files with 0 additions and 440 deletions
--- a/vllm_backend_spec.md
+++ b/vllm_backend_spec.md
@@ -1,440 +0,0 @@
-# vLLM Backend Implementation Specification
-
-## Overview
-This specification outlines the implementation of vLLM backend support for llamactl, following the existing patterns established by the llama.cpp and MLX backends.
-
-## 1. Backend Configuration
-
-### Basic Details
- **Backend Type**: `vllm`
- **Executable**: `vllm` (configured via `VllmExecutable`)
- **Subcommand**: `serve` (automatically prepended to arguments)
- **Default Host/Port**: Auto-assigned by llamactl
- **Health Check**: Uses `/health` endpoint (returns HTTP 200 with no content)
- **API Compatibility**: OpenAI-compatible endpoints
-
-### Example Command
-```bash
-vllm serve --enable-log-outputs --tensor-parallel-size 2 --gpu-memory-utilization 0.5 --model ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g
-```
-
-## 2. File Structure
-Following the existing backend pattern:
-```
-pkg/backends/vllm/
-├── vllm.go          # VllmServerOptions struct and methods
-├── vllm_test.go     # Unit tests for VllmServerOptions
-├── parser.go        # Command parsing logic
-└── parser_test.go   # Parser tests
-```
-
-## 3. Core Implementation Files
-
-### 3.1 `pkg/backends/vllm/vllm.go`
-
-#### VllmServerOptions Struct
-```go
-type VllmServerOptions struct {
-    // Basic connection options (auto-assigned by llamactl)
-    Host string `json:"host,omitempty"`
-    Port int    `json:"port,omitempty"`
-    
-    // Core model options
-    Model string `json:"model,omitempty"`
-    
-    // Common serving options
-    EnableLogOutputs       bool    `json:"enable_log_outputs,omitempty"`
-    TensorParallelSize    int     `json:"tensor_parallel_size,omitempty"`
-    GPUMemoryUtilization  float64 `json:"gpu_memory_utilization,omitempty"`
-    
-    // Additional parameters to be added based on vLLM CLI documentation
-    // Following the same comprehensive approach as llamacpp.LlamaServerOptions
-}
-```
-
-#### Required Methods
- `UnmarshalJSON()` - Custom unmarshaling with alternative field name support (dash-to-underscore conversion)
- `BuildCommandArgs()` - Convert struct to command line arguments (excluding "serve" subcommand)
- `NewVllmServerOptions()` - Constructor with vLLM defaults
-
-#### Field Name Mapping
-Support both CLI argument names (with dashes) and programmatic names (with underscores), similar to the llama.cpp implementation:
-```go
-fieldMappings := map[string]string{
-    "enable-log-outputs":       "enable_log_outputs",
-    "tensor-parallel-size":     "tensor_parallel_size", 
-    "gpu-memory-utilization":   "gpu_memory_utilization",
-    // ... other mappings
-}
-```
-
-### 3.2 `pkg/backends/vllm/parser.go`
-
-#### ParseVllmCommand Function
-Following the same pattern as `llamacpp/parser.go` and `mlx/parser.go`:
-
-```go
-func ParseVllmCommand(command string) (*VllmServerOptions, error)
-```
-
-**Supported Input Formats:**
-1. `vllm serve --model MODEL_NAME --other-args`
-2. `/path/to/vllm serve --model MODEL_NAME`  
-3. `serve --model MODEL_NAME --other-args`
-4. `--model MODEL_NAME --other-args` (args only)
-5. Multiline commands with backslashes
-
-**Implementation Details:**
- Handle "serve" subcommand detection and removal
- Support quoted strings and escaped characters
- Validate command structure
- Convert parsed arguments to `VllmServerOptions`
-
-## 4. Backend Integration
-
-### 4.1 Backend Type Definition
-**File**: `pkg/backends/backend.go`
-```go
-const (
-    BackendTypeLlamaCpp BackendType = "llama_cpp"
-    BackendTypeMlxLm    BackendType = "mlx_lm" 
-    BackendTypeVllm     BackendType = "vllm"     // ADD THIS
-)
-```
-
-### 4.2 Configuration Integration
-**File**: `pkg/config/config.go`
-
-#### BackendConfig Update
-```go
-type BackendConfig struct {
-    LlamaExecutable string `yaml:"llama_executable"`
-    MLXLMExecutable string `yaml:"mlx_lm_executable"`
-    VllmExecutable  string `yaml:"vllm_executable"`  // ADD THIS
-}
-```
-
-#### Default Configuration
- **Default Value**: `"vllm"`
- **Environment Variable**: `LLAMACTL_VLLM_EXECUTABLE`
-
-#### Environment Variable Loading
-Add to `loadEnvVars()` function:
-```go
-if vllmExec := os.Getenv("LLAMACTL_VLLM_EXECUTABLE"); vllmExec != "" {
-    cfg.Backends.VllmExecutable = vllmExec
-}
-```
-
-### 4.3 Instance Options Integration
-**File**: `pkg/instance/options.go`
-
-#### CreateInstanceOptions Update
-```go
-type CreateInstanceOptions struct {
-    // existing fields...
-    VllmServerOptions *vllm.VllmServerOptions `json:"-"`
-}
-```
-
-#### JSON Marshaling/Unmarshaling
-Update `UnmarshalJSON()` and `MarshalJSON()` methods to handle vLLM backend similar to existing backends.
-
-#### BuildCommandArgs Implementation
-```go
-case backends.BackendTypeVllm:
-    if c.VllmServerOptions != nil {
-        // Prepend "serve" as first argument
-        args := []string{"serve"}
-        args = append(args, c.VllmServerOptions.BuildCommandArgs()...)
-        return args
-    }
-```
-
-**Key Point**: The "serve" subcommand is handled at the instance options level, keeping the `VllmServerOptions.BuildCommandArgs()` method focused only on vLLM-specific parameters.
-
-## 5. Health Check Integration
-
-### 5.1 Standard Health Check for vLLM
-**File**: `pkg/instance/lifecycle.go`
-
-vLLM provides a standard `/health` endpoint that returns HTTP 200 with no content, so no modifications are needed to the existing health check logic. The current `WaitForHealthy()` method will work as-is:
-
-```go
-healthURL := fmt.Sprintf("http://%s:%d/health", host, port)
-```
-
-### 5.2 Startup Time Considerations
- vLLM typically has longer startup times compared to llama.cpp
- The existing configurable timeout system should handle this adequately
- Users may need to adjust `on_demand_start_timeout` for larger models
-
-## 6. Lifecycle Integration
-
-### 6.1 Executable Selection
-**File**: `pkg/instance/lifecycle.go`
-
-Update the `Start()` method to handle vLLM executable:
-
-```go
-switch i.options.BackendType {
-case backends.BackendTypeLlamaCpp:
-    executable = i.globalBackendSettings.LlamaExecutable
-case backends.BackendTypeMlxLm:
-    executable = i.globalBackendSettings.MLXLMExecutable
-case backends.BackendTypeVllm:                              // ADD THIS
-    executable = i.globalBackendSettings.VllmExecutable
-default:
-    return fmt.Errorf("unsupported backend type: %s", i.options.BackendType)
-}
-
-args := i.options.BuildCommandArgs()
-i.cmd = exec.CommandContext(i.ctx, executable, args...)
-```
-
-### 6.2 Command Execution
-The final executed command will be:
-```bash
-vllm serve --model MODEL_NAME --other-vllm-args
-```
-
-Where:
- `vllm` comes from `VllmExecutable` configuration
- `serve` is prepended by `BuildCommandArgs()`
- Remaining args come from `VllmServerOptions.BuildCommandArgs()`
-
-## 7. Server Handler Integration
-
-### 7.1 New Handler Method
-**File**: `pkg/server/handlers.go`
-
-```go
-// ParseVllmCommand godoc
-// @Summary Parse vllm serve command
-// @Description Parses a vLLM serve command string into instance options
-// @Tags backends
-// @Security ApiKeyAuth
-// @Accept json
-// @Produce json
-// @Param request body ParseCommandRequest true "Command to parse"
-// @Success 200 {object} instance.CreateInstanceOptions "Parsed options"
-// @Failure 400 {object} map[string]string "Invalid request or command"
-// @Router /backends/vllm/parse-command [post]
-func (h *Handler) ParseVllmCommand() http.HandlerFunc {
-    // Implementation similar to ParseMlxCommand()
-    // Uses vllm.ParseVllmCommand() internally
-}
-```
-
-### 7.2 Router Integration
-**File**: `pkg/server/routes.go`
-
-Add vLLM route:
-```go
-r.Route("/backends", func(r chi.Router) {
-    r.Route("/llama-cpp", func(r chi.Router) {
-        r.Post("/parse-command", handler.ParseLlamaCommand())
-    })
-    r.Route("/mlx", func(r chi.Router) {
-        r.Post("/parse-command", handler.ParseMlxCommand())
-    })
-    r.Route("/vllm", func(r chi.Router) {      // ADD THIS
-        r.Post("/parse-command", handler.ParseVllmCommand())
-    })
-})
-```
-
-## 8. Validation Integration
-
-### 8.1 Instance Options Validation
-**File**: `pkg/validation/validation.go`
-
-Add vLLM validation case:
-```go
-func ValidateInstanceOptions(options *instance.CreateInstanceOptions) error {
-    // existing validation...
-    
-    switch options.BackendType {
-    case backends.BackendTypeLlamaCpp:
-        return validateLlamaCppOptions(options)
-    case backends.BackendTypeMlxLm:
-        return validateMlxOptions(options)
-    case backends.BackendTypeVllm:          // ADD THIS
-        return validateVllmOptions(options)
-    default:
-        return ValidationError(fmt.Errorf("unsupported backend type: %s", options.BackendType))
-    }
-}
-
-func validateVllmOptions(options *instance.CreateInstanceOptions) error {
-    if options.VllmServerOptions == nil {
-        return ValidationError(fmt.Errorf("vLLM server options cannot be nil for vLLM backend"))
-    }
-    
-    // Basic validation following the same pattern as other backends
-    if err := validateStructStrings(options.VllmServerOptions, ""); err != nil {
-        return err
-    }
-    
-    // Port validation
-    if options.VllmServerOptions.Port < 0 || options.VllmServerOptions.Port > 65535 {
-        return ValidationError(fmt.Errorf("invalid port range: %d", options.VllmServerOptions.Port))
-    }
-    
-    return nil
-}
-```
-
-## 9. Testing Strategy
-
-### 9.1 Unit Tests
- **`vllm_test.go`**: Test `VllmServerOptions` marshaling/unmarshaling, BuildCommandArgs()
- **`parser_test.go`**: Test command parsing for various formats
- **Integration tests**: Mock vLLM commands and validate parsing
-
-### 9.2 Test Cases
-```go
-func TestBuildCommandArgs_VllmBasic(t *testing.T) {
-    options := VllmServerOptions{
-        Model:              "microsoft/DialoGPT-medium",
-        Port:               8080,
-        Host:               "localhost", 
-        EnableLogOutputs:   true,
-        TensorParallelSize: 2,
-    }
-    
-    args := options.BuildCommandArgs()
-    // Validate expected arguments (excluding "serve")
-}
-
-func TestParseVllmCommand_FullCommand(t *testing.T) {
-    command := "vllm serve --model ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --tensor-parallel-size 2"
-    result, err := ParseVllmCommand(command)
-    // Validate parsing results
-}
-```
-
-## 10. Example Usage
-
-### 10.1 Parse Existing vLLM Command
-```bash
-curl -X POST http://localhost:8080/api/v1/backends/vllm/parse-command \
-  -H "Authorization: Bearer your-management-key" \
-  -H "Content-Type: application/json" \
-  -d '{
-    "command": "vllm serve --model ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --tensor-parallel-size 2 --gpu-memory-utilization 0.5"
-  }'
-```
-
-### 10.2 Create vLLM Instance
-```bash
-curl -X POST http://localhost:8080/api/v1/instances/my-vllm-model \
-  -H "Authorization: Bearer your-management-key" \
-  -H "Content-Type: application/json" \
-  -d '{
-    "backend_type": "vllm",
-    "backend_options": {
-      "model": "ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g",
-      "tensor_parallel_size": 2,
-      "gpu_memory_utilization": 0.5,
-      "enable_log_outputs": true
-    }
-  }'
-```
-
-### 10.3 Use via OpenAI-Compatible API
-```bash
-curl -X POST http://localhost:8080/v1/chat/completions \
-  -H "Authorization: Bearer your-inference-key" \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "my-vllm-model",
-    "messages": [{"role": "user", "content": "Hello!"}]
-  }'
-```
-
-## 11. Implementation Checklist
-
-### Phase 1: Core Backend
- [ ] Create `pkg/backends/vllm/vllm.go`
- [ ] Implement `VllmServerOptions` struct with basic fields
- [ ] Implement `BuildCommandArgs()`, `UnmarshalJSON()`, `MarshalJSON()`
- [ ] Add comprehensive field mappings for CLI args
- [ ] Create unit tests for `VllmServerOptions`
-
-### Phase 2: Command Parsing
- [ ] Create `pkg/backends/vllm/parser.go`  
- [ ] Implement `ParseVllmCommand()` function
- [ ] Handle various command input formats
- [ ] Create comprehensive parser tests
- [ ] Test edge cases and error conditions
-
-### Phase 3: Integration
- [ ] Add `BackendTypeVllm` to `pkg/backends/backend.go`
- [ ] Update `BackendConfig` in `pkg/config/config.go`
- [ ] Add environment variable support
- [ ] Update `CreateInstanceOptions` in `pkg/instance/options.go`
- [ ] Implement `BuildCommandArgs()` with "serve" prepending
-
-### Phase 4: Lifecycle & Health Checks
- [ ] Update executable selection in `pkg/instance/lifecycle.go`
- [ ] Test instance startup and health checking (uses existing `/health` endpoint)
- [ ] Validate command execution flow
-
-### Phase 5: API Integration
- [ ] Add `ParseVllmCommand()` handler in `pkg/server/handlers.go`
- [ ] Add vLLM route in `pkg/server/routes.go`
- [ ] Update validation in `pkg/validation/validation.go`
- [ ] Test API endpoints
-
-### Phase 6: Testing & Documentation
- [ ] Create comprehensive integration tests
- [ ] Test with actual vLLM installation (if available)
- [ ] Update documentation
- [ ] Test OpenAI-compatible proxy functionality
-
-## 12. Configuration Examples
-
-### 12.1 YAML Configuration
-```yaml
-backends:
-  llama_executable: "llama-server"
-  mlx_lm_executable: "mlx_lm.server"
-  vllm_executable: "vllm"
-
-instances:
-  # ... other instance settings
-```
-
-### 12.2 Environment Variables
-```bash
-export LLAMACTL_VLLM_EXECUTABLE="vllm"
-# OR for custom installation
-export LLAMACTL_VLLM_EXECUTABLE="python -m vllm" 
-# OR for containerized deployment
-export LLAMACTL_VLLM_EXECUTABLE="docker run --rm --gpus all vllm/vllm-openai"
-```
-
-## 13. Notes and Considerations
-
-### 13.1 Startup Time
- vLLM instances may take significantly longer to start than llama.cpp
- Consider documenting recommended timeout values
- The configurable `on_demand_start_timeout` should accommodate this
-
-### 13.2 Resource Usage  
- vLLM typically requires substantial GPU memory
- No special handling needed in llamactl (follows existing pattern)
- Resource management is left to the user/administrator
-
-### 13.3 Model Compatibility
- Primarily designed for HuggingFace models
- Supports various quantization formats (GPTQ, AWQ, etc.)
- Model path validation can be basic (similar to other backends)
-
-### 13.4 Future Enhancements
- Consider adding vLLM-specific parameter validation
- Could add model download/caching features  
- May want to add vLLM version detection capabilities
-
-This specification provides a comprehensive roadmap for implementing vLLM backend support while maintaining consistency with the existing llamactl architecture.