Files
llamactl/vllm_backend_spec.md

14 KiB

vLLM Backend Implementation Specification

Overview

This specification outlines the implementation of vLLM backend support for llamactl, following the existing patterns established by the llama.cpp and MLX backends.

1. Backend Configuration

Basic Details

  • Backend Type: vllm
  • Executable: vllm (configured via VllmExecutable)
  • Subcommand: serve (automatically prepended to arguments)
  • Default Host/Port: Auto-assigned by llamactl
  • Health Check: Uses /health endpoint (returns HTTP 200 with no content)
  • API Compatibility: OpenAI-compatible endpoints

Example Command

vllm serve --enable-log-outputs --tensor-parallel-size 2 --gpu-memory-utilization 0.5 --model ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g

2. File Structure

Following the existing backend pattern:

pkg/backends/vllm/
├── vllm.go          # VllmServerOptions struct and methods
├── vllm_test.go     # Unit tests for VllmServerOptions
├── parser.go        # Command parsing logic
└── parser_test.go   # Parser tests

3. Core Implementation Files

3.1 pkg/backends/vllm/vllm.go

VllmServerOptions Struct

type VllmServerOptions struct {
    // Basic connection options (auto-assigned by llamactl)
    Host string `json:"host,omitempty"`
    Port int    `json:"port,omitempty"`
    
    // Core model options
    Model string `json:"model,omitempty"`
    
    // Common serving options
    EnableLogOutputs       bool    `json:"enable_log_outputs,omitempty"`
    TensorParallelSize    int     `json:"tensor_parallel_size,omitempty"`
    GPUMemoryUtilization  float64 `json:"gpu_memory_utilization,omitempty"`
    
    // Additional parameters to be added based on vLLM CLI documentation
    // Following the same comprehensive approach as llamacpp.LlamaServerOptions
}

Required Methods

  • UnmarshalJSON() - Custom unmarshaling with alternative field name support (dash-to-underscore conversion)
  • BuildCommandArgs() - Convert struct to command line arguments (excluding "serve" subcommand)
  • NewVllmServerOptions() - Constructor with vLLM defaults

Field Name Mapping

Support both CLI argument names (with dashes) and programmatic names (with underscores), similar to the llama.cpp implementation:

fieldMappings := map[string]string{
    "enable-log-outputs":       "enable_log_outputs",
    "tensor-parallel-size":     "tensor_parallel_size", 
    "gpu-memory-utilization":   "gpu_memory_utilization",
    // ... other mappings
}

3.2 pkg/backends/vllm/parser.go

ParseVllmCommand Function

Following the same pattern as llamacpp/parser.go and mlx/parser.go:

func ParseVllmCommand(command string) (*VllmServerOptions, error)

Supported Input Formats:

  1. vllm serve --model MODEL_NAME --other-args
  2. /path/to/vllm serve --model MODEL_NAME
  3. serve --model MODEL_NAME --other-args
  4. --model MODEL_NAME --other-args (args only)
  5. Multiline commands with backslashes

Implementation Details:

  • Handle "serve" subcommand detection and removal
  • Support quoted strings and escaped characters
  • Validate command structure
  • Convert parsed arguments to VllmServerOptions

4. Backend Integration

4.1 Backend Type Definition

File: pkg/backends/backend.go

const (
    BackendTypeLlamaCpp BackendType = "llama_cpp"
    BackendTypeMlxLm    BackendType = "mlx_lm" 
    BackendTypeVllm     BackendType = "vllm"     // ADD THIS
)

4.2 Configuration Integration

File: pkg/config/config.go

BackendConfig Update

type BackendConfig struct {
    LlamaExecutable string `yaml:"llama_executable"`
    MLXLMExecutable string `yaml:"mlx_lm_executable"`
    VllmExecutable  string `yaml:"vllm_executable"`  // ADD THIS
}

Default Configuration

  • Default Value: "vllm"
  • Environment Variable: LLAMACTL_VLLM_EXECUTABLE

Environment Variable Loading

Add to loadEnvVars() function:

if vllmExec := os.Getenv("LLAMACTL_VLLM_EXECUTABLE"); vllmExec != "" {
    cfg.Backends.VllmExecutable = vllmExec
}

4.3 Instance Options Integration

File: pkg/instance/options.go

CreateInstanceOptions Update

type CreateInstanceOptions struct {
    // existing fields...
    VllmServerOptions *vllm.VllmServerOptions `json:"-"`
}

JSON Marshaling/Unmarshaling

Update UnmarshalJSON() and MarshalJSON() methods to handle vLLM backend similar to existing backends.

BuildCommandArgs Implementation

case backends.BackendTypeVllm:
    if c.VllmServerOptions != nil {
        // Prepend "serve" as first argument
        args := []string{"serve"}
        args = append(args, c.VllmServerOptions.BuildCommandArgs()...)
        return args
    }

Key Point: The "serve" subcommand is handled at the instance options level, keeping the VllmServerOptions.BuildCommandArgs() method focused only on vLLM-specific parameters.

5. Health Check Integration

5.1 Standard Health Check for vLLM

File: pkg/instance/lifecycle.go

vLLM provides a standard /health endpoint that returns HTTP 200 with no content, so no modifications are needed to the existing health check logic. The current WaitForHealthy() method will work as-is:

healthURL := fmt.Sprintf("http://%s:%d/health", host, port)

5.2 Startup Time Considerations

  • vLLM typically has longer startup times compared to llama.cpp
  • The existing configurable timeout system should handle this adequately
  • Users may need to adjust on_demand_start_timeout for larger models

6. Lifecycle Integration

6.1 Executable Selection

File: pkg/instance/lifecycle.go

Update the Start() method to handle vLLM executable:

switch i.options.BackendType {
case backends.BackendTypeLlamaCpp:
    executable = i.globalBackendSettings.LlamaExecutable
case backends.BackendTypeMlxLm:
    executable = i.globalBackendSettings.MLXLMExecutable
case backends.BackendTypeVllm:                              // ADD THIS
    executable = i.globalBackendSettings.VllmExecutable
default:
    return fmt.Errorf("unsupported backend type: %s", i.options.BackendType)
}

args := i.options.BuildCommandArgs()
i.cmd = exec.CommandContext(i.ctx, executable, args...)

6.2 Command Execution

The final executed command will be:

vllm serve --model MODEL_NAME --other-vllm-args

Where:

  • vllm comes from VllmExecutable configuration
  • serve is prepended by BuildCommandArgs()
  • Remaining args come from VllmServerOptions.BuildCommandArgs()

7. Server Handler Integration

7.1 New Handler Method

File: pkg/server/handlers.go

// ParseVllmCommand godoc
// @Summary Parse vllm serve command
// @Description Parses a vLLM serve command string into instance options
// @Tags backends
// @Security ApiKeyAuth
// @Accept json
// @Produce json
// @Param request body ParseCommandRequest true "Command to parse"
// @Success 200 {object} instance.CreateInstanceOptions "Parsed options"
// @Failure 400 {object} map[string]string "Invalid request or command"
// @Router /backends/vllm/parse-command [post]
func (h *Handler) ParseVllmCommand() http.HandlerFunc {
    // Implementation similar to ParseMlxCommand()
    // Uses vllm.ParseVllmCommand() internally
}

7.2 Router Integration

File: pkg/server/routes.go

Add vLLM route:

r.Route("/backends", func(r chi.Router) {
    r.Route("/llama-cpp", func(r chi.Router) {
        r.Post("/parse-command", handler.ParseLlamaCommand())
    })
    r.Route("/mlx", func(r chi.Router) {
        r.Post("/parse-command", handler.ParseMlxCommand())
    })
    r.Route("/vllm", func(r chi.Router) {      // ADD THIS
        r.Post("/parse-command", handler.ParseVllmCommand())
    })
})

8. Validation Integration

8.1 Instance Options Validation

File: pkg/validation/validation.go

Add vLLM validation case:

func ValidateInstanceOptions(options *instance.CreateInstanceOptions) error {
    // existing validation...
    
    switch options.BackendType {
    case backends.BackendTypeLlamaCpp:
        return validateLlamaCppOptions(options)
    case backends.BackendTypeMlxLm:
        return validateMlxOptions(options)
    case backends.BackendTypeVllm:          // ADD THIS
        return validateVllmOptions(options)
    default:
        return ValidationError(fmt.Errorf("unsupported backend type: %s", options.BackendType))
    }
}

func validateVllmOptions(options *instance.CreateInstanceOptions) error {
    if options.VllmServerOptions == nil {
        return ValidationError(fmt.Errorf("vLLM server options cannot be nil for vLLM backend"))
    }
    
    // Basic validation following the same pattern as other backends
    if err := validateStructStrings(options.VllmServerOptions, ""); err != nil {
        return err
    }
    
    // Port validation
    if options.VllmServerOptions.Port < 0 || options.VllmServerOptions.Port > 65535 {
        return ValidationError(fmt.Errorf("invalid port range: %d", options.VllmServerOptions.Port))
    }
    
    return nil
}

9. Testing Strategy

9.1 Unit Tests

  • vllm_test.go: Test VllmServerOptions marshaling/unmarshaling, BuildCommandArgs()
  • parser_test.go: Test command parsing for various formats
  • Integration tests: Mock vLLM commands and validate parsing

9.2 Test Cases

func TestBuildCommandArgs_VllmBasic(t *testing.T) {
    options := VllmServerOptions{
        Model:              "microsoft/DialoGPT-medium",
        Port:               8080,
        Host:               "localhost", 
        EnableLogOutputs:   true,
        TensorParallelSize: 2,
    }
    
    args := options.BuildCommandArgs()
    // Validate expected arguments (excluding "serve")
}

func TestParseVllmCommand_FullCommand(t *testing.T) {
    command := "vllm serve --model ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --tensor-parallel-size 2"
    result, err := ParseVllmCommand(command)
    // Validate parsing results
}

10. Example Usage

10.1 Parse Existing vLLM Command

curl -X POST http://localhost:8080/api/v1/backends/vllm/parse-command \
  -H "Authorization: Bearer your-management-key" \
  -H "Content-Type: application/json" \
  -d '{
    "command": "vllm serve --model ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --tensor-parallel-size 2 --gpu-memory-utilization 0.5"
  }'

10.2 Create vLLM Instance

curl -X POST http://localhost:8080/api/v1/instances/my-vllm-model \
  -H "Authorization: Bearer your-management-key" \
  -H "Content-Type: application/json" \
  -d '{
    "backend_type": "vllm",
    "backend_options": {
      "model": "ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g",
      "tensor_parallel_size": 2,
      "gpu_memory_utilization": 0.5,
      "enable_log_outputs": true
    }
  }'

10.3 Use via OpenAI-Compatible API

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer your-inference-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-vllm-model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

11. Implementation Checklist

Phase 1: Core Backend

  • Create pkg/backends/vllm/vllm.go
  • Implement VllmServerOptions struct with basic fields
  • Implement BuildCommandArgs(), UnmarshalJSON(), MarshalJSON()
  • Add comprehensive field mappings for CLI args
  • Create unit tests for VllmServerOptions

Phase 2: Command Parsing

  • Create pkg/backends/vllm/parser.go
  • Implement ParseVllmCommand() function
  • Handle various command input formats
  • Create comprehensive parser tests
  • Test edge cases and error conditions

Phase 3: Integration

  • Add BackendTypeVllm to pkg/backends/backend.go
  • Update BackendConfig in pkg/config/config.go
  • Add environment variable support
  • Update CreateInstanceOptions in pkg/instance/options.go
  • Implement BuildCommandArgs() with "serve" prepending

Phase 4: Lifecycle & Health Checks

  • Update executable selection in pkg/instance/lifecycle.go
  • Test instance startup and health checking (uses existing /health endpoint)
  • Validate command execution flow

Phase 5: API Integration

  • Add ParseVllmCommand() handler in pkg/server/handlers.go
  • Add vLLM route in pkg/server/routes.go
  • Update validation in pkg/validation/validation.go
  • Test API endpoints

Phase 6: Testing & Documentation

  • Create comprehensive integration tests
  • Test with actual vLLM installation (if available)
  • Update documentation
  • Test OpenAI-compatible proxy functionality

12. Configuration Examples

12.1 YAML Configuration

backends:
  llama_executable: "llama-server"
  mlx_lm_executable: "mlx_lm.server"
  vllm_executable: "vllm"

instances:
  # ... other instance settings

12.2 Environment Variables

export LLAMACTL_VLLM_EXECUTABLE="vllm"
# OR for custom installation
export LLAMACTL_VLLM_EXECUTABLE="python -m vllm" 
# OR for containerized deployment
export LLAMACTL_VLLM_EXECUTABLE="docker run --rm --gpus all vllm/vllm-openai"

13. Notes and Considerations

13.1 Startup Time

  • vLLM instances may take significantly longer to start than llama.cpp
  • Consider documenting recommended timeout values
  • The configurable on_demand_start_timeout should accommodate this

13.2 Resource Usage

  • vLLM typically requires substantial GPU memory
  • No special handling needed in llamactl (follows existing pattern)
  • Resource management is left to the user/administrator

13.3 Model Compatibility

  • Primarily designed for HuggingFace models
  • Supports various quantization formats (GPTQ, AWQ, etc.)
  • Model path validation can be basic (similar to other backends)

13.4 Future Enhancements

  • Consider adding vLLM-specific parameter validation
  • Could add model download/caching features
  • May want to add vLLM version detection capabilities

This specification provides a comprehensive roadmap for implementing vLLM backend support while maintaining consistency with the existing llamactl architecture.