Mathis/llamactl

Fork 0

mirror of https://github.com/lordmathis/llamactl.git synced 2025-11-06 09:04:27 +00:00

Files

LordMathis 4df02a6519 Initial vLLM backend support

2025-09-19 18:05:12 +02:00

14 KiB

Raw Blame History

vLLM Backend Implementation Specification

Overview

This specification outlines the implementation of vLLM backend support for llamactl, following the existing patterns established by the llama.cpp and MLX backends.

1. Backend Configuration

Basic Details

Backend Type: vllm
Executable: vllm (configured via VllmExecutable)
Subcommand: serve (automatically prepended to arguments)
Default Host/Port: Auto-assigned by llamactl
Health Check: Uses /health endpoint (returns HTTP 200 with no content)
API Compatibility: OpenAI-compatible endpoints

Example Command

vllm serve --enable-log-outputs --tensor-parallel-size 2 --gpu-memory-utilization 0.5 --model ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g

2. File Structure

Following the existing backend pattern:

pkg/backends/vllm/
├── vllm.go          # VllmServerOptions struct and methods
├── vllm_test.go     # Unit tests for VllmServerOptions
├── parser.go        # Command parsing logic
└── parser_test.go   # Parser tests

3. Core Implementation Files

3.1 `pkg/backends/vllm/vllm.go`

VllmServerOptions Struct

type VllmServerOptions struct {
    // Basic connection options (auto-assigned by llamactl)
    Host string `json:"host,omitempty"`
    Port int    `json:"port,omitempty"`
    
    // Core model options
    Model string `json:"model,omitempty"`
    
    // Common serving options
    EnableLogOutputs       bool    `json:"enable_log_outputs,omitempty"`
    TensorParallelSize    int     `json:"tensor_parallel_size,omitempty"`
    GPUMemoryUtilization  float64 `json:"gpu_memory_utilization,omitempty"`
    
    // Additional parameters to be added based on vLLM CLI documentation
    // Following the same comprehensive approach as llamacpp.LlamaServerOptions
}

Required Methods

UnmarshalJSON() - Custom unmarshaling with alternative field name support (dash-to-underscore conversion)
BuildCommandArgs() - Convert struct to command line arguments (excluding "serve" subcommand)
NewVllmServerOptions() - Constructor with vLLM defaults

Field Name Mapping

Support both CLI argument names (with dashes) and programmatic names (with underscores), similar to the llama.cpp implementation:

fieldMappings := map[string]string{
    "enable-log-outputs":       "enable_log_outputs",
    "tensor-parallel-size":     "tensor_parallel_size", 
    "gpu-memory-utilization":   "gpu_memory_utilization",
    // ... other mappings
}

3.2 `pkg/backends/vllm/parser.go`

ParseVllmCommand Function

Following the same pattern as llamacpp/parser.go and mlx/parser.go:

func ParseVllmCommand(command string) (*VllmServerOptions, error)

Supported Input Formats:

vllm serve --model MODEL_NAME --other-args
/path/to/vllm serve --model MODEL_NAME
serve --model MODEL_NAME --other-args
--model MODEL_NAME --other-args (args only)
Multiline commands with backslashes

Implementation Details:

Handle "serve" subcommand detection and removal
Support quoted strings and escaped characters
Validate command structure
Convert parsed arguments to VllmServerOptions

4. Backend Integration

4.1 Backend Type Definition

File: pkg/backends/backend.go

const (
    BackendTypeLlamaCpp BackendType = "llama_cpp"
    BackendTypeMlxLm    BackendType = "mlx_lm" 
    BackendTypeVllm     BackendType = "vllm"     // ADD THIS
)

4.2 Configuration Integration

File: pkg/config/config.go

BackendConfig Update

type BackendConfig struct {
    LlamaExecutable string `yaml:"llama_executable"`
    MLXLMExecutable string `yaml:"mlx_lm_executable"`
    VllmExecutable  string `yaml:"vllm_executable"`  // ADD THIS
}

Default Configuration

Default Value: "vllm"
Environment Variable: LLAMACTL_VLLM_EXECUTABLE

Environment Variable Loading

Add to loadEnvVars() function:

if vllmExec := os.Getenv("LLAMACTL_VLLM_EXECUTABLE"); vllmExec != "" {
    cfg.Backends.VllmExecutable = vllmExec
}

4.3 Instance Options Integration

File: pkg/instance/options.go

CreateInstanceOptions Update

type CreateInstanceOptions struct {
    // existing fields...
    VllmServerOptions *vllm.VllmServerOptions `json:"-"`
}

JSON Marshaling/Unmarshaling

Update UnmarshalJSON() and MarshalJSON() methods to handle vLLM backend similar to existing backends.

BuildCommandArgs Implementation

case backends.BackendTypeVllm:
    if c.VllmServerOptions != nil {
        // Prepend "serve" as first argument
        args := []string{"serve"}
        args = append(args, c.VllmServerOptions.BuildCommandArgs()...)
        return args
    }

Key Point: The "serve" subcommand is handled at the instance options level, keeping the VllmServerOptions.BuildCommandArgs() method focused only on vLLM-specific parameters.

5. Health Check Integration

5.1 Standard Health Check for vLLM

File: pkg/instance/lifecycle.go

vLLM provides a standard /health endpoint that returns HTTP 200 with no content, so no modifications are needed to the existing health check logic. The current WaitForHealthy() method will work as-is:

healthURL := fmt.Sprintf("http://%s:%d/health", host, port)

5.2 Startup Time Considerations

vLLM typically has longer startup times compared to llama.cpp
The existing configurable timeout system should handle this adequately
Users may need to adjust on_demand_start_timeout for larger models

6. Lifecycle Integration

6.1 Executable Selection

File: pkg/instance/lifecycle.go

Update the Start() method to handle vLLM executable:

switch i.options.BackendType {
case backends.BackendTypeLlamaCpp:
    executable = i.globalBackendSettings.LlamaExecutable
case backends.BackendTypeMlxLm:
    executable = i.globalBackendSettings.MLXLMExecutable
case backends.BackendTypeVllm:                              // ADD THIS
    executable = i.globalBackendSettings.VllmExecutable
default:
    return fmt.Errorf("unsupported backend type: %s", i.options.BackendType)
}

args := i.options.BuildCommandArgs()
i.cmd = exec.CommandContext(i.ctx, executable, args...)

6.2 Command Execution

The final executed command will be:

vllm serve --model MODEL_NAME --other-vllm-args

Where:

vllm comes from VllmExecutable configuration
serve is prepended by BuildCommandArgs()
Remaining args come from VllmServerOptions.BuildCommandArgs()

7. Server Handler Integration

7.1 New Handler Method

File: pkg/server/handlers.go

// ParseVllmCommand godoc
// @Summary Parse vllm serve command
// @Description Parses a vLLM serve command string into instance options
// @Tags backends
// @Security ApiKeyAuth
// @Accept json
// @Produce json
// @Param request body ParseCommandRequest true "Command to parse"
// @Success 200 {object} instance.CreateInstanceOptions "Parsed options"
// @Failure 400 {object} map[string]string "Invalid request or command"
// @Router /backends/vllm/parse-command [post]
func (h *Handler) ParseVllmCommand() http.HandlerFunc {
    // Implementation similar to ParseMlxCommand()
    // Uses vllm.ParseVllmCommand() internally
}

7.2 Router Integration

File: pkg/server/routes.go

Add vLLM route:

r.Route("/backends", func(r chi.Router) {
    r.Route("/llama-cpp", func(r chi.Router) {
        r.Post("/parse-command", handler.ParseLlamaCommand())
    })
    r.Route("/mlx", func(r chi.Router) {
        r.Post("/parse-command", handler.ParseMlxCommand())
    })
    r.Route("/vllm", func(r chi.Router) {      // ADD THIS
        r.Post("/parse-command", handler.ParseVllmCommand())
    })
})

8. Validation Integration

8.1 Instance Options Validation

File: pkg/validation/validation.go

Add vLLM validation case:

func ValidateInstanceOptions(options *instance.CreateInstanceOptions) error {
    // existing validation...
    
    switch options.BackendType {
    case backends.BackendTypeLlamaCpp:
        return validateLlamaCppOptions(options)
    case backends.BackendTypeMlxLm:
        return validateMlxOptions(options)
    case backends.BackendTypeVllm:          // ADD THIS
        return validateVllmOptions(options)
    default:
        return ValidationError(fmt.Errorf("unsupported backend type: %s", options.BackendType))
    }
}

func validateVllmOptions(options *instance.CreateInstanceOptions) error {
    if options.VllmServerOptions == nil {
        return ValidationError(fmt.Errorf("vLLM server options cannot be nil for vLLM backend"))
    }
    
    // Basic validation following the same pattern as other backends
    if err := validateStructStrings(options.VllmServerOptions, ""); err != nil {
        return err
    }
    
    // Port validation
    if options.VllmServerOptions.Port < 0 || options.VllmServerOptions.Port > 65535 {
        return ValidationError(fmt.Errorf("invalid port range: %d", options.VllmServerOptions.Port))
    }
    
    return nil
}

9. Testing Strategy

9.1 Unit Tests

vllm_test.go: Test VllmServerOptions marshaling/unmarshaling, BuildCommandArgs()
parser_test.go: Test command parsing for various formats
Integration tests: Mock vLLM commands and validate parsing

9.2 Test Cases

func TestBuildCommandArgs_VllmBasic(t *testing.T) {
    options := VllmServerOptions{
        Model:              "microsoft/DialoGPT-medium",
        Port:               8080,
        Host:               "localhost", 
        EnableLogOutputs:   true,
        TensorParallelSize: 2,
    }
    
    args := options.BuildCommandArgs()
    // Validate expected arguments (excluding "serve")
}

func TestParseVllmCommand_FullCommand(t *testing.T) {
    command := "vllm serve --model ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --tensor-parallel-size 2"
    result, err := ParseVllmCommand(command)
    // Validate parsing results
}

10. Example Usage

10.1 Parse Existing vLLM Command

curl -X POST http://localhost:8080/api/v1/backends/vllm/parse-command \
  -H "Authorization: Bearer your-management-key" \
  -H "Content-Type: application/json" \
  -d '{
    "command": "vllm serve --model ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --tensor-parallel-size 2 --gpu-memory-utilization 0.5"
  }'

10.2 Create vLLM Instance

curl -X POST http://localhost:8080/api/v1/instances/my-vllm-model \
  -H "Authorization: Bearer your-management-key" \
  -H "Content-Type: application/json" \
  -d '{
    "backend_type": "vllm",
    "backend_options": {
      "model": "ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g",
      "tensor_parallel_size": 2,
      "gpu_memory_utilization": 0.5,
      "enable_log_outputs": true
    }
  }'

10.3 Use via OpenAI-Compatible API

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer your-inference-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-vllm-model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

11. Implementation Checklist

Phase 1: Core Backend

Create pkg/backends/vllm/vllm.go
Implement VllmServerOptions struct with basic fields
Implement BuildCommandArgs(), UnmarshalJSON(), MarshalJSON()
Add comprehensive field mappings for CLI args
Create unit tests for VllmServerOptions

Phase 2: Command Parsing

Create pkg/backends/vllm/parser.go
Implement ParseVllmCommand() function
Handle various command input formats
Create comprehensive parser tests
Test edge cases and error conditions

Phase 3: Integration

Add BackendTypeVllm to pkg/backends/backend.go
Update BackendConfig in pkg/config/config.go
Add environment variable support
Update CreateInstanceOptions in pkg/instance/options.go
Implement BuildCommandArgs() with "serve" prepending

Phase 4: Lifecycle & Health Checks

Update executable selection in pkg/instance/lifecycle.go
Test instance startup and health checking (uses existing /health endpoint)
Validate command execution flow

Phase 5: API Integration

Add ParseVllmCommand() handler in pkg/server/handlers.go
Add vLLM route in pkg/server/routes.go
Update validation in pkg/validation/validation.go
Test API endpoints

Phase 6: Testing & Documentation

Create comprehensive integration tests
Test with actual vLLM installation (if available)
Update documentation
Test OpenAI-compatible proxy functionality

12. Configuration Examples

12.1 YAML Configuration

backends:
  llama_executable: "llama-server"
  mlx_lm_executable: "mlx_lm.server"
  vllm_executable: "vllm"

instances:
  # ... other instance settings

12.2 Environment Variables

export LLAMACTL_VLLM_EXECUTABLE="vllm"
# OR for custom installation
export LLAMACTL_VLLM_EXECUTABLE="python -m vllm" 
# OR for containerized deployment
export LLAMACTL_VLLM_EXECUTABLE="docker run --rm --gpus all vllm/vllm-openai"

13. Notes and Considerations

13.1 Startup Time

vLLM instances may take significantly longer to start than llama.cpp
Consider documenting recommended timeout values
The configurable on_demand_start_timeout should accommodate this

13.2 Resource Usage

vLLM typically requires substantial GPU memory
No special handling needed in llamactl (follows existing pattern)
Resource management is left to the user/administrator

13.3 Model Compatibility

Primarily designed for HuggingFace models
Supports various quantization formats (GPTQ, AWQ, etc.)
Model path validation can be basic (similar to other backends)

13.4 Future Enhancements

Consider adding vLLM-specific parameter validation
Could add model download/caching features
May want to add vLLM version detection capabilities

This specification provides a comprehensive roadmap for implementing vLLM backend support while maintaining consistency with the existing llamactl architecture.

14 KiB Raw Blame History