14 KiB
vLLM Backend Implementation Specification
Overview
This specification outlines the implementation of vLLM backend support for llamactl, following the existing patterns established by the llama.cpp and MLX backends.
1. Backend Configuration
Basic Details
- Backend Type:
vllm - Executable:
vllm(configured viaVllmExecutable) - Subcommand:
serve(automatically prepended to arguments) - Default Host/Port: Auto-assigned by llamactl
- Health Check: Uses
/healthendpoint (returns HTTP 200 with no content) - API Compatibility: OpenAI-compatible endpoints
Example Command
vllm serve --enable-log-outputs --tensor-parallel-size 2 --gpu-memory-utilization 0.5 --model ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g
2. File Structure
Following the existing backend pattern:
pkg/backends/vllm/
├── vllm.go # VllmServerOptions struct and methods
├── vllm_test.go # Unit tests for VllmServerOptions
├── parser.go # Command parsing logic
└── parser_test.go # Parser tests
3. Core Implementation Files
3.1 pkg/backends/vllm/vllm.go
VllmServerOptions Struct
type VllmServerOptions struct {
// Basic connection options (auto-assigned by llamactl)
Host string `json:"host,omitempty"`
Port int `json:"port,omitempty"`
// Core model options
Model string `json:"model,omitempty"`
// Common serving options
EnableLogOutputs bool `json:"enable_log_outputs,omitempty"`
TensorParallelSize int `json:"tensor_parallel_size,omitempty"`
GPUMemoryUtilization float64 `json:"gpu_memory_utilization,omitempty"`
// Additional parameters to be added based on vLLM CLI documentation
// Following the same comprehensive approach as llamacpp.LlamaServerOptions
}
Required Methods
UnmarshalJSON()- Custom unmarshaling with alternative field name support (dash-to-underscore conversion)BuildCommandArgs()- Convert struct to command line arguments (excluding "serve" subcommand)NewVllmServerOptions()- Constructor with vLLM defaults
Field Name Mapping
Support both CLI argument names (with dashes) and programmatic names (with underscores), similar to the llama.cpp implementation:
fieldMappings := map[string]string{
"enable-log-outputs": "enable_log_outputs",
"tensor-parallel-size": "tensor_parallel_size",
"gpu-memory-utilization": "gpu_memory_utilization",
// ... other mappings
}
3.2 pkg/backends/vllm/parser.go
ParseVllmCommand Function
Following the same pattern as llamacpp/parser.go and mlx/parser.go:
func ParseVllmCommand(command string) (*VllmServerOptions, error)
Supported Input Formats:
vllm serve --model MODEL_NAME --other-args/path/to/vllm serve --model MODEL_NAMEserve --model MODEL_NAME --other-args--model MODEL_NAME --other-args(args only)- Multiline commands with backslashes
Implementation Details:
- Handle "serve" subcommand detection and removal
- Support quoted strings and escaped characters
- Validate command structure
- Convert parsed arguments to
VllmServerOptions
4. Backend Integration
4.1 Backend Type Definition
File: pkg/backends/backend.go
const (
BackendTypeLlamaCpp BackendType = "llama_cpp"
BackendTypeMlxLm BackendType = "mlx_lm"
BackendTypeVllm BackendType = "vllm" // ADD THIS
)
4.2 Configuration Integration
File: pkg/config/config.go
BackendConfig Update
type BackendConfig struct {
LlamaExecutable string `yaml:"llama_executable"`
MLXLMExecutable string `yaml:"mlx_lm_executable"`
VllmExecutable string `yaml:"vllm_executable"` // ADD THIS
}
Default Configuration
- Default Value:
"vllm" - Environment Variable:
LLAMACTL_VLLM_EXECUTABLE
Environment Variable Loading
Add to loadEnvVars() function:
if vllmExec := os.Getenv("LLAMACTL_VLLM_EXECUTABLE"); vllmExec != "" {
cfg.Backends.VllmExecutable = vllmExec
}
4.3 Instance Options Integration
File: pkg/instance/options.go
CreateInstanceOptions Update
type CreateInstanceOptions struct {
// existing fields...
VllmServerOptions *vllm.VllmServerOptions `json:"-"`
}
JSON Marshaling/Unmarshaling
Update UnmarshalJSON() and MarshalJSON() methods to handle vLLM backend similar to existing backends.
BuildCommandArgs Implementation
case backends.BackendTypeVllm:
if c.VllmServerOptions != nil {
// Prepend "serve" as first argument
args := []string{"serve"}
args = append(args, c.VllmServerOptions.BuildCommandArgs()...)
return args
}
Key Point: The "serve" subcommand is handled at the instance options level, keeping the VllmServerOptions.BuildCommandArgs() method focused only on vLLM-specific parameters.
5. Health Check Integration
5.1 Standard Health Check for vLLM
File: pkg/instance/lifecycle.go
vLLM provides a standard /health endpoint that returns HTTP 200 with no content, so no modifications are needed to the existing health check logic. The current WaitForHealthy() method will work as-is:
healthURL := fmt.Sprintf("http://%s:%d/health", host, port)
5.2 Startup Time Considerations
- vLLM typically has longer startup times compared to llama.cpp
- The existing configurable timeout system should handle this adequately
- Users may need to adjust
on_demand_start_timeoutfor larger models
6. Lifecycle Integration
6.1 Executable Selection
File: pkg/instance/lifecycle.go
Update the Start() method to handle vLLM executable:
switch i.options.BackendType {
case backends.BackendTypeLlamaCpp:
executable = i.globalBackendSettings.LlamaExecutable
case backends.BackendTypeMlxLm:
executable = i.globalBackendSettings.MLXLMExecutable
case backends.BackendTypeVllm: // ADD THIS
executable = i.globalBackendSettings.VllmExecutable
default:
return fmt.Errorf("unsupported backend type: %s", i.options.BackendType)
}
args := i.options.BuildCommandArgs()
i.cmd = exec.CommandContext(i.ctx, executable, args...)
6.2 Command Execution
The final executed command will be:
vllm serve --model MODEL_NAME --other-vllm-args
Where:
vllmcomes fromVllmExecutableconfigurationserveis prepended byBuildCommandArgs()- Remaining args come from
VllmServerOptions.BuildCommandArgs()
7. Server Handler Integration
7.1 New Handler Method
File: pkg/server/handlers.go
// ParseVllmCommand godoc
// @Summary Parse vllm serve command
// @Description Parses a vLLM serve command string into instance options
// @Tags backends
// @Security ApiKeyAuth
// @Accept json
// @Produce json
// @Param request body ParseCommandRequest true "Command to parse"
// @Success 200 {object} instance.CreateInstanceOptions "Parsed options"
// @Failure 400 {object} map[string]string "Invalid request or command"
// @Router /backends/vllm/parse-command [post]
func (h *Handler) ParseVllmCommand() http.HandlerFunc {
// Implementation similar to ParseMlxCommand()
// Uses vllm.ParseVllmCommand() internally
}
7.2 Router Integration
File: pkg/server/routes.go
Add vLLM route:
r.Route("/backends", func(r chi.Router) {
r.Route("/llama-cpp", func(r chi.Router) {
r.Post("/parse-command", handler.ParseLlamaCommand())
})
r.Route("/mlx", func(r chi.Router) {
r.Post("/parse-command", handler.ParseMlxCommand())
})
r.Route("/vllm", func(r chi.Router) { // ADD THIS
r.Post("/parse-command", handler.ParseVllmCommand())
})
})
8. Validation Integration
8.1 Instance Options Validation
File: pkg/validation/validation.go
Add vLLM validation case:
func ValidateInstanceOptions(options *instance.CreateInstanceOptions) error {
// existing validation...
switch options.BackendType {
case backends.BackendTypeLlamaCpp:
return validateLlamaCppOptions(options)
case backends.BackendTypeMlxLm:
return validateMlxOptions(options)
case backends.BackendTypeVllm: // ADD THIS
return validateVllmOptions(options)
default:
return ValidationError(fmt.Errorf("unsupported backend type: %s", options.BackendType))
}
}
func validateVllmOptions(options *instance.CreateInstanceOptions) error {
if options.VllmServerOptions == nil {
return ValidationError(fmt.Errorf("vLLM server options cannot be nil for vLLM backend"))
}
// Basic validation following the same pattern as other backends
if err := validateStructStrings(options.VllmServerOptions, ""); err != nil {
return err
}
// Port validation
if options.VllmServerOptions.Port < 0 || options.VllmServerOptions.Port > 65535 {
return ValidationError(fmt.Errorf("invalid port range: %d", options.VllmServerOptions.Port))
}
return nil
}
9. Testing Strategy
9.1 Unit Tests
vllm_test.go: TestVllmServerOptionsmarshaling/unmarshaling, BuildCommandArgs()parser_test.go: Test command parsing for various formats- Integration tests: Mock vLLM commands and validate parsing
9.2 Test Cases
func TestBuildCommandArgs_VllmBasic(t *testing.T) {
options := VllmServerOptions{
Model: "microsoft/DialoGPT-medium",
Port: 8080,
Host: "localhost",
EnableLogOutputs: true,
TensorParallelSize: 2,
}
args := options.BuildCommandArgs()
// Validate expected arguments (excluding "serve")
}
func TestParseVllmCommand_FullCommand(t *testing.T) {
command := "vllm serve --model ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --tensor-parallel-size 2"
result, err := ParseVllmCommand(command)
// Validate parsing results
}
10. Example Usage
10.1 Parse Existing vLLM Command
curl -X POST http://localhost:8080/api/v1/backends/vllm/parse-command \
-H "Authorization: Bearer your-management-key" \
-H "Content-Type: application/json" \
-d '{
"command": "vllm serve --model ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --tensor-parallel-size 2 --gpu-memory-utilization 0.5"
}'
10.2 Create vLLM Instance
curl -X POST http://localhost:8080/api/v1/instances/my-vllm-model \
-H "Authorization: Bearer your-management-key" \
-H "Content-Type: application/json" \
-d '{
"backend_type": "vllm",
"backend_options": {
"model": "ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g",
"tensor_parallel_size": 2,
"gpu_memory_utilization": 0.5,
"enable_log_outputs": true
}
}'
10.3 Use via OpenAI-Compatible API
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer your-inference-key" \
-H "Content-Type: application/json" \
-d '{
"model": "my-vllm-model",
"messages": [{"role": "user", "content": "Hello!"}]
}'
11. Implementation Checklist
Phase 1: Core Backend
- Create
pkg/backends/vllm/vllm.go - Implement
VllmServerOptionsstruct with basic fields - Implement
BuildCommandArgs(),UnmarshalJSON(),MarshalJSON() - Add comprehensive field mappings for CLI args
- Create unit tests for
VllmServerOptions
Phase 2: Command Parsing
- Create
pkg/backends/vllm/parser.go - Implement
ParseVllmCommand()function - Handle various command input formats
- Create comprehensive parser tests
- Test edge cases and error conditions
Phase 3: Integration
- Add
BackendTypeVllmtopkg/backends/backend.go - Update
BackendConfiginpkg/config/config.go - Add environment variable support
- Update
CreateInstanceOptionsinpkg/instance/options.go - Implement
BuildCommandArgs()with "serve" prepending
Phase 4: Lifecycle & Health Checks
- Update executable selection in
pkg/instance/lifecycle.go - Test instance startup and health checking (uses existing
/healthendpoint) - Validate command execution flow
Phase 5: API Integration
- Add
ParseVllmCommand()handler inpkg/server/handlers.go - Add vLLM route in
pkg/server/routes.go - Update validation in
pkg/validation/validation.go - Test API endpoints
Phase 6: Testing & Documentation
- Create comprehensive integration tests
- Test with actual vLLM installation (if available)
- Update documentation
- Test OpenAI-compatible proxy functionality
12. Configuration Examples
12.1 YAML Configuration
backends:
llama_executable: "llama-server"
mlx_lm_executable: "mlx_lm.server"
vllm_executable: "vllm"
instances:
# ... other instance settings
12.2 Environment Variables
export LLAMACTL_VLLM_EXECUTABLE="vllm"
# OR for custom installation
export LLAMACTL_VLLM_EXECUTABLE="python -m vllm"
# OR for containerized deployment
export LLAMACTL_VLLM_EXECUTABLE="docker run --rm --gpus all vllm/vllm-openai"
13. Notes and Considerations
13.1 Startup Time
- vLLM instances may take significantly longer to start than llama.cpp
- Consider documenting recommended timeout values
- The configurable
on_demand_start_timeoutshould accommodate this
13.2 Resource Usage
- vLLM typically requires substantial GPU memory
- No special handling needed in llamactl (follows existing pattern)
- Resource management is left to the user/administrator
13.3 Model Compatibility
- Primarily designed for HuggingFace models
- Supports various quantization formats (GPTQ, AWQ, etc.)
- Model path validation can be basic (similar to other backends)
13.4 Future Enhancements
- Consider adding vLLM-specific parameter validation
- Could add model download/caching features
- May want to add vLLM version detection capabilities
This specification provides a comprehensive roadmap for implementing vLLM backend support while maintaining consistency with the existing llamactl architecture.