Inference Configuration

Inference Configuration#

Introduction#

This guide covers the configuration options available for inference in Oumi. The configuration system is designed to be:

Modular: Each aspect of inference (model, generation, remote settings) is configured separately
Type-safe: All configuration options are validated at runtime
Flexible: Supports various inference scenarios from local to remote inference
Extensible: Easy to add new configuration options and validate them

The configuration system is built on the InferenceConfig class, which contains all inference settings. This class is composed of several parameter classes:

Model Configuration: Model architecture and loading settings via ModelParams
Generation Configuration: Text generation parameters via GenerationParams
Remote Configuration: Remote API settings via RemoteParams

All configuration files in Oumi are YAML files, which provide a human-readable format for specifying inference settings. The configuration system automatically validates these files and converts them to the appropriate Python objects.

Basic Structure#

A typical configuration file has this structure:

model:  # Model settings
  model_name: "meta-llama/Llama-3.1-8B-Instruct"
  trust_remote_code: true
  model_kwargs:
    device_map: "auto"
    torch_dtype: "float16"

generation:  # Generation parameters
  max_new_tokens: 100
  temperature: 0.7
  top_p: 0.9
  batch_size: 1

engine: "VLLM"  # VLLM, LLAMACPP, NATIVE, REMOTE_VLLM, etc.

remote_params:  # Optional remote settings
  api_url: "https://api.example.com/v1"
  api_key: "${API_KEY}"
  connection_timeout: 20.0

Configuration Components#

Model Configuration#

Configure the model architecture and loading using the ModelParams class:

model:
  # Required
  model_name: "meta-llama/Llama-3.1-8B-Instruct"    # Model ID or path (REQUIRED)

  # Model loading
  adapter_model: null                                # Path to adapter model (auto-detected if model_name is adapter)
  tokenizer_name: null                               # Custom tokenizer name/path (defaults to model_name)
  tokenizer_pad_token: null                          # Override pad token
  tokenizer_kwargs: {}                               # Additional tokenizer args
  model_max_length: null                             # Max sequence length (positive int or null)
  load_pretrained_weights: true                      # Load pretrained weights
  trust_remote_code: false                           # Allow remote code execution (use with trusted models only)

  # Model precision and hardware
  torch_dtype_str: "float32"                         # Model precision (float32/float16/bfloat16/float64)
  device_map: "auto"                                 # Device placement strategy (auto/null)
  compile: false                                     # JIT compile model

  # Attention and optimization
  attn_implementation: null                          # Attention impl (null/sdpa/flash_attention_2/eager)
  enable_liger_kernel: false                         # Enable Liger CUDA kernel for potential speedup

  # Model behavior
  chat_template: null                                # Chat formatting template
  freeze_layers: []                                  # Layer names to freeze during training

  # Additional settings
  model_kwargs: {}                                   # Additional model constructor args

Using LoRA Adapters#

The adapter_model parameter allows you to load LoRA (Low-Rank Adaptation) adapters on top of a base model. This is useful when you’ve fine-tuned a model using LoRA and want to serve the adapted version.

Configuration Example:

model:
  model_name: "meta-llama/Llama-3.1-8B-Instruct"  # Base model
  adapter_model: "path/to/lora/adapter"           # LoRA adapter path

Engine Support:

Not all inference engines support LoRA adapters. The following engines support LoRA adapter inference: VLLM, REMOTE_VLLM, NATIVE.

For detailed examples of serving LoRA, see the Inference Engines guide.

Generation Configuration#

Configure text generation parameters using the GenerationParams class:

generation:
  max_new_tokens: 256                # Maximum number of new tokens to generate (default: 256)
  batch_size: 1                      # Number of sequences to generate in parallel (default: 1)
  exclude_prompt_from_response: true # Whether to remove the prompt from the response (default: true)
  seed: null                        # Seed for random number determinism (default: null)
  temperature: 0.0                  # Controls randomness in output (0.0 = deterministic) (default: 0.0)
  top_p: 1.0                       # Nucleus sampling probability threshold (default: 1.0)
  frequency_penalty: 0.0           # Penalize repeated tokens (default: 0.0)
  presence_penalty: 0.0            # Penalize tokens based on presence in text (default: 0.0)
  stop_strings: null               # List of sequences to stop generation (default: null)
  stop_token_ids: null            # List of token IDs to stop generation (default: null)
  logit_bias: {}                  # Token-level biases for generation (default: {})
  min_p: 0.0                      # Minimum probability threshold for tokens (default: 0.0)
  use_cache: false                # Whether to use model's internal cache (default: false)
  num_beams: 1                    # Number of beams for beam search (default: 1)
  use_sampling: false             # Whether to use sampling vs greedy decoding (default: false)
  guided_decoding: null           # Parameters for guided decoding (default: null)
  skip_special_tokens: true       # Whether to skip special tokens when decoding (default: true)

Note

Not all inference engines support all generation parameters. Each engine has its own set of supported parameters which can be checked via the get_supported_params attribute of the engine class. For example:

Please refer to the specific engine’s documentation for details on supported parameters.

Special Tokens Handling#

The skip_special_tokens parameter controls whether special tokens (like <eos>, <pad>, <bos>, <think>) are included in the decoded output:

true (default): Special tokens are removed from the output, producing clean, readable text suitable for user-facing applications.
false: Special tokens are preserved in the output. This is useful for:
- Reasoning models: Models like GPT-OSS (openai/gpt-oss-20b, openai/gpt-oss-120b) that output their internal reasoning using special tokens. Set to false to preserve these reasoning tokens.
- Tool-calling models: Models that use special tokens to mark function calls or tool invocations.
- Debugging: When you need to inspect the exact token sequence generated by the model.
- Custom parsing: When implementing custom logic that relies on special tokens in the output format.

Note

The skip_special_tokens parameter is only supported by NativeTextInferenceEngine and VLLMInferenceEngine. Remote API engines typically handle special token filtering automatically and do not expose this parameter.

Remote Configuration#

Configure remote API settings using the RemoteParams class:

remote_params:
  api_url: "https://api.example.com/v1"   # Required: URL of the API endpoint
  api_key: "your-api-key"                 # API key for authentication
  api_key_env_varname: null               # Environment variable for API key
  max_retries: 3                          # Maximum number of retries
  connection_timeout: 20.0                # Request timeout in seconds
  num_workers: 1                          # Number of parallel workers
  politeness_policy: 0.0                  # Sleep time between requests
  batch_completion_window: "24h"          # Time window for batch completion
  use_adaptive_concurrency: True          # Whether to change concurrency based on error rate

Engine Selection#

The engine parameter specifies which inference engine to use. Available options from InferenceEngineType:

ANTHROPIC: Use Anthropic’s API via AnthropicInferenceEngine
DEEPSEEK: Use DeepSeek Platform API via DeepSeekInferenceEngine
GOOGLE_GEMINI: Use Google Gemini via GoogleGeminiInferenceEngine
GOOGLE_VERTEX: Use Google Vertex AI via GoogleVertexInferenceEngine
LAMBDA: Use Lambda AI API via LambdaInferenceEngine
LLAMACPP: Use llama.cpp for CPU inference via LlamaCppInferenceEngine
NATIVE: Use native PyTorch inference via NativeTextInferenceEngine
OPENAI: Use OpenAI API via OpenAIInferenceEngine
PARASAIL: Use Parasail API via ParasailInferenceEngine
REMOTE_VLLM: Use external vLLM server via RemoteVLLMInferenceEngine
REMOTE: Use any OpenAI-compatible API via RemoteInferenceEngine
SAMBANOVA: Use SambaNova API via SambanovaInferenceEngine
SGLANG: Use SGLang inference engine via SGLangInferenceEngine
TOGETHER: Use Together API via TogetherInferenceEngine
VLLM: Use vLLM for optimized local inference via VLLMInferenceEngine

Additional Configuration#

The following top-level parameters are also available in the configuration:

# Input/Output paths
input_path: null    # Path to input file containing prompts (JSONL format)
output_path: null   # Path to save generated outputs

The input_path should contain prompts in JSONL format, where each line is a JSON representation of an Oumi Conversation object.