Inference Configuration#
Introduction#
This guide covers the configuration options available for inference in Oumi. The configuration system is designed to be:
Modular: Each aspect of inference (model, generation, remote settings) is configured separately
Type-safe: All configuration options are validated at runtime
Flexible: Supports various inference scenarios from local to remote inference
Extensible: Easy to add new configuration options and validate them
The configuration system is built on the InferenceConfig class, which contains all inference settings. This class is composed of several parameter classes:
Model Configuration: Model architecture and loading settings via
ModelParamsGeneration Configuration: Text generation parameters via
GenerationParamsRemote Configuration: Remote API settings via
RemoteParams
All configuration files in Oumi are YAML files, which provide a human-readable format for specifying inference settings. The configuration system automatically validates these files and converts them to the appropriate Python objects.
Basic Structure#
A typical configuration file has this structure:
model: # Model settings
model_name: "meta-llama/Llama-3.1-8B-Instruct"
trust_remote_code: true
model_kwargs:
device_map: "auto"
torch_dtype: "float16"
generation: # Generation parameters
max_new_tokens: 100
temperature: 0.7
top_p: 0.9
batch_size: 1
engine: "VLLM" # VLLM, LLAMACPP, NATIVE, REMOTE_VLLM, etc.
remote_params: # Optional remote settings
api_url: "https://api.example.com/v1"
api_key: "${API_KEY}"
connection_timeout: 20.0
Configuration Components#
Model Configuration#
Configure the model architecture and loading using the ModelParams class:
model:
# Required
model_name: "meta-llama/Llama-3.1-8B-Instruct" # Model ID or path (REQUIRED)
# Model loading
adapter_model: null # Path to adapter model (auto-detected if model_name is adapter)
tokenizer_name: null # Custom tokenizer name/path (defaults to model_name)
tokenizer_pad_token: null # Override pad token
tokenizer_kwargs: {} # Additional tokenizer args
model_max_length: null # Max sequence length (positive int or null)
load_pretrained_weights: true # Load pretrained weights
trust_remote_code: false # Allow remote code execution (use with trusted models only)
# Model precision and hardware
torch_dtype_str: "float32" # Model precision (float32/float16/bfloat16/float64)
device_map: "auto" # Device placement strategy (auto/null)
compile: false # JIT compile model
# Attention and optimization
attn_implementation: null # Attention impl (null/sdpa/flash_attention_2/eager)
enable_liger_kernel: false # Enable Liger CUDA kernel for potential speedup
# Model behavior
chat_template: null # Chat formatting template
freeze_layers: [] # Layer names to freeze during training
# Additional settings
model_kwargs: {} # Additional model constructor args
Using LoRA Adapters#
The adapter_model parameter allows you to load LoRA (Low-Rank Adaptation) adapters on top of a base model. This is useful when you’ve fine-tuned a model using LoRA and want to serve the adapted version.
Configuration Example:
model:
model_name: "meta-llama/Llama-3.1-8B-Instruct" # Base model
adapter_model: "path/to/lora/adapter" # LoRA adapter path
Engine Support:
Not all inference engines support LoRA adapters. The following engines support LoRA adapter inference: VLLM, REMOTE_VLLM, NATIVE.
For detailed examples of serving LoRA, see the Inference Engines guide.
Generation Configuration#
Configure text generation parameters using the GenerationParams class:
generation:
max_new_tokens: 256 # Maximum number of new tokens to generate (default: 256)
batch_size: 1 # Number of sequences to generate in parallel (default: 1)
exclude_prompt_from_response: true # Whether to remove the prompt from the response (default: true)
seed: null # Seed for random number determinism (default: null)
temperature: 0.0 # Controls randomness in output (0.0 = deterministic) (default: 0.0)
top_p: 1.0 # Nucleus sampling probability threshold (default: 1.0)
frequency_penalty: 0.0 # Penalize repeated tokens (default: 0.0)
presence_penalty: 0.0 # Penalize tokens based on presence in text (default: 0.0)
stop_strings: null # List of sequences to stop generation (default: null)
stop_token_ids: null # List of token IDs to stop generation (default: null)
logit_bias: {} # Token-level biases for generation (default: {})
min_p: 0.0 # Minimum probability threshold for tokens (default: 0.0)
use_cache: false # Whether to use model's internal cache (default: false)
num_beams: 1 # Number of beams for beam search (default: 1)
use_sampling: false # Whether to use sampling vs greedy decoding (default: false)
guided_decoding: null # Parameters for guided decoding (default: null)
skip_special_tokens: true # Whether to skip special tokens when decoding (default: true)
Note
Not all inference engines support all generation parameters. Each engine has its own set of supported parameters which can be checked via the get_supported_params attribute of the engine class. For example:
Please refer to the specific engine’s documentation for details on supported parameters.
Special Tokens Handling#
The skip_special_tokens parameter controls whether special tokens (like <eos>, <pad>, <bos>, <think>) are included in the decoded output:
true(default): Special tokens are removed from the output, producing clean, readable text suitable for user-facing applications.false: Special tokens are preserved in the output. This is useful for:Reasoning models: Models like GPT-OSS (openai/gpt-oss-20b, openai/gpt-oss-120b) that output their internal reasoning using special tokens. Set to
falseto preserve these reasoning tokens.Tool-calling models: Models that use special tokens to mark function calls or tool invocations.
Debugging: When you need to inspect the exact token sequence generated by the model.
Custom parsing: When implementing custom logic that relies on special tokens in the output format.
Note
The skip_special_tokens parameter is only supported by NativeTextInferenceEngine and VLLMInferenceEngine. Remote API engines typically handle special token filtering automatically and do not expose this parameter.
Remote Configuration#
Configure remote API settings using the RemoteParams class:
remote_params:
api_url: "https://api.example.com/v1" # Required: URL of the API endpoint
api_key: "your-api-key" # API key for authentication
api_key_env_varname: null # Environment variable for API key
max_retries: 3 # Maximum number of retries
connection_timeout: 20.0 # Request timeout in seconds
num_workers: 1 # Number of parallel workers
politeness_policy: 0.0 # Sleep time between requests
batch_completion_window: "24h" # Time window for batch completion
use_adaptive_concurrency: True # Whether to change concurrency based on error rate
Engine Selection#
The engine parameter specifies which inference engine to use. Available options from InferenceEngineType:
ANTHROPIC: Use Anthropic’s API viaAnthropicInferenceEngineDEEPSEEK: Use DeepSeek Platform API viaDeepSeekInferenceEngineGOOGLE_GEMINI: Use Google Gemini viaGoogleGeminiInferenceEngineGOOGLE_VERTEX: Use Google Vertex AI viaGoogleVertexInferenceEngineLAMBDA: Use Lambda AI API viaLambdaInferenceEngineLLAMACPP: Use llama.cpp for CPU inference viaLlamaCppInferenceEngineNATIVE: Use native PyTorch inference viaNativeTextInferenceEngineOPENAI: Use OpenAI API viaOpenAIInferenceEnginePARASAIL: Use Parasail API viaParasailInferenceEngineREMOTE_VLLM: Use external vLLM server viaRemoteVLLMInferenceEngineREMOTE: Use any OpenAI-compatible API viaRemoteInferenceEngineSAMBANOVA: Use SambaNova API viaSambanovaInferenceEngineSGLANG: Use SGLang inference engine viaSGLangInferenceEngineTOGETHER: Use Together API viaTogetherInferenceEngineVLLM: Use vLLM for optimized local inference viaVLLMInferenceEngine
Additional Configuration#
The following top-level parameters are also available in the configuration:
# Input/Output paths
input_path: null # Path to input file containing prompts (JSONL format)
output_path: null # Path to save generated outputs
The input_path should contain prompts in JSONL format, where each line is a JSON representation of an Oumi Conversation object.
See Also#
Inference Engines for local and remote inference engines usage
Common Workflows for common workflows
Inference Configuration for detailed parameter documentation