oumi.core.collators#
Submodules#
oumi.core.collators.text_collator_with_padding module#
oumi.core.collators.text_completions_collator_with_padding module#
oumi.core.collators.vision_language_collator_with_padding module#
- class oumi.core.collators.vision_language_collator_with_padding.VisionLanguageCollatorWithPadding(tokenizer: PreTrainedTokenizerBase, *, max_length: int | None, truncation: bool = False, label_ignore_index: int | None = None, allow_multi_image_inputs: bool = True, main_image_feature: str = 'pixel_values')[source]#
Bases:
object
oumi.core.collators.vision_language_sft_collator module#
Vision-Language SFT collator for conversation-based multimodal training.
This module provides a collator specifically designed for supervised fine-tuning (SFT) of vision-language models using conversation data.
Unlike VisionLanguageCollatorWithPadding which expects pre-processed features, this collator works with raw conversation objects and handles the complete feature generation pipeline.
Example
>>> from oumi.builders import build_tokenizer
>>> from oumi.core.configs import ModelParams
>>> tokenizer = build_tokenizer(ModelParams(model_name="llava-hf/llava-1.5-7b-hf"))
>>> collator = VisionLanguageSftCollator(
... tokenizer=tokenizer,
... processor_name="llava-hf/llava-1.5-7b-hf",
... max_length=512,
... truncation=True
... )
>>> # Expects batch items with conversation_json field
>>> batch = collator([{"conversation_json": conversation1.to_json()}, ...])
- class oumi.core.collators.vision_language_sft_collator.VisionLanguageSftCollator(tokenizer: PreTrainedTokenizerBase, processor_name: str, *, processor_kwargs: dict[str, Any] | None = None, max_length: int | None = None, truncation: bool = False, truncation_side: str = 'right', label_ignore_index: int | None = None, allow_multi_image_inputs: bool = True, trust_remote_code: bool = False, train_on_completions_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, process_individually: bool = False)[source]#
Bases:
object
Collator for vision-language SFT that processes conversation data.
This collator is designed for supervised fine-tuning of vision-language models where training data comes in the form of conversations containing both text and images. It handles the complete pipeline from raw conversations to model-ready tensor batches.
- Key Features:
Processes Conversation objects containing text and image data
Uses model-specific processors to extract image features
Handles tokenization and feature generation in one step
Supports various vision-language architectures
Manages padding, truncation, and label masking
The collator expects batch items with a “conversation_json” field containing serialized Conversation objects. These conversations can include:
Multiple turns of dialogue
Image references (paths, URLs, or base64 data)
System prompts and user/assistant messages
- __call__(batch) dict[str, Any] [source]#
Process a batch of conversation data into model-ready features.
This method converts serialized conversations into the tensor format expected by vision-language models. It handles the complete pipeline: 1. Deserializes conversation JSON strings 2. Passes conversations to the feature generator 3. Returns batched tensors ready for training
- Parameters:
batch –
List of dictionaries, where each dictionary must contain a “conversation_json” field with a serialized Conversation object.
Expected format: [
{“conversation_json”: ‘{“messages”: […], “images”: […]}’}, {“conversation_json”: ‘{“messages”: […], “images”: […]}’}, …
]
The conversation JSON should include: - messages: List of message dictionaries with role and content - images: Optional list of image data (paths, URLs, or base64)
- Returns:
- “input_ids”: Token IDs including image placeholders
”attention_mask”: Attention masks for the input
”labels”: Target labels with appropriate masking
”pixel_values” or model-specific image features
Additional model-specific features (cross_attention_mask, etc.)
The exact keys depend on the model architecture and processor used.
- Return type:
Dictionary containing all features needed for model training
- Raises:
ValueError – If batch is empty or any item lacks “conversation_json” field.
Example
>>> conversation = Conversation(messages=[ ... {"role": "user", "content": "What's in this image?"}, ... {"role": "assistant", "content": "I see a cat."} ... ], images=["path/to/image.jpg"]) >>> batch_item = {"conversation_json": conversation.to_json()} >>> features = collator([batch_item]) >>> print(features.keys()) dict_keys(['input_ids', 'attention_mask', 'labels', 'pixel_values'])