oumi.core.collators#

Submodules#

oumi.core.collators.text_collator_with_padding module#

class oumi.core.collators.text_collator_with_padding.TextCollatorWithPadding(tokenizer: PreTrainedTokenizerBase, *, max_length: int | None, truncation: bool = False, label_ignore_index: int | None = None, max_variable_sized_dims: int = 1, debug: bool = False)[source]#

Bases: object

__call__(batch) dict[str, Any][source]#

Pads to the longest length present in the batch.

Parameters:

batch – List of batch items.

Returns:

Processed batch.

Return type:

Dict[str, torch.Tensor]

oumi.core.collators.text_completions_collator_with_padding module#

class oumi.core.collators.text_completions_collator_with_padding.TextCompletionsCollatorWithPadding(tokenizer: PreTrainedTokenizerBase, instruction_prefix: str, response_prefix: str, debug: bool = False)[source]#

Bases: object

__call__(batch) dict[str, Any][source]#

Pads to the longest length present in the batch.

Parameters:

batch – List of batch items.

Returns:

Processed batch.

Return type:

Dict[str, torch.Tensor]

oumi.core.collators.vision_language_collator_with_padding module#

class oumi.core.collators.vision_language_collator_with_padding.VisionLanguageCollatorWithPadding(tokenizer: PreTrainedTokenizerBase, *, max_length: int | None, truncation: bool = False, label_ignore_index: int | None = None, allow_multi_image_inputs: bool = True, main_image_feature: str = 'pixel_values')[source]#

Bases: object

__call__(batch) dict[str, Any][source]#

Custom collator for multi-modal vision-language training.

Parameters:

batch – List of batch items.

Returns:

Processed batch.

Return type:

Dict[str, torch.Tensor]

collate_images(images) Tensor[source]#

Collate images for multi-modal training.

Parameters:

images – List of images to collate.

Returns:

Batch of processed images.

Return type:

torch.Tensor

oumi.core.collators.vision_language_sft_collator module#

Vision-Language SFT collator for conversation-based multimodal training.

This module provides a collator specifically designed for supervised fine-tuning (SFT) of vision-language models using conversation data.

Unlike VisionLanguageCollatorWithPadding which expects pre-processed features, this collator works with raw conversation objects and handles the complete feature generation pipeline.

Example

>>> from oumi.builders import build_tokenizer
>>> from oumi.core.configs import ModelParams
>>> tokenizer = build_tokenizer(ModelParams(model_name="llava-hf/llava-1.5-7b-hf"))
>>> collator = VisionLanguageSftCollator(
...     tokenizer=tokenizer,
...     processor_name="llava-hf/llava-1.5-7b-hf",
...     max_length=512,
...     truncation=True
... )
>>> # Expects batch items with conversation_json field
>>> batch = collator([{"conversation_json": conversation1.to_json()}, ...])
class oumi.core.collators.vision_language_sft_collator.VisionLanguageSftCollator(tokenizer: PreTrainedTokenizerBase, processor_name: str, *, processor_kwargs: dict[str, Any] | None = None, max_length: int | None = None, truncation: bool = False, truncation_side: str = 'right', label_ignore_index: int | None = None, allow_multi_image_inputs: bool = True, trust_remote_code: bool = False, train_on_completions_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, process_individually: bool = False)[source]#

Bases: object

Collator for vision-language SFT that processes conversation data.

This collator is designed for supervised fine-tuning of vision-language models where training data comes in the form of conversations containing both text and images. It handles the complete pipeline from raw conversations to model-ready tensor batches.

Key Features:
  • Processes Conversation objects containing text and image data

  • Uses model-specific processors to extract image features

  • Handles tokenization and feature generation in one step

  • Supports various vision-language architectures

  • Manages padding, truncation, and label masking

The collator expects batch items with a “conversation_json” field containing serialized Conversation objects. These conversations can include:

  • Multiple turns of dialogue

  • Image references (paths, URLs, or base64 data)

  • System prompts and user/assistant messages

__call__(batch) dict[str, Any][source]#

Process a batch of conversation data into model-ready features.

This method converts serialized conversations into the tensor format expected by vision-language models. It handles the complete pipeline: 1. Deserializes conversation JSON strings 2. Passes conversations to the feature generator 3. Returns batched tensors ready for training

Parameters:

batch

List of dictionaries, where each dictionary must contain a “conversation_json” field with a serialized Conversation object.

Expected format: [

{“conversation_json”: ‘{“messages”: […], “images”: […]}’}, {“conversation_json”: ‘{“messages”: […], “images”: […]}’}, …

]

The conversation JSON should include: - messages: List of message dictionaries with role and content - images: Optional list of image data (paths, URLs, or base64)

Returns:

  • “input_ids”: Token IDs including image placeholders
    • ”attention_mask”: Attention masks for the input

    • ”labels”: Target labels with appropriate masking

    • ”pixel_values” or model-specific image features

    • Additional model-specific features (cross_attention_mask, etc.)

The exact keys depend on the model architecture and processor used.

Return type:

Dictionary containing all features needed for model training

Raises:

ValueError – If batch is empty or any item lacks “conversation_json” field.

Example

>>> conversation = Conversation(messages=[
...     {"role": "user", "content": "What's in this image?"},
...     {"role": "assistant", "content": "I see a cat."}
... ], images=["path/to/image.jpg"])
>>> batch_item = {"conversation_json": conversation.to_json()}
>>> features = collator([batch_item])
>>> print(features.keys())
dict_keys(['input_ids', 'attention_mask', 'labels', 'pixel_values'])