oumi.core.synthesis#

Submodules#

oumi.core.synthesis.attribute_formatter module#

class oumi.core.synthesis.attribute_formatter.AttributeFormatter(params: GeneralSynthesisParams)[source]#

Bases: object

Formats a sample using a format string.

Integrates information from permutable attributes to support formatting of placeholders in the format string (i.e. {attribute_id.value}).

format(sample: dict[str, str], format_string: str, missing_values_allowed: bool = False) str[source]#

Format a sample using a format string.

Parameters:
  • sample – The sample to format.

  • format_string – The format string to use.

  • missing_values_allowed – If True, missing values are allowed in the sample.

Returns:

The formatted string.

oumi.core.synthesis.attribute_synthesizer module#

class oumi.core.synthesis.attribute_synthesizer.AttributeSynthesizer(params: GeneralSynthesisParams)[source]#

Bases: object

Synthesizes values for a generated attribute based on the given samples.

Parameters:

params – The parameters for the attribute synthesizer.

synthesize(samples: list[dict], generated_attribute: GeneratedAttribute) list[Conversation][source]#

Synthesize values for the generated attribute.

oumi.core.synthesis.dataset_ingestion module#

class oumi.core.synthesis.dataset_ingestion.DatasetPath(path: str)[source]#

Bases: object

Path to a dataset in some storage location.

get_file_extension() str[source]#

Get the file extension.

get_path_str() str[source]#

Get the path.

get_storage_type() DatasetStorageType[source]#

Get the storage type.

class oumi.core.synthesis.dataset_ingestion.DatasetReader[source]#

Bases: object

Reads a dataset from some storage location.

Supports: - HuggingFace - Local files (JSONL, CSV, TSV, Parquet, JSON) - Glob patterns

read(data_source: DatasetSource) list[dict][source]#

Read the data from the data path.

class oumi.core.synthesis.dataset_ingestion.DatasetStorageType(value)[source]#

Bases: Enum

Storage location for a dataset (local, HuggingFace, etc.).

HF = 'hf'#

HuggingFace

LOCAL = 'local'#

Local files

oumi.core.synthesis.document_ingestion module#

class oumi.core.synthesis.document_ingestion.DocumentReader[source]#

Bases: object

Reader for documents.

read(document_path: str) list[str][source]#

Read the document.

class oumi.core.synthesis.document_ingestion.DocumentSegmenter(params: DocumentSegmentationParams)[source]#

Bases: object

Segmenter for documents.

segment(document: str) list[str][source]#

Segment the document.

segment_batch(documents: list[str]) list[str][source]#

Segment multiple documents.

Segments will be returned as a flat list of segments.

oumi.core.synthesis.planner module#

class oumi.core.synthesis.planner.DatasetPlanner[source]#

Bases: object

plan(synthesis_params: GeneralSynthesisParams, sample_count: int) list[dict][source]#

Setup the dataset’s attributes for inference.

This function will create a list of dictionaries, with each dictionary representing a sample of the dataset with a particular attribute value for each attribute.

  • Example sources are used to populate the dataset plan with a set of examples for specific attributes, with each example being used round-robin.

  • Document sources are used to populate the dataset plan with documents and/or document segments, each sample of a document source being used round-robin.

  • Dataset sources are used to populate the dataset plan with values for the attributes, with each sample of a dataset source being used round-robin.

  • Permutable attributes have their values sampled from a distribution.

  • Combination sampling overrides the distribution for particular attribute-value combinations.

The final list of dictionaries will be used to create a dataset.

Parameters:
  • synthesis_params – The synthesis parameters.

  • sample_count – The number of samples to plan.

Returns:

A list of dictionaries, each representing a sample of the dataset with the attribute values for each attribute.