oumi.core.synthesis#
Submodules#
oumi.core.synthesis.attribute_formatter module#
- class oumi.core.synthesis.attribute_formatter.AttributeFormatter(params: GeneralSynthesisParams)[source]#
Bases:
object
Formats a sample using a format string.
Integrates information from permutable attributes to support formatting of placeholders in the format string (i.e. {attribute_id.value}).
- format(sample: dict[str, str], format_string: str, missing_values_allowed: bool = False) str [source]#
Format a sample using a format string.
- Parameters:
sample – The sample to format.
format_string – The format string to use.
missing_values_allowed – If True, missing values are allowed in the sample.
- Returns:
The formatted string.
oumi.core.synthesis.attribute_synthesizer module#
- class oumi.core.synthesis.attribute_synthesizer.AttributeSynthesizer(params: GeneralSynthesisParams)[source]#
Bases:
object
Synthesizes values for a generated attribute based on the given samples.
- Parameters:
params – The parameters for the attribute synthesizer.
- synthesize(samples: list[dict], generated_attribute: GeneratedAttribute) list[Conversation] [source]#
Synthesize values for the generated attribute.
oumi.core.synthesis.dataset_ingestion module#
- class oumi.core.synthesis.dataset_ingestion.DatasetPath(path: str)[source]#
Bases:
object
Path to a dataset in some storage location.
- get_storage_type() DatasetStorageType [source]#
Get the storage type.
- class oumi.core.synthesis.dataset_ingestion.DatasetReader[source]#
Bases:
object
Reads a dataset from some storage location.
Supports: - HuggingFace - Local files (JSONL, CSV, TSV, Parquet, JSON) - Glob patterns
- read(data_source: DatasetSource) list[dict] [source]#
Read the data from the data path.
oumi.core.synthesis.document_ingestion module#
- class oumi.core.synthesis.document_ingestion.DocumentReader[source]#
Bases:
object
Reader for documents.
- class oumi.core.synthesis.document_ingestion.DocumentSegmenter(params: DocumentSegmentationParams)[source]#
Bases:
object
Segmenter for documents.
oumi.core.synthesis.planner module#
- class oumi.core.synthesis.planner.DatasetPlanner[source]#
Bases:
object
- plan(synthesis_params: GeneralSynthesisParams, sample_count: int) list[dict] [source]#
Setup the dataset’s attributes for inference.
This function will create a list of dictionaries, with each dictionary representing a sample of the dataset with a particular attribute value for each attribute.
Example sources are used to populate the dataset plan with a set of examples for specific attributes, with each example being used round-robin.
Document sources are used to populate the dataset plan with documents and/or document segments, each sample of a document source being used round-robin.
Dataset sources are used to populate the dataset plan with values for the attributes, with each sample of a dataset source being used round-robin.
Permutable attributes have their values sampled from a distribution.
Combination sampling overrides the distribution for particular attribute-value combinations.
The final list of dictionaries will be used to create a dataset.
- Parameters:
synthesis_params – The synthesis parameters.
sample_count – The number of samples to plan.
- Returns:
A list of dictionaries, each representing a sample of the dataset with the attribute values for each attribute.