oumi.core.analyze#

Sample analyzer plugin system for OUMI.

This package provides a plugin-based architecture for analyzing conversation data with different types of sample analyzers (length, safety, etc.).

class oumi.core.analyze.DatasetAnalysisResult(dataset_name: str, total_conversations: int, conversations_analyzed: int, total_messages: int, messages: list[MessageAnalysisResult])[source]#

Bases: object

Complete result of dataset analysis.

Variables:
  • dataset_name (str) – Name of the analyzed dataset

  • total_conversations (int) – Total number of conversations in the dataset

  • conversations_analyzed (int) – Number of conversations actually analyzed

  • total_messages (int) – Total number of messages analyzed

  • messages (list[oumi.core.analyze.dataset_analyzer.MessageAnalysisResult]) – List of analysis results for each individual message

conversations_analyzed: int#
dataset_name: str#
messages: list[MessageAnalysisResult]#
to_dataframe() DataFrame[source]#

Convert the analysis results to a pandas DataFrame.

Returns:

DataFrame with flattened analyzer metrics for easy querying. Each row represents one message with all its analysis metrics.

to_dict() dict[str, Any][source]#

Convert the analysis result to a dictionary.

Returns:

Dictionary representation of the analysis result

total_conversations: int#
total_messages: int#
class oumi.core.analyze.DatasetAnalyzer(config: AnalyzeConfig)[source]#

Bases: object

Orchestrates dataset analysis by creating and managing sample analyzers.

property analysis_results: DatasetAnalysisResult | None#

Get the analysis results if available.

Returns:

DatasetAnalysisResult if analysis has been run, None otherwise

analyze_dataset() None[source]#

Analyze the dataset and store results internally.

This method performs sample-level analysis using the configured sample analyzers. Each sample analyzer processes individual messages and returns metrics for each message. Results are stored internally and can be accessed via the query() method.

Raises:

ValueError – If no analyzers are configured for analysis.

filter(query_expression: str) BaseMapDataset[source]#

Filter the original dataset based on analysis results.

This method uses analysis results to filter the original dataset, returning a new dataset object containing only the conversations that match the query.

Parameters:

query_expression – Pandas query expression to filter analysis results

Returns:

A new dataset object containing only the filtered conversations

Examples:

# Filter for conversations with short messages
short_dataset = analyzer.filter("length_word_count < 10")

# Filter for conversations with assistant messages
assistant_dataset = analyzer.filter("role == 'assistant'")

# Filter for conversations with long user messages
long_user_dataset = analyzer.filter(
    "role == 'user' and length_word_count > 100")
query(query_expression: str) DataFrame[source]#

Query analysis results using pandas query expression.

Parameters:
  • query_expression – Pandas query expression to filter analysis results

  • information (Please see pandas DataFrame query documentation for more)

  • https – //pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Returns:

DataFrame with filtered analysis results

Examples

# Filter for short messages short_messages = analyzer.query(“length_word_count < 10”)

# Filter for assistant messages assistant_messages = analyzer.query(“role == ‘assistant’”)

# Filter for long user messages long_user = analyzer.query(“role == ‘user’ and length_word_count > 100”)

class oumi.core.analyze.LengthAnalyzer(*, char_count: bool = True, word_count: bool = True, sentence_count: bool = True, token_count: bool = False, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast | None = None, include_special_tokens: bool = True)[source]#

Bases: SampleAnalyzer

Analyzer that computes various length metrics for text content.

analyze_message(text_content: str, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast | None = None) dict[str, Any][source]#

Analyze text content and return length metrics.

Parameters:
  • text_content – The text content to analyze

  • tokenizer – Optional tokenizer to use for token counting

Returns:

Dictionary containing requested length metrics

class oumi.core.analyze.MessageAnalysisResult(conversation_id: str, conversation_index: int, message_index: int, role: str, message_id: str, text_content: str, analyzer_metrics: dict[str, Any])[source]#

Bases: object

Result of analyzing a single message in a conversation.

Variables:
  • conversation_id (str) – Unique identifier for the conversation

  • conversation_index (int) – Index of the conversation in the dataset

  • message_index (int) – Index of the message within the conversation

  • role (str) – Role of the message sender (e.g., ‘user’, ‘assistant’)

  • message_id (str) – Unique identifier for the message

  • text_content (str) – The text content of the message

  • analyzer_metrics (dict[str, Any]) – Dictionary of metrics computed by sample analyzers, with keys prefixed by analyzer ID to avoid conflicts

ANALYZER_METRICS_FIELD = 'analyzer_metrics'#
analyzer_metrics: dict[str, Any]#
conversation_id: str#
conversation_index: int#
message_id: str#
message_index: int#
role: str#
text_content: str#
to_dict() dict[str, Any][source]#

Convert the analysis result to a dictionary with flattened analyzer metrics.

Returns:

Dictionary representation of the analysis result with analyzer metrics flattened into the main dictionary (prefixed by analyzer ID)

class oumi.core.analyze.SampleAnalyzer[source]#

Bases: ABC

Base class for sample analyzer plugins that analyze individual samples.

abstractmethod analyze_message(text_content: str, tokenizer: Any | None = None) dict[str, Any][source]#

Analyze a single message and return metrics.

Parameters:
  • text_content – The text content to analyze

  • tokenizer – Optional tokenizer to use for tokenization-based analysis

Returns:

Dictionary containing analysis metrics