oumi.core.analyze

oumi.core.analyze#

Sample analyzer plugin system for Oumi.

This package provides a plugin-based architecture for analyzing conversation data with different types of sample analyzers (length, safety, etc.).

class oumi.core.analyze.DatasetAnalyzer(config: AnalyzeConfig, dataset: BaseMapDataset | None = None)[source]#

Bases: object

Orchestrates the analysis of datasets using multiple sample analyzers.

property analysis_df: DataFrame | None#

Get the merged analysis DataFrame with both message and conversation metrics.

Returns:: DataFrame with columns prefixed by message_ and conversation_ for each analyzer
Raises:: RuntimeError – If analysis has not been run yet.

property analysis_results: DatasetAnalysisResult | None#

Get the analysis results if available.

Returns:: DatasetAnalysisResult if analysis has been run, None otherwise

property analysis_summary: dict[str, Any]#

Get the comprehensive analysis summary.

Returns:: Dictionary containing comprehensive dataset analysis summary
Raises:: RuntimeError – If analysis has not been run yet.

analyze_dataset() → None[source]#

Analyze the dataset and store results internally.

This method performs both message-level and conversation-level analysis using the configured sample analyzers. Each analyzer processes entire conversations and returns metrics for both individual messages and conversations as a whole. Results are stored internally and can be accessed via the query() method.

Raises:: ValueError – If no analyzers are configured for analysis.

property conversation_df: DataFrame | None#

Get the conversation-level analysis DataFrame.

Returns:: DataFrame with conversation-level metrics prefixed by conversation_
Raises:: RuntimeError – If analysis has not been run yet.

filter(query_expression: str) → BaseMapDataset | BaseIterableDataset[source]#

Filter the original dataset based on analysis results.

This method uses analysis results to filter the original dataset, returning a new dataset object containing only the conversations that match the query.

Parameters:: query_expression – Pandas query expression to filter analysis results
Returns:: A new dataset object containing only the filtered conversations
Raises:: RuntimeError – If analysis has not been run yet.

Examples:

# Filter for conversations with short messages
short_dataset = analyzer.filter("length_word_count < 10")

# Filter for conversations with assistant messages
assistant_dataset = analyzer.filter("role == 'assistant'")

# Filter for conversations with long user messages
long_user_dataset = analyzer.filter(
    "role == 'user' and length_word_count > 100"
)

property message_df: DataFrame | None#

Get the message-level analysis DataFrame.

Returns:: DataFrame with message-level metrics prefixed by message_
Raises:: RuntimeError – If analysis has not been run yet.

query(query_expression: str) → DataFrame[source]#

Query the analysis results using pandas query syntax.

Parameters:: query_expression – Pandas query expression (e.g., “char_count > 10”)
Returns:: DataFrame containing rows that match the query expression
Raises:: RuntimeError – If analysis has not been run yet.

query_conversations(query_expression: str) → DataFrame[source]#

Query conversation-level analysis results using pandas query expression.

Parameters:: query_expression – Pandas query expression to filter conversation analysis results
Returns:: DataFrame with filtered conversation analysis results
Raises:: RuntimeError – If analysis has not been run yet.

Examples:

# Filter for short conversations
long_conversations = analyzer.query_conversations(
    "length_token_count > 1000"
)

class oumi.core.analyze.LengthAnalyzer(*, char_count: bool = True, word_count: bool = True, sentence_count: bool = True, token_count: bool = False, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast | None = None, include_special_tokens: bool = True)[source]#

Bases: SampleAnalyzer

Analyzer that computes various length metrics for text content.

analyze_sample(df: DataFrame, schema: dict | None = None) → DataFrame[source]#

Analyze text fields and return metrics.

Parameters:

df – Input DataFrame with text fields
schema – Column schema dict to identify text fields

Returns:

DataFrame with added field-level analysis columns

class oumi.core.analyze.SampleAnalyzer[source]#

Bases: ABC

Base class for sample analyzer plugins that analyze individual samples.

All analyzers work with pandas DataFrames for efficient processing.

abstractmethod analyze_sample(df: DataFrame, schema: dict | None = None) → DataFrame[source]#

Analyze text fields and return analysis results.

This method performs analysis on the input DataFrame and returns the DataFrame with added analysis columns. All analyzers must implement this method.

Parameters:

df – Input DataFrame with text fields
schema – Column schema dict to identify text fields

Returns:

DataFrame with added analysis columns

oumi.core.analyze

Contents

oumi.core.analyze#