Judge Config

Judge Config#

Oumi allows users to define their judge configurations through a YAML file, providing a flexible, human-readable, and easily customizable format for setting up LLM-based evaluations. By using YAML, users can effortlessly configure judge prompts, response formats, output types, and the underlying inference models. This approach not only streamlines the process of configuring evaluations but also ensures that configurations are easily versioned, shared, and reproduced across different environments and teams.

Configuration Options#

The configuration YAML file is loaded into the JudgeConfig class, and consists of judge_params (JudgeParams) and inference_config (InferenceConfig). The judge parameters define the evaluation criteria, prompts, and output format, while the inference configuration specifies the underlying judge model and generation parameters used for the judge’s reasoning.

Judge Parameters#

Prompt Template#

prompt_template (required): This is a text prompt that defines the judge’s behavior.

To be clear and effective, it should should include the following:

Role Declaration: Clearly state that the model is acting as a judge and explain what it is evaluating.
Inputs: List and explain the inputs the judge will receive (e.g., request, response, ground_truth).
Evaluation Criteria: Specify the exact dimensions to judge.
Data: Insert placeholders (e.g. {request}, {response}) for all the inputs listed above, so that they can be replaced at runtime with each example’s actual inputs.
Expected Output: Describe the expected output type. This must be consistent with the judgment_type below.

System Instruction#

system_instruction (optional): System message to guide the judge’s behavior and evaluation criteria. It is a common practice to break down the judge prompt into two messages: A system instruction message (role: system), and a user prompt message (role: user). If we use a system role, then the prompt_template (described above) should only include information related to the particular example (i.e., the “Data”), while the system_instruction should include all the remaining fields that describe the judge’s behavior, inputs, and output. See example in the next section.

Template Variables#

template_variables (optional): Dictionary of variables to replace in prompt_template and system_instruction, before processing input data. In addition to the placeholders that will be replaced by each example, the user can also leverage this dict to define additional placeholders that will be statically replaced when loading the YAML file. This is useful for defining versatile judges that can be re-used with slight variations. For example, our format compliance judge can be used to validate JSON, XML, HTML, etc outputs, just by updating the response_format template variable.

Response Format#

response_format: How the judge should structure its output.

Options:

XML: Structured XML with tagged fields (e.g., <judgment>Yes</judgment>)
JSON: JSON object with named fields (e.g., {"judgment": "Yes"})
RAW: Plain text response

Judgment Types#

judgment_type: The expected type of the judge’s primary output.

Options:

BOOL: Boolean judgments (True/False, Yes/No)
INT: Integer scores (e.g., 1-5 ratings)
FLOAT: Floating-point values (e.g., 0.0-1.0 scores)
TEXT: Free-form textual judgments
ENUM: Categorical selections from predefined options

Judgment Scores#

judgment_scores (optional): For ENUM types, these scores enable mapping from category names to numeric scores. Example: {"excellent": 1.0, "good": 0.7, "poor": 0.3}

Explanation#

include_explanation (default: false): Whether to include explanations alongside judgments.

Few-Shot Examples#

examples (optional): List of example evaluations to guide the judge.

Each example should include a dictionary that contains (key, value) pairs for:

All input placeholders defined in the prompt_template.
A judgment consistent with the judgment_type.
An explanation text (if include_explanation is set) that explains the rationale of the judgment above.

Inference Configuration#

The inference_config section (InferenceConfig) specifies the model and generation parameters. For more details, visit the inference configuration page.

Configuration Example#

Here’s a comprehensive configuration example, showing many of the available parameters:

judge_params:
  system_instruction: |
    You are a truthfulness judge. Your task is to determine whether the response is factually accurate, grounded in verifiable information, and free from hallucinations, fabrications, or speculative claims. The response should not include opinions, beliefs, or unverifiable assertions.

    Carefully assess the response against the user’s request. Evaluate truthfulness by systematically checking: specific factual claims, numerical data, names and references, causal relationships, and generalizations for accuracy and verifiability.

    {expected_response}

    Evaluation Criteria:
    1. Factual Accuracy: All factual claims in the response must be correct and verifiable.
    2. Grounding: The response must be based on information that can be reasonably inferred or verified from the user’s request or common knowledge. It should not invent details or make unsupported assumptions.
    3. No Hallucination: The response must not include fabricated facts, imaginary entities, or incorrect references. Check specific names, dates, numbers, and citations carefully.
    4. No Opinions or Beliefs: The response must avoid presenting personal beliefs, subjective opinions, or unverifiable statements.
    5. Uncertainty and Limitations: Responses that acknowledge uncertainty, express appropriate caveats, or state limitations of knowledge are acceptable and should not be penalized for truthfulness, as long as they don't make false claims.
    6. Temporal Accuracy: Consider whether factual claims are accurate for the relevant time period. Facts that were true historically but are no longer current should be evaluated in context.

  prompt_template: |
    Here is the data:
    [BEGIN DATA]
    ***
    [user request]:
    {request}
    ***
    [response]:
    {response}
    ***
    [END DATA]

  # Template variables to statically replace YAML placeholders
  template_variables:
    expected_response: If the response is entirely truthful and grounded, respond with 'Yes'. If not, respond with 'No'. If you are unsure, respond with 'Unsure'.

  # Output configuration
  response_format: JSON          # XML, JSON, or RAW
  judgment_type: ENUM            # BOOL, INT, FLOAT, TEXT, or ENUM
  include_explanation: True

  # For categorical (ENUM) judgments with custom scoring
  judgment_scores:
    "Yes": 1.0
    "Unsure": 0.5
    "No": 0.0

  # Few-shot examples to guide the judge
  examples:
    - question: "What is the capital of France?"
      response: "Paris"
      judgment: "Yes"
      explanation: "The response is factually correct, since the capital of France is Paris."
    - question: "Who is the president of the US in 2025?"
      response: "The president is Joe Biden."
      judgment: "No"
      explanation: "The answer is incorrect. The president of the US in 2025 is Donald J. Trump."

inference_config:
  # Model configuration
  model:
    model_name: "gpt-4o"

  # Inference engine
  engine: OPENAI

  # Generation parameters
  generation:
    max_new_tokens: 8192
    temperature: 1.0

Configuration Loading#

The Judge framework supports multiple ways to load configurations:

Local File Path#

from oumi.judges.simple_judge import SimpleJudge

judge = SimpleJudge("./my_judge_config.yaml")

Repository Path#

from oumi.judges.simple_judge import SimpleJudge

# Load from GitHub repository using oumi:// prefix
judge = SimpleJudge("oumi://configs/projects/judges/generic/truthfulness.yaml")

# Load from GitHub repository using the judge's name
judge = SimpleJudge("generic/truthfulness")

Programmatic Configuration#

from oumi.judges.simple_judge import SimpleJudge
from oumi.core.configs.judge_config import JudgeConfig
from oumi.core.configs.params.judge_params import JudgeParams
from oumi.core.configs.inference_config import InferenceConfig

judge_config = JudgeConfig(
    judge_params=JudgeParams(...),
    inference_config=InferenceConfig(...)
)

judge = SimpleJudge(judge_config)

Parameter Override#

You can override configuration parameters at runtime using the CLI or programmatically:

CLI Override#

oumi judge dataset \
    --config generic/truthfulness \
    --input dataset.jsonl \
    --judge_params.response_format XML

Programmatic Override#

from oumi.core.configs.judge_config import JudgeConfig
from oumi.judges.simple_judge import SimpleJudge

judge_config = JudgeConfig.from_path("generic/truthfulness")
judge_config.judge_params.response_format = "XML"
judge = SimpleJudge(judge_config)