Auto Evaluation

Guided Decoding

Use the auto_evaluation.guided_decoding task type for configuring auto evaluation tasks where the set of potential results are well defined.

Hide configuration properties

model

string

required

The identifier of the model (in <model_provider>/<model_name> format) to generate the auto evaluation result. Note this model must support some form of guided decoding (e.g. OpenAI’s response formatting) in order for results to be computed.

system_prompt

string

An optional system prompt to be sent as the first message of the chat completion request.

prompt

string

required

The user prompt containing the evaluation question and any relevant data from the evaluation item (see referencing item data).

choices

string[]

required

A list of potential results that the model should choose from in response to the prompt.

inference_args

object

Any additional properties to be included in the chat completion request to the model
(e.g. { "temperature": 0 }).

Example Usage

The following illustrates a basic example in which a guided decoding task is defined to evaluate the correctness of a generated output compared to the ground truth.

client.evaluations.create(
  name="Example Correctness Evaluation",
  data=[
    {
      "input": "What color is the sky?",
      "expected_output": "Blue",
      "generated_output": "The sky appears blue during ..."
    },
    ...
  ],
  tasks=[
    {
      "task_type": "auto_evaluation.guided_decoding",
      "alias": "correctness",
      "configuration": {
        "model": "openai/o3-mini",
        "prompt": """
          Given the user's query: {{item.input}},
          The agent's response was: {{item.generated_output}}
          The expected response is: {{item.expected_output}}
          Did the agent's response fully represent the expected response?
        """,
        "choices": ["Yes", "No"]
      }
    }
  ]
)

When defining a task, you can also customize the response format. For example, you can have the judge LLM provide a reason for the final score, which is often helpful. The following example demonstrates several options.

tasks = [
  {
    "task_type": "auto_evaluation.guided_decoding",
    "alias": "multi_response_option_judge",
    "configuration": {
      "model": "openai/gpt-4o",
      "prompt": "Evaluate this response...",
      "response_format": {
        "type": "object",
        "properties": {
          "is_helpful": {
            "type": "boolean",
            "description": "Whether the response is helpful"
          },
          "quality_score": {
            "type": "integer",
            "minimum": 1,
            "maximum": 5,
            "description": "Quality score from 1 to 5"
          },
          "accuracy_score": {
            "type": "number",
            "minimum": 0.0,
            "maximum": 1.0,
            "description": "Accuracy as a decimal"
          },
          "category": {
            "type": "string",
            "enum": ["excellent", "good", "fair", "poor"],
            "description": "Quality category"
          },
          "reasoning": {
            "type": "string",
            "description": "Explanation of the evaluation"
          }
        },
        "required": ["is_helpful", "quality_score", "accuracy_score", "category", "reasoning"]
      }
    }
  }
]

Overview

Getting Started

Evaluations

Evaluation Dashboards

Tracing

Agents

Guided Decoding

Example Usage

Overview

Getting Started

Evaluations

Evaluation Dashboards

Tracing

Agents

​Guided Decoding

​Example Usage

Guided Decoding

Example Usage