Skip to main content

Guided Decoding

Use the auto_evaluation.guided_decoding task type for configuring auto evaluation tasks where the set of potential results are well defined.

Example Usage

The following illustrates a basic example in which a guided decoding task is defined to evaluate the correctness of a generated output compared to the ground truth.
client.evaluations.create(
  name="Example Correctness Evaluation",
  data=[
    {
      "input": "What color is the sky?",
      "expected_output": "Blue",
      "generated_output": "The sky appears blue during ..."
    },
    ...
  ],
  tasks=[
    {
      "task_type": "auto_evaluation.guided_decoding",
      "alias": "correctness",
      "configuration": {
        "model": "openai/o3-mini",
        "prompt": """
          Given the user's query: {{item.input}},
          The agent's response was: {{item.generated_output}}
          The expected response is: {{item.expected_output}}
          Did the agent's response fully represent the expected response?
        """,
        "choices": ["Yes", "No"]
      }
    }
  ]
)
When defining a task, you can also customize the response format. For example, you can have the judge LLM provide a reason for the final score, which is often helpful. The following example demonstrates several options.
tasks = [
  {
    "task_type": "auto_evaluation.guided_decoding",
    "alias": "multi_response_option_judge",
    "configuration": {
      "model": "openai/gpt-4o",
      "prompt": "Evaluate this response...",
      "response_format": {
        "type": "object",
        "properties": {
          "is_helpful": {
            "type": "boolean",
            "description": "Whether the response is helpful"
          },
          "quality_score": {
            "type": "integer",
            "minimum": 1,
            "maximum": 5,
            "description": "Quality score from 1 to 5"
          },
          "accuracy_score": {
            "type": "number",
            "minimum": 0.0,
            "maximum": 1.0,
            "description": "Accuracy as a decimal"
          },
          "category": {
            "type": "string",
            "enum": ["excellent", "good", "fair", "poor"],
            "description": "Quality category"
          },
          "reasoning": {
            "type": "string",
            "description": "Explanation of the evaluation"
          }
        },
        "required": ["is_helpful", "quality_score", "accuracy_score", "category", "reasoning"]
      }
    }
  }
]