Agent Evaluation Suites give you a repeatable way to verify that an agent still behaves the way you expect after prompt, model, tool, or skill changes. You define evaluations that score responses, then attach them to scenarios on individual agents so each scenario produces a pass/fail result you can review over time.
Use evaluation suites to:
- Define reusable test suites on agents with conversation scenarios.
- Assert specific behavior, such as the tools and skills that should load during a conversation.
- Score agent responses with LLM-as-judge evaluations against a passing threshold.
Concepts
- Evaluation — A reusable LLM-as-judge definition. An evaluation has a name, a prompt that tells the model what to score (for example, whether the agent followed the right instructions, avoided hallucinations, or used the correct tone), a passing threshold, and the model used to score.
- Scenario — A test case attached to a specific agent. A scenario describes a conversation prompt to run, an assertion (for example, that a particular tool or skill should load), and one or more evaluations that score the response.
- Run — A single execution of a scenario. Each run produces results for the assertion and every attached evaluation, and is recorded in run history for that scenario.
Create an evaluation
App Administrators create evaluations once and reuse them across any number of agents and scenarios.
- Open Intelligence → Evaluations in your app.
- Click New Evaluation in the top-right corner.
- Enter the evaluation details:
- Name — A descriptive name (for example, Hallucination Judge or Tone Check).
- Prompt — Instructions that tell the model what to score. Be specific about what counts as passing and failing behavior.
- Passing Threshold — The score the response must meet or exceed to pass.
- Choose a Model from the dropdown to act as the judge.
- Click Save.
Reuse the same evaluation across multiple agents and scenarios to standardize scoring — for example, a shared hallucination judge or a tone judge that every customer-facing agent should pass.
Attach an evaluation to an agent scenario
Scenarios live on the agent and pair a test prompt with the assertions and evaluations that should run against it.
- Open Intelligence → Agents and select the agent you want to evaluate.
- Open the Scenarios tab on the agent page and click + Scenario.
- Configure the scenario:
- Name and Description — Identify the scenario and what it covers.
- Prompt — The user message or conversation that drives the scenario.
- Timeout — How long the scenario is allowed to run before failing.
- Model — The model the agent should use for this run.
- Assertion — A required behavior to verify, such as which tool or skill should load.
- Evaluations — Select one or more evaluations to score the response.
- Click Save.
Run a scenario and review results
- From the scenario configuration page, click Run Scenario.
- Watch the live response in the panel and review the assertion result and each evaluation’s score.
- Open the Run history section on the same page to compare the latest run against previous runs over time.
Because run history is preserved per scenario, you can use evaluation suites as a regression check: rerun a scenario after changing a prompt, model, tool, or skill to confirm the agent still meets your expectations before promoting the change.