Test Agent Behavior

Agent Evaluation Suites give you a repeatable way to verify that an agent still behaves the way you expect after prompt, model, tool, or skill changes. You define evaluations that score responses, then attach them to scenarios on individual agents so each scenario produces a pass/fail result you can review over time. Use evaluation suites to:

Define reusable test suites on agents with conversation scenarios.
Assert specific behavior, such as the tools and skills that should load during a conversation.
Score agent responses with LLM-as-judge evaluations against a passing threshold.

Concepts

Evaluation — A reusable LLM-as-judge definition. An evaluation has a name, a prompt that tells the model what to score (for example, whether the agent followed the right instructions, avoided hallucinations, or used the correct tone), a passing threshold, and the model used to score.
Scenario — A test case attached to a specific agent. A scenario describes a conversation prompt to run, an assertion (for example, that a particular tool or skill should load), and one or more evaluations that score the response.
Run — A single execution of a scenario. Each run produces results for the assertion and every attached evaluation, and is recorded in run history for that scenario.

Create an evaluation

App Administrators create evaluations once and reuse them across any number of agents and scenarios.

Open Intelligence → Evaluations in your app.
Click New Evaluation in the top-right corner.
Enter the evaluation details:
- Name — A descriptive name (for example, Hallucination Judge or Tone Check).
- Prompt — Instructions that tell the model what to score. Be specific about what counts as passing and failing behavior.
- Passing Threshold — The score the response must meet or exceed to pass.
Choose a Model from the dropdown to act as the judge.
Click Save.

Reuse the same evaluation across multiple agents and scenarios to standardize scoring — for example, a shared hallucination judge or a tone judge that every customer-facing agent should pass.

Attach an evaluation to an agent scenario

Scenarios live on the agent and pair a test prompt with the assertions and evaluations that should run against it.

Open Intelligence → Agents and select the agent you want to evaluate.
Open the Scenarios tab on the agent page and click + Scenario.
Configure the scenario:
- Name and Description — Identify the scenario and what it covers.
- Prompt — The user message or conversation that drives the scenario.
- Timeout — How long the scenario is allowed to run before failing.
- Model — The model the agent should use for this run.
- Assertion — A required behavior to verify, such as which tool or skill should load.
- Evaluations — Select one or more evaluations to score the response.
Click Save.

Run a scenario and review results

From the scenario configuration page, click Run Scenario.
Watch the live response in the panel and review the assertion result and each evaluation’s score.
Open the Run history section on the same page to compare the latest run against previous runs over time.

Because run history is preserved per scenario, you can use evaluation suites as a regression check: rerun a scenario after changing a prompt, model, tool, or skill to confirm the agent still meets your expectations before promoting the change.

Understand AI

Connect a provider

Create AI services

Use AI capabilities

Build agents

Deploy agents to channels

External agents

Concepts

Create an evaluation

Attach an evaluation to an agent scenario

Run a scenario and review results

​Concepts

​Create an evaluation

​Attach an evaluation to an agent scenario

​Run a scenario and review results

Concepts

Create an evaluation

Attach an evaluation to an agent scenario

Run a scenario and review results