Skip to content

Evaluations and Improvements

Test Suite is how you stop a good fix from creating a quiet regression somewhere else.

For operators, the most reliable cases do not start as synthetic prompts. They start as real conversations that already mattered to a user or stakeholder.

Step 1: Start from a real failure

The fastest path is:

  1. open a risky thread in Histories
  2. read the stakeholder feedback or the problematic reply
  3. ask Copilot what likely caused it
  4. decide whether the issue belongs in instructions, tools, or data

If the answer is clearly something you never want to repeat, save it as a case.

Step 2: Use Add Case from the history thread

In the history thread, click Add Case.

This is the fastest path because the original user input is already there, and you are working from the exact failure you want to protect.

Creating a case directly from a history thread

Step 3: Write a strong Standard

Inside the case detail, the most important field is usually Standard.

Write it like a checklist, not a vague aspiration.

Weak:

  • Should answer well

Strong:

  • Must ask at least one clarifying question before recommending a consultation
  • Must not jump straight to booking
  • Must mention callback when urgency or uncertainty is high

Keep standards checkable

A good standard is specific enough that another operator could read the reply and decide whether it passed without guessing what you meant.

Step 4: Run Test Suite against the version you want to trust

Once the case set is ready, run Test Suite on the current working version.

You are looking for two things:

  • does the fixed case now pass
  • did any previously strong case become worse

Comparing results in Test Suite after changes

Step 5: Fix the agent and rerun until the important cases are stable

When a case fails, map it to one concrete change:

  • rewrite a rule in Instructions
  • tighten a tool's When to Use
  • move missing knowledge into Knowledge Base or Attachments
  • change the handoff boundary between agents

Then rerun the same case set. The point is not to chase a perfect score everywhere. The point is to keep important behavior stable before release.

Practical rules for operators

  • Start with a small set of high-value cases, not a giant spreadsheet.
  • Protect the failures that affect trust, routing quality, or costly handoff mistakes first.
  • Prefer real user language over invented QA phrasing.
  • Add a case as soon as you say, We should never miss this again.

When to publish

Publish when the version now does what you need on the important cases and does not regress the behaviors you already trust.

That is the real job of evaluations: not scoring for its own sake, but creating evidence that a release is safer than the last one.