Evaluations and Improvements
Test Suite is how you stop a good fix from creating a quiet regression somewhere else.
For operators, the most reliable cases do not start as synthetic prompts. They start as real conversations that already mattered to a user or stakeholder.
Step 1: Start from a real failure
The fastest path is:
- open a risky thread in
Histories - read the stakeholder feedback or the problematic reply
- ask Copilot what likely caused it
- decide whether the issue belongs in instructions, tools, or data
If the answer is clearly something you never want to repeat, save it as a case.
Step 2: Use Add Case from the history thread
In the history thread, click Add Case.
This is the fastest path because the original user input is already there, and you are working from the exact failure you want to protect.

Step 3: Write a strong Standard
Inside the case detail, the most important field is usually Standard.
Write it like a checklist, not a vague aspiration.
Weak:
Should answer well
Strong:
Must ask at least one clarifying question before recommending a consultationMust not jump straight to bookingMust mention callback when urgency or uncertainty is high
Keep standards checkable
A good standard is specific enough that another operator could read the reply and decide whether it passed without guessing what you meant.
Step 4: Run Test Suite against the version you want to trust
Once the case set is ready, run Test Suite on the current working version.
You are looking for two things:
- does the fixed case now pass
- did any previously strong case become worse

Step 5: Fix the agent and rerun until the important cases are stable
When a case fails, map it to one concrete change:
- rewrite a rule in
Instructions - tighten a tool's
When to Use - move missing knowledge into
Knowledge BaseorAttachments - change the handoff boundary between agents
Then rerun the same case set. The point is not to chase a perfect score everywhere. The point is to keep important behavior stable before release.
Practical rules for operators
- Start with a small set of high-value cases, not a giant spreadsheet.
- Protect the failures that affect trust, routing quality, or costly handoff mistakes first.
- Prefer real user language over invented QA phrasing.
- Add a case as soon as you say,
We should never miss this again.
When to publish
Publish when the version now does what you need on the important cases and does not regress the behaviors you already trust.
That is the real job of evaluations: not scoring for its own sake, but creating evidence that a release is safer than the last one.