Skip to content

Evaluations and Improvements

Test Suite is how you stop a good fix from creating a quiet regression somewhere else.

It is also how operators make expected behavior checkable. A useful case pairs a realistic input with a Standard that says what the agent must do, must not do, and when it should hand off.

The most reliable cases do not start as synthetic prompts. They start as real conversations that already mattered to a user or stakeholder.

Step 1: Start from a real failure

The fastest path is:

  1. open a risky thread in Histories
  2. read the stakeholder feedback or the problematic reply
  3. ask Copilot what likely caused it
  4. decide whether the issue belongs in instructions, tools, or data

If the answer is clearly something you never want to repeat, save it as a case.

Step 2: Use Add Case from the history thread

In the history thread, click Add Case.

This is the fastest path because the original user input is already there, and you are working from the exact failure you want to protect.

Creating a case directly from a history thread

Step 3: Write a strong Standard

Inside the case detail, the most important field is usually Standard.

Write it like a checklist, not a vague aspiration.

Weak:

  • Should answer well

Strong:

  • Must ask at least one clarifying question before recommending a consultation
  • Must not jump straight to booking
  • Must mention callback when urgency or uncertainty is high

Keep standards checkable

A good standard is specific enough that another operator could read the reply and decide whether it passed without guessing what you meant.

Inspect tool steps when the failure depends on tool use

Some failures are not only about the final wording. The agent may have searched the wrong knowledge source, used a weak query, skipped a required tool, or ignored the result it retrieved.

In the case detail page, open Tool Steps above the AI response. Expand a step to inspect:

  • which tool was called
  • what arguments were sent
  • what result came back

Use this evidence when deciding whether the fix belongs in Instructions, a tool's When to Use, or the underlying Knowledge Base.

If you create an evaluator that should judge tool behavior, include {tool_steps} in the evaluator system prompt. Existing evaluators do not use tool steps unless their prompt explicitly asks for this placeholder.

Use the canonical tool type in the standard so the evaluator can match the tool step reliably.

English name 中文名稱 Canonical tool type
Search Knowledge Base 搜尋知識庫 consultant_retrieve_context_objs
List Knowledge Base Files 列出知識庫檔案 consultant_list_kb_files
Read Knowledge Base File Lines 讀取知識庫檔案行數 consultant_get_context_obj_lines
Search Web 搜尋網頁 consultant_search_web
Fetch Web Page Content 讀取網頁內容 consultant_fetch_web_content
Call Agent 呼叫其他 Agent consultant_call_agent
Request Form 要求填寫表單 consultant_request_form
Payment 建立付款 consultant_payment
Memory 記住使用者資訊 consultant_memory
HTTP Request 呼叫 HTTP API consultant_http_request
Generate Image 產生圖片 consultant_generate_image

Example standard for searching the Knowledge Base:

Required tool:
- Search Knowledge Base (`consultant_retrieve_context_objs`)

Parameter requirements:
- The question should ask about refund policy, refund conditions, or 退費政策.
- The keywords should include refund-related terms.

Result requirements:
- The result should contain refund-related policy information.

Final answer requirements:
- The response should be grounded in the retrieved policy result.

Example standard for finding and reading a specific Knowledge Base file:

Required tools:
- List Knowledge Base Files (`consultant_list_kb_files`)
- Read Knowledge Base File Lines (`consultant_get_context_obj_lines`)

Parameter requirements:
- `list_kb_files.pattern` should target the expected file name or path.
- `get_context_obj_lines.knowledge_node_id` should come from the file list result.
- The requested line range should be narrow enough to inspect the relevant section.

Result requirements:
- The file result should contain the policy, procedure, or source section needed to answer.

Final answer requirements:
- The response should use the file content, not only general knowledge.

Example standard for web research:

Required tools:
- Search Web (`consultant_search_web`)
- Fetch Web Page Content (`consultant_fetch_web_content`) when a search result must be opened before answering

Parameter requirements:
- The search question and keywords should match the user's requested topic.
- The fetched URL should come from a relevant search result.

Result requirements:
- The fetched content should contain evidence for the final answer.

Final answer requirements:
- The response should not claim facts that are missing from the fetched content.

Example standard for routing to another agent:

Required tool:
- Call Agent (`consultant_call_agent`)

Parameter requirements:
- `agent_name` should match the specialist agent responsible for the request.
- `query` should include the user's key need and enough context for the specialist agent.

Final answer requirements:
- The response should use the specialist agent's result or clearly explain the handoff outcome.

Step 4: Run Test Suite against the version you want to trust

Once the case set is ready, run Test Suite on the current working version.

You are looking for two things:

  • does the fixed case now pass
  • did any previously strong case become worse

Comparing results in Test Suite after changes

Step 5: Fix the agent and rerun until the important cases are stable

When a case fails, map it to one concrete change:

  • rewrite a rule in Instructions
  • tighten a tool's When to Use
  • move missing knowledge into Knowledge Base or Instructions
  • change the handoff boundary between agents

Then rerun the same case set. The point is not to chase a perfect score everywhere. The point is to keep important behavior stable before release.

Practical rules for operators

  • Start with a small set of high-value cases, not a giant spreadsheet.
  • Protect the failures that affect trust, routing quality, or costly handoff mistakes first.
  • Prefer real user language over invented QA phrasing.
  • Add a case as soon as you say, We should never miss this again.

When to publish

Publish when the version now does what you need on the important cases and does not regress the behaviors you already trust.

That is the real job of evaluations: not scoring for its own sake, but creating evidence that a release is safer than the last one.